Stage 4, day 13:
Great success this morning. I have finally solved the remaining mystery that did obsess me for the last 36 hours or so.
It wasn't easy at all to find out. The bug did occur in a 2 seconds window. About 500 lines of logs are covering the events that did occur within those 2 seconds. And somewhere in those, something not very obvious is off. This is a perfect example of the classical finding a needle in a haystack.
Usually, I do inspect core dumps which make solving these kind of issues trivial. In this case, it wasn't available. Despite having root access to the server, I'm discovering surprising side-effects of being a VM.
1. It is impossible to change system time. It is protected by some Linux security feature that I'm not very familiar with. Capabilities. The init process. which is the ancestor of every processes on the system, possess a capability bounding set which all the system processes inherit. This capability set is missing the CAP_SYS_TIME capability. There doesn't seem to have any possible workaround that. I did ask the host machine admin to synchronize the time. He did (I wonder what he did thought of my request. It must be pretty unusual that one of his users complains about the clock being 20 seconds off....). It is amazing, in order to not disrupt anything running that expect continuous time, the time adjustment is performed by slowing just a bit the time. I did follow the adjustment. It took roughly 12 hours to lose a 20 seconds advance. Bottomline, I did adjust my code to take protective measures and not assume that the time is good.
2. Even with root access, it seems like it is impossible to enable core dump creation... I did work with such limitations. The only tool you have in that situation is very carefully crafted logs. That is it.
Back to my mystery. Despite looking at the logs, I couldn't find the problem. I started to reformat them to make them easier to look at and understand the flow of events. I started to annotate them. This morning, I was about to give up. I was in the process of creating a plan B. Such as adding a bit more logs that could help to catch the problem. Despite having succeeded making the software run on the server, rollback to my dev machine where I can have core dumps. and so on.... But before giving up... I did another annotation pass... And with this... It did allow me to spot 1 message among the few dozens messages exchanged back and forth... And that message was out of order, actually, it was missing... Upon that realization, I immediately understood that it was the reception of that message at an unexpected time that did crash the process. Without looking at the code to get confirmation, in my head just with the realization, it was game-over. I got the bastard. I did smile and laugh while I am alone at my desk... Best way to start the day....
Ok, now that this is out of the way, lets get back to the regular program and lets finish the money making software so that I can fully enjoy my quarantine...
Next priority once I stabilize the code base and have a satisfactorily working prototype will be to perform some refactoring.
Two of my modules are getting pretty big (1600 and 2500 lines of code). This starts to make them pretty complex and hard to maintain.
Breaking them into smaller components would definitely help in managing them.
Great success this morning. I have finally solved the remaining mystery that did obsess me for the last 36 hours or so.
It wasn't easy at all to find out. The bug did occur in a 2 seconds window. About 500 lines of logs are covering the events that did occur within those 2 seconds. And somewhere in those, something not very obvious is off. This is a perfect example of the classical finding a needle in a haystack.
Usually, I do inspect core dumps which make solving these kind of issues trivial. In this case, it wasn't available. Despite having root access to the server, I'm discovering surprising side-effects of being a VM.
1. It is impossible to change system time. It is protected by some Linux security feature that I'm not very familiar with. Capabilities. The init process. which is the ancestor of every processes on the system, possess a capability bounding set which all the system processes inherit. This capability set is missing the CAP_SYS_TIME capability. There doesn't seem to have any possible workaround that. I did ask the host machine admin to synchronize the time. He did (I wonder what he did thought of my request. It must be pretty unusual that one of his users complains about the clock being 20 seconds off....). It is amazing, in order to not disrupt anything running that expect continuous time, the time adjustment is performed by slowing just a bit the time. I did follow the adjustment. It took roughly 12 hours to lose a 20 seconds advance. Bottomline, I did adjust my code to take protective measures and not assume that the time is good.
2. Even with root access, it seems like it is impossible to enable core dump creation... I did work with such limitations. The only tool you have in that situation is very carefully crafted logs. That is it.
Back to my mystery. Despite looking at the logs, I couldn't find the problem. I started to reformat them to make them easier to look at and understand the flow of events. I started to annotate them. This morning, I was about to give up. I was in the process of creating a plan B. Such as adding a bit more logs that could help to catch the problem. Despite having succeeded making the software run on the server, rollback to my dev machine where I can have core dumps. and so on.... But before giving up... I did another annotation pass... And with this... It did allow me to spot 1 message among the few dozens messages exchanged back and forth... And that message was out of order, actually, it was missing... Upon that realization, I immediately understood that it was the reception of that message at an unexpected time that did crash the process. Without looking at the code to get confirmation, in my head just with the realization, it was game-over. I got the bastard. I did smile and laugh while I am alone at my desk... Best way to start the day....
Ok, now that this is out of the way, lets get back to the regular program and lets finish the money making software so that I can fully enjoy my quarantine...
Next priority once I stabilize the code base and have a satisfactorily working prototype will be to perform some refactoring.
Two of my modules are getting pretty big (1600 and 2500 lines of code). This starts to make them pretty complex and hard to maintain.
Breaking them into smaller components would definitely help in managing them.