Cycle 12, day #4:
Not too long after putting online the new version containing the refactoring, it got triggered by the detection of a good trade. It did crash midway due to a heap corruption.
This is the worse type of problem because when the crash occur, it is not the same place where the corruption has been made.
There are currently 2 spots that are quite recent. The new JSON code and my order code refactoring.
Since, the client and the server receives the same JSON messages and only the server did crash, I would think that this almost eliminate the JSON code from the suspects. The server do log a little bit more JSON msgs. Because JSON decoding is done in place, in the input buffer, to log a msg, I need to reencode it... The JSON reencoding code might be the cause. I don't have 100% confidence in it...
Otherwise, the problem comes from the new order code... I have logs... I may try to recreate the same sequence of events in hope that the problem will show up...
This one is going to be tough to find out...
Update:
I did rework the code a little bit. For one, I do place the encoding buffer on the stack instead of allocating it on the heap. The hope is that if there is a buffer overrun there, It will make the program crash right away.
A lot of activity did happen this afternoon and no crash did happen. This is increasing my confidence level in the latest order refactoring code. Unless I finally nail down the problem, I will end up writing a test that will reproduce the events that has lead to the crash but I am not sure this will recreate the crash. This will just give me another test that I can run to validate that I did not create regression.
I think that I have found some smoke. The heap corruption is detected when execution is in the JSON decoder that is supposed to NOT allocate memory when it is allocating memory... I did create a custom allocator that simply abort the program if it attempts to allocate. and it did! Something is fishy in that zone... I am into something...
Update #2:
I have resolved the JSON decoder mem alloc mystery. The doc says that it doesn't allocate memory when using the in place mode parsing which is true for the low level API. When using the high level API which I do in one particular context, the one where the crash did occur, it does allocate memory to build its DOM.
I am now preallocating memory on the stack and provide it to the decoder so that it stops doing dynamic allocations which was originally, the #1 motivation to make the library switch. I was thinking that I wasn't doing any allocation and I was still doing some...
I don't think that this was the cause of my corruption incident but by minimizing the heap usage, I also minimize the probability that I accidentally corrupt it.
I am currently preparing the engine test framework that I did develop back in August to investigate another weird incident. I'm still not convinced that this will result into good leads to solve the corruption mystery but nothing is lost as I'll make my developed frameworks more reusable in the worst case scenario.
Unless the problem does happen again, since with the latest refactoring did change the Order object size, another very unlikely explanation could be that some code units haven't been recompiled (no idea how that could happen assuming that make is doing its job) this could have led to a situation where there is a mismatch in size between 2 units that pass between them order objects pointers. This mismatch could have caused corruption... but this is just a wild guess...
The end result, is that I did furiously investigate this issue and by doing so, I have made several small enhancements here and there...
I'm going to put those in place whenever I can. Server is currently in the middle of a trade that isn't about to close since markets have decided to go in the opposite direction. BTW, BTC did reach the $12K level. The last few times, it did happen, it did crash back fast toward $10K... I feel like this scenario is very probable... and I'm seeing a lot of crypto websites being very bullish about the news... I feel like this could be the proverbial... Sell the news... I guess we will know soon... Too bad that my margin trading code isn't ready yet with all those unexpected events...
Not too long after putting online the new version containing the refactoring, it got triggered by the detection of a good trade. It did crash midway due to a heap corruption.
This is the worse type of problem because when the crash occur, it is not the same place where the corruption has been made.
There are currently 2 spots that are quite recent. The new JSON code and my order code refactoring.
Since, the client and the server receives the same JSON messages and only the server did crash, I would think that this almost eliminate the JSON code from the suspects. The server do log a little bit more JSON msgs. Because JSON decoding is done in place, in the input buffer, to log a msg, I need to reencode it... The JSON reencoding code might be the cause. I don't have 100% confidence in it...
Otherwise, the problem comes from the new order code... I have logs... I may try to recreate the same sequence of events in hope that the problem will show up...
This one is going to be tough to find out...
Update:
I did rework the code a little bit. For one, I do place the encoding buffer on the stack instead of allocating it on the heap. The hope is that if there is a buffer overrun there, It will make the program crash right away.
A lot of activity did happen this afternoon and no crash did happen. This is increasing my confidence level in the latest order refactoring code. Unless I finally nail down the problem, I will end up writing a test that will reproduce the events that has lead to the crash but I am not sure this will recreate the crash. This will just give me another test that I can run to validate that I did not create regression.
I think that I have found some smoke. The heap corruption is detected when execution is in the JSON decoder that is supposed to NOT allocate memory when it is allocating memory... I did create a custom allocator that simply abort the program if it attempts to allocate. and it did! Something is fishy in that zone... I am into something...
Update #2:
I have resolved the JSON decoder mem alloc mystery. The doc says that it doesn't allocate memory when using the in place mode parsing which is true for the low level API. When using the high level API which I do in one particular context, the one where the crash did occur, it does allocate memory to build its DOM.
I am now preallocating memory on the stack and provide it to the decoder so that it stops doing dynamic allocations which was originally, the #1 motivation to make the library switch. I was thinking that I wasn't doing any allocation and I was still doing some...
I don't think that this was the cause of my corruption incident but by minimizing the heap usage, I also minimize the probability that I accidentally corrupt it.
I am currently preparing the engine test framework that I did develop back in August to investigate another weird incident. I'm still not convinced that this will result into good leads to solve the corruption mystery but nothing is lost as I'll make my developed frameworks more reusable in the worst case scenario.
Unless the problem does happen again, since with the latest refactoring did change the Order object size, another very unlikely explanation could be that some code units haven't been recompiled (no idea how that could happen assuming that make is doing its job) this could have led to a situation where there is a mismatch in size between 2 units that pass between them order objects pointers. This mismatch could have caused corruption... but this is just a wild guess...
The end result, is that I did furiously investigate this issue and by doing so, I have made several small enhancements here and there...
I'm going to put those in place whenever I can. Server is currently in the middle of a trade that isn't about to close since markets have decided to go in the opposite direction. BTW, BTC did reach the $12K level. The last few times, it did happen, it did crash back fast toward $10K... I feel like this scenario is very probable... and I'm seeing a lot of crypto websites being very bullish about the news... I feel like this could be the proverbial... Sell the news... I guess we will know soon... Too bad that my margin trading code isn't ready yet with all those unexpected events...