There is actually nothing wrong, I just did not take enough samples. I have read others to ask the same question, I guess they also did not look deep enough into sampling. The microphone seems always to start with negative numbers after power up. And first few samples are 0x0000, as the microphone is powering up.
The microphone data was going up towards its lowest binary value of 0x3FFFF (which is approaching the zero from the negative side in decimal world), and after about 600 samples the values reached zero, thus flipped the sign. See the attached picture. So the previous scope picture above seems to show a correct decoding after all.
Note that my processing changes the original 18-bits to 16-bits to save memory. Mic had max at 0x3FFFF, I have now the max at 0xFFFF, but it has nothing to do with the MSB bit, it does flip to positive.

Next step was to convert the two's complement to be only positive. I simply wanted to add an offset of 'half-way through', which in case of 16-bits is 0x8000:
*(audio_data + i) = 0x8000 + data_from_mic; // change values up to a positive world!
As I do not need the resolution, I only record every 40th sample from the I2S. After this the data looks human-understandable. See the recording of 1000 samples (2 seconds), saying word "Hey!" and a clap with hands. Notice, how after the power up, the mic starts with above mentioned zero level, then likely auto-detects the external volume and settles to middle/zero level:

Hopefully this helps someone.