1

I have a data stream that looks like

x\xdam\xdd]\xc4\xeemZ\xc7q)2\xcc\xd6cL\x1a1\x8c\xc9hc\x0c)\xeb^\xaf\xd7Z (...)

How can I find out what I do see here? At the moment I'm trying to dive into binary, ascii, codec, utf-8 and so on but more in general. Hence, I would be grateful for every help that helps me focusing.

edit: The data stream represents audio data that is allegedly given in a FLAC format. But I can't tell if above data is already in FLAC or just binary encoded as FLAC or whatever. Finally, my goal is very plain: To find out what exactly can be seen above. What is it or respectively in which format is it given?

It is audio data. Ok, but..

  1. is it given in binary data
  2. is the binary data given in ASCII?
  3. is it FLAC but in binary form?
  4. is it binary that has to be converted via ASCII into FLAC?
  5. ..

edit: Here is the stream (I had to investigate where I can find the beginning..):

b'fLaC\x00\x00\x00"\x10\x00\x10\x00\x00\x03\xf5\x00\x07\xd9\x0b\xb8\x00\xf0\x00\x03\xa8\x00\t[\x92\x01*\xb7\xf2\xb5\xa0\xb7Ts\x89{\xce\x8a\x84\x00\x00( \x00\x00\x00reference libFLAC 1.3.2 20170101\x00\x00\x00\x00\xff\xf8\xca\x08\x00(H\xff\xfe\x00\x01\x00\x02\xff\xfd\xff\xff\xb5\xe7[\x1a\x1308\x0f0\x04\x95\xc8lA\x151\x9b'
(...)
Ben
  • 615
  • 8
  • 15
  • Where does this data stream come from? – Jakob Halskov Mar 24 '20 at 09:55
  • Please edit your question to explain what is transmitting this information, what you are using to capture it, the capture settings and provide a link to the user manual for the device. (Is it not obvious that we would need this information?) – Transistor Mar 24 '20 at 09:56
  • From an online server. It represents audio data in a FLAC format. I just don't know what I see exactly here.. is it binary? Is it ASCII? Do I have to convert it somehow? Is it already FLAC but just in binary?.. I'm totally confused. – Ben Mar 24 '20 at 09:57
  • That's better. You have a good chance of getting a sensible answer now. I am not familiar with FLAC so I probably can't help further. – Transistor Mar 24 '20 at 10:06
  • 1
    This is a text representation of binary data. \xda for example means 13*16+10. All what's not following this \x12 pattern are ASCII characters. Look them up to find their binary value. – Janka Mar 24 '20 at 10:44

1 Answers1

5

Your original sample is too small to know if it's FLAC or not. You need a long sample, and then look for 0xff 0xf8 bytes in it to see if you have FLAC block headers. Your additional sample from the beginning of the streadm is definitely FLAC.

FLAC is a complex format for compressing audio (format overview), with many variations on sample size and so on. It's binary, and resyncable, which means that periodically there is some known marker.

FLAC streams will begin with a FLAC header (starting with 66 4C 61 43, which is fLaC in ASCII), and have a FLAC block header at least every few thousand bytes, these headers begin with ff f8). Good general overview at Wikipedia article.

Your original stream

Your stream is given as "ASCII with hexadecimal escapes", as might be used in many programming languages, including C, Python, PHP.

x\xdam\xdd]\xc4\xeemZ\xc7q)2\xcc\xd6cL\x1a1\x8c\xc9hc\x0c)\xeb^\xaf\xd7Z

As it's just binary, your stream is better represented as:

78 da 6d dd 5d c4 ee 6d  5a c7 71 29 32 cc d6 63
4c 1a 31 8c c9 68 63 0c  29 eb 5e af d7 5a 0a

(Easily seen in almost any Linux shell with echo -e "x\xdam\xdd]\xc4\xeemZ\xc7q)2\xcc\xd6cL\x1a1\x8c\xc9hc\x0c)\xeb^\xaf\xd7Z" | hd)

Assuming your stream actually is FLAC, you will need a larger sample to decode it, as you are missing the header information. Without the header or block header information, it's impossible to reliably tell all the compression details.

Your longer sample

Your longer sample from the beginning of the stream is definitely FLAC. You can see this because it begins with 'fLaC' and is decodable by the program metaflac, part of the open-source program flac (download page).

We see this by putting your sample in a file (as binary) and then running metaflac to see if it can decode it. It can. The following is in a Linux shell:

# create SAMPLE file with echo -e or python or whatever
$ metaflac --list SAMPLE
METADATA block #0
  type: 0 (STREAMINFO)
  is last: false
  length: 34
  minimum blocksize: 4096 samples
  maximum blocksize: 4096 samples
  minimum framesize: 1013 bytes
  maximum framesize: 2009 bytes
  sample_rate: 48000 Hz
  channels: 1
  bits-per-sample: 16
  total samples: 239616
  MD5 signature: 095b92012ab7f2b5a0b75473897bce8a
METADATA block #1
  type: 4 (VORBIS_COMMENT)
  is last: true
  length: 40
  vendor string: reference libFLAC 1.3.2 20170101
  comments: 0

FLAC details

If you want to examine by hand (just to see if you have FLAC), you can use the definition and look for sync flags on the blocks.

A Sony sample file is 24-bit, 96 kHz sampling rate, and has portions which look like this:

00060099: ff f8 3b 1c 00 2c 10 0d e0 e0 a1 23 e5 a6 64 a2
00063511: ff f8 3b 1c 01 2b 14 ff ff eb 00 02 38 0f 20 0c
00067822: ff f8 3b 1c 02 22 14 fb 7b 42 fb 57 cd 0b 4c c6
00072135: ff f8 3b 1c 03 25 14 f4 3c b9 f3 e0 98 03 01 0d
00076338: ff f8 3b 1c 04 30 14 12 cf 29 12 b5 16 0b 2c ac

These are found just by looking for the sync pattern ff f8 in the file, and displaying 16 bytes from there: this is how we find block headers.

The beginning of the header is:

/-----ff------\ /------f8-----\ /-----3b------\ /-----1c------\
1_1_1_1 1_1_1_1 1_1_1_1 1_0_0_0 0_0_1_1 1_0_1_1 0_0_0_1 1_1_0_0
1 1 1 1 1 1 1 1 1 1 1 1 1 0                                     sync flag
                            0                                   filler
                              0                                 filler
                                0 0 1 1                         1152 samples
                                        1 0 1 1                 96 kHz
                                                0 0 0 1         mono
                                                        1 1 0   24 bit/sample
                                                              0 filler

This is followed by the increasing block numbers 00, 01, 02, and the other portions of the header. (Note that the block numbers are coded in a non-obvious way for blocks above 127, but with a recognisable pattern.)

Not every sequence beginning fff8 will be a block header, as some might be data. You have to check other details in the block header to be certain it is a header and not data. But what I've shown here is enough to show it's almost certainly a FLAC stream.

From the format definition https://xiph.org/flac/format.html

Since a decoder may start decoding in the middle of a stream, there must be a method to determine the start of a frame. A 14-bit sync code begins each frame. The sync code will not appear anywhere else in the frame header. However, since it may appear in the subframes, the decoder has two other ways of ensuring a correct sync. The first is to check that the rest of the frame header contains no invalid data. Even this is not foolproof since valid header patterns can still occur within the subframes. The decoder's final check is to generate an 8-bit CRC of the frame header and compare this to the CRC stored at the end of the frame header.

Again, since a decoder may start decoding at an arbitrary frame in the stream, each frame header must contain some basic information about the stream because the decoder may not have access to the STREAMINFO metadata block at the start of the stream. This information includes sample rate, bits per sample, number of channels, etc.

The FLAC reference continues with the exact details of all the bits, which was used to give the example above.

Fully decoding a stream

In the first instance, use the open source program flac to see what you have. On this little sample, the program reports FLAC__STREAM_DECODER_ERROR_STATUS_LOST_SYNC (which is what we would expect for a small chunk from the middle of a stream.)

Once you have something you can make sense of, you can use the open source library libFLAC to do what you need.

jonathanjo
  • 12,049
  • 3
  • 27
  • 60
  • Thanks a lot for the basic but also detailed answer! I will study it deeply! – Ben Mar 24 '20 at 13:13
  • Already the first question: With regard to the missing header, is it important that it is present due to some software mandatories (decoding or so) or is it sufficient when I have some details like sapmling rate and so on? – Ben Mar 24 '20 at 13:18
  • the 2nd question: How did you convert the binary data? – Ben Mar 24 '20 at 13:36
  • 1
    If you post a longer segment of the data it will be clear whether it's FLAC or not from the block headers (or absence of block headers). If you know the sampling rate and other details, the block header will confirm that. – jonathanjo Mar 24 '20 at 16:10
  • Ok, thank you! Then it's probably better to use the header anyway (header means metadata, right?). I added the first 120 entries of the stream. I wonder about the presence of "libFLAC 1.3.2" within the stream.. – Ben Mar 25 '20 at 09:16
  • 1
    I updated answer with your longer sample. Yes, header is metadata, and so are the block headers. "libFLAC 1.3.2" in the stream is just a clue: it's actually the program which made the stream. – jonathanjo Mar 25 '20 at 10:59
  • Very great answer! Well explained and very comprehensive! Thanks a lot again! I have one more question: I've seen the metaflac tool but I need tool within python to make use of that audio stream. Preferably something that converts this into numerical values so I can use them for visualizations and so on. I thought of 'Mutagen' but when I try to use "FLAC()' I receive an error that it's an invalid start byte. And according to the documentation I'm not even sure if I can apply it already. Actually, I'm not sure what to do with flac audio given in ASCII with hexadecimal escapes..? – Ben Mar 25 '20 at 15:54
  • Thanks for compliment. Where are you getting the stream from? What OS are you using for your python? – jonathanjo Mar 26 '20 at 10:22
  • It's audio data from a machine in our company, we want to check if we can see some patterns in it. I'm using python 3.7. – Ben Mar 26 '20 at 12:58
  • @Ben I actually meant how do you get the audio stream? Have you got it in a file, or is it being captured somehow? And your python, running on linux/windows/macos/something else? – jonathanjo Mar 26 '20 at 13:00
  • Ah, sorry, I have to make a request to a server so I can get the data as an python object. I could also download .flac files for .flac files but this is too complicated and the data has to be merged with other data so I need to use the data in principle at it is. I think I would be very happy if I could "just" convert it to.. a np. array or so? At the beginning, I didn't expect it would be so complicated even to read the data. I'm on a windows machine, I'm using Anaconda/spyder for python. – Ben Mar 26 '20 at 14:32