Being fooled by randomness
This is fine
"The data stream is compressed," said my collaborator. Your application can read a block of compressed data, do whatever with it, and then read the next block.
"Sounds great," I replied. "What's the compression method?"
"zlib," they answered. "There's a block of compressed data, a separator sequence and then another block, another separator, and so on."
"What's the separator sequence?"
"The ASCII sequence SEPSEPSEP," they said.
Something tickled in the back of my mind. "Is that a good idea?" I asked. "What if the separator shows up in the data?"
"That's pretty unlikely, right? What's the probability of that string occurring in real binary data?"
I thought for a moment. "I think we can estimate this, actually. Well-compressed data is basically random, so isn't it just a 1 in 256 chance for each byte?"
"That sounds right."
"So. 256 to the ninth power, or 2^72?" I stated.
"I guess you're right," I admitted. "It seems wrong to rely on that, though."
"We can change it if you really want," they said. "But it would be a pain to do."
"No, a one in 256^9 chance is works out to once every billion years for the volume we're considering," I said. "We'll just use what you've got."
A billion years later
"Hey," I said. "I think your data stream producer is broken."
"What do you mean? It's pretty simple," my collaborator responded, a little defensively. "Here it is in essence."
"That's about what I expected," I replied. "But I keep getting corrupt messages."
"Well, what does your consumer application look like?"
"Here it is," I said, now defensive myself:
"That looks right. What's the problem?"
"Well, those messages were fine. But I keep getting ones that fail to decompress. Here's the hex string for such a message."
"Uh, look those first few bytes. 53 45 50, isn't that... S E P in ASCII?"
"Yeah, you're right. I knew it! The separator is appearing in the data! Didn't I warn you about this?"
"Yeah, but then you also said it would only happen once every billion years. That was... yesterday."
"I think we, uh, miscalculated."
Everything is obvious (once you know the answer)
So what went wrong? It's clear in retrospect. We were expecting the data stream to look like this:
first_message | SEP SEP SEP | next_message
Because each message is compressed, and therefore effectively random, we were right that a particular string of nine bytes was unlikely to occur. However, we failed to consider the case where a message ended in the three byte string S E P. What we would want is this:
...SEP | SEP SEP SEP | next_message
But we would get is this:
... | SEP SEP SEP | SEP next_message
That is, we'd split too early, and truncate the first message. The next message would then have a spare SEP prepended to it, so it would be corrupt too.
What we thought was a 256^9 chance of corruption was actually a 256^3 chance of corruption: a message merely needed to reproduce part of the separator sequence. Our billion years turned into a few dozen minutes. Whoops.
Do the right thing
What would have been a better way to produce the compressed stream? There are a number of cromulent answers.
One method would be to take advantage of the fact that the zlib format looks like this:
header | DEFLATE block | DEFLATE block | last DEFLATE block | checksum
Since you can tell which DEFLATE block is the last one (see the DEFLATE stream format), you know when you're at the end of a message. There's no need for separators at all!
A simplified reader based on this method is below:
Another would be to use gzip instead of plain zlib. It adds some framing on top of zlib, in particular a header and trailer. Most gzip tools allow you to just concatenate compressed blocks together
A third would be to prepend each compressed block with its length. Then the reader knows exactly how much to read.
In the end
The lesson here is this: never believe anyone when they say "the bad thing probably won't happen."