19 May 2003 bluefoxicy   » (Observer)

First I'd like to say, I am not affiliated with Xiph or xiph.org at all. That said...

Well, I'm working on an Ogg structure for fcomp2 ( http://fcomp.sourceforge.net/__fcomp2.plan ). I'll have to modify the format a bit. Here's what I'm thinking right now. This is with N frames of compressed data (one for each file) and assuming m Master Dictionaries.

Now I'm starting to see. Streams. heh.

First off, each Stream starts with 4 bytes of data that identify its type. If the Attribute or File List Streams are damaged, other mirror streams can be located in this way.

The Ogg streams for an archive are interlaced. Citing the Ogg RFC:

      -------------------------------------------------------------
      |*A*|*B*|*C*|A|A|C|B|A|B|#A#|C|...|B|C|#B#|#C#|*D*|D|...|#D#|
      -------------------------------------------------------------
       bos bos bos             eos           eos eos bos       eos

Our interlacing scheme is simple: BOS each stream. Then for each stream IN ORDER, place the entire stream contiguously. Then place an EOS. This is a way to "fool" Ogg into doing a sort of self-contained Interleaved-Chained stream (i.e. [A][b][endb][c][endc][ENDA] is illegal, so it's [A][b][c][bdata][bdata][ENDb][cdata][cdata][ENDc][ENDA], which is legal). You CAN interlace however you want but why bother slowing down the decoding process? It would (preferably) look like this:

      -------------------------------------------------------------
      |*A*|*B*|*C*|B|B|B|B|B|B|#B#|C|C|C|C|C|C|C|#C#||#A#|
      -------------------------------------------------------------
       bos bos bos             eos               eos  eos

This is because there is no reason to interleave; we're not streaming video and audio or anything.


Stream 0: Start Stream

This indicates the start of an interlaced Ogg fcomp2 stream. Our magic number is 0x21050E17, or {'!', ('f' - 'a'), ('o' - 'a'), ('x' - 'a')}; so 21 05 0E 17.

It has no data in it, just a BOS and EOS.


Stream 1: Attribute Stream

The first and last stream are Attribute Streams. They store the following information:

Magic Number -- 0x00000000

Stream Numbers of File List Streams -- There are usually 2 File List Streams. Sometimes there's more. These are in a set. This gives the Serial Numbers of each stream.

Stream Numbers of Attribute Mirrors -- Attribute Mirror serial numbers. If this attribute stream is corrupted, these are read and seeked to first, and their magic numbers checked, to avoid a long-winded full physical stream scan (if possible).

Master Dictionary Index -- Gives a list of Master Dictionary numbers, and a list of stream serial numbers for each of them.

May itself be gzip'd. If it's gzip'd, its magic number is 0x80000000


Stream 2: File List Stream

The File List Stream stores the path/file name/stream serial number/Master Dictionary of each file. Each entry is stored as this.

Magic Number: 0x00000001

May itself be gzip'd. If it's gzip'd, its magic number is 0x80000001


Stream 3..(m+2): Master Dictionaries (originals)

These don't exist if you have no Master Dictionaries. If they do, they contain both the data and the length for each string in the dictionary. This means that the first occurance in the fcomp2 compressed stream of a dictionary entry CAN be replaced with a dictionary index.

Master Dictionaries store both the data AND the length; the indicies are determined during the reading of the dictionary.

Magic Number: 0x00000002

May itself be gzip'd. If it's gzip'd, its magic number is 0x80000002


Stream X..(X+N-1): Compressed Files

The compressed files can go into any streams. Spewed or interleaved throughout are any mirrors of Master Dictionaries, the Attributes Streams, and the File List Streams that you want for error recovery.

I've got to write out an adapted fcomp2 storage format for this. For now, we'll go with the following:

Magic Number: 0x00000003

Master Dictionaries -- A list of Master Dictionaries. If more than one is specified, then each is appended to the end of the dictionary as it is read. Also, the stream may have its own dictionary, which is appended to the end of this.

Dictiorary -- This dictionary is appended to the end of the final dictionary made by the Master Dictionaries. If it exists anyway.

Huffman Encoding information -- Flags and values to tell how to work the Huffman Encoding.

Compressed Stream -- The compressed stream. This uses all of the above to decode to the original stream.

May itself be gzip'd. If it's gzip'd, its magic number is 0x80000003. This is provided for consistency, i.e. the code can just look for the MSB to be on in the Magic Number to see if it's gzip'd. Don't do this. :)

Second to Last Stream -- File List Mirror

Mandatory File List Mirror on the end of the physical stream, for error control.

Final Stream -- Attributes Mirror

Primary Attributes Mirror. If the Attributes Stream is dead, then this should exist at least, as a failsafe, even if no other mirrors exist.



The above should provide enough to allow seeking through the stream to any file; a type of "Solid" compression (shared dictionary; real "Solid" would be a compressed tarball); and enough to actually stream by sending the Attributes and File List, allowing a client to select a file, and then sending the Master Dictionary and File Streams. This is especially good for compressing entire web sites and letting a web server supply these to a (compatible) browser, as the server can find everything from the browser request.

--Bluefox Icy

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!