ByteOrdering and File Formats

Posted 7 Apr 2000 at 14:42 UTC by caolan Share This

Some simple thoughts on binary file formats. Portable handling, structure describing etc.

Current Reality

When writing wv,libwmf and other file format parsers I do a lot of reading binary c structures. So I'd like to share some of my thoughts on it.

Microsoft code in particular is rife with fwrites and freads of structures, along with mmapping files and casting pointers to structs to offsets within the file. This all works perfectly fine for them of course because their compiler does not do structure padding and because all their data is littleendian. These restrictions are hardly acceptable for the rest of the world that has to deal with this sort of data.

Most graphic file and other storage converters and writers for unix etc are strewn with workarounds, some like wine will take the approach that they will only target littleendian systems and then require the compiler to have the ability to pack their structures without holes, and then they can blithely use fread as well. Ok that works, but its limiting. It is acceptable for wine as it had no intention to be an emulator and to emulate the windows intel instruction on systems that did not have them, so the restriction created no serious problems.

Other solutions are just to byteswap the data as it is read, most programs do this, wv for instance currently reads each individual struct member from a FILE * one by one, swapping if necessary on a bigendian platform. The advantage here is that it is perfectly and completely portable, as witnessed by the ports to VAX, AmigaOS and numerous unices. It avoids the entire padding issue. Indeed the data types do not even have to be the exact same size as the datatypes used in the storage, as long as they are big enough then it will work fine.

The problem here is that a one by one member read is agonizingly slow in comparison to one efficient read from file. So in wv in many places we do a large fread of the raw data into scratch memory and then fill the struct from this element by element, swapping as necessary, again crossplatform, but unnecessary for systems which would have been able to just read the data directly into the struct in the first place.

I had some speed stats for wv, I cannot track down the exact figures at the moment but the general overview of the results was that single member by member reads with byteswapping are much slower that nobyteswapping (surprise eh!), reading in bigger structs in chunks are phenominally faster than element by element, the speed gains in reading in chunks makes any necessary byteswapping speed losses insignificant in comparison, and mmapping the file and either copying the data to a structure from there, or using it in place is faster again but only by very little indeed, though you do gain by not having to manage your own scratch space for the copying from mmap to struct case. Nothing too surprising of course, reading from physical media is by far the slowest operation for most activities.

What I'd prefer The desirable solution as I imagine it is to have a simple mechanism that will generate the fastest and most elegent solution for reading and writing binary structures from a given platform A on a platform B.

With a given c structure definition

Given equal A and B endianess
    {
    Given compatible padding system
        direct freads and fwrites to/from structs
    else
        freads and fwrites to a scratch
        fall back to element by element reads from scratch to/from structs
    }
Given unequal A and B endianess
    {
    Given compatible padding system
        freads and fwrites to a scratch
        fall back to element by element reads and byteswapping
            from scratch to/from structs
    }

Fair enough, rpcgen and other idl languages will do pretty much this sort of thing. My first thought was that it would be nice to have the ability to tag certain structs in a header and use rpcgen to generate the fastest read/write infrastructure automatically. rpcgen though uses bigendian layout as its internal standard endianeess, which penalises needlessly the majority case of reading windows formats, and its generally a bit crufty anyway.

The interesting thing now is that while this would work quite easily for fixed size structs it runs into problems where we might have a struct such as

typedef struct _twiddle
{
    U32[1]    noEntries;
    U32 *entries;
}twiddle;

which would have NoEntries of entries stored linearly. And then theres beauties such as

typedef strict _twiddle
{
U8  fFlagsA;
U8  fFlagsB;
U8  therest:6;
U8  *entries8;
U16 *entries16;
}twiddle;
where each bit set in fFlagsA means that there is a U8 entry stored, and each bit set in fFlagsB means there is a U16 entry stored, etc etc

We still have to handle these cases manually with the simple approach, it would be nice to have a simple idl style file format description language which was explictly intended to describe existing style of binary struct layouts. Flexible enough to describe the above two cases, but simple enough so that rocket scientist capability is not required to use it. It would not penalize the implementation with undue extra baggage for situations where it is not necessary, i.e. fixed length structs on little endian platforms should be implemented with a simple fread.

It would be very helpful for a standardization/agreement effort in the free software on a good way to describe binary file formats. It would allow cross platform and cross language efficient and error free automatic boiler plate generation to serialise and deserialise common file formats. Plugged into a documentation system would then give some benefits where a file format could be explained in a for instance a docbook document and some tools could extract the structures explained and described therein directly into working code.

Is there an existing system which already can be used for this? Is there the need for one? It offends my sense of what is right to manually or even automatically but in a hodge podge, on a case by base basis generate code to read such structs. During wv I put together some dumbo awk and shell scripts to convert structs in the office 97 documentation into such code, but I found it unsatisfactory to consider the amount of wasted effort that will be repeated in the future for other smiliar cases.

[1], inventing some new names for datatypes whose size must be exact, something else that we all do and shouldn't, see DWORD, UINT32, u_int32 etc for examples


We need a language-independent type system, posted 7 Apr 2000 at 15:29 UTC by nether » (Journeyer)

Oh lordy, this is exactly the sort of thing I was about to write about.

I, too, have been wondering how come there is no portable solution for serialization and deserialization of arbitrary types of data.

This relates somewhat, at least in my mind, to the article about XML, and the ensuing discussion.

Anyway, I've been considering starting a project on this that might be worth a start, but I haven't been sure if there would be actual use for it.

The way I see it, the real problem is more generic than just how to go about dumping data into a file and back from it. The question is how to make programs, possibly on different architectures, possibly written in different languages, simply and consistently operate on the same kind of data and be able to communicate it to each other.

It would first seem that this is just the sort of thing that CORBA is good for. However, CORBA is not really about handling data, it's about handling objects. So corba has lots and lots of cruft that really is not necessary here. Moreover, it doesn't really help one with writing data into a file, although the standard does define a standard way of serializing objects and other values over the wire.

Now, think on a higher level than just endianness and binary values. When we say "U32" what we really mean is an integer bounded in [0, 2^32[. That's the crucial point that we are interested in. Now, this is just an abstract concept, but if we have a mapping into a serialization method, we can say ....

So, here's the idea.

At the core there is the type system. The type system can be very very powerful, because it handles only data. No functions, no objects, no other problems with consistency. The type system is used to define the data types that we are interested in at an abstract level. We can say "an integer between 0 and 99", for example, or "either a tree or a leaf".

The actual syntax of the type definition language is not very essential. There will probably be many. In fact, if we can describe the type system _in_ the type definition language, we can do lots of nifty stuff and forget about syntax completely, because it will be determined by the serialization method.

Now, once we have a set of types defined in this language, we need the de/serialization capability. So we can have many marshalling protocols for representing values of the specified types within a file. CORBA's CDR is one example of a protocol of this kind, and, well, XML and sexps and whatnot could be used too, as well as a custom written binary protocol, such as something with compression or whatever.

The beauty of this is that we need just write a generic serialization protocol for the type system, and serializers for any constructed compound types can be systematically generated.

Now, we have a way of keeping the data in a file, but of course we need to read it into a program. So, we have language bindings.

A language binding defines how types and values in our abstract generic type system map into the types and values in the specified language. This again is familiar from CORBA, gtk, and whatnot.

Note that if we have defined multiple bindings, we could automatically generate stuff to convert values in one language into values in another language. So the system would have uses beyond just reading and writing files.

Sorry, this was probably a little confusing and most likely I couldn't convince anyone how utterly cool this system would be, but I'm in a bit of a hurry now. Will rant further later.

Standard formats exist. Let's use them more efficiently., posted 7 Apr 2000 at 15:35 UTC by Raphael » (Master)

In your article, you give a pointer to an entry in the C FAQ. That entry contains an indirect link to another one which lists some standard formats for (binary) data exchange. That list is of course far from complete because there are many other file formats or network protocols that are more or less standard and can be used to exchange some stuctured data. See also the previous discussions on Advogato about XML (although it is text format and not a binary format, some people have developed binary encodings for XML, which ruins the purpose of that thing but this is another story.)

Anyway, when you look at these standards, you will probably think: "Eek! Reading ASN.1 B.E.R. data is awfully slow and complex!". Well... Nothing prevents us from taking a not-too-complex subset of the standard, and writing some routines for loading/saving these files efficiently. The tricks that you describe above (loading data in chunks, etc.) are only some optimizations for reading the data in the fastest way. Of course it is usually not possible to choose a file format that is optimal for all platforms at the same time... But as long as we pick a standard file format that does not prevent these optimizations for the most common case, then that should be enough. Every project would then be free to implement some code for reading/writing these files in the most efficient way on their platform of choice.

And anyway, these optimizations are not necessary most of the time. Taking Moore's law and Murphy's law into account, you will often prefer to have a simple text-based file format instead of a binary file format. The former makes debugging easier, and you will have some problems to debug. Gaining speed is only interesting if you can gain a factor of 10 or more, not a factor of 2. Otherwise, the speed of the processors will be faster by the time you have finished debugging your code.

Not trying to define a standardized new binary file format, posted 7 Apr 2000 at 15:53 UTC by caolan » (Master)

I would certainly look at using existing standards if and when i go to create my own data storage file format (though to be fair I would actually never inflict another file format on the world if there was any possibility I could store it in an existing format). Even perhaps using XML, though we've had the XML discussion in relation to storing data which is inherently not ideal for XML, I shudder to think of all the graphics that will be turned into some sort of uuencoded text so as to be stored embedded inside XML documents for instance.

But my main focus is not about defining a standard mechanism to save binary data for, but a standard mechanism to read existing binary files. A sweet way to make that process simple effective and standardized, through some software tools to do the dirty work.

In particular my main focus is on the filthy world of windows file formats, so I'd like to have a little descriptor language flexible enough to describe some of the microsoft mechanisms that they apply for example in indicating the number of records of varying types in variable length structures

C.

Text files. Text files. Text files., posted 7 Apr 2000 at 16:00 UTC by argent » (Master)

For things that are inherently big binary blobs like PNG or MP3 where tere's a well defined independently implemented binary standard, binary files are OK. Temporary files for big binary hacking programs like geophysical data sets can also get away with binary. Or Oracle...

But yeh, I'd like to echo the previous message. Binary data formats shuld be a last resort. Plain text or XML is always preferred. EPS instead of Acrobat, HTML instead of binary data.

I'm in the middle of a discussion on this very subject on the Palm newsgroup. There's these people insisting that DOC (simple block-compressed text) on the Palm is "obsolete" and that proprietary binary formats that get a little better compression (less than 2:1 improvement) and are a little easier to parse are inherently better than HTML or some other SGML-ish format in DOC files.

When in doubt, use text files, and textual protocols. It's so gratifying when grokkable interfaces with fattening redundant information win out, like using HTML as well as, or even nstead of, SNMP for router management.

Encoding and readability, posted 7 Apr 2000 at 16:03 UTC by ingvar » (Master)

Once upon a time, I had to write a routine (a portable routine, to be used on big/small-endian machines, with differing word sizes). Now, I had some constraints on my data (so this isn't a general approach): it was all numerical; it was all positive; it was of unbounded, but finite length and it always came in pairs (key and modulus, if that matters).

What I had internally was (basically) a:
struct number {
  unsigned int no_bits;
  unsigned long *data;
};

So, the output section was (basically (and in pseudo-code)) "write number of bits, in hex, line break. while there are unwritten bits in the number, write the most significat unwritten word, using %0<wordlength/4>x formatting; write line break".

The reader was equally simple.

The code worked on a 386, it worked on a M68k, it worked on a HP-PA. What the lib vendor suggested was, instead, to just fread()/fwrite() the data to disk.

What I'm trying to say, I think, is

  • Keep the encoding as simple as possible, but no simpler.
  • Try to keep the encoding "machine-close", but still human-readable (e.g use hex or octal, instead of decimal).
  • If you go for a description language, keep the syntax as clean as possible.

IDL's and code generators are not good enough, posted 7 Apr 2000 at 16:39 UTC by jop » (Journeyer)

My experience with automatically generated externalization (marshalling, serialization, pickling, whatever...) code has been far from satisfactory, mostly, because the compiler never generates the data structure you want from the abstract specification (e.g. a sequence should be an array, a list<>, something custom?) or, the other way around, it is not able to guess right the abstract structure (e.g. how would something as hairy as a "doubly linked list with sentry element" be automatically recognized as a sequence?). As a consequence, you end up converting back and forth between the structure you need for your algorithm and the structure the generator wants you to use, which is a major objection to using any IDL/mapping combination (e.g. CORBA).

In addition, I agree with you that even for simple structures using C (or C++) types alone as input to the code generator is simply not possible. Expanding on your example, I think the general issue is that C/C++ programmer use the same syntactic construct to express several distinct concepts. For instance a pointer might be a mutable reference, an optional item, a variable length array, a polymorphic reference etc. As such, annotation is definitely required even if my first argument is not considered.

This is, in short, the motivation for developing XTL, a C++ library which enables the programmer to do the required annotations for externalization, without requiring any sort of pre-processing: It is all done with C++ itself! On the other hand, the result is obviously C++ specific, and I really don't see how it could be ported to other languages, although it is currently being ported to several compilers and architectures.

Regarding speed, static typing and inlining allow the compiler to produce code that for me has been fast enough, even if tricks like direct fread()/fwrite() of structs are not allowed (which I personally consider something one has to unlearn in order to program in anything other than C, or you end up with weird stuff like virtual function tables in your files ;-).

Separation of meaning and representation, posted 7 Apr 2000 at 17:13 UTC by nether » (Journeyer)

That is exactly the sort of thing that I would like to get solved right. The point is that the data structure definition must be as abstract and as high-level as possible. Forget about linked lists, think about _sequence_. All we need is a serializer for sequences, for example, one word to indicate the length followed by the elements, and a mapping between the sequence and some programming language-specific type. In some cases, it may be a list, in others, a vector. I would really like to get the semantics of the data represented in a very abstract fashion and have their binary representations, as well as language bindings, as completely separate modules.

That XTL link doesn't seem to work, btw.

More on separation of meaning and representation,, posted 7 Apr 2000 at 17:41 UTC by jop » (Journeyer)

I also think that separation of meaning (i.e. external representation) and (internal) representation is required. However, general purpose mappings between both are not useful, as the mapping between them is an n-to-n relation, as I tryied to explain with the sequence example As such, I don't think it is possible to have standardlanguage bindings and still be efficient. What is needed is an API that let's you do your own mapping with as little effort as possible. And that is what I tryied to achieve with XTL.

BTW, thanks for the hint, the link is now working again. :)

There's a reason it hasn't been done, posted 7 Apr 2000 at 20:02 UTC by Rasputin » (Journeyer)

If I understand what you're trying to accomplish correctly, you would like some variety of definition language that supplies the mapping between stored binary data representations (i.e serialized data) and language based structures that would use them. As is mentioned elsewhere in this discussion, this has been done in a variety of ways for specific instances. The problem is that this is actually a very wide set of problems for which, as far as I have ever been able to find out, there is no general case solution.

What you're really asking for is a language that will allow the mapping of an arbitrary data representation to another arbitrary data representation given the meta-data of the initial state and the semantics of the final state. You are, unfortunately, short a couple of data points here. If you make an initial assumption about one end of the transition (eg the final data structure will be of the form length, count) now you can do this because you have a meta-data to meta-data mapping. Equally important is that you also now have a many to one relationship.

This is actually a very simplistic way of describing the problem but I think you can see where I'm going. Where this problem is solvable it is in specific and limited ways. The general case is much more intractable, but may, of course, eventually be solvable. I would be willing to offer a small amount of my time to assist in a limited case implementation (eg endian/padding independant to sequential structures), but having fought the larger battle in the International Standards arena, I'm unwilling to go further down that path.

I suspect I might have simplified this to the point where it stopped making sense in the context of the original article. Unfortunately, I don't have the time to expand (I have a database doing a marvelous impression of a smoking hole to deal with). If that turns out to be the case, I'll follow up with more detail when time allows.

There exists a third possibility...that I've completely misunderstood what it is you're trying to accomplish. In that case, disregard all after I ;)

Re: Standard formats exist. Let's use them..., posted 9 Apr 2000 at 10:12 UTC by rth » (Master)

The pointer Raphael gave is a good one. If you really feel the need to use a binary format, you'll be hard pressed to do better than netCDF or HDF. This is not exactly a new problem you've stumbled upon.

Re: Not trying to define..., posted 9 Apr 2000 at 20:02 UTC by rth » (Master)

Ho hum. So read all the replys before replying next time...

To make up for it, here's some things that aren't a problem, assuming we're talking about GCC:

Structure packing

Use __attribute__((packed)) as needed to force the use of unaligned members. For instance

struct foo { char x; short y; int x; } __attribute__((packed));
has size 7 (for all targets worth mentioning; discussion of type sizes later). The compiler will emit unaligned loads in whatever form necessary for the target. Note that the compiler does know that ia32 can perform the unaligned load in hardware, and so does nothing special.

For slightly better performance on particular structures on systems that don't do unaligned loads in hardware, use packed on individual structure members:

struct bar { int a; int b; char c; int d __attribute__((packed)); };
This differs from foo in that the structure as a whole has alignment 4, rather than alignment 1. This does affect how the structure may be aligned in memory, but if you're doing freads of individual structures you don't care. Anyway, the knowledge that a and b are sufficiently aligned allows the compiler to always emit direct loads.

Other alignments

This isn't an issue with Windows structures, but there are other assorted padding alternatives you may encounter. Half-word alignment and over-alignment are the most common here. Let's say you want something like

struct baz { char e; char pad; int f; };
i.e. 2-byte alignment for f. That can be accomplished by using another attribute:
struct baz { char e; int f __attribute__((packed, aligned(2))); };
Note that aligned will not lower a member's alignment below its natural alignment without also using packed.

Type sizes

Ideally you'd use ISO C99's stdint.h, in particular int_least8_t, int_least16_t, uint_least8_t etc. But ISO C99 is probably not something you've encountered on your favourite Amiga, so an alternative is required. For GCC you can reference specific "machine modes" like so:

typedef int signed8 __attribute__((mode(QI)));
typedef int signed16 __attribute__((mode(HI)));
typedef unsigned int unsigned32 __attribute__((mode(SI)));
typedef unsigned int unsigned64 __attribute__((mode(DI)));
The original meanings of QI, HI, etc from the gcc 1.0 days are gone; they now mean 1, 2, 4 and 8 units respectively, where unit is the smallest addressable element on the target. Which is of course a byte on all current desktop hardware. (Yes, GCC does work on word-addressable machines; yes, you're in for a world of hurt porting to a bit-addressable machine.)

Pointers in structures

We were talking about an on-disk format, right? Surely there are no embedded pointers, per-se, only file offsets or following arrays. Surely you example really should have been

typedef struct _twiddle {
U32 noEntries;
U32 entries[N];
};
In which case you define the structure's array with length 1 instead of N, tweek the total size calculation, and life is golden. If you've got either 8-bit array elements or 16-bit array elements, but not both, make entries a union.

Byte ordering

Tougher. You could do worse than to steal the asm/byteorder.h bits from the Linux kernel sources. If you're going to be manipulating the structures a lot once read, byteswapping the data once up front can pay off. Also note that the le32_to_cpup macros typically won't work on unaligned data.

Bit fields

Here you're screwed. There are no compiler knobs you can turn to get things layed out the same way on all systems. If at all possible use a 32-bit word with explicit bit masking. Otherwise you'll have to resort to by-member copying.

r~

Network Data Representation: DCE/RPC, posted 12 Apr 2000 at 07:14 UTC by lkcl » (Master)

I've been dealing with exactly this issue since mid-1997. There are well over two hundred different data structures in Samba that are sent over-the-wire, and Samba compiles and runs on HP/UX, Sparcs, RS/6000, x86, MIPS, CRAY, just to name a few.

NDR is designed to handle arrays (fixed and variable length), pointers (references and real pointers), structures, uint8,16,32, floating point, the whole damn lot.

The point of NDR in DCE/RPC is that it can be used to transmit data over-the-wire (and the receiver is told what the byte order is, and the *receiver* is expected to "make good". It is assumed that this will make the generator-code of the NDR by the sender simpler).

Coding examples, see http://samba.org/cgi-bin/cvsweb/samba/source, then: rpc_parse/*.c and include/rpc_*.h

Good starting points: create_user_creds() in rpc_parse/parse_creds.c and get_user_creds() in msrpc/msrpcd_process.c

Another one, for those people interested in storing binary data C-structures (complex ones), in gdbm or tdb: lib/vuser_db.c, see tdb_store_vuid(), lookup and delete. This is used to store, retrieve and delete samba-user information based on a lookup key (the smbd process-id and an internal smb vuid).

NDR is *normally* auto-generated from IDLs. In this case, we've spent two years writing 40,000 lines of hand-generated code. We *are* working on a project called sidlc (Samba IDL compiler): it will generate NDR marshalling / unmarshalling code.

NDR and DCE/RPC are well-thought-out, and if in this brief post it looks like it won't do what you think, well blame me for having r.s.i. NDR and DCE/RPC are *well* worth investigating.

Network Data Representation: DCE/RPC, posted 12 Apr 2000 at 07:33 UTC by lkcl » (Master)

typedef struct _twiddle {
U32 num_Entries;
U32 *entries;
}twiddle;

OK, i decided to write some code for you to do this, based on samba source. not compiled. we have so many of these in samba source, now, it's boring to write, just cut/paste, took me five minutes to do. there are definite benefits to doing this auto-generated, it would be thirty seconds, if auto-generated.

/*******************************************************************
reads or writes a structure.
********************************************************************/
BOOL parse_io_r_twiddle(char *desc, struct twiddle * r_u,
prs_struct *ps, int depth)
{
uint32 i;
fstring tmp;
uint32 ptr_entries;

if (r_u == NULL)
return False;

prs_debug(ps, depth, desc, "parse_io_r_twiddle");
depth++;

if (UNMARSHALLING(ps))
{
ZERO_STRUCTP(r_u);
}

prs_align(ps);

prs_uint32("num_entries1", ps, depth, &(r_u->num_entries1));

/* looks weird, but copes with both marshalling and
unmarshalling */
ptr_entries = r_u->entries != NULL ? 1 : 0
prs_uint32("ptr_entries ", ps, depth, &ptr_entries);
ptr_entries = r_u->entries != NULL ? 1 : 0

if (ptr_entries != 0)
{
prs_uint32("num_entries2", ps, depth,
&(r_u->num_entries2));

if (r_u->num_entries2 != r_u->num_entries1)
{
/* RPC fault */
return False;
}

if (UNMARSHALLING(ps))
r_u->entries = g_new(uint32, r_u->num_entries2);

if (r_u->entries == NULL)
{
DEBUG(0, ("NULL entries in
parse_io_r_twiddle\n"));
parse_free_r_twiddle(r_u);
return False;
}

for (i = 0; i < r_u->num_entries2; i++)
{
slprintf(tmp, sizeof(tmp) - 1, "rid[%02d] ",
i);
prs_uint32(tmp, ps, depth, &(r_u->entries[i]));
}
}

if (MARSHALLING(ps)
{
/* storing. memory no longer needed */
parse_free_r_twiddle(r_u);
}

return True;
}

/*******************************************************************
frees a structure.
********************************************************************/
void parse_free_r_twiddle(_twiddle * r_l)
{
if (r_l->entries != NULL)
{
free(r_l->entries);
r_l->entries = NULL;
}

r_l->num_entries1 = 0;
}

HTML, posted 12 Apr 2000 at 07:59 UTC by lkcl » (Master)

uh... the Elitism Argument Wins. advogato sucks. i have to learn html, now, to post code-fragments on advogato??? *yuck*. and who was it talking about "binary" formats being too much? well, html is too much! give me straight, unadulterated text, any day. Down With HTML, especially pre and /pre!

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page