Binary I/O and Filesystems

CS 321 2007 Lecture, Dr. Lawlor

So a file's full of bytes. You don't want bytes. You first want to stick bytes together to make ints, doubles, and the other types in your program. You then want to stick those together into structs, like "std::string". These structs need to be laid out into data structures, which all need to get stored on disk somehow.

Let's take those one at a time.

Storing Ints as Bytes

Say you've got the hex value 0xa0b1c2d3. The obvious way to store this 32-bit int in memory is using 4 bytes, as follows:

Byte:	0	1	2	3
Value:	0xa0	0xb1	0xc2	0xd3

This is called "big-endian" notation--the first byte is the big end of the int. Almost all CPUs in the history of computers have used big-endian storage.

In fact, only two desktop CPUs have ever not been big-endian: the ancient VAX and the modern x86. Sadly, those two are exceedingly popular machines, which store that same int 0xa0b1c2d3 in "little-endian" notation:

Byte:	0	1	2	3
Value:	0xd3	0xc2	0xb1	0xa0

The difference between big and little endian machines ("endianness") stinks, but that's life. You can verify what's happening by writing out an int and reading bytes, or by copying memory between byte and int variables:

#include <string.h>

int foo(void) 
{
	char b[4]={0xa0, 0xb1, 0xc2, 0xd3};  // 4 bytes
	int i=0; // 1 int
	memcpy(&i,&b[0],sizeof(i)); // copy bytes from b to i
	return i;
}

(executable NetRun link)

On a little-endian x86 machine, this program will print 0xd3c2b1a0.
On a big-endian PowerPC machine, this program will print 0xa0b1c2d3.

Note that this is a function of how the CPU stores an "int" in the bytes of memory, so have the exact same endian-dependent situation with files:

#include <fstream>

int foo(void) 
{
	// Write out 4 bytes into this file:
	std::ofstream fo("test.bin",std::ios_base::binary);
	char b[4]={0xa0, 0xb1, 0xc2, 0xd3};  // 4 bytes
	fo.write((char *)&b[0],4);
	fo.flush(); //<- else fo leaves our bytes in output buffer!
	
	// Read those same bytes out as a binary int:
	std::ifstream fi("test.bin",std::ios_base::binary);
	int i=0;
	fi.read((char *)&i,sizeof(i));
	return i;
}

(executable NetRun link)

Stupid ways to deal with endianness

Ignore it. Hope that everybody for the rest of time will use 32-bit little-endian systems.

Hardcode a byte swap into your program, like this. This is stupid because NOT every machine will need a byte swap, and byte swapping is slow and ugly. It's also way too easy to accidentally swap bytes twice, or corrupt the in-memory data.

#include <fstream>

int foo(void) 
{
	// Write out 4 bytes into this file:
	std::ofstream fo("test.bin",std::ios_base::binary);
	char b[4]={0xa0, 0xb1, 0xc2, 0xd3};  // 4 bytes
	
	// Watch out!!!!!!!!   Gotta swap bytes!!!!!!!
	for (int k=0;k<2;k++) std::swap(b[k],b[3-k]);
	
	fo.write((char *)&b[0],4);
	fo.flush(); //<- else fo leaves our bytes in output buffer!
	
	// Read those same bytes out as a binary int:
	std::ifstream fi("test.bin",std::ios_base::binary);
	int i=0;
	fi.read((char *)&i,sizeof(i));
	return i;
}

Smart ways to deal with endianness

Don't use binary data. For example, normal ASCII output, like std::cout<<i<<" "<<j<<" "<<k<<"\n"; does not have endianness problems, because it's ASCII data. Unfortunately, ASCII is slow (see below), and while easy for humans, it's tricky for machines to work with.
Use a good library, like HDF4, which automatically handles endianess inside the library. A library like this will automatically convert the data when needed.

Use a smart reader class, like below, which always has the in-memory and on-disk layout you need. For example, this class will always be 32 bits long, and always have big-endian byte layout, even if your machine's "int" has none of those things:

typedef unsigned char io_byte;
class Big32 { //Big-endian (network byte order) 32-bit integer
	io_byte d[4];
public:
	Big32() {}
	Big32(unsigned int i) { set(i); }
	operator unsigned int () const { return d[3]|(d[2]<<8)|(d[1]<<16)|(d[0]<<24); }
	unsigned int operator=(unsigned int i) {set(i);return i;}
	void set(unsigned int i) { 
		d[3]=(io_byte)i; 
		d[2]=(io_byte)(i>>8); 
		d[1]=(io_byte)(i>>16); 
		d[0]=(io_byte)(i>>24); 
	}
};

(executable NetRun link)

That last technique, the magic class, is by far my favorite. The biggest advantage of this is that now if you have fourteen things you need to store on disk, you can make a new class out of Big32 objects, and the new class will also have a known on-disk byte layout:

class stuffpile {
public:
	Big32 foos;
	Big32 bars[11];
	Big32 baz,boz;
};

"stuffpile" objects can now be written and read easily and portably as bytes, just like Big32s.

Argh! I Hate Binary! Why not just use ASCII?

ASCII really is fine if you don't care too much about:

File size. ASCII can be much larger than the equivalent value binary. "123456789 " is ten characters. 0x12345678 is four bytes. (Be careful, though! Since "1 " is two characters, sometimes ASCII is smaller!)
I/O speed. On my machine using the same "ofstream" object, ASCII output is about 4x slower than binary output--about 10MB/s instead of 40MB/s. ASCII is hence slower than your disk drive!
Security. ASCII has a huge number of "special characters" that can cause trouble during reads (think about newlines, escape characters, high-ASCII/UNICODE, and so on). Binary input is usually much safer.
Seeking. Since every int written in ASCII is a different number of bytes, you can't just jump to the i'th int. Since every int written in binary is the same size, you can just jump to the i'th int, with something like "fi.seekg(i*sizeof(int));"

Unfortunately, we often do care about all four of these things. Hence it's important for you to learn about reading binary files.

Real-life complicated binary files

A real binary file usually has an interesting structure. The first thing in the file is a "header". This is a sequence of stuff at known locations. For simple files everything in the file is at a known fixed location, but real life is rarely simple. Instead, often the header will give the file locations ("offsets") to where you can find the other stuff in the file.

Example: EXE file format

Modern Windows executables are in the "PE" format (Portable Executable). They start with an old MS-DOS program header, but that data isn't used anymore (it's just a tiny DOS program that prints "This program can't be run in DOS mode"). To find the real executable info, you jump to byte 0x3c in the file (with a seek) and then read 4 bytes, which are a little-endian byte offset. At this byte in the file (again, you get there with a seek), there's a whole struct full of information about the program. Here's a complete example program. Through the use of the "lil32" class, this program can run on any machine, not just little-endian Windows machines.

Example: FAT file system

Any filesystem is just a big binary data structure sitting on your disk. One common filesystem, used in USB keychain drives, floppy disks, and old hard disks, is the "File Allocation Table" filesystem.

The first thing on disk is the FAT "boot sector", which tells you how many entires are actually in the FAT. Then comes the FAT itself (read the Wikipedia article, it's good!). Then comes the blocks of data in the normal user files sitting on the disk. Because the boot sector, FAT, and user data blocks are all a known size, the OS can directly seek (the disk) to a particular location to read a particular file.