Binary I/O and Filesystems
CS 321 2007 Lecture, Dr. Lawlor
So a file's full of bytes. You don't want bytes. You first
want to stick bytes together to make ints, doubles, and the other types
in your program. You then want to stick those together into
structs, like "std::string". These structs need to be laid out
into data structures, which all need to get stored on disk somehow.
Let's take those one at a time.
Storing Ints as Bytes
Say you've got the hex value 0xa0b1c2d3. The obvious way to store this 32-bit int in memory is using 4 bytes, as follows:
Byte:
|
0
|
1
|
2
|
3
|
Value:
|
0xa0
|
0xb1
|
0xc2
|
0xd3
|
This is called "big-endian" notation--the first byte is the big end of
the int. Almost all CPUs in the history of computers have used
big-endian storage.
In fact, only two desktop CPUs have ever not
been big-endian: the ancient VAX and the modern x86. Sadly, those
two are exceedingly popular machines, which store that same int
0xa0b1c2d3 in "little-endian" notation:
Byte:
|
0
|
1
|
2
|
3
|
Value:
|
0xd3
|
0xc2
|
0xb1
|
0xa0
|
The difference between big and little endian machines ("endianness")
stinks, but that's life. You can verify what's happening by
writing out an int and reading bytes, or by copying memory between byte
and int variables:
#include <string.h>
int foo(void)
{
char b[4]={0xa0, 0xb1, 0xc2, 0xd3}; // 4 bytes
int i=0; // 1 int
memcpy(&i,&b[0],sizeof(i)); // copy bytes from b to i
return i;
}
(executable NetRun link)
On a little-endian x86 machine, this program will print 0xd3c2b1a0.
On a big-endian PowerPC machine, this program will print 0xa0b1c2d3.
Note that this is a function of how the CPU stores an "int" in the
bytes of memory, so have the exact same endian-dependent situation with
files:
#include <fstream>
int foo(void)
{
// Write out 4 bytes into this file:
std::ofstream fo("test.bin",std::ios_base::binary);
char b[4]={0xa0, 0xb1, 0xc2, 0xd3}; // 4 bytes
fo.write((char *)&b[0],4);
fo.flush(); //<- else fo leaves our bytes in output buffer!
// Read those same bytes out as a binary int:
std::ifstream fi("test.bin",std::ios_base::binary);
int i=0;
fi.read((char *)&i,sizeof(i));
return i;
}
(executable NetRun link)
Stupid ways to deal with endianness
- Ignore it. Hope that everybody for the rest of time will use 32-bit little-endian systems.
- Hardcode a byte swap into your program, like this. This is
stupid because NOT every machine will need a byte swap, and byte
swapping is slow and ugly. It's also way too easy to accidentally
swap bytes twice, or corrupt the in-memory data.
#include <fstream>
int foo(void)
{
// Write out 4 bytes into this file:
std::ofstream fo("test.bin",std::ios_base::binary);
char b[4]={0xa0, 0xb1, 0xc2, 0xd3}; // 4 bytes
// Watch out!!!!!!!! Gotta swap bytes!!!!!!!
for (int k=0;k<2;k++) std::swap(b[k],b[3-k]);
fo.write((char *)&b[0],4);
fo.flush(); //<- else fo leaves our bytes in output buffer!
// Read those same bytes out as a binary int:
std::ifstream fi("test.bin",std::ios_base::binary);
int i=0;
fi.read((char *)&i,sizeof(i));
return i;
}
Smart ways to deal with endianness
That last technique, the magic class, is by far my favorite. The
biggest advantage of this is that now if you have fourteen things you
need to store on disk, you can make a new class out of Big32 objects,
and the new class will also have a known on-disk byte layout:
class stuffpile {
public:
Big32 foos;
Big32 bars[11];
Big32 baz,boz;
};
"stuffpile" objects can now be written and read easily and portably as bytes, just like Big32s.
Argh! I Hate Binary! Why not just use ASCII?
ASCII really is fine if you don't care too much about:
- File size. ASCII can be much larger than the equivalent
value binary. "123456789 " is ten characters. 0x12345678 is
four bytes. (Be careful, though! Since "1 " is two
characters, sometimes ASCII is smaller!)
- I/O speed. On my machine using the same "ofstream" object,
ASCII output is about 4x slower than binary output--about 10MB/s
instead of 40MB/s. ASCII is hence slower than your disk drive!
- Security. ASCII has a huge number of "special characters"
that can cause trouble during reads (think about newlines, escape
characters, high-ASCII/UNICODE, and so on). Binary input is
usually much safer.
- Seeking. Since every int written in ASCII is a different
number of bytes, you can't just jump to the i'th int. Since every
int written in binary is the same size, you can just jump to the i'th int, with something like "fi.seekg(i*sizeof(int));"
Unfortunately, we often do care about all four of these things.
Hence it's important for you to learn about reading binary files.
Real-life complicated binary files
A real binary file usually has an interesting structure. The
first thing in the file is a "header". This is a sequence of
stuff at known locations. For simple files everything
in the file is at a known fixed location, but real life is rarely
simple. Instead, often the header will give the file locations
("offsets") to where you can find the other stuff in the file.
Example: EXE file format
Modern Windows executables are in the "PE" format (Portable Executable). They start with an old MS-DOS program header,
but that data isn't used anymore (it's just a tiny DOS program that
prints "This program can't be run in DOS mode"). To find the real
executable info, you jump to byte 0x3c in the file (with a seek) and
then read 4 bytes, which are a little-endian byte offset. At this
byte in the file (again, you get there with a seek), there's a whole
struct full of information about the program. Here's a complete example program. Through the use of the "lil32" class, this program can run on any machine, not just little-endian Windows machines.
Example: FAT file system
Any filesystem is just a big binary data structure sitting on your
disk. One common filesystem, used in USB keychain drives, floppy
disks, and old hard disks, is the "File Allocation Table" filesystem.
The first thing on disk is the FAT "boot sector", which tells you how
many entires are actually in the FAT. Then comes the FAT itself
(read the Wikipedia article, it's good!). Then comes the blocks
of data in the normal user files sitting on the disk. Because the
boot sector, FAT, and user data blocks are all a known size, the OS can
directly seek (the disk) to a particular location to read a particular
file.