Ok, this is a brain-dump of everything I've learned about MDB files. I'm am
using Access 97, so everything I say applies to that and maybe or maybe not
other versions.
Right, so here goes:
Note: It appears that much of the data in the pages is unitialized garbage.
This makes the task of figuring out the format a bit more challenging.
Pages
-----
MDB files are a set of pages. These pages are 2K (2048 bytes) in size, so in a
hex dump of the data they start on adreeses like xxx000 and xxx800.
The first byte of each page seems to be a type indentifier for instance the
first page in the mdb file is 0x00, which no other pages seems to share. Other
pages have values of 0x01, 0x02, 0x03, 0x04 though the exact meaning of these
is currently a mystery. (0x04 seems to be data I guess).
The second byte is always 0x01 as far as I can tell.
At some point in the file the page layout is apparently abandoned though the
very last 2K in the file again looks like a valid page. The purpose of this
non-paged region is so far unknown .
Bytes after the first and second seemed to depend on the type of page, although bytes 4-7 seem to indicate a page type of some sort. 02 00 00 00 is found on all catalog pages.
Pages seem to have two parts, a header and a data portion. The header starts
at the front of the page and builds up. The data is packed to the end of the
page. This means the last byte of the data portion is the last byte of the
page.
Byte Order
----------
All offsets to data within the file are in little endian (intel) order
Catalogs
--------
So far the first page of the catalog has always been seen at 0x9000 bytes into
the file. It is unclear whether this is always where it occurs, or whether a
pointer to this location exists elsewhere.
The header to the catalog page(s) start look something like this:
Column Type may be one of the following (not complete).
0x03 Integer (16 bit)
0x04 Long Integer (32 bit)
0x08 Short Date/Time
0x0a Text
0x0c Hyperlink
Following the 18 byte column records begins the column names, listed in order
with a 1 byte size prefix preceding each name.
After this are a series of 39 byte fields for each index. At offset 34 is a 4 byte page number where the index lives.
Beyond this are a series of 20 byte fields for each 'index entry'. There may be more entrys than indexes and byte 20 represents its type (0x00 for normal index, 0x01 for Primary Key, and 0x02 otherwise).
It is currently unknown how indexes are mapped to columns or the format of the index pages.
Indices are not completely understood but here is what we know.
On the page pointed to by the table definition a series of records start at
byte offset 0xf8.
The record generally begins with 0x7f or 0x80. 0x80 is the one's complement of 0x7f and all text data in the index would then need to be negated. The reason
for this negation is unknown, although I suspect it has to do with descending
order.
Access stored an 'alphabetic sort order' version of the text key columns in the index. Basically this means that upper and lower case characters A-Z are merged and start at 0x60. Digits are 0x56 through 0x5f. Once converted into this
(non-ascii) character set, the text value is able to be sorted in 'alphabetic'
order. A text column will end with a NULL (0x00 or 0xff if negated).
Beyond the key columns is stored a 3 byte page number and 1 byte row number.
So to search the index, you need to convert your value into the alphabetic
character set, compare against each index entry, and on successful comparison
follow the page and row number to the data. Because text data is managled
during this conversion there is no 'covered querys' possible (a query that can
be satisfied by reading the index, without descending to the leaf page to read