mdbtools/HACKING

Ok, this is a brain-dump of everything I've learned about MDB files.  I'm am 
using Access 97, so everything I say applies to that and maybe or maybe not 
other versions.

Right, so here goes:

Note: It appears that much of the data in the pages is unitialized garbage.
This makes the task of figuring out the format a bit more challenging.

Pages
-----

MDB files are a set of pages.  These pages are 2K (2048 bytes) in size, so in a
hex dump of the data they start on adreeses like xxx000 and xxx800.  

The first byte of each page seems to be a type indentifier for instance the 
first page in the mdb file is 0x00, which no other pages seems to share.  Other
pages have values of 0x01, 0x02, 0x03, 0x04 though the exact meaning of these
is currently a mystery. (0x04 seems to be data I guess).

The second byte is always 0x01 as far as I can tell.

At some point in the file the page layout is apparently abandoned though the 
very last 2K in the file again looks like a valid page.  The purpose of this
non-paged region is so far unknown .

Bytes after the first and second seemed to depend on the type of page, although bytes 4-7 seem to indicate a page type of some sort.  02 00 00 00 is found on all catalog pages.

Pages seem to have two parts, a header and a data portion.  The header starts 
at the front of the page and builds up.  The data is packed to the end of the 
page.  This means the last byte of the data portion is the last byte of the 
page.

Byte Order
----------

All offsets to data within the file are in little endian (intel) order

Catalogs
--------

So far the first page of the catalog has always been seen at 0x9000 bytes into
the file.  It is unclear whether this is always where it occurs, or whether a 
pointer to this location exists elsewhere.

The header to the catalog page(s) start look something like this:

+------+---------+--------------------------------------------------------+
| 0x01 | 1 byte  | Page type                                              |
| 0x01 | 1 byte  | Unknown                                                |
| ???? | 2 bytes | A pointer of unknown use into the page                 |
| 0x02 | 1 byte  | Unknown                                                |
| 0x00 | 3 bytes | Possibly part of a 32 bit int including the 0x02 above |
| ???? | 2 bytes | a 16bit int of the number of records on this page      |
+-------------------------------------------------------------------------+
| Iterate for the number of records                                       |
+-------------------------------------------------------------------------+
| ???? | 2 bytes | offset to the records location on this page            |
+-------------------------------------------------------------------------+

The rest of the data is packed to the end of the page, such that the last 
record ends on byte 2047 (0 based). 

Some of the offsets are not within the bounds of the page.  The reason for this
is not presently understood and the current code discards them silently.
Offsets that have 0x40 in the high order byte point to a location within the 
page where a pointer to another catalog page is stored. This does not seem to 
yeild a complete chain of catalog pages and is currently being ignored in favor 
of a brute force read of the entire database for catalog pages.

Little is understood of the meaning of the bytes that make up the records.  They
vary in size, but portion prior to the objects name seems to be fixed.  All 
records start with a '0x11'. The next two bytes are a page number to the column definitions. (see Column Definition).

Byte offset 9 from the beginning of the record contains it's type.  Here is a
table of known types:

0x00 Form
0x01 User Table
0x02 Macro
0x03 System Table
0x04 Report
0x05 Query
0x06 Linked Table
0x07 Module
0x0b Unknown but used for two objects (AccessLayout and UserDefined)

Byte offset 31 from the begining of the record starts the object's name.  I am
not presently aware of any field defining the length of the name, so the present
course of action has been to stop at the first non-printable character 
(generally a 0x03 or 0x02)

After the name there is sometimes have (not yet determined why only sometimes) 
a page pointer and offset to the KKD records (see below).  There is also pointer to other catalog pages, but I'm not really sure how to parse those.

Table Definition
-----------------

The second and third bytes of each catalog entry store a 16 bit page pointer to 
a table definition, including name, type, size, number of datarows, a pointer 
to the first data page, and possibly more.  I haven't fully figured this out so what follows is rough.  

The header to table definition pages start look something like this:

+------+---------+--------------------------------------------------------+
| 0x02 | 1 byte  | Page type                                              |
| 0x01 | 1 byte  | Unknown                                                |
| 'VC' | 2 bytes | ???                                                    |
| 0x00 | 4 bytes | Pointer to continuation page (if multipage table def)  |
| ???? | 4 bytes | appears to be a length of the data                     |
| ???? | 4 bytes | number of rows of data in this table                   |
| 0x00 | 4 bytes | ???                                                    |
| 0x4e | 1 byte  | ???                                                    |
| ???? | 2 bytes | generally same as # of cols but not always             |
| ???? | 2 bytes | ???                                                    |
| ???? | 2 bytes | number of columns in table                             |
| ???? | 4 bytes | number of indexes for this table                       |
| ???? | 4 bytes | number of index entries for this table                 |
| 0x00 | 1 byte  | ???                                                    |
| ???? | 2 bytes | page number of first datapage for table                |
| ???? | 2 bytes | ???                                                    |
| ???? | 2 bytes | page number of first datapage for table                |
| 0x00 | 1 byte  | ???                                                    |
+-------------------------------------------------------------------------+
| Iterate for 2 x number of indexes                                       |
+-------------------------------------------------------------------------+
| ???? | 4 bytes | number of rows in table                                |
| ???? | 4 bytes | number of rows in the index                            |
+-------------------------------------------------------------------------+

The next few bytes are somewhat of a mystery right now, but around 0x2B from 
the start of the page (though not always) begins a series of 18 byte records
one for each column present.  It's format is as follows:
+------+---------+--------------------------------------------------------+
| ???? | 1 byte  | Column Type (see table below)                          |
| ???? | 2 bytes | Column Number, ascending sequential number, starts at 0|
| ???? | 1 byte  | unknown. 1 is sometimes seen in text types             |
| ???? | 1 byte  | unknown                                                |
| ???? | 4 bytes | Column Number (again)                                  |
| ???? | 6 bytes | ??? (timestamp?)                                       |
| ???? | 1 bytes | bitmask of some sort. low order bit indicates variable |
|      |         | length column                                          |
| ???? | 2 bytes | length of column                                       |
+-------------------------------------------------------------------------+

Column Type may be one of the following (not complete).

0x03   Integer (16 bit)
0x04   Long Integer (32 bit)
0x08   Short Date/Time
0x0a   Text
0x0c   Hyperlink

Following the 18 byte column records begins the column names, listed in order
with a 1 byte size prefix preceding each name.

After this are a series of 39 byte fields for each index.  At offset 34 is a 4 byte page number where the index lives.

Beyond this are a series of 20 byte fields for each 'index entry'.  There may be more entrys than indexes and byte 20 represents its type (0x00 for normal index, 0x01 for Primary Key, and 0x02 otherwise).

It is currently unknown how indexes are mapped to columns or the format of the index pages.

Data Rows
---------

The header of a data page looks like this:

+------+---------+--------------------------------------------------------+
| 0x01 | 1 byte  | Page type                                              |
| 0x01 | 1 byte  | Unknown                                                |
| ???? | 2 bytes | Unknown                                                |
| ???? | 2 bytes | Page pointer to table definition                       |
| 0x00 | 2 bytes | Unknown                                                |
| ???? | 4 bytes | number of rows of data in this table                   |
+------+---------+--------------------------------------------------------+
| Iterate for the number of records                                       |
+-------------------------------------------------------------------------+
| ???? | 2 bytes | offset to the records location on this page            |
+-------------------------------------------------------------------------+

Each data row looks like this:

+------+---------+--------------------------------------------------------+
| ???? | 1 byte  | Number of columns stored in this row                   |
| ???? | n bytes | Fixed length columns                                   |
| ???? | n bytes | Variable length columns                                |
| ???? | 1 byte  | length of data from beginning of record                |
| ???? | n bytes | offset from start of row for each variable length col  |
| ???? | 1 byte  | number of variable length columns                      |
| ???? | n bytes | Null indicator.  size is 1 byte per 8 columns.         |
|      |         | 0 indicates a null value.                              |
+------+---------+--------------------------------------------------------+

Note: it is possible for the offset to the beginning of a variable length 
column to require more than one byte (if the sum of the lengths of columns is
greater than 255).  I have no idea how this is represented in the data as I
have not looked at tables large enough for this to occur yet. 

Indices
-------

Indices are not completely understood but here is what we know.

On the page pointed to by the table definition a series of records start at
byte offset 0xf8.

The record generally begins with 0x7f or 0x80.  0x80 is the one's complement of 0x7f and all text data in the index would then need to be negated.  The reason
for this negation is unknown, although I suspect it has to do with descending
order.

Access stored an 'alphabetic sort order' version of the text key columns in the index.  Basically this means that upper and lower case characters A-Z are merged and start at 0x60. Digits are 0x56 through 0x5f. Once converted into this 
(non-ascii) character set, the text value is able to be sorted in 'alphabetic'
order.  A text column will end with a NULL (0x00 or 0xff if negated).  

Beyond the key columns is stored a 3 byte page number and 1 byte row number.

So to search the index, you need to convert your value into the alphabetic 
character set, compare against each index entry, and on successful comparison
follow the page and row number to the data.  Because text data is managled 
during this conversion there is no 'covered querys' possible (a query that can
be satisfied by reading the index, without descending to the leaf page to read
the data).

KKD Records
-----------

Design View table definitions appear to be stored in 'KKD' records (my name for 
them...they always start with 'KKD\0'). Again these reside on pages, packed to 
the end of the page. 

They look a little like this: (this needs work...see the kkd.c)

'K' 'K' 'D' 0x00
16 bit length value    (this includes the length)
0x00 0x00
0x80 0x00              (0x80 seems to indicate a header)
Then one of more of: 16 bit length field and a value of that size.
For instance: 
0x0d 0x00 and 'AccessVersion' (AccessVersion is 13 bytes, 0x0d 0x00 intel order)

Next comes one of more rows of data. (column names, descriptions, etc...)
16 bit length value    (this includes the length)
0x00 0x00
0x00 0x00
   16bit length field (this include the length itself)
   4 bytes of unknown purpose
      16 bit length field (non-inclusive)
      value (07.53 for the AccessVersion example above)

See kkd.c for an example, although it needs cleanup.
restructing...automake now used sql stuff working 2001-04-12 07:33:19 +08:00			`Ok, this is a brain-dump of everything I've learned about MDB files. I'm am`
			`using Access 97, so everything I say applies to that and maybe or maybe not`
			`other versions.`

			`Right, so here goes:`

			`Note: It appears that much of the data in the pages is unitialized garbage.`
			`This makes the task of figuring out the format a bit more challenging.`

			`Pages`
			`-----`

			`MDB files are a set of pages. These pages are 2K (2048 bytes) in size, so in a`
			`hex dump of the data they start on adreeses like xxx000 and xxx800.`

			`The first byte of each page seems to be a type indentifier for instance the`
			`first page in the mdb file is 0x00, which no other pages seems to share. Other`
			`pages have values of 0x01, 0x02, 0x03, 0x04 though the exact meaning of these`
			`is currently a mystery. (0x04 seems to be data I guess).`

			`The second byte is always 0x01 as far as I can tell.`

			`At some point in the file the page layout is apparently abandoned though the`
			`very last 2K in the file again looks like a valid page. The purpose of this`
			`non-paged region is so far unknown .`

			`Bytes after the first and second seemed to depend on the type of page, although bytes 4-7 seem to indicate a page type of some sort. 02 00 00 00 is found on all catalog pages.`

			`Pages seem to have two parts, a header and a data portion. The header starts`
			`at the front of the page and builds up. The data is packed to the end of the`
			`page. This means the last byte of the data portion is the last byte of the`
			`page.`

			`Byte Order`
			`----------`

			`All offsets to data within the file are in little endian (intel) order`

			`Catalogs`
			`--------`

			`So far the first page of the catalog has always been seen at 0x9000 bytes into`
			`the file. It is unclear whether this is always where it occurs, or whether a`
			`pointer to this location exists elsewhere.`

			`The header to the catalog page(s) start look something like this:`

			`+------+---------+--------------------------------------------------------+`
			`\| 0x01 \| 1 byte \| Page type \|`
			`\| 0x01 \| 1 byte \| Unknown \|`
			`\| ???? \| 2 bytes \| A pointer of unknown use into the page \|`
			`\| 0x02 \| 1 byte \| Unknown \|`
			`\| 0x00 \| 3 bytes \| Possibly part of a 32 bit int including the 0x02 above \|`
			`\| ???? \| 2 bytes \| a 16bit int of the number of records on this page \|`
			`+-------------------------------------------------------------------------+`
			`\| Iterate for the number of records \|`
			`+-------------------------------------------------------------------------+`
			`\| ???? \| 2 bytes \| offset to the records location on this page \|`
			`+-------------------------------------------------------------------------+`

			`The rest of the data is packed to the end of the page, such that the last`
			`record ends on byte 2047 (0 based).`

			`Some of the offsets are not within the bounds of the page. The reason for this`
			`is not presently understood and the current code discards them silently.`
			`Offsets that have 0x40 in the high order byte point to a location within the`
			`page where a pointer to another catalog page is stored. This does not seem to`
			`yeild a complete chain of catalog pages and is currently being ignored in favor`
			`of a brute force read of the entire database for catalog pages.`

			`Little is understood of the meaning of the bytes that make up the records. They`
			`vary in size, but portion prior to the objects name seems to be fixed. All`
			`records start with a '0x11'. The next two bytes are a page number to the column definitions. (see Column Definition).`

			`Byte offset 9 from the beginning of the record contains it's type. Here is a`
			`table of known types:`

			`0x00 Form`
			`0x01 User Table`
			`0x02 Macro`
			`0x03 System Table`
			`0x04 Report`
			`0x05 Query`
			`0x06 Linked Table`
			`0x07 Module`
			`0x0b Unknown but used for two objects (AccessLayout and UserDefined)`

			`Byte offset 31 from the begining of the record starts the object's name. I am`
			`not presently aware of any field defining the length of the name, so the present`
			`course of action has been to stop at the first non-printable character`
			`(generally a 0x03 or 0x02)`

			`After the name there is sometimes have (not yet determined why only sometimes)`
			`a page pointer and offset to the KKD records (see below). There is also pointer to other catalog pages, but I'm not really sure how to parse those.`

			`Table Definition`
			`-----------------`

			`The second and third bytes of each catalog entry store a 16 bit page pointer to`
			`a table definition, including name, type, size, number of datarows, a pointer`
			`to the first data page, and possibly more. I haven't fully figured this out so what follows is rough.`

			`The header to table definition pages start look something like this:`

			`+------+---------+--------------------------------------------------------+`
			`\| 0x02 \| 1 byte \| Page type \|`
			`\| 0x01 \| 1 byte \| Unknown \|`
			`\| 'VC' \| 2 bytes \| ??? \|`
			`\| 0x00 \| 4 bytes \| Pointer to continuation page (if multipage table def) \|`
			`\| ???? \| 4 bytes \| appears to be a length of the data \|`
			`\| ???? \| 4 bytes \| number of rows of data in this table \|`
			`\| 0x00 \| 4 bytes \| ??? \|`
			`\| 0x4e \| 1 byte \| ??? \|`
			`\| ???? \| 2 bytes \| generally same as # of cols but not always \|`
			`\| ???? \| 2 bytes \| ??? \|`
			`\| ???? \| 2 bytes \| number of columns in table \|`
			`\| ???? \| 4 bytes \| number of indexes for this table \|`
			`\| ???? \| 4 bytes \| number of index entries for this table \|`
			`\| 0x00 \| 1 byte \| ??? \|`
			`\| ???? \| 2 bytes \| page number of first datapage for table \|`
			`\| ???? \| 2 bytes \| ??? \|`
			`\| ???? \| 2 bytes \| page number of first datapage for table \|`
			`\| 0x00 \| 1 byte \| ??? \|`
			`+-------------------------------------------------------------------------+`
			`\| Iterate for 2 x number of indexes \|`
			`+-------------------------------------------------------------------------+`
			`\| ???? \| 4 bytes \| number of rows in table \|`
			`\| ???? \| 4 bytes \| number of rows in the index \|`
			`+-------------------------------------------------------------------------+`

			`The next few bytes are somewhat of a mystery right now, but around 0x2B from`
			`the start of the page (though not always) begins a series of 18 byte records`
			`one for each column present. It's format is as follows:`
			`+------+---------+--------------------------------------------------------+`
			`\| ???? \| 1 byte \| Column Type (see table below) \|`
			`\| ???? \| 2 bytes \| Column Number, ascending sequential number, starts at 0\|`
			`\| ???? \| 1 byte \| unknown. 1 is sometimes seen in text types \|`
			`\| ???? \| 1 byte \| unknown \|`
			`\| ???? \| 4 bytes \| Column Number (again) \|`
			`\| ???? \| 6 bytes \| ??? (timestamp?) \|`
			`\| ???? \| 1 bytes \| bitmask of some sort. low order bit indicates variable \|`
			`\| \| \| length column \|`
			`\| ???? \| 2 bytes \| length of column \|`
			`+-------------------------------------------------------------------------+`

			`Column Type may be one of the following (not complete).`

			`0x03 Integer (16 bit)`
			`0x04 Long Integer (32 bit)`
			`0x08 Short Date/Time`
			`0x0a Text`
			`0x0c Hyperlink`

			`Following the 18 byte column records begins the column names, listed in order`
			`with a 1 byte size prefix preceding each name.`

			`After this are a series of 39 byte fields for each index. At offset 34 is a 4 byte page number where the index lives.`

			`Beyond this are a series of 20 byte fields for each 'index entry'. There may be more entrys than indexes and byte 20 represents its type (0x00 for normal index, 0x01 for Primary Key, and 0x02 otherwise).`

			`It is currently unknown how indexes are mapped to columns or the format of the index pages.`

			`Data Rows`
			`---------`

			`The header of a data page looks like this:`

			`+------+---------+--------------------------------------------------------+`
			`\| 0x01 \| 1 byte \| Page type \|`
			`\| 0x01 \| 1 byte \| Unknown \|`
			`\| ???? \| 2 bytes \| Unknown \|`
			`\| ???? \| 2 bytes \| Page pointer to table definition \|`
			`\| 0x00 \| 2 bytes \| Unknown \|`
			`\| ???? \| 4 bytes \| number of rows of data in this table \|`
			`+------+---------+--------------------------------------------------------+`
			`\| Iterate for the number of records \|`
			`+-------------------------------------------------------------------------+`
			`\| ???? \| 2 bytes \| offset to the records location on this page \|`
			`+-------------------------------------------------------------------------+`

			`Each data row looks like this:`

			`+------+---------+--------------------------------------------------------+`
			`\| ???? \| 1 byte \| Number of columns stored in this row \|`
			`\| ???? \| n bytes \| Fixed length columns \|`
			`\| ???? \| n bytes \| Variable length columns \|`
			`\| ???? \| 1 byte \| length of data from beginning of record \|`
			`\| ???? \| n bytes \| offset from start of row for each variable length col \|`
			`\| ???? \| 1 byte \| number of variable length columns \|`
			`\| ???? \| n bytes \| Null indicator. size is 1 byte per 8 columns. \|`
			`\| \| \| 0 indicates a null value. \|`
			`+------+---------+--------------------------------------------------------+`

			`Note: it is possible for the offset to the beginning of a variable length`
			`column to require more than one byte (if the sum of the lengths of columns is`
			`greater than 255). I have no idea how this is represented in the data as I`
			`have not looked at tables large enough for this to occur yet.`

added like operator and handling of string sargs Added index stuff to HACKING file Misc. mdb-sql updates 2001-04-21 05:06:46 +08:00			`Indices`
			`-------`

			`Indices are not completely understood but here is what we know.`

			`On the page pointed to by the table definition a series of records start at`
			`byte offset 0xf8.`

			`The record generally begins with 0x7f or 0x80. 0x80 is the one's complement of 0x7f and all text data in the index would then need to be negated. The reason`
			`for this negation is unknown, although I suspect it has to do with descending`
			`order.`

			`Access stored an 'alphabetic sort order' version of the text key columns in the index. Basically this means that upper and lower case characters A-Z are merged and start at 0x60. Digits are 0x56 through 0x5f. Once converted into this`
			`(non-ascii) character set, the text value is able to be sorted in 'alphabetic'`
			`order. A text column will end with a NULL (0x00 or 0xff if negated).`

			`Beyond the key columns is stored a 3 byte page number and 1 byte row number.`

			`So to search the index, you need to convert your value into the alphabetic`
			`character set, compare against each index entry, and on successful comparison`
			`follow the page and row number to the data. Because text data is managled`
			`during this conversion there is no 'covered querys' possible (a query that can`
			`be satisfied by reading the index, without descending to the leaf page to read`
			`the data).`

restructing...automake now used sql stuff working 2001-04-12 07:33:19 +08:00			`KKD Records`
			`-----------`

			`Design View table definitions appear to be stored in 'KKD' records (my name for`
			`them...they always start with 'KKD\0'). Again these reside on pages, packed to`
			`the end of the page.`

			`They look a little like this: (this needs work...see the kkd.c)`

			`'K' 'K' 'D' 0x00`
			`16 bit length value (this includes the length)`
			`0x00 0x00`
			`0x80 0x00 (0x80 seems to indicate a header)`
			`Then one of more of: 16 bit length field and a value of that size.`
			`For instance:`
			`0x0d 0x00 and 'AccessVersion' (AccessVersion is 13 bytes, 0x0d 0x00 intel order)`

			`Next comes one of more rows of data. (column names, descriptions, etc...)`
			`16 bit length value (this includes the length)`
			`0x00 0x00`
			`0x00 0x00`
			`16bit length field (this include the length itself)`
			`4 bytes of unknown purpose`
			`16 bit length field (non-inclusive)`
			`value (07.53 for the AccessVersion example above)`

			`See kkd.c for an example, although it needs cleanup.`