Figuring out the WhatApp database format
I’m currently fiddling with the WhatsApp ChatStorage.sqlite
database that I extracted from a recent local iOS backup. I want to parse the contents into properly marked-up HTML files, and store them outside of the iOS backup. To become more independent from the iOS backup and WhatsApp itself
I already got pretty far (massively improving my SQL skills in the process), but of course I want to add as much context to the messages as possible. WhatsApp saves the metadata for media items (namely links, replies, image thumbnails) for messages in the ZWAMEDIAITEM.ZMETADATA
column of the database. On iOS this column contains blobs of binary property lists, that can be inspected on MacOS using the plutil
tool. Still there is some figuring-out for me to do, and I’d like your help for that.
Among other things, it contains the senderJID
(JID standing in for Jabber ID since WhatsApp was built on Jabber) of the referenced metadata. The thing that I am really after is the quotedMessageData
field. It contains a lot more data. For replies for example it contains the text of the message your reply was referring to. When the metadata contained a link, and WhatsApp managed to scrape a link preview of the web, the field contains all stuff you would need to rebuild that preview: the link itself, the contents of the HTML <title />
tag, and a tiny thumbnail image.
It’s all clearly visible when viewed in a hex editor, the text, the link, the magic number of the thumbnail JPEG (FF D8
/0xFF 0xD8
) but even after hours of fiddling and researching binary message and serialization patterns, control characters and the like, I can’t seem to fully figure it out. After all it’s the first time I’m dealing with this sort of things. Things that are quite apparent:
- The first byte signals how the message will look like.
0x0A
(Line feed): the element is plain text. If this is the very first byte of the blob, there no other control code after the ULEB128 (see below), the element begins right away0x32
: drop-in for0x0A
, except there will be more than one element?0x1A
(Substitute character):0x7A
: the element is a senderJID- There are a few other characters that seem to act as control codes (
0x42
,0x22
,0x20
) but I can’t figure out, why they are used instead of another LF. Probably to distinguish between datatypes or to initiate some sort of nesting?
- The next n bytes after a control code seem to always belong to a ULEB128 defining the length of the next part. Trying out different positions in different files it almost always matches up towards the next human-differentiable part of the blob. But not always always, so there must be another trick here
- (Also I guess magic happens?)
- Whenever the sequence
0x82 0x01
occurs, it marks the beginning of a JPEG image. It’s length is defined by another ULEB128 following the sequence. Directly after that, the JPEG starts with its magic number. Writing the data out to disk, always results in a properly readable image. 0x22
followed by0x20
(doublequotes followed by a space) seems to have a special meaning as well. Maybe it ends a section?- Now, that I am seeing all this in one place, I realize that looking at bytes might be a bit to coarse. Maybe it would help to check out the nibbles instead?