Strong opinions loosely held

Figuring out the WhatApp database format

I’m currently fiddling with the WhatsApp ChatStorage.sqlite database that I extracted from a recent local iOS backup. I want to parse the contents into properly marked-up HTML files, and store them outside of the iOS backup. To become more independent from the iOS backup and WhatsApp itself

I already got pretty far (massively improving my SQL skills in the process), but of course I want to add as much context to the messages as possible. WhatsApp saves the metadata for media items (namely links, replies, image thumbnails) for messages in the ZWAMEDIAITEM.ZMETADATA column of the database. On iOS this column contains blobs of binary property lists, that can be inspected on MacOS using the plutil tool. Still there is some figuring-out for me to do, and I’d like your help for that.

Among other things, it contains the senderJID (JID standing in for Jabber ID since WhatsApp was built on Jabber) of the referenced metadata. The thing that I am really after is the quotedMessageData field. It contains a lot more data. For replies for example it contains the text of the message your reply was referring to. When the metadata contained a link, and WhatsApp managed to scrape a link preview of the web, the field contains all stuff you would need to rebuild that preview: the link itself, the contents of the HTML <title /> tag, and a tiny thumbnail image.

It’s all clearly visible when viewed in a hex editor, the text, the link, the magic number of the thumbnail JPEG (FF D8/0xFF 0xD8) but even after hours of fiddling and researching binary message and serialization patterns, control characters and the like, I can’t seem to fully figure it out. After all it’s the first time I’m dealing with this sort of things. Things that are quite apparent: