Maildir Deduplication by Message-ID
Introduction
This is a set of scripts centered around finding duplicate messages in one-or-more Maildir-style directories (one message per file). I wrote them to de-duplicate personal archives of mailing lists, which had multiple copies of the same message, but delivered via different transport paths (MX, protocol, mail sync time, etc.).
Assumptions and caveats
These tools do not have much in the way of child-proofing. If you ask them to do something stupid, they will blindly do it. It's quite easy to construct a pipeline that will destroy data. If you're not comfortable with that, please don't use these.
Running these scripts while something else is modifying the same messages is likely to lead to confusion, race conditions, and maybe lost data. Adding new messages is fine, reading messages is fine, just don't move/change/delete anything.
These scripts expect Maildir. If you manage to feed in an mbox-style mailbox, they will likely treat it as just one message.
For duplicate detection, we assume the Message-ID field is suitable for identifying duplicate messages. That is not a universally valid assumption. Be careful. ckmddupes can confirm the assumption.
The scripts
mddupes script
File: mddupes.plGrovel over a tree of message files, reading each message, finding the Message-ID from each, and reporting any duplicates encountered.
Theory of operation
- Given a list of directory names, read every file in those directories
- Treat every file as an RFC-822 message, and remember the Message-ID
- Report on stdout any messages (files) which duplicate a seen Message-ID
- Each report line lists both file names, separated by tab
- Warn if a Message-ID line is not found in a file
- Progress and problems are reported to stderr
Assumptions and caveats
Assumes every entry in each directory is a message file.
It makes no attempt to identify subdirectories and/or non-message files.
If it encounters a subdirectory it will try to open it as a file; this ends up doing nothing on my system, but for all I know it could make demons fly out of your nose. If it encounters a non-message file it will try to open it like a message. At best it will then complain it could not find a Message-ID. At worst it will interpret random data as a Message-ID and give bad results.
Feed it the names of the Maildir "new" and/or "cur" directories, while nothing else is using those directories, and it should be OK. The findmaildirs -d command is useful for this.
nmdupes script
File: nmdupes.pyRun a Notmuch search query, and print Message-IDs for any resulting messages that are contained in more than one file.
Commentary
If you have your mail indexed by Notmuch, you can use it to look for duplicate Message-IDs. Since Notmuch already has the Message-IDs and file names indexed, it can run much faster than mddupes (which has to open and read each message file). On my system, it was hours versus minutes.
Provide a Notmuch search query. To run against all mail, simply use * as the query (escaped for your shell, as appropriate).
ckmddupes script
File: ckmddupes.plRead the output of mddupes / nmdupes, and check each pair of duplicates, by reading the entire file for each and making sure they are identical in content (not just Message-ID).
Theory of operation
- Reads stdin, expecting it to be pairs of file names, one pair per line, separated by tab
-
- Output of mddupes/nmdupes is suitable as input
- Compares the file pairs to see if they really are the same message
- Ignores selected headers which can normally vary (different transport paths for same msg, meaningless mail software IDs, etc.)
- Ignores blank lines at the end of one message but not the other
- Otherwise messages must match, headers and body, all lines, same order
- Report to stderr messages which differ in some way (not actually a dupe)
- Report both names of identical messages (true dupes) to stdout
Assumptions and caveats
Minor differences in messages (in particular, footers added to some but not others) will still be reported as differing. Whether or not that's the right thing depends on your scenario.
linkify-pairs script
File: linkify-pairs.plRead the output of mddupes / nmdupes / ckmddupes, and turn each pair of distinct duplicate files into a pair of hardlinks to the same file (inode).
Theory of operation
- Reads stdin, expecting it to be pairs of file names, one pair per line, separated by tab
-
- Output of mddupes/nmdupes/ckmddupes is suitable as input
- Unlinks one file, and then creates a link to the other, with the same name as the one that was just unlinked
- If the files are already a set of hardlinks to the same inode, skips them. The dupe-finders (above) don't know about hardlinks and will continue reporting them as dupes.
- Duplicates of 3 or more should converge to one inode by the end of the run
Assumptions and caveats
Assumes the input is correct, i.e., that the files actually are duplicates. If you feed it a list of unrelated files, it will happily delete half of them.
findmaildirs script
File: findmaildirs.shFind directories that look like Maildir mail folders.
Theory of operation
- Looks for directories containing the cur,new,tmp directories that every Maildir should have
- Uses find(1) to do the search
- Specify one or more directories to search under, or specify none and it will search under the current directory, just like find(1) (this is not a coincidence)
- By default, it just reports the pathnames of the Maildir proper. Give it the -d switch, and it will instead report the cur and new directories of each Maildir. This output is suitable as arguments to mddupes.
- It avoids descending into cur or new, since those may have many message files and be slow to search. This means it won't find Maildirs nested within part of another Maildir, but that is fairly pathological.
Assumptions and caveats
Assumes any directory with cur,new,tmp subdirectories is a Maildir.
Usage examples
By themselves, none of the scripts does anything to fix a problem. They're intended to be used as building blocks, with pipelines and redirection.
mddupes $( findmaildirs /oldmail -d ) | ckmddupes | linkify-pairs
The above will look for duplicate messages in Maildirs under the /oldmail directory, confirm the candidates are duplicates, and turn any duplicates into one hardlinked file.
nmdupes \* | linkify-pairs
The above will look for duplicate messages in all mail known to Notmuch, and turn any messages with duplicate Message-IDs into one hardlinked file. Whether or not the content of said messages is the same is not checked.
nmdupes \* | ckmddupes | linkify-pairs
The above will look for duplicate messages in all mail known to Notmuch, confirm the candidates are duplicates, and turn any duplicates into one hardlinked file. It will take much longer to run, since it has to read and compare every message.
nmdupes \* >nmdupes.out 2>nmdupes.err ckmddupes <nmdupes.out >ckmddupes.out 2>ckmddupes.err wc -l nmdupes.out ckmddupes.out ckmddupes.out linkify-pairs <ckmddupes.out
The above does the same thing as the previous example, but breaks it up into stages, allowing opportunities for review. It uses wc(1) to count lines for some basic statistics.
mddupes $( findmaildirs -d ) | ckmddupes | cut -f2 | xargs rm
The above looks for duplicates in Maildirs under the current directory (and subdirectories), and permanently deletes duplicates, leaving just one copy of each pair.