Maildir Deduplication by Message-ID

Introduction

This is a set of scripts centered around finding duplicate messages in one-or-more Maildir-style directories (one message per file). I wrote them to de-duplicate personal archives of mailing lists, which had multiple copies of the same message, but delivered via different transport paths (MX, protocol, mail sync time, etc.).

Assumptions and caveats

These tools do not have much in the way of child-proofing. If you ask them to do something stupid, they will blindly do it. It's quite easy to construct a pipeline that will destroy data. If you're not comfortable with that, please don't use these.

Running these scripts while something else is modifying the same messages is likely to lead to confusion, race conditions, and maybe lost data. Adding new messages is fine, reading messages is fine, just don't move/change/delete anything.

These scripts expect Maildir. If you manage to feed in an mbox-style mailbox, they will likely treat it as just one message.

For duplicate detection, we assume the Message-ID field is suitable for identifying duplicate messages. That is not a universally valid assumption. Be careful. ckmddupes can confirm the assumption.

The scripts

mddupes script

File: mddupes.pl

Grovel over a tree of message files, reading each message, finding the Message-ID from each, and reporting any duplicates encountered.

Theory of operation

Given a list of directory names, read every file in those directories
Treat every file as an RFC-822 message, and remember the Message-ID
Report on stdout any messages (files) which duplicate a seen Message-ID
Each report line lists both file names, separated by tab
Warn if a Message-ID line is not found in a file
Progress and problems are reported to stderr

Assumptions and caveats

Assumes every entry in each directory is a message file.

It makes no attempt to identify subdirectories and/or non-message files.

If it encounters a subdirectory it will try to open it as a file; this ends up doing nothing on my system, but for all I know it could make demons fly out of your nose. If it encounters a non-message file it will try to open it like a message. At best it will then complain it could not find a Message-ID. At worst it will interpret random data as a Message-ID and give bad results.

Feed it the names of the Maildir "new" and/or "cur" directories, while nothing else is using those directories, and it should be OK. The findmaildirs -d command is useful for this.

nmdupes script

File: nmdupes.py

Run a Notmuch search query, and print Message-IDs for any resulting messages that are contained in more than one file.

Commentary

If you have your mail indexed by Notmuch, you can use it to look for duplicate Message-IDs. Since Notmuch already has the Message-IDs and file names indexed, it can run much faster than mddupes (which has to open and read each message file). On my system, it was hours versus minutes.

Provide a Notmuch search query. To run against all mail, simply use * as the query (escaped for your shell, as appropriate).

ckmddupes script

File: ckmddupes.pl

Read the output of mddupes / nmdupes, and check each pair of duplicates, by reading the entire file for each and making sure they are identical in content (not just Message-ID).

Theory of operation

Reads stdin, expecting it to be pairs of file names, one pair per line, separated by tab
- Output of mddupes/nmdupes is suitable as input
Compares the file pairs to see if they really are the same message
- Ignores selected headers which can normally vary (different transport paths for same msg, meaningless mail software IDs, etc.)
- Ignores blank lines at the end of one message but not the other
- Otherwise messages must match, headers and body, all lines, same order
Report to stderr messages which differ in some way (not actually a dupe)
Report both names of identical messages (true dupes) to stdout

Assumptions and caveats

Minor differences in messages (in particular, footers added to some but not others) will still be reported as differing. Whether or not that's the right thing depends on your scenario.

linkify-pairs script

File: linkify-pairs.pl

Read the output of mddupes / nmdupes / ckmddupes, and turn each pair of distinct duplicate files into a pair of hardlinks to the same file (inode).

Theory of operation

Reads stdin, expecting it to be pairs of file names, one pair per line, separated by tab
- Output of mddupes/nmdupes/ckmddupes is suitable as input
Unlinks one file, and then creates a link to the other, with the same name as the one that was just unlinked
If the files are already a set of hardlinks to the same inode, skips them. The dupe-finders (above) don't know about hardlinks and will continue reporting them as dupes.
Duplicates of 3 or more should converge to one inode by the end of the run

Assumptions and caveats

Assumes the input is correct, i.e., that the files actually are duplicates. If you feed it a list of unrelated files, it will happily delete half of them.

findmaildirs script

File: findmaildirs.sh

Find directories that look like Maildir mail folders.

Theory of operation

Looks for directories containing the cur,new,tmp directories that every Maildir should have
Uses find(1) to do the search
Specify one or more directories to search under, or specify none and it will search under the current directory, just like find(1) (this is not a coincidence)
By default, it just reports the pathnames of the Maildir proper. Give it the -d switch, and it will instead report the cur and new directories of each Maildir. This output is suitable as arguments to mddupes.
It avoids descending into cur or new, since those may have many message files and be slow to search. This means it won't find Maildirs nested within part of another Maildir, but that is fairly pathological.

Assumptions and caveats

Assumes any directory with cur,new,tmp subdirectories is a Maildir.

Usage examples

By themselves, none of the scripts does anything to fix a problem. They're intended to be used as building blocks, with pipelines and redirection.

mddupes $( findmaildirs /oldmail -d ) | ckmddupes | linkify-pairs

The above will look for duplicate messages in Maildirs under the /oldmail directory, confirm the candidates are duplicates, and turn any duplicates into one hardlinked file.

nmdupes \* | linkify-pairs

The above will look for duplicate messages in all mail known to Notmuch, and turn any messages with duplicate Message-IDs into one hardlinked file. Whether or not the content of said messages is the same is not checked.

nmdupes \* | ckmddupes | linkify-pairs

The above will look for duplicate messages in all mail known to Notmuch, confirm the candidates are duplicates, and turn any duplicates into one hardlinked file. It will take much longer to run, since it has to read and compare every message.

nmdupes \* >nmdupes.out 2>nmdupes.err
ckmddupes <nmdupes.out >ckmddupes.out 2>ckmddupes.err
wc -l nmdupes.out ckmddupes.out ckmddupes.out
linkify-pairs <ckmddupes.out

The above does the same thing as the previous example, but breaks it up into stages, allowing opportunities for review. It uses wc(1) to count lines for some basic statistics.

mddupes $( findmaildirs -d ) | ckmddupes | cut -f2 | xargs rm

The above looks for duplicates in Maildirs under the current directory (and subdirectories), and permanently deletes duplicates, leaving just one copy of each pair.