Message stores and Mailman archivers

All Mailman archivers needs access to the archived mails in some way. The current Mailman 2 design stores all archives in mbox files and the pipermail archiver build static HTML pages on every new received for a mailing list. So the mails are stored at least two times. One time in the mbox and another time in the HTML page (and maybe another time for some search index). Mailman 3 should do the things in another way. The handling of mails should be in a more clever way. The mails should be kept at a single location, accessible for all archivers. It should be possible to access one specific mail in a short time (so that storage in a shared mbox does not work apart from the locking issues). The problem is, that some archivers needs some more data about the mails than other archivers. The Hyperkitty web archiver have to know the thread starting mails to be able fast display a overview of threads. Maybe even fast access to the sender and the subject could be useful (for search). The NNTP archiver does not need so many data about the single mails. It needs a way to select a specific mail of one mailing list identified by an index. Thread information are completely irrelevant. But the NNTP archiver for example needs to find a mail by its message id in all archived mailing lists. For this requirement it is very bad to store the archive for each mailing list in a separate table in a database. The search of a mail across all mailing lists would require as many select queries as mailing lists are known.

Recently there was a discussion on the mailman-developers mailing list about the storage of user information. It is the same problem. Some components needs some data, other components needs other data but all components needs some common data. The thing with the user information is, that the common data are not static. The user may change its password or its email address. The archivers have easier work. The archives messages should not be changed. The only change I could imagine it, that some messages gets deleted (by deletion of the whole mailing list or by request for deletion because of private data). All archivers could simply update the data looking if specific mails are still available.

One possibility discussed in the discussion about user data is, that if a user want to have the Postorius web interface, the user have to use the Postorius user manager and if the user does not want Postorius, it uses the cores user manager. That might work for user information, I could imagine situations that causes problems with that approach. It one archivers requires some special information and another requires other information, that means there have to be one IMessageStore implementation that satisfies the requirements of both. That means the user/admin could not freely combine the available components. For archivers I would propose a system, where the core simply save the messages and provides an interface (via zope.component for archivers in the core and via REST for archivers outside the core like Hyperkitty) to access the messages. Each archiver could store private extensions in an own storage. The NNTP archiver will for example creates an index of the message ids, Hyperkitty creates an index for the subject and thread information needed for efficient displaying a nice website. The archivers should then get the mailing list information and an index for accessing the message. For archiving the messages does not need to be delivered to the different archivers, as they could access the messages via the exposed API of the core.

Kittystore

So why not directly use the Kittystore of Hyperkitty? During testing with kittystore I had several issues inserting the messages into the store. The parsing of the messages have various problems with the archives I used for testing. The core should not fiddly with parsing different sender formats (From: could contain just a single mail address, "Surname, Name" mail address or Name Surname <mail address> or even other formats) if it does not need it directly. The services that need this information should know how to parse it and what to do if the parsing fails. Kittystore even fails with standard multipart mails. It tries to insert the content into a column of the database. If working with multipart mails the content of the mail is a list of messages. If an archiver in the core relies on that information and the core needs much higher quality of parsing.

Additional to that fact kittystore is base on a different ORM mapper than mailman core. That is bad, apart from the longer dependency list, because the relations between archived mails and mailing lists could not be modeled in a sane way.

BasicMessageStore

I propose an implementation of IMessageStore in the core, that requires no parsing of the message, called BasicMessageStore. Therefore it should not use the message id as unique identifier. Even if the message is malformed, it could be archived. Simply the complete message should be saved in plain. Once a unique id was used, it should point for ever to that single message, except if the message would get deleted, then the id should not be valid any more. The messages should be in relation to the mailing list, the message was posted on, so that the archive could get deleted if the mailing list is removed.

The BasicMessageStore should be the first archiver executed and it should be the only one, that get the whole mail. The other archivers should only get the id of the archived message and request the content via the API. Maybe the API should can separate header and body of the mail. If called via REST interface it could matter that the body gets only transferred if needed.

Comments !