Replication protocol design #2

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Replication protocol design #2

Timo Sirainen
Changes:

 - Added goal 8 and rewrote mailbox synchronization plan.
 - Added new SELECT command to change active mailbox and removed mailbox
ID from command parameters

Goals
-----

1. If a single (or configurable number of) server dies, no mails must be
lost that have been reported to IMAP/SMTP clients as being successfully
stored.

2. Must be able to automatically recover from a server desynchronization,
such as:
 - server has been offline for a long time
 - some mail files have been manually added/deleted
 - corrupted data/mail files if they're noticed

3. In multi-master setup if the link between servers die, the servers must
be able to proceed autonomously (kind of conflicts with goal 1 though).
When the link comes back, the changes are replicated as soon as possible.

4. Normal IMAP commands must not be able to cause desynchronization between
servers. For example making conflicting flag changes simultaneously in two
servers must not result in the servers having different flags.

5. Must perform well with at least 3 servers in a multi-master setup,
preferrably still with tens of servers.

6. Latency shouldn't be increased noticeably when using servers distributed
into 3 or more data centers. Must be usable over high-latency links (in
optimistic async mode).

7. In normal operation send minimal incremental changes.

8. Make non-incremental synchronization fast enough to allow replication
with users' untrusted Dovecot servers which may connect only rarely.
Support also super high latency replication (e.g. using USB sticks).

Protocol
--------

There are two major parts in the protocol: Handling the normal incremental
changes and fixing desynchronization.

Originally I thought that maybe the changes could be sent using the same
format as Dovecot's transaction logs, but now I'm beginning to think that
it's probably not that good idea. The code reuse potential is quite minimal
and the format would still have to be extended in several ways so that it
won't be directly compatible anyway.

So I'm thinking the protocol could be something text-based. The main
benefit is that text-based protocols are easier to debug. Stream
compression should drop most of the extra overhead if bandwidth is a
problem.

The exact on-wire protocol anyway doesn't matter in the design that's
discussed below.

The commands have tags similar to IMAP commands, because some commands may
have to be forwarded to other servers and it may take a while to get a
reply. During the wait the server may process other commands.

Mailbox master
--------------

Each mailbox has a single master server selected. In multi-master setups
the master server may be moved by having the destination server simply
request it from the current master. The current master must then give it
up. If link is lost to the current master, one of the remaining servers
will become the new master within the remaining servers.

Each server must be connected to the current master server. Since each
mailbox can have a different master server, this typically means all
servers are connected to each others. However it's possible to create
setups where server connects to only one other server, which in turn
connect to more servers. This is useful if there are bandwidth bottlenecks
between some servers. This kind of a situation can also happen if in a
network A-B-C the link A-C dies, but A-B and B-C continues to work. Because
of this all commands must be able to function in a way that the server
proxies them to the current master, instead of failing the command and
trying to make it the caller's problem to resend the command to the actual
master.

When the cluster starts up, a single server is selected as the root master
for all mailboxes. If a server doesn't know who the current mailbox master
is, it asks from the root. All servers cache the currently known mailbox
masters to avoid constant requests to the root.

If the root server dies, another server is selected as the root. Because
the new root doesn't know what masters have been requested (and asking all
of them from all servers would just waste bandwidth), all the servers are
expected to flush their master caches and drop their own master status. The
new root doesn't respond to any requests before all servers have notified
that they've dropped being a master.

The master status doesn't have to be at mailbox level granularity. It could
just as well be configured to move at user, domain or even global level.
Perhaps this could be done dynamically, so that higher granularity is used
when the master is beginning to change too often between servers.

Mailbox ID
----------

Mailbox IDs are session-specific numbers dynamically assigned for
user+mailbox+UIDVALIDITY combinations. All connections have different
mailbox IDs. Also send and receive directions have different IDs. This
allows the sender to easily replace existing IDs to point to new mailboxes
without causing any confusion.

MBOX:
 - Mailbox ID
 - User name
 - Mailbox name
 - Mailbox UIDVALIDITY
Reply:
 - [Mailbox UIDVALIDITY] (if changed - command failed)
 - Mailbox UIDNEXT
 - Highest modseq

If the receiver finds out it has a different UIDVALIDITY, the mailbox
requires a full resync. UIDNEXT and modseq can be used to determine if
replication servers are out of sync.

After mailbox ID has been assigned, the active mailbox can be changed using
a command:

SELECT:
 - Mailbox ID

Requesting master status
------------------------

MASTER-MOVE:
 - [Destination SID] (if forwarding)

The command is sent to the last known master for the selected mailbox. The
server will keep forwarding the command until it reaches the current
master. During the forwarding other servers may want to request something
from the master. These requests must be delayed by the forwarding servers
until the move is finished.

Saving messages
---------------

SAVE:
 - Received date
 - [IMAP UID] (only if we're the master)
 - Global UID (stays the same when copying the message)
 - Message text
Reply:
 - [IMAP UID] (only if not specified in parameters)
 - [Current mailbox master SID] (if it was moved)

If current server is not the master, the SAVE is sent to the master which
gives the message its IMAP UID. The master server then replicates the
message to other servers with the IMAP UID parameter set.

The mailbox master may have already changed by the time server receives a
save request. If server receives a SAVE without IMAP UID parameter, it's
responsible for finding out the new mailbox master and sending a new SAVE
request to it. Once the new master replies with the IMAP UID, the server
can reply to the original SAVE request, also providing the new master SID
so the future requests can be sent there directly.

To be sure the message doesn't get lost, the server should not reply OK to
the IMAP/SMTP client until it has received a reply from SAVE.

Copying messages
----------------

FIXME: Source or destination mailbox ID parameter should probably be
removed and the selected mailbox's ID used instead.

COPY:
 - Source mailbox ID
 - Destination mailbox ID
 - Source IMAP UID
 - Global UID
 - [Destination IMAP UID] (only if we're the master)
 - Destination received date
Reply:
 - [Destination IMAP UID] (only if not specified in parameters)
 - [Current mailbox master SID] (if it was moved)

Source mailbox ID + source IMAP UID identifies the message to be copied.
It's expected to contain the given global UID (which is just an extra
sanity check). Otherwise it works the same way as SAVE.

Since the message already exists, it's probably not necessary to wait for a
reply before replying OK to originating IMAP client.

Expunging messages
------------------

EXPUNGE:
 - IMAP UIDs
(No reply)

Expunges also have to be sent via master server (the same way as SAVE) to
avoid COPY command failing in some servers because it was just expunged.

Changing message flags
----------------------

STORE:
 - IMAP UIDs
 - Added flags/keywords
 - Removed flags/keywords
 - [Current modseq] (master sends)
 - [Highest modseq of the messages before this change] (non-master sends)
 - [flag: this is a CONDSTORE STORE UNCHANGEDSINCE] (non-master may send)
[Reply:
 - IMAP UIDs where STORE was rejected to (if CONDSTORE flag was used)
]

Stores also have to be sent via master server to avoid flag
desynchronization. Master first checks if it has higher modseqs in the
messages. Then it applies all the changes and forwards the changes to other
servers. For messages that had higher initial modseqs their flags are sent
to the server sending the STORE to fix a potential desync.

If CONDSTORE flag is set, the change is rejected for messages that had a
higher modseq. Non-masters shouldn't reply to a STORE UNCHANGEDSINCE
command before the change has been replicated to master and the rejections
have been processed.

Mailbox synchronization
-----------------------

Network connections to other servers are always initiated by the same
servers. From replication point of view they could be thought of as
client/server.

The clients keep track of the last seen server state for all mailboxes:
 - UIDVALIDITY
 - UIDNEXT
 - Highest modseq

Client requests mailbox ID with MBOX command. If UIDVALIDITY changed, state
is reset by setting UIDNEXT=1 and highest modseq=0. MBOX command is then
resent using the updated UIDVALIDITY.

If server's UIDNEXT (server-UIDNEXT) is higher than last seen UIDNEXT
(old-UIDNEXT), it means the server has added new messages and we want to
fetch them. If the client's UIDNEXT (client-UIDNEXT) is higher than the
old-UIDNEXT, it means the client has added new messages and they need to be
sent to the server. All messages with UID >= old-UIDNEXT and UID <
min(client-UIDNEXT, server-UIDNEXT) have an UID conflict and they must be
given new unused UIDs.

The synchronization is done in two batches of commands. Client first
selects the mailbox and then starts the sync:

SYNC-BEGIN:
 - server's last highest modseq
 - client's current highest modseq
 - QRESYNC-like sequence + IMAP UID lists for optimizing EXPUNGE replies

Client then proceeds to send all its changes using SAVE, EXPUNGE and STORE
commands. The sync is finished with SYNC-END command.

Server replies with:
 - List of expunged IMAP UIDs (EXPUNGE command):
   - IMAP UIDs
 - Added messages (SAVE command?):
   - IMAP UID
   - Global UID
   - Modseq
   - Flags and keywords
   - Received date
   - [Message text]
 - Changed flags/keywords/modseqs (STORE commands):
   - IMAP UID
   - Modseq
   - Changed flags and keywords (or if not easily determined,
     their current values)
 - IMAP UID conflict fixes (UIDFIX command):
   - Source IMAP UIDs
   - Destination IMAP UIDs
 - Sync finish:
   - Updated UIDNEXT
   - Updated highest modseq

Server may also reply with STORE commands that change nothing except
modseqs. This is needed to get the modseqs synchronized.

If in high-latency + high-bandwidth mode the message texts are always sent.
The message texts are also sent if the server can determine that the client
couldn't have previously seen the message (i.e. the message was saved, not
copied). Otherwise they have to be requested for the messages with
previously unseen global UIDs:

FETCH:
 - IMAP UIDs
Reply:
 - Message texts

By the time client receives the reply from server, it may already have done
further changes. This means it may have to fix modseqs and IMAP UIDs
internally as well.


signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replication protocol design #2

Timo Sirainen
On Thu, 2008-05-01 at 14:36 +0300, Timo Sirainen wrote:
> Support also super high latency replication (e.g. using USB sticks).

If it wasn't clear how this works:

"Client" and "server" are both Dovecots, client just has no/slow network
connection.

1. Client runs a command to create a replication batch file that's
stored on USB memory stick (or whatever).

2. The USB stick is plugged into another machine with a higher network
bandwidth to the server.

3. The replication batch file is copied to the server and a command is
run to process it. The command creates a reply batch file.

4. The reply file is copied to the USB stick and the file gets moved to
the client.

5. Client runs a command to import the reply file.

This works because the mailbox synchronization doesn't require more than
one roundtrip.


signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replication protocol design #2

Ed Wildgoose-2
In reply to this post by Timo Sirainen
8. Make non-incremental synchronization fast enough to allow replication
with users' untrusted Dovecot servers which may connect only rarely.
Support also super high latency replication (e.g. using USB sticks).


The USB stick sync in particular would be a MASSIVELY powerful feature for an admittedly small userbase!

Ed W

Reply | Threaded
Open this post in threaded view
|

Re: Replication protocol design #2

Timo Sirainen
In reply to this post by Timo Sirainen
Added "current modseq" parameter to all commands and MODSEQ command.
Other things missing from the replication protocol are related to
mailbox list, i.e. figuring out which mailboxes to replicate, which are
newly created, deleted, renamed, etc. There will probably have to be a
new "unique global mailbox ID" which is preserved across renames.

I think global message UIDs will also become 128bit IDs instead of
64bit. 128bit GIDs could be generated from:

 - 64bit timestamp with nanosecond resolution (or microsecond if
clock_gettime() not available). When delivering multiple mails within a
process make sure this value increases always.
 - 32bit process ID
 - 32bit server ID (configurable? based on IPv4 address? 48bit MAC
address by reducing PID/timestamp by 16bits? is exposing these to a
normal user a security problem?)

I think it's pretty safe to assume that two processes can't deliver a
mail within the same nanosecond using the same PID.

Modseq changing
---------------

MODSEQ:
 - IMAP UIDs
 - Updated modseq

When doing expunges, stores or whatever else in future updates modseqs,
there's a problem with keeping them synchronized between all servers. This
command allows fixing them so that the entire cluster will have the modseq
from the server that assigned the highest modseq for the change. This means
that some servers may temporarily assign a lower modseq for a message only
to be soon updated to a slightly higher modseq.

When master is processing commands if a command's modseq parameter <=
highest-modseq, the master sends back a MODSEQ which changes the message's
modseq to highest-modseq+1. The command is forwarded to other servers using
the updated modseq.

When server receives a MODSEQ command, the modseq is updated only for the
messages which currently have a lower modseq. The modseq must be updated
also if the message has already been expunged, so that syncing (replication
or QRESYNC) can send expunges correctly.


signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replication protocol design #2

Charles Marcus
In reply to this post by Timo Sirainen
On 5/1/2008 7:36 AM, Timo Sirainen wrote:
> Changes:

Hi Timo,

The replication work sounds way cool and I look forward to being able to
take advantage of it, but...

I'm also very keen on seeing full support for Shared Folders and client
managed ACLs...

Has this fallen to the back-burner? I seem to recall this was planned
for 1.2, but haven't heard much about it since...

Thanks for all your hard work!

--

Best regards,

Charles
Reply | Threaded
Open this post in threaded view
|

Re: Replication protocol design #2

Timo Sirainen
On May 1, 2008, at 7:46 PM, Charles Marcus wrote:

> I'm also very keen on seeing full support for Shared Folders and  
> client managed ACLs...
>
> Has this fallen to the back-burner? I seem to recall this was  
> planned for 1.2, but haven't heard much about it since...

At least some kind of a shared mailbox support is required for  
replication. I was thinking that there would be one replication  
process which can write (and read) all users' all mailboxes.

Other than that ACLs aren't high on my TODO list.


PGP.sig (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Replication protocol design #2

Ed Wildgoose-2
In reply to this post by Timo Sirainen

>  - 32bit server ID (configurable? based on IPv4 address? 48bit MAC
> address by reducing PID/timestamp by 16bits? is exposing these to a
> normal user a security problem?)
>  


If in doubt MD5 the information and pick some bits from the MD5 hash