[2.3.8] possible replication issue

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[2.3.8] possible replication issue

Dovecot mailing list
Hi,

some of our customers have discovered a replication issue after
upgraded from 2.3.7.2 to 2.3.8.

Running 2.3.8 several replication connections are hanging until defined
timeout. So after some seconds there are $replication_max_conns hanging
connections.
Other replications are running fast and successful.

Also running a doveadm sync tcp:... is working fine for all users.

I can't see exactly, but I haven't seen mailboxes timeouting again and
again. So I would assume it's not related to the mailbox.

From the logs:

server1:
Oct 16 08:29:25 server1 dovecot[5715]:
dsync-local([hidden email])<FXnVDW22pl0tGAAA1cwDxA>: Error:
dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version
not received)
Oct 16 08:29:25 server1 dovecot[5715]:
dsync-local([hidden email])<FXnVDW22pl0tGAAA1cwDxA>: Error:
Timeout during state=master_recv_handshake

server2:

Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
failed: EOF (last sent=handshake, last recv=handshake)

There aren't any additional logs regarding the replication.

I have tried increasing vsz_limit or reducing replication_max_conns.
Nothing changed.

--

Both customers have 10k+ users. Currently I couldn't reproduce this on
smaller test systems.

Both installation were downgraded to 2.3.7.2 to fix the issue for now

--

I've attached a tcpdump showing the client showing the client stops
sending any data after the mailbox_guid table headers.



Any idea what could be wrong here or the debug this issue?

Thanks.

Carsten Rosenberg

doveconf-n.txt (4K) Download Attachment
repl-dump.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [2.3.8] possible replication issue

Dovecot mailing list
I have the same Problem here.
All systems are running Debian 9 amd64.

My dovecot director servers are running 2.3.8, but the Mailbox Servers having sync / replication problems with 2.3.8. So i have downgraded the Mailbox Servers to 2.3.7 and now everything works fine again...

Am 18. Oktober 2019 13:52:37 MESZ schrieb Carsten Rosenberg via dovecot <[hidden email]>:
Hi,

some of our customers have discovered a replication issue after
upgraded from 2.3.7.2 to 2.3.8.

Running 2.3.8 several replication connections are hanging until defined
timeout. So after some seconds there are $replication_max_conns hanging
connections.
Other replications are running fast and successful.

Also running a doveadm sync tcp:... is working fine for all users.

I can't see exactly, but I haven't seen mailboxes timeouting again and
again. So I would assume it's not related to the mailbox.

From the logs:

server1:
Oct 16 08:29:25 server1 dovecot[5715]:
dsync-local([hidden email])<FXnVDW22pl0tGAAA1cwDxA>: Error:
dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version
not received)
Oct 16 08:29:25 server1 dovecot[5715]:
dsync-local([hidden email])<FXnVDW22pl0tGAAA1cwDxA>: Error:
Timeout during state=master_recv_handshake

server2:

Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
failed: EOF (last sent=handshake, last recv=handshake)

There aren't any additional logs regarding the replication.

I have tried increasing vsz_limit or reducing replication_max_conns.
Nothing changed.

--

Both customers have 10k+ users. Currently I couldn't reproduce this on
smaller test systems.

Both installation were downgraded to 2.3.7.2 to fix the issue for now

--

I've attached a tcpdump showing the client showing the client stops
sending any data after the mailbox_guid table headers.



Any idea what could be wrong here or the debug this issue?

Thanks.

Carsten Rosenberg

--
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
Reply | Threaded
Open this post in threaded view
|

Re: [2.3.8] possible replication issue

Dovecot mailing list
In reply to this post by Dovecot mailing list
Hello,

upgrading to 2.3.9 unfortunately does *not* solve this issue:

I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some
seconds replication stopped. The other replicator remained with 2.3.7.2.
After downgrading to 2.3.7.2 replication is again working fine.

I did not try to upgrade both replicators up to now, as this is a live
production system. Is there a chance, that upgrading both replicators
will solve the problem?

The machines are running Ubuntu 18.04

Any help is appreciated.

Thanks,
Andreas

Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot:

> Hi,
>
> some of our customers have discovered a replication issue after
> upgraded from 2.3.7.2 to 2.3.8.
>
> Running 2.3.8 several replication connections are hanging until defined
> timeout. So after some seconds there are $replication_max_conns hanging
> connections.
> Other replications are running fast and successful.
>
> Also running a doveadm sync tcp:... is working fine for all users.
>
> I can't see exactly, but I haven't seen mailboxes timeouting again and
> again. So I would assume it's not related to the mailbox.
>
>  From the logs:
>
> server1:
> Oct 16 08:29:25 server1 dovecot[5715]:
> dsync-local([hidden email])<FXnVDW22pl0tGAAA1cwDxA>: Error:
> dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version
> not received)
> Oct 16 08:29:25 server1 dovecot[5715]:
> dsync-local([hidden email])<FXnVDW22pl0tGAAA1cwDxA>: Error:
> Timeout during state=master_recv_handshake
>
> server2:
>
> Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
> failed: EOF (last sent=handshake, last recv=handshake)
>
> There aren't any additional logs regarding the replication.
>
> I have tried increasing vsz_limit or reducing replication_max_conns.
> Nothing changed.
>
> --
>
> Both customers have 10k+ users. Currently I couldn't reproduce this on
> smaller test systems.
>
> Both installation were downgraded to 2.3.7.2 to fix the issue for now
>
> --
>
> I've attached a tcpdump showing the client showing the client stops
> sending any data after the mailbox_guid table headers.
>
>
>
> Any idea what could be wrong here or the debug this issue?
>
> Thanks.
>
> Carsten Rosenberg
>

--
________________________________________________________________________
Dr. Andreas Piper, Hochschulrechenzentrum der Philipps-Univ. Marburg
           Hans-Meerwein-Straße 6, 35032 Marburg, Germany
Phone: +49 6421 28-23521  Fax: -26994  E-Mail: [hidden email]


smime.p7s (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [2.3.8] possible replication issue

Dovecot mailing list
I think there's a good chance that upgrading both will fix it. The bug already existed in old versions, it just wasn't normally triggered. Since v2.3.8 this situation is triggered on one dsync side, so the v2.3.9 fix needs to be on the other side.

On 5. Dec 2019, at 8.34, Piper Andreas via dovecot <[hidden email]> wrote:

Hello,

upgrading to 2.3.9 unfortunately does *not* solve this issue:

I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some seconds replication stopped. The other replicator remained with 2.3.7.2. After downgrading to 2.3.7.2 replication is again working fine.

I did not try to upgrade both replicators up to now, as this is a live production system. Is there a chance, that upgrading both replicators will solve the problem?

The machines are running Ubuntu 18.04

Any help is appreciated.

Thanks,
Andreas

Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot:
Hi,
some of our customers have discovered a replication issue after
upgraded from 2.3.7.2 to 2.3.8.
Running 2.3.8 several replication connections are hanging until defined
timeout. So after some seconds there are $replication_max_conns hanging
connections.
Other replications are running fast and successful.
Also running a doveadm sync tcp:... is working fine for all users.
I can't see exactly, but I haven't seen mailboxes timeouting again and
again. So I would assume it's not related to the mailbox.
From the logs:
server1:
Oct 16 08:29:25 server1 dovecot[5715]:
dsync-local([hidden email])<FXnVDW22pl0tGAAA1cwDxA>: Error:
dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version
not received)
Oct 16 08:29:25 server1 dovecot[5715]:
dsync-local([hidden email])<FXnVDW22pl0tGAAA1cwDxA>: Error:
Timeout during state=master_recv_handshake
server2:
Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
failed: EOF (last sent=handshake, last recv=handshake)
There aren't any additional logs regarding the replication.
I have tried increasing vsz_limit or reducing replication_max_conns.
Nothing changed.
--
Both customers have 10k+ users. Currently I couldn't reproduce this on
smaller test systems.
Both installation were downgraded to 2.3.7.2 to fix the issue for now
--
I've attached a tcpdump showing the client showing the client stops
sending any data after the mailbox_guid table headers.
Any idea what could be wrong here or the debug this issue?
Thanks.
Carsten Rosenberg


-- 
________________________________________________________________________
Dr. Andreas Piper, Hochschulrechenzentrum der Philipps-Univ. Marburg
         Hans-Meerwein-Straße 6, 35032 Marburg, Germany
Phone: +49 6421 28-23521  Fax: -26994  E-Mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [2.3.8] possible replication issue

Dovecot mailing list
Hello Timo,

upgrading both replicators did the job! Both replicators now run v2.3.9
and replication works fine, all sync-jobs which queued up during the
upgrading have been processed successfully.

Thanks for the reassurement and all your great work with dovecot,

Andreas


Am 05.12.19 um 13:15 schrieb Timo Sirainen via dovecot:

> I think there's a good chance that upgrading both will fix it. The bug
> already existed in old versions, it just wasn't normally triggered.
> Since v2.3.8 this situation is triggered on one dsync side, so the
> v2.3.9 fix needs to be on the other side.
>
>> On 5. Dec 2019, at 8.34, Piper Andreas via dovecot
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>> Hello,
>>
>> upgrading to 2.3.9 unfortunately does *not* solve this issue:
>>
>> I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some
>> seconds replication stopped. The other replicator remained with
>> 2.3.7.2. After downgrading to 2.3.7.2 replication is again working fine.
>>
>> I did not try to upgrade both replicators up to now, as this is a live
>> production system. Is there a chance, that upgrading both replicators
>> will solve the problem?
>>
>> The machines are running Ubuntu 18.04
>>
>> Any help is appreciated.
>>
>> Thanks,
>> Andreas
>>
>> Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot:
>>> Hi,
>>> some of our customers have discovered a replication issue after
>>> upgraded from 2.3.7.2 to 2.3.8.
>>> Running 2.3.8 several replication connections are hanging until defined
>>> timeout. So after some seconds there are $replication_max_conns hanging
>>> connections.
>>> Other replications are running fast and successful.
>>> Also running a doveadm sync tcp:... is working fine for all users.
>>> I can't see exactly, but I haven't seen mailboxes timeouting again and
>>> again. So I would assume it's not related to the mailbox.
>>> From the logs:
>>> server1:
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local([hidden email]
>>> <mailto:[hidden email]>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version
>>> not received)
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local([hidden email]
>>> <mailto:[hidden email]>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> Timeout during state=master_recv_handshake
>>> server2:
>>> Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
>>> failed: EOF (last sent=handshake, last recv=handshake)
>>> There aren't any additional logs regarding the replication.
>>> I have tried increasing vsz_limit or reducing replication_max_conns.
>>> Nothing changed.
>>> --
>>> Both customers have 10k+ users. Currently I couldn't reproduce this on
>>> smaller test systems.
>>> Both installation were downgraded to 2.3.7.2 to fix the issue for now
>>> --
>>> I've attached a tcpdump showing the client showing the client stops
>>> sending any data after the mailbox_guid table headers.
>>> Any idea what could be wrong here or the debug this issue?
>>> Thanks.
>>> Carsten Rosenberg
>>
>>


smime.p7s (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [2.3.8] possible replication issue

Dovecot mailing list
Hello all,

Just tested this morning : I can confirm that issue seems to be resolved for me after upgrading both servers from 2.3.7.2 to 2.3.9.

Refs :
  * https://dovecot.org/pipermail/dovecot/2019-October/117353.html
  * https://dovecot.org/pipermail/dovecot/2019-November/117467.html

No more "I/O has stalled" error messages and replication works fine now.
Thanks very much to the Dovecot team.

Have a nice day.
Fabien

-----Message d'origine-----
De : dovecot <[hidden email]> De la part de Piper Andreas via dovecot
Envoyé : vendredi 6 décembre 2019 07:10
À : [hidden email]
Objet : Re: [2.3.8] possible replication issue

Hello Timo,

upgrading both replicators did the job! Both replicators now run v2.3.9
and replication works fine, all sync-jobs which queued up during the
upgrading have been processed successfully.

Thanks for the reassurement and all your great work with dovecot,

Andreas


Am 05.12.19 um 13:15 schrieb Timo Sirainen via dovecot:

> I think there's a good chance that upgrading both will fix it. The bug
> already existed in old versions, it just wasn't normally triggered.
> Since v2.3.8 this situation is triggered on one dsync side, so the
> v2.3.9 fix needs to be on the other side.
>
>> On 5. Dec 2019, at 8.34, Piper Andreas via dovecot
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>> Hello,
>>
>> upgrading to 2.3.9 unfortunately does *not* solve this issue:
>>
>> I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some
>> seconds replication stopped. The other replicator remained with
>> 2.3.7.2. After downgrading to 2.3.7.2 replication is again working fine.
>>
>> I did not try to upgrade both replicators up to now, as this is a live
>> production system. Is there a chance, that upgrading both replicators
>> will solve the problem?
>>
>> The machines are running Ubuntu 18.04
>>
>> Any help is appreciated.
>>
>> Thanks,
>> Andreas
>>
>> Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot:
>>> Hi,
>>> some of our customers have discovered a replication issue after
>>> upgraded from 2.3.7.2 to 2.3.8.
>>> Running 2.3.8 several replication connections are hanging until defined
>>> timeout. So after some seconds there are $replication_max_conns hanging
>>> connections.
>>> Other replications are running fast and successful.
>>> Also running a doveadm sync tcp:... is working fine for all users.
>>> I can't see exactly, but I haven't seen mailboxes timeouting again and
>>> again. So I would assume it's not related to the mailbox.
>>> From the logs:
>>> server1:
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local([hidden email]
>>> <mailto:[hidden email]>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version
>>> not received)
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local([hidden email]
>>> <mailto:[hidden email]>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> Timeout during state=master_recv_handshake
>>> server2:
>>> Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
>>> failed: EOF (last sent=handshake, last recv=handshake)
>>> There aren't any additional logs regarding the replication.
>>> I have tried increasing vsz_limit or reducing replication_max_conns.
>>> Nothing changed.
>>> --
>>> Both customers have 10k+ users. Currently I couldn't reproduce this on
>>> smaller test systems.
>>> Both installation were downgraded to 2.3.7.2 to fix the issue for now
>>> --
>>> I've attached a tcpdump showing the client showing the client stops
>>> sending any data after the mailbox_guid table headers.
>>> Any idea what could be wrong here or the debug this issue?
>>> Thanks.
>>> Carsten Rosenberg
>>
>>