|
Hi WhoCares!
Thank you so much for the detailed bug report! Both issues were spot-on.
v1.5.1 is now available from the dashboard download link with the following fixes:
1) Printf mixing (agent): The hb_failed status line was using \r without a terminating \n, causing it to overwrite normal output. Fixed — now uses \n delimiters so the "SERVER OFFLINE" message prints cleanly on its own line.
2) Heartbeat timeouts every ~300s (server): This was the more critical one. save_state() was holding the global lock during the entire disk write — serializing millions of DPs with struct.pack in a loop while every API endpoint waited. As the DP table grows, save time grows, and at ~15M+ DPs it was blocking long enough to trigger agent heartbeat timeouts.
Fix: new save_state_background() takes a fast snapshot of all data structures under lock (milliseconds), then releases the lock and writes to disk outside it. Agents no longer see any interruption during auto-save.
Fleet is currently at 28 workers / ~21 G/s, 9.3% progress, efficiency 99.9%. No more periodic disconnections.
Thanks again for catching these — the heartbeat timeout one in particular would have become worse as the DP table keeps growing toward collision.
|