Skip to content

Worker prover error recovery #463

@rpanic

Description

@rpanic

o1js isn't exactly known for it determinism on completing proving work successfully....
Apparently sometimes, as also reported by nori, proving will stop working after a certain amount of proofs generated.
Apart from that, stuff can always go wrong and we should therefore have a robust system to deal with errors in workers.
Currently, we catch and log errors but leave the worker running without actually retrying the tasks or fixing the worker's error (which could be in a faulty state that isn't our fault).

So a strategy to fix this:

  • Crash workers along with their proving work failing
    • Docker restarts those workers automatically
  • Retry tasks when using bullmq (we already do it for the localqueue). The restarted worker will pick it up as soon as it's restarted

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions