Worker prover error recovery

o1js isn't exactly known for it determinism on completing proving work successfully....
Apparently sometimes, as also reported by nori, proving will stop working after a certain amount of proofs generated.
Apart from that, stuff can always go wrong and we should therefore have a robust system to deal with errors in workers.
Currently, we catch and log errors but leave the worker running without actually retrying the tasks or fixing the worker's error (which could be in a faulty state that isn't our fault).

So a strategy to fix this:
- [ ] Crash workers along with their proving work failing
    - Docker restarts those workers automatically
- [ ] Retry tasks when using bullmq (we already do it for the localqueue). The restarted worker will pick it up as soon as it's restarted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker prover error recovery #463

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Worker prover error recovery #463

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions