Proposal: long-horizon tension crash tests as a new eval dimension

Hi, and thanks for releasing LLMeBench and the related papers.

From what I understand, this repo is positioning itself as a flexible framework for benchmarking LLMs across tasks, datasets and providers, not just a single leaderboard. That is exactly why I am asking this question here.

On a separate track, I have been working on something I call long-horizon tension crash tests for LLMs. The goal is not only to ask one hard question, but to push a model through a long sequence of S-class problems and watch where its internal state quietly drifts or collapses.

Concretely, I maintain an open TXT pack called "WFGY 3.0 · Singularity Demo (BlackHole-131)". It is a plain-text universe of 131 S-class questions (alignment, extreme physics, long-horizon civilization decisions, etc), designed so that any LLM that supports file input can be stress-tested on high-tension reasoning. Everything runs as pure text; no extra code or infra is required.

The way people use it today is roughly:

- upload the TXT pack to an LLM that supports file input
- type `run` to let the model read a short boot menu
- choose `go` to trigger a short demo run over several of those S-class questions
- observe where the model starts to drift, contradict itself, or collapse

My question for you is:

From the perspective of LLMeBench, would this kind of long-horizon tension crash test make sense as a task family or eval dimension inside your framework? For example, as a "stress run" asset that focuses on stability under extreme semantic tension, instead of just accuracy on a single dataset.

I am not asking you to adopt anything immediately. I am mainly trying to understand:

- where a long-horizon, text-only crash test like this would sit inside LLMeBench’s taxonomy, if at all
- what kind of logging or metrics you would expect if someone from the community contributed such an asset

If this sounds even slightly relevant, I am happy to share more details, traces, or a minimal TXT subset that could be used as a first experiment. Everything is plain text in a public GitHub repo, so it can run entirely inside your existing benchmarking flow.

Thanks again for building LLMeBench and for any guidance on how (or whether) this kind of stress test belongs here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: long-horizon tension crash tests as a new eval dimension #399

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: long-horizon tension crash tests as a new eval dimension #399

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions