Hi, and thanks for releasing LLMeBench and the related papers.
From what I understand, this repo is positioning itself as a flexible framework for benchmarking LLMs across tasks, datasets and providers, not just a single leaderboard. That is exactly why I am asking this question here.
On a separate track, I have been working on something I call long-horizon tension crash tests for LLMs. The goal is not only to ask one hard question, but to push a model through a long sequence of S-class problems and watch where its internal state quietly drifts or collapses.
Concretely, I maintain an open TXT pack called "WFGY 3.0 · Singularity Demo (BlackHole-131)". It is a plain-text universe of 131 S-class questions (alignment, extreme physics, long-horizon civilization decisions, etc), designed so that any LLM that supports file input can be stress-tested on high-tension reasoning. Everything runs as pure text; no extra code or infra is required.
The way people use it today is roughly:
- upload the TXT pack to an LLM that supports file input
- type
run to let the model read a short boot menu
- choose
go to trigger a short demo run over several of those S-class questions
- observe where the model starts to drift, contradict itself, or collapse
My question for you is:
From the perspective of LLMeBench, would this kind of long-horizon tension crash test make sense as a task family or eval dimension inside your framework? For example, as a "stress run" asset that focuses on stability under extreme semantic tension, instead of just accuracy on a single dataset.
I am not asking you to adopt anything immediately. I am mainly trying to understand:
- where a long-horizon, text-only crash test like this would sit inside LLMeBench’s taxonomy, if at all
- what kind of logging or metrics you would expect if someone from the community contributed such an asset
If this sounds even slightly relevant, I am happy to share more details, traces, or a minimal TXT subset that could be used as a first experiment. Everything is plain text in a public GitHub repo, so it can run entirely inside your existing benchmarking flow.
Thanks again for building LLMeBench and for any guidance on how (or whether) this kind of stress test belongs here.
Hi, and thanks for releasing LLMeBench and the related papers.
From what I understand, this repo is positioning itself as a flexible framework for benchmarking LLMs across tasks, datasets and providers, not just a single leaderboard. That is exactly why I am asking this question here.
On a separate track, I have been working on something I call long-horizon tension crash tests for LLMs. The goal is not only to ask one hard question, but to push a model through a long sequence of S-class problems and watch where its internal state quietly drifts or collapses.
Concretely, I maintain an open TXT pack called "WFGY 3.0 · Singularity Demo (BlackHole-131)". It is a plain-text universe of 131 S-class questions (alignment, extreme physics, long-horizon civilization decisions, etc), designed so that any LLM that supports file input can be stress-tested on high-tension reasoning. Everything runs as pure text; no extra code or infra is required.
The way people use it today is roughly:
runto let the model read a short boot menugoto trigger a short demo run over several of those S-class questionsMy question for you is:
From the perspective of LLMeBench, would this kind of long-horizon tension crash test make sense as a task family or eval dimension inside your framework? For example, as a "stress run" asset that focuses on stability under extreme semantic tension, instead of just accuracy on a single dataset.
I am not asking you to adopt anything immediately. I am mainly trying to understand:
If this sounds even slightly relevant, I am happy to share more details, traces, or a minimal TXT subset that could be used as a first experiment. Everything is plain text in a public GitHub repo, so it can run entirely inside your existing benchmarking flow.
Thanks again for building LLMeBench and for any guidance on how (or whether) this kind of stress test belongs here.