Skip to content

Proposal: long-horizon tension crash tests as a new eval dimension #399

@onestardao

Description

@onestardao

Hi, and thanks for releasing LLMeBench and the related papers.

From what I understand, this repo is positioning itself as a flexible framework for benchmarking LLMs across tasks, datasets and providers, not just a single leaderboard. That is exactly why I am asking this question here.

On a separate track, I have been working on something I call long-horizon tension crash tests for LLMs. The goal is not only to ask one hard question, but to push a model through a long sequence of S-class problems and watch where its internal state quietly drifts or collapses.

Concretely, I maintain an open TXT pack called "WFGY 3.0 · Singularity Demo (BlackHole-131)". It is a plain-text universe of 131 S-class questions (alignment, extreme physics, long-horizon civilization decisions, etc), designed so that any LLM that supports file input can be stress-tested on high-tension reasoning. Everything runs as pure text; no extra code or infra is required.

The way people use it today is roughly:

  • upload the TXT pack to an LLM that supports file input
  • type run to let the model read a short boot menu
  • choose go to trigger a short demo run over several of those S-class questions
  • observe where the model starts to drift, contradict itself, or collapse

My question for you is:

From the perspective of LLMeBench, would this kind of long-horizon tension crash test make sense as a task family or eval dimension inside your framework? For example, as a "stress run" asset that focuses on stability under extreme semantic tension, instead of just accuracy on a single dataset.

I am not asking you to adopt anything immediately. I am mainly trying to understand:

  • where a long-horizon, text-only crash test like this would sit inside LLMeBench’s taxonomy, if at all
  • what kind of logging or metrics you would expect if someone from the community contributed such an asset

If this sounds even slightly relevant, I am happy to share more details, traces, or a minimal TXT subset that could be used as a first experiment. Everything is plain text in a public GitHub repo, so it can run entirely inside your existing benchmarking flow.

Thanks again for building LLMeBench and for any guidance on how (or whether) this kind of stress test belongs here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions