A linear classifier extended with a bounded uncertainty region ±ε around the decision boundary. Inputs whose net activation satisfies −ε ≤ z ≤ ε yield a third output state instead of a forced classification, and are escalated to an LLM oracle — a two-tier architecture that trades latency for precision on the inputs where a linear model is least reliable.
Standard step function:
Epsilon activation (EUAF):
The ε parameter controls the width of the uncertainty band. A prediction of 0.5 means the model's net input landed too close to the decision boundary to be trusted. What you do with that signal is up to the caller — escalate to a slower model, log it, or route it differently.
Converting variable-length text into a fixed-size numeric vector is a solved problem — but a predictable vectorization is an attack surface. If an adversary can predict where their input lands in vector space, they can craft inputs that evade the classifier.
epsilon salts the hash with a 384-bit secret at initialization:
Where S is the salt, p = 53 is the polynomial base, and m = 1{,}000{,}000{,}007. The salt shifts both the base and the output scale, making the resulting vector space private to each deployment.
from secrets import randbits
from text_vectorizer.ngram_hasher import NGramHashVectorizer
salt = randbits(384)
vectorizer = NGramHashVectorizer(salt, chunk_size=150, ngram_size=3)
vectors = vectorizer.vectorized_slices("SELECT * FROM users WHERE id=1 OR 1=1--")
# → list[list[float]] (one sublist per 150-char chunk)Weights and bias update only on misclassified examples:
Features are normalized with z-score before training and inference so the learning rate is scale-independent:
The normalization parameters (μ, σ) are computed from the training set and stored alongside the model weights, so inference uses the same scale as training without requiring the original dataset.
Models live in a shared LRU cache — a threading.Lock-guarded OrderedDict. Multiple threads can run inference concurrently while a new model is being loaded into a separate slot. When the cache is full, the least-recently-used model is evicted.
from core.perceptron_cache import ModelCache
from core.uncertainty_perceptron import SimplePerceptron
cache = ModelCache(cache_length=10)
perceptron = SimplePerceptron(cache)
cache_id = perceptron.train(
epochs=30,
patience=3,
labeled_dataset_path="data/labeled.json",
learning_rate=0.65,
model_metadata={"model_name": "v1", "description": "...", "author": "..."}
)
prediction = perceptron.inference(features=[0.82, 0.44, 0.91], cache_id=cache_id, epsilon=0.12)
# → 0 | 0.5 | 1Dataset format — list[dict]:
[
{"features": [0.82, 0.44, 0.91], "label": 1},
{"features": [0.11, 0.20, 0.30], "label": 0}
]Four extractors that produce numeric features from the components of an HTTP request:
| Module | Extracts |
|---|---|
features/uri_syntax.py |
URL length, path depth, query string |
features/http_header.py |
Browser type, OS, referer depth, cookie count |
features/payload_statistical.py |
Shannon entropy, digit count, special chars, max word length |
features/client_profiler.py |
HTTP method encoding |
Shannon entropy (normalized):
High entropy in a short payload is a strong signal for encoding or obfuscation. Low entropy in a long payload often indicates pattern repetition typical of scanners.
When the perceptron outputs 0.5, the input can be passed to a language model for a second opinion. The oracle uses structured output (Pydantic) and isolates the untrusted payload inside <payload>...</payload> delimiters so prompt injection attempts are classified, not executed.
from feedback.knowledge_client import KnowledgeDistillerLLM
distiller = KnowledgeDistillerLLM("grok-4-fast-reasoning", api_key="xai-...")
result = distiller.inference_query(
payload="rate: 120 req/min; <payload>SELECT * FROM users WHERE 1=1--</payload>"
)
# → {"label": 1.0, "explanation": "SQL tautology → SQLi. High rate compounds risk."}# Train and infer on the included OR gate dataset
python -B -m core.uncertainty_perceptron
# LRU cache isolated test
python -B -m core.perceptron_cache
# N-gram vectorizer
python -B -m text_vectorizer.ngram_hashernumpy
pydantic
xai-sdk
Dylan Sutton Chávez (2025).