feat: Add HTML representation#2236
Conversation
I tend to think this is the way to go. Let's not bite off more than we need to here. I genuinely don't have a good grasp on what the use-case is here in strong terms - MuData already has its own renderer for example.
Right, and this could build off of the work in #2290 and extend the JSON schema there. I would also go for a less-feature complete but more robust version of a JSON schema. For example, I know that categories can get big, but I think we should not worry about that. That is a v2 feature. |
|
Thanks for the detailed proposal in your latest comments. I've mapped each feature onto the TypedDict + Jinja architecture to understand what transfers and what doesn't. To help navigate, here's where I address each of your points:
I've linked to relevant earlier comments throughout. Some points below build on arguments from earlier in the thread — I'd find it most productive if we can engage with those discussions rather than revisiting them from scratch. Before diving in: the instinct behind TypedDict + Jinja is architecturally sound in the general case — separating data from presentation, defaulting to auto-escaping, enabling JSON-serializable intermediates. If the rendering layer were the complex part of this system, I'd agree templates are the right tool. But in this system, the complexity lives in the Python formatting layer — type dispatch, error recovery, context-dependent decisions — which survives a Jinja migration unchanged. Jinja replaces the rendering layer, which is the simpler part. I want to walk through that concretely rather than assert it. The core tradeoff@flying-sheep outlined two options: (1) TypedDict + Jinja with extensibility built around it, or (2) leave out extensibility for now. @ilan-gold favors option 2:
I understand the core concerns here are maintainability and security — you'll be maintaining this code long-term and are responsible for ensuring it's safe. I share those goals. But as I'll show below, TypedDict + Jinja does not deliver the improvements in robustness and maintainability it appears to promise, and introduces a new maintenance cost at the Python/Jinja boundary. The key thing to surface is that this isn't just about deferring extensibility — TypedDict + Jinja is architecturally at odds with it. The features that require extensibility (ecosystem custom HTML, per-type dispatch) can't be expressed in a fixed TypedDict schema without falling back to So the real question is: do we want extensibility? But first, since the proposal seems to assume there's no structured intermediate representation, let me recap the architecture so we're working from the same mental model. How the current architecture worksThe PR doesn't go from object to HTML in one step. There are three layers:
We already have separation of concerns. The question is whether the rendering half should be written in Python or in Jinja — not whether separation exists. TypedDict + Jinja would replace layers 2 and 3, but the complexity doesn't live there. It lives in the formatters (~2,200 lines across What maps cleanly to TypedDict + JinjaThese features are fully compatible with a JSON-serializable intermediate representation — roughly 60-70% of the visual output:
What a Jinja migration must reimplementThe remaining features CAN be expressed as TypedDict fields — but the formatting logic that populates those fields requires Python features that can't move into Jinja templates. With TypedDict + Jinja, this logic must be reimplemented as a Python "crawl phase" that does the same work as the current formatters.
None of this logic goes away — it's reimplemented targeting TypedDict output instead of The maintenance cost at the boundaryIn addition, TypedDict + Jinja introduces a new maintenance cost that the current system doesn't have: a dual-contract boundary between the crawl phase and the template. In the current system, With TypedDict + Jinja, this changes. The TypedDict carries unresolved data — nullable fields for each attribute, raw category lists, error sentinels. The template must handle every combination with its own conditionals: {% if entry.shape is not none %}({{ entry.shape|join(', ') }}){% endif %}
{% if entry.dtype %} {{ entry.dtype }}{% endif %}
{% if entry.error %}<span class="warning">⚠ {{ entry.error }}</span>{% endif %}
{% if entry.colors %} {# render swatches #} {% endif %}That's two layers that must agree on: what fields are nullable, what null means, how partial results compose. A change to what the crawl phase produces can silently break the template, with no compile-time check across the Python/Jinja boundary. Mypy checks the TypedDict definition in Python; it cannot check that the template handles every nullable combination correctly. This is in tension with the typing rigor we've established elsewhere in this PR — the strict typing that motivated removing This compounds with JSON export. Adding a JSON consumer to the same TypedDict creates a third site that must handle the same combinatorial space of nullable fields — crawl, template, JSON serializer — all implementing their own conditional logic for the same partial-failure scenarios, all kept in sync manually.
Concretely, when
This is the opposite of reduced maintenance burden. The current system resolves ambiguity once, in the formatter. TypedDict + Jinja defers it to every consumer. What TypedDicts structurally preventUnlike the items above, these features are genuinely incompatible with a fixed TypedDict schema — not because of implementation effort, but because of structural limitations in how Jinja2 extensibility works.
These aren't features that could be added later on top of TypedDict + Jinja. The only escape hatch is Why extensibility matters@ilan-gold, you mentioned:
Here are the concrete cases: Discoverability of analysis results. Ecosystem tools store results across multiple AnnData slots, but there's no way for a user to see what was computed or which tool put it there — they just see generic arrays and columns. Our package kompot writes DE results to Reusable components for MuData and SpatialData. Early in this PR, @Zethson asked for exactly this:
The README rendering for collaborators. When sharing AnnData files between lab members, it's common to store a description in The extensibility API has been in this PR for months and is covered by 607 tests (108 adversarial). If there are specific maintainability or correctness concerns, I'd like to understand them so I can address them concretely. If you're concerned about API lock-in, Option B below keeps the API internal while preserving the architecture that makes it possible. On Jinja and securityI understand this is framed primarily as a security question, and I want to engage with that directly. I evaluated template-based architectures early on and explained this reasoning in detail. Let me revisit it in light of the specific proposal. The security argument for Jinja is: auto-escaping by default means a contributor can't accidentally forget to escape user data, preventing XSS from maliciously crafted AnnData files. That's a real concern, and I take it seriously. But let's be precise about the threat model and what Jinja actually changes. The threat is narrow. The attack surface is: an attacker crafts an AnnData file with malicious strings (e.g., Jinja's advantage is real but bounded. The failure mode asymmetry is genuine: forgetting The cost is disproportionate to the security gain. This improvement in default escaping for internal rendering comes at the cost of: a new dependency, a cross-language boundary with unchecked contracts (see Maintenance cost at the boundary above), and structural barriers to extensibility (see What TypedDicts structurally prevent above). And for ecosystem extensions that produce custom visualizations, the escaping responsibility moves to third-party code — Jinja provides no safety improvement there. Ecosystem extensions reintroduce the risk. If extensibility is supported, ecosystem packages would supply their own templates or generate HTML for custom visualizations. The escaping responsibility shifts to code outside anndata's control. But more fundamentally, ecosystem packages already run arbitrary Python in the user's process — a malicious or buggy package can execute code, access the filesystem, or exfiltrate data, none of which is constrained by HTML escaping. XSS in a formatter is a strictly lesser risk than what ecosystem code can already do. Jinja's auto-escaping on anndata's side doesn't change this threat model. CSS injection isn't addressed by Jinja either. Category color values from On robustness and scope
and
I want to address both the scope concern and the robustness expectation. On review burden: The PR is large, and I understand that reviewing +22K lines is daunting. As I broke down earlier, 41% of those lines are tests, 15% is the visual test harness, and 8% is static assets (CSS/JS). The actual source code is ~29% (~6.4K lines). Dropping extensibility ( On robustness: The expectation of improved robustness from TypedDict + Jinja is misleading. The logic where robustness matters must be reimplemented in a crawl phase regardless (see above), and the boundary between crawl and template replaces a single-contract system with a dual-contract system — adding a maintenance surface, not removing one. I agree that dropping the extensibility API reduces scope — that's Option B below, and I'm happy to go that route. But the robustness question remains: is TypedDict + Jinja more robust than f-strings for the code that stays? The internal formatting logic (type dispatch, For context on the rendering approach: xarray's repr uses f-strings and Jinja was never considered in that project's design discussion. Dask did migrate to Jinja (dask#8019), but for a different use case — Dask renders one known type per repr call (one template per type: On JSON exportJSON export is a valuable goal and I'm in favor of it. But rather than motivating a Jinja migration, JSON export highlights the cost of TypedDict + Jinja. As discussed in Maintenance cost at the boundary, TypedDict + Jinja creates a dual-contract system where the crawl phase produces unresolved data and the template handles nullable combinations. Adding a JSON consumer to the same TypedDict creates a triple-contract system — three sites implementing conditional logic for the same partial-failure scenarios, kept in sync manually. With the current system, adding JSON export means adding a serialization method to There's also a schema mismatch. The HTML path truncates and summarizes: I raised several design questions in my earlier detailed response that I'd like to resolve before designing the schema:
I'd appreciate engagement on these questions — they need to be resolved regardless of which rendering architecture we choose. Path forwardI think there are three reasonable options for the rendering architecture, plus JSON export as a separate follow-up: Option A: Merge with extensibility API. The Option B: Merge without extensibility API. I remove Option C: Adopt TypedDict + Jinja, strip extensibility. Replace JSON export can be added as a follow-up to any option above, once the design questions above are resolved. My recommendation is A (or B as a compromise on review scope). I believe the current architecture provides the foundation for both extensibility and JSON export without the costs of a Jinja migration. I've created a visual side-by-side comparison (gist source) showing what each approach can express for the features discussed above — basic layout, category colors, error recovery, ecosystem custom HTML, and the maintenance cost of adding JSON export. I want to make sure we're making this decision on a shared understanding of the implementation. If there are specific parts of the code that feel hard to maintain or that raise security concerns, I'd welcome those pointers — they'd help me improve the implementation regardless of which direction we go. |
JupyterLab strips <style> tags from untrusted notebooks (e.g. executed via nbconvert or transferred between users), leaving the HTML repr unstyled. Add a fallback div that is visible by default and hidden by CSS, showing a summary line and instructions to run `jupyter trust`.
repr(adata) can crash when aligned mappings contain objects with broken .shape properties, because _gen_repr accesses all mappings which triggers validation. This caused _repr_html_ to return None for adversarial AnnData objects (visible as "None" in test 24). Wrap the repr() call in a try/except with a simple shape-based fallback string.
SectionFormatter can now define render_html(obj, context) to produce custom HTML directly, bypassing the standard foldable <details> section. If render_html fails, falls back to get_entries gracefully. This enables compact inline representations (e.g., TreeData's label/alignment/allow_overlap as a single line like the X entry) alongside the standard entry-grid sections. Includes: - render_html support in _render_custom_section with fallback - TreeData visual test example showing both patterns - Unit tests for render_html, escaping, and crash fallback
Null bytes in user data (e.g., column name "null\x00byte") leaked through html.escape into the HTML output as literal \x00 bytes, breaking HTML parsers and causing truncated rendering in browsers. Replace null bytes with U+FFFD (Unicode replacement character) before escaping, per the HTML spec.
…notebooks) Use the xarray-style dual-representation pattern: emit a text <pre> fallback (visible by default) alongside the rich HTML (hidden via inline display:none). When CSS loads it flips visibility; when CSS is stripped the text repr shows.
Replace div cells with span cells so entries stay on one line without CSS. Add inline min-width via CSS custom variables for column alignment, monospace font fallback, comma-separated categories (hidden by CSS), and contextual hints for no-CSS and no-JS environments. - Entry cells: <div> → <span> with inline-block + min-width fallback - Category items: comma separators (hidden by CSS which uses margin) - Wrap buttons: hidden inline, shown only by JS overflow detection - Nested content: inline margin-left for indentation without CSS - No-CSS hint: visible by default, hidden by CSS - No-JS hint: hidden by default, shown by CSS :not(.anndata-repr--js), hidden again by JS init - CSS resets inline fallback styles on grid children (!important) - Drop text <pre> fallback — rich HTML now degrades well enough
Use anndata.utils.iter_outer from scverse#2372 as the canonical source for standard section iteration in the HTML repr. Drop the redundant SECTION_ORDER tuple; display order now follows iter_outer. - _render_all_sections iterates (name, elem) pairs from iter_outer and passes elem to downstream renderers, avoiding a second getattr (which would trigger another file open/close cycle on backed AnnData). - _render_dataframe_section, _render_mapping_section, _render_uns_section, _render_raw_section now take the elem directly. - _detect_unknown_sections and _get_custom_sections_by_position use a local STANDARD_SECTIONS frozenset for name-only membership checks so they don't pay iter_outer's per-yield I/O just to get names. - Drop unused adata param from _render_uns_entry. Display order changes to X, obs, var, obsm, varm, obsp, varp, layers, uns, raw (uns moves from position 4 to position 9).
Drop the STANDARD_SECTIONS frozenset (which mirrored iter_outer's internal name list). Instead, materialize iter_outer once at the top of _render_all_sections and reuse the collected names for the membership checks in _get_custom_sections_by_position and _detect_unknown_sections. Same in _collect_all_field_names: the names come from the iter_outer loop that was already running for column/key collection. Removes the maintenance burden of keeping a separate constant in sync with iter_outer's internal list.
|
Thanks for landing this @ilan-gold! Just adopted One small friction point worth sharing. The repr has a couple of name-only checks (filtering registered custom-section formatters, detecting unknown attributes that aren't standard sections) where we just need the set of canonical section names, not their values. We currently collect the names while iterating values for rendering, then thread that set through to the membership-check sites. It works, but it means we're either reconstructing the set per render, or carrying it around as a parameter. It would be nicer if the canonical section names were exposed as a public constant alongside
Not a blocker, happy with the current state. Just flagging in case it fits with the follow-ups you have in mind. |
|
Hi! Responding to your Jinja comparison gist, please just read the basic documentation: https://jinja.palletsprojects.com/en/stable/templates/ Few of the things that you say are “Structurally prevented” are actually prevented in any way, e.g.
I think it’s still not clear to you what MarkupSafe does. Or you know it but your LLM picks up obsolete text and you accept its output unquestioningly. In any case, custom HTML is very much possible, please stop going back to pretending it isn’t.
Not at all, we’re very free to design blocks however we like, e.g. very trivially you could extend a chain at the start (there are also other solution such as a nested block in the
{% block thing %}
{% if thing == "foo" %}
foo
{% elif thing == "bar" %}
bar
{% else %}
baz
{% endif %}
{% endblock %}
{% extends "parent.j2" %}
{% block thing %}
{% if thing == "spam" %}
spam!
{% else %}
{{ super() }}
{% endif %}
{% endblock %}something else: why do you paste in the javascript in multiple places? Shouldn’t it be loaded once and reused? |
|
Hi @flying-sheep, appreciate the engagement. On MarkupSafe: On custom HTML: my concern covers both bespoke visualizations and, more importantly, pre-existing HTML representations. Third-party scientific packages that define custom types typically already have To make the middle-ground concrete, I put up a small POC on our fork: settylab/anndata#8. It routes the top-level repr through one autoescape-enabled Jinja template and wraps existing formatter-produced fragments in Stepping back, the architectural choice hinges on requirements we haven't explicitly agreed on:
On JavaScript composition: you're right that The Jinja question is forward-looking (reducing future XSS risk through default-safe idioms, type-level trust discipline) rather than correcting a present defect, so the trade-off depends on which requirements above weigh heaviest. Open to iterating on the POC or the current design. |
I disagree. I might have been unclear in why I brought up jinja but using it won’t and can’t make anything “structurally impossible”. To clarify: I simply consider jinja a better replacement for functions that are desigened to turn structured data into HTML, compared to simple functions that use Python string manipulation APIs. The approach “mark pieces of data as markup safe before passing it into a rendering black box” works better than having a bunch of parameters to a function, some of which contain valid markup while others contain data-derived raw strings that need to be escaped in order to not accidentally breaking page layout because there’s a stray “<”. Safety against malice isn’t the primary concern, it’s more about robustness, where safety is achieved along the way.
That’s a realistic pattern, but I think we’re not one one page about what “extending” means in the context of this PR. I mean by it that there are other packages building on anndata, which might want to both hook into the existing structure (e.g. add an attribute or so next to the others, but otherwise reuse the rest as-is) and also use the existing render machinery for stylistically integrated rendering of the new parts. You’re talking about integrating 3rd party HTML representations while extending.
Ah I saw it being called in multiple places and assumed it’d be included multiple times per repr. I think we can’t do better than one inclusion per repr. We can however improve runtime behavior – it could create a global API and reuse it if it finds it instead of creating a copy of the whole API. |
Two related robustness fixes to the inlined repr JS:
1. `repr.js`: replace the last remaining `innerHTML = \`<h3 id="${modalTitleId}">…\`` in the README modal with plain DOM construction (`createElement`/`textContent`/`appendChild`). The literal `id="${modalTitleId}"` substring inside the inlined JS source was being regex-matched by the Jupyter-compatibility tests as a duplicate HTML ID attribute across cells. Using DOM APIs removes the problematic substring entirely and matches the surrounding code style.
2. `javascript.py`: wrap the per-container initialisation body in an `install-once` guard. Every cell still ships the full source so any cell stays self-sufficient across deletion, reorder, or notebook reopen, but only the first to execute actually installs `window.anndataRepr`; subsequent cells reuse the installed `init(container)`. Addresses the runtime-redundancy concern without giving up per-cell portability.
…hover Replace iter_outer as the section-iteration source in the repr, and derive the canonical section list from get_literal_members(AnnDataElem). Why not iter_outer: iter_outer yields (name, getattr(adata, name)) pairs, and propagates the first exception it hits — so a single broken section (corrupt aligned mapping, subclass with a crashing property, etc.) terminates the generator mid-iteration. _repr_html_'s top-level except catches that, the repr returns None, and the whole cell output disappears. The adversarial "Evil AnnData" case hit this. Iterating get_literal_members(AnnDataElem) ourselves and doing the getattr inside each section's try/except isolates the failure: a broken section renders as an error placeholder and the remaining sections still appear. iter_outer stays for callers that want strict semantics (AnnData.__str__, to_memory, _reduce, I/O). Also: use the same Literal as the single source of truth wherever the repr previously derived a set of section names from iter_outer (_detect_unknown_sections, _get_custom_sections_by_position). Those helpers now compute the set locally and no longer need the caller to thread it through. CSS fix: .anndata-entry:hover .anndata-entry__copy propagated the :hover state to every ancestor entry, revealing every ancestor's copy button when a deeply-nested row was hovered. Scope the trigger to the entry's own row: the entry itself for plain div rows (which never contain nested entries) and the <summary> for expandable rows (nested children live in .anndata-entry__nested-content, outside the summary).
|
On JS composition: taken. Just pushed an On extensibility and the Styler analogy: useful pointer, and it maps cleanly onto the case you described. Packages that hook into anndata's structure (adding an attribute or section next to the others) and want stylistically integrated rendering of the new parts. The current design is already shaped similarly: On Jinja / the POC: the reframe is useful. Landing on robustness (one trust boundary + autoescape vs. threading mixed markup/data through ad-hoc |
Rich HTML representation for AnnData
Summary
Implements rich HTML representation (
_repr_html_) for AnnData objects in Jupyter notebooks. Builds on previous draft PRs (#784, #694, #521, #346) with a complete, production-ready implementation.Live Demo | Reviewer's Guide (technical details, design decisions, extensibility examples)
Screenshot
Features
Interactive Display
.rawsection showing unprocessed data (Reportn_varsof.rawin__repr__#349)Visual Indicators
unspalettes (e.g.,cell_type_colors)unsvaluesuns["README"])Serialization Warnings
Proactively warns about data that won't serialize:
/(deprecated)Compatibility
.anndata-reprprevents style conflictsread_lazy()(categories, colors)Extensibility
Three extension mechanisms for ecosystem packages (MuData, SpatialData, TreeData):
obst/vart,mod)See the Reviewer's Guide for examples and API documentation.
Testing
python tests/visual_inspect_repr_html.pyRelated
sparse_datasetby removingscipyinheritance #1927 (sparse scipy changes), feat: array-api compatibility #2063 (Array-API)Acknowledgments
Thanks to @selmanozleyen (#784), @gtca (#694), @VolkerH (#521), @ivirshup (#346, #675), and @Zethson (#675) for prior work and discussions.
Technical Notes and Edits
Lazy Loading
Constants are in
_repr_constants.py(outside_repr/) to prevent loading ~6K lines onimport anndata. The full module loads only when_repr_html_()is called.Config Changes
pyproject.toml: Addedvartto codespell ignore list (TreeData section name).Edit (Dec 27, 2024)
To simplify review and reduce the diff, I've merged settylab/anndata#3 into this PR. That PR was originally created as a follow-up to explore additional features based on the discussion with @Zethson about SpatialData/MuData extensibility.
What changed:
.rawsection - Expandable row showing unprocessed data (Reportn_varsof.rawin__repr__#349)Edit (Jan 4, 2025)
Moved detailed implementation documentation (architecture, design decisions, extensibility examples, configuration reference) to the Reviewer's Guide to keep this PR description focused on features.
Code refactoring:
html.pyinto focused modules for maintainabilitycomponents.py(badges, buttons, icons)sections.py(obs/var, mapping, uns, raw)core.py(avoids circular imports)utils.pyFormatterContextconsolidates all 6 rendering settings (read once at entry, propagated via context)html.pyreduced from ~2100 to ~740 lines, clean import hierarchyNew features:
read_lazy()AnnData objects (experimental) - indicates when obs/var are xarray-backed(lazy)indicator on columnsBug fixes:
adata-text-mutedclass for uniform appearanceRelated issue discovered:
read_lazy()returns index values as byte-representation strings (e.g.,"b'cell_0'"instead of"cell_0") - seeISSUE_READ_LAZY_INDEX.mdEdit (Jan 6, 2025)
Smart partial loading for
read_lazy()AnnData:Previously, lazy AnnData showed no category previews to avoid disk I/O. Now we do minimal, configurable loading to get richer visualization cheaply: only the first N category labels and their colors are read from storage (not the full column data). New setting
repr_html_max_lazy_categories(default: 100, set to 0 for metadata-only mode).Visual tests reorganized: 8 (Dask), 8b (lazy categories), 8c (metadata-only), 9 (backed).
Edit (Jan 6, 2025 - continued)
FormattedOutput API and architecture:
Clean separation between formatters and renderers - formatters inspect data and produce complete
FormattedOutput, renderers only receiveFormattedOutput(never the original data).The
FormattedOutputdataclass fields were renamed to be self-documenting:meta_contentpreview(text) orpreview_html(HTML)html_content+is_expandable=Trueexpanded_htmlhtml_content+is_expandable=Falsepreview_htmlis_expandableexpanded_html is not Nonetype_htmltype_namevisually)Naming convention:
*_htmlsuffix indicates raw HTML (caller responsible for escaping), plain text fields are auto-escaped.UI/UX improvements:
▼/▲arrows instead of⋯/▲for consistencyEdit (Jan 7, 2025)
Test architecture overhaul:
Tests reorganized from a single file into 10 focused modules for maintainability and parallel execution:
test_repr_core.pytest_repr_sections.pytest_repr_formatters.pytest_repr_ui.pytest_repr_warnings.pytest_repr_registry.pytest_repr_lazy.pytest_html_validator.pyHTMLValidator class (
conftest.py) provides structured HTML assertions:Key features: regex-based (no dependencies), section-aware matching, exact attribute matching to avoid "obs" matching "obsm".
Optional strict validation when dependencies available:
validate_html5()- W3C HTML5 + ARIA (requiresvnu)validate_js()- JavaScript syntax (requiresesprima)Jupyter Notebook/Lab compatibility tests (13 new tests in
TestJupyterNotebookCompatibility):Validates CSS scoping, JavaScript isolation, unique IDs across multiple cells, and Jupyter dark mode support.
Bug fix:
readme-modal-titleID is now unique per container to prevent ID collisions when multiple AnnData objects are displayed in the same notebook.Edit (Jan 8, 2025)
Maintainability improvements:
_render_entry_rowandrender_formatted_entryto eliminate duplicationget_formatter_for()andlist_formatters()methods to FormatterRegistry__init__.pystatic/directorytests/repr/html_validator.pymodule (conftest.py: 960→270 lines)_repr_constants.pyrender_entry_type_cell()signaturelazy.pymodulestatic/css_colors.txtfor easy updatesFile structure changes:
API simplifications:
render_entry_type_cell()now acceptsTypeCellConfigdataclass instead of 10 individual parametersis_lazy_adata(),is_lazy_column(),get_lazy_categories(),get_lazy_categorical_info()importlib.resources.files()(Python 3.9+)Edit (Jan 9, 2025)
Robustness & escaping coverage testing:
Added 108 tests in
test_repr_robustness.pyacross 14 test classes:html.escape()is called at every user-data insertion point using a<b>MARKER</b>probe__repr__,__len__,__sizeof__, properties)Escaping tests trust
html.escape()(stdlib) and only verify it's called at every insertion point, rather than exercising the escaping mechanism itself with attack vectors.Test cleanup:
Removed redundant and overly-specific tests to focus on meaningful coverage. Tests now verify behavior that matters (e.g., XSS escaped, errors visible, truncation applied) rather than testing identical code paths multiple times.
Visual inspection: Consolidated to 26 scenarios with single comprehensive "Evil AnnData" test combining all adversarial patterns.
Fixes:
repr_html_max_readme_sizeto_settings.pyitype stubspytest.warnsfor expected warnings)Updated stats:
Edit (Jan 16, 2025)
Error handling consolidation:
Refactored error handling to use a single
errorfield inFormattedOutputinstead of separateis_hard_errorparameters scattered across the codebase.Key changes:
FormattedOutputerror: str | Nonefield with documented precedence overpreview/preview_htmlFallbackFormatterFormatterRegistry.format_value()render_formatted_entry()is_hard_errorparam, now detects viaoutput.error_validate_key_and_collect_warnings()(key_warnings, is_key_not_serializable)- key issues mark as not serializable, preserving previewError vs Warning separation:
output.error: Hard rendering failure - row highlighted red, error message replaces previewoutput.is_serializable=False: Serialization warning - red background, but preview preservedNew behavior when formatters fail:
This prevents long error messages from appearing in HTML while preserving full details in warnings for debugging. Serialization issues (like non-string keys, lambdas, custom objects) preserve the value preview while showing the reason in the tooltip.
Updated stats:
Edit (Jan 26, 2025)
Review response changes (addressing @flying-sheep's review):
Typing:
Any→objectReplaced all ~95 uses of
Anyacross 7 files. Formatter method signatures now useobj: objectsince AnnData'sunsaccepts genuinely arbitrary objects and formatters handle AnnData-like objects (e.g., MuData) via duck typing.dict[str, Any]with known structure replaced with precise union types.CSS: Native nesting + dark mode + variable dedup
repr.cssto native CSS nesting (&). Selector repetitions of.anndata-reprreduced from 173 to 13. File length unchanged (~1164 lines) because the feature surface is genuinely large (~68 component blocks, 14 dtype colors, copy button, README styling, state variants), not because of repetition.[data-theme="dark"]for Furo/sphinx-book-theme) alongside existing Jupyter/VS Code detection.@media (prefers-color-scheme: dark)block and theme-selector block.&--variant) produce invalid CSS at nesting depth 2+ (browser treats&as:is(parent child), so&--viewbecomes:is(.anndata-repr .anndata-badge)--view). 7 modifier rules flattened to sibling selectors.Security tests simplified
Replaced ~34 attack-vector-heavy tests with 12 focused escaping-coverage tests. Each test puts a
<b>MARKER</b>probe at one user-data insertion point and verifies it appears escaped. RemovedTestCSSAttacks,TestEncodingAttacks; trimmedTestBadColorArrays,TestEvilReadme; consolidatedTestUltimateEvilAnnDatato 1 test. Total: 108 tests (14 classes), down from 123 (16 classes).Other:
FormatterContext.column_namerenamed toFormatterContext.keyFormatterRegistry.format_value()Future-Proofing: Related PRs and Issues
This PR includes explicit handling and/or code references to track compatibility with several in-progress or future changes. The following PRs/issues may trigger updates to the
_reprmodule:Already Handled
_reprSparseMatrixFormatteruses duck typing fallbackformatters.py:242,260,307ArrayAPIFormattervia duck typingformatters.py:771,1135ArrayAPIFormatterMay Require Updates When Merged
LazyCategoricalDtypeAPICategoricalArrayinternalslazy.py(all functions)obsformatters.py:159Recommended Post-Merge Actions
When feat: add
LazyCategoricalDtypefor lazy categorical columns #2288 merges:CategoricalFormatterandlazy.pyto use the newLazyCategoricalDtypeAPIget_lazy_categorical_info()extracts category count by manually navigatingobj.variable._data.array— replace withdtype.n_categoriesanddtype.head_categories(n)isinstance(dtype, LazyCategoricalDtype)for cleaner detectionWhen Add support for lists in obs #1923 is resolved:
_check_series_serializability()informatters.pyto recognize list-of-strings as serializableWhen feat: allow gpu io in
sparse_datasetby removingscipyinheritance #1927 merges:SparseMatrixFormatterstill works with new sparse array classesis_sparse()utility or the new classes have a stable API, the duck typing incan_format()(checking fornnz,tocsr,tocsc) could be simplified to direct type checksWhen feat: array-api compatibility #2063/feat: support array-api #2071 stabilize:
ArrayAPIFormatterduck typing (shape/dtype/ndim) follows the Array API standard and is the correct approachis_array_api_compatible(), could use that instead of manual attribute checks"cubed": "Cubed"toknown_backendsdict inArrayAPIFormatterfor prettier display labelsInternal API Usage Inventory
Current patterns accessing internal/private APIs that may be replaceable:
lazy.py:_get_categorical_array()col.variable._data.arrayisinstance(dtype, LazyCategoricalDtype)lazy.py:get_lazy_category_count()CategoricalArray._categories["values"].shape[0]dtype.n_categorieslazy.py:get_lazy_categorical_info()._categories,._ordereddtype.n_categories,dtype.orderedlazy.py:get_lazy_categories()read_elem_partial()on private._categoriesdtype.head_categories(n)lazy.py:is_lazy_adata()obs.__class__.__name__ == "Dataset2D"SparseMatrixFormatter.can_format()nnz,tocsr,tocscArrayAPIFormatter.can_format()shape,dtype,ndimBackedSparseDatasetFormatter.can_format()formatattr