55
66## Summary
77
8- One paragraph explanation of the proposed change.
8+ This RFC formalizes the Vortex type system by grounding ` DType ` as a quotient type over physical
9+ encodings and establishes a decision framework (based on refinement types) for when new ` DType `
10+ variants are justified. It additionally proposes extending canonicalization from a single target
11+ (one canonical "normal form" per ` DType ` ) to parameterized "sections" via ` to_canonical_target ` ,
12+ allowing consumers to choose among multiple valid canonical targets (e.g., ` List ` vs. ` ListView ` ).
913
1014## Motivation
1115
@@ -34,7 +38,7 @@ The second proposal is to relax (or extend) the concept of a "canonical" type fr
3438physical encoding for every logical type (a unique normal form) to allowing many possible canonical
3539targets (multiple normal forms).
3640
37- # Type Theory Background
41+ ## Type Theory Background
3842
3943This section introduces the type-theoretic concepts that underpin Vortex's ` DType ` system and its
4044relationship to physical encodings. To reiterate, most of the maintainers understand these concepts
@@ -43,9 +47,9 @@ intuitively, but there is value in mapping these implicit concepts to explicit t
4347Note that this section made heavy use of LLMs to help research and identify terms and definitions,
4448as the author of this RFC is notably _ not_ a type theory expert.
4549
46- ## Equivalence Classes and ` DType ` as a Quotient Type
50+ ### Equivalence Classes and ` DType ` as a Quotient Type
4751
48- ### In Theory
52+ #### In Theory
4953
5054An ** equivalence relation** ` ~ ` on a set ` S ` is a relation that is reflexive (` a ~ a ` ), symmetric
5155(` a ~ b ` implies ` b ~ a ` ), and transitive (` a ~ b ` and ` b ~ c ` implies ` a ~ c ` ). An equivalence
@@ -62,7 +66,7 @@ must produce the same result regardless of which member of the class you operate
6266` f : A → B ` respects the equivalence relation (` a ~ a' ` implies ` f(a) = f(a') ` ), then ` f ` descends
6367to a well-defined function on the quotient ` f' : A/~ → B ` .
6468
65- ### In Vortex
69+ #### In Vortex
6670
6771Consider the set of all physical array representations / encodings in Vortex: a dictionary-encoded
6872` i32 ` array, a run-end-encoded ` i32 ` array, a bitpacked ` i32 ` array, a flat Arrow ` i32 ` buffer,
@@ -77,7 +81,7 @@ A Vortex `DType` like `Primitive(I32, NonNullable)` **names** one of these equiv
7781tells us what logical data we are working with, but says nothing about which physical encoding is
7882representing it. Thus, we can say that logical types in Vortex form equivalence classes, and ` DType `
7983is the set of equivalence classes. More formally, ` DType ` is the quotient type over the space of
80- physical encodings, collapsed by decoded / decompressed equivalence relation.
84+ physical encodings, collapsed by the decoded / decompressed equivalence relation.
8185
8286This quotient structure imposes a concrete requirement: any operation defined on ` DType ` must
8387produce the same result regardless of which physical encoding backs the data.
@@ -86,7 +90,7 @@ For example, operations like `filter`, `take`, and `scalar_at` all satisfy this:
8690the logical values, not on how those values are stored. However, an operation like "return the
8791` ends ` buffer" is not well-defined on the quotient type as that only exists for run-end encoding.
8892
89- ## Sections and Canonicalization
93+ ### Sections and Canonicalization
9094
9195Observe that every physical array (a specific encoding combined with actual data) maps back to a
9296` DType ` . A run-end-encoded ` i32 ` array maps to ` Primitive(I32) ` , as does a dictionary-encoded ` i32 `
@@ -100,7 +104,7 @@ which physical encoding should I use to represent it?"
100104
101105** In Vortex** , the current ` to_canonical ` function is a section. For each ` DType ` , it selects
102106exactly one canonical physical form. Observe how the ` Canonical ` enum is essentially identical to
103- ` DType ` enum (with the exception of ` VarBinView ` with ` Utf8 ` and ` Binary ` ):
107+ the ` DType ` enum (with the exception of ` VarBinView ` with ` Utf8 ` and ` Binary ` ):
104108
105109``` rust
106110/// The different logical types in Vortex (the different equivalence classes).
@@ -141,7 +145,7 @@ variable-length list data. Both are valid sections (since both pick a representa
141145equivalence class), and both satisfy ` π(s(d)) = d ` . The current system in Vortex simply hardcodes
142146one particular section. The second proposal in this RFC is to allow _ multiple sections_ .
143147
144- ## The Church-Rosser Property (Confluence)
148+ ### The Church-Rosser Property (Confluence)
145149
146150A rewriting system has the ** Church-Rosser property** (or is ** confluent** ) if, whenever a term can
147151be reduced in two different ways, both reduction paths can be continued to reach the same final
@@ -165,13 +169,13 @@ In Vortex, a similar scenario would be defining multiple strategies of canonical
165169strategy could target ` List ` as a canonical target (normal form) for list data, and another strategy
166170could target ` ListView ` . See the [ ` List ` vs. ` ListView ` ] ( #list-vs-listview ) section for more info.
167171
168- ## Refinement Types and the DType Decision Framework
172+ ### Refinement Types and the DType Decision Framework
169173
170174A ** refinement type** ` { x : T | P(x) } ` is a type ` T ` restricted to values satisfying a predicate
171175` P ` . Refinement types express subtypes without changing the underlying representation, instead they
172176add constraints that gate operations or impose invariants.
173177
174- For example in Vortex, ` Utf8 ` is a refinement of ` Binary ` :
178+ For example, in Vortex, ` Utf8 ` is a refinement of ` Binary ` :
175179
176180```
177181Utf8 ~= { b : Binary | valid_utf8(b) }
@@ -236,11 +240,6 @@ earns its place in `DType`.
236240
237241Under this reading, ` FixedSizeBinary ` is an encoding or canonical form, not a logical type.
238242
239- ### Decision
240-
241- It is somewhat hard to decide which is the right way to go. However, this section provides some more
242- structure to the discussions we have been holding.
243-
244243## List vs. ListView
245244
246245There is also a separate question about whether the current ` Canonical ` system is the most ideal.
@@ -262,7 +261,7 @@ The distinction is entirely in the buffer layout:
262261 belong to no element).
263262
264263This is a purely physical distinction, as no query operation can observe the difference.
265- For example, ` scalar_at(i) ` will always returns the same list, and ` filter ` , ` take ` , and ` slice ` all
264+ For example, ` scalar_at(i) ` will always return the same list, and ` filter ` , ` take ` , and ` slice ` all
266265produce logically identical results.
267266
268267However, this physical distinction has massive performance implications. Converting from ` ListView `
@@ -274,7 +273,7 @@ a form with stronger structural guarantees (no aliasing, no gaps). The section t
274273` ListView ` is cheaper and permits aliasing and faster random access.
275274
276275Many of our consumers (particularly Arrow FFI boundaries and consumers of DataFusion) prefer ` List `
277- because Arrow has is adding support for ` ListView ` slowly. Other consumers prefer ` ListView ` because
276+ because Arrow is adding support for ` ListView ` slowly. Other consumers prefer ` ListView ` because
278277some operations (namely random access and potentially dependent operations) are faster.
279278
280279It is highly unfortunate that we cannot canonicalize into different targets, and we are forced to
@@ -345,7 +344,26 @@ For `DType`s where the distinction does not apply (e.g., `Primitive`, `Bool`, `N
345344produce the same canonical form. The parameterization only has an effect where multiple valid
346345canonical forms exist.
347346
348- TODO wrap everything up and explain how the type theory helps justify this.
347+ Because ` DType ` is a quotient type and canonicalization is a section on that quotient, supporting
348+ multiple canonical targets is theoretically sound (adding this feature will not weaken any invariant
349+ of the Vortex type system). Because each ` CanonicalTarget ` defines an internally confluent reduction
350+ strategy, the quotient structure of ` DType ` guarantees that all well-defined operations produce
351+ the same logical results regardless of which target is chosen.
352+
353+ ```
354+ Logical type (DType: purely semantic, not physical)
355+ ▲
356+ │ section 1: Default (ListView, VarBinView)
357+ │ section 2: Contiguous (List, VarBin)
358+ │ section n: ... (future targets)
359+ ▼
360+ Canonical form (a specific "normal form" encoding chosen by the section)
361+ ▲
362+ │ encode
363+ │ decode
364+ ▼
365+ Physical encoding (dictionary, REE, bitpacked, etc.)
366+ ```
349367
350368## Compatibility
351369
@@ -367,32 +385,32 @@ not fit the use case.
367385
368386## Prior Art
369387
370- TODO
371-
372- How have other systems solved this or similar problems? Consider:
373-
374- - Other columnar formats (Parquet, Arrow, etc.).
375- - Database internals (DuckDB, DataFusion, Velox, etc.).
376- - Relevant academic papers or blog posts.
377-
378- This section helps frame the design in a broader context. If there is no relevant prior art, that is fine.
388+ - ** Arrow** defines both ` List ` and ` ListView ` (and ` LargeList ` /` LargeListView ` ) as separate type
389+ IDs in its columnar format. Arrow's canonical layout uses monotonic offsets (` List ` ), and
390+ ` ListView ` support is being added incrementally across implementations. Arrow also distinguishes
391+ ` FixedSizeBinary ` from ` FixedSizeList<uint8> ` at the type level.
392+ - ** Parquet** has ` FIXED_LEN_BYTE_ARRAY ` as a distinct primitive type, separate from repeated
393+ groups. This is a nominal distinction similar to the ` FixedSizeBinary ` question.
394+ - ** DuckDB** uses a single canonical form per logical type and does not support multiple
395+ canonicalization targets.
379396
380397## Unresolved Questions
381398
382399- Should ` FixedSizeBinary<n> ` be a ` DType ` variant (refinement type) or extension type metadata?
383400 See the [ analysis above] ( #should-fixedsizebinary-be-a-dtype ) for the case for and against.
384-
385- TODO
386-
387- - What parts of the design need to be resolved during the RFC process?
388- - What is explicitly out of scope for this RFC?
389- - Are there open questions that can be deferred to implementation?
401+ - What is the concrete set of ` CanonicalTarget ` variants? This RFC proposes ` Default ` and
402+ ` Contiguous ` , but there may be other useful targets. Theoretically, we could have a target that
403+ would allow us to return compressed arrays as a ` Canonical ` type!
390404
391405## Future Possibilities
392406
393- TODO
394-
395- What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing.
407+ - We can have extension types be a generalization of the refinement type pattern: for example we
408+ could enforce user-defined predicates that gate custom operations on existing ` DType ` s.
409+ - User-defined or plugin-defined canonical targets for custom FFI boundaries or serialization
410+ formats.
411+ - Using the decision framework to audit existing ` DType ` variants and determine if any should be
412+ consolidated or split, as well as to make decisions about logical types we want to add (namely
413+ ` FixedSizeBinary ` and ` Variant ` ).
396414
397415## Further Reading
398416
0 commit comments