Skip to content

Commit 25d9ccc

Browse files
committed
first draft done
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent ab1da6e commit 25d9ccc

1 file changed

Lines changed: 55 additions & 37 deletions

File tree

proposed/9999-types.md

Lines changed: 55 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,11 @@
55

66
## Summary
77

8-
One paragraph explanation of the proposed change.
8+
This RFC formalizes the Vortex type system by grounding `DType` as a quotient type over physical
9+
encodings and establishes a decision framework (based on refinement types) for when new `DType`
10+
variants are justified. It additionally proposes extending canonicalization from a single target
11+
(one canonical "normal form" per `DType`) to parameterized "sections" via `to_canonical_target`,
12+
allowing consumers to choose among multiple valid canonical targets (e.g., `List` vs. `ListView`).
913

1014
## Motivation
1115

@@ -34,7 +38,7 @@ The second proposal is to relax (or extend) the concept of a "canonical" type fr
3438
physical encoding for every logical type (a unique normal form) to allowing many possible canonical
3539
targets (multiple normal forms).
3640

37-
# Type Theory Background
41+
## Type Theory Background
3842

3943
This section introduces the type-theoretic concepts that underpin Vortex's `DType` system and its
4044
relationship to physical encodings. To reiterate, most of the maintainers understand these concepts
@@ -43,9 +47,9 @@ intuitively, but there is value in mapping these implicit concepts to explicit t
4347
Note that this section made heavy use of LLMs to help research and identify terms and definitions,
4448
as the author of this RFC is notably _not_ a type theory expert.
4549

46-
## Equivalence Classes and `DType` as a Quotient Type
50+
### Equivalence Classes and `DType` as a Quotient Type
4751

48-
### In Theory
52+
#### In Theory
4953

5054
An **equivalence relation** `~` on a set `S` is a relation that is reflexive (`a ~ a`), symmetric
5155
(`a ~ b` implies `b ~ a`), and transitive (`a ~ b` and `b ~ c` implies `a ~ c`). An equivalence
@@ -62,7 +66,7 @@ must produce the same result regardless of which member of the class you operate
6266
`f : A → B` respects the equivalence relation (`a ~ a'` implies `f(a) = f(a')`), then `f` descends
6367
to a well-defined function on the quotient `f' : A/~ → B`.
6468

65-
### In Vortex
69+
#### In Vortex
6670

6771
Consider the set of all physical array representations / encodings in Vortex: a dictionary-encoded
6872
`i32` array, a run-end-encoded `i32` array, a bitpacked `i32` array, a flat Arrow `i32` buffer,
@@ -77,7 +81,7 @@ A Vortex `DType` like `Primitive(I32, NonNullable)` **names** one of these equiv
7781
tells us what logical data we are working with, but says nothing about which physical encoding is
7882
representing it. Thus, we can say that logical types in Vortex form equivalence classes, and `DType`
7983
is the set of equivalence classes. More formally, `DType` is the quotient type over the space of
80-
physical encodings, collapsed by decoded / decompressed equivalence relation.
84+
physical encodings, collapsed by the decoded / decompressed equivalence relation.
8185

8286
This quotient structure imposes a concrete requirement: any operation defined on `DType` must
8387
produce the same result regardless of which physical encoding backs the data.
@@ -86,7 +90,7 @@ For example, operations like `filter`, `take`, and `scalar_at` all satisfy this:
8690
the logical values, not on how those values are stored. However, an operation like "return the
8791
`ends` buffer" is not well-defined on the quotient type as that only exists for run-end encoding.
8892

89-
## Sections and Canonicalization
93+
### Sections and Canonicalization
9094

9195
Observe that every physical array (a specific encoding combined with actual data) maps back to a
9296
`DType`. A run-end-encoded `i32` array maps to `Primitive(I32)`, as does a dictionary-encoded `i32`
@@ -100,7 +104,7 @@ which physical encoding should I use to represent it?"
100104

101105
**In Vortex**, the current `to_canonical` function is a section. For each `DType`, it selects
102106
exactly one canonical physical form. Observe how the `Canonical` enum is essentially identical to
103-
`DType` enum (with the exception of `VarBinView` with `Utf8` and `Binary`):
107+
the `DType` enum (with the exception of `VarBinView` with `Utf8` and `Binary`):
104108

105109
```rust
106110
/// The different logical types in Vortex (the different equivalence classes).
@@ -141,7 +145,7 @@ variable-length list data. Both are valid sections (since both pick a representa
141145
equivalence class), and both satisfy `π(s(d)) = d`. The current system in Vortex simply hardcodes
142146
one particular section. The second proposal in this RFC is to allow _multiple sections_.
143147

144-
## The Church-Rosser Property (Confluence)
148+
### The Church-Rosser Property (Confluence)
145149

146150
A rewriting system has the **Church-Rosser property** (or is **confluent**) if, whenever a term can
147151
be reduced in two different ways, both reduction paths can be continued to reach the same final
@@ -165,13 +169,13 @@ In Vortex, a similar scenario would be defining multiple strategies of canonical
165169
strategy could target `List` as a canonical target (normal form) for list data, and another strategy
166170
could target `ListView`. See the [`List` vs. `ListView`](#list-vs-listview) section for more info.
167171

168-
## Refinement Types and the DType Decision Framework
172+
### Refinement Types and the DType Decision Framework
169173

170174
A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying a predicate
171175
`P`. Refinement types express subtypes without changing the underlying representation, instead they
172176
add constraints that gate operations or impose invariants.
173177

174-
For example in Vortex, `Utf8` is a refinement of `Binary`:
178+
For example, in Vortex, `Utf8` is a refinement of `Binary`:
175179

176180
```
177181
Utf8 ~= { b : Binary | valid_utf8(b) }
@@ -236,11 +240,6 @@ earns its place in `DType`.
236240

237241
Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a logical type.
238242

239-
### Decision
240-
241-
It is somewhat hard to decide which is the right way to go. However, this section provides some more
242-
structure to the discussions we have been holding.
243-
244243
## List vs. ListView
245244

246245
There is also a separate question about whether the current `Canonical` system is the most ideal.
@@ -262,7 +261,7 @@ The distinction is entirely in the buffer layout:
262261
belong to no element).
263262

264263
This is a purely physical distinction, as no query operation can observe the difference.
265-
For example, `scalar_at(i)` will always returns the same list, and `filter`, `take`, and `slice` all
264+
For example, `scalar_at(i)` will always return the same list, and `filter`, `take`, and `slice` all
266265
produce logically identical results.
267266

268267
However, this physical distinction has massive performance implications. Converting from `ListView`
@@ -274,7 +273,7 @@ a form with stronger structural guarantees (no aliasing, no gaps). The section t
274273
`ListView` is cheaper and permits aliasing and faster random access.
275274

276275
Many of our consumers (particularly Arrow FFI boundaries and consumers of DataFusion) prefer `List`
277-
because Arrow has is adding support for `ListView` slowly. Other consumers prefer `ListView` because
276+
because Arrow is adding support for `ListView` slowly. Other consumers prefer `ListView` because
278277
some operations (namely random access and potentially dependent operations) are faster.
279278

280279
It is highly unfortunate that we cannot canonicalize into different targets, and we are forced to
@@ -345,7 +344,26 @@ For `DType`s where the distinction does not apply (e.g., `Primitive`, `Bool`, `N
345344
produce the same canonical form. The parameterization only has an effect where multiple valid
346345
canonical forms exist.
347346

348-
TODO wrap everything up and explain how the type theory helps justify this.
347+
Because `DType` is a quotient type and canonicalization is a section on that quotient, supporting
348+
multiple canonical targets is theoretically sound (adding this feature will not weaken any invariant
349+
of the Vortex type system). Because each `CanonicalTarget` defines an internally confluent reduction
350+
strategy, the quotient structure of `DType` guarantees that all well-defined operations produce
351+
the same logical results regardless of which target is chosen.
352+
353+
```
354+
Logical type (DType: purely semantic, not physical)
355+
356+
│ section 1: Default (ListView, VarBinView)
357+
│ section 2: Contiguous (List, VarBin)
358+
│ section n: ... (future targets)
359+
360+
Canonical form (a specific "normal form" encoding chosen by the section)
361+
362+
│ encode
363+
│ decode
364+
365+
Physical encoding (dictionary, REE, bitpacked, etc.)
366+
```
349367

350368
## Compatibility
351369

@@ -367,32 +385,32 @@ not fit the use case.
367385

368386
## Prior Art
369387

370-
TODO
371-
372-
How have other systems solved this or similar problems? Consider:
373-
374-
- Other columnar formats (Parquet, Arrow, etc.).
375-
- Database internals (DuckDB, DataFusion, Velox, etc.).
376-
- Relevant academic papers or blog posts.
377-
378-
This section helps frame the design in a broader context. If there is no relevant prior art, that is fine.
388+
- **Arrow** defines both `List` and `ListView` (and `LargeList`/`LargeListView`) as separate type
389+
IDs in its columnar format. Arrow's canonical layout uses monotonic offsets (`List`), and
390+
`ListView` support is being added incrementally across implementations. Arrow also distinguishes
391+
`FixedSizeBinary` from `FixedSizeList<uint8>` at the type level.
392+
- **Parquet** has `FIXED_LEN_BYTE_ARRAY` as a distinct primitive type, separate from repeated
393+
groups. This is a nominal distinction similar to the `FixedSizeBinary` question.
394+
- **DuckDB** uses a single canonical form per logical type and does not support multiple
395+
canonicalization targets.
379396

380397
## Unresolved Questions
381398

382399
- Should `FixedSizeBinary<n>` be a `DType` variant (refinement type) or extension type metadata?
383400
See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against.
384-
385-
TODO
386-
387-
- What parts of the design need to be resolved during the RFC process?
388-
- What is explicitly out of scope for this RFC?
389-
- Are there open questions that can be deferred to implementation?
401+
- What is the concrete set of `CanonicalTarget` variants? This RFC proposes `Default` and
402+
`Contiguous`, but there may be other useful targets. Theoretically, we could have a target that
403+
would allow us to return compressed arrays as a `Canonical` type!
390404

391405
## Future Possibilities
392406

393-
TODO
394-
395-
What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing.
407+
- We can have extension types be a generalization of the refinement type pattern: for example we
408+
could enforce user-defined predicates that gate custom operations on existing `DType`s.
409+
- User-defined or plugin-defined canonical targets for custom FFI boundaries or serialization
410+
formats.
411+
- Using the decision framework to audit existing `DType` variants and determine if any should be
412+
consolidated or split, as well as to make decisions about logical types we want to add (namely
413+
`FixedSizeBinary` and `Variant`).
396414

397415
## Further Reading
398416

0 commit comments

Comments
 (0)