Skip to content

Conversation

@gh-yzou
Copy link

@gh-yzou gh-yzou commented May 28, 2025

Add spec and type description for YEAR_MONTH_INTERVAL and DURATION types. The type can be used to map the SQL types YearMonthInterval and DayTimeInterval

This carries over the work in #43

LogicalTypes.md Outdated

#### INTERVAL_DAY_TIME
`INTERVAL_DAY_TIME` is used to represent a day-time time interval, such as
`5 days, 10 hours and 30 minutes`. It must annotate and 16-byte `FIXED_LEN_BYTE_ARRAY`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a line that is like "As in ansi standard ... day_time [link or citiation) a day is always considered to be 24 hours (equivalent to X nanos)"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added one more paragraph at the end of this section. There is actually no documentation explicitly stated that 1 day is equal to 24 hours, those seems indicated from the value range definition of the SQL standard, so I linked the whole SQL standard doc here.

LogicalTypes.md Outdated

#### INTERVAL_DAY_TIME
`INTERVAL_DAY_TIME` is used to represent a day-time time interval, such as
`5 days, 10 hours and 30 minutes`. It must annotate and 16-byte `FIXED_LEN_BYTE_ARRAY`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still have to decide whether we also just want to introduce at 128bit int primitive type, but this is ok for now and we can even shape it like a 128bit int primitive ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg!, I am keeping that comment here, so we can carry the discussion along

LogicalTypes.md Outdated

#### INTERVAL
`INTERVAL` is *deprecated*. Please use `INTERVAL_YEAR_MONTH` and `INTERVAL_DAY_TIME`
as a more precise representation per [ANSI SQL Standard](https://www.ibm.com/docs/en/informix-servers/14.10.0?topic=types-ansi-sql-standards-datetime-interval-values).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more official link rather than an IBM documentation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find an oracle definition, which might be better than the IBM one. updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say either IBM or Oracle are definitive on the ANSI SQL standard? (The problem being that the spec is not freely available to any of us, so this becomes a "trust what I'm saying" situation w.r.t the spec's wording...)

@gh-yzou gh-yzou requested a review from RussellSpitzer May 29, 2025 02:00
@gh-yzou gh-yzou marked this pull request as ready for review May 29, 2025 17:15
@gh-yzou gh-yzou requested a review from wgtmac June 10, 2025 17:57
*
* Allowed for physical type: FIXED_LEN_BYTE_ARRAY
*/
struct IntervalMonthDayType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest either naming this with nanoseconds or parameterizing it in case we want a different granularity in the future.

Copy link
Author

@gh-yzou gh-yzou Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more beneficial to keep IntervalDayTimeType to be more consistent with the SQL standard. For the granularity, i think we can potentially introduce scale and precision for engine to interpret, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gh-yzou Parquet is not tied to SQL, so we shouldn't feel constrained by SQL's choice of vocabulary, especially when it could be misleading or confusing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I made a mistake before, this the type is not really MonthDayType, it is actually DayTimeIntervalType, the types doesn't have a Month component. Is the suggestion to have name like "IntervalDayNanoType" ? I think the problem is if we call it DayNano is that in the future we might need to introduce type like DayMills, DayMirco for different precision usage, which could be quite tedious. I think it might be cleaner to have DayTime and then introduce scale and precision parameter for different granularity in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using "scale" and "precision", can we instead just parameterize the unit (Milli, Micro, Nano, etc.)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just parameterize the unit should also work. However, if the current representation is already the highest precision, it should cover the capability of representing other unit, right? i am thinking how much extra benefit it could bring by supporting different granularity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts here were similar, generally we want to add different precisions is because we want to cover different ranges of possible values, but if our underlying type can support the full range we expect of other precisions it probably isn't important.

I think this is basically the rational behind some of the other interval types like in MonthDayNanos (assuming you treat days not as calendar days - not in spec but theoretically you could have done that)

/**
* Month-Day Interval logical type annotation
*
* The data is stored as a 16-byte signed value, which represents the number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signed little

Suggested change
* The data is stored as a 16-byte signed value, which represents the number
* The data is stored as a 16-byte signed little endian value, which represents the number

? Also, consider providing some hex strings with corresponding nano-second count to avoid ambiguity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, also added and example

LogicalTypes.md Outdated
The time interval is independent of any timezone.

Based on the [ANSI SQL standard](https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-7690645A-0EE3-46CA-90DE-C96DF5A01F8F) definition of the INTERVAL data type and fields values,
an interval of 1 day is equivalent to 24 hours, regardless of the specific number of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this detail should also probably be in parquet.thrift (reference to SQL standard probably isn't strictly necessary there, just how 1 day is defined).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition also turns this into what other systems often call a "duration" not an "interval" right? For instance this would map to pandas/datetime Timedelta and not pandas DateOffset.

Copy link
Contributor

@emkornfield emkornfield Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm you are correct. SQL standard calls this type interval which is where the name comes from. But for DayTime the SQL standard appears to make the distinction somewhat meaningless. For year-month there is a clear semantic difference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the DayTimeInterval sometimes are referred as duration to be more clear on the unit used, and the pandas TimeDelta can be mapped to DayTimeInterval type, which is one of the biggest use case also. The YearMonth is clearly different as duration as @lidavidm pointed out.

@lidavidm

this detail should also probably be in parquet.thrift (reference to SQL standard probably isn't strictly necessary there, just how 1 day is defined).

is your suggestion to remove the SQL reference here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make that suggestion, to be clear ^_^ but I do think it would help to be clear (somewhere) about whether each type is a calendar interval/relative delta or a duration/absolute time delta

* The data is stored as a 16-byte signed value, which represents the number
* of nanoseconds. The value can be negative to indicate a backward duration.
*
* Allowed for physical type: FIXED_LEN_BYTE_ARRAY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one issue with this is it potentially limits encodings, I wonder if we should be defining a primitive int128 physical type, or we might need to just special case some encodings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think my concern with int128 is that would that be just used by IntervalMonthDayType, and we might need to also introduce uint128 to be more complete. For now, I think just special case some encodings might be good enough.

Copy link
Member

@pitrou pitrou Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, only DELTA_BINARY_PACKED would come on top of a hypothetical INT128 type... assuming it's ever implemented for 128-bit ints (and, if possible, implemented efficiently!).

OTOH, FLBA gives you several encodings for free that are already implemented. In particular, BYTE_STREAM_SPLIT should be quite efficient at making this new data type more compressible.

/**
* Month-Day Interval logical type annotation
*
* The data is stored as a 16-byte signed value, which represents the number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to go with a 16-byte signed integer as opposed to using Arrow's MonthDayNano type? That way Parquet maintains a simple and efficient interoperability with Arrow instead of forcing a new Arrow type to be compatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer is MonthDayNanos more closely mimics Postgres's notion of interval, where each field is separate (so a day is not guaranteed to be 24 hours). Apparently, according the SQL spec (which is not publicly available, and I don't currently have access to and at least some other DB providers also follow the convention) is that a day is always 24 hours/86400 seconds as documented.

The main difference that arise from the representations are:

  1. MonthDayNanos in arrow allows for days that are not 24 hours, and you can have more the 86400 seconds in the nanos field. One short-coming of the arrow representation is it can't represent +/- 10000 years at nanosecond precision which is typical in SQL.
  2. This representations makes it so everything can be normalized. But I think a fair question is why not encode it as days + nanoseconds and do the multiplication instead of division to get backs (in this case I think it just means more math then applying nanosecond addition directly to nanosecond timestamps).

That being said, I think there might be a place for all 3 of these representations just like we have all three in arrow, if someone wants to add the third.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nanosecond timestamps are 8 byte though right? So it's not just a trivial addition (if we're using 16 bytes here)

One short-coming of the arrow representation is it can't represent +/- 10000 years at nanosecond precision which is typical in SQL.

MonthDayNano has a 32-bit months field and 64-bit nanos field which together I would think gives you full precision for 10,000 years? (As the spec says there is no constraint on how many seconds are in the nanos field.)

Copy link
Contributor

@emkornfield emkornfield Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nanosecond timestamps are 8 byte though right? So it's not just a trivial addition (if we're using 16 bytes here)

Yes you are correct, but for most cases I imagine the 8 most significant bytes will be sign extension which is likely slightly cheaper to check and mask out then force multiplications.

MonthDayNano has a 32-bit months field and 64-bit nanos field which together I would think gives you full precision for 10,000 years? (As the spec says there is no constraint on how many seconds are in the nanos field.)

Taken in aggregate then give you 10,000 years. I was referring to only the nanosecond portion (Arrow specifically calls out that each field is independent). Days was specifically in Arrow where meant to be a "calendar" type which might not be 24 hours (which is what postgres does) so the 10K year coverage using Arrow definitions would be slightly different semantics, then this representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. I suppose the strictly closest Arrow type, if there were one, would be 128 bit duration then (and I suppose we would want to define int128 while we're at it).

I think if we're going to do this, we should kick off a parallel discussion in Arrow (and maybe RAPIDS/cuDF) for a potential new (extension) type since otherwise we're won't be able to actually represent the new type in (for instance) all our client libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically my main goal is to have the type map cleanly onto an Arrow type. Currently this being 16 bytes makes that very hard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing i'm a little worried about with the Arrow definition for MonthDayNano is the handling of bounds. According to the arrow spec the Nano and Day portions are allowed to vary in signs. Wouldn't this make calculations of min-max values a bit more confusing? I think if we did go that route, and i'm not sure we should, but I think we would have to add a few other limitations which would diverge a bit there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limitations that are still valid for the Arrow MonthDayNano would still be very useful as it would still provide ~zero-copy reads in that we wouldn't need to modify/transform the values during read. Writing would just need to update the value to meet the limitations before we write.

According to the arrow spec the Nano and Day portions are allowed to vary in signs. Wouldn't this make calculations of min-max values a bit more confusing?

It would make it a bit more complex, sure, but I don't think they would be confusing. Every value would still refer to a specific interval of time that can be compared.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking selfishly about how I could do a straight binary comparison if we don't mix signs, but if we do mix signs I have to deserialize first right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-copy reads don't really exist in Parquet, except if you use no nulls, no compression, no encryption, PLAIN encoding, and you're willing to carry data in page-sized chunks.

LogicalTypes.md Outdated

#### INTERVAL

`INTERVAL` is *deprecated*. Please use `INTERVAL_YEAR_MONTH` and `INTERVAL_DAY_TIME`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why deprecated? Parquet is not limited to SQL workloads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the discussion at https://docs.google.com/document/d/12ghQxWxyAhSQeZyy0IWiwJ02gTqFOgfYm8x851HZFLk, I think the existing Parquet interval types are not well-thought and not adopted widely (may be biased).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I read your comment there and I think you're right. The new type is less trivially convertible from Arrow's month_day_nano, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new type is less trivially convertible from Arrow's month_day_nano, though.

I think the problem exists regardless whether have two types "YearMonthInterval" and "DayTimeInterval", or just have MonthDayNano type. Since the number of days in a month changes, converting from MonthDayNano to YearMonthInterval or DayTimeInterval is not possible unless we force that either the Month component is empty or the day nano components are empty.
If we want to provide full interoperability across different engines with different types, we might eventually need to support all three types.

LogicalTypes.md Outdated

`INTERVAL_DAY_TIME` is used to represent a day-time time interval, such as
`5 days, 10 hours and 30 minutes`. It must annotate a 16-byte `FIXED_LEN_BYTE_ARRAY`
that stores the total number of nanoseconds representing the interval. The value is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call it "day-time" if it only stores nanoseconds? Please let's make the naming clear and non-misleading.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is mainly choose based on the SQL Standard, and which is also close to the type name used by engines like spark. I think the type name doesn't have to be exactly the same as the underlying storage used. Similarly the YearMonthInterval type stores the total number of months.

@gh-yzou gh-yzou force-pushed the yzou-interval-spec branch from da5c30b to 0c42791 Compare June 23, 2025 18:24
@gh-yzou gh-yzou force-pushed the yzou-interval-spec branch from 38793c0 to 2196689 Compare July 25, 2025 16:52
@gh-yzou gh-yzou changed the title Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types Add YEAR_MONTH_INTERVAL and DURATION types Jul 25, 2025
@gh-yzou gh-yzou requested review from lidavidm and pitrou July 25, 2025 17:01
@gh-yzou gh-yzou requested review from emkornfield and zeroshade July 25, 2025 17:01
@gh-yzou
Copy link
Author

gh-yzou commented Jul 25, 2025

@RussellSpitzer @lidavidm @pitrou @wgtmac @zeroshade I updated the PR based on the discussion result on this thread https://lists.apache.org/thread/n8jdft4mltdcf91v7t8qf1hz5cg8nbnz, please help take one more look. Thanks a lot for helping on the review!

LogicalTypes.md Outdated
The `DURATION` type takes `unit` as a parameter, and the value must be one of
`MILLIS`, `MICROS` or `NANOS`.

Duration of 1 day is defined exactly of 24 hours, regardless of the specific number of
Copy link
Contributor

@emkornfield emkornfield Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this makes sense without mentioning, that Durations support the ANSI SQL definition of DAY_TIME. we might want to rephrase a little bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Duration of 1 day is defined exactly of 24 hours, regardless of the specific number of
Durations can be used to represent ANSI SQL's definition of DayTime Intervals. In this case, a duration of 1 day is defined exactly of 24 hours, regardless of the specific number of

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. mentioned ANSI SQL standard in the description

LogicalTypes.md Outdated
`YEAR_MONTH_INTERVAL` is used to represent a year-month time interval, such as
`4 years and 6 months`. It must annotate an `int32` that stores the total number
of months as a signed integer, which represents the interval and can be negative.
The time duration is independent of any timezone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can delete this line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg! deleted

}

/**
* Day-Time Interval logical type annotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Day-Time Interval logical type annotation
* Duration logical type annotation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but overall looks good to me.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this addresses my concerns vis-a-vis Arrow compatibility.

@gh-yzou gh-yzou requested a review from emkornfield July 28, 2025 04:54
@gh-yzou gh-yzou requested a review from sfc-gh-yzou July 28, 2025 04:59
LogicalTypes.md Outdated

`DURATION` is used to represent a span of time, such as `5 days`. It must
annotate an `int64` value that stores the total number of time units for the
duration. The value is a signed integer and can be negative to indicate backward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
duration. The value is a signed integer and can be negative to indicate backward
duration. The value is a signed integer and can be negative to indicate a backwards

Minor grammar fix^

But I am wondering if we have a better way to describe what a negative duration represents.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i updated it to

The value is a signed integer, where a negative value indicates the
duration moves backward in time (e.g., -5 days means going backward for 5 days).  

what do you think about the current description?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@gh-yzou
Copy link
Author

gh-yzou commented Jul 29, 2025

@emkornfield Thanks a lot for helping on the discussion and review! Do you think we can merge this PR? or shall we wait for a bit longer?

@emkornfield
Copy link
Contributor

@gh-yzou We typically wait for reference implementations before merging the format change (i.e. at this point we are at step 2 https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md).

As an example this is how geography was done (https://lists.apache.org/thread/s6s714c98cn9gg22mnk5nsn7xymym8xo). Variant has been handled a little bit differently because it originated outside of the project.

@sfc-gh-yzou
Copy link

@emkornfield Thanks a lot for the pointer! I will start the reference implementation for java.

The duration is purely a measure of time and is independent of any time zone.

The `DURATION` type takes `unit` as a parameter, and the value must be one of
`MILLIS`, `MICROS` or `NANOS`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`MILLIS`, `MICROS` or `NANOS`.
`MILLIS`, `MICROS` or `NANOS`.

The `DURATION` type takes `unit` as a parameter, and the value must be one of
`MILLIS`, `MICROS` or `NANOS`.

`Duration` can be used to represent DayTime Intervals as defined by ANSI SQL. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`Duration` can be used to represent DayTime Intervals as defined by ANSI SQL. In
`DURATION` can be used to represent day-time interval as defined by ANSI SQL. In

Just to be consistent with above.

/**
* Year-Month Interval logical type annotation
*
* The data is stored as an 4 byte signed integer which represents the number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The data is stored as an 4 byte signed integer which represents the number
* The data is stored as a 4-byte signed integer which represents the number

@wgtmac
Copy link
Member

wgtmac commented Jul 30, 2025

Any plan to add PoC implementation to arrow-cpp or arrow-rs to meet the 2 impls requirement? @sfc-gh-yzou

@gh-yzou
Copy link
Author

gh-yzou commented Aug 27, 2025

Hi @wgtmac sorry for the late reply! Got distracted by some other projects, the current plan is to add java and cpp implementation for the interval type, no specific plan for arrow yet. Perhaps, i can swap the cpp support to arrow-cpp if that is more useful?

@wgtmac
Copy link
Member

wgtmac commented Aug 29, 2025

@gh-yzou I'm just wondering what cpp implementation are you talking about. Usually we refer to parquet-cpp in the arrow-cpp repo: https://github.com/apache/arrow/tree/main/cpp/src/parquet

@gh-yzou
Copy link
Author

gh-yzou commented Aug 29, 2025

@wgtmac yes, that is the one i refer to. Sorry, i am new to the community, I guess, we are talking about the same thing about arrow-cpp and parquet-cpp then :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants