-
Notifications
You must be signed in to change notification settings - Fork 461
Add YEAR_MONTH_INTERVAL and DURATION types #496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
LogicalTypes.md
Outdated
|
|
||
| #### INTERVAL_DAY_TIME | ||
| `INTERVAL_DAY_TIME` is used to represent a day-time time interval, such as | ||
| `5 days, 10 hours and 30 minutes`. It must annotate and 16-byte `FIXED_LEN_BYTE_ARRAY` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a line that is like "As in ansi standard ... day_time [link or citiation) a day is always considered to be 24 hours (equivalent to X nanos)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added one more paragraph at the end of this section. There is actually no documentation explicitly stated that 1 day is equal to 24 hours, those seems indicated from the value range definition of the SQL standard, so I linked the whole SQL standard doc here.
LogicalTypes.md
Outdated
|
|
||
| #### INTERVAL_DAY_TIME | ||
| `INTERVAL_DAY_TIME` is used to represent a day-time time interval, such as | ||
| `5 days, 10 hours and 30 minutes`. It must annotate and 16-byte `FIXED_LEN_BYTE_ARRAY` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still have to decide whether we also just want to introduce at 128bit int primitive type, but this is ok for now and we can even shape it like a 128bit int primitive ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg!, I am keeping that comment here, so we can carry the discussion along
LogicalTypes.md
Outdated
|
|
||
| #### INTERVAL | ||
| `INTERVAL` is *deprecated*. Please use `INTERVAL_YEAR_MONTH` and `INTERVAL_DAY_TIME` | ||
| as a more precise representation per [ANSI SQL Standard](https://www.ibm.com/docs/en/informix-servers/14.10.0?topic=types-ansi-sql-standards-datetime-interval-values). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more official link rather than an IBM documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find an oracle definition, which might be better than the IBM one. updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say either IBM or Oracle are definitive on the ANSI SQL standard? (The problem being that the spec is not freely available to any of us, so this becomes a "trust what I'm saying" situation w.r.t the spec's wording...)
src/main/thrift/parquet.thrift
Outdated
| * | ||
| * Allowed for physical type: FIXED_LEN_BYTE_ARRAY | ||
| */ | ||
| struct IntervalMonthDayType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest either naming this with nanoseconds or parameterizing it in case we want a different granularity in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more beneficial to keep IntervalDayTimeType to be more consistent with the SQL standard. For the granularity, i think we can potentially introduce scale and precision for engine to interpret, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gh-yzou Parquet is not tied to SQL, so we shouldn't feel constrained by SQL's choice of vocabulary, especially when it could be misleading or confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I made a mistake before, this the type is not really MonthDayType, it is actually DayTimeIntervalType, the types doesn't have a Month component. Is the suggestion to have name like "IntervalDayNanoType" ? I think the problem is if we call it DayNano is that in the future we might need to introduce type like DayMills, DayMirco for different precision usage, which could be quite tedious. I think it might be cleaner to have DayTime and then introduce scale and precision parameter for different granularity in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using "scale" and "precision", can we instead just parameterize the unit (Milli, Micro, Nano, etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just parameterize the unit should also work. However, if the current representation is already the highest precision, it should cover the capability of representing other unit, right? i am thinking how much extra benefit it could bring by supporting different granularity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts here were similar, generally we want to add different precisions is because we want to cover different ranges of possible values, but if our underlying type can support the full range we expect of other precisions it probably isn't important.
I think this is basically the rational behind some of the other interval types like in MonthDayNanos (assuming you treat days not as calendar days - not in spec but theoretically you could have done that)
src/main/thrift/parquet.thrift
Outdated
| /** | ||
| * Month-Day Interval logical type annotation | ||
| * | ||
| * The data is stored as a 16-byte signed value, which represents the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signed little
| * The data is stored as a 16-byte signed value, which represents the number | |
| * The data is stored as a 16-byte signed little endian value, which represents the number |
? Also, consider providing some hex strings with corresponding nano-second count to avoid ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, also added and example
LogicalTypes.md
Outdated
| The time interval is independent of any timezone. | ||
|
|
||
| Based on the [ANSI SQL standard](https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-7690645A-0EE3-46CA-90DE-C96DF5A01F8F) definition of the INTERVAL data type and fields values, | ||
| an interval of 1 day is equivalent to 24 hours, regardless of the specific number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this detail should also probably be in parquet.thrift (reference to SQL standard probably isn't strictly necessary there, just how 1 day is defined).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definition also turns this into what other systems often call a "duration" not an "interval" right? For instance this would map to pandas/datetime Timedelta and not pandas DateOffset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm you are correct. SQL standard calls this type interval which is where the name comes from. But for DayTime the SQL standard appears to make the distinction somewhat meaningless. For year-month there is a clear semantic difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the DayTimeInterval sometimes are referred as duration to be more clear on the unit used, and the pandas TimeDelta can be mapped to DayTimeInterval type, which is one of the biggest use case also. The YearMonth is clearly different as duration as @lidavidm pointed out.
this detail should also probably be in parquet.thrift (reference to SQL standard probably isn't strictly necessary there, just how 1 day is defined).
is your suggestion to remove the SQL reference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't make that suggestion, to be clear ^_^ but I do think it would help to be clear (somewhere) about whether each type is a calendar interval/relative delta or a duration/absolute time delta
src/main/thrift/parquet.thrift
Outdated
| * The data is stored as a 16-byte signed value, which represents the number | ||
| * of nanoseconds. The value can be negative to indicate a backward duration. | ||
| * | ||
| * Allowed for physical type: FIXED_LEN_BYTE_ARRAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one issue with this is it potentially limits encodings, I wonder if we should be defining a primitive int128 physical type, or we might need to just special case some encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think my concern with int128 is that would that be just used by IntervalMonthDayType, and we might need to also introduce uint128 to be more complete. For now, I think just special case some encodings might be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, only DELTA_BINARY_PACKED would come on top of a hypothetical INT128 type... assuming it's ever implemented for 128-bit ints (and, if possible, implemented efficiently!).
OTOH, FLBA gives you several encodings for free that are already implemented. In particular, BYTE_STREAM_SPLIT should be quite efficient at making this new data type more compressible.
src/main/thrift/parquet.thrift
Outdated
| /** | ||
| * Month-Day Interval logical type annotation | ||
| * | ||
| * The data is stored as a 16-byte signed value, which represents the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason to go with a 16-byte signed integer as opposed to using Arrow's MonthDayNano type? That way Parquet maintains a simple and efficient interoperability with Arrow instead of forcing a new Arrow type to be compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer is MonthDayNanos more closely mimics Postgres's notion of interval, where each field is separate (so a day is not guaranteed to be 24 hours). Apparently, according the SQL spec (which is not publicly available, and I don't currently have access to and at least some other DB providers also follow the convention) is that a day is always 24 hours/86400 seconds as documented.
The main difference that arise from the representations are:
- MonthDayNanos in arrow allows for days that are not 24 hours, and you can have more the 86400 seconds in the nanos field. One short-coming of the arrow representation is it can't represent +/- 10000 years at nanosecond precision which is typical in SQL.
- This representations makes it so everything can be normalized. But I think a fair question is why not encode it as days + nanoseconds and do the multiplication instead of division to get backs (in this case I think it just means more math then applying nanosecond addition directly to nanosecond timestamps).
That being said, I think there might be a place for all 3 of these representations just like we have all three in arrow, if someone wants to add the third.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nanosecond timestamps are 8 byte though right? So it's not just a trivial addition (if we're using 16 bytes here)
One short-coming of the arrow representation is it can't represent +/- 10000 years at nanosecond precision which is typical in SQL.
MonthDayNano has a 32-bit months field and 64-bit nanos field which together I would think gives you full precision for 10,000 years? (As the spec says there is no constraint on how many seconds are in the nanos field.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nanosecond timestamps are 8 byte though right? So it's not just a trivial addition (if we're using 16 bytes here)
Yes you are correct, but for most cases I imagine the 8 most significant bytes will be sign extension which is likely slightly cheaper to check and mask out then force multiplications.
MonthDayNano has a 32-bit months field and 64-bit nanos field which together I would think gives you full precision for 10,000 years? (As the spec says there is no constraint on how many seconds are in the nanos field.)
Taken in aggregate then give you 10,000 years. I was referring to only the nanosecond portion (Arrow specifically calls out that each field is independent). Days was specifically in Arrow where meant to be a "calendar" type which might not be 24 hours (which is what postgres does) so the 10K year coverage using Arrow definitions would be slightly different semantics, then this representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. I suppose the strictly closest Arrow type, if there were one, would be 128 bit duration then (and I suppose we would want to define int128 while we're at it).
I think if we're going to do this, we should kick off a parallel discussion in Arrow (and maybe RAPIDS/cuDF) for a potential new (extension) type since otherwise we're won't be able to actually represent the new type in (for instance) all our client libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically my main goal is to have the type map cleanly onto an Arrow type. Currently this being 16 bytes makes that very hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing i'm a little worried about with the Arrow definition for MonthDayNano is the handling of bounds. According to the arrow spec the Nano and Day portions are allowed to vary in signs. Wouldn't this make calculations of min-max values a bit more confusing? I think if we did go that route, and i'm not sure we should, but I think we would have to add a few other limitations which would diverge a bit there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limitations that are still valid for the Arrow MonthDayNano would still be very useful as it would still provide ~zero-copy reads in that we wouldn't need to modify/transform the values during read. Writing would just need to update the value to meet the limitations before we write.
According to the arrow spec the Nano and Day portions are allowed to vary in signs. Wouldn't this make calculations of min-max values a bit more confusing?
It would make it a bit more complex, sure, but I don't think they would be confusing. Every value would still refer to a specific interval of time that can be compared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking selfishly about how I could do a straight binary comparison if we don't mix signs, but if we do mix signs I have to deserialize first right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero-copy reads don't really exist in Parquet, except if you use no nulls, no compression, no encryption, PLAIN encoding, and you're willing to carry data in page-sized chunks.
LogicalTypes.md
Outdated
|
|
||
| #### INTERVAL | ||
|
|
||
| `INTERVAL` is *deprecated*. Please use `INTERVAL_YEAR_MONTH` and `INTERVAL_DAY_TIME` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why deprecated? Parquet is not limited to SQL workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the discussion at https://docs.google.com/document/d/12ghQxWxyAhSQeZyy0IWiwJ02gTqFOgfYm8x851HZFLk, I think the existing Parquet interval types are not well-thought and not adopted widely (may be biased).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I read your comment there and I think you're right. The new type is less trivially convertible from Arrow's month_day_nano, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new type is less trivially convertible from Arrow's month_day_nano, though.
I think the problem exists regardless whether have two types "YearMonthInterval" and "DayTimeInterval", or just have MonthDayNano type. Since the number of days in a month changes, converting from MonthDayNano to YearMonthInterval or DayTimeInterval is not possible unless we force that either the Month component is empty or the day nano components are empty.
If we want to provide full interoperability across different engines with different types, we might eventually need to support all three types.
LogicalTypes.md
Outdated
|
|
||
| `INTERVAL_DAY_TIME` is used to represent a day-time time interval, such as | ||
| `5 days, 10 hours and 30 minutes`. It must annotate a 16-byte `FIXED_LEN_BYTE_ARRAY` | ||
| that stores the total number of nanoseconds representing the interval. The value is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call it "day-time" if it only stores nanoseconds? Please let's make the naming clear and non-misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is mainly choose based on the SQL Standard, and which is also close to the type name used by engines like spark. I think the type name doesn't have to be exactly the same as the underlying storage used. Similarly the YearMonthInterval type stores the total number of months.
da5c30b to
0c42791
Compare
38793c0 to
2196689
Compare
|
@RussellSpitzer @lidavidm @pitrou @wgtmac @zeroshade I updated the PR based on the discussion result on this thread https://lists.apache.org/thread/n8jdft4mltdcf91v7t8qf1hz5cg8nbnz, please help take one more look. Thanks a lot for helping on the review! |
LogicalTypes.md
Outdated
| The `DURATION` type takes `unit` as a parameter, and the value must be one of | ||
| `MILLIS`, `MICROS` or `NANOS`. | ||
|
|
||
| Duration of 1 day is defined exactly of 24 hours, regardless of the specific number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this makes sense without mentioning, that Durations support the ANSI SQL definition of DAY_TIME. we might want to rephrase a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Duration of 1 day is defined exactly of 24 hours, regardless of the specific number of | |
| Durations can be used to represent ANSI SQL's definition of DayTime Intervals. In this case, a duration of 1 day is defined exactly of 24 hours, regardless of the specific number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. mentioned ANSI SQL standard in the description
LogicalTypes.md
Outdated
| `YEAR_MONTH_INTERVAL` is used to represent a year-month time interval, such as | ||
| `4 years and 6 months`. It must annotate an `int32` that stores the total number | ||
| of months as a signed integer, which represents the interval and can be negative. | ||
| The time duration is independent of any timezone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can delete this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg! deleted
src/main/thrift/parquet.thrift
Outdated
| } | ||
|
|
||
| /** | ||
| * Day-Time Interval logical type annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Day-Time Interval logical type annotation | |
| * Duration logical type annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, but overall looks good to me.
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this addresses my concerns vis-a-vis Arrow compatibility.
LogicalTypes.md
Outdated
|
|
||
| `DURATION` is used to represent a span of time, such as `5 days`. It must | ||
| annotate an `int64` value that stores the total number of time units for the | ||
| duration. The value is a signed integer and can be negative to indicate backward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| duration. The value is a signed integer and can be negative to indicate backward | |
| duration. The value is a signed integer and can be negative to indicate a backwards |
Minor grammar fix^
But I am wondering if we have a better way to describe what a negative duration represents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i updated it to
The value is a signed integer, where a negative value indicates the
duration moves backward in time (e.g., -5 days means going backward for 5 days).
what do you think about the current description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
|
@emkornfield Thanks a lot for helping on the discussion and review! Do you think we can merge this PR? or shall we wait for a bit longer? |
|
@gh-yzou We typically wait for reference implementations before merging the format change (i.e. at this point we are at step 2 https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md). As an example this is how geography was done (https://lists.apache.org/thread/s6s714c98cn9gg22mnk5nsn7xymym8xo). Variant has been handled a little bit differently because it originated outside of the project. |
|
@emkornfield Thanks a lot for the pointer! I will start the reference implementation for java. |
| The duration is purely a measure of time and is independent of any time zone. | ||
|
|
||
| The `DURATION` type takes `unit` as a parameter, and the value must be one of | ||
| `MILLIS`, `MICROS` or `NANOS`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `MILLIS`, `MICROS` or `NANOS`. | |
| `MILLIS`, `MICROS` or `NANOS`. |
| The `DURATION` type takes `unit` as a parameter, and the value must be one of | ||
| `MILLIS`, `MICROS` or `NANOS`. | ||
|
|
||
| `Duration` can be used to represent DayTime Intervals as defined by ANSI SQL. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `Duration` can be used to represent DayTime Intervals as defined by ANSI SQL. In | |
| `DURATION` can be used to represent day-time interval as defined by ANSI SQL. In |
Just to be consistent with above.
| /** | ||
| * Year-Month Interval logical type annotation | ||
| * | ||
| * The data is stored as an 4 byte signed integer which represents the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * The data is stored as an 4 byte signed integer which represents the number | |
| * The data is stored as a 4-byte signed integer which represents the number |
|
Any plan to add PoC implementation to arrow-cpp or arrow-rs to meet the 2 impls requirement? @sfc-gh-yzou |
|
Hi @wgtmac sorry for the late reply! Got distracted by some other projects, the current plan is to add java and cpp implementation for the interval type, no specific plan for arrow yet. Perhaps, i can swap the cpp support to arrow-cpp if that is more useful? |
|
@gh-yzou I'm just wondering what cpp implementation are you talking about. Usually we refer to parquet-cpp in the arrow-cpp repo: https://github.com/apache/arrow/tree/main/cpp/src/parquet |
|
@wgtmac yes, that is the one i refer to. Sorry, i am new to the community, I guess, we are talking about the same thing about arrow-cpp and parquet-cpp then :). |
Add spec and type description for YEAR_MONTH_INTERVAL and DURATION types. The type can be used to map the SQL types YearMonthInterval and DayTimeInterval
This carries over the work in #43