-
Notifications
You must be signed in to change notification settings - Fork 110
GH-946: Add Variant extension type support #947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
|
We're about to overhaul the API for implementing extension writers so you may want to hold off |
Thanks @lidavidm, I will wait for that, then rebase. You are talking about the |
Implements VariantType extension type with VariantVector for storing variant data with metadata and value buffers. Includes reader/writer implementations and comprehensive test coverage.
2b45ca4 to
a297cc6
Compare
| @Override | ||
| public Object getObject(int index) { | ||
| return getUnderlyingVector().getObject(index); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is useful this way. I don't have a strong opinion though what should it return instead. @lidavidm, do you have an idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands, this returns the raw struct row? I suppose decoding to a Java object may be what's expected but that may be expensive. Or returning an explicit Variant object that has methods for inspecting/manipulating the data in that row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the variant representation of parquet-java would be suitable here. That is a java object backed by nio ByteBuffers. It would be beneficial as arrow-java would not need to maintain/test the related code. Meanwhile, it does not support returning "rich" objects (e.g. timestamps) but the primitive representations only (e.g. long for a timestamp). I'm not sure how it would fit the current concept.
Alternatively, we would be able to wrap the parquet-java Variant object and extend it with those rich types the different Arrow vectors support (e.g. LocalDateTime for a timestamp nano value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we go this route, I'd rather have an Arrow Variant to avoid having to break APIs down the line if we decide to extend functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be sure I understand you correctly, @lidavidm, do you agree with the following?
Add the new parquet-java dependency and use its variant implementation as a bases of a new Arrow Variant class to be returned by VariantVector.getObject. This class can return the corresponding java objects for the related primitives according to the java objects returned by their typed vectors. No parquet-java classes shall face the public API of Arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, arrow-parquet-variant might be OK to isolate the dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with arrow-parquet-variant approach. It cleanly isolated the dependency (to avoid any side effect in other parts of Arrow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've misinterpreted @jbonofre's suggestion. I wanted to reference the parquet-java's parquet-variant module. That contains only 10 prod classes. But right, it has a dependency on other parquet-java modules as well. So, we have two options:
- Create the separate
arrow-variantmodule that contains any variant related classes, including the ones already included in this PR. - Duplicate the variant related code from
parquet-java, and implement our own classes based on it in the existingvectormodule.
I do not have a strong opinion which one would be the best choice for Arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my concern about parquet-variant: the "parent" and transitive dependencies. I don't think we should fetch a large part of Parquet in Arrow.
I have a preference to create arrow-variant module with clearly defined dependencies.
Thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. @tmater, could you update your PR accordingly?
Implements VariantType extension type with VariantVector for storing variant data with metadata and value buffers. Includes reader/writer implementations and comprehensive test coverage.
What's Changed
Closes #GH-946.