(issue 724) : bypass pandas using pytables directly to work with HDF5 files#761
(issue 724) : bypass pandas using pytables directly to work with HDF5 files#761alixdamman wants to merge 4 commits intolarray-project:masterfrom
Conversation
|
I know you didn't ask for it, but FWIW, I will not have time to review this initial POC before 26th of April. I need to have clear head to do this and it's not the case these days... my initial gut feeling is that you are putting too much into the LArray HDF layer (eg sort_rows, sort_columns & fill_value). Those are only useful when we are loading HDF not produced via larray, which is nice to have but is not what I expected this issue was about. I just wanted to have a simple binary save of larray objects and load them back which would be as fast as possible, and thus bypass pandas entirely. Loading arbitrary HDF objects (not produced via larray) is much more complex and I wouldn't tacke this now, unless you have a clear idea of what is needed. |
|
PytablesHDFStore is not yet implemented. The class is almost empty. My first objective is to get the backward compatibility with old produced HDF files. Currently, the PR shows a fully implemented PandasHDFStore which just reproduces what we get before. Nothing else. I started the PR to let you know that I'm working on it. I don't expect you to review this soon but maybe, after the 26/4, to first look at the classes structure. |
|
I know. I just wanted to clarify my vision. Not, that you have to follow it, but I just wanted to avoid any misunderstanding. |
|
Just to also clarify myself, my wish now is to implement a |
larray/inout/hdf.py
Outdated
| return array.astype(str) | ||
| try: | ||
| array = np.asarray(array) | ||
| if array.dtype == np.object_ or (not PY2 and array.dtype.kind == 'S'): |
There was a problem hiding this comment.
converting object arrays to string seem like a bad idea (imagine object array containing floats or ints), or mixed arrays with both strings and numbers (I think most of our real-life object arrays are of that kind).
|
remainder: add test with an array of dtype=np.object_ |
9264074 to
01669f2
Compare
df28aea to
03e2fc4
Compare
|
The pandas bypass and the internals cleanup both seem great. However, I am not sure about making this (LHDFStore) part of the public API yet because this overlaps heavily with a lazy Session backed by an HDF file. At this point I am unsure what is the best API to offer to our users for opening an HDF file and load/write some arrays when needed then close the file but I fear having both lazy sessions and LHDFStore as public API could confuse our users, because that would be essentially two ways to do the same thing (in addition to the current confusion about read_X and sessions). But do not revert anything (except maybe the changes to api.rst), in the worst case LHDFStore will be used by lazy sessions. We might also decide this is a better API than lazy sessions or that it's worth having both API but I simply cannot tell for now. I would like to avoid you releasing the 0.32 release with this new API being advertised and then we add another similar API in the next release and our users being confused. This might be what we do anyway in the end but this at least this needs to be thought through. |
- moved LHDFStore to inout/hdf.py - implemented PandasStorer and PytablesStorer - updated LArray/Axis/Group.to_hdf - removed Metadata.to_hdf and Metadata.from_hdf - renamed PandasHDFHandler as HDFHandler
03e2fc4 to
559d0c0
Compare
559d0c0 to
fa1c222
Compare
| engine = 'tables' | ||
| else: | ||
| import tables | ||
| handle = tables.open_file(filepath, mode='r') |
There was a problem hiding this comment.
wouldn't it be better to use a context manager here?
gdementen
left a comment
There was a problem hiding this comment.
I just had a good look at this again with fresh eyes. And I think there are too many abstraction layers in there:
# -> means "uses"
# ( ) means inherits from
Session -> HDFHandler(FileHandler) -> LHDFStore -> PytablesStorer(AbstractStorer)
-> PandasStorer(AbstractStorer)I imagined something much "flatter":
Session -> PytablesHDFHandler(FileHandler)
-> PandasHDFHandler(FileHandler)or (more likely) to avoid a bit of code duplication:
Session -> PytablesHDFHandler(HDFHandler(FileHandler))
-> PandasHDFHandler(HDFHandler(FileHandler))Note that nothing forbids us to have extra methods in HDFHandler and/or P*HDFHandler for HDF specific stuff (if any is actually necessary). We could probably also make a few of the methods in FileHandler public, and add a few extra methods so that it can be used directly as a context manager like you do with LHDFStore. I don't understand why we need those two extra abstraction layers. I could understand one extra layer if we cannot accomodate the HDF specificities with extra methods/attributes (but on the top of my head, I don't see why it would be the case).
I know this has been a while, but do you remember why you did it this way instead of implementing specific FileHandlers (and enhancing the FileHandler class/protocol as needed)?
PS: The LazySession stuff can come on top of the FileHandler paradigm I think, so here I am happy I didn't go with Session subclasses for each type of file.
| Key (path) of the array within the HDF file (see Notes below). | ||
| engine: {'auto', 'tables', 'pandas'}, optional | ||
| Dump using `engine`. Use 'pandas' to update an HDF file generated with a LArray version previous to 0.31. | ||
| Defaults to 'auto' (use default engine if you don't know the LArray version used to produced the HDF file). |
There was a problem hiding this comment.
change "used to produced" to either "used to produce" or "which produced"
| if group is not None: | ||
| self._handle.remove_node(group, recursive=True) | ||
| paths = key.split('/') | ||
| # recursively create the parent groups |
There was a problem hiding this comment.
doesn't the createparents argument do this?
https://www.pytables.org/usersguide/libref/file_class.html#tables.File.create_group
Not yet finished but started the PR to open the discussion.