Have you see tiledb? https://tiledb.com/data-types/dataframes My team is current...

stavrospap · on May 4, 2022

Hi folks, Stavros from TileDB here. Here are my two cents on tabular data. TileDB (Embedded) is a very serious competitor to Parquet, the only other sane choice IMO when it comes to storing large volumes of tabular data (especially when combined with Arrow). Admittedly, we haven’t been advertising TileDB’s tabular capabilities, but that’s only because we were busy with much more challenging applications, such as genomics (population and single-cell), LiDAR, imaging and other very convoluted (from a data format perspective) domains.

Similar to Parquet:

* TileDB is columnar and comes with a lot of compressors, checksum and encryption filters.

* TileDB is built in C++ with multi-threading and vectorization in mind

* TileDB integrates with Arrow, using zero-copy techniques

* TileDB has numerous optimized APIs (C, C++, C#, Python, R, Java, Go)

* TileDB pushes compute down to storage, similar to what Arrow does

Better than Parquet:

* TileDB is multi-dimensional, allowing rapid multi-column conditions

* TileDB builds versioning and time-traveling into the format (no need for Delta Lake, Iceberg, etc)

* TileDB allows for lock-free parallel writes / parallel reads with ACID properties (no need for Delta Lake, Iceberg, etc)

* TileDB can handle more than tables, for example n-dimensional dense arrays (e.g., for imaging, video, etc)

Useful links:

* Github repo (https://github.com/TileDB-Inc/TileDB)

* TileDB Embedded overview (https://tiledb.com/products/tiledb-embedded/)

* Docs (https://docs.tiledb.com/)

* Webinar on why arrays as a universal data model (https://tiledb.com/blog/why-arrays-as-a-universal-data-model)

Happy to hear everyone’s thoughts.

elmolino89 · on May 4, 2022

TileDB-VCF does work very well. Which types of data stored as HDF5s you are ingesting into TileDB?