The Birth of Parquet

https://sympathetic.ink/2024/01/24/Chapter-1-The-birth-of-Parquet.html

145whinvik | 1 week ago | 79 | HN

loading story #40300795

loading story #40307060

loading story #40302149

loading story #40299364

loading story #40317027

loading story #40307428

jjgreen1 week ago | parent | next

Why is it not in Debian? Is there some deep and dark secret?

https://search.debian.org/cgi-bin/omega?DB=en&P=parquet

loading story #40307127

loading story #40284705

mistrial91 week ago | parent

pandas is a python-centric, tabular data handler that works well in clouds (and desktop Debian). Pandas can read parquet data today, among other libs mentioned. The binary dot-so driver style is single-host centric and not the emphasis of these cloudy projects (and their cloudy funders)

https://pandas.pydata.org/docs/reference/api/pandas.read_par...

https://packages.debian.org/buster/python3-pandas

Perhaps more alarm is called for when this python+pandas and parquet does not work on Debian, but that is not the case today.

ps- data access in clouds often uses the S3:// endpoint . Contrast to a POSIX endpoint using _fread()_ or similar.. many parquet-aware clients prefer the cloudy, un-POSIX method to access data and that is another reason it is not a simple package in Debian today.

datadrivenangel1 week ago | root | parent | next

Pandas often has significant memory overhead, so it's not uncommon to need ~3-5x the amount of memory as your file size.

Polars and DuckDB are much better about memory management.

loading story #40302938

jjgreen1 week ago | root | parent

As I understand it, pandas can read parquet if the pyarrow or fastparquet packages are available, but that's not the case and attempts to fix that have been underway for several years.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021

loading story #40302576

loading story #40318434

#visit	6935701
#session	6388
#live-session	0