Hacker News new | past | comments | ask | show | jobs | submit
loading story #40300795
loading story #40307060
loading story #40302149
loading story #40299364
loading story #40317027
loading story #40307428
Why is it not in Debian? Is there some deep and dark secret?

https://search.debian.org/cgi-bin/omega?DB=en&P=parquet

loading story #40307127
loading story #40284705
pandas is a python-centric, tabular data handler that works well in clouds (and desktop Debian). Pandas can read parquet data today, among other libs mentioned. The binary dot-so driver style is single-host centric and not the emphasis of these cloudy projects (and their cloudy funders)

https://pandas.pydata.org/docs/reference/api/pandas.read_par...

https://packages.debian.org/buster/python3-pandas

Perhaps more alarm is called for when this python+pandas and parquet does not work on Debian, but that is not the case today.

ps- data access in clouds often uses the S3:// endpoint . Contrast to a POSIX endpoint using _fread()_ or similar.. many parquet-aware clients prefer the cloudy, un-POSIX method to access data and that is another reason it is not a simple package in Debian today.

Pandas often has significant memory overhead, so it's not uncommon to need ~3-5x the amount of memory as your file size.

Polars and DuckDB are much better about memory management.

loading story #40302938
As I understand it, pandas can read parquet if the pyarrow or fastparquet packages are available, but that's not the case and attempts to fix that have been underway for several years.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021

loading story #40302576
loading story #40318434