Story Detail of id 41874177 | Liveview Hacker News

mslot10 hours ago | on: Pg_parquet: An extension to connect Postgres and parquet

(Marco from Crunchy Data)

With PostgreSQL extensions, we find it's most effective to have single-purpose modular extensions.

For instance, I created pg_cron a few years ago, and it's on basically every PostgreSQL service because it does one thing and does it well.

We wanted to create a light-weight implementation of Parquet that does not pull a multi-threaded library into every postgres process.

When you get to more complex features, a lot of questions around trade-offs, user experience, and deployment model start appearing. For instance, when querying an Iceberg table, caching becomes quite important, but that raises lots of other questions around cache management. Also, how do you deal with that memory hungry, multi-threaded query engine running in every process without things constantly falling over?

It's easier to answer those questions in the context of a managed service where you control the environment, so we have a product that can query Iceberg/Parquet/CSV/etc. in S3, does automatic caching, figures out the region of your bucket, can create tables directly from files, and uses DuckDB to accelerate queries in a reliable manner. This is partially powered by a set of custom extensions, partially by other things running on the managed service. https://docs.crunchybridge.com/analytics

However, some components can be neatly extracted and shared broadly like COPY TO/FROM Parquet. We find it very useful for archiving old partitions, importing public and private data sets, preparing data for analytics, and moving data between PostgreSQL servers.

#visit	10093506
#session	44450
#live-session	0