Why DuckDB is my first choice for data processing

(robinlinacre.com)

310 points | by tosh 22 days ago

38 comments

mrtimo 22 days ago
What I love about duckdb:
-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.
[-]
- skeeter2020 22 days ago
  >> The .csv parser is amazing
  Their csv support coupled with lots of functions and fast & easy iterative data discovery has totally changed how I approach investigation problems. I used to focus a significant amount of time on understanding the underlying schema of the problem space first, and often there really wasn't one - but you didn't find out easily. Now I start with pulling in data, writing exploratory queries to validate my assumptions, then cleaning & transforming data and creating new tables from that state; rinse and repeat. Aside from getting much deeper much quicker, you also hit dead ends sooner, saving a lot of otherwise wasted time.
  There's an interesting paper out there on how the CSV parser works, and some ideas for future enhancements. I couldn't seem to find it but maybe someone else can?
  [-]
  - tosh 22 days ago
    not a paper but I found this: https://duckdb.org/2025/04/16/duckdb-csv-pollock-benchmark
- HowardStark 21 days ago
  Been playing around with Clickhouse a lot recently and have had a great experience particularly because it hits many of these same points. In my case the "local files" hasn't been a huge fixture but the Parquet and JSON ingestion have been very convenient and I think CH intends for `clickhouse-local` to be some sort of analog to the "add duckdb" point.
  One of my favorite features is `SELECT ... FROM s3Cluster('<ch cluster>', 'https://...<s3 url>.../data//.json', ..., 'JSON')`[0] which lets you wildcard ingest from an S3 bucket and distributes the processing across nodes in your configured cluster. Also, I think it works with `schema_inference_mode` (mentioned below) though I haven't tried it. Very cool time for databases / DB tooling.
  (I actually wasn't familiar with `union_by_name` but it looks to be like Clickhouse has implemented that as well [1,2] Neat feature in either case!)
  [0] https://clickhouse.com/docs/sql-reference/table-functions/s3... [1] https://clickhouse.com/docs/interfaces/schema-inference [2] https://github.com/ClickHouse/ClickHouse/pull/55892
- oulipo2 22 days ago
  Malloy and PRQL (https://prql-lang.org/book/) are quite cool
- jorin 21 days ago
  I built Shaper following Malloy's idea of combining data queries and visualizations. But Shaper uses SQL instead of a custom language. It turns DuckDB into a dashboard builder for when you all you need is SQL.
  https://github.com/taleshape-com/shaper
- freakynit 21 days ago
  I built https://zenquery.app and have used duckdb internally to do all procssing. The speed is crazy, schema auto-detection works correctly (most of the times) and LLM's generate correct SQL's for given queries in plain english.
- arjie 22 days ago
  This is a great sell. I have this annoyingly manual approach with a SQLite import and so on. This is great. Thank you!
- exographicskip 22 days ago
  Thanks for the excellent comment! Now excuse me while I go export my spotify history to play around with duckdb <3
  [-]
  - mrtimo 21 days ago
    Spotify says it will take 30 days for the export... it really only takes about 48 hours if I remember correctly. While you wait for the download here is an example listening history exploration in malloy - I converted the listening history to .parquet: https://github.com/mrtimo/spotify-listening-history
- falconroar 21 days ago
  Polars has all of these benefits (to some degree), but also allows for larger-than-memory datasets.
  [-]
  - prometheon1 21 days ago
    DuckDB supports this as well, depending on which benchmark you look at it regularly performs better on those datasets than Polars.
- hk1337 22 days ago
  -- hive partitioning
- newusertoday 22 days ago
  its 32 mb uncompressed and around 6MB compressed, its not that small https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm/dist/
  it is also difficult to customize as compared to sqlite so for example if you want to use your own parser for csv than it becomes hard.
  But yes it provides lot of convenience out of the box as you have already listed.
steve_adams_86 22 days ago
It has become a favourite tool for me as well.
I work with scientists who research BC's coastal environment, from airborne observation of glaciers to autonomous drones in the deep sea. We've got heaps of data.
A while back I took a leap of faith with DuckDB as the data-processing engine for a new tool we're using to transform and validate biodiversity data. The goal is to take heaps of existing datasets and convert them to valid Darwin Core data. Keyword being valid.
DuckDB is such an incredible tool in this context. Essentially I dynamically build duckdb tables from schemas describing the data, then import it into the tables. If it fails, it explains why on a row-by-row basis (as far as it's able to, at least). Once the raw data is in, transformations can occur. This is accomplished entirely in DuckDB as well. Finally, validations are performed using application-layer logic if the transformation alone isn't assurance enough.
I've managed to build an application that's way faster, way more capable, and much easier to build than I expected. And it's portable! I think I can get the entire core running in a browser. Field researchers could run this on an iPad in a browser, offline!
This is incredible to me. I've had so much fun learning to use DuckDB better. It's probably my favourite discovery in a couple of years.
And yeah, this totally could have been done any number of different ways. I had prototypes which took much different routes. But the cool part here is I can trust DuckDB to do a ton of heavy lifting. It comes with the cost of some things happening in SQL that I'd prefer it didn't sometimes, but I'm content with that tradeoff. In cases where I'm missing application-layer type safety, I use parsing and tests to ensure my DB abstractions are doing what I expect. It works really well!
edit: For anyone curious, the point of this project is to allow scientists to analyze biodiversity and genomic data more easily using common rather than bespoke tools, as well as publish it to public repositories. Publishing is a major pain point because people in the field typically work very far from the Darwin Core spec :) I'm very excited to polish it a bit and get it in the hands of other organizations.
[-]
- anotherpaul 21 days ago
  What's the advantage over using Polars for the same task? It seems to me the natural competitor here and I vastly prefer the Polars syntax over SQL any day. So I was curious if I should try duckdb or stick with polars
  [-]
  - steve_adams_86 21 days ago
    Polars would be better in some ways for sure. It was in one of my early prototypes. What put me off was that I was essentially designing my own database which I didn't trust as much as something like DuckDB.
    Polars would let me have a lot of luxuries that are lost at the boundaries between my application and DuckDB, but those are weighed in the tradeoffs I was talking about. I do a lot of parsing at the boundaries to ensure data structures are sound, and otherwise DuckDB is enforcing strict schemas at runtime which provides as much safety as a dataset's schema requires. I do a lot of testing to ensure that I can trust how schemas are built and enforced as well.
    Things like foreign keys, expressions that span multiple tables effortlessly, normalization, check constraints, unique constraints, and primary keys work perfectly right off the shelf. It's kind of perfect because the spec I'm supporting is fundamentally about normalized relational data.
    Another consideration was that while Polars is a bit faster, we don't encounter datasets that require more speed. The largest dataset I've processed, including extensive transformations and complex validations (about as complex as they get in this spec), takes ~3 seconds for around 580k rows. That's on an M1 Max with 16GB of RAM, for what it's worth.
    Our teams have written countless R scripts to do the same work with less assurance that the outputs are correct, having to relearn the spec each time, and with much worse performance (these people are not developers). So, we're very happy with DuckDB's performance despite that Polars would probably let us do it faster.
    Having said that, if someone built the same project and chose Polars I wouldn't think they were wrong to do so. It's a great choice too, which is why your question is a good one.
  - microflash 21 days ago
    Familiarity with SQL is a plus in my opinion. Also, DuckDB has SDKs in more languages compared to Polars.
    [-]
    - steve_adams_86 21 days ago
      I wasn't all that excited about SQL at first, but I've come around to it. Initially I really wanted to keep all of my data and operations in the application layer, and I'd gone to great lengths to model that to make it possible. I had this vision of all types of operations, queries, and so on being totally type safe and kept in a code-based registry such that I could do things like provide a GUI on top of data and functions I knew were 100% valid an compile-time. The only major drawback was that some kinds of changes to the application would require updating the repository.
      I still love that idea but SQL turns out to be so battle-proven, reliable, flexible, capable, and well-documented that it's really hard to beat. After giving it a shot for a couple of weeks it became clear that it would yield a way more flexible and capable application. I'm confident enough that I can overcome the rough edges with the right abstractions and some polish over time.
  - falconroar 21 days ago
    Polars has all of the benefits of DuckDB (to some degree), but also allows for larger-than-memory datasets.
    [-]
    - steve_adams_86 21 days ago
      DuckDB has this capability as well: https://duckdb.org/docs/stable/guides/performance/how_to_tun...
      [-]
      - falconroar 21 days ago
        Interesting, I wasn't aware; thanks for that. I will say, Polars' implementation is much more centered on out-of-core processing, and bypasses some of DuckDB's limitations ("DuckDB cannot yet offload some complex intermediate aggregate states to disk"). Both incredible pieces of software.
        To expand on this, Polars' `LazyFrame` implementation allows for simple addition of new backends like GPU, streaming, and now distributed computing (though it's currently locked to a vendor). The DuckDB codebase just doesn't have this flexibility, though there are ways to get it to run on GPU using external software.
        [-]
        steve_adams_86 19 days ago
        Thanks for that insight as well! My needs don't tend to be so demanding so I've gotten away without knowing these details, but I suspect I the not-so-distant future this could be useful to know.
        Being able to use distributed backends to process frames sounds kind of incredible, but I can't imagine my little projects ever making use of it. Still, very cool stuff.
        noworriesnate 21 days ago
        Have you seen Ibis[1]? It's a dataframe API that translates calls to it into various backends, including Polars and DuckDB. I've messed around with it a little for cases where data engineering transforms had to use pyspark but I wanted to do exploratory analysis in an environment that didn't have pyspark.
        [1] https://ibis-project.org/
- geysersam 21 days ago
  What file formats are the existing datasets you have? I also work on data processing in a scientific domain where HDF5 is a common format. Unfortunately Duckdb doesn't support HDF5 out of the box, and the existing hdf5 extension wasn't fast enough and didn't have the features needed, so I made a new one based on the c++ extension template. I'd love to collaborate on it if anyone is interested.
  [-]
  - steve_adams_86 19 days ago
    That's really fascinating. Is your format open source? I don't know if I'd have overlapping needs for something like that (though I did investigate hdf5 early on, it seemed very promising as a place to store our outputs) but I'd be curious to explore it and see what you're doing with it.
    Right now we typically read from CSV or Excel, because that's what the scientists prefer to work with. For better or worse. There's a bit of parquet kicking around. The wrappers around handling imports for DuckDB are very, very thin. It handles just about everything seamlessly
    Is this the extension that was too slow? https://duckdb.org/community_extensions/extensions/hdf5
uwemaurer 22 days ago
We use DuckDB to process analytics and feeds for Bluesky (https://bluefacts.app)
To get fast access to the query results we use the Apache Arrow interface and generate the code directly from DuckDB SQL queries using the SQG tool ( https://sqg.dev/generators/java-duckdb-arrow/)
[-]
- udkl 22 days ago
  I'm very interested to see a behind the scenes of your tech - do you guys have a writeup somewhere? You HN tool is also promising!

100% agree.

> Writing SQL code

Language integration is paramount for med/lg projects. There's an experimental Java lang project, manifold-sql [1], that does the impossible: inline native DuckDB SQL + type-safety.

    """
    [.sql/] SELECT station_name, count(*) AS num_services
      FROM 'http://blobs.duckdb.org/train_services.parquet'
      WHERE monthname(date) = 'May'
      GROUP BY ALL
      ORDER BY num_services DESC
      LIMIT 3
    """
    .fetch()
    .forEach(row -> out.println(row.stationName + ": " + row.numServices));

1. https://github.com/manifold-systems/manifold/blob/master/doc...

lz400 21 days ago
I do a lot of data processing and my tool of choice is polars. It's blazing fast and has (like pandas) a lot of very useful functions that aren't in SQL or are awkward to emulate in SQL. I can also just do Python functions if I want something that's not offered.
Please sell DuckDB to me. I don't know it very well but my (possibly wrong) intuition is that even giving equal performance, it's going to drop me to the awkwardness of SQL for data processing.
[-]
- bovinejoni 21 days ago
  I could anecdotally tell you it’s significantly faster and more concise for my workloads, but it’s a standalone executable so just try it out and benchmark for your use case. It doesn’t require fine tuning or even a learning curve
film42 22 days ago
Just 10 minutes ago I was working with a very large semi-malformed excel file generated by a mainframe. DuckDB was able to load it with all_varchar (just keep everything a string) in under a second.
I'm still waiting for Excel to load the file.
majkinetor 22 days ago
Anybody with experience in using duckdb to quickly select page of filtered transactions from the single table having a couple of billions of records and let's say 30 columns where each can be filtered using simple WHERE clausule? Lets say 10 years of payment order data. I am wondering since this is not analytical scenario.
Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)
[-]
- morganherlocker 22 days ago
  I've used duckdb a lot at this scale, and I would not expect something like this to take more than a few seconds, if that. The only slow duckdb queries I have encountered either involve complex joins or glob across many files.
- cess11 22 days ago
  I'm not so sure the common index algorithms would work to speed up a count. How often is the table updated? If it's often, and it's also queried often for the count, then run the count somewhat often on a schedule and store the result separately, and if it isn't queried often, do it more seldom.
  From what you describe I'd expect a list of column-value pairs under a WHERE to resolve pretty fast if it uses indices and don't fish out large amounts of data at once.
DangitBobby 22 days ago
Being able to use SQL on CSV and json/jsonl files is pretty sweet. Of course it does much more than that, but that's what I do most often with it. Love duckdb.
[-]
- samuell 22 days ago
  Indeed! I generally like awk a lot for simpler CSV/TSV processing, but when it comes to cases where you need things like combining/joining multiple CSV files or aggregating for certain columns, SQL really shines IME.
efromvt 22 days ago
DuckDB can't get enough love! Beyond being a really performant database, being so friendly [sql and devx] is really underrated and hard to pull off well and is key to it being so fun - you get a compounding ecosystem because it's so easy to get started. Hoping that they can manage to keep it vibrant without it slowing down the pace of innovation at all.
The web/WASM integration is also fabulous. Looking forward to more "small engines" getting into that space to provide some competition and keep pushing it forward.
smithclay 22 days ago
Agree with the author, will add: duckdb is an extremely compelling choice if you’re a developer and want to embed analytics in your app (which can also run in a web browser with wasm!)
Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.
[-]
- dimitrieh 21 days ago
  100%
  We recently created a survey website for the community survey results for Node-RED making it completely dynamic and segment-able. Creates lots of value and allows everyone to look at the data through their own lens. It uses DuckDB with WASM under the hood. Awesome technologies.
  https://nodered.org/community-survey/
- canadiantim 22 days ago
  The wasm is pretty heavy data-wise tho, I’m hoping eventually it’ll be lighter for easier loading on not so good devices.
  [-]
  - efromvt 22 days ago
    I'd really love a minimalist version, I'm not sure how small it's feasible for them to shrink it. As long as it doesn't get bigger and devices keep getting faster, I suppose?
willtemperley 21 days ago
It's a great tool, but the dynamic loading of extensions in DuckDB makes working with code signing very difficult. Second, the spatial extension uses LGPL components which adds another headache for commercial apps.
As such, it's not readily usable as a library, or set of libraries. I really prefer Apache's approach to analytics where it's possible to pick and choose the parts you need, and integrate them with standard package maangers.
Need GB/S arrays over HTTP? Use Arrow Flight. Want to share self-describing structured arrays with files? Use Arrow IPC. Need to read Parquet? Add that package trait.
Another potential issue with DuckDB is the typing at the SQL interface.
Arrow allows direct access to primitive arrays, but DuckDB uses a slightly different type system at the SQL interface. Even small differences in type systems can lead to combinatoric type explosion. This is more a critiscm of SQL interfaces than DuckDB however.
Additionally Arrow has native libraries in most mainstream languages.
morgul 15 days ago
I did large-scale molecular data engineering with DuckDB and Polars (https://github.com/scikit-fingerprints/MolPILE_dataset, https://arxiv.org/abs/2509.18353 for those interested). Both were amazing, but for large-scale JOINs, only DuckDB didn't result in OOM (tried Dask, DuckDB, Polars). Really a pleasure to use too.
I could also JOIN local CSV datasets, Postgres database, and even Excel files from chemists. All of this in Jupyter Notebook and really seamless Python integration. This also means that one can easily do heavy lifting in DuckDB, export to Polars variable, put into Plotly, and get an interactive plot. Neat stuff.
morgul 15 days ago
I did large-scale molecular data engineering with DuckDB and Polars (https://github.com/scikit-fingerprints/MolPILE_dataset, https://arxiv.org/abs/2509.18353 for those interested). Both were amazing, but for large-scale JOINs, only DuckDB didn't result in OOM. Really a pleasure to use too.
I could also JOIN local CSV datasets, Postgres database, and even Excel files from chemists. All of this in Jupyter Notebook and really seamless Python integration. This also means that one can easily do heavy lifting in DuckDB, export to Polars variable, put into Plotly, and get an interactive plot. Neat stuff.
netcraft 21 days ago
I love duckdb, I use it as much as I can. I just wish that the support for node/bun was as good as python. And I wish that they would bundle it differently for node/bun - the way it is now it depends on a dynamic link to a library which means I cant bundle it into a bun executable.
[-]
- benpacker 21 days ago
  I get segfaults all the time when using it with Bun and the new Node neo api. What’s your recipe, are you not getting these?
tjchear 22 days ago
I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?
[-]
- RobinL 22 days ago
  It is true that for json and csv you need a full scan but there are several mitigations.
  The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.
  Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.
  In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.
  Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns
- ayhanfuat 22 days ago
  If you are going to query it frequently then json/csv might become an issue. I think the reason it doesn't become a problem for duckdb/polars users is that we generally convert them to parquet after first read.
- biophysboy 22 days ago
  Zonemaps are created for columns automatically. I process somewhat large tables w/ duckdb regularly (100M rows) and never have any problems.
  [-]
  - riku_iki 22 days ago
    that's true for duckdb native tables, but the question was about json.
    [-]
    - biophysboy 22 days ago
      They said json and csv - it handles both!
      [-]
      - riku_iki 22 days ago
        handles depends on size. But I tried to say there is no zonemaps for json.
- mr_toad 21 days ago
  Many analytical queries require full scans of fact tables anyway, so indexes are less useful. Joins are usually to dimensional tables, which are quite small. Snowflake doesn’t use indexes at all, and it’s built for handling the largest volumes of data.
  However, you wouldn’t want to use either for transaction processing, the lack of indexes would really hurt.
- simlevesque 22 days ago
  But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.
  [-]
  - gdulli 22 days ago
    But you can have a surprisingly large amount of data before the inefficiency you're talking about becomes untenable.
- akhundelar 22 days ago
  Not a duckdb user, but I use polars a lot (mentioned in the article).
  Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.
- 0cf8612b2e1e 22 days ago
  Visidata is in Python and has offered “real time” analytics of fixed files for a long time. Computers are stupidly fast. You can do a lot of operations within a few seconds time window.
- mpalmer 22 days ago
  I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.
oulu2006 22 days ago
That's really interesting, I love the idea of being able to use columnar support directly within postgresql.
I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.
[-]
- biophysboy 22 days ago
  Its a really handy tool. I've queried basically everything you can w/ duckdb - csv, json, s3 buckets, MS SQL servers, excel sheets, pandas dataframes, etc - and have had very few issues.
yakkomajuri 22 days ago
Been quite a fan of DuckDB and we actually even use it in production.
But coincidentally today I was exploring memory usage and I believe I'm finding memory leaks. Anybody have similar experiences?
Still debugging more deeply but looking reasonably conclusive atm.
[-]
- yakkomajuri 22 days ago
  As far as I can tell pretty conclusive results:
  https://github.com/duckdb/duckdb/issues/20569
  If someone can explain this or has a fix for it I'd love to hear it!
  [-]
  - fauigerzigerk 20 days ago
    You shouldn't expect a database system to release memory back to the OS just because it's no longer needed for some specific purpose.
    DuckDB has a memory_limit setting with a default of 80% of RAM. If you want to set a lower limit you can do something like
    SET memory_limit = '1GB';
    https://duckdb.org/2024/07/09/memory-management
biophysboy 22 days ago
I think my favorite part of duckdb is its flexibility. Its such a handly little swiss army knife for doing analytical processing in scientific environments (messy data w/ many formats).
s-a-p 22 days ago
"making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.
Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.
[-]
- nchagnet 22 days ago
  We're using ducklake with data storage on Google cloud storage and the catalog inside a postgres database and it's a breeze! It may not be the most mature product, but it's definitely a good setup for small to medium applications which still require a data lake.
- biophysboy 22 days ago
  DuckLake is pretty new, so I guess it would depend on if you need a more mature, fully-featured app.
  [-]
  - pattar 22 days ago
    I went to a talk by the Motherduck team about why they built DuckLake instead of leaning more in on Iceberg. The key takeaway is that instead of storing all the table metadata inside files on s3 and dealing with latency and file io they instead store all of that info inside a duckdb table. Seems like a good idea and worked smoothly when I tried, however it is not quite in a stable production state it is still <1.0. They have a nice talk about it on youtube: https://youtu.be/hrTjvvwhHEQ?si=WaT-rclQHBxnc9qV
    [-]
    - willvarfar 22 days ago
      (I work a lot with BigQuery's BigLake adaptor and it's basically caching the metadata of the iceberg manifest and parquet footers in Bigtable (this is Google) so query planning is super fast etc. Really helps)
nylonstrung 21 days ago
DuckDB is awesome.
If you want it's power as a query engine but like to write python instead of SQL, I highly recommend using it as a backend for the Ibis dataframe library
It let's you interchange pythonic dataframe syntax (like Pandas and Polars) with SQL that 'compile' down to SQL in DuckDB dialect
And you can use those queries interchangably in postgres, sqlite, polars, spark, etc
https://ibis-project.org/
davidtm 22 days ago
It is the darling of laptop-based analysis. I’m always surprised that duckdb doesn’t get used more to enhance existing, and stubbornly bedded-in data pipelines. We use it in browser and on server at Count (https://count.co) to make data warehouses more responsive for multiuser and agentic work.
PLenz 22 days ago
Duckdb is wonderful. I have several multi-TB pipelines that I moved over from spark and dask. It's so much easier to think in terms of knew machine that gets it's resources used efficiently instead of distributed/cloud processes. It even let me take some big data things back to on-prem from aws.
countrymile 22 days ago
It's worth noting that R has a great duckdb API as well. Saved me a lot of time when dealing with a 29GB CSV file and splitting it into separate parquet files on a low RAM ONS server a few months back.
https://r.duckdb.org/
microflash 21 days ago
I've been using DuckDB to process massive Excel files. Despite weird quirks [1], it has been great experience so far. I now use it to process CSV, JSON, Parquet files. It is very fast, and extremely approachable, thanks to SQL being the language for interaction.
[1]: https://github.com/duckdb/duckdb-excel/issues/76
wswin 21 days ago
Comparison to Spark in terms of speed and simplicity is pretty unfair since Spark is designed for multi-node use and horizontal scaling isn't going anywhere.
[-]
- potamic 21 days ago
  What is duckdb better compared to? What were people using before it entered the scene?
tobilg 22 days ago
Built with DuckDB-Wasm: SQL Terminal in the browser
https://terminal.sql-workbench.com
red2awn 22 days ago
It was mentioned that the performance of DuckDB is similar to that of Polars (among others). In that case why would one choose DuckDB over Polars? The only differentiator seems to be that you do the querying with standard SQL instead of the library specific APIs.
[-]
- dkdcio 22 days ago
  - performance is often better, especially on “out of core” (“streaming”, spill to disk data sizes). Polars has done a ton of work on their streaming engine but they’re still catching up
  - you don’t need to use Python (but Pythonic wrappers like Ibis exist; disclaimer I worked on Ibis, you can find my blogs on performance comparisons easily there); CLI, WASM, etc. w/o Python
  - governance: DuckDB as OSS is setup in a more sustainable way (DuckDB Labs + DuckDB Foundation). while there is a VC-backed company (MotherDuck), it doesn’t employ the primary developers/control the project in the same way the Polars company does
  - overall just simplicity and focus. tends to break less, solely focused on single-node, easy to extend, etc. — not trying to do a cloud product, distributing computing, supporting GPU execution
fastasucan 20 days ago
I dont get the either/or posts here about polars. Some times I do duckdb sql queries against a polars df, some times I convert a duckdb result to polars. Its great to use whatever is best for the situation.
[-]
- RobinL 20 days ago
  I think that's totally fine for individual work, but in larger data engineering teams it's less good to switch between tools because other people may have to maintain your code.
  That said, polars is good, and if the team agree to standardise on it then that's a totally reasonable choice.
  I guess one of my reservations is I've been (historically) burned by decisions within data eng teams to use pandas, causing all sorts of problems with data typing and memory and eventually having to rewrite it all. But I accept polars doesn't suffer from the same problems (and actually some of them are even mitigated in more recent versions of pandas)
mastermage 20 days ago
I love duckdb for a way different reason than most. It has a much more expressive type system compared to Sqlite. Like Time/Date also Unions. Its like a mini postgress.
clumsysmurf 22 days ago
DuckDB has experimental builds for Android ... I'm wondering how much work it would take to implement a Java API for it similar to sqlite (Cursor, etc).
[-]
- jtbaker 22 days ago
  Something different than https://duckdb.org/docs/stable/clients/java?
  [-]
  - clumsysmurf 21 days ago
    Android doesn't use JDBC.
n_u 21 days ago
Question for folks in data science / ML space: Has DuckDB been replacing Pandas and NumPy for basic data processing?
falconroar 21 days ago
Polars has all of these benefits (to some degree), but also allows for larger-than-memory datasets.
fsdfasdsfadfasd 21 days ago
ctrl + f clickhouse doesn't return anything, surprising
[-]
- Onavo 21 days ago
  Clickhouse chDB is excellent, much less buggy than DuckDB
  https://clickhouse.com/chdb
rustyconover 22 days ago
DuckDB is awesome and Robin is too!
vivzkestrel 21 days ago
stupid question:
- what is wrong with postgresql for doing this?
[-]
- jorin 21 days ago
  postgres (without extensions) is slower for analytical queries since it stores data as rows, not columns. Also, duckdb is an in-memory database so it's more comparable to sqlite than postgres.
  [-]
  - vivzkestrel 20 days ago
    well what was wrong with apache cassandra? isnt that a columnar database too?
noo_u 22 days ago
I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:
"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."
become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.
The author also has another article where they claim that:
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.
[-]
- RobinL 22 days ago
  Author here. Re: 'SQL should be the first option considered', there are certainly advantages to other dataframe APIs like pandas or polars, and arguably any one is better in the moment than SQL. At the moment Polars is ascendent and it's a high quality API.
  But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.
  I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.
  I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.
  [-]
  - chaps 22 days ago
    Yeah, my experiences match yours and I very, very much work with messy data (FOIA data), though I use postgres instead of duckdb.
    Most of the datasets I work with are indeed <10GB but the ones that are much larger follow the same ETL and analysis flows. It helps that I've built a lot of tooling to help with types and memory-efficient inserts. Having to rewrite pipelines because of "that one dataframe API" is exactly what solidified my thoughts around SQL over everything else. So much of my life time has been lost trying to get dataframe and non-dataframe libraries to work together.
    Thing about SQL is that it can be taken just about anywhere, so the time spent improving your SQL skills is almost always well worth it. R and pandas much less so.
    [-]
    - gofreddygo 22 days ago
      I advocated for a SQL solution at work this week and it seems to have worked. My boss is wary of the old school style SQL databases with their constraints and just being a pain to work with. As a developer, these pains aren't too hard to get used to or automate or document away for me and never understood the undue dislike of sql.
      The fact that I can use sqlite / local sql db for all kinds of development and reliably use the same code (with minor updates) in the cloud hosted solution is such a huge benefit that it undermines anything else that any other solution has to offer. I'm excited about the sql stuff I learned over 10 years ago being of of great use to me in the coming months.
      The last product I worked heavily on used a nosql database and it worked fine till you start tweak it just a little bit - split entities, convert data types or update ids. Most of the data access layer logic dealt with conversion between data coming in from the database and the guardrails to keep the data integrity in check while interacting with the application models. To me this is something so obviously solved years ago with a few lines of constraints. Moving over to sql was totally impossible. Learned my lesson, advocated hard for sql. Hoping for better outcomes.
      [-]
      - chaps 22 days ago
        I totally understand the apprehension towards SQL. It's esoteric as all hell and it tends to be managed by DBAs who you really only go to whenever there's a problem. Organizationally it's simpler to just do stuff in-memory without having to fiddle around with databases.
  - noo_u 22 days ago
    I think both of us are ultimately wary of using the wrong tool for the job.
    I see your point, even though my experience has been somewhat the opposite. E.g. a pipeline that used to work fast enough/at all up until some point in time because the scale of the data or requirements allowed it. Then some subset of these conditions changes, the pipeline cannot meet them, and one has to reverse engineer obscure SQL views/stored procedures/plugins, and migrate the whole thing to python or some compiled language.
    I work with high density signal data now, and my SQL knowledge occupies the "temporary solution" part of my brain for the most part.
- __mharrison__ 21 days ago
  My experience is that if you want to do ML, viz, or advanced analytics, dataframes give a better experience.
  If you are shuffling data around in pipelines, sure, go for SQL.
  Readability is in the eye of the beholder. I much prefer dataframes for that, though a good chunk of the internet claims to throw up in their mouths upon seeing it...
- hnthrowaway0315 22 days ago
  SQL is popular because everyone can learn and start using it after a while. I agree that Python sometimes is a better tool but I don't see SQL going away anytime.
  From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.
- alastairr 22 days ago
  I'm running duckdb over 500gb of parquet on a largish desktop (50gb ram) and it's been smooth & fast. I guess OOM issues will matter at some point, but I think it's going to be in the top 1% of real world use cases.
  [-]
  - jauntywundrkind 22 days ago
    With a decent SSD (or eight), spilling to disk is really not bad these days! Yes!
    And if that's still not enough, if you just need to crunch data a couple times a week, it's not unreasonable to get a massive massive cloud box with ridiculous amounts of ram or ram+SSD. I7i or i8g boxes. Alas, we have cheap older gen epycs & some amazing cheap motherboards but RAM prices to DIY are off the charts unbelievable, but so be it.
- falconroar 21 days ago
  Polars also has all of these benefits (to some degree), but also allows for larger-than-memory datasets. Also has GPU backend, distributed backend, etc. Polars is heavily underrated, even with the recent hype.
  [-]
  - camgunz 21 days ago
    I downvoted all your recs for polars, 1 because this is a DuckDB thread and it's low-key rude, and 2 because there are 4 of them. I wouldn't have minded if there were a single post that were like "DuckDB is cool, polars could be an alternative if..."
- physicsguy 22 days ago
  You can get 32TiB of RAM instances on AWS these days
  [-]
  - jeffbee 22 days ago
    That sounds damned near useless for typical data analysis purposes and I would very much prefer a distributed system to a system that would take an hour to fill main memory over its tiny network port. Also, those cost $400/hr and are specifically designed for businesses where they have backed themselves into a corner of needing to run a huge SAP HANA instance. I doubt they would even sell you one before you prove you have an SAP license.
    For a tiny fraction of the cost you can get numerous nodes with 600gbps ethernet ports that can fill their memory in seconds.
    [-]
    - akvadrako 21 days ago
      Seems they come with 200gbit ports so it takes 20 minutes to fill memory.
  - mr_toad 21 days ago
    Which is a lot for a single user, but when you have a dozens or hundreds of analysts who all want to run their own jobs on your hundred terabyte data warehouse then even the largest single machine wont cut it.
  - RobinL 22 days ago
    Exactly - these huge machines are surely eating a lot into the need for distributed systems like Spark. So much less of a headache to run as well
- theLiminator 22 days ago
  Yeah, i'm also similarly confused.
  > "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
  SQL has nothing to do with fast. Not sure what makes it any more testable than polars? Future-proof in what way? I guess they mean your SQL dialect won't have breaking changes?
  [-]
  - wood_spirit 22 days ago
    I’m also a duckdb convert. All my notebooks have moved from Pandas and polars to Duckdb. It is faster to write and faster to read (after you return to a notebook after time away) and often faster to run. Certainly not slower to run.
    My current habit is to suck down big datasets to parquet shards and then just query them with a wildcard in duckdb. I move to bigquery when doing true “big data” but a few GB of extract from BQ to a notebook VM disk and duckdb is super ergonomic and performant most of the time.
    It’s the sql that I like. Being a veteran of when the world went mad for nosql it is just so nice to experience the revenge of sql.
    [-]
    - theLiminator 22 days ago
      I personally find polars easier to read/write than sql. Especially when you start doing UDFs with numpy/et. al. I think for me, duckdb's clear edge is the cli experience.
      > It is faster to write and faster to read
      At least on clickbench, polars and duckdb are roughly comparable (with polars edging out duckdb).
      [-]
      - erikcw 21 days ago
        I use them both depending on which feels more natural for the task, often within the same project. The interop is easy and very high performance thanks to Apache Arrow: `df = duckdb.sql(sql).pl()` and `result = duckdb.sql("SELECT * FROM df")`.
    - exagolo 19 days ago
      For the bigger tasks, Exasol might also be a very neat option for you. We have a free personal edition that can scale regarding data volumes, #servers (MPP architecture) and complex workloads.
      Recently, we have also compared ourselves against DuckDB and were 4 times faster even on a single node. We are in-memory optimized, but data doesn't need to fit in the RAM.
      Disclaimer: I'm CTO@Exasol
  - RobinL 22 days ago
    Author here. I wouldn't argue SQL or duckdb is _more_ testable than polars. But I think historically people have criticised SQL as being hard to test. Duckdb changes that.
    I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines.
    In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time.
    [-]
    - theLiminator 22 days ago
      > I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines.
      Yeah, but SQL isn't really portable between query all query engines. You always have to be speaking the same dialect. Also, SQL isn't the only "declarative" dsl, polars's lazyframe api is similarly declarative. Technically Ibis's dataframe dsl also works as a multi-frontend declarative query language. Or even substrait.
      Anways my point is that SQL is not inherently a faster paradigm than "dataframes", but that you're conflating declarative query planning with SQL.