Open sourcing Dicer: Databricks's auto-sharder

(databricks.com)

50 points | by vivek-jain 3 hours ago

4 comments

  • khaki54 1 hour ago
    Seems weird to call it sharding since it's not sharding indexed datasets or anything like that. Is this just a tool to mitigate Databricks’ internal service-scaling challenges?
    • atuladya 7 minutes ago
      Right - this is not about sharding data/datasets. This is for sharding in-memory state that a service might have. The problem of building services at low cost, high scale, low latency and high throughput is common in many environments including our services at Databricks, and Dicer helps with that.
  • ayf 3 hours ago
    Does anyone else have something similar?

    What are some use cases that you found are useful?

    • louis-paul 53 minutes ago
      • atuladya 9 minutes ago
        It is similar to Slicer in terms of the abstraction (I built Slicer at Google) but the architecture, implementation and algorithms have a lot of differences
    • WookieRushing 30 minutes ago
      These show up once you have a certain scale where it is either cost inefficient or the hot spots are very dynamic. They also try to avoid latency by being eventually consistent sidecars instead of proxies.

      I’ve seen them used for traffic routing, storage system metadata systems, distributed cache etc

    • vivek-jain 3 hours ago
      Sharded in-memory caching turns out to be rather useful at scale :)

      Some of the key examples highlighted on our blog are Unity Catalog, which is essentially the metadata layer for Databricks, our Query Orchestration Engine, and our distributed remote cache. See the blog post for more!

  • vivek-jain 3 hours ago
    [dead]