Flash-KMeans: Fast and Memory-Efficient Exact K-Means

(arxiv.org)

142 points | by matt_d 3 days ago

6 comments

  • frakt0x90 5 hours ago
    They created this in service of their video generation model which "clusters and reorders tokens based on semantic similarity using k-means.":

    http://arxiv.org/pdf/2505.18875

  • jacquesm 3 hours ago
    Nice one. K-Means is one of those neat little powertools that once you get the hang of it you find more and more applications for, but it can be a bit slow for larger data sets. So this is very nice to have, thank you matt_d for posting.
  • leecarraher 4 hours ago
    Do they mean deterministic k-means, k-means++ ... ? Global optimal k-means is NP-Hard, so linear speedups aren't terribly helpful. It's nice, until you add more input. Standard k-means would be nice, or the k-means++ seed algorithm.
    • n4r9 12 minutes ago
      The abstract suggests they're proposing speed-up techniques for the assignment and centroid update stages of the classic k-means algorithm. Which would therefore also apply to k-means++.
    • jmalicki 4 hours ago
      Kmeans++ is just a seed, this is the inner loop.

      Also analogous to flash attention, a linear speedup in big O sense based on the typical algorithmoc complexity computing model can be a polynomial speedup in measured wall clock time due to memory hierarchy differences.

      Still small compared to exponential differences, but for an NP-Hard problem, a linear 100x speedup is the difference between practically computable vs. not. There are a ton of things I'd wait 2 hours for that I wouldn't wait a week for.

  • wood_spirit 7 hours ago
    Does this have corresponding speed ups or memory gains for normal CPUs too? Just thinking about all the cups of coffee that have been made and drunk while scikit-learn kmeans chugs through a notebook :)
    • snovv_crash 7 hours ago
      For CPU with bigger K you would put the centroids in a search tree, so take advantage of the sparsity, while a GPU would calculate the full NxK distance matrix. So from my understanding the bottleneck they are fixing doesn't show up on CPU.
      • xavxav 7 hours ago
        search trees tend not to scale well to higher dimensions though, right?

        from what I've seen I had the impression that Yinyang k-means was the best way to take advantage of the sparsity.

        • snovv_crash 2 hours ago
          Most data I've used is for geospatial with D<=4 (xyzt) so for me search trees worked great. But for things like descriptor or embedding clustering yes, trees wouldn't be useful.
    • openclaw01 6 hours ago
      [dead]
  • matrix2596 8 hours ago
    looks like flash attention concepts applied to kmeans, nice speedup results
  • maiconburn 5 hours ago
    [dead]