ONNX Runtime and CoreML May Silently Convert Your Model to FP16

(ym2132.github.io)

72 points | by Two_hands 11 hours ago

4 comments

  • smcleod 5 hours ago
    This was an interesting read, thanks for sharing. I've recently been building something that uses Parakeet v2/v3 models, I'm using the parakeet-rs package (https://github.com/altunenes/parakeet-rs) which has had a few issues running models with CoreML (unrelated to the linked post), e.g. https://github.com/microsoft/onnxruntime/issues/26355
    • Two_hands 2 hours ago
      Thank you for reading.

      Also generally I think CoreML isn't the best. The best solution for ORT would probably be to introduce a pure MPS provider (https://github.com/microsoft/onnxruntime/issues/21271), but given they've already bought into CoreML the effort may not be worth the reward for the core team. Which fair enough as it's a pretty mammoth task

      • pzo 1 hour ago
        However one benefits of CoreML - it is the only way to be able for 3rd party to execute on ANE (Apple Neural Engine aka NPU). ANE for some models can execute even faster than GPU/MPS and consume even less battery.

        But I agree CoreML in ONNX Runtime is not perfect - most of the time when I tested some models there were too many partitioning and whole graph was running slower compare when using only model in just CoreML format.

        • Two_hands 1 hour ago
          To be honest it's a shame the whole thing is closed up, I guess it's to be expected from Apple, but I reckon CoreML would be benefit a lot from at least exposing the internals/allowing users to define new ops.

          Also, the ANE only allows some operators to be ran on it right? There's very little transparency/control on what can be offloaded to it and cannot which makes using it difficult.

  • trashtensor 6 hours ago
    if you double click the coreml file in a mac and open xcode there is a profiler you can run. the profiler will show you the operations it's using and what the bit depth is.
    • Two_hands 2 hours ago
      cheers for the tip, I'll give it a go
  • yousifa 5 hours ago
    On the coreml side this is likely because the neural engine supports fp16 and offloading some/all layers to ANE significantly increases inference time and power usage when running models. You can inspect in the Xcode profiler to see what is running on each part of the device at what precision.
    • Two_hands 2 hours ago
      Yeah I can see why they let it be that way, but the fact it is pretty undefined is what bugged me. I suppose it depends on what your goals are - efficiency vs reproducibility.

      Also I did run a test of FP16 vs FP32 for a large matmul on the Apple GPU and the FP16 calculation was 1.28x faster so it makes sense that they'd go for FP16 as a default.

  • DiabloD3 10 hours ago
    [flagged]
    • noosphr 7 hours ago
      While this is a bit too harsh - and the solution is naive at best - the problem is real.

      The idea of bitwise reproducibility for floating point computations is completely laughable in any part of the DL landscape. Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.

      From NVidia not guaranteeing bitwise reproducibility even on the same GPU: https://docs.nvidia.com/deeplearning/cudnn/backend/v9.17.0/d...

      To frameworks somehow being even worse. Where the best you can do is order the frameworks in terms of how bad they are - with tensorflow being far down at the bottom and jax being (currently) at the top - and try to use the best one.

      This is a huge issue to anyone serious about developing novel models and I see no one talking about it, let alone trying to solve it.

      • arthur2e5 7 hours ago
        > Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.

        Not that strongly for more parallel things, quite similar to the situation with atomics on cuDNN. cuBLAS for example has a similar issue with multi-stream handling, though this can be overcome with a proper workspace allocation: https://docs.nvidia.com/cuda/cublas/index.html?highlight=Rep....

        Still better than cuDNN where some operations just don't have a reproducible version though. The other fields are at least trying. DL doesn't seem to be.

        On that note Intel added reproducible BLAS to oneMKL on CPU and GPU last year. https://www.intel.com/content/www/us/en/developer/archive/tr...

      • Two_hands 2 hours ago
        Wow I didn't know that.

        The worst part of it is as you say we all accept it and no one talks about it.

        Is there any recommended reading you'd suggest to look into this more and the impacts of it?

        • noosphr 2 hours ago
          Caveat emptor but this seems like an up to date paper on the state of bitwise reproducibility in dl with a bunch of citations to other papers that go into more depth: https://arxiv.org/pdf/2510.09180
      • pca006132 6 hours ago
        > The idea of bitwise reproducibility for floating point computations is completely laughable in any part of the DL landscape. Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.

        It is quite annoying when you do parallelization, and idk if that many people cared about bitwise reproducibility, especially when it requires compromising a bit of performance.

    • omneity 8 hours ago
      Not until it gets tensor parallelism.
    • ipython 10 hours ago
      Eh, those “ai researchers” are too busy rolling around in mounds of freshly minted Benjamins to care about “quality software”