Building voice agents with Nvidia open models

(daily.co)

126 points | by kwindla 31 days ago

8 comments

amelius 31 days ago
I've been using festival under Linux.
https://manpages.ubuntu.com/manpages/trusty/man1/festival.1....
But it is quite old now and pre-dates the DL/AI era.
Does anybody know of a good modern replacement that I can "apt install"?
[-]
- sigmonsays 30 days ago
  I used piper with a model I found online. It's _ALOT_ better than festival afaik. I'm not sure you can apt install it though.
  echo "hello" | piper --model ~/.local/share/piper/en_US-lessac-medium.onnx --output_file - | aplay
  [-]
  - gunalx 30 days ago
    You can in fact apt install piper.
    [-]
    - amelius 30 days ago
      That's a different piper.
      piper - GTK application to configure gaming devices
      [-]
      - gunalx 17 days ago
        ^piper-tts exists.
smusamashah 29 days ago
Do any of the top models let you pause and think while speaking? I have to speak non-stop to Gemini assitant and ChatGPT, which is very very useless/unnatural for voice mode. Specially for non-english speakers probably. I sometimes have to think more to translate my thoughts to english.
[-]
- fragmede 29 days ago
  Have you tried talking to ChatGPT in your native tongue? I was blown away by my mother speaking her native tongue to ChatGPT and having it respond in that language. (It's ever so slightly not a mainstream one.)
  [-]
  - smusamashah 29 days ago
    Even in my own language I can't talk without any pauses.
jjcm 30 days ago
These have gotten good enough to really make command-by-voice interactions pleasant. I'd love to try this with Cursor - just use it fully with voice.
rickydroll 30 days ago
<pedantic>Voice recognition identifies who you are, speech recognition identifies what you say. </pedantic>
Example:
Voice recognition: arrrrrrgh! (Oh, I know that guy. He always gets irritated when someone uses terms speech and voice recognition wrong)
Speech Recognition: "Why can't you guys keep it straight? It is as simple as knowing the difference between hypothesis and theory."
nowittyusername 30 days ago
This is perfect for me. I just started working on the voice related stuff for my agent framework and this will be of real use. Thanks.
jauntywundrkind 30 days ago
There's also the excellent also open source unmute.sh. which alas is also Nvidia only at this point. https://unmute.sh/
[-]
- vikboyechko 30 days ago
  The game show is pretty good. Have a feeling this project will consume all my attention this week, thanks for the tip.
atonse 30 days ago
Can't wait for this to land in MacWhisper. I like the idea of the streaming dictation especially when dictating long prompts to Claude Code.
deckar01 30 days ago
It supports Turing T4, but not Ampere…
[-]
- nsbk 30 days ago
  Any ideas on how to add Ampere support? I have a use case in mind that I would love to try on my 3090 rig
  [-]
  - deckar01 29 days ago
    Magpie-TTS needs a kernel compiled targeting Ampere, but it appears to be closed source. It was compiled for the 2018 T4, but not 2020-2024 consumer cards, just 2025 consumer cards.
    [-]
    - nsbk 27 days ago
      I actually forked the repo, modified the Dockerfile and build/run scripts targeting Ampere and the whole setup is running seamlessly on my 3090, Magpie is running fine and using under 3Gb of memory, ~2Gb for nemotron STT, and ~18Gb for Nemotron Nano 30b. Latencies are great and the turn detection works really well!
      I'm going to use this setup as the base for a language learning App for my gf :)
      [-]
      - deckar01 13 days ago
        I got your fork working (also on a 3090). I was not impressed with the latency or the recommended LLM’s quality.
        [-]
        nsbk 13 days ago
        Make sure you’re using the nemotron-speech asr model. I added support for Spanish via Canary models but these have like 10x the latency: 160ms on nemotron-speech vs 1.5s canary.
        For the LLM I’m currently using Mistral-Small-3.2-24B-Instruct instead of Nemotron 3 and it works well for my use case