Thank you a lot for your positive and supporting comments about our new /e/OS Voice-to-text!
Regarding its implementation in /e/OS, I’d like to explain a few things to explain why we have chosen an OpenAI STT API to implement it and how it’s going to evolve in the future:
- What we have learned from our experimentations with STT models that run locally on the smartphone for speech recognition:
- they work quite poorly, they make a lot of mistakes in voice recognition
- they are not able to mix languages (i.e. you have to preset one language before use and parts of what you say that is not in this language - that happens all the time - are not recognized)
- they take a huge amount of memory to run (in the magnitude of hundreds of MB) + CPU overhead
- At some point it became clear that offline STT was a no-go unless we wanted to offer a degraded UX to /e/OS users. So we have looked for alternatives to implement this service, with Privacy constraints in mind. It soon became clear that OpenAI Whisper or OpenAI new GPT-4 transcribe API was the best quality option we could offer:
- it’s fast
- recognition is very accurate
- recognition can mix different languages to some degree
- the transcribe API allows realtime transcription by default.
- To ensure Privacy protection, we have deciced to offer an anonymization proxy to the system that offers two benefits:
- it makes transcriptions fully anonymous: server-side the API receives various audiostreams from our proxy, that cannot be associated to any spectific user.
- it allowed us to use the QUIC protocol between the /e/OS smartphone and the proxy, to ensure that whatever network conditions are (which can vary a lot when moving using 3G/4G/LTE), the service can recover easily (which is not possible with TCP).
- What’s next? We aknowledge that this service is not totally perfect in term of Privacy protection, although it offers a decent level of protection, as it is anonymizing the streams. In any case, we’re looking at possible better alternative services that we will be able to use in the future to improve the current implementation. One alternative we could chose would be to implement our own instances of Whisper (which is open source software) but this is quite a big project that we cannot prioritize at the moment.
So we think that this /e/OS voice-to-text feature is super useful to send quick messages when walking for instance, or driving a car to avoid typing it, which can be extremely dangerous, and that for this kind of usage has a decent level of Privacy protection.
Also, there is no obligation to use this service at all: people can chose other alternatives if they prefer them, and obviously, users who are dealing with very sensitive information should probably not be using ANY online voice-to-text service (but again, again and again, it’s never been the purpose of /e/OS and will never be to protect users targeted by a gov agency, or by a criminal organization).
Thank you for your positive support and suggestion!
Of course we welcome positive and viable suggestions how we can make this service better (in term of feature and in term of privacy).