In /e/OS 3.0 we have introduced a “speech to text” feature, that lets Premium users who cannot or don’t want to write a message using on-screen keyboard to just dictate the message and get it transcribed into text.
The case we envisioned for this feature was to reply to short messages quickly when it’s not easy to type a message, for instance, while driving, where it becomes extremely dangerous.
We wanted it to be realtime, multilingual with auto language-detection, easy to use and high-quality.
To achieve this feature, that is a basic expected feature in a modern OS, we tried several approaches, including small local language models running on the phone. Unfortunately all our previous attempts failed because either the resulting transcription was poor in terms of quality, or too slow, or taking too much RAM/CPU on the device.
Finally we decided on an API approach which is basically: send a voice audio stream to an external server and receive transcribed text in return.
To implement this approach, we chose to use the OpenAI gpt-4 transcribe API that has been designed for this purpose, is cost-effective and efficient. We have not found any other comparable service unfortunately (think: quality, efficiency, cost).
But we didn’t want to have /e/OS connect directly to OpenAI servers, because our mission is to offer strong privacy by default to users and direct access to servers can be used to track users.
So we have implemented a nice piece of software that is an anonymizer proxy that stands between /e/OS Voice to text client and OpenAI servers. It’s even using the QUIC protocol between /e/OS and the proxy, to ensure that it can run over mobile network and keep recovering easily in case of signal drops.
As a result, an /e/OS user that would use Voice to text will touch the microphone icon on the keyboard, that will start a software stack that will:
- connect to the anonymizer proxy
- open an audio stream to the proxy
- wait for text transcription in return
The proxy itself is receiving audio streams from users and is relaying these streams to the OpenAI servers using the gpt-4 transcribe API, and is waiting from transcribed text in return, before relaying back to the /e/OS device. As a result, OpenAI servers see:
- hundreds of subsequent or simultaneous audio streams to process
- all threads coming from 1 IP address (the proxy)
On the proxy, logs show we have up to 1000 concurrent audio streams open.
Why we think that this feature is acceptable in term of Privacy
- all audio streams are mixed from one callers: this would be very difficult to create a match between a user and a series of messages
- all are coming from a single IP address making it impossible to track a specific user
- the feature is NEVER running by default in the background. To start it, an /e/OS user has to touch the microphone icon on the onscreen keyboard and it’s very clear visually that it is running, as the keyboard is replaced by a dedicated widget. Once you switch to normal keyboard or get out of the feature, it just stops sending any audio.
Why we think our service is not perfect
- even though this service’s business model at OpenAI is not relying on users data processing (no ads based, they don’t train their models with the paying API we are using, plus no data retention on servers), it’s still sending some private data (the user’s voice) to a private service, that we have to trust (or not…).
- we have to improve the opt-in process: when the user first runs it, we have to explain better what this service is, its limitations, usage recommendations, and ask for an explicit approval.
- we could make it better in terms of raw voice data sent to servers (quantization…)
What’s next with /e/OS Voice to text?
Although we think that this new /e/OS service, the way we implemented it with Privacy in mind, offers an acceptable balance between a better user experience and privacy:
- some parts need to be improved, especially the explicit opt-in and anonymization of voice audio data
- we didn’t expect that regarding big techs in general and OpenAI in particular, at some point discussions cease to be rational.
Regarding the first point, this is an ongoing process and we will improve this part in the upcoming releases of /e/OS.
The second point is more personal, since I personally take full responsability in this feature’s software architecture and for the choice of API providers. I’ve understood that I underestimated the emotional part of these discussions, and that whatever we would do to improve and fix this Voice to text feature to make it perfect in terms of Privacy, as long as it will use an OpenAI service we will see people use this case to spread FUD against /e/OS and Murena and hurt our reputation as a pro-Privacy OS.
Therefore, we have started to look again at possible alternatives. The only credible alternative today (e.g. without relying on a private service that you have to trust), would be to run an STT model for the purpose of this feature.
There is one possible way to make it, using the Whisper model, that is open source, and that we could install and run on our servers. The blocking issue for now is cost: the inference of such models is very costly in terms of GPU usage.
So we’ll be continuing to explore this approach and figure out how we can solve the economical equation to make it sustainable, and finally stop using OpenAI once it’s ready.
Meanwhile, to all users who don’t trust the current implementation: just don’t use it and you are safe.
Stay tuned,
Gaël
Regain your privacy! Adopt /e/OS the deGoogled mobile OS and online services