Clarification about Voice to text (STT) in /e/OS, and what's next

GaelDuval · September 17, 2025, 8:01am

In /e/OS 3.0 we have introduced a “speech to text” feature, that lets Premium users who cannot or don’t want to write a message using on-screen keyboard to just dictate the message and get it transcribed into text.

The case we envisioned for this feature was to reply to short messages quickly when it’s not easy to type a message, for instance, while driving, where it becomes extremely dangerous.

We wanted it to be realtime, multilingual with auto language-detection, easy to use and high-quality.

To achieve this feature, that is a basic expected feature in a modern OS, we tried several approaches, including small local language models running on the phone. Unfortunately all our previous attempts failed because either the resulting transcription was poor in terms of quality, or too slow, or taking too much RAM/CPU on the device.

Finally we decided on an API approach which is basically: send a voice audio stream to an external server and receive transcribed text in return.

To implement this approach, we chose to use the OpenAI gpt-4 transcribe API that has been designed for this purpose, is cost-effective and efficient. We have not found any other comparable service unfortunately (think: quality, efficiency, cost).

But we didn’t want to have /e/OS connect directly to OpenAI servers, because our mission is to offer strong privacy by default to users and direct access to servers can be used to track users.

So we have implemented a nice piece of software that is an anonymizer proxy that stands between /e/OS Voice to text client and OpenAI servers. It’s even using the QUIC protocol between /e/OS and the proxy, to ensure that it can run over mobile network and keep recovering easily in case of signal drops.

As a result, an /e/OS user that would use Voice to text will touch the microphone icon on the keyboard, that will start a software stack that will:

connect to the anonymizer proxy
open an audio stream to the proxy
wait for text transcription in return

The proxy itself is receiving audio streams from users and is relaying these streams to the OpenAI servers using the gpt-4 transcribe API, and is waiting from transcribed text in return, before relaying back to the /e/OS device. As a result, OpenAI servers see:

hundreds of subsequent or simultaneous audio streams to process
all threads coming from 1 IP address (the proxy)

On the proxy, logs show we have up to 1000 concurrent audio streams open.

Why we think that this feature is acceptable in term of Privacy

all audio streams are mixed from one callers: this would be very difficult to create a match between a user and a series of messages
all are coming from a single IP address making it impossible to track a specific user
the feature is NEVER running by default in the background. To start it, an /e/OS user has to touch the microphone icon on the onscreen keyboard and it’s very clear visually that it is running, as the keyboard is replaced by a dedicated widget. Once you switch to normal keyboard or get out of the feature, it just stops sending any audio.

Why we think our service is not perfect

even though this service’s business model at OpenAI is not relying on users data processing (no ads based, they don’t train their models with the paying API we are using, plus no data retention on servers), it’s still sending some private data (the user’s voice) to a private service, that we have to trust (or not…).
we have to improve the opt-in process: when the user first runs it, we have to explain better what this service is, its limitations, usage recommendations, and ask for an explicit approval.
we could make it better in terms of raw voice data sent to servers (quantization…)

What’s next with /e/OS Voice to text?

Although we think that this new /e/OS service, the way we implemented it with Privacy in mind, offers an acceptable balance between a better user experience and privacy:

some parts need to be improved, especially the explicit opt-in and anonymization of voice audio data
we didn’t expect that regarding big techs in general and OpenAI in particular, at some point discussions cease to be rational.

Regarding the first point, this is an ongoing process and we will improve this part in the upcoming releases of /e/OS.

The second point is more personal, since I personally take full responsability in this feature’s software architecture and for the choice of API providers. I’ve understood that I underestimated the emotional part of these discussions, and that whatever we would do to improve and fix this Voice to text feature to make it perfect in terms of Privacy, as long as it will use an OpenAI service we will see people use this case to spread FUD against /e/OS and Murena and hurt our reputation as a pro-Privacy OS.

Therefore, we have started to look again at possible alternatives. The only credible alternative today (e.g. without relying on a private service that you have to trust), would be to run an STT model for the purpose of this feature.

There is one possible way to make it, using the Whisper model, that is open source, and that we could install and run on our servers. The blocking issue for now is cost: the inference of such models is very costly in terms of GPU usage.

So we’ll be continuing to explore this approach and figure out how we can solve the economical equation to make it sustainable, and finally stop using OpenAI once it’s ready.

Meanwhile, to all users who don’t trust the current implementation: just don’t use it and you are safe.

Stay tuned,
Gaël

Regain your privacy! Adopt /e/OS the deGoogled mobile OS and online services phone

tyxo · September 17, 2025, 9:25am

Personally, I found all the communication and presentation of the new feature were transparent and fair. And I think there was a lot of interaction with the users from the beginning. As I have stated in other messages, I’m using a local model, which is working on my quite modern phone. I also stated in my post that using such local models is not suitable for everybody. There are limitations as I mentioned regarding the power of the device and also the quality of the models. I’m quite happy with what I have.

From my point of view, there are only few alternatives and we must accept some limitations. If we want to have features on par with big tech, unfortunately reality is we have to use exactly these technologies. Otherwise, living with the compromise is totally fine for me.

Manoj · September 17, 2025, 9:30am

Gaël’s post explains the technical details of our new Voice to Text feature, developed in response to user requests, including those with keyboard accessibility issues or situations like driving where typing is impractical. Our goal is to deliver a high-quality translation experience which does not compromise on quality.
This feature is exclusive to premium users who manually enable it. Non-premium users or those uninterested do not need to enable it.
As Gaël noted, we’re exploring alternative options and welcome suggestions to enhance safety while maintaining an uncompromised user experience.

davidsillen · September 17, 2025, 12:28pm

Thanks for the informative reply. Personally, my main gripe is giving a company like OpenAI the time of day at all. I think companies like that illustrate what is wrong with not only tech, but sociey as a whole right now.

tyxo · September 17, 2025, 1:34pm

@GaelDuval BTW, I thank Mistral has now a model (Voxstral) that might be an alternative.

According to my map your offices appear to be close to each

pitrack · September 17, 2025, 6:34pm

Thank you very much for those clarifications @GaelDuval !

Too bad that solutions “running on the phone” are not suitable. Local solutions would have lots of advantages (but I might miss a lot of things since I’m far from an expert) over an external API :

it would avoid considerable energy waste (servers, networks and so on)
it would work without being connected to any network
it would offer unbeatable privacy

I suppose you have tried Futo voice input ? I find it very good for my needs… But maybe it’s not suitable for a lot of languages or not suitable at all for any other reason

Anyway once again thank you very much for all this informations and for the work done with e/os !

Regards

tyxo · September 17, 2025, 7:32pm

The issue with FUTO is the license. It was evaluated by the eFoundation team, but it didn’t meet several criteria set by them. Thus, it’s only suitable for individual installation

pitrack · September 17, 2025, 7:38pm

Oh good to know. Thx for the input

Manoj · September 18, 2025, 6:15am

One potential issue Report: Apple Discussed Buying Mistral AI and Perplexity - MacRumors

https://www.reuters.com/business/apple-internally-discussed-buying-mistral-perplexity-information-reports-2025-08-26/

tyxo · September 18, 2025, 8:32am

Thanks, I didn’t know that.

Manoj · September 19, 2025, 1:39pm

We have already explained our side and even mentioned that we are looking for an alternative way of implementation. A method that does not compromise user experience quality. This looks more like an attempt to continue arguing without reaching a solution. The solution here is a working and perfect solution to implement this app. That is what we are doing. Removing the app is not the solution. We will continue with it but with a different, better and safer implementation for users who requested for it. Premium users, those who do not want to use it need not enable it.

jobal · September 19, 2025, 2:55pm

For what I understand of the discussion, I think this one could be a great solution: change the voice (because the important is only the words).

GaelDuval · September 19, 2025, 3:39pm

You are welcome to leave to another product. But what you are doing here is advertising another project, so I have requested removal of your post.

Manoj · September 20, 2025, 1:28am

All users,
We welcome your technical suggestions for implementing the STT feature. For example, @tyxo suggested Voxtral from Mistral, but its proprietary nature and potential acquisition raise concerns.

Please contribute constructive ideas to the technical discussion. Avoid preaching about other ROMs or trolling. This forum is for sharing /e/OS news and discussing improvements. Thank you for your positive support.

nanabanaman · September 20, 2025, 11:23am

To my knowledge the FUTO team’s changes to the whisper model are developed as a separate project under MIT license. So as long as you don’t use the actual application, you could integrate the whisper model with their changes.

Also, you only mentioned checking license compatibility, but did you try approaching them? Given that they are a non-profit that sponsors privacy friendly open source projects, supporting the /e/ project with a license doesn’t seem far fetched. Especially considering that improvements being upstreamed would benefit both projects.