108 lines
4.3 KiB
Markdown
108 lines
4.3 KiB
Markdown
|
# Research Regarding STT/TTS
|
|||
|
|
|||
|
This repository is a mess! It's my personal notepad — a pure collection of snippets and experiments that cost me blood, sweat, and many tears.
|
|||
|
|
|||
|
**Special thanks to:** Google. *You know what you did.*
|
|||
|
**To OpenAI:** You're amazing! Quality stuff. Sadly, I'm not rich enough to run a 24/7 service with your pricing regarding STT/TTS, so I use only `gpt4o-mini`.
|
|||
|
|
|||
|
The end result of this repository is a working **STT/TTS system** that allows you to talk with ChatGPT.
|
|||
|
|
|||
|
To save money, I use TTS/STT from Google Cloud (paid). It's surprisingly cheap!
|
|||
|
|
|||
|
Do not take the way I communicate with the LLM too seriously — that wasn’t the main focus. The implementation in this project has no context, memory, or system messages. Every call is treated as a new session.
|
|||
|
|
|||
|
If you're interested in this technology but get stuck due to lack of documentation, feel free to email me at **retoor@molodetz.nl**.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## How to Play Immediately (Without Configuration)
|
|||
|
You can get started in just 5 minutes:
|
|||
|
1. Create a virtual environment.
|
|||
|
2. Install the requirements file: `pip install -r requirements.txt`.
|
|||
|
3. Execute `tts.py`.
|
|||
|
|
|||
|
With these steps, you'll have a working `gpt4o-mini` model listening to you and responding in text.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## Application Output (`tts.py`)
|
|||
|
|
|||
|
The output is speech, but here’s how a typical conversation looks:
|
|||
|
|
|||
|
```
|
|||
|
Adjusting for ambient noise, please wait...
|
|||
|
Listening...
|
|||
|
Recognized Text: what is the name of the dog of ga
|
|||
|
Response from gpt4o_mini: Please provide more context or details about what "GA" refers to, so I can assist you accurately.
|
|||
|
Recognized Text: Garfield the gas has a dog friends what is his name
|
|||
|
Response from gpt4o_mini: Garfield's dog friend is named Odie.
|
|||
|
Recognized Text: is FTP still used
|
|||
|
Response from gpt4o_mini: Yes, FTP (File Transfer Protocol) is still used for transferring files over a network, although more secure alternatives like SFTP (Secure File Transfer Protocol) and FTPS (FTP Secure) are often preferred due to security concerns.
|
|||
|
Recognized Text: why is Linux better than
|
|||
|
Response from gpt4o_mini: Please complete your question for a more specific comparison about why Linux might be considered better than another operating system or software.
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## Repository Structure
|
|||
|
|
|||
|
The repository contains:
|
|||
|
- **`play.py`**: For playing audio with Python.
|
|||
|
- **`gcloud.py`**: A wrapper around the Google Cloud SDK (this was the most time-consuming to build).
|
|||
|
- **`tts.py`**: Execute this script to talk with GPT.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## Requirements and Preparation
|
|||
|
|
|||
|
- **A paid Google Cloud account**
|
|||
|
- Google Cloud CLI
|
|||
|
- You get $300 and 90 days for free, but you'll need to attach a credit card. I used it extensively and didn't spend a cent!
|
|||
|
- The free credit barely depletes even with heavy usage.
|
|||
|
|
|||
|
- **Google Cloud SDK + CLI** installed
|
|||
|
*Important:* These standalone applications affect the behavior of Python's Google library regarding authentication.
|
|||
|
|
|||
|
- **Python 3** and the following:
|
|||
|
- `python3-venv`
|
|||
|
- `python3-pip`
|
|||
|
|
|||
|
> I initially installed a lot using `apt-get`, but I can’t recall if it was all necessary in the end.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
## Installation Steps
|
|||
|
|
|||
|
1. Activate the virtual environment:
|
|||
|
```bash
|
|||
|
python3 -m venv venv && source venv/bin/activate
|
|||
|
```
|
|||
|
2. Install the requirements:
|
|||
|
```bash
|
|||
|
pip install -r requirements.txt
|
|||
|
```
|
|||
|
## Testing the setup
|
|||
|
1. Check Google Authentication & TTS
|
|||
|
```bash
|
|||
|
python gcloud.py
|
|||
|
```
|
|||
|
- If successful, it will speak a sentence.
|
|||
|
- If not, you'll likely encounter some authentication issues — brace yourself for Google-related configuration struggles.
|
|||
|
|
|||
|
2. Check Speech Recognition (No API Needed)
|
|||
|
```bash
|
|||
|
python tts.py
|
|||
|
```
|
|||
|
- This sends your text to the gpt4o-mini model and prints the response.
|
|||
|
- Requires no configuration and works out of the box.
|
|||
|
|
|||
|
## Conclusion
|
|||
|
Play stupid games, win stupid prizes. Figuring this out was a nightmare. If OpenAI's services were financially viable, I would have chosen them — better quality and much easier to implement.
|
|||
|
|
|||
|
Now, I have a fully operational project that communicates perfectly and even follows conversations. For example, I can:
|
|||
|
- Assign numbers.
|
|||
|
- Perform calculations (e.g., divide "the first number by the second").
|
|||
|
- Use the microphone full-time to ask or say anything I want. I have a wireless JBL GO speaker that's directly ready for the job when I turn it on.
|
|||
|
|
|||
|
I hope some people appreciate the snippets!
|