The world of text-to-speech (TTS) has seen incredible advancements, but often these powerful models require hefty hardware like GPUs. But what if you could run a top-tier TTS model locally on your CPU? Enter Kokoro, a game-changing TTS model that delivers impressive results even on resource-constrained devices.
Kokoro: Small but Mighty
Kokoro stands out for its remarkable efficiency. With just 82 million parameters, it outperforms models several times its size, including XTTS (467M parameters) and MetaVoice (1.2B parameters). This proves that cutting-edge TTS is achievable without relying on massive models and powerful GPUs.
Running Kokoro with ONNX on Your CPU
The key to running Kokoro efficiently on your CPU is ONNX (Open Neural Network Exchange), an open format for representing machine learning models. ONNX allows you to run the model on various platforms and hardware, including CPUs, without sacrificing performance. Here’s how you can set up and run Kokoro on your CPU using ONNX:
Steps
Install Dependencies
Ensure you have Python and essential libraries like gradio, kokoro-onnx, soundfile, and tempfile installed.
Obtain the Kokoro ONNX Model
Download the kokoro-v0_19.onnx model file.
Download the Voices File
Obtain the voices.json file, which contains information about the available voices.
Create a Python Script
The provided Python code demonstrates how to set up a simple Gradio interface to interact with the Kokoro model. You can modify this code to suit your needs.
import gradio as gr
from kokoro_onnx import Kokoro
import soundfile as sf
import tempfile
import os
class TextToSpeechApp:
def __init__(self):
# Initialize Kokoro
self.kokoro = Kokoro("kokoro-v0_19.onnx", "voices.json")
# Available voices
self.voices = [
'af', 'af_bella', 'af_nicole', 'af_sarah', 'af_sky',
'am_adam', 'am_michael', 'bf_emma', 'bf_isabella',
'bm_george', 'bm_lewis'
]
def generate_speech(self, text, voice, speed):
try:
# Generate audio
samples, sample_rate = self.kokoro.create(
text,
voice=voice,
speed=float(speed)
)
# Create temporary file
temp_dir = tempfile.mkdtemp()
temp_path = os.path.join(temp_dir, "output.wav")
# Save to temporary file
sf.write(temp_path, samples, sample_rate)
return temp_path
except Exception as e:
return f"Error: {str(e)}"
def create_interface(self):
interface = gr.Interface(
fn=self.generate_speech,
inputs=[
gr.Textbox(label="Enter text to convert", lines=5),
gr.Dropdown(choices=self.voices, label="Select Voice", value=self.voices[0]),
gr.Slider(minimum=0.5, maximum=2.0, value=1.0, step=0.1, label="Speech Speed")
],
outputs=gr.Audio(label="Generated Speech"),
title="Text to Speech Converter",
description="Convert text to speech using different voices and speeds."
)
return interface
def main():
app = TextToSpeechApp()
interface = app.create_interface()
# Launch with a public URL
interface.launch(server_name="0.0.0.0", share=True)
if __name__ == "__main__":
main()
Run the Script: Execute your Python script, and the Gradio interface will allow you to input text, select a voice, adjust the speech speed, and generate speech output.
You can download the code from github https://github.com/nkalra0123/kokoro-tts
Hugging face repo : https://huggingface.co/hexgrad/Kokoro-82M
Benefits of Running Kokoro Locally on Your CPU
Accessibility: You don’t need a high-end GPU to experience high-quality TTS.
Offline Use: Once set up, you can use Kokoro offline, making it ideal for scenarios with limited or no internet connectivity.
Privacy: Processing text locally ensures your data remains private.
Conclusion
Kokoro is a testament to the fact that efficient and powerful TTS is possible even on modest hardware. By leveraging the ONNX format and running it on your CPU, you can enjoy impressive text-to-speech capabilities without the need for a dedicated GPU. This opens up new possibilities for integrating TTS into a wide range of applications, even on devices with limited processing power.