Kokoro: High-Quality Text-to-Speech(tts) on Your CPU with ONNX

This sound is generated with Kokoro tts

The world of text-to-speech (TTS) has seen incredible advancements, but often these powerful models require hefty hardware like GPUs. But what if you could run a top-tier TTS model locally on your CPU? Enter Kokoro, a game-changing TTS model that delivers impressive results even on resource-constrained devices.

Kokoro: Small but Mighty

Kokoro stands out for its remarkable efficiency. With just 82 million parameters, it outperforms models several times its size, including XTTS (467M parameters) and MetaVoice (1.2B parameters). This proves that cutting-edge TTS is achievable without relying on massive models and powerful GPUs.

Running Kokoro with ONNX on Your CPU

The key to running Kokoro efficiently on your CPU is ONNX (Open Neural Network Exchange), an open format for representing machine learning models. ONNX allows you to run the model on various platforms and hardware, including CPUs, without sacrificing performance. Here’s how you can set up and run Kokoro on your CPU using ONNX:

Steps

Install Dependencies

Ensure you have Python and essential libraries like gradio, kokoro-onnx, soundfile, and tempfile installed.

Obtain the Kokoro ONNX Model

Download the kokoro-v0_19.onnx model file.

Download the Voices File

Obtain the voices.json file, which contains information about the available voices.

Create a Python Script

The provided Python code demonstrates how to set up a simple Gradio interface to interact with the Kokoro model. You can modify this code to suit your needs.

import gradio as gr
from kokoro_onnx import Kokoro
import soundfile as sf
import tempfile
import os

class TextToSpeechApp:
    def __init__(self):
        # Initialize Kokoro
        self.kokoro = Kokoro("kokoro-v0_19.onnx", "voices.json")
        
        # Available voices
        self.voices = [
            'af', 'af_bella', 'af_nicole', 'af_sarah', 'af_sky',
            'am_adam', 'am_michael', 'bf_emma', 'bf_isabella',
            'bm_george', 'bm_lewis'
        ]

    def generate_speech(self, text, voice, speed):
        try:
            # Generate audio
            samples, sample_rate = self.kokoro.create(
                text,
                voice=voice,
                speed=float(speed)
            )
            
            # Create temporary file
            temp_dir = tempfile.mkdtemp()
            temp_path = os.path.join(temp_dir, "output.wav")
            
            # Save to temporary file
            sf.write(temp_path, samples, sample_rate)
            
            return temp_path
            
        except Exception as e:
            return f"Error: {str(e)}"

    def create_interface(self):
        interface = gr.Interface(
            fn=self.generate_speech,
            inputs=[
                gr.Textbox(label="Enter text to convert", lines=5),
                gr.Dropdown(choices=self.voices, label="Select Voice", value=self.voices[0]),
                gr.Slider(minimum=0.5, maximum=2.0, value=1.0, step=0.1, label="Speech Speed")
            ],
            outputs=gr.Audio(label="Generated Speech"),
            title="Text to Speech Converter",
            description="Convert text to speech using different voices and speeds."
        )
        return interface

def main():
    app = TextToSpeechApp()
    interface = app.create_interface()
    # Launch with a public URL
    interface.launch(server_name="0.0.0.0", share=True)

if __name__ == "__main__":
    main()

Run the Script: Execute your Python script, and the Gradio interface will allow you to input text, select a voice, adjust the speech speed, and generate speech output.

You can download the code from github https://github.com/nkalra0123/kokoro-tts

Hugging face repo : https://huggingface.co/hexgrad/Kokoro-82M

Benefits of Running Kokoro Locally on Your CPU

Accessibility: You don’t need a high-end GPU to experience high-quality TTS.

Offline Use: Once set up, you can use Kokoro offline, making it ideal for scenarios with limited or no internet connectivity.

Privacy: Processing text locally ensures your data remains private.

Conclusion

Kokoro is a testament to the fact that efficient and powerful TTS is possible even on modest hardware. By leveraging the ONNX format and running it on your CPU, you can enjoy impressive text-to-speech capabilities without the need for a dedicated GPU. This opens up new possibilities for integrating TTS into a wide range of applications, even on devices with limited processing power.