AI Agent with Voice

AI Assistant Development March 2025 Personal Project GitHub Repository

Project Overview

The AI Agent with Voice is a Python-based application that extends the functionality of large language models (LLMs) by adding speech input and output capabilities. It creates a natural voice interface that allows users to have spoken conversations with AI models running locally through Ollama.

This project combines offline speech recognition with high-quality text-to-speech to provide a seamless conversational experience. By implementing a Model-View-Controller (MVC) architecture, the application maintains a clean separation of concerns, making it both maintainable and extensible.

Unlike cloud-based voice assistants, this application focuses on privacy by processing all speech locally, while still delivering a responsive and natural interaction experience. Users can choose from multiple voice options and connect to different language models based on their needs.

Key Features

Voice Input: Offline speech recognition using the Vosk library for private, accurate transcription of user speech.
Voice Output: Dual text-to-speech engines with XTTS-v2 for high-quality natural-sounding voices and system TTS as a fallback option.
Conversation Display: Color-coded conversation history that shows both user inputs and AI responses for easy reference.
Multiple LLM Support: Connect to different language models through Ollama, allowing users to choose the most appropriate model for their needs.
Voice Selection: Choice of multiple voices for AI responses, with support for custom voice cloning.
MVC Architecture: Clean separation of concerns with model for AI logic, view for UI, and controller for connecting components.
Offline Operation: All speech processing happens locally, protecting user privacy and allowing operation without internet connectivity.

See it in Action

Demonstration of voice interaction with the AI agent

Development Process

The development of this project focused on creating a modular, extensible system that could provide a natural voice interface to AI models while maintaining user privacy and control. The project was built using Python and several specialized libraries.

The development process included:

Designing a flexible MVC architecture to separate the AI logic, user interface, and control flow
Implementing offline speech recognition using Vosk for privacy-focused voice input
Building a hybrid text-to-speech system with XTTS-v2 for high-quality voices and system TTS as a reliable fallback
Creating a responsive TKinter interface with intuitive controls and conversation history
Developing an Ollama integration for connecting to locally running language models
Implementing voice selection and custom voice options
Thorough testing and optimization for responsiveness and performance

Technical Highlights

Python

Core language used for application development with various specialized libraries

Vosk

Offline speech recognition engine for private, accurate voice input processing

XTTS-v2

Advanced deep learning text-to-speech for natural-sounding AI responses

TKinter

Python's built-in GUI toolkit used to create the user interface

Ollama

Local LLM server for running AI models without cloud dependencies

MVC Architecture

Software design pattern used for clean separation of concerns

Challenges & Solutions

Voice Recognition Accuracy

Challenge: Achieving reliable speech recognition without relying on cloud-based services.

Solution: Implemented Vosk, an offline speech recognition system with customizable models, and added silence detection with adaptive thresholds to improve accuracy in different environments.

Natural-Sounding Speech

Challenge: Creating natural-sounding AI voices that don't break immersion during conversation.

Solution: Integrated XTTS-v2, a state-of-the-art neural text-to-speech system, while maintaining system TTS as a fallback. Added sentence splitting for better prosody and custom voice support.

System Performance

Challenge: Ensuring responsive performance despite running speech recognition, LLM inference, and TTS simultaneously.

Solution: Implemented a multi-threaded architecture with background processing for speech recognition and TTS, along with asynchronous processing of audio to maintain UI responsiveness.

Cross-Platform Compatibility

Challenge: Creating a consistent experience across different operating systems.

Solution: Designed fallback mechanisms for both speech recognition and TTS that automatically adapt to available system resources and capabilities.

Results & Implementation Details

The AI Agent with Voice project successfully creates a seamless voice interface for interacting with AI models. The application demonstrates how advanced speech technologies can be combined with local LLMs to create privacy-focused voice assistants.

The implementation follows the MVC pattern:

Model (model.py): Handles communication with Ollama, maintains conversation history, and processes AI responses
View (view.py): Implements the TKinter GUI with user controls and conversation display
Controller (controller.py): Connects the model and view, manages speech recognition and TTS components

Specialized components include:

Speech Recognition (speech.py): Implements real-time voice input with Vosk, supporting word-by-word feedback and adaptive silence detection
Text-to-Speech (tts.py): Provides high-quality voice output with XTTS-v2, with system TTS fallback and custom voice support

Key learning outcomes from this project include:

Building MVC-structured applications for complex AI projects
Working with offline speech recognition and neural TTS systems
Creating responsive, multi-threaded applications that handle resource-intensive AI tasks
Designing user interfaces that provide natural conversation experiences
Integrating with local LLM servers for privacy-focused AI applications

Links & Resources

GitHub Repository Documentation Download Release Demo Video