example-projects/examples/audio_meaning_db/README.md

89 lines
4.9 KiB
Markdown
Raw Normal View History

# audio_meaning_db
A self-contained semantic audio search tool for **spoken audio**. Upload audio clips (optionally with a description) to build up a database, then search by audio to find the nearest neighbors by what's being said. Runs as a single Docker service: a FastAPI backend that transcribes speech locally with Whisper (`openai/whisper-base`), embeds the transcripts with a sentence-transformer (`all-MiniLM-L6-v2`), and stores vectors in ChromaDB, served behind a minimal browser UI.
Long uploads are split into **60-second segments**; each segment is transcribed and indexed independently. Search returns the best-matching segment along with a link to the full parent clip.
## Why Whisper + sentence embeddings (and not CLAP)
This tool is optimized for finding audio by *what is said*, not by *how it sounds*. Two recordings of the sentence "I love cats" embedded with CLAP would look similar regardless of content; two recordings saying "I love cats" vs "I'm fond of felines" would look very different. Whisper transcription + text embeddings inverts that: meaning-preserving paraphrases match, and acoustic differences (voice, accent, background noise) are ignored.
If you want music/SFX/ambient similarity instead, swap in CLAP — the plumbing is the same.
## Prerequisites
You need Docker Engine and the Docker Compose plugin. If you don't already have them:
- **Linux (Ubuntu/Debian):** follow the official install guide at https://docs.docker.com/engine/install/ubuntu/. After installing, add your user to the `docker` group so you don't need `sudo`:
```bash
sudo usermod -aG docker $USER
newgrp docker
```
- **macOS / Windows:** install Docker Desktop from https://docs.docker.com/desktop/. Compose is bundled.
Verify it works:
```bash
docker --version
docker compose version
```
## Running it
From this directory:
```bash
docker compose up -d --build
```
Then open http://localhost:8082 in your browser.
### What to expect on the first run
The first `up --build` is slow because it:
1. Installs Python deps including CPU-only PyTorch (~200 MB pip download) and `ffmpeg` + `libsndfile`.
2. Downloads the Whisper model (~150 MB) and the sentence-transformer (~80 MB) into a cached volume on first server start.
3. Starts with an **empty database** — no seeding. Upload your own audio.
Watch progress with:
```bash
docker compose logs -f backend
```
You'll see `ASR model openai/whisper-base ready.` and `Text embedding model ... ready.` once it's warmed up. Subsequent runs reuse the cached models and existing database, so startup is fast.
## Using the UI
Two tabs:
- **Submit Audio** — drop or click to select an audio file (mp3, wav, m4a, flac, ogg). Add an optional description and click *Submit to Database*. The file is chunked into 60-second segments, each transcribed and embedded. You'll see the per-segment transcripts once it's done.
- **Search by Audio** — drop or click to select a query clip (≤ 60 seconds, hard-enforced). The backend transcribes it, embeds the transcript, and returns the most semantically similar stored segments, ranked by cosine similarity. Each result shows the segment's transcript, a playable audio slice of just that segment, and a toggle to play the full parent clip.
## API
Direct endpoints:
- `POST /api/submit` — multipart form: `file` (audio), optional `description` (string). Returns `{parent_id, filename, segments_added, total_segments, duration_sec, segments}` where `segments` includes per-segment timestamps and transcripts.
- `POST /api/search` — multipart form: `file` (audio, ≤ 60 s), optional query param `n` (default 10). Returns `{results, query_transcript}` with ranked matches.
- `GET /api/audio/{filename}` — serves a stored audio file in its original format.
- `GET /api/segment/{parent_filename}?start=<sec>&end=<sec>` — serves a WAV slice of a segment.
- `GET /api/stats``{total_segments, total_clips}`.
## Stopping and resetting
```bash
docker compose down # stop containers, keep data
docker compose down -v # also delete the database, cached models, and stored audio
```
If you wipe volumes, the next start will re-download both models.
## Configuration
Environment variables set in `docker-compose.yml`:
- `ASR_MODEL` — HuggingFace Whisper model name. Default: `openai/whisper-base`. Smaller/faster: `openai/whisper-tiny`. Better quality: `openai/whisper-small` (~500 MB, slower on CPU). English-only variants (`-base.en`, `-tiny.en`) are slightly better for English-only content. If you change this, existing transcripts stay valid (only query behavior changes).
- `TEXT_EMBEDDING_MODEL` — sentence-transformers model name. Default: `sentence-transformers/all-MiniLM-L6-v2` (384-d). If you change this, wipe the `chroma_data` volume — embedding dimensions must match across all stored vectors.
Host port mapping is also in `docker-compose.yml`; change the left side of `"8082:8080"` if 8082 conflicts with something else.