example-projects/examples/audio_meaning_db
2026-05-28 12:16:02 -05:00
..
backend Syncing from Shen's latest main on github 2026-05-28 12:16:02 -05:00
docker-compose.yml Syncing from Shen's latest main on github 2026-05-28 12:16:02 -05:00
README.md Syncing from Shen's latest main on github 2026-05-28 12:16:02 -05:00

audio_meaning_db

A self-contained semantic audio search tool for spoken audio. Upload audio clips (optionally with a description) to build up a database, then search by audio to find the nearest neighbors by what's being said. Runs as a single Docker service: a FastAPI backend that transcribes speech locally with Whisper (openai/whisper-base), embeds the transcripts with a sentence-transformer (all-MiniLM-L6-v2), and stores vectors in ChromaDB, served behind a minimal browser UI.

Long uploads are split into 60-second segments; each segment is transcribed and indexed independently. Search returns the best-matching segment along with a link to the full parent clip.

Why Whisper + sentence embeddings (and not CLAP)

This tool is optimized for finding audio by what is said, not by how it sounds. Two recordings of the sentence "I love cats" embedded with CLAP would look similar regardless of content; two recordings saying "I love cats" vs "I'm fond of felines" would look very different. Whisper transcription + text embeddings inverts that: meaning-preserving paraphrases match, and acoustic differences (voice, accent, background noise) are ignored.

If you want music/SFX/ambient similarity instead, swap in CLAP — the plumbing is the same.

Prerequisites

You need Docker Engine and the Docker Compose plugin. If you don't already have them:

Verify it works:

docker --version
docker compose version

Running it

From this directory:

docker compose up -d --build

Then open http://localhost:8082 in your browser.

What to expect on the first run

The first up --build is slow because it:

  1. Installs Python deps including CPU-only PyTorch (~200 MB pip download) and ffmpeg + libsndfile.
  2. Downloads the Whisper model (~150 MB) and the sentence-transformer (~80 MB) into a cached volume on first server start.
  3. Starts with an empty database — no seeding. Upload your own audio.

Watch progress with:

docker compose logs -f backend

You'll see ASR model openai/whisper-base ready. and Text embedding model ... ready. once it's warmed up. Subsequent runs reuse the cached models and existing database, so startup is fast.

Using the UI

Two tabs:

  • Submit Audio — drop or click to select an audio file (mp3, wav, m4a, flac, ogg). Add an optional description and click Submit to Database. The file is chunked into 60-second segments, each transcribed and embedded. You'll see the per-segment transcripts once it's done.
  • Search by Audio — drop or click to select a query clip (≤ 60 seconds, hard-enforced). The backend transcribes it, embeds the transcript, and returns the most semantically similar stored segments, ranked by cosine similarity. Each result shows the segment's transcript, a playable audio slice of just that segment, and a toggle to play the full parent clip.

API

Direct endpoints:

  • POST /api/submit — multipart form: file (audio), optional description (string). Returns {parent_id, filename, segments_added, total_segments, duration_sec, segments} where segments includes per-segment timestamps and transcripts.
  • POST /api/search — multipart form: file (audio, ≤ 60 s), optional query param n (default 10). Returns {results, query_transcript} with ranked matches.
  • GET /api/audio/{filename} — serves a stored audio file in its original format.
  • GET /api/segment/{parent_filename}?start=<sec>&end=<sec> — serves a WAV slice of a segment.
  • GET /api/stats{total_segments, total_clips}.

Stopping and resetting

docker compose down             # stop containers, keep data
docker compose down -v          # also delete the database, cached models, and stored audio

If you wipe volumes, the next start will re-download both models.

Configuration

Environment variables set in docker-compose.yml:

  • ASR_MODEL — HuggingFace Whisper model name. Default: openai/whisper-base. Smaller/faster: openai/whisper-tiny. Better quality: openai/whisper-small (~500 MB, slower on CPU). English-only variants (-base.en, -tiny.en) are slightly better for English-only content. If you change this, existing transcripts stay valid (only query behavior changes).
  • TEXT_EMBEDDING_MODEL — sentence-transformers model name. Default: sentence-transformers/all-MiniLM-L6-v2 (384-d). If you change this, wipe the chroma_data volume — embedding dimensions must match across all stored vectors.

Host port mapping is also in docker-compose.yml; change the left side of "8082:8080" if 8082 conflicts with something else.