Creating a Wake Word with Rustpotter

Creating a Wake Word with Rustpotter and Edge-TTS

Example wake word: “Hei, Dorele!”

This guide walks you step-by-step through:
- installing Rustpotter,
- generating wake-word and “none” samples with Edge-TTS,
- training a wake-word model,
- and testing it on files and live microphone audio.


The example wake word is “Hei, Dorele!”, but you can replace it with any phrase.


1. What is Rustpotter?

Rustpotter is an open-source wake word / keyword spotter written in Rust. You give it examples of:
- positive (wake word) audio clips, and
- negative (“none”) audio clips,
and it trains a lightweight neural network that can run in real time on modest hardware (mini PCs, SBCs, etc.).


Rustpotter learns labels from the filenames:
- files whose names start with [label] → class label (e.g. [hei_dorele]...)
- files without brackets → special label none (background / not wake word)


In this tutorial we’ll use:
- wake word label: hei_dorele
- special negative label: none


2. Prerequisites


You’ll need:
- A 64-bit Linux or Windows machine - I use a Dell OptiPlex 3000 Thin Client with N5105 Intel CPU.
- A working microphone (for testing).
- Python 3.9+.
- rustpotter-cli binary.
- edge-tts Python library (to synthesize audio).
- FFmpeg (to convert audio to 16 kHz mono WAV).

2.1 Install rustpotter-cli

Go to the rustpotter-cli releases and download the appropriate binary for your OS/architecture.

Linux (example for Debian/Ubuntu, x86_64):

cd ~/Downloads
curl -OL https://github.com/GiviMAD/rustpotter-cli/releases/download/v3.0.2/rustpotter-cli_debian_amd64
chmod +x rustpotter-cli_debian_amd64
sudo mv rustpotter-cli_debian_amd64 /usr/local/bin/rustpotter-cli

rustpotter-cli --version

2.2 Install Python dependencies

Install Edge-TTS:

python -m pip install --upgrade edge-tts

Install FFmpeg:

sudo apt install ffmpeg
ffmpeg -version

3. Project layout

Create a workspace for your wake word:

mkdir -p ~/rustpotter-data/hei_dorele
cd ~/rustpotter-data/hei_dorele

4. Generate wake word and “none” samples with Edge-TTS

We’ll use Edge-TTS to synthesize:
- wake word samples: “Hei, Dorele!”
- none samples: random Romanian phrases that do not contain the wake word.

Rustpotter uses filename labels like this:
- [hei_dorele]something.wav → label hei_dorele (our wake word)
- none_*.wav (no brackets) → label none (non-wake data)

4.1 Script: generate_hei_dorele_samples.py

#!/usr/bin/env python3 
import asyncio
import random
import re
import subprocess
from pathlib import Path

import edge_tts  # pip install edge-tts

# =======================
#  Wake word
# =======================
WAKE_PHRASE = "Hei, Dorele!"

# Prefix for all final WAV filenames
WAV_PREFIX = "[hei_dorele]"

# Number of distinct TTS base utterances per voice
BASES_PER_VOICE = 40          # 2 voices * 40 = 80 bases
NOISY_VARIANTS_PER_BASE = 4   # 1 clean + 4 noisy = 5 per base
# Total files = 2 * 40 * 5 = 400

# =======================
#  Paths / Config
# =======================

# Output root for wake-word
OUTPUT_DIR = Path("train/hei_dorele")

# ffmpeg path (change if needed)
FFMPEG = r"C:\ffmpeg\bin\ffmpeg.exe"

# Absolute path to noise directory with MP3/WAV noise files
NOISE_DIR = Path(r"C:\Users\YourUser\Desktop\Voice\Noise") # Change with your path

# Edge TTS voices
EDGE_VOICES = [
    "ro-RO-AlinaNeural",
    "ro-RO-EmilNeural",
]

# Rate/pitch randomization ranges (for variety)
EDGE_RATE_MIN = -15   # percent
EDGE_RATE_MAX = +15
EDGE_PITCH_MIN = -30  # Hz
EDGE_PITCH_MAX = +30

# Noise level (relative gain)
NOISY_GAIN_MIN = 0.10
NOISY_GAIN_MAX = 0.25

# =======================
#  Helpers
# =======================

def ensure_ffmpeg():
    if not Path(FFMPEG).exists():
        raise SystemExit(f"ffmpeg not found at {FFMPEG}. Fix the path or install ffmpeg there.")

def ensure_noise_dir():
    if not NOISE_DIR.exists():
        print(f"[!] Noise dir {NOISE_DIR} does not exist. No noisy samples will be created.")
        return False
    noise_files = list(NOISE_DIR.glob("*.mp3")) + list(NOISE_DIR.glob("*.wav"))
    if not noise_files:
        print(f"[!] No .mp3/.wav noise files found in {NOISE_DIR}. No noisy samples will be created.")
        return False
    return True

def slugify(text: str) -> str:
    text = text.strip().lower()
    text = re.sub(r"[ăâ]", "a", text)
    text = re.sub(r"[î]", "i", text)
    text = re.sub(r"[șş]", "s", text)
    text = re.sub(r"[țţ]", "t", text)
    text = re.sub(r"[^a-z0-9]+", "_", text)
    return text.strip("_")

def ffmpeg_resample_to_16k(in_path: Path, out_path: Path):
    """Resample any audio to 16 kHz mono WAV."""
    cmd = [
        FFMPEG,
        "-y",
        "-i", str(in_path),
        "-ac", "1",           # mono
        "-ar", "16000",       # 16 kHz
        "-acodec", "pcm_s16le",
        str(out_path),
    ]
    print(f"[+] Resampling {in_path.name} -> {out_path.name}")
    subprocess.run(cmd, check=True)

def add_noise(clean_wav: Path, out_wav: Path, noise_files):
    """Mix clean_wav with a random noise file from noise_files."""
    noise_file = random.choice(noise_files)
    noise_vol = f"{random.uniform(NOISY_GAIN_MIN, NOISY_GAIN_MAX):.2f}"

    # IMPORTANT: duration=first to keep full speech (never cut "Dorele")
    cmd = [
        FFMPEG,
        "-y",
        "-i", str(clean_wav),
        "-i", str(noise_file),
        "-filter_complex",
        f"[1:a]volume={noise_vol}[n];"
        f"[0:a][n]amix=inputs=2:duration=first:dropout_transition=2",
        "-ac", "1",
        "-ar", "16000",
        "-acodec", "pcm_s16le",
        str(out_wav),
    ]
    print(f"[+] Adding noise ({noise_file.name}, vol={noise_vol}) -> {out_wav.name}")
    subprocess.run(cmd, check=True)

def random_rate_pitch():
    """Return random (rate, pitch) within configured ranges."""
    rate = random.randint(EDGE_RATE_MIN, EDGE_RATE_MAX)
    pitch = random.randint(EDGE_PITCH_MIN, EDGE_PITCH_MAX)
    return rate, pitch

# =======================
#  Synthesis for one base
# =======================

async def synth_one_base(base_id: int, voice: str, noise_files):
    """
    Create one TTS base utterance (clean) for given voice,
    then generate NOISY_VARIANTS_PER_BASE noisy versions.
    """
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

    voice_tag = voice.split("-")[-1].replace("Neural", "").lower()  # 'alina', 'emil'
    base_slug = slugify(WAKE_PHRASE) or "sample"

    # Temporary MP3 from Edge (no prefix needed, will be deleted)
    tmp_mp3 = OUTPUT_DIR / f"edge_tmp_base{base_id:03d}_{voice_tag}.mp3"

    # Clean WAV (16 kHz mono) with [hei_dorele] prefix
    clean_wav = OUTPUT_DIR / f"{WAV_PREFIX}edge_base{base_id:03d}_{voice_tag}_{base_slug}_clean.wav"

    rate, pitch = random_rate_pitch()
    rate_str = f"{rate:+d}%"
    pitch_str = f"{pitch:+d}Hz"

    communicate = edge_tts.Communicate(
        WAKE_PHRASE,
        voice,
        rate=rate_str,
        pitch=pitch_str,
    )

    print(f"[EDGE] base {base_id:03d} voice={voice_tag} rate={rate} pitch={pitch} -> {tmp_mp3.name}")
    await communicate.save(str(tmp_mp3))

    # Convert to clean 16kHz mono WAV
    ffmpeg_resample_to_16k(tmp_mp3, clean_wav)
    tmp_mp3.unlink()  # delete temp mp3

    # Generate noisy variants, also prefixed with [hei_dorele]
    if noise_files:
        for k in range(1, NOISY_VARIANTS_PER_BASE + 1):
            noisy_wav = OUTPUT_DIR / (
                f"{WAV_PREFIX}edge_base{base_id:03d}_{voice_tag}_{base_slug}_noisy{k}.wav"
            )
            add_noise(clean_wav, noisy_wav, noise_files)

    # Small delay to be nice to Edge (avoid 429)
    await asyncio.sleep(0.3)

# =======================
#  Main
# =======================

async def main():
    ensure_ffmpeg()
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

    have_noise = ensure_noise_dir()
    noise_files = []
    if have_noise:
        noise_files = list(NOISE_DIR.glob("*.mp3")) + list(NOISE_DIR.glob("*.wav"))
        print(f"[i] Using {len(noise_files)} noise files from {NOISE_DIR}")
    else:
        print("[!] Proceeding with clean samples only (no noise).")

    total_bases = len(EDGE_VOICES) * BASES_PER_VOICE
    total_files = total_bases * (1 + (NOISY_VARIANTS_PER_BASE if have_noise else 0))
    print(f"[i] Target: {total_files} WAV files "
          f"({len(EDGE_VOICES)} voices × {BASES_PER_VOICE} bases × "
          f"{'1 clean + ' + str(NOISY_VARIANTS_PER_BASE) + ' noisy' if have_noise else '1 clean'}).")

    base_counter = 0
    for voice in EDGE_VOICES:
        for _ in range(BASES_PER_VOICE):
            base_counter += 1
            await synth_one_base(base_counter, voice, noise_files)

    print("\nDone. Check folder:", OUTPUT_DIR)

if __name__ == "__main__":
    asyncio.run(main())

4.2 Run the generator generate_hei_dorele_samples.py

5. Create train/test split (80/20)

Rustpotter’s trainer expects:
- train/ – training wavs
- test/ – test wavs
Both should contain a mix of [hei_dorele]*.wav and none_*.wav.

#!/usr/bin/env python3
import random
import shutil
from pathlib import Path

# Base folder where your current structure lives
BASE = Path(r"C:\Users\YourUser\Desktop\Voice")

# Source folders
HEI_DIR = BASE / "train" / "hei_dorele"  # [hei_dorele]*.wav
NONE_DIR = BASE / "train" / "none"       # negative examples

# Output folders for Rustpotter training
TRAIN_OUT = BASE / "rp_train"
TEST_OUT  = BASE / "rp_test"

# Train/test ratio
TRAIN_RATIO = 0.8  # 80% train, 20% test

def split_and_copy(src_dir: Path, train_out: Path, test_out: Path, label_name: str):
    """Split wavs in src_dir into train/test and copy them."""
    files = sorted(src_dir.glob("*.wav"))
    total = len(files)
    if total == 0:
        print(f"[!] No wav files found in {src_dir} for label '{label_name}'")
        return

    print(f"[i] Found {total} '{label_name}' files in {src_dir}")

    random.shuffle(files)
    n_train = int(total * TRAIN_RATIO)
    train_files = files[:n_train]
    test_files = files[n_train:]

    # Copy train files
    for f in train_files:
        dst = train_out / f.name
        shutil.copy2(f, dst)

    # Copy test files
    for f in test_files:
        dst = test_out / f.name
        shutil.copy2(f, dst)

    print(f"    -> {len(train_files)} to {train_out}")
    print(f"    -> {len(test_files)} to {test_out}")

def main():
    # Create output dirs
    TRAIN_OUT.mkdir(parents=True, exist_ok=True)
    TEST_OUT.mkdir(parents=True, exist_ok=True)

    # Fixed seed for reproducibility
    random.seed(42)

    # Split wake word and non-wake word
    split_and_copy(HEI_DIR, TRAIN_OUT, TEST_OUT, "hei_dorele")
    split_and_copy(NONE_DIR, TRAIN_OUT, TEST_OUT, "none")

    print("\nDone. You can now use:")
    print(f"  rustpotter-cli train -t small --train-dir {TRAIN_OUT} --test-dir {TEST_OUT} ...")

if __name__ == "__main__":
    main()

Now you have:

train/
  [hei_dorele]edge_tts_001_hei_dorele.wav
  [hei_dorele]edge_tts_002_hei_dorele.wav
  none_edge_tts_001_buna_dimineata.wav
  ...
test/
  [hei_dorele]edge_tts_0XX_hei_dorele.wav
  none_edge_tts_0YY_...wav

6. Train the wake word model

In linux cd to your folder

rustpotter-cli train \
  -t small \
  --train-dir train \
  --test-dir test \
  -l 0.017 \
  -e 2500 \
  --test-epochs 10 \
  hei_dorele_small.rpw

Explanation:

-t small – model type; other options: tiny, small, medium, large.
--train-dir, --test-dir – our generated folders.
-l 0.017 – learning rate.
-e 2500 – total epochs.
--test-epochs 10 – evaluate with the test set every 10 epochs.
hei_dorele_small.rpw – output model file.

You should see something like:

Start training hei_dorele_small.rpw!
Model type: small.
Labels: ["none", "hei_dorele"].
Training with 670 records.
Testing with 163 records.
Training on XXXXms of audio.
...
2500 train loss: 0.00002 test acc: 99.39%
hei_dorele_small.rpw created!


This .rpw file is your trainable wake word model for “Hei, Dorele!”.

7. Test the model on sample files

Use test to see how the model behaves on individual WAV files.
Positive example (wake word):

rustpotter-cli test \
  -g \
  --gain-ref 0.004 \
  hei_dorele_small.rpw \
  "test/[hei_dorele]edge_tts_001_hei_dorele.wav"

-g enables gain normalization.
--gain-ref sets the reference gain level.
You should see a “wakeword detection” log with label hei_dorele.

Negative example (“none”):

rustpotter-cli test \
  -g \
  --gain-ref 0.004 \
  hei_dorele_small.rpw \
  test/none_edge_tts_001_buna_dimineata.wav

You should get no detection or clearly lower scores (depending on threshold and options).

8. Live detection from microphone (spot)

Finally, run the model live to detect when someone says “Hei, Dorele!” in real time.

- List audio devices: rustpotter-cli devices -c -m 1

This shows available input devices and their configs (sample rate, channels). Find: the device index of your microphone (e.g., 0), and a config index that supports mono input (e.g., 0).

- Run spot with the wake word model

rustpotter-cli spot \
  -g \
  --gain-ref 0.004 \
  -t 0.5 \
  -m 6 \
  -e \
  --device-index 0 \
  --config-index 0 \
  hei_dorele_small.rpw

Options:

-g --gain-ref 0.004 – gain normalization.
-t 0.5 – detection threshold (tune up/down as needed).
-m 6 – require at least 6 consecutive positive frames.
-e – eager: emit detection as soon as conditions are met.
--device-index, --config-index – input device and audio format.

Now speak “Hei, Dorele!” into the mic:

The CLI should print a detection line whenever it recognizes the wake word.
Try from different distances, with different background noise.

If it misses too often → lower the threshold (-t 0.45) or reduce -m.
If it triggers too easily (false positives) → raise threshold (0.6) or increase -m.

9. Where to go next

Add real microphone recordings (not just TTS) to your dataset and retrain.
Add more diverse none samples from your real environment (TV, people talking).
Use spot’s recording options to capture false positives and feed them back as new none examples.
Integrate Rustpotter into your voice assistant pipeline, e.g.:

Rustpotter detects “Hei, Dorele!”
Then you record the next few seconds and send them to Whisper / other STT
Then control smart home / bots / etc.
This setup gives new users a full path from nothing to a working wake word based on Rustpotter + Edge-TTS, with both wake and none classes covered.

Comments powered by CComment