Creating a Wake Word with Rustpotter and Edge-TTS
Example wake word: “Hei, Dorele!”
This guide walks you step-by-step through:
- installing Rustpotter,
- generating wake-word and “none” samples with Edge-TTS,
- training a wake-word model,
- and testing it on files and live microphone audio.
The example wake word is “Hei, Dorele!”, but you can replace it with any phrase.
1. What is Rustpotter?
Rustpotter is an open-source wake word / keyword spotter written in Rust. You give it examples of:
- positive (wake word) audio clips, and
- negative (“none”) audio clips,
and it trains a lightweight neural network that can run in real time on modest hardware (mini PCs, SBCs, etc.).
Rustpotter learns labels from the filenames:
- files whose names start with [label] → class label (e.g. [hei_dorele]...)
- files without brackets → special label none (background / not wake word)
In this tutorial we’ll use:
- wake word label: hei_dorele
- special negative label: none
2. Prerequisites
You’ll need:
- A 64-bit Linux or Windows machine - I use a Dell OptiPlex 3000 Thin Client with N5105 Intel CPU.
- A working microphone (for testing).
- Python 3.9+.
- rustpotter-cli binary.
- edge-tts Python library (to synthesize audio).
- FFmpeg (to convert audio to 16 kHz mono WAV).
2.1 Install rustpotter-cli
Go to the rustpotter-cli releases and download the appropriate binary for your OS/architecture.
Linux (example for Debian/Ubuntu, x86_64):
cd ~/Downloads
curl -OL https://github.com/GiviMAD/rustpotter-cli/releases/download/v3.0.2/rustpotter-cli_debian_amd64
chmod +x rustpotter-cli_debian_amd64
sudo mv rustpotter-cli_debian_amd64 /usr/local/bin/rustpotter-cli
rustpotter-cli --version
2.2 Install Python dependencies
Install Edge-TTS:
python -m pip install --upgrade edge-tts
Install FFmpeg:
sudo apt install ffmpeg
ffmpeg -version
3. Project layout
Create a workspace for your wake word:
mkdir -p ~/rustpotter-data/hei_dorele
cd ~/rustpotter-data/hei_dorele
4. Generate wake word and “none” samples with Edge-TTS
We’ll use Edge-TTS to synthesize:
- wake word samples: “Hei, Dorele!”
- none samples: random Romanian phrases that do not contain the wake word.
Rustpotter uses filename labels like this:
- [hei_dorele]something.wav → label hei_dorele (our wake word)
- none_*.wav (no brackets) → label none (non-wake data)
4.1 Script: generate_hei_dorele_samples.py
#!/usr/bin/env python3
import asyncio
import random
import re
import subprocess
from pathlib import Path
import edge_tts # pip install edge-tts
# =======================
# Wake word
# =======================
WAKE_PHRASE = "Hei, Dorele!"
# Prefix for all final WAV filenames
WAV_PREFIX = "[hei_dorele]"
# Number of distinct TTS base utterances per voice
BASES_PER_VOICE = 40 # 2 voices * 40 = 80 bases
NOISY_VARIANTS_PER_BASE = 4 # 1 clean + 4 noisy = 5 per base
# Total files = 2 * 40 * 5 = 400
# =======================
# Paths / Config
# =======================
# Output root for wake-word
OUTPUT_DIR = Path("train/hei_dorele")
# ffmpeg path (change if needed)
FFMPEG = r"C:\ffmpeg\bin\ffmpeg.exe"
# Absolute path to noise directory with MP3/WAV noise files
NOISE_DIR = Path(r"C:\Users\YourUser\Desktop\Voice\Noise") # Change with your path
# Edge TTS voices
EDGE_VOICES = [
"ro-RO-AlinaNeural",
"ro-RO-EmilNeural",
]
# Rate/pitch randomization ranges (for variety)
EDGE_RATE_MIN = -15 # percent
EDGE_RATE_MAX = +15
EDGE_PITCH_MIN = -30 # Hz
EDGE_PITCH_MAX = +30
# Noise level (relative gain)
NOISY_GAIN_MIN = 0.10
NOISY_GAIN_MAX = 0.25
# =======================
# Helpers
# =======================
def ensure_ffmpeg():
if not Path(FFMPEG).exists():
raise SystemExit(f"ffmpeg not found at {FFMPEG}. Fix the path or install ffmpeg there.")
def ensure_noise_dir():
if not NOISE_DIR.exists():
print(f"[!] Noise dir {NOISE_DIR} does not exist. No noisy samples will be created.")
return False
noise_files = list(NOISE_DIR.glob("*.mp3")) + list(NOISE_DIR.glob("*.wav"))
if not noise_files:
print(f"[!] No .mp3/.wav noise files found in {NOISE_DIR}. No noisy samples will be created.")
return False
return True
def slugify(text: str) -> str:
text = text.strip().lower()
text = re.sub(r"[ăâ]", "a", text)
text = re.sub(r"[î]", "i", text)
text = re.sub(r"[șş]", "s", text)
text = re.sub(r"[țţ]", "t", text)
text = re.sub(r"[^a-z0-9]+", "_", text)
return text.strip("_")
def ffmpeg_resample_to_16k(in_path: Path, out_path: Path):
"""Resample any audio to 16 kHz mono WAV."""
cmd = [
FFMPEG,
"-y",
"-i", str(in_path),
"-ac", "1", # mono
"-ar", "16000", # 16 kHz
"-acodec", "pcm_s16le",
str(out_path),
]
print(f"[+] Resampling {in_path.name} -> {out_path.name}")
subprocess.run(cmd, check=True)
def add_noise(clean_wav: Path, out_wav: Path, noise_files):
"""Mix clean_wav with a random noise file from noise_files."""
noise_file = random.choice(noise_files)
noise_vol = f"{random.uniform(NOISY_GAIN_MIN, NOISY_GAIN_MAX):.2f}"
# IMPORTANT: duration=first to keep full speech (never cut "Dorele")
cmd = [
FFMPEG,
"-y",
"-i", str(clean_wav),
"-i", str(noise_file),
"-filter_complex",
f"[1:a]volume={noise_vol}[n];"
f"[0:a][n]amix=inputs=2:duration=first:dropout_transition=2",
"-ac", "1",
"-ar", "16000",
"-acodec", "pcm_s16le",
str(out_wav),
]
print(f"[+] Adding noise ({noise_file.name}, vol={noise_vol}) -> {out_wav.name}")
subprocess.run(cmd, check=True)
def random_rate_pitch():
"""Return random (rate, pitch) within configured ranges."""
rate = random.randint(EDGE_RATE_MIN, EDGE_RATE_MAX)
pitch = random.randint(EDGE_PITCH_MIN, EDGE_PITCH_MAX)
return rate, pitch
# =======================
# Synthesis for one base
# =======================
async def synth_one_base(base_id: int, voice: str, noise_files):
"""
Create one TTS base utterance (clean) for given voice,
then generate NOISY_VARIANTS_PER_BASE noisy versions.
"""
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
voice_tag = voice.split("-")[-1].replace("Neural", "").lower() # 'alina', 'emil'
base_slug = slugify(WAKE_PHRASE) or "sample"
# Temporary MP3 from Edge (no prefix needed, will be deleted)
tmp_mp3 = OUTPUT_DIR / f"edge_tmp_base{base_id:03d}_{voice_tag}.mp3"
# Clean WAV (16 kHz mono) with [hei_dorele] prefix
clean_wav = OUTPUT_DIR / f"{WAV_PREFIX}edge_base{base_id:03d}_{voice_tag}_{base_slug}_clean.wav"
rate, pitch = random_rate_pitch()
rate_str = f"{rate:+d}%"
pitch_str = f"{pitch:+d}Hz"
communicate = edge_tts.Communicate(
WAKE_PHRASE,
voice,
rate=rate_str,
pitch=pitch_str,
)
print(f"[EDGE] base {base_id:03d} voice={voice_tag} rate={rate} pitch={pitch} -> {tmp_mp3.name}")
await communicate.save(str(tmp_mp3))
# Convert to clean 16kHz mono WAV
ffmpeg_resample_to_16k(tmp_mp3, clean_wav)
tmp_mp3.unlink() # delete temp mp3
# Generate noisy variants, also prefixed with [hei_dorele]
if noise_files:
for k in range(1, NOISY_VARIANTS_PER_BASE + 1):
noisy_wav = OUTPUT_DIR / (
f"{WAV_PREFIX}edge_base{base_id:03d}_{voice_tag}_{base_slug}_noisy{k}.wav"
)
add_noise(clean_wav, noisy_wav, noise_files)
# Small delay to be nice to Edge (avoid 429)
await asyncio.sleep(0.3)
# =======================
# Main
# =======================
async def main():
ensure_ffmpeg()
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
have_noise = ensure_noise_dir()
noise_files = []
if have_noise:
noise_files = list(NOISE_DIR.glob("*.mp3")) + list(NOISE_DIR.glob("*.wav"))
print(f"[i] Using {len(noise_files)} noise files from {NOISE_DIR}")
else:
print("[!] Proceeding with clean samples only (no noise).")
total_bases = len(EDGE_VOICES) * BASES_PER_VOICE
total_files = total_bases * (1 + (NOISY_VARIANTS_PER_BASE if have_noise else 0))
print(f"[i] Target: {total_files} WAV files "
f"({len(EDGE_VOICES)} voices × {BASES_PER_VOICE} bases × "
f"{'1 clean + ' + str(NOISY_VARIANTS_PER_BASE) + ' noisy' if have_noise else '1 clean'}).")
base_counter = 0
for voice in EDGE_VOICES:
for _ in range(BASES_PER_VOICE):
base_counter += 1
await synth_one_base(base_counter, voice, noise_files)
print("\nDone. Check folder:", OUTPUT_DIR)
if __name__ == "__main__":
asyncio.run(main())
4.2 Run the generator generate_hei_dorele_samples.py
5. Create train/test split (80/20)
Rustpotter’s trainer expects:
- train/ – training wavs
- test/ – test wavs
Both should contain a mix of [hei_dorele]*.wav and none_*.wav.
#!/usr/bin/env python3
import random
import shutil
from pathlib import Path
# Base folder where your current structure lives
BASE = Path(r"C:\Users\YourUser\Desktop\Voice")
# Source folders
HEI_DIR = BASE / "train" / "hei_dorele" # [hei_dorele]*.wav
NONE_DIR = BASE / "train" / "none" # negative examples
# Output folders for Rustpotter training
TRAIN_OUT = BASE / "rp_train"
TEST_OUT = BASE / "rp_test"
# Train/test ratio
TRAIN_RATIO = 0.8 # 80% train, 20% test
def split_and_copy(src_dir: Path, train_out: Path, test_out: Path, label_name: str):
"""Split wavs in src_dir into train/test and copy them."""
files = sorted(src_dir.glob("*.wav"))
total = len(files)
if total == 0:
print(f"[!] No wav files found in {src_dir} for label '{label_name}'")
return
print(f"[i] Found {total} '{label_name}' files in {src_dir}")
random.shuffle(files)
n_train = int(total * TRAIN_RATIO)
train_files = files[:n_train]
test_files = files[n_train:]
# Copy train files
for f in train_files:
dst = train_out / f.name
shutil.copy2(f, dst)
# Copy test files
for f in test_files:
dst = test_out / f.name
shutil.copy2(f, dst)
print(f" -> {len(train_files)} to {train_out}")
print(f" -> {len(test_files)} to {test_out}")
def main():
# Create output dirs
TRAIN_OUT.mkdir(parents=True, exist_ok=True)
TEST_OUT.mkdir(parents=True, exist_ok=True)
# Fixed seed for reproducibility
random.seed(42)
# Split wake word and non-wake word
split_and_copy(HEI_DIR, TRAIN_OUT, TEST_OUT, "hei_dorele")
split_and_copy(NONE_DIR, TRAIN_OUT, TEST_OUT, "none")
print("\nDone. You can now use:")
print(f" rustpotter-cli train -t small --train-dir {TRAIN_OUT} --test-dir {TEST_OUT} ...")
if __name__ == "__main__":
main()
Now you have:
train/
[hei_dorele]edge_tts_001_hei_dorele.wav
[hei_dorele]edge_tts_002_hei_dorele.wav
none_edge_tts_001_buna_dimineata.wav
...
test/
[hei_dorele]edge_tts_0XX_hei_dorele.wav
none_edge_tts_0YY_...wav
6. Train the wake word model
In linux cd to your folder
rustpotter-cli train \
-t small \
--train-dir train \
--test-dir test \
-l 0.017 \
-e 2500 \
--test-epochs 10 \
hei_dorele_small.rpw
Explanation:
-t small – model type; other options: tiny, small, medium, large.
--train-dir, --test-dir – our generated folders.
-l 0.017 – learning rate.
-e 2500 – total epochs.
--test-epochs 10 – evaluate with the test set every 10 epochs.
hei_dorele_small.rpw – output model file.
You should see something like:
Start training hei_dorele_small.rpw!
Model type: small.
Labels: ["none", "hei_dorele"].
Training with 670 records.
Testing with 163 records.
Training on XXXXms of audio.
...
2500 train loss: 0.00002 test acc: 99.39%
hei_dorele_small.rpw created!
This .rpw file is your trainable wake word model for “Hei, Dorele!”.
7. Test the model on sample files
Use test to see how the model behaves on individual WAV files.
Positive example (wake word):
rustpotter-cli test \
-g \
--gain-ref 0.004 \
hei_dorele_small.rpw \
"test/[hei_dorele]edge_tts_001_hei_dorele.wav"
-g enables gain normalization.
--gain-ref sets the reference gain level.
You should see a “wakeword detection” log with label hei_dorele.
Negative example (“none”):
rustpotter-cli test \
-g \
--gain-ref 0.004 \
hei_dorele_small.rpw \
test/none_edge_tts_001_buna_dimineata.wav
You should get no detection or clearly lower scores (depending on threshold and options).
8. Live detection from microphone (spot)
Finally, run the model live to detect when someone says “Hei, Dorele!” in real time.
- List audio devices: rustpotter-cli devices -c -m 1
This shows available input devices and their configs (sample rate, channels). Find: the device index of your microphone (e.g., 0), and a config index that supports mono input (e.g., 0).
- Run spot with the wake word model
rustpotter-cli spot \
-g \
--gain-ref 0.004 \
-t 0.5 \
-m 6 \
-e \
--device-index 0 \
--config-index 0 \
hei_dorele_small.rpw
Options:
-g --gain-ref 0.004 – gain normalization.
-t 0.5 – detection threshold (tune up/down as needed).
-m 6 – require at least 6 consecutive positive frames.
-e – eager: emit detection as soon as conditions are met.
--device-index, --config-index – input device and audio format.
Now speak “Hei, Dorele!” into the mic:
The CLI should print a detection line whenever it recognizes the wake word.
Try from different distances, with different background noise.
If it misses too often → lower the threshold (-t 0.45) or reduce -m.
If it triggers too easily (false positives) → raise threshold (0.6) or increase -m.
9. Where to go next
Add real microphone recordings (not just TTS) to your dataset and retrain.
Add more diverse none samples from your real environment (TV, people talking).
Use spot’s recording options to capture false positives and feed them back as new none examples.
Integrate Rustpotter into your voice assistant pipeline, e.g.:
Rustpotter detects “Hei, Dorele!”
Then you record the next few seconds and send them to Whisper / other STT
Then control smart home / bots / etc.
This setup gives new users a full path from nothing to a working wake word based on Rustpotter + Edge-TTS, with both wake and none classes covered.
Comments powered by CComment