TTS ONNX Node.js Implementation
Node.js implementation for TTS inference. Uses ONNX Runtime to generate speech from text.
Requirements
- Node.js v16 or higher
- npm or yarn
Installation
cd nodejs
npm install
Basic Usage
Example 1: Default Inference
Run inference with default settings:
npm start
Or:
node example_onnx.js
This will use:
- Voice style:
assets/voice_styles/M1.json - Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory:
results/ - Total steps: 5
- Number of generations: 4
Example 2: Batch Inference
Process multiple voice styles and texts at once:
node example_onnx.js \
--voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first text
- Use female voice style (F1.json) for the second text
- Process both samples in a single batch
Example 3: High Quality Inference
Increase denoising steps for better quality:
node example_onnx.js \
--total-step 10 \
--voice-style "assets/voice_styles/M1.json" \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
Available Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--use-gpu |
flag | False | Use GPU for inference (not supported yet) |
--onnx-dir |
str | assets/onnx |
Path to ONNX model directory |
--total-step |
int | 5 | Number of denoising steps (higher = better quality, slower) |
--n-test |
int | 4 | Number of times to generate each sample |
--voice-style |
str+ | assets/voice_styles/M1.json |
Voice style file path(s). Separate multiple files with commas |
--text |
str+ | (long default text) | Text(s) to synthesize. Separate multiple texts with pipes |
--save-dir |
str | results |
Output directory |
Notes
- Batch Processing: The number of voice style files must match the number of texts. Use commas to separate files and pipes to separate texts
- Quality vs Speed: Higher
--total-stepvalues produce better quality but take longer - GPU Support: GPU mode is not supported yet
Architecture
-
helper.js: Node.js port of Python'shelper.pyPreprocessor: Audio preprocessing (STFT, Mel Spectrogram)UnicodeProcessor: Text preprocessing- Utility functions (mask generation, tensor conversion, etc.)
-
example_onnx.js: Main inference script- ONNX model loading
- TTS inference pipeline execution
- WAV file saving
-
package.json: Node.js project configuration and dependencies
Implementation Notes
-
Pure Node.js WAV Processing: Writes WAV files without external native libraries. Outputs 16-bit PCM format.
-
Memory Efficiency: Note that Node.js may consume significant memory when processing large arrays.
-
Performance: The mel spectrogram extraction (Step 1-1) is currently slower than Python's Librosa, which uses highly optimized C extensions. This bottleneck could be further improved with additional optimizations such as WASM-based FFT libraries or native addons.