This commit is contained in:
ANLGBOY
2025-11-19 01:18:16 +09:00
commit d31536d9fc
74 changed files with 10681 additions and 0 deletions
+1
View File
@@ -0,0 +1 @@
assets/onnx/*.onnx filter=lfs diff=lfs merge=lfs -text
+61
View File
@@ -0,0 +1,61 @@
assets/*
assets/.git
assets/.gitignore
assets/.gitattributes
*.onnx
onnx
# Output files
results
# Python
__pycache__
*.py[cod]
*$py.class
*.so
.Python
# Virtual environments
.venv
venv/
ENV/
env/
# Node.js
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*
package-lock.json
# Swift
.build/
.swiftpm/
*.xcodeproj
*.xcworkspace
xcuserdata/
DerivedData/
# Distribution / packaging
build/
dist/
*.egg-info/
.eggs/
# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
+21
View File
@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 Supertone Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
+393
View File
@@ -0,0 +1,393 @@
# Supertonic — Lightning Fast, On-Device TTS
[![Demo](https://img.shields.io/badge/🤗%20Hugging%20Face-Demo-yellow)](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo)
[![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-blue)](https://huggingface.co/Supertone/supertonic)
<p align="center">
<img src="img/Supertonic_IMG_v02_4x.webp" alt="Supertonic Banner">
</p>
**Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
> 🎧 **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo), or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic)
### Table of Contents
- [Why Supertonic?](#why-supertonic)
- [Language Support](#language-support)
- [Getting Started](#getting-started)
- [Performance](#performance)
- [Citation](#citation)
- [License](#license)
## Why Supertonic?
- **⚡ Blazingly Fast**: Generates speech up to **167× faster than real-time** on consumer hardware (M4 Pro)—unmatched by any other TTS system
- **🪶 Ultra Lightweight**: Only **66M parameters**, optimized for efficient on-device performance with minimal footprint
- **📱 On-Device Capable**: **Complete privacy** and **zero latency**—all processing happens locally on your device
- **🎨 Natural Text Handling**: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
- **⚙️ Highly Configurable**: Adjust inference steps, batch processing, and other parameters to match your specific needs
- **🧩 Flexible Deployment**: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
## Language Support
We provide ready-to-use TTS inference examples across multiple ecosystems:
| Language/Platform | Path | Description |
|-------------------|------|-------------|
| [**Python**](py/) | `py/` | ONNX Runtime inference |
| [**Node.js**](nodejs/) | `nodejs/` | Server-side JavaScript |
| [**Browser**](web/) | `web/` | WebGPU/WASM inference |
| [**Java**](java/) | `java/` | Cross-platform JVM |
| [**C++**](cpp/) | `cpp/` | High-performance C++ |
| [**C#**](csharp/) | `csharp/` | .NET ecosystem |
| [**Go**](go/) | `go/` | Go implementation |
| [**Swift**](swift/) | `swift/` | macOS applications |
| [**iOS**](ios/) | `ios/` | Native iOS apps |
| [**Rust**](rust/) | `rust/` | Memory-safe systems |
> For detailed usage instructions, please refer to the README.md in each language directory.
## Getting Started
First, clone the repository:
```bash
git clone https://github.com/supertone-inc/supertonic.git
cd supertonic
```
### Prerequisites
Before running the examples, download the ONNX models and preset voices, and place them in the `assets` directory:
> **Note:** The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
> - macOS: `brew install git-lfs && git lfs install`
> - Generic: see `https://git-lfs.com` for installers
```bash
git clone https://huggingface.co/Supertone/supertonic assets
```
### Quick Start
**Python Example** ([Details](py/))
```bash
cd py
uv sync
uv run example_onnx.py
```
**Node.js Example** ([Details](nodejs/))
```bash
cd nodejs
npm install
npm start
```
**Browser Example** ([Details](web/))
```bash
cd web
npm install
npm run dev
```
**Java Example** ([Details](java/))
```bash
cd java
mvn clean install
mvn exec:java
```
**C++ Example** ([Details](cpp/))
```bash
cd cpp
mkdir build && cd build
cmake .. && cmake --build . --config Release
./example_onnx
```
**C# Example** ([Details](csharp/))
```bash
cd csharp
dotnet restore
dotnet run
```
**Go Example** ([Details](go/))
```bash
cd go
go mod download
go run example_onnx.go helper.go
```
**Swift Example** ([Details](swift/))
```bash
cd swift
swift build -c release
.build/release/example_onnx
```
**Rust Example** ([Details](rust/))
```bash
cd rust
cargo build --release
./target/release/example_onnx
```
**iOS Example** ([Details](ios/))
```bash
cd ios/ExampleiOSApp
xcodegen generate
open ExampleiOSApp.xcodeproj
```
- In Xcode: Targets → ExampleiOSApp → Signing: select your Team
- Choose your iPhone as run destination → Build & Run
### Technical Details
- **Runtime**: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
- **Browser Support**: onnxruntime-web for client-side inference
- **Batch Processing**: Supports batch inference for improved throughput
- **Audio Output**: Outputs 16-bit WAV files
## Performance
We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
**Metrics:**
- **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
- **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
### Characters per Second
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|--------|-----------------|----------------|-----------------|
| **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
| **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
| **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
> **Notes:**
> `API` = Cloud-based API services (measured from Seoul)
> `Open` = Open-source models
> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
> Supertonic (RTX4090): Tested with PyTorch model
> Kokoro: Tested on M4 Pro CPU with ONNX
> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
### Real-time Factor
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|--------|-----------------|----------------|-----------------|
| **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
| **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
| **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
<details>
<summary><b>Additional Performance Data (5-step inference)</b></summary>
<br>
**Characters per Second (5-step)**
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|--------|-----------------|----------------|-----------------|
| **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
| **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
| **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
**Real-time Factor (5-step)**
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|--------|-----------------|----------------|-----------------|
| **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
| **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
| **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
</details>
### Natural Text Handling
Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.
> 🎧 **View audio samples more easily**: Check out our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#text-handling) for a better viewing experience of all audio examples
**Overview of Test Cases:**
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini |
|:--------:|:--------------:|:----------:|:----------:|:------:|:------:|
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ |
| Time and Date | Time notation, abbreviated weekdays/months, date formats | ✅ | ❌ | ❌ | ❌ |
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ |
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ |
<details>
<summary><b>Example 1: Financial Expression</b></summary>
<br>
**Text:**
> "The startup secured **$5.2M** in venture capital, a huge leap from their initial **$450K** seed round."
**Challenges:**
- Decimal point in currency ($5.2M should be read as "five point two million")
- Abbreviated magnitude units (M for million, K for thousand)
- Currency symbol ($) that needs to be properly pronounced as "dollars"
**Audio Samples:**
| System | Result | Audio Sample |
|--------|--------|--------------|
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1eancUOhiSXCVoTu9ddh4S-OcVQaWrPV-/view?usp=sharing) |
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1-r2scv7XQ1crIDu6QOh3eqVl445W6ap_/view?usp=sharing) |
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1MFDXMjfmsAVOqwPx7iveS0KUJtZvcwxB/view?usp=sharing) |
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1dEHpNzfMUucFTJPQK0k4RcFZvPwQTt09/view?usp=sharing) |
</details>
<details>
<summary><b>Example 2: Time and Date</b></summary>
<br>
**Text:**
> "The train delay was announced at **4:45 PM** on **Wed, Apr 3, 2024** due to track maintenance."
**Challenges:**
- Time expression with PM notation (4:45 PM)
- Abbreviated weekday (Wed)
- Abbreviated month (Apr)
- Full date format (Apr 3, 2024)
**Audio Samples:**
| System | Result | Audio Sample |
|--------|--------|--------------|
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1ehkZU8eiizBenG2DgR5tzBGQBvHS0Uaj/view?usp=sharing) |
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ta3r6jFyebmA-sT44l8EaEQcMLVmuOEr/view?usp=sharing) |
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1sskmem9AzHAQ3Hv8DRSZoqX_pye-CXuU/view?usp=sharing) |
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1zx9X8oMsLMXW0Zx_SURoqjju-By2yh_n/view?usp=sharing) |
</details>
<details>
<summary><b>Example 3: Phone Number</b></summary>
<br>
**Text:**
> "You can reach the hotel front desk at **(212) 555-0142 ext. 402** anytime."
**Challenges:**
- Area code in parentheses that should be read as separate digits
- Phone number with hyphen separator (555-0142)
- Abbreviated extension notation (ext.)
- Extension number (402)
**Audio Samples:**
| System | Result | Audio Sample |
|--------|--------|--------------|
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1z-e5iTsihryMR8ll1-N1YXkB2CIJYJ6F/view?usp=sharing) |
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1HAzVXFTZfZm0VEK2laSpsMTxzufcuaxA/view?usp=sharing) |
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/15tjfAmb3GbjP_kmvD7zSdIWkhtAaCPOg/view?usp=sharing) |
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1BCL8n7yligUZyso970ud7Gf5NWb1OhKD/view?usp=sharing) |
</details>
<details>
<summary><b>Example 4: Technical Unit</b></summary>
<br>
**Text:**
> "Our drone battery lasts **2.3h** when flying at **30kph** with full camera payload."
**Challenges:**
- Decimal time duration with abbreviation (2.3h = two point three hours)
- Speed unit with abbreviation (30kph = thirty kilometers per hour)
- Technical abbreviations (h for hours, kph for kilometers per hour)
- Technical/engineering context requiring proper pronunciation
**Audio Samples:**
| System | Result | Audio Sample |
|--------|--------|--------------|
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1kvOBvswFkLfmr8hGplH0V2XiMxy1shYf/view?usp=sharing) |
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1_SzfjWJe5YEd0t3R7DztkYhHcI_av48p/view?usp=sharing) |
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1P5BSilj5xFPTV2Xz6yW5jitKZohO9o-6/view?usp=sharing) |
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1GU82SnWC50OvC8CZNjhxvNZFKQb7I9_Y/view?usp=sharing) |
</details>
> **Note:** These samples demonstrate how each system handles text normalization and pronunciation of complex expressions **without requiring pre-processing or phonetic annotations**.
## Citation
The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:
### SupertonicTTS: Main Architecture
This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.
```bibtex
@article{kim2025supertonic,
title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
journal={arXiv preprint arXiv:2503.23108},
year={2025},
url={https://arxiv.org/abs/2503.23108}
}
```
### Length-Aware RoPE: Text-Speech Alignment
This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.
```bibtex
@article{kim2025larope,
title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
journal={arXiv preprint arXiv:2509.11084},
year={2025},
url={https://arxiv.org/abs/2509.11084}
}
```
### Self-Purifying Flow Matching: Training with Noisy Labels
This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.
```bibtex
@article{kim2025spfm,
title={Training Flow Matching Models with Reliable Labels via Self-Purification},
author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
journal={arXiv preprint arXiv:2509.19091},
year={2025},
url={https://arxiv.org/abs/2509.19091}
}
```
## License
This projects sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic/blob/main/LICENSE) file for details.
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
Copyright (c) 2025 Supertone Inc.
+122
View File
@@ -0,0 +1,122 @@
cmake_minimum_required(VERSION 3.15)
project(Supertonic_CPP)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Enable aggressive optimization
if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release)
endif()
# Add optimization flags
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
# Find required packages
find_package(PkgConfig REQUIRED)
find_package(OpenMP)
# ONNX Runtime - Try multiple methods
# Method 1: Try to find via CMake config
find_package(onnxruntime QUIET CONFIG)
if(NOT onnxruntime_FOUND)
# Method 2: Try pkg-config
pkg_check_modules(ONNXRUNTIME QUIET libonnxruntime)
if(ONNXRUNTIME_FOUND)
set(ONNXRUNTIME_INCLUDE_DIR ${ONNXRUNTIME_INCLUDE_DIRS})
set(ONNXRUNTIME_LIB ${ONNXRUNTIME_LIBRARIES})
else()
# Method 3: Manual search in common locations
find_path(ONNXRUNTIME_INCLUDE_DIR
NAMES onnxruntime_cxx_api.h
PATHS
/usr/local/include
/opt/homebrew/include
/usr/include
${CMAKE_PREFIX_PATH}/include
PATH_SUFFIXES onnxruntime
)
find_library(ONNXRUNTIME_LIB
NAMES onnxruntime libonnxruntime
PATHS
/usr/local/lib
/opt/homebrew/lib
/usr/lib
${CMAKE_PREFIX_PATH}/lib
)
endif()
if(NOT ONNXRUNTIME_INCLUDE_DIR OR NOT ONNXRUNTIME_LIB)
message(FATAL_ERROR "ONNX Runtime not found. Please install it:\n"
" macOS: brew install onnxruntime\n"
" Ubuntu: See README.md for installation instructions")
endif()
message(STATUS "Found ONNX Runtime:")
message(STATUS " Include: ${ONNXRUNTIME_INCLUDE_DIR}")
message(STATUS " Library: ${ONNXRUNTIME_LIB}")
endif()
# nlohmann/json
find_package(nlohmann_json REQUIRED)
# Include directories
if(NOT onnxruntime_FOUND)
include_directories(${ONNXRUNTIME_INCLUDE_DIR})
endif()
# Helper library
add_library(tts_helper STATIC
helper.cpp
helper.h
)
if(onnxruntime_FOUND)
target_link_libraries(tts_helper
onnxruntime::onnxruntime
nlohmann_json::nlohmann_json
)
else()
target_include_directories(tts_helper PUBLIC ${ONNXRUNTIME_INCLUDE_DIR})
target_link_libraries(tts_helper
${ONNXRUNTIME_LIB}
nlohmann_json::nlohmann_json
)
endif()
# Enable OpenMP if available
if(OpenMP_CXX_FOUND)
target_link_libraries(tts_helper OpenMP::OpenMP_CXX)
message(STATUS "OpenMP enabled for parallel processing")
else()
message(WARNING "OpenMP not found - parallel processing will be disabled")
endif()
# Example executable
add_executable(example_onnx
example_onnx.cpp
)
if(onnxruntime_FOUND)
target_link_libraries(example_onnx
tts_helper
onnxruntime::onnxruntime
nlohmann_json::nlohmann_json
)
else()
target_link_libraries(example_onnx
tts_helper
${ONNXRUNTIME_LIB}
nlohmann_json::nlohmann_json
)
endif()
# Installation
install(TARGETS example_onnx DESTINATION bin)
install(TARGETS tts_helper DESTINATION lib)
install(FILES helper.h DESTINATION include)
+101
View File
@@ -0,0 +1,101 @@
# Supertonic C++ Implementation
High-performance text-to-speech inference using ONNX Runtime.
## Requirements
- C++17 compiler, CMake 3.15+
- Libraries: ONNX Runtime, nlohmann/json
## Installation
**Ubuntu/Debian:**
> ⚠️ **Note:** Installation instructions not yet verified.
```bash
sudo apt-get install -y cmake g++ nlohmann-json3-dev
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.3/onnxruntime-linux-x64-1.16.3.tgz
tar -xzf onnxruntime-linux-x64-1.16.3.tgz
sudo cp -r onnxruntime-linux-x64-1.16.3/include/* /usr/local/include/
sudo cp -r onnxruntime-linux-x64-1.16.3/lib/* /usr/local/lib/
sudo ldconfig
```
**macOS:**
```bash
brew install cmake nlohmann-json onnxruntime
```
**Windows (vcpkg):**
> ⚠️ **Note:** Installation instructions not yet verified.
```powershell
vcpkg install nlohmann-json:x64-windows onnxruntime:x64-windows
vcpkg integrate install
```
## Building
```bash
cd cpp && mkdir build && cd build
cmake .. && cmake --build . --config Release
./example_onnx
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
./example_onnx
```
This will use:
- Voice style: `../assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
./example_onnx \
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
```
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first text
- Use female voice style (F1.json) for the second text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
./example_onnx \
--total-step 10 \
--voice-style ../assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str | `../assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated for batch) |
| `--text` | str | (long default text) | Text(s) to synthesize (pipe-separated for batch) |
| `--save-dir` | str | `results` | Output directory |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
Symlink
+1
View File
@@ -0,0 +1 @@
../assets
+109
View File
@@ -0,0 +1,109 @@
#include "helper.h"
#include <iostream>
#include <filesystem>
#include <algorithm>
#include <string>
#include <vector>
namespace fs = std::filesystem;
struct Args {
std::string onnx_dir = "../assets/onnx";
int total_step = 5;
int n_test = 4;
std::vector<std::string> voice_style = {"../assets/voice_styles/M1.json"};
std::vector<std::string> text = {
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
};
std::string save_dir = "results";
};
auto splitString = [](const std::string& str, char delim) {
std::vector<std::string> result;
size_t start = 0, pos;
while ((pos = str.find(delim, start)) != std::string::npos) {
result.push_back(str.substr(start, pos - start));
start = pos + 1;
}
result.push_back(str.substr(start));
return result;
};
Args parseArgs(int argc, char* argv[]) {
Args args;
for (int i = 1; i < argc; i++) {
std::string arg = argv[i];
if (arg == "--onnx-dir" && i + 1 < argc) args.onnx_dir = argv[++i];
else if (arg == "--total-step" && i + 1 < argc) args.total_step = std::stoi(argv[++i]);
else if (arg == "--n-test" && i + 1 < argc) args.n_test = std::stoi(argv[++i]);
else if (arg == "--voice-style" && i + 1 < argc) args.voice_style = splitString(argv[++i], ',');
else if (arg == "--text" && i + 1 < argc) args.text = splitString(argv[++i], '|');
else if (arg == "--save-dir" && i + 1 < argc) args.save_dir = argv[++i];
}
return args;
}
int main(int argc, char* argv[]) {
std::cout << "=== TTS Inference with ONNX Runtime (C++) ===\n\n";
// --- 1. Parse arguments --- //
Args args = parseArgs(argc, argv);
int total_step = args.total_step;
int n_test = args.n_test;
std::string save_dir = args.save_dir;
std::vector<std::string> voice_style_paths = args.voice_style;
std::vector<std::string> text_list = args.text;
if (voice_style_paths.size() != text_list.size()) {
std::cerr << "Error: Number of voice styles (" << voice_style_paths.size()
<< ") must match number of texts (" << text_list.size() << ")\n";
return 1;
}
int bsz = voice_style_paths.size();
// --- 2. Load Text to Speech --- //
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "TTS");
Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(
OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault
);
auto text_to_speech = loadTextToSpeech(env, args.onnx_dir, false);
std::cout << std::endl;
// --- 3. Load Voice Style --- //
auto style = loadVoiceStyle(voice_style_paths, true);
// --- 4. Synthesize speech --- //
fs::create_directories(save_dir);
for (int n = 0; n < n_test; n++) {
std::cout << "\n[" << (n + 1) << "/" << n_test << "] Starting synthesis...\n";
auto result = timer("Generating speech from text", [&]() {
return text_to_speech->call(memory_info, text_list, style, total_step);
});
int sample_rate = text_to_speech->getSampleRate();
int wav_shape_1 = result.wav.size() / bsz;
for (int b = 0; b < bsz; b++) {
std::string fname = sanitizeFilename(text_list[b], 20) + "_" + std::to_string(n + 1) + ".wav";
int wav_len = static_cast<int>(sample_rate * result.duration[b]);
std::vector<float> wav_out(
result.wav.begin() + b * wav_shape_1,
result.wav.begin() + b * wav_shape_1 + wav_len
);
std::string output_path = save_dir + "/" + fname;
writeWavFile(output_path, wav_out, sample_rate);
std::cout << "Saved: " << output_path << "\n";
}
clearTensorBuffers();
}
std::cout << "\n=== Synthesis completed successfully! ===\n";
return 0;
}
+714
View File
@@ -0,0 +1,714 @@
#include "helper.h"
#include <fstream>
#include <iostream>
#include <cmath>
#include <algorithm>
#include <random>
#include <sstream>
#include <nlohmann/json.hpp>
using json = nlohmann::json;
// Global tensor buffers for memory management
static std::vector<std::vector<float>> g_tensor_buffers_float;
static std::vector<std::vector<int64_t>> g_tensor_buffers_int64;
void clearTensorBuffers() {
g_tensor_buffers_float.clear();
g_tensor_buffers_int64.clear();
}
// ============================================================================
// UnicodeProcessor implementation
// ============================================================================
UnicodeProcessor::UnicodeProcessor(const std::string& unicode_indexer_json_path) {
indexer_ = loadJsonInt64(unicode_indexer_json_path);
}
std::string UnicodeProcessor::preprocessText(const std::string& text) {
// Simple NFKD normalization (C++ doesn't have built-in Unicode normalization)
// For now, just return the text as-is
// TODO: add proper Unicode normalization
return text;
}
std::vector<uint16_t> UnicodeProcessor::textToUnicodeValues(const std::string& text) {
std::vector<uint16_t> unicode_values;
for (char c : text) {
unicode_values.push_back(static_cast<uint16_t>(static_cast<unsigned char>(c)));
}
return unicode_values;
}
std::vector<std::vector<std::vector<float>>> UnicodeProcessor::getTextMask(
const std::vector<int64_t>& text_ids_lengths
) {
return lengthToMask(text_ids_lengths);
}
void UnicodeProcessor::call(
const std::vector<std::string>& text_list,
std::vector<std::vector<int64_t>>& text_ids,
std::vector<std::vector<std::vector<float>>>& text_mask
) {
std::vector<std::string> processed_texts;
for (const auto& text : text_list) {
processed_texts.push_back(preprocessText(text));
}
std::vector<int64_t> text_ids_lengths;
for (const auto& text : processed_texts) {
text_ids_lengths.push_back(static_cast<int64_t>(text.length()));
}
int64_t max_len = *std::max_element(text_ids_lengths.begin(), text_ids_lengths.end());
text_ids.resize(text_list.size());
for (size_t i = 0; i < processed_texts.size(); i++) {
text_ids[i].resize(max_len, 0);
auto unicode_vals = textToUnicodeValues(processed_texts[i]);
for (size_t j = 0; j < unicode_vals.size(); j++) {
if (unicode_vals[j] < indexer_.size()) {
text_ids[i][j] = indexer_[unicode_vals[j]];
}
}
}
text_mask = getTextMask(text_ids_lengths);
}
// ============================================================================
// Style implementation
// ============================================================================
Style::Style(const std::vector<float>& ttl_data, const std::vector<int64_t>& ttl_shape,
const std::vector<float>& dp_data, const std::vector<int64_t>& dp_shape)
: ttl_data_(ttl_data), ttl_shape_(ttl_shape), dp_data_(dp_data), dp_shape_(dp_shape) {}
// ============================================================================
// TextToSpeech implementation
// ============================================================================
TextToSpeech::TextToSpeech(
const Config& cfgs,
UnicodeProcessor* text_processor,
Ort::Session* dp_ort,
Ort::Session* text_enc_ort,
Ort::Session* vector_est_ort,
Ort::Session* vocoder_ort
) : cfgs_(cfgs),
text_processor_(text_processor),
dp_ort_(dp_ort),
text_enc_ort_(text_enc_ort),
vector_est_ort_(vector_est_ort),
vocoder_ort_(vocoder_ort) {
sample_rate_ = cfgs.ae.sample_rate;
base_chunk_size_ = cfgs.ae.base_chunk_size;
chunk_compress_factor_ = cfgs.ttl.chunk_compress_factor;
ldim_ = cfgs.ttl.latent_dim;
}
void TextToSpeech::sampleNoisyLatent(
const std::vector<float>& duration,
std::vector<std::vector<std::vector<float>>>& noisy_latent,
std::vector<std::vector<std::vector<float>>>& latent_mask
) {
int bsz = duration.size();
float wav_len_max = *std::max_element(duration.begin(), duration.end()) * sample_rate_;
std::vector<int64_t> wav_lengths;
for (float d : duration) {
wav_lengths.push_back(static_cast<int64_t>(d * sample_rate_));
}
int chunk_size = base_chunk_size_ * chunk_compress_factor_;
int latent_len = static_cast<int>((wav_len_max + chunk_size - 1) / chunk_size);
int latent_dim = ldim_ * chunk_compress_factor_;
// Generate random noise with normal distribution
std::random_device rd;
std::mt19937 gen(rd());
std::normal_distribution<float> dist(0.0f, 1.0f);
noisy_latent.resize(bsz);
for (int b = 0; b < bsz; b++) {
noisy_latent[b].resize(latent_dim);
for (int d = 0; d < latent_dim; d++) {
noisy_latent[b][d].resize(latent_len);
for (int t = 0; t < latent_len; t++) {
noisy_latent[b][d][t] = dist(gen);
}
}
}
latent_mask = getLatentMask(wav_lengths, base_chunk_size_, chunk_compress_factor_);
// Apply mask
for (int b = 0; b < bsz; b++) {
for (int d = 0; d < latent_dim; d++) {
for (size_t t = 0; t < noisy_latent[b][d].size(); t++) {
noisy_latent[b][d][t] *= latent_mask[b][0][t];
}
}
}
}
TextToSpeech::SynthesisResult TextToSpeech::call(
Ort::MemoryInfo& memory_info,
const std::vector<std::string>& text_list,
const Style& style,
int total_step
) {
int bsz = text_list.size();
if (bsz != style.getTtlShape()[0]) {
throw std::runtime_error("Number of texts must match number of style vectors");
}
// Process text
std::vector<std::vector<int64_t>> text_ids;
std::vector<std::vector<std::vector<float>>> text_mask;
text_processor_->call(text_list, text_ids, text_mask);
std::vector<int64_t> text_ids_shape = {bsz, static_cast<int64_t>(text_ids[0].size())};
std::vector<int64_t> text_mask_shape = {bsz, 1, static_cast<int64_t>(text_mask[0][0].size())};
auto text_ids_tensor = intArrayToTensor(memory_info, text_ids, text_ids_shape);
auto text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
// Create style tensors
auto style_ttl_tensor = Ort::Value::CreateTensor<float>(
memory_info,
const_cast<float*>(style.getTtlData().data()),
style.getTtlData().size(),
style.getTtlShape().data(),
style.getTtlShape().size()
);
auto style_dp_tensor = Ort::Value::CreateTensor<float>(
memory_info,
const_cast<float*>(style.getDpData().data()),
style.getDpData().size(),
style.getDpShape().data(),
style.getDpShape().size()
);
// Run duration predictor
const char* dp_input_names[] = {"text_ids", "style_dp", "text_mask"};
const char* dp_output_names[] = {"duration"};
std::vector<Ort::Value> dp_inputs;
dp_inputs.push_back(std::move(text_ids_tensor));
dp_inputs.push_back(std::move(style_dp_tensor));
dp_inputs.push_back(std::move(text_mask_tensor));
auto dp_outputs = dp_ort_->Run(
Ort::RunOptions{nullptr},
dp_input_names, dp_inputs.data(), dp_inputs.size(),
dp_output_names, 1
);
auto* dur_data = dp_outputs[0].GetTensorMutableData<float>();
std::vector<float> duration(dur_data, dur_data + bsz);
// Create new tensors for text encoder (previous ones were moved)
text_ids_tensor = intArrayToTensor(memory_info, text_ids, text_ids_shape);
text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
style_ttl_tensor = Ort::Value::CreateTensor<float>(
memory_info,
const_cast<float*>(style.getTtlData().data()),
style.getTtlData().size(),
style.getTtlShape().data(),
style.getTtlShape().size()
);
// Run text encoder
const char* text_enc_input_names[] = {"text_ids", "style_ttl", "text_mask"};
const char* text_enc_output_names[] = {"text_emb"};
std::vector<Ort::Value> text_enc_inputs;
text_enc_inputs.push_back(std::move(text_ids_tensor));
text_enc_inputs.push_back(std::move(style_ttl_tensor));
text_enc_inputs.push_back(std::move(text_mask_tensor));
auto text_enc_outputs = text_enc_ort_->Run(
Ort::RunOptions{nullptr},
text_enc_input_names, text_enc_inputs.data(), text_enc_inputs.size(),
text_enc_output_names, 1
);
// Sample noisy latent
std::vector<std::vector<std::vector<float>>> xt, latent_mask;
sampleNoisyLatent(duration, xt, latent_mask);
std::vector<int64_t> latent_shape = {
bsz,
static_cast<int64_t>(xt[0].size()),
static_cast<int64_t>(xt[0][0].size())
};
std::vector<int64_t> latent_mask_shape = {
bsz, 1,
static_cast<int64_t>(latent_mask[0][0].size())
};
// Prepare scalar tensors
std::vector<float> total_step_vec(bsz, static_cast<float>(total_step));
auto total_step_tensor = Ort::Value::CreateTensor<float>(
memory_info,
total_step_vec.data(),
total_step_vec.size(),
std::vector<int64_t>{bsz}.data(),
1
);
// Store text_emb data to reuse across iterations
auto text_emb_info = text_enc_outputs[0].GetTensorTypeAndShapeInfo();
size_t text_emb_size = text_emb_info.GetElementCount();
auto* text_emb_data = text_enc_outputs[0].GetTensorMutableData<float>();
std::vector<float> text_emb_vec(text_emb_data, text_emb_data + text_emb_size);
auto text_emb_shape = text_emb_info.GetShape();
// Iterative denoising
for (int step = 0; step < total_step; step++) {
std::vector<float> current_step_vec(bsz, static_cast<float>(step));
text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
auto latent_mask_tensor = arrayToTensor(memory_info, latent_mask, latent_mask_shape);
auto noisy_latent_tensor = arrayToTensor(memory_info, xt, latent_shape);
style_ttl_tensor = Ort::Value::CreateTensor<float>(
memory_info,
const_cast<float*>(style.getTtlData().data()),
style.getTtlData().size(),
style.getTtlShape().data(),
style.getTtlShape().size()
);
auto text_emb_tensor = Ort::Value::CreateTensor<float>(
memory_info,
text_emb_vec.data(),
text_emb_vec.size(),
text_emb_shape.data(),
text_emb_shape.size()
);
auto current_step_tensor = Ort::Value::CreateTensor<float>(
memory_info,
current_step_vec.data(),
current_step_vec.size(),
std::vector<int64_t>{bsz}.data(),
1
);
const char* vector_est_input_names[] = {
"noisy_latent", "text_emb", "style_ttl", "text_mask", "latent_mask", "total_step", "current_step"
};
const char* vector_est_output_names[] = {"denoised_latent"};
std::vector<Ort::Value> vector_est_inputs;
vector_est_inputs.push_back(std::move(noisy_latent_tensor));
vector_est_inputs.push_back(std::move(text_emb_tensor));
vector_est_inputs.push_back(std::move(style_ttl_tensor));
vector_est_inputs.push_back(std::move(text_mask_tensor));
vector_est_inputs.push_back(std::move(latent_mask_tensor));
// Create a new total_step tensor for each iteration
auto total_step_tensor_iter = Ort::Value::CreateTensor<float>(
memory_info,
total_step_vec.data(),
total_step_vec.size(),
std::vector<int64_t>{bsz}.data(),
1
);
vector_est_inputs.push_back(std::move(total_step_tensor_iter));
vector_est_inputs.push_back(std::move(current_step_tensor));
auto vector_est_outputs = vector_est_ort_->Run(
Ort::RunOptions{nullptr},
vector_est_input_names, vector_est_inputs.data(), vector_est_inputs.size(),
vector_est_output_names, 1
);
// Update xt with denoised output
auto* denoised_data = vector_est_outputs[0].GetTensorMutableData<float>();
size_t idx = 0;
for (int b = 0; b < bsz; b++) {
for (size_t d = 0; d < xt[b].size(); d++) {
for (size_t t = 0; t < xt[b][d].size(); t++) {
xt[b][d][t] = denoised_data[idx++];
}
}
}
}
// Run vocoder
auto latent_tensor = arrayToTensor(memory_info, xt, latent_shape);
const char* vocoder_input_names[] = {"latent"};
const char* vocoder_output_names[] = {"wav_tts"};
std::vector<Ort::Value> vocoder_inputs;
vocoder_inputs.push_back(std::move(latent_tensor));
auto vocoder_outputs = vocoder_ort_->Run(
Ort::RunOptions{nullptr},
vocoder_input_names, vocoder_inputs.data(), vocoder_inputs.size(),
vocoder_output_names, 1
);
auto wav_info = vocoder_outputs[0].GetTensorTypeAndShapeInfo();
size_t wav_size = wav_info.GetElementCount();
auto* wav_data = vocoder_outputs[0].GetTensorMutableData<float>();
SynthesisResult result;
result.wav.assign(wav_data, wav_data + wav_size);
result.duration = duration;
return result;
}
// ============================================================================
// Utility functions
// ============================================================================
std::vector<std::vector<std::vector<float>>> lengthToMask(
const std::vector<int64_t>& lengths, int max_len
) {
if (max_len == -1) {
max_len = *std::max_element(lengths.begin(), lengths.end());
}
std::vector<std::vector<std::vector<float>>> mask;
for (auto len : lengths) {
std::vector<std::vector<float>> batch_mask(1);
batch_mask[0].resize(max_len);
for (int i = 0; i < max_len; i++) {
batch_mask[0][i] = (i < len) ? 1.0f : 0.0f;
}
mask.push_back(batch_mask);
}
return mask;
}
std::vector<std::vector<std::vector<float>>> getLatentMask(
const std::vector<int64_t>& wav_lengths,
int base_chunk_size,
int chunk_compress_factor
) {
int latent_size = base_chunk_size * chunk_compress_factor;
std::vector<int64_t> latent_lengths;
for (auto len : wav_lengths) {
latent_lengths.push_back((len + latent_size - 1) / latent_size);
}
return lengthToMask(latent_lengths);
}
// ============================================================================
// ONNX model loading
// ============================================================================
std::unique_ptr<Ort::Session> loadOnnx(
Ort::Env& env,
const std::string& onnx_path,
const Ort::SessionOptions& opts
) {
return std::make_unique<Ort::Session>(env, onnx_path.c_str(), opts);
}
OnnxModels loadOnnxAll(
Ort::Env& env,
const std::string& onnx_dir,
const Ort::SessionOptions& opts
) {
OnnxModels models;
models.dp = loadOnnx(env, onnx_dir + "/duration_predictor.onnx", opts);
models.text_enc = loadOnnx(env, onnx_dir + "/text_encoder.onnx", opts);
models.vector_est = loadOnnx(env, onnx_dir + "/vector_estimator.onnx", opts);
models.vocoder = loadOnnx(env, onnx_dir + "/vocoder.onnx", opts);
return models;
}
// ============================================================================
// Configuration and processor loading
// ============================================================================
Config loadCfgs(const std::string& onnx_dir) {
std::string cfg_path = onnx_dir + "/tts.json";
std::ifstream file(cfg_path);
if (!file.is_open()) {
throw std::runtime_error("Failed to open config file: " + cfg_path);
}
json j;
file >> j;
Config cfg;
cfg.ae.sample_rate = j["ae"]["sample_rate"];
cfg.ae.base_chunk_size = j["ae"]["base_chunk_size"];
cfg.ttl.chunk_compress_factor = j["ttl"]["chunk_compress_factor"];
cfg.ttl.latent_dim = j["ttl"]["latent_dim"];
return cfg;
}
std::unique_ptr<UnicodeProcessor> loadTextProcessor(const std::string& onnx_dir) {
std::string unicode_indexer_path = onnx_dir + "/unicode_indexer.json";
return std::make_unique<UnicodeProcessor>(unicode_indexer_path);
}
// ============================================================================
// Voice style loading
// ============================================================================
Style loadVoiceStyle(const std::vector<std::string>& voice_style_paths, bool verbose) {
int bsz = voice_style_paths.size();
// Read first file to get dimensions
std::ifstream first_file(voice_style_paths[0]);
if (!first_file.is_open()) {
throw std::runtime_error("Failed to open voice style file: " + voice_style_paths[0]);
}
json first_json;
first_file >> first_json;
auto ttl_dims = first_json["style_ttl"]["dims"].get<std::vector<int64_t>>();
auto dp_dims = first_json["style_dp"]["dims"].get<std::vector<int64_t>>();
int64_t ttl_dim1 = ttl_dims[1];
int64_t ttl_dim2 = ttl_dims[2];
int64_t dp_dim1 = dp_dims[1];
int64_t dp_dim2 = dp_dims[2];
// Pre-allocate arrays with full batch size
size_t ttl_size = bsz * ttl_dim1 * ttl_dim2;
size_t dp_size = bsz * dp_dim1 * dp_dim2;
std::vector<float> ttl_flat(ttl_size);
std::vector<float> dp_flat(dp_size);
// Fill in the data
for (int i = 0; i < bsz; i++) {
std::ifstream file(voice_style_paths[i]);
if (!file.is_open()) {
throw std::runtime_error("Failed to open voice style file: " + voice_style_paths[i]);
}
json j;
file >> j;
// Flatten data
auto ttl_data_nested = j["style_ttl"]["data"].get<std::vector<std::vector<std::vector<float>>>>();
std::vector<float> ttl_data;
for (const auto& batch : ttl_data_nested) {
for (const auto& row : batch) {
ttl_data.insert(ttl_data.end(), row.begin(), row.end());
}
}
auto dp_data_nested = j["style_dp"]["data"].get<std::vector<std::vector<std::vector<float>>>>();
std::vector<float> dp_data;
for (const auto& batch : dp_data_nested) {
for (const auto& row : batch) {
dp_data.insert(dp_data.end(), row.begin(), row.end());
}
}
// Copy to pre-allocated array
size_t ttl_offset = i * ttl_dim1 * ttl_dim2;
std::copy(ttl_data.begin(), ttl_data.end(), ttl_flat.begin() + ttl_offset);
size_t dp_offset = i * dp_dim1 * dp_dim2;
std::copy(dp_data.begin(), dp_data.end(), dp_flat.begin() + dp_offset);
}
std::vector<int64_t> ttl_shape = {bsz, ttl_dim1, ttl_dim2};
std::vector<int64_t> dp_shape = {bsz, dp_dim1, dp_dim2};
if (verbose) {
std::cout << "Loaded " << bsz << " voice styles" << std::endl;
}
return Style(ttl_flat, ttl_shape, dp_flat, dp_shape);
}
// ============================================================================
// TextToSpeech loading
// ============================================================================
std::unique_ptr<TextToSpeech> loadTextToSpeech(
Ort::Env& env,
const std::string& onnx_dir,
bool use_gpu
) {
Ort::SessionOptions opts;
if (use_gpu) {
throw std::runtime_error("GPU mode is not supported yet");
} else {
std::cout << "Using CPU for inference" << std::endl;
}
auto cfgs = loadCfgs(onnx_dir);
auto models = loadOnnxAll(env, onnx_dir, opts);
auto text_processor = loadTextProcessor(onnx_dir);
// Transfer ownership to TextToSpeech (use raw pointers internally)
auto tts = std::make_unique<TextToSpeech>(
cfgs,
text_processor.get(),
models.dp.get(),
models.text_enc.get(),
models.vector_est.get(),
models.vocoder.get()
);
// Keep the models and processor alive by storing them
// (In production, you'd want better lifetime management)
static OnnxModels static_models;
static std::unique_ptr<UnicodeProcessor> static_text_processor;
static_models = std::move(models);
static_text_processor = std::move(text_processor);
return tts;
}
// ============================================================================
// WAV file writing
// ============================================================================
void writeWavFile(
const std::string& filename,
const std::vector<float>& audio_data,
int sample_rate
) {
std::ofstream file(filename, std::ios::binary);
if (!file.is_open()) {
throw std::runtime_error("Failed to open file for writing: " + filename);
}
int num_channels = 1;
int bits_per_sample = 16;
int byte_rate = sample_rate * num_channels * bits_per_sample / 8;
int block_align = num_channels * bits_per_sample / 8;
int data_size = audio_data.size() * bits_per_sample / 8;
// RIFF header
file.write("RIFF", 4);
int32_t chunk_size = 36 + data_size;
file.write(reinterpret_cast<char*>(&chunk_size), 4);
file.write("WAVE", 4);
// fmt chunk
file.write("fmt ", 4);
int32_t fmt_chunk_size = 16;
file.write(reinterpret_cast<char*>(&fmt_chunk_size), 4);
int16_t audio_format = 1; // PCM
file.write(reinterpret_cast<char*>(&audio_format), 2);
int16_t num_channels_16 = num_channels;
file.write(reinterpret_cast<char*>(&num_channels_16), 2);
file.write(reinterpret_cast<char*>(&sample_rate), 4);
file.write(reinterpret_cast<char*>(&byte_rate), 4);
int16_t block_align_16 = block_align;
file.write(reinterpret_cast<char*>(&block_align_16), 2);
int16_t bits_per_sample_16 = bits_per_sample;
file.write(reinterpret_cast<char*>(&bits_per_sample_16), 2);
// data chunk
file.write("data", 4);
file.write(reinterpret_cast<char*>(&data_size), 4);
// Write audio data
for (float sample : audio_data) {
float clamped = std::max(-1.0f, std::min(1.0f, sample));
int16_t int_sample = static_cast<int16_t>(clamped * 32767);
file.write(reinterpret_cast<char*>(&int_sample), 2);
}
}
// ============================================================================
// Tensor conversion utilities
// ============================================================================
Ort::Value arrayToTensor(
Ort::MemoryInfo& memory_info,
const std::vector<std::vector<std::vector<float>>>& array,
const std::vector<int64_t>& dims
) {
// Flatten the array
std::vector<float> flat;
for (const auto& batch : array) {
for (const auto& row : batch) {
for (float val : row) {
flat.push_back(val);
}
}
}
// Store in global buffer to keep data alive
g_tensor_buffers_float.push_back(std::move(flat));
auto& buffer = g_tensor_buffers_float.back();
return Ort::Value::CreateTensor<float>(
memory_info,
buffer.data(),
buffer.size(),
dims.data(),
dims.size()
);
}
Ort::Value intArrayToTensor(
Ort::MemoryInfo& memory_info,
const std::vector<std::vector<int64_t>>& array,
const std::vector<int64_t>& dims
) {
// Flatten the array
std::vector<int64_t> flat;
for (const auto& row : array) {
for (int64_t val : row) {
flat.push_back(val);
}
}
// Store in global buffer to keep data alive
g_tensor_buffers_int64.push_back(std::move(flat));
auto& buffer = g_tensor_buffers_int64.back();
return Ort::Value::CreateTensor<int64_t>(
memory_info,
buffer.data(),
buffer.size(),
dims.data(),
dims.size()
);
}
// ============================================================================
// JSON loading helpers
// ============================================================================
std::vector<int64_t> loadJsonInt64(const std::string& file_path) {
std::ifstream file(file_path);
if (!file.is_open()) {
throw std::runtime_error("Failed to open file: " + file_path);
}
json j;
file >> j;
return j.get<std::vector<int64_t>>();
}
// ============================================================================
// Sanitize filename
// ============================================================================
std::string sanitizeFilename(const std::string& text, int max_len) {
std::string result;
int count = 0;
for (char c : text) {
if (count >= max_len) break;
if (std::isalnum(static_cast<unsigned char>(c))) {
result += c;
} else {
result += '_';
}
count++;
}
return result;
}
+202
View File
@@ -0,0 +1,202 @@
#pragma once
#include <string>
#include <vector>
#include <memory>
#include <iostream>
#include <iomanip>
#include <chrono>
#include <onnxruntime_cxx_api.h>
/**
* Configuration structure
*/
struct Config {
struct AEConfig {
int sample_rate;
int base_chunk_size;
} ae;
struct TTLConfig {
int chunk_compress_factor;
int latent_dim;
} ttl;
};
/**
* Unicode text processor
*/
class UnicodeProcessor {
public:
explicit UnicodeProcessor(const std::string& unicode_indexer_json_path);
// Process text list to text IDs and mask
void call(
const std::vector<std::string>& text_list,
std::vector<std::vector<int64_t>>& text_ids,
std::vector<std::vector<std::vector<float>>>& text_mask
);
private:
std::vector<int64_t> indexer_;
std::string preprocessText(const std::string& text);
std::vector<uint16_t> textToUnicodeValues(const std::string& text);
std::vector<std::vector<std::vector<float>>> getTextMask(
const std::vector<int64_t>& text_ids_lengths
);
};
/**
* Style class
*/
class Style {
public:
Style(const std::vector<float>& ttl_data, const std::vector<int64_t>& ttl_shape,
const std::vector<float>& dp_data, const std::vector<int64_t>& dp_shape);
const std::vector<float>& getTtlData() const { return ttl_data_; }
const std::vector<float>& getDpData() const { return dp_data_; }
const std::vector<int64_t>& getTtlShape() const { return ttl_shape_; }
const std::vector<int64_t>& getDpShape() const { return dp_shape_; }
private:
std::vector<float> ttl_data_;
std::vector<float> dp_data_;
std::vector<int64_t> ttl_shape_;
std::vector<int64_t> dp_shape_;
};
/**
* TextToSpeech class
*/
class TextToSpeech {
public:
TextToSpeech(
const Config& cfgs,
UnicodeProcessor* text_processor,
Ort::Session* dp_ort,
Ort::Session* text_enc_ort,
Ort::Session* vector_est_ort,
Ort::Session* vocoder_ort
);
struct SynthesisResult {
std::vector<float> wav;
std::vector<float> duration;
};
SynthesisResult call(
Ort::MemoryInfo& memory_info,
const std::vector<std::string>& text_list,
const Style& style,
int total_step
);
int getSampleRate() const { return sample_rate_; }
private:
Config cfgs_;
UnicodeProcessor* text_processor_;
Ort::Session* dp_ort_;
Ort::Session* text_enc_ort_;
Ort::Session* vector_est_ort_;
Ort::Session* vocoder_ort_;
int sample_rate_;
int base_chunk_size_;
int chunk_compress_factor_;
int ldim_;
void sampleNoisyLatent(
const std::vector<float>& duration,
std::vector<std::vector<std::vector<float>>>& noisy_latent,
std::vector<std::vector<std::vector<float>>>& latent_mask
);
};
// Utility functions
std::vector<std::vector<std::vector<float>>> lengthToMask(
const std::vector<int64_t>& lengths, int max_len = -1
);
std::vector<std::vector<std::vector<float>>> getLatentMask(
const std::vector<int64_t>& wav_lengths,
int base_chunk_size,
int chunk_compress_factor
);
// ONNX model loading
struct OnnxModels {
std::unique_ptr<Ort::Session> dp;
std::unique_ptr<Ort::Session> text_enc;
std::unique_ptr<Ort::Session> vector_est;
std::unique_ptr<Ort::Session> vocoder;
};
std::unique_ptr<Ort::Session> loadOnnx(
Ort::Env& env,
const std::string& onnx_path,
const Ort::SessionOptions& opts
);
OnnxModels loadOnnxAll(
Ort::Env& env,
const std::string& onnx_dir,
const Ort::SessionOptions& opts
);
// Configuration and processor loading
Config loadCfgs(const std::string& onnx_dir);
std::unique_ptr<UnicodeProcessor> loadTextProcessor(const std::string& onnx_dir);
// Voice style loading
Style loadVoiceStyle(const std::vector<std::string>& voice_style_paths, bool verbose = false);
// TextToSpeech loading
std::unique_ptr<TextToSpeech> loadTextToSpeech(
Ort::Env& env,
const std::string& onnx_dir,
bool use_gpu = false
);
// WAV file writing
void writeWavFile(
const std::string& filename,
const std::vector<float>& audio_data,
int sample_rate
);
// Tensor conversion utilities
void clearTensorBuffers();
Ort::Value arrayToTensor(
Ort::MemoryInfo& memory_info,
const std::vector<std::vector<std::vector<float>>>& array,
const std::vector<int64_t>& dims
);
Ort::Value intArrayToTensor(
Ort::MemoryInfo& memory_info,
const std::vector<std::vector<int64_t>>& array,
const std::vector<int64_t>& dims
);
// JSON loading helpers
std::vector<int64_t> loadJsonInt64(const std::string& file_path);
// Timer utility
template<typename Func>
auto timer(const std::string& name, Func&& func) -> decltype(func()) {
auto start = std::chrono::high_resolution_clock::now();
std::cout << name << "..." << std::endl;
auto result = func();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << " -> " << name << " completed in "
<< std::fixed << std::setprecision(2) << elapsed.count() << " sec" << std::endl;
return result;
}
// Sanitize filename
std::string sanitizeFilename(const std::string& text, int max_len);
+41
View File
@@ -0,0 +1,41 @@
# Build results
bin/
obj/
[Dd]ebug/
[Rr]elease/
x64/
x86/
[Aa]rm/
[Aa]rm64/
bld/
[Bb]in/
[Oo]bj/
[Ll]og/
# Visual Studio files
.vs/
*.suo
*.user
*.userosscache
*.sln.docstates
*.userprefs
# Rider
.idea/
*.sln.iml
# User-specific files
*.rsuser
*.suo
*.user
*.userosscache
*.sln.docstates
# Output directory
results/*.wav
# OS files
.DS_Store
Thumbs.db
+118
View File
@@ -0,0 +1,118 @@
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace Supertonic
{
class Program
{
class Args
{
public bool UseGpu { get; set; } = false;
public string OnnxDir { get; set; } = "assets/onnx";
public int TotalStep { get; set; } = 5;
public int NTest { get; set; } = 4;
public List<string> VoiceStyle { get; set; } = new List<string> { "assets/voice_styles/M1.json" };
public List<string> Text { get; set; } = new List<string>
{
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
};
public string SaveDir { get; set; } = "results";
}
static Args ParseArgs(string[] args)
{
var result = new Args();
for (int i = 0; i < args.Length; i++)
{
switch (args[i])
{
case "--use-gpu":
result.UseGpu = true;
break;
case "--onnx-dir" when i + 1 < args.Length:
result.OnnxDir = args[++i];
break;
case "--total-step" when i + 1 < args.Length:
result.TotalStep = int.Parse(args[++i]);
break;
case "--n-test" when i + 1 < args.Length:
result.NTest = int.Parse(args[++i]);
break;
case "--voice-style" when i + 1 < args.Length:
result.VoiceStyle = args[++i].Split(',').ToList();
break;
case "--text" when i + 1 < args.Length:
result.Text = args[++i].Split('|').ToList();
break;
case "--save-dir" when i + 1 < args.Length:
result.SaveDir = args[++i];
break;
}
}
return result;
}
static void Main(string[] args)
{
Console.WriteLine("=== TTS Inference with ONNX Runtime (C#) ===\n");
// --- 1. Parse arguments --- //
var parsedArgs = ParseArgs(args);
int totalStep = parsedArgs.TotalStep;
int nTest = parsedArgs.NTest;
string saveDir = parsedArgs.SaveDir;
var voiceStylePaths = parsedArgs.VoiceStyle;
var textList = parsedArgs.Text;
if (voiceStylePaths.Count != textList.Count)
{
throw new ArgumentException(
$"Number of voice styles ({voiceStylePaths.Count}) must match number of texts ({textList.Count})");
}
int bsz = voiceStylePaths.Count;
// --- 2. Load Text to Speech --- //
var textToSpeech = Helper.LoadTextToSpeech(parsedArgs.OnnxDir, parsedArgs.UseGpu);
Console.WriteLine();
// --- 3. Load Voice Style --- //
var style = Helper.LoadVoiceStyle(voiceStylePaths, verbose: true);
// --- 4. Synthesize speech --- //
for (int n = 0; n < nTest; n++)
{
Console.WriteLine($"\n[{n + 1}/{nTest}] Starting synthesis...");
var (wav, duration) = Helper.Timer("Generating speech from text", () =>
textToSpeech.Call(textList, style, totalStep)
);
if (!Directory.Exists(saveDir))
{
Directory.CreateDirectory(saveDir);
}
for (int b = 0; b < bsz; b++)
{
string fname = $"{Helper.SanitizeFilename(textList[b], 20)}_{n + 1}.wav";
int wavLen = (int)(textToSpeech.SampleRate * duration[b]);
var wavOut = new float[wavLen];
Array.Copy(wav, b * wav.Length / bsz, wavOut, 0, Math.Min(wavLen, wav.Length / bsz));
string outputPath = Path.Combine(saveDir, fname);
Helper.WriteWavFile(outputPath, wavOut, textToSpeech.SampleRate);
Console.WriteLine($"Saved: {outputPath}");
}
}
Console.WriteLine("\n=== Synthesis completed successfully! ===");
}
}
}
+612
View File
@@ -0,0 +1,612 @@
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.Json;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
namespace Supertonic
{
// ============================================================================
// Configuration classes
// ============================================================================
public class Config
{
public AEConfig AE { get; set; } = null!;
public TTLConfig TTL { get; set; } = null!;
public class AEConfig
{
public int SampleRate { get; set; }
public int BaseChunkSize { get; set; }
}
public class TTLConfig
{
public int ChunkCompressFactor { get; set; }
public int LatentDim { get; set; }
}
}
// ============================================================================
// Style class
// ============================================================================
public class Style
{
public float[] Ttl { get; set; }
public long[] TtlShape { get; set; }
public float[] Dp { get; set; }
public long[] DpShape { get; set; }
public Style(float[] ttl, long[] ttlShape, float[] dp, long[] dpShape)
{
Ttl = ttl;
TtlShape = ttlShape;
Dp = dp;
DpShape = dpShape;
}
}
// ============================================================================
// Unicode text processor
// ============================================================================
public class UnicodeProcessor
{
private readonly Dictionary<int, long> _indexer;
public UnicodeProcessor(string unicodeIndexerPath)
{
var json = File.ReadAllText(unicodeIndexerPath);
var indexerArray = JsonSerializer.Deserialize<long[]>(json) ?? throw new Exception("Failed to load indexer");
_indexer = new Dictionary<int, long>();
for (int i = 0; i < indexerArray.Length; i++)
{
_indexer[i] = indexerArray[i];
}
}
private string PreprocessText(string text)
{
// Simple normalization (C# has Normalize built-in)
return text.Normalize(NormalizationForm.FormKD);
}
private int[] TextToUnicodeValues(string text)
{
return text.Select(c => (int)c).ToArray();
}
private float[][][] GetTextMask(long[] textIdsLengths)
{
return Helper.LengthToMask(textIdsLengths);
}
public (long[][] textIds, float[][][] textMask) Call(List<string> textList)
{
var processedTexts = textList.Select(t => PreprocessText(t)).ToList();
var textIdsLengths = processedTexts.Select(t => (long)t.Length).ToArray();
long maxLen = textIdsLengths.Max();
var textIds = new long[textList.Count][];
for (int i = 0; i < processedTexts.Count; i++)
{
textIds[i] = new long[maxLen];
var unicodeVals = TextToUnicodeValues(processedTexts[i]);
for (int j = 0; j < unicodeVals.Length; j++)
{
if (_indexer.TryGetValue(unicodeVals[j], out long val))
{
textIds[i][j] = val;
}
}
}
var textMask = GetTextMask(textIdsLengths);
return (textIds, textMask);
}
}
// ============================================================================
// TextToSpeech class
// ============================================================================
public class TextToSpeech
{
private readonly Config _cfgs;
private readonly UnicodeProcessor _textProcessor;
private readonly InferenceSession _dpOrt;
private readonly InferenceSession _textEncOrt;
private readonly InferenceSession _vectorEstOrt;
private readonly InferenceSession _vocoderOrt;
public readonly int SampleRate;
private readonly int _baseChunkSize;
private readonly int _chunkCompressFactor;
private readonly int _ldim;
public TextToSpeech(
Config cfgs,
UnicodeProcessor textProcessor,
InferenceSession dpOrt,
InferenceSession textEncOrt,
InferenceSession vectorEstOrt,
InferenceSession vocoderOrt)
{
_cfgs = cfgs;
_textProcessor = textProcessor;
_dpOrt = dpOrt;
_textEncOrt = textEncOrt;
_vectorEstOrt = vectorEstOrt;
_vocoderOrt = vocoderOrt;
SampleRate = cfgs.AE.SampleRate;
_baseChunkSize = cfgs.AE.BaseChunkSize;
_chunkCompressFactor = cfgs.TTL.ChunkCompressFactor;
_ldim = cfgs.TTL.LatentDim;
}
private (float[][][] noisyLatent, float[][][] latentMask) SampleNoisyLatent(float[] duration)
{
int bsz = duration.Length;
float wavLenMax = duration.Max() * SampleRate;
var wavLengths = duration.Select(d => (long)(d * SampleRate)).ToArray();
int chunkSize = _baseChunkSize * _chunkCompressFactor;
int latentLen = (int)((wavLenMax + chunkSize - 1) / chunkSize);
int latentDim = _ldim * _chunkCompressFactor;
// Generate random noise
var random = new Random();
var noisyLatent = new float[bsz][][];
for (int b = 0; b < bsz; b++)
{
noisyLatent[b] = new float[latentDim][];
for (int d = 0; d < latentDim; d++)
{
noisyLatent[b][d] = new float[latentLen];
for (int t = 0; t < latentLen; t++)
{
// Box-Muller transform for normal distribution
double u1 = 1.0 - random.NextDouble();
double u2 = 1.0 - random.NextDouble();
noisyLatent[b][d][t] = (float)(Math.Sqrt(-2.0 * Math.Log(u1)) * Math.Cos(2.0 * Math.PI * u2));
}
}
}
var latentMask = Helper.GetLatentMask(wavLengths, _baseChunkSize, _chunkCompressFactor);
// Apply mask
for (int b = 0; b < bsz; b++)
{
for (int d = 0; d < latentDim; d++)
{
for (int t = 0; t < latentLen; t++)
{
noisyLatent[b][d][t] *= latentMask[b][0][t];
}
}
}
return (noisyLatent, latentMask);
}
public (float[] wav, float[] duration) Call(List<string> textList, Style style, int totalStep)
{
int bsz = textList.Count;
if (bsz != style.TtlShape[0])
{
throw new ArgumentException("Number of texts must match number of style vectors");
}
// Process text
var (textIds, textMask) = _textProcessor.Call(textList);
var textIdsShape = new long[] { bsz, textIds[0].Length };
var textMaskShape = new long[] { bsz, 1, textMask[0][0].Length };
var textIdsTensor = Helper.IntArrayToTensor(textIds, textIdsShape);
var textMaskTensor = Helper.ArrayToTensor(textMask, textMaskShape);
var styleTtlTensor = new DenseTensor<float>(style.Ttl, style.TtlShape.Select(x => (int)x).ToArray());
var styleDpTensor = new DenseTensor<float>(style.Dp, style.DpShape.Select(x => (int)x).ToArray());
// Run duration predictor
var dpInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
NamedOnnxValue.CreateFromTensor("style_dp", styleDpTensor),
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
};
using var dpOutputs = _dpOrt.Run(dpInputs);
var durOnnx = dpOutputs.First(o => o.Name == "duration").AsTensor<float>().ToArray();
// Run text encoder
var textEncInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
};
using var textEncOutputs = _textEncOrt.Run(textEncInputs);
var textEmbTensor = textEncOutputs.First(o => o.Name == "text_emb").AsTensor<float>();
// Sample noisy latent
var (xt, latentMask) = SampleNoisyLatent(durOnnx);
var latentShape = new long[] { bsz, xt[0].Length, xt[0][0].Length };
var latentMaskShape = new long[] { bsz, 1, latentMask[0][0].Length };
var totalStepArray = Enumerable.Repeat((float)totalStep, bsz).ToArray();
// Iterative denoising
for (int step = 0; step < totalStep; step++)
{
var currentStepArray = Enumerable.Repeat((float)step, bsz).ToArray();
var vectorEstInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("noisy_latent", Helper.ArrayToTensor(xt, latentShape)),
NamedOnnxValue.CreateFromTensor("text_emb", textEmbTensor),
NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor),
NamedOnnxValue.CreateFromTensor("latent_mask", Helper.ArrayToTensor(latentMask, latentMaskShape)),
NamedOnnxValue.CreateFromTensor("total_step", new DenseTensor<float>(totalStepArray, new int[] { bsz })),
NamedOnnxValue.CreateFromTensor("current_step", new DenseTensor<float>(currentStepArray, new int[] { bsz }))
};
using var vectorEstOutputs = _vectorEstOrt.Run(vectorEstInputs);
var denoisedLatent = vectorEstOutputs.First(o => o.Name == "denoised_latent").AsTensor<float>();
// Update xt
int idx = 0;
for (int b = 0; b < bsz; b++)
{
for (int d = 0; d < xt[b].Length; d++)
{
for (int t = 0; t < xt[b][d].Length; t++)
{
xt[b][d][t] = denoisedLatent.GetValue(idx++);
}
}
}
}
// Run vocoder
var vocoderInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("latent", Helper.ArrayToTensor(xt, latentShape))
};
using var vocoderOutputs = _vocoderOrt.Run(vocoderInputs);
var wavTensor = vocoderOutputs.First(o => o.Name == "wav_tts").AsTensor<float>();
return (wavTensor.ToArray(), durOnnx);
}
}
// ============================================================================
// Helper class with utility functions
// ============================================================================
public static class Helper
{
// ============================================================================
// Utility functions
// ============================================================================
public static float[][][] LengthToMask(long[] lengths, long maxLen = -1)
{
if (maxLen == -1)
{
maxLen = lengths.Max();
}
var mask = new float[lengths.Length][][];
for (int i = 0; i < lengths.Length; i++)
{
mask[i] = new float[1][];
mask[i][0] = new float[maxLen];
for (int j = 0; j < maxLen; j++)
{
mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
}
}
return mask;
}
public static float[][][] GetLatentMask(long[] wavLengths, int baseChunkSize, int chunkCompressFactor)
{
int latentSize = baseChunkSize * chunkCompressFactor;
var latentLengths = wavLengths.Select(len => (len + latentSize - 1) / latentSize).ToArray();
return LengthToMask(latentLengths);
}
// ============================================================================
// ONNX model loading
// ============================================================================
public static InferenceSession LoadOnnx(string onnxPath, SessionOptions opts)
{
return new InferenceSession(onnxPath, opts);
}
public static (InferenceSession dp, InferenceSession textEnc, InferenceSession vectorEst, InferenceSession vocoder)
LoadOnnxAll(string onnxDir, SessionOptions opts)
{
var dpPath = Path.Combine(onnxDir, "duration_predictor.onnx");
var textEncPath = Path.Combine(onnxDir, "text_encoder.onnx");
var vectorEstPath = Path.Combine(onnxDir, "vector_estimator.onnx");
var vocoderPath = Path.Combine(onnxDir, "vocoder.onnx");
return (
LoadOnnx(dpPath, opts),
LoadOnnx(textEncPath, opts),
LoadOnnx(vectorEstPath, opts),
LoadOnnx(vocoderPath, opts)
);
}
// ============================================================================
// Configuration loading
// ============================================================================
public static Config LoadCfgs(string onnxDir)
{
var cfgPath = Path.Combine(onnxDir, "tts.json");
var json = File.ReadAllText(cfgPath);
using var doc = JsonDocument.Parse(json);
var root = doc.RootElement;
return new Config
{
AE = new Config.AEConfig
{
SampleRate = root.GetProperty("ae").GetProperty("sample_rate").GetInt32(),
BaseChunkSize = root.GetProperty("ae").GetProperty("base_chunk_size").GetInt32()
},
TTL = new Config.TTLConfig
{
ChunkCompressFactor = root.GetProperty("ttl").GetProperty("chunk_compress_factor").GetInt32(),
LatentDim = root.GetProperty("ttl").GetProperty("latent_dim").GetInt32()
}
};
}
public static UnicodeProcessor LoadTextProcessor(string onnxDir)
{
var unicodeIndexerPath = Path.Combine(onnxDir, "unicode_indexer.json");
return new UnicodeProcessor(unicodeIndexerPath);
}
// ============================================================================
// Voice style loading
// ============================================================================
public static Style LoadVoiceStyle(List<string> voiceStylePaths, bool verbose = false)
{
int bsz = voiceStylePaths.Count;
// Read first file to get dimensions
var firstJson = File.ReadAllText(voiceStylePaths[0]);
using var firstDoc = JsonDocument.Parse(firstJson);
var firstRoot = firstDoc.RootElement;
var ttlDims = ParseInt64Array(firstRoot.GetProperty("style_ttl").GetProperty("dims"));
var dpDims = ParseInt64Array(firstRoot.GetProperty("style_dp").GetProperty("dims"));
long ttlDim1 = ttlDims[1];
long ttlDim2 = ttlDims[2];
long dpDim1 = dpDims[1];
long dpDim2 = dpDims[2];
// Pre-allocate arrays with full batch size
int ttlSize = (int)(bsz * ttlDim1 * ttlDim2);
int dpSize = (int)(bsz * dpDim1 * dpDim2);
var ttlFlat = new float[ttlSize];
var dpFlat = new float[dpSize];
// Fill in the data
for (int i = 0; i < bsz; i++)
{
var json = File.ReadAllText(voiceStylePaths[i]);
using var doc = JsonDocument.Parse(json);
var root = doc.RootElement;
// Flatten data
var ttlData3D = ParseFloat3DArray(root.GetProperty("style_ttl").GetProperty("data"));
var ttlDataFlat = new List<float>();
foreach (var batch in ttlData3D)
{
foreach (var row in batch)
{
ttlDataFlat.AddRange(row);
}
}
var dpData3D = ParseFloat3DArray(root.GetProperty("style_dp").GetProperty("data"));
var dpDataFlat = new List<float>();
foreach (var batch in dpData3D)
{
foreach (var row in batch)
{
dpDataFlat.AddRange(row);
}
}
// Copy to pre-allocated array
int ttlOffset = (int)(i * ttlDim1 * ttlDim2);
ttlDataFlat.CopyTo(ttlFlat, ttlOffset);
int dpOffset = (int)(i * dpDim1 * dpDim2);
dpDataFlat.CopyTo(dpFlat, dpOffset);
}
var ttlShape = new long[] { bsz, ttlDim1, ttlDim2 };
var dpShape = new long[] { bsz, dpDim1, dpDim2 };
if (verbose)
{
Console.WriteLine($"Loaded {bsz} voice styles");
}
return new Style(ttlFlat, ttlShape, dpFlat, dpShape);
}
private static float[][][] ParseFloat3DArray(JsonElement element)
{
var result = new List<float[][]>();
foreach (var batch in element.EnumerateArray())
{
var batch2D = new List<float[]>();
foreach (var row in batch.EnumerateArray())
{
var rowData = new List<float>();
foreach (var val in row.EnumerateArray())
{
rowData.Add(val.GetSingle());
}
batch2D.Add(rowData.ToArray());
}
result.Add(batch2D.ToArray());
}
return result.ToArray();
}
private static long[] ParseInt64Array(JsonElement element)
{
var result = new List<long>();
foreach (var val in element.EnumerateArray())
{
result.Add(val.GetInt64());
}
return result.ToArray();
}
// ============================================================================
// TextToSpeech loading
// ============================================================================
public static TextToSpeech LoadTextToSpeech(string onnxDir, bool useGpu = false)
{
var opts = new SessionOptions();
if (useGpu)
{
throw new NotImplementedException("GPU mode is not supported yet");
}
else
{
Console.WriteLine("Using CPU for inference");
}
var cfgs = LoadCfgs(onnxDir);
var (dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) = LoadOnnxAll(onnxDir, opts);
var textProcessor = LoadTextProcessor(onnxDir);
return new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
}
// ============================================================================
// WAV file writing
// ============================================================================
public static void WriteWavFile(string filename, float[] audioData, int sampleRate)
{
using var writer = new BinaryWriter(File.Open(filename, FileMode.Create));
int numChannels = 1;
int bitsPerSample = 16;
int byteRate = sampleRate * numChannels * bitsPerSample / 8;
short blockAlign = (short)(numChannels * bitsPerSample / 8);
int dataSize = audioData.Length * bitsPerSample / 8;
// RIFF header
writer.Write(Encoding.ASCII.GetBytes("RIFF"));
writer.Write(36 + dataSize);
writer.Write(Encoding.ASCII.GetBytes("WAVE"));
// fmt chunk
writer.Write(Encoding.ASCII.GetBytes("fmt "));
writer.Write(16); // fmt chunk size
writer.Write((short)1); // audio format (PCM)
writer.Write((short)numChannels);
writer.Write(sampleRate);
writer.Write(byteRate);
writer.Write(blockAlign);
writer.Write((short)bitsPerSample);
// data chunk
writer.Write(Encoding.ASCII.GetBytes("data"));
writer.Write(dataSize);
// Write audio data
foreach (var sample in audioData)
{
float clamped = Math.Max(-1.0f, Math.Min(1.0f, sample));
short intSample = (short)(clamped * 32767);
writer.Write(intSample);
}
}
// ============================================================================
// Tensor conversion utilities
// ============================================================================
public static DenseTensor<float> ArrayToTensor(float[][][] array, long[] dims)
{
var flat = new List<float>();
foreach (var batch in array)
{
foreach (var row in batch)
{
flat.AddRange(row);
}
}
return new DenseTensor<float>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
}
public static DenseTensor<long> IntArrayToTensor(long[][] array, long[] dims)
{
var flat = new List<long>();
foreach (var row in array)
{
flat.AddRange(row);
}
return new DenseTensor<long>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
}
// ============================================================================
// Timer utility
// ============================================================================
public static T Timer<T>(string name, Func<T> func)
{
var start = DateTime.Now;
Console.WriteLine($"{name}...");
var result = func();
var elapsed = (DateTime.Now - start).TotalSeconds;
Console.WriteLine($" -> {name} completed in {elapsed:F2} sec");
return result;
}
public static string SanitizeFilename(string text, int maxLen)
{
var result = new StringBuilder();
int count = 0;
foreach (char c in text)
{
if (count >= maxLen) break;
if (char.IsLetterOrDigit(c))
{
result.Append(c);
}
else
{
result.Append('_');
}
count++;
}
return result.ToString();
}
}
}
+99
View File
@@ -0,0 +1,99 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `ExampleONNX.cs`.
## Installation
### Prerequisites
- .NET 9.0 SDK or later
- [Download .NET SDK](https://dotnet.microsoft.com/download)
### Install dependencies
```bash
dotnet restore
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
dotnet run
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
dotnet run -- \
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
```
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first text
- Use female voice style (F1.json) for the second text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
dotnet run -- \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated) |
| `--text` | str+ | (long default text) | Text(s) to synthesize (pipe-separated: `|`) |
| `--save-dir` | str | `results` | Output directory |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
## Building the Project
### Build for Release
```bash
dotnet build -c Release
```
### Run the compiled executable
```bash
./bin/Release/net9.0/Supertonic
```
## Project Structure
```
csharp/
├── ExampleONNX.cs # Main inference script
├── Helper.cs # Helper functions and classes
├── Supertonic.csproj # Project configuration
├── README.md # This file
└── results/ # Output directory (created automatically)
```
+17
View File
@@ -0,0 +1,17 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net9.0</TargetFramework>
<LangVersion>13.0</LangVersion>
<Nullable>enable</Nullable>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.20.1" />
<PackageReference Include="System.Text.Json" Version="9.0.1" />
</ItemGroup>
</Project>
+1
View File
@@ -0,0 +1 @@
../assets
+17
View File
@@ -0,0 +1,17 @@
# Binaries
tts_example
example_onnx
*.exe
# Go build artifacts
*.o
*.a
*.so
# Results
results/
# Go workspace
go.work
go.work.sum
+128
View File
@@ -0,0 +1,128 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `example_onnx.go`.
## Installation
This project uses Go modules for dependency management.
### Prerequisites
1. Install Go 1.21 or later from [https://golang.org/dl/](https://golang.org/dl/)
2. Install ONNX Runtime C library:
**macOS (via Homebrew):**
```bash
brew install onnxruntime
```
**Linux:**
```bash
# Download ONNX Runtime from GitHub releases
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.0/onnxruntime-linux-x64-1.16.0.tgz
tar -xzf onnxruntime-linux-x64-1.16.0.tgz
sudo cp onnxruntime-linux-x64-1.16.0/lib/* /usr/local/lib/
sudo cp -r onnxruntime-linux-x64-1.16.0/include/* /usr/local/include/
sudo ldconfig
```
### Install Go dependencies
```bash
go mod download
```
### Configure ONNX Runtime Library Path (Optional)
If the ONNX Runtime library is not in a standard location, set the environment variable:
**Automatic Detection (Recommended):**
```bash
# macOS
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
# Linux
export ONNXRUNTIME_LIB_PATH=$(find /usr/local/lib /usr/lib -name "libonnxruntime.so*" 2>/dev/null | head -n 1)
```
**Manual Configuration:**
```bash
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.so # Linux
# or
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.dylib # macOS
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
go run example_onnx.go helper.go
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
go run example_onnx.go helper.go \
-voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
-text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
```
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice (M1.json) for the first text
- Use female voice (F1.json) for the second text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
go run example_onnx.go helper.go \
-total-step 10 \
-voice-style "assets/voice_styles/M1.json" \
-text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `-use-gpu` | flag | false | Use GPU for inference (default: CPU) |
| `-onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `-total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `-n-test` | int | 4 | Number of times to generate each sample |
| `-voice-style` | str | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
| `-text` | str | (long default text) | Text(s) to synthesize, pipe-separated |
| `-save-dir` | str | `results` | Output directory |
## Notes
- **Batch Processing**: The number of `-voice-style` files must match the number of `-text` entries
- **Quality vs Speed**: Higher `-total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
## Building a Binary
To build a standalone executable:
```bash
go build -o tts_example example_onnx.go helper.go
```
Then run it:
```bash
./tts_example -voice-style "../assets/voice_styles/M1.json" -text "Hello world"
```
Symlink
+1
View File
@@ -0,0 +1 @@
../assets
+144
View File
@@ -0,0 +1,144 @@
package main
import (
"flag"
"fmt"
"os"
"path/filepath"
"strings"
ort "github.com/yalue/onnxruntime_go"
)
// Args holds command line arguments
type Args struct {
useGPU bool
onnxDir string
totalStep int
nTest int
voiceStyle []string
text []string
saveDir string
}
func parseArgs() *Args {
args := &Args{}
flag.BoolVar(&args.useGPU, "use-gpu", false, "Use GPU for inference (default: CPU)")
flag.StringVar(&args.onnxDir, "onnx-dir", "assets/onnx", "Path to ONNX model directory")
flag.IntVar(&args.totalStep, "total-step", 5, "Number of denoising steps")
flag.IntVar(&args.nTest, "n-test", 4, "Number of times to generate")
flag.StringVar(&args.saveDir, "save-dir", "results", "Output directory")
var voiceStyleStr, textStr string
flag.StringVar(&voiceStyleStr, "voice-style", "assets/voice_styles/M1.json", "Voice style file path(s), comma-separated")
flag.StringVar(&textStr, "text", "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.", "Text(s) to synthesize, pipe-separated")
flag.Parse()
// Parse comma-separated voice-style
if voiceStyleStr != "" {
args.voiceStyle = strings.Split(voiceStyleStr, ",")
for i := range args.voiceStyle {
args.voiceStyle[i] = strings.TrimSpace(args.voiceStyle[i])
}
}
// Parse pipe-separated text
if textStr != "" {
args.text = strings.Split(textStr, "|")
for i := range args.text {
args.text[i] = strings.TrimSpace(args.text[i])
}
}
return args
}
func main() {
fmt.Println("=== TTS Inference with ONNX Runtime (Go) ===\n")
// --- 1. Parse arguments --- //
args := parseArgs()
totalStep := args.totalStep
nTest := args.nTest
saveDir := args.saveDir
voiceStylePaths := args.voiceStyle
textList := args.text
if len(voiceStylePaths) != len(textList) {
fmt.Printf("Error: Number of voice styles (%d) must match number of texts (%d)\n",
len(voiceStylePaths), len(textList))
os.Exit(1)
}
bsz := len(voiceStylePaths)
// Initialize ONNX Runtime
if err := InitializeONNXRuntime(); err != nil {
fmt.Printf("Error initializing ONNX Runtime: %v\n", err)
os.Exit(1)
}
defer ort.DestroyEnvironment()
// --- 2. Load config --- //
cfg, err := LoadCfgs(args.onnxDir)
if err != nil {
fmt.Printf("Error loading config: %v\n", err)
os.Exit(1)
}
// --- 3. Load TTS components --- //
textToSpeech, err := LoadTextToSpeech(args.onnxDir, args.useGPU, cfg)
if err != nil {
fmt.Printf("Error loading TTS components: %v\n", err)
os.Exit(1)
}
defer textToSpeech.Destroy()
// --- 4. Load voice styles --- //
style, err := LoadVoiceStyle(voiceStylePaths, true)
if err != nil {
fmt.Printf("Error loading voice styles: %v\n", err)
os.Exit(1)
}
defer style.Destroy()
// --- 5. Synthesize speech --- //
if err := os.MkdirAll(saveDir, 0755); err != nil {
fmt.Printf("Error creating save directory: %v\n", err)
os.Exit(1)
}
for n := 0; n < nTest; n++ {
fmt.Printf("\n[%d/%d] Starting synthesis...\n", n+1, nTest)
var wav []float32
var duration []float32
Timer("Generating speech from text", func() interface{} {
w, d, err := textToSpeech.Call(textList, style, totalStep)
if err != nil {
fmt.Printf("Error generating speech: %v\n", err)
os.Exit(1)
}
wav = w
duration = d
return nil
})
// Save outputs
for i := 0; i < bsz; i++ {
fname := fmt.Sprintf("%s_%d.wav", sanitizeFilename(textList[i], 20), n+1)
wavOut := extractWavSegment(wav, duration[i], textToSpeech.SampleRate, i, bsz)
outputPath := filepath.Join(saveDir, fname)
if err := writeWavFile(outputPath, wavOut, textToSpeech.SampleRate); err != nil {
fmt.Printf("Error writing wav file: %v\n", err)
continue
}
fmt.Printf("Saved: %s\n", outputPath)
}
}
fmt.Println("\n=== Synthesis completed successfully! ===")
}
+12
View File
@@ -0,0 +1,12 @@
module supertonic-tts
go 1.21
require (
github.com/go-audio/audio v1.0.0
github.com/go-audio/wav v1.1.0
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12
github.com/yalue/onnxruntime_go v1.11.0
)
require github.com/go-audio/riff v1.0.0 // indirect
+10
View File
@@ -0,0 +1,10 @@
github.com/go-audio/audio v1.0.0 h1:zS9vebldgbQqktK4H0lUqWrG8P0NxCJVqcj7ZpNnwd4=
github.com/go-audio/audio v1.0.0/go.mod h1:6uAu0+H2lHkwdGsAY+j2wHPNPpPoeg5AaEFh9FlA+Zs=
github.com/go-audio/riff v1.0.0 h1:d8iCGbDvox9BfLagY94fBynxSPHO80LmZCaOsmKxokA=
github.com/go-audio/riff v1.0.0/go.mod h1:l3cQwc85y79NQFCRB7TiPoNiaijp6q8Z0Uv38rVG498=
github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12 h1:dd7vnTDfjtwCETZDrRe+GPYNLA1jBtbZeyfyE8eZCyk=
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12/go.mod h1:i/KKcxEWEO8Yyl11DYafRPKOPVYTrhxiTRigjtEEXZU=
github.com/yalue/onnxruntime_go v1.11.0 h1:aKH4yPIbqfcB3SfnQWq/WxzLelkyolntHnffL3eMBHY=
github.com/yalue/onnxruntime_go v1.11.0/go.mod h1:b4X26A8pekNb1ACJ58wAXgNKeUCGEAQ9dmACut9Sm/4=
+734
View File
@@ -0,0 +1,734 @@
package main
import (
"encoding/json"
"fmt"
"math"
"math/rand"
"os"
"path/filepath"
"time"
"github.com/go-audio/audio"
"github.com/go-audio/wav"
ort "github.com/yalue/onnxruntime_go"
)
// Config structures
type SpecProcessorConfig struct {
NFFT int `json:"n_fft"`
WinLength int `json:"win_length"`
HopLength int `json:"hop_length"`
NMels int `json:"n_mels"`
Eps float64 `json:"eps"`
NormMean float64 `json:"norm_mean"`
NormStd float64 `json:"norm_std"`
}
type EncoderConfig struct {
SpecProcessor SpecProcessorConfig `json:"spec_processor"`
}
type AEConfig struct {
SampleRate int `json:"sample_rate"`
BaseChunkSize int `json:"base_chunk_size"`
Encoder EncoderConfig `json:"encoder"`
}
type StyleTokenLayerConfig struct {
NStyle int `json:"n_style"`
StyleValueDim int `json:"style_value_dim"`
}
type StyleEncoderConfig struct {
StyleTokenLayer StyleTokenLayerConfig `json:"style_token_layer"`
}
type ProjOutConfig struct {
Idim int `json:"idim"`
Odim int `json:"odim"`
}
type TextEncoderConfig struct {
ProjOut ProjOutConfig `json:"proj_out"`
}
type TTLConfig struct {
ChunkCompressFactor int `json:"chunk_compress_factor"`
LatentDim int `json:"latent_dim"`
StyleEncoder StyleEncoderConfig `json:"style_encoder"`
TextEncoder TextEncoderConfig `json:"text_encoder"`
}
type DPStyleEncoderConfig struct {
StyleTokenLayer StyleTokenLayerConfig `json:"style_token_layer"`
}
type DPConfig struct {
LatentDim int `json:"latent_dim"`
ChunkCompressFactor int `json:"chunk_compress_factor"`
StyleEncoder DPStyleEncoderConfig `json:"style_encoder"`
}
type Config struct {
AE AEConfig `json:"ae"`
TTL TTLConfig `json:"ttl"`
DP DPConfig `json:"dp"`
}
// VoiceStyleData holds voice style JSON structure
type VoiceStyleData struct {
StyleTTL struct {
Data [][][]float64 `json:"data"`
Dims []int64 `json:"dims"`
Type string `json:"type"`
} `json:"style_ttl"`
StyleDP struct {
Data [][][]float64 `json:"data"`
Dims []int64 `json:"dims"`
Type string `json:"type"`
} `json:"style_dp"`
}
// UnicodeProcessor for text processing
type UnicodeProcessor struct {
indexer []int64
}
// NewUnicodeProcessor creates a new UnicodeProcessor
func NewUnicodeProcessor(unicodeIndexerPath string) (*UnicodeProcessor, error) {
indexer, err := loadJSONInt64(unicodeIndexerPath)
if err != nil {
return nil, fmt.Errorf("failed to load unicode indexer: %w", err)
}
return &UnicodeProcessor{indexer: indexer}, nil
}
// Call processes text list to text IDs and mask
func (up *UnicodeProcessor) Call(textList []string) ([][]int64, [][][]float64) {
// Preprocess texts
processedTexts := make([]string, len(textList))
for i, text := range textList {
processedTexts[i] = preprocessText(text)
}
// Get text lengths
textLengths := make([]int64, len(processedTexts))
maxLen := 0
for i, text := range processedTexts {
textLengths[i] = int64(len([]rune(text)))
if int(textLengths[i]) > maxLen {
maxLen = int(textLengths[i])
}
}
// Create text IDs
textIDs := make([][]int64, len(processedTexts))
for i, text := range processedTexts {
row := make([]int64, maxLen)
runes := []rune(text)
for j, r := range runes {
unicodeVal := int(r)
if unicodeVal < len(up.indexer) {
row[j] = up.indexer[unicodeVal]
} else {
row[j] = -1
}
}
textIDs[i] = row
}
// Create text mask
textMask := lengthToMask(textLengths, maxLen)
return textIDs, textMask
}
// Utility functions
func preprocessText(text string) string {
// Simple normalization (Go doesn't have built-in NFKD normalization)
// For full Unicode normalization, use golang.org/x/text/unicode/norm
return text
}
func lengthToMask(lengths []int64, maxLen int) [][][]float64 {
bsz := len(lengths)
mask := make([][][]float64, bsz)
for i := 0; i < bsz; i++ {
row := make([]float64, maxLen)
for j := 0; j < maxLen; j++ {
if int64(j) < lengths[i] {
row[j] = 1.0
} else {
row[j] = 0.0
}
}
mask[i] = [][]float64{row}
}
return mask
}
func getTextMask(textLengths []int64, maxLen int) [][][]float64 {
return lengthToMask(textLengths, maxLen)
}
func getLatentMask(wavLengths []int64, cfg Config) [][][]float64 {
baseChunkSize := int64(cfg.AE.BaseChunkSize)
chunkCompressFactor := int64(cfg.TTL.ChunkCompressFactor)
latentSize := baseChunkSize * chunkCompressFactor
latentLengths := make([]int64, len(wavLengths))
maxLen := int64(0)
for i, wavLen := range wavLengths {
latentLengths[i] = (wavLen + latentSize - 1) / latentSize
if latentLengths[i] > maxLen {
maxLen = latentLengths[i]
}
}
return lengthToMask(latentLengths, int(maxLen))
}
func writeWavFile(filename string, audioData []float64, sampleRate int) error {
file, err := os.Create(filename)
if err != nil {
return err
}
defer file.Close()
// Convert float64 to int
intData := make([]int, len(audioData))
for i, sample := range audioData {
// Clamp to [-1, 1] and convert to 16-bit int
clamped := math.Max(-1.0, math.Min(1.0, sample))
intData[i] = int(clamped * 32767)
}
encoder := wav.NewEncoder(file, sampleRate, 16, 1, 1)
buf := &audio.IntBuffer{
Data: intData,
Format: &audio.Format{SampleRate: sampleRate, NumChannels: 1},
SourceBitDepth: 16,
}
if err := encoder.Write(buf); err != nil {
return err
}
return encoder.Close()
}
// Style holds style tensors
type Style struct {
TtlTensor *ort.Tensor[float32]
DpTensor *ort.Tensor[float32]
}
func (s *Style) Destroy() {
if s.TtlTensor != nil {
s.TtlTensor.Destroy()
}
if s.DpTensor != nil {
s.DpTensor.Destroy()
}
}
// LoadVoiceStyle loads voice style from JSON files
func LoadVoiceStyle(voiceStylePaths []string, verbose bool) (*Style, error) {
bsz := len(voiceStylePaths)
// Read first file to get dimensions
firstData, err := os.ReadFile(voiceStylePaths[0])
if err != nil {
return nil, fmt.Errorf("failed to read voice style file: %w", err)
}
var firstStyle VoiceStyleData
if err := json.Unmarshal(firstData, &firstStyle); err != nil {
return nil, fmt.Errorf("failed to parse voice style JSON: %w", err)
}
ttlDims := firstStyle.StyleTTL.Dims
dpDims := firstStyle.StyleDP.Dims
ttlDim1 := ttlDims[1]
ttlDim2 := ttlDims[2]
dpDim1 := dpDims[1]
dpDim2 := dpDims[2]
// Pre-allocate arrays with full batch size
ttlSize := int(int64(bsz) * ttlDim1 * ttlDim2)
dpSize := int(int64(bsz) * dpDim1 * dpDim2)
ttlFlat := make([]float32, ttlSize)
dpFlat := make([]float32, dpSize)
// Fill in the data
for i := 0; i < bsz; i++ {
data, err := os.ReadFile(voiceStylePaths[i])
if err != nil {
return nil, fmt.Errorf("failed to read voice style file: %w", err)
}
var voiceStyle VoiceStyleData
if err := json.Unmarshal(data, &voiceStyle); err != nil {
return nil, fmt.Errorf("failed to parse voice style JSON: %w", err)
}
// Flatten TTL data
ttlOffset := int(int64(i) * ttlDim1 * ttlDim2)
idx := 0
for _, batch := range voiceStyle.StyleTTL.Data {
for _, row := range batch {
for _, val := range row {
ttlFlat[ttlOffset+idx] = float32(val)
idx++
}
}
}
// Flatten DP data
dpOffset := int(int64(i) * dpDim1 * dpDim2)
idx = 0
for _, batch := range voiceStyle.StyleDP.Data {
for _, row := range batch {
for _, val := range row {
dpFlat[dpOffset+idx] = float32(val)
idx++
}
}
}
}
ttlShape := []int64{int64(bsz), ttlDim1, ttlDim2}
dpShape := []int64{int64(bsz), dpDim1, dpDim2}
ttlTensor, err := ort.NewTensor(ttlShape, ttlFlat)
if err != nil {
return nil, fmt.Errorf("failed to create TTL tensor: %w", err)
}
dpTensor, err := ort.NewTensor(dpShape, dpFlat)
if err != nil {
ttlTensor.Destroy()
return nil, fmt.Errorf("failed to create DP tensor: %w", err)
}
if verbose {
fmt.Printf("Loaded %d voice styles\n\n", bsz)
}
return &Style{
TtlTensor: ttlTensor,
DpTensor: dpTensor,
}, nil
}
// TextToSpeech generates speech from text
type TextToSpeech struct {
cfg Config
textProcessor *UnicodeProcessor
dpOrt *ort.DynamicAdvancedSession
textEncOrt *ort.DynamicAdvancedSession
vectorEstOrt *ort.DynamicAdvancedSession
vocoderOrt *ort.DynamicAdvancedSession
SampleRate int
baseChunkSize int
chunkCompress int
ldim int
}
func (tts *TextToSpeech) sampleNoisyLatent(durOnnx []float32) ([][][]float64, [][][]float64) {
bsz := len(durOnnx)
maxDur := float64(0)
for _, d := range durOnnx {
if float64(d) > maxDur {
maxDur = float64(d)
}
}
wavLenMax := maxDur * float64(tts.SampleRate)
wavLengths := make([]int64, bsz)
for i, d := range durOnnx {
wavLengths[i] = int64(float64(d) * float64(tts.SampleRate))
}
chunkSize := tts.baseChunkSize * tts.chunkCompress
latentLen := int((wavLenMax + float64(chunkSize) - 1) / float64(chunkSize))
latentDim := tts.ldim * tts.chunkCompress
rng := rand.New(rand.NewSource(time.Now().UnixNano()))
noisyLatent := make([][][]float64, bsz)
for b := 0; b < bsz; b++ {
batch := make([][]float64, latentDim)
for d := 0; d < latentDim; d++ {
row := make([]float64, latentLen)
for t := 0; t < latentLen; t++ {
// Box-Muller transform for normal distribution
// Add epsilon to avoid log(0)
const eps = 1e-10
u1 := math.Max(eps, rng.Float64())
u2 := rng.Float64()
row[t] = math.Sqrt(-2.0*math.Log(u1)) * math.Cos(2.0*math.Pi*u2)
}
batch[d] = row
}
noisyLatent[b] = batch
}
latentMask := getLatentMask(wavLengths, tts.cfg)
// Apply mask
for b := 0; b < bsz; b++ {
for d := 0; d < latentDim; d++ {
for t := 0; t < latentLen; t++ {
noisyLatent[b][d][t] *= latentMask[b][0][t]
}
}
}
return noisyLatent, latentMask
}
func (tts *TextToSpeech) Call(textList []string, style *Style, totalStep int) ([]float32, []float32, error) {
bsz := len(textList)
// Process text
textIDs, textMask := tts.textProcessor.Call(textList)
textIDsShape := []int64{int64(bsz), int64(len(textIDs[0]))}
textMaskShape := []int64{int64(bsz), 1, int64(len(textMask[0][0]))}
textIDsTensor := IntArrayToTensor(textIDs, textIDsShape)
defer textIDsTensor.Destroy()
textMaskTensor := ArrayToTensor(textMask, textMaskShape)
defer textMaskTensor.Destroy()
// Predict duration
dpOutputs := []ort.Value{nil}
err := tts.dpOrt.Run(
[]ort.Value{textIDsTensor, style.DpTensor, textMaskTensor},
dpOutputs,
)
if err != nil {
return nil, nil, fmt.Errorf("failed to run duration predictor: %w", err)
}
durTensor := dpOutputs[0].(*ort.Tensor[float32])
defer durTensor.Destroy()
durOnnx := durTensor.GetData()
// Encode text
textIDsTensor2 := IntArrayToTensor(textIDs, textIDsShape)
defer textIDsTensor2.Destroy()
textEncOutputs := []ort.Value{nil}
err = tts.textEncOrt.Run(
[]ort.Value{textIDsTensor2, style.TtlTensor, textMaskTensor},
textEncOutputs,
)
if err != nil {
return nil, nil, fmt.Errorf("failed to run text encoder: %w", err)
}
textEmbTensor := textEncOutputs[0].(*ort.Tensor[float32])
defer textEmbTensor.Destroy()
// Sample noisy latent
xt, latentMask := tts.sampleNoisyLatent(durOnnx)
latentShape := []int64{int64(bsz), int64(len(xt[0])), int64(len(xt[0][0]))}
latentMaskShape := []int64{int64(bsz), 1, int64(len(latentMask[0][0]))}
// Prepare constant arrays
totalStepArray := make([]float32, bsz)
for b := 0; b < bsz; b++ {
totalStepArray[b] = float32(totalStep)
}
scalarShape := []int64{int64(bsz)}
totalStepTensor, _ := ort.NewTensor(scalarShape, totalStepArray)
defer totalStepTensor.Destroy()
// Denoising loop
for step := 0; step < totalStep; step++ {
currentStepArray := make([]float32, bsz)
for b := 0; b < bsz; b++ {
currentStepArray[b] = float32(step)
}
currentStepTensor, _ := ort.NewTensor(scalarShape, currentStepArray)
noisyLatentTensor := ArrayToTensor(xt, latentShape)
latentMaskTensor := ArrayToTensor(latentMask, latentMaskShape)
textMaskTensor2 := ArrayToTensor(textMask, textMaskShape)
vectorEstOutputs := []ort.Value{nil}
err = tts.vectorEstOrt.Run(
[]ort.Value{noisyLatentTensor, textEmbTensor, style.TtlTensor, latentMaskTensor, textMaskTensor2,
currentStepTensor, totalStepTensor},
vectorEstOutputs,
)
if err != nil {
return nil, nil, fmt.Errorf("failed to run vector estimator: %w", err)
}
denoisedTensor := vectorEstOutputs[0].(*ort.Tensor[float32])
denoisedData := denoisedTensor.GetData()
// Update latent
idx := 0
for b := 0; b < bsz; b++ {
for d := 0; d < len(xt[b]); d++ {
for t := 0; t < len(xt[b][d]); t++ {
xt[b][d][t] = float64(denoisedData[idx])
idx++
}
}
}
noisyLatentTensor.Destroy()
latentMaskTensor.Destroy()
textMaskTensor2.Destroy()
currentStepTensor.Destroy()
denoisedTensor.Destroy()
}
// Generate waveform
finalLatentTensor := ArrayToTensor(xt, latentShape)
defer finalLatentTensor.Destroy()
vocoderOutputs := []ort.Value{nil}
err = tts.vocoderOrt.Run(
[]ort.Value{finalLatentTensor},
vocoderOutputs,
)
if err != nil {
return nil, nil, fmt.Errorf("failed to run vocoder: %w", err)
}
wavBatchTensor := vocoderOutputs[0].(*ort.Tensor[float32])
defer wavBatchTensor.Destroy()
wav := wavBatchTensor.GetData()
return wav, durOnnx, nil
}
func (tts *TextToSpeech) Destroy() {
if tts.dpOrt != nil {
tts.dpOrt.Destroy()
}
if tts.textEncOrt != nil {
tts.textEncOrt.Destroy()
}
if tts.vectorEstOrt != nil {
tts.vectorEstOrt.Destroy()
}
if tts.vocoderOrt != nil {
tts.vocoderOrt.Destroy()
}
}
// LoadTextToSpeech loads TTS components
func LoadTextToSpeech(onnxDir string, useGPU bool, cfg Config) (*TextToSpeech, error) {
if useGPU {
return nil, fmt.Errorf("GPU mode is not supported yet")
}
fmt.Println("Using CPU for inference\n")
// Load models
dpPath := filepath.Join(onnxDir, "duration_predictor.onnx")
textEncPath := filepath.Join(onnxDir, "text_encoder.onnx")
vectorEstPath := filepath.Join(onnxDir, "vector_estimator.onnx")
vocoderPath := filepath.Join(onnxDir, "vocoder.onnx")
dpOrt, err := ort.NewDynamicAdvancedSession(dpPath, []string{"text_ids", "style_dp", "text_mask"},
[]string{"duration"}, nil)
if err != nil {
return nil, fmt.Errorf("failed to load duration predictor: %w", err)
}
textEncOrt, err := ort.NewDynamicAdvancedSession(textEncPath, []string{"text_ids", "style_ttl", "text_mask"},
[]string{"text_emb"}, nil)
if err != nil {
return nil, fmt.Errorf("failed to load text encoder: %w", err)
}
vectorEstOrt, err := ort.NewDynamicAdvancedSession(vectorEstPath,
[]string{"noisy_latent", "text_emb", "style_ttl", "latent_mask", "text_mask", "current_step", "total_step"},
[]string{"denoised_latent"}, nil)
if err != nil {
return nil, fmt.Errorf("failed to load vector estimator: %w", err)
}
vocoderOrt, err := ort.NewDynamicAdvancedSession(vocoderPath, []string{"latent"},
[]string{"wav_tts"}, nil)
if err != nil {
return nil, fmt.Errorf("failed to load vocoder: %w", err)
}
// Load text processor
unicodeIndexerPath := filepath.Join(onnxDir, "unicode_indexer.json")
textProcessor, err := NewUnicodeProcessor(unicodeIndexerPath)
if err != nil {
return nil, err
}
textToSpeech := &TextToSpeech{
cfg: cfg,
textProcessor: textProcessor,
dpOrt: dpOrt,
textEncOrt: textEncOrt,
vectorEstOrt: vectorEstOrt,
vocoderOrt: vocoderOrt,
SampleRate: cfg.AE.SampleRate,
baseChunkSize: cfg.AE.BaseChunkSize,
chunkCompress: cfg.TTL.ChunkCompressFactor,
ldim: cfg.TTL.LatentDim,
}
return textToSpeech, nil
}
// InitializeONNXRuntime initializes ONNX Runtime environment
func InitializeONNXRuntime() error {
libPath := os.Getenv("ONNXRUNTIME_LIB_PATH")
if libPath == "" {
libPath = "/usr/local/lib/libonnxruntime.so"
if _, err := os.Stat("/usr/local/lib/libonnxruntime.dylib"); err == nil {
libPath = "/usr/local/lib/libonnxruntime.dylib"
} else if _, err := os.Stat("/usr/lib/libonnxruntime.so"); err == nil {
libPath = "/usr/lib/libonnxruntime.so"
}
}
ort.SetSharedLibraryPath(libPath)
if err := ort.InitializeEnvironment(); err != nil {
return fmt.Errorf("failed to initialize ONNX Runtime: %w\nHint: Set ONNXRUNTIME_LIB_PATH environment variable", err)
}
return nil
}
// sanitizeFilename creates a safe filename from text
func sanitizeFilename(text string, maxLen int) string {
if len(text) > maxLen {
text = text[:maxLen]
}
result := make([]rune, 0, len(text))
for _, r := range text {
if (r >= 'a' && r <= 'z') || (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
result = append(result, r)
} else {
result = append(result, '_')
}
}
return string(result)
}
// extractWavSegment extracts a single audio segment from batch output
func extractWavSegment(wav []float32, duration float32, sampleRate int, index int, batchSize int) []float64 {
wavLen := int(float64(sampleRate) * float64(duration))
wavPerBatch := len(wav) / batchSize
wavStart := index * wavPerBatch
wavEnd := wavStart + wavLen
if wavEnd > len(wav) {
wavEnd = len(wav)
}
wavOut := make([]float64, wavLen)
for j := 0; j < wavLen && wavStart+j < len(wav); j++ {
wavOut[j] = float64(wav[wavStart+j])
}
return wavOut
}
// Timer measures execution time
func Timer(name string, fn func() interface{}) interface{} {
start := time.Now()
fmt.Printf("%s...\n", name)
result := fn()
elapsed := time.Since(start).Seconds()
fmt.Printf(" -> %s completed in %.2f sec\n", name, elapsed)
return result
}
// LoadCfgs loads configuration from JSON file
func LoadCfgs(onnxDir string) (Config, error) {
cfgPath := filepath.Join(onnxDir, "tts.json")
data, err := os.ReadFile(cfgPath)
if err != nil {
return Config{}, err
}
var cfg Config
if err := json.Unmarshal(data, &cfg); err != nil {
return Config{}, err
}
return cfg, nil
}
// JSON loading helpers
func loadJSONInt64(filePath string) ([]int64, error) {
data, err := os.ReadFile(filePath)
if err != nil {
return nil, err
}
var result []int64
if err := json.Unmarshal(data, &result); err != nil {
return nil, err
}
return result, nil
}
// Tensor conversion utilities
func ArrayToTensor(array [][][]float64, shape []int64) *ort.Tensor[float32] {
// Flatten array
totalSize := int64(1)
for _, dim := range shape {
totalSize *= dim
}
flat := make([]float32, totalSize)
idx := 0
for b := 0; b < len(array); b++ {
for d := 0; d < len(array[b]); d++ {
for t := 0; t < len(array[b][d]); t++ {
flat[idx] = float32(array[b][d][t])
idx++
}
}
}
tensor, err := ort.NewTensor(shape, flat)
if err != nil {
panic(err)
}
return tensor
}
func IntArrayToTensor(array [][]int64, shape []int64) *ort.Tensor[int64] {
// Flatten array
totalSize := int64(1)
for _, dim := range shape {
totalSize *= dim
}
flat := make([]int64, totalSize)
idx := 0
for b := 0; b < len(array); b++ {
for t := 0; t < len(array[b]); t++ {
flat[idx] = array[b][t]
idx++
}
}
tensor, err := ort.NewTensor(shape, flat)
if err != nil {
panic(err)
}
return tensor
}
+250
View File
@@ -0,0 +1,250 @@
<?xml version="1.0" encoding="UTF-8"?>
<svg id="_레이어_2" data-name="레이어 2" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0 0 1920 1080">
<defs>
<style>
.cls-1, .cls-2 {
fill: none;
}
.cls-3 {
fill: #227cff;
}
.cls-4 {
fill: #ff0;
}
.cls-2 {
stroke: #0a0a0a;
stroke-miterlimit: 10;
stroke-width: 1.72px;
}
.cls-5 {
fill: #f2f2f2;
}
.cls-6 {
fill: #0a0a0a;
}
.cls-7 {
clip-path: url(#clippath);
}
</style>
<clipPath id="clippath">
<rect class="cls-1" x="181.43" width="1626.55" height="1080"/>
</clipPath>
</defs>
<g id="Work">
<g>
<rect class="cls-5" width="1920" height="1080"/>
<g>
<circle class="cls-3" cx="1679.4" cy="880.43" r="59.81"/>
<path class="cls-6" d="M1713.35,805.14c-5.14,12.55-10.93,23.87-18.55,37.52l14.39-2.21c-12.75,23.11-27.5,45.05-44.09,65.57l14.46-2.22c-18.8,23.34-39.84,44.76-62.85,63.95,37.17-24.18,63.26-49.33,92.3-75.66l-16.71,2.56c24.03-21.85,46.85-44.98,68.39-69.3l-22.1,3.39c10.89-12.33,20.91-24.28,30.11-35.66l-55.34,12.05Z"/>
</g>
<g class="cls-7">
<path class="cls-4" d="M1036.15-20.65c-38.28,93.45-81.37,177.77-138.14,279.42l107.16-16.43c-94.95,172.09-204.79,335.46-328.29,488.29l107.68-16.51c-139.96,173.79-296.71,333.29-468.05,476.23,276.77-180.08,471.07-367.33,687.29-563.39l-124.4,19.07c178.91-162.71,348.9-334.99,509.26-516.04l-164.57,25.23c81.12-91.85,155.67-180.81,224.18-265.57l-412.13,89.69Z"/>
</g>
<g>
<path class="cls-6" d="M157.78,462.19h7.78v52.03h28.69v7.36h-36.47v-59.39Z"/>
<path class="cls-6" d="M204.45,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM200.77,479.75h7.19v41.82h-7.19v-41.82Z"/>
<path class="cls-6" d="M237.74,539.9c-9.03,0-16.06-3.85-19.66-8.53l5.02-5.02c3.35,4.27,8.2,7.03,14.64,7.03,7.11,0,13.97-4.43,13.97-14.47v-4.77c-2.84,4.18-8.37,7.36-14.56,7.36-11.63,0-20.49-9.37-20.49-21.33s8.87-21.25,20.49-21.25c6.19,0,11.71,3.09,14.56,7.28v-6.44h7.19v39.4c0,13.97-9.2,20.74-21.16,20.74ZM238.16,514.8c8.37,0,14.14-6.19,14.14-14.64s-5.77-14.64-14.14-14.64-14.14,6.27-14.14,14.64,5.77,14.64,14.14,14.64Z"/>
<path class="cls-6" d="M270.78,458.84h7.19v27.35c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-62.74Z"/>
<path class="cls-6" d="M334.1,521.99c-7.28,0-12.8-4.1-12.8-12.8v-22.84h-8.87v-6.61h8.87v-11.63h7.19v11.63h12.05v6.61h-12.05v21.92c0,5.35,2.43,7.11,6.94,7.11,1.76,0,3.76-.33,5.1-.84v6.44c-1.76.59-3.85,1-6.44,1Z"/>
<path class="cls-6" d="M348.07,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
<path class="cls-6" d="M399.35,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM395.67,479.75h7.19v41.82h-7.19v-41.82Z"/>
<path class="cls-6" d="M414.82,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
<path class="cls-6" d="M480.23,539.9c-9.03,0-16.06-3.85-19.66-8.53l5.02-5.02c3.35,4.27,8.2,7.03,14.64,7.03,7.11,0,13.97-4.43,13.97-14.47v-4.77c-2.84,4.18-8.37,7.36-14.56,7.36-11.63,0-20.49-9.37-20.49-21.33s8.87-21.25,20.49-21.25c6.19,0,11.71,3.09,14.56,7.28v-6.44h7.19v39.4c0,13.97-9.2,20.74-21.16,20.74ZM480.65,514.8c8.37,0,14.14-6.19,14.14-14.64s-5.77-14.64-14.14-14.64-14.14,6.27-14.14,14.64,5.77,14.64,14.14,14.64Z"/>
<path class="cls-6" d="M511.93,494.56h21.33v7.19h-21.33v-7.19Z"/>
<path class="cls-6" d="M544.97,462.19h34.55v7.36h-26.77v17.98h21.16v7.36h-21.16v26.68h-7.78v-59.39Z"/>
<path class="cls-6" d="M600.68,478.92c6.27,0,11.96,3.26,14.72,7.28v-6.44h7.19v41.82h-7.19v-6.44c-2.76,4.02-8.45,7.28-14.72,7.28-11.71,0-20.49-9.79-20.49-21.75s8.78-21.75,20.49-21.75ZM601.77,485.52c-8.45,0-14.3,6.69-14.3,15.14s5.86,15.14,14.3,15.14,14.22-6.69,14.22-15.14-5.77-15.14-14.22-15.14Z"/>
<path class="cls-6" d="M647.11,522.41c-7.44,0-13.89-3.18-16.39-9.7l5.86-3.26c1.5,4.27,6.02,6.61,10.62,6.61,4.01,0,7.28-2.01,7.28-5.6,0-3.01-1.84-4.85-7.28-6.52l-4.18-1.34c-6.86-2.01-10.46-6.36-10.46-12.21.08-7.19,6.36-11.46,14.39-11.46,6.19,0,11.04,2.59,13.8,7.28l-5.35,3.68c-1.84-2.68-4.68-4.77-8.7-4.77-3.51,0-6.94,1.92-6.94,5.02,0,2.51,1.34,4.6,5.94,6.02l4.6,1.42c7.19,2.17,11.38,5.94,11.38,12.3,0,8.03-6.19,12.55-14.55,12.55Z"/>
<path class="cls-6" d="M686.25,521.99c-7.28,0-12.8-4.1-12.8-12.8v-22.84h-8.87v-6.61h8.87v-11.63h7.19v11.63h12.04v6.61h-12.04v21.92c0,5.35,2.43,7.11,6.94,7.11,1.76,0,3.76-.33,5.1-.84v6.44c-1.76.59-3.85,1-6.44,1Z"/>
<path class="cls-6" d="M700.56,533.54h-4.77l5.6-11.96c-2.01-1-3.43-3.09-3.43-5.6,0-3.35,2.76-6.27,6.19-6.27s6.19,2.93,6.19,6.27c0,1.42-.5,2.68-1.17,3.76l-8.62,13.8Z"/>
<path class="cls-6" d="M766.14,522.58c-17.06,0-30.7-13.47-30.7-30.7s13.63-30.7,30.7-30.7,30.7,13.47,30.7,30.7-13.55,30.7-30.7,30.7ZM766.14,515.22c13.05,0,22.92-10.29,22.92-23.34s-9.87-23.34-22.92-23.34-22.84,10.29-22.84,23.34,9.87,23.34,22.84,23.34Z"/>
<path class="cls-6" d="M805.7,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
<path class="cls-6" d="M851.96,494.56h21.33v7.19h-21.33v-7.19Z"/>
<path class="cls-6" d="M885,462.19h17.06c18.24,0,31.62,12.71,31.62,29.7s-13.38,29.7-31.62,29.7h-17.06v-59.39ZM902.06,514.3c14.3,0,23.76-9.54,23.76-22.42s-9.45-22.42-23.76-22.42h-9.29v44.84h9.29Z"/>
<path class="cls-6" d="M961.03,478.92c10.96,0,19.83,7.61,19.91,21,0,.75,0,1.25-.08,2.17h-34.3c.25,7.86,6.19,13.72,14.47,13.72,6.44,0,10.46-2.84,13.05-7.28l5.69,3.93c-3.76,6.11-10.12,9.95-18.82,9.95-12.97,0-21.67-9.45-21.67-21.75s9.03-21.75,21.75-21.75ZM947.07,496.23h26.52c-1-6.86-6.44-10.96-12.8-10.96s-12.38,4.02-13.72,10.96Z"/>
<path class="cls-6" d="M982.53,479.75h8.03l14.3,31.62,14.3-31.62h8.03l-19.32,41.82h-6.02l-19.32-41.82Z"/>
<path class="cls-6" d="M1036.48,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM1032.8,479.75h7.19v41.82h-7.19v-41.82Z"/>
<path class="cls-6" d="M1070.61,522.41c-12.63,0-21.92-9.62-21.92-21.75s9.28-21.75,21.92-21.75c8.62,0,15.64,4.52,19.32,11.21l-6.27,3.51c-2.34-4.77-7.03-8.03-13.05-8.03-8.7,0-14.64,6.69-14.64,15.06s5.94,15.06,14.64,15.06c6.02,0,10.71-3.26,13.05-8.03l6.27,3.51c-3.68,6.69-10.71,11.21-19.32,11.21Z"/>
<path class="cls-6" d="M1116.03,478.92c10.96,0,19.83,7.61,19.91,21,0,.75,0,1.25-.08,2.17h-34.3c.25,7.86,6.19,13.72,14.47,13.72,6.44,0,10.46-2.84,13.05-7.28l5.69,3.93c-3.76,6.11-10.12,9.95-18.82,9.95-12.97,0-21.67-9.45-21.67-21.75s9.03-21.75,21.75-21.75ZM1102.06,496.23h26.52c-1-6.86-6.44-10.96-12.8-10.96s-12.38,4.02-13.72,10.96Z"/>
<path class="cls-6" d="M1173.33,469.55h-18.49v-7.36h44.58v7.36h-18.49v52.03h-7.61v-52.03Z"/>
<path class="cls-6" d="M1219.08,469.55h-18.49v-7.36h44.58v7.36h-18.49v52.03h-7.61v-52.03Z"/>
<path class="cls-6" d="M1253.71,507.19c3.18,5.02,8.03,8.11,14.72,8.11,6.11,0,10.96-3.35,10.96-8.87,0-4.77-3.01-7.95-8.53-10.04l-7.86-3.01c-9.29-3.43-13.47-8.28-13.47-16.4,0-9.7,7.95-15.81,18.57-15.81,7.44,0,13.63,3.35,17.32,8.11l-5.69,5.02c-3.01-3.6-6.69-5.86-11.79-5.86-5.86,0-10.71,3.26-10.71,8.2s2.93,7.28,8.95,9.54l7.19,2.76c8.78,3.35,13.8,8.36,13.8,17.15,0,10.12-7.86,16.48-18.9,16.48-9.79,0-17.73-4.6-20.83-10.71l6.27-4.68Z"/>
<path class="cls-6" d="M1299.64,522.25c-3.43,0-6.19-2.84-6.19-6.27s2.76-6.27,6.19-6.27,6.19,2.93,6.19,6.27-2.76,6.27-6.19,6.27Z"/>
</g>
<g>
<g>
<path class="cls-6" d="M175.59,793.41c0,5.19-3.89,9.12-9.89,9.12h-5.03v10.5h-3.77v-28.77h8.79c6,0,9.89,3.97,9.89,9.16ZM171.86,793.41c0-3.2-2.19-5.63-6.16-5.63h-5.03v11.27h5.03c3.97,0,6.16-2.43,6.16-5.63Z"/>
<path class="cls-6" d="M186.24,792.36c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM186.77,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M201.8,792.76h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M222.67,792.36c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM223.2,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M238.23,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M281.43,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM274.66,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M301.98,813.23c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
<path class="cls-6" d="M316.61,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM309.84,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M329.57,792.76h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M348.5,813.43c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
</g>
<g>
<path class="cls-6" d="M642.94,793.41c0,5.19-3.89,9.12-9.89,9.12h-5.03v10.5h-3.77v-28.77h8.79c6,0,9.89,3.97,9.89,9.16ZM639.22,793.41c0-3.2-2.19-5.63-6.16-5.63h-5.03v11.27h5.03c3.97,0,6.16-2.43,6.16-5.63Z"/>
<path class="cls-6" d="M645.58,792.76h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M667.05,813.43c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM667.05,810.19c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
<path class="cls-6" d="M689.63,821.9c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM689.83,809.74c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
<path class="cls-6" d="M704.82,792.76h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M725.69,792.36c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM726.22,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M741.25,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M775.49,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M811.52,787.33c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM809.73,792.76h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M818.2,792.76h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M849.08,821.9c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM849.29,809.74c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
<path class="cls-6" d="M623.73,824.36h3.48v30.4h-3.48v-30.4Z"/>
<path class="cls-6" d="M640.55,834.09c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM641.08,837.29c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M656.11,834.49h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M686.99,863.63c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM687.19,851.47c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
<path class="cls-6" d="M701.9,834.49h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
<path class="cls-6" d="M732.38,834.09c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM732.9,837.29c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M756.57,863.63c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM756.77,851.47c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
<path class="cls-6" d="M780.72,834.09c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM773.95,842.48h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M799.81,855.16c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
</g>
<g>
<path class="cls-6" d="M1102.25,813.51c-8.27,0-14.87-6.52-14.87-14.87s6.61-14.87,14.87-14.87,14.87,6.52,14.87,14.87-6.56,14.87-14.87,14.87ZM1102.25,809.94c6.32,0,11.1-4.98,11.1-11.31s-4.78-11.31-11.1-11.31-11.06,4.98-11.06,11.31,4.78,11.31,11.06,11.31Z"/>
<path class="cls-6" d="M1120.61,792.76h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M1142.21,799.93h10.33v3.49h-10.33v-3.49Z"/>
<path class="cls-6" d="M1157.4,784.25h8.27c8.83,0,15.32,6.16,15.32,14.39s-6.48,14.39-15.32,14.39h-8.27v-28.77ZM1165.67,809.5c6.93,0,11.51-4.62,11.51-10.86s-4.58-10.86-11.51-10.86h-4.5v21.72h4.5Z"/>
<path class="cls-6" d="M1193.43,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM1186.66,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M1203.03,792.76h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
<path class="cls-6" d="M1228.36,787.33c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM1226.57,792.76h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M1244.08,813.43c-6.12,0-10.62-4.66-10.62-10.54s4.5-10.54,10.62-10.54c4.17,0,7.58,2.19,9.36,5.43l-3.04,1.7c-1.13-2.31-3.4-3.89-6.32-3.89-4.21,0-7.09,3.24-7.09,7.29s2.88,7.29,7.09,7.29c2.92,0,5.19-1.58,6.32-3.89l3.04,1.7c-1.78,3.24-5.19,5.43-9.36,5.43Z"/>
<path class="cls-6" d="M1265.27,792.36c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM1258.5,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
</g>
<g>
<path class="cls-6" d="M1097.68,909.11h-11.46l5.86-7.11h12.8v59.39h-7.19v-52.28Z"/>
<path class="cls-6" d="M1135.41,962.23c-12.21,0-21-9.12-21-22.33v-16.56c0-13.13,8.78-22.33,21-22.33s21.08,9.2,21.08,22.33v16.56c0,13.22-8.87,22.33-21.08,22.33ZM1135.41,955.2c8.11,0,13.89-5.69,13.89-15.31v-16.56c0-9.62-5.77-15.31-13.89-15.31s-13.8,5.69-13.8,15.31v16.56c0,9.62,5.77,15.31,13.8,15.31Z"/>
<path class="cls-6" d="M1184.17,962.23c-12.21,0-21-9.12-21-22.33v-16.56c0-13.13,8.78-22.33,21-22.33s21.08,9.2,21.08,22.33v16.56c0,13.22-8.87,22.33-21.08,22.33ZM1184.17,955.2c8.11,0,13.89-5.69,13.89-15.31v-16.56c0-9.62-5.77-15.31-13.89-15.31s-13.8,5.69-13.8,15.31v16.56c0,9.62,5.77,15.31,13.8,15.31Z"/>
<path class="cls-6" d="M1226.08,934.29c-9.12,0-16.48-7.44-16.48-16.48s7.36-16.48,16.48-16.48,16.48,7.36,16.48,16.48-7.45,16.48-16.48,16.48ZM1226.08,927.76c5.44,0,9.87-4.6,9.87-9.95s-4.43-9.95-9.87-9.95-9.87,4.6-9.87,9.95,4.35,9.95,9.87,9.95ZM1263.3,902h7.78l-37.64,59.39h-7.78l37.64-59.39ZM1270.66,962.06c-9.12,0-16.48-7.44-16.48-16.48s7.36-16.48,16.48-16.48,16.48,7.36,16.48,16.48-7.44,16.48-16.48,16.48ZM1270.66,955.53c5.44,0,9.87-4.6,9.87-9.95s-4.43-9.95-9.87-9.95-9.87,4.6-9.87,9.95,4.35,9.95,9.87,9.95Z"/>
</g>
<g>
<path class="cls-6" d="M644.38,962.23c-11.63,0-19.99-7.86-19.99-18.65,0-7.11,3.51-12.63,9.45-15.56-3.85-2.59-6.19-6.78-6.19-11.71,0-8.87,7.19-15.31,16.73-15.31s16.65,6.44,16.65,15.31c0,4.94-2.26,9.12-6.02,11.71,5.86,2.93,9.29,8.45,9.29,15.56,0,10.79-8.11,18.65-19.91,18.65ZM644.38,955.37c7.95,0,12.63-5.27,12.63-11.79s-4.68-11.88-12.63-11.88-12.71,5.19-12.71,11.88,4.94,11.79,12.71,11.79ZM644.38,925.25c6.02,0,9.54-4.18,9.54-8.78,0-4.94-3.6-8.7-9.54-8.7s-9.62,3.76-9.62,8.7c0,4.6,3.76,8.78,9.62,8.78Z"/>
<path class="cls-6" d="M668.05,933.95h14.47v-14.39h6.53v14.39h14.39v6.44h-14.39v14.39h-6.53v-14.39h-14.47v-6.44Z"/>
</g>
<g>
<path class="cls-6" d="M176.26,962.23c-11.54,0-20.16-8.78-20.16-19.57,0-6.27,3.01-10.79,6.69-15.14l21.67-25.51h9.12l-18.15,21.5c.84-.08,1.59-.17,2.34-.17,9.95,0,18.57,8.78,18.57,19.16,0,10.96-8.53,19.74-20.08,19.74ZM176.26,955.28c7.19,0,12.71-5.77,12.71-12.8s-5.52-12.8-12.71-12.8-12.8,5.77-12.8,12.8,5.52,12.8,12.8,12.8Z"/>
<path class="cls-6" d="M219.68,962.23c-11.54,0-20.16-8.78-20.16-19.57,0-6.27,3.01-10.79,6.69-15.14l21.67-25.51h9.12l-18.15,21.5c.84-.08,1.59-.17,2.34-.17,9.95,0,18.57,8.78,18.57,19.16,0,10.96-8.53,19.74-20.08,19.74ZM219.68,955.28c7.19,0,12.71-5.77,12.71-12.8s-5.52-12.8-12.71-12.8-12.8,5.77-12.8,12.8,5.52,12.8,12.8,12.8Z"/>
<path class="cls-6" d="M255.56,915.22v46.17h-7.78v-59.39h7.11l22.08,29.61,22.08-29.61h7.03v59.39h-7.7v-46.17l-21.41,28.78-21.41-28.78Z"/>
</g>
<line class="cls-2" x1="157.77" y1="878.71" x2="407.74" y2="878.71"/>
<line class="cls-2" x1="621.9" y1="878.71" x2="871.87" y2="878.71"/>
<line class="cls-2" x1="1086.03" y1="878.71" x2="1336" y2="878.71"/>
</g>
<g>
<path class="cls-6" d="M158.3,582.04h3.77v28.77h-3.77v-28.77Z"/>
<path class="cls-6" d="M167.46,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M198.74,611.22c-6.12,0-10.62-4.66-10.62-10.54s4.5-10.54,10.62-10.54c4.17,0,7.58,2.19,9.36,5.43l-3.04,1.7c-1.13-2.31-3.4-3.89-6.32-3.89-4.21,0-7.09,3.24-7.09,7.29s2.88,7.29,7.09,7.29c2.92,0,5.19-1.58,6.32-3.89l3.04,1.7c-1.78,3.24-5.19,5.43-9.36,5.43Z"/>
<path class="cls-6" d="M210.98,590.55h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M232.42,590.15c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM225.65,598.53h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M253.73,590.15c3.04,0,5.79,1.58,7.13,3.53v-13.25h3.48v30.4h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM254.26,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M271.08,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM269.29,590.55h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M281.25,610.81h-3.49v-30.4h3.49v13.25c1.34-1.95,4.09-3.53,7.13-3.53,5.63,0,9.93,4.74,9.93,10.54s-4.3,10.54-9.93,10.54c-3.04,0-5.8-1.58-7.13-3.53v3.12ZM287.85,593.35c-4.09,0-6.89,3.24-6.89,7.34s2.8,7.34,6.89,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M301.67,580.42h3.49v30.4h-3.49v-30.4Z"/>
<path class="cls-6" d="M311.64,619.28l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
<path class="cls-6" d="M337.94,580.42h3.48v30.4h-3.48v-30.4Z"/>
<path class="cls-6" d="M348.19,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM346.41,590.55h3.48v20.26h-3.48v-20.26Z"/>
<path class="cls-6" d="M363.51,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM363.71,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
<path class="cls-6" d="M378.71,580.42h3.48v13.25c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-30.4Z"/>
<path class="cls-6" d="M408.57,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
<path class="cls-6" d="M426.73,596.22l-5.03,14.59h-3.08l-6.89-20.26h3.65l4.9,14.79,5.07-14.79h2.67l5.07,14.79,4.9-14.79h3.69l-6.89,20.26h-3.08l-4.99-14.59Z"/>
<path class="cls-6" d="M452.7,590.15c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM445.93,598.53h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M467.45,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM465.67,590.55h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M482.77,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM482.97,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
<path class="cls-6" d="M497.96,580.42h3.48v13.25c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-30.4Z"/>
<path class="cls-6" d="M527.83,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
<path class="cls-6" d="M550.28,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM550.81,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M565.84,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M596.43,590.15c3.04,0,5.8,1.58,7.13,3.53v-13.25h3.49v30.4h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM596.96,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M623.62,610.81h-3.48v-30.4h3.48v13.25c1.34-1.95,4.09-3.53,7.13-3.53,5.63,0,9.93,4.74,9.93,10.54s-4.3,10.54-9.93,10.54c-3.04,0-5.79-1.58-7.13-3.53v3.12ZM630.23,593.35c-4.09,0-6.89,3.24-6.89,7.34s2.8,7.34,6.89,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M644.05,580.42h3.49v30.4h-3.49v-30.4Z"/>
<path class="cls-6" d="M660.87,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM661.39,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M674.76,608.06l11.63-14.31h-11.47v-3.2h15.93v2.8l-11.55,14.27h11.92v3.2h-16.45v-2.76Z"/>
<path class="cls-6" d="M696.2,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM694.42,590.55h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M702.89,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M733.76,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM733.97,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
<path class="cls-6" d="M748.96,580.42h3.49v30.4h-3.49v-30.4Z"/>
<path class="cls-6" d="M758.93,619.28l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
<path class="cls-6" d="M786.4,593.75h-4.3v-3.2h4.3v-4.09c0-4.17,2.67-6.28,6.2-6.28,1.22,0,2.27.2,3.16.53v3.08c-.65-.24-1.7-.41-2.51-.41-2.23,0-3.36.89-3.36,3.44v3.73h5.88v3.2h-5.88v17.06h-3.48v-17.06Z"/>
<path class="cls-6" d="M805.78,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM806.3,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M827.45,611.22c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
<path class="cls-6" d="M845.61,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.17,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
<path class="cls-6" d="M851.73,616.61h-2.31l2.72-5.8c-.97-.49-1.66-1.5-1.66-2.72,0-1.62,1.34-3.04,3-3.04s3,1.42,3,3.04c0,.69-.24,1.3-.57,1.82l-4.17,6.69Z"/>
<path class="cls-6" d="M157.78,639.18h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M170.7,639.18h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
<path class="cls-6" d="M192.83,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M215.07,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M239.1,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM237.32,639.18h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M245.79,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M276.67,668.32c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM276.87,656.16c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
<path class="cls-6" d="M300.01,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M330.61,638.77c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM331.13,641.98c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M354.03,659.65c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.17,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
<path class="cls-6" d="M361.77,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM359.98,639.18h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M365.41,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
<path class="cls-6" d="M397.47,638.77c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM390.7,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M410.43,629.05h3.49v30.4h-3.49v-30.4Z"/>
<path class="cls-6" d="M420.4,667.91l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
<path class="cls-6" d="M448.49,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM446.7,639.18h3.48v20.26h-3.48v-20.26Z"/>
<path class="cls-6" d="M455.17,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M486.13,667.91l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
<path class="cls-6" d="M513.73,659.85c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM513.73,656.61c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
<path class="cls-6" d="M527.38,639.18h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
<path class="cls-6" d="M549.51,639.18h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M578.81,638.77c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM572.04,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M591.78,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M610.54,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
<path class="cls-6" d="M635.86,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM634.08,639.18h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M642.55,639.18h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M664.03,659.85c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM664.03,656.61c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
<path class="cls-6" d="M677.97,639.18h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M700.21,639.18h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
<path class="cls-6" d="M743.41,638.77c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM736.64,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
<path class="cls-6" d="M756.38,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
<path class="cls-6" d="M786.24,659.65c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
<path class="cls-6" d="M796.42,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
<path class="cls-6" d="M821.74,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM819.96,639.18h3.49v20.26h-3.49v-20.26Z"/>
<path class="cls-6" d="M836.78,638.77c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM837.3,641.98c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
<path class="cls-6" d="M873.9,659.93c-8.27,0-14.87-6.52-14.87-14.87s6.61-14.87,14.87-14.87,14.87,6.52,14.87,14.87-6.56,14.87-14.87,14.87ZM873.9,656.36c6.32,0,11.1-4.98,11.1-11.31s-4.78-11.31-11.1-11.31-11.06,4.98-11.06,11.31,4.78,11.31,11.06,11.31Z"/>
<path class="cls-6" d="M914.1,659.44l-17.51-22.41v22.41h-3.77v-28.77h3.28l17.51,22.41v-22.41h3.73v28.77h-3.24Z"/>
<path class="cls-6" d="M944.65,659.44l-17.51-22.41v22.41h-3.77v-28.77h3.28l17.51,22.41v-22.41h3.73v28.77h-3.24Z"/>
<path class="cls-6" d="M963.7,647.89l-8.79,11.55h-4.62l11.1-14.55-10.66-14.23h4.5l8.47,11.23,8.47-11.23h4.5l-10.62,14.23,11.06,14.55h-4.58l-8.83-11.55Z"/>
<path class="cls-6" d="M980.75,659.77c-1.66,0-3-1.38-3-3.04s1.34-3.04,3-3.04,3,1.42,3,3.04-1.34,3.04-3,3.04Z"/>
</g>
<g>
<path class="cls-3" d="M312.56,274.03h-18c0-5.18-4.04-10.02-9.4-10.07l-104.25.02c-11.33,2.12-11.32,17.43-.38,20.06l102.54.02c38.59,2.04,38.78,52.82,2.1,56.56-33.79-1.6-69.45,2.06-103.02,0-11.72-.72-21.95-7.64-26.04-18.75-1.18-3.2-1.28-6.01-1.77-9.32h18c.28,4.87,3.95,9.62,8.98,10.07l104.25-.02c12.15-2.24,11.75-18.22-.05-20.47l-105.04-.03c-34.49-3.76-34.33-52.52,0-56.14l107.07.13c13.99,1.94,24.86,13.7,25.02,27.94Z"/>
<polygon class="cls-3" points="845.46 245.97 845.46 263.98 708.99 263.98 708.99 284.08 807.58 284.08 808.2 284.71 808.2 302.08 708.99 302.08 708.99 322.6 845.46 322.6 845.46 340.61 690.15 340.61 690.15 245.97 845.46 245.97"/>
<path class="cls-3" d="M979.42,302.08l40.19,38.52h-24.07l-41.02-38.52h-66.35v38.52h-18.42v-94.63h127.89c2.7,0,8.65,2.55,11.1,3.97,18.71,10.84,17.89,38.72-.98,48.86-1.77.95-7.92,3.28-9.7,3.28h-18.63ZM888.16,284.08h108.63c.19,0,2.17-.96,2.59-1.18,5.89-3.09,6.76-10.78,2.58-15.72-.77-.9-3.71-3.2-4.75-3.2h-109.05v20.1Z"/>
<path class="cls-3" d="M1223.38,246.09l104.97-.13c13.07,1.48,23.94,11.35,25.72,24.52-2.05,25.65,10.21,65.79-26.58,70.12l-101.32.02c-16.42-1.48-26.57-12.82-27.42-29.1-.57-10.96-.6-25.52,0-36.47.83-15.17,9.22-26.52,24.62-28.97ZM1225.46,264.08c-3.86,1.07-7.28,3.93-7.85,8.06,1.21,13.06-1.63,29.14.04,41.84.54,4.08,4.71,8.46,8.95,8.64l100.07-.02c4.88-.89,8.58-4.39,9.01-9.41,1.12-12.93-.84-27.49-.05-40.6-.75-4.36-4.01-8.14-8.53-8.64l-101.64.12Z"/>
<path class="cls-3" d="M537.36,302.08v38.52h-18.42v-94.63h127.89c2.29,0,8.64,2.6,10.8,3.85,19.96,11.5,17.66,41.29-3.33,50.11-1.6.67-5.94,2.15-7.48,2.15h-109.47ZM537.36,284.08h108.21c2.55,0,6.68-3.98,7.46-6.35,1.45-4.38-.08-9.62-3.93-12.25-.44-.3-2.83-1.49-3.11-1.49h-108.63v20.1Z"/>
<path class="cls-3" d="M1763.91,245.97v18h-128.72c-.58,0-3.75,1.77-4.41,2.29-2.94,2.33-3.53,5.62-3.78,9.2-.69,9.89-.63,24.87,0,34.79.15,2.35.5,5.46,1.76,7.45.92,1.44,4.77,4.88,6.42,4.88h128.72v18h-129.14c-14.54,0-25.37-14.67-26.18-28.25-.67-11.18-.67-26.96,0-38.14.74-12.49,10.69-28.25,24.51-28.25h130.82Z"/>
<polygon class="cls-3" points="1517.76 323.86 1517.76 245.97 1536.6 245.97 1536.6 340.61 1510.02 340.61 1399.71 262.72 1399.71 340.61 1381.29 340.61 1381.29 245.97 1409.55 245.97 1517.76 323.86"/>
<path class="cls-3" d="M355.26,245.97v69.3c0,2.61,5.06,7.86,8.15,7.34l101.74-.02c2.83.06,8.16-4.19,8.16-6.91v-69.72h18.42v71.39c0,11-15.16,23.94-26.14,23.26-33.52-1.59-68.88,2.04-102.18,0-10.81-.66-20.71-6.87-24.79-17.08-.41-1.02-1.77-4.96-1.77-5.76v-71.81h18.42Z"/>
<polygon class="cls-3" points="1188.31 245.97 1188.31 263.35 1187.68 263.98 1118.4 263.98 1118.4 340.61 1099.98 340.61 1099.98 263.98 1030.49 263.98 1030.49 245.97 1188.31 245.97"/>
<rect class="cls-3" x="1563.39" y="245.97" width="18.42" height="94.63"/>
</g>
<g>
<path class="cls-6" d="M156.1,179.65h54.1v-54.1h-54.1v54.1ZM183.15,129.76c5.25,0,10.08,1.79,13.94,4.77l-13.03,13.03,4.13,4.13,13.03-13.03c2.98,3.86,4.77,8.68,4.77,13.94,0,12.62-10.23,22.84-22.84,22.84s-22.84-10.23-22.84-22.84,10.23-22.84,22.84-22.84Z"/>
<path class="cls-6" d="M279.9,132.95c0,8.93,0,17.85,0,26.78,0,3-1.02,5.56-3.39,7.47-2,1.62-4.32,2.26-6.87,2.03-3.3-.31-5.79-1.89-7.41-4.8-.85-1.52-1.09-3.18-1.09-4.9,0-8.82,0-17.63,0-26.45v-.79h-4.64v.59c0,8.98,0,17.96,0,26.94,0,.86.03,1.74.18,2.58.55,3.28,2.09,6.04,4.68,8.15,3.29,2.68,7.09,3.66,11.28,3.08,3.43-.48,6.33-2.02,8.62-4.62,2.3-2.61,3.28-5.7,3.28-9.14v-27.6h-4.65v.67Z"/>
<path class="cls-6" d="M244.82,151.28c-1.95-1-4.04-1.33-6.2-1.37-1.41-.03-2.71-.42-3.92-1.17-2.75-1.7-3.99-5.22-2.86-8.25,1.07-2.87,3.65-4.63,6.95-4.71,3.71-.09,6.89,2.1,7.55,5.82.07.39.09.8.13,1.21h4.6c.03-3.38-1.3-6.12-3.67-8.34-2.21-2.06-4.9-3.07-7.93-3.15-4.59-.11-8.33,1.55-10.88,5.43-1.96,2.98-2.26,6.25-1.21,9.63.68,2.16,1.97,3.92,3.77,5.32,2.02,1.57,4.29,2.47,6.85,2.64,1.48.1,2.98.1,4.38.73,4.68,2.1,5.63,7.4,3.21,10.99-1.52,2.24-3.77,3.2-6.42,3.32-3.59.16-6.69-1.91-7.78-5.25-.23-.71-.29-1.47-.43-2.2h-4.55c-.13,4.12,2.51,8.5,6.32,10.4,3.97,1.98,8,2.08,12.03.2,4.76-2.22,7.47-7.32,6.61-12.5-.67-4.02-2.94-6.89-6.54-8.75Z"/>
<path class="cls-6" d="M310.07,135.16c-2.37-2.08-5.22-2.88-8.31-2.9-2.36-.01-4.72,0-7.09,0h0s-4.64,0-4.64,0v40.59h.03v.02h4.6v-.02h0v-18.01h.69c2.1,0,4.2,0,6.3,0,1.61,0,3.2-.18,4.72-.76,3.81-1.47,6.44-4.06,7.31-8.12.91-4.26-.33-7.89-3.62-10.78ZM302.79,150.34c-2.59.1-5.2.05-7.79.06-.12,0-.2-.04-.29-.04v-13.68h.18c2.8,0,5.61-.09,8.4.12,3.27.24,6.14,3.08,6.07,6.88-.06,3.69-2.97,6.52-6.56,6.66Z"/>
<path class="cls-6" d="M361.78,154.05c3.82-1.47,6.44-4.06,7.31-8.12.91-4.26-.33-7.89-3.62-10.78-2.37-2.08-5.22-2.88-8.31-2.9-3.7-.02-7.41,0-11.11,0h-.62c0,13.56.03,27.08.03,40.6h4.6v-18.03h.69c1.09,0,2.17,0,3.26,0l12.03,18.06h5.39l-12.13-18.21c.84-.11,1.67-.3,2.48-.61ZM350.4,150.4c-.12,0-.2-.04-.29-.04v-13.68h.18c2.8,0,5.61-.09,8.4.12,3.27.24,6.13,3.08,6.07,6.88-.06,3.69-2.97,6.52-6.56,6.66-2.6.1-5.2.05-7.79.06Z"/>
<polygon class="cls-6" points="469.03 165.24 447.05 132.34 442.53 132.34 442.53 172.84 447.05 172.84 447.05 139.95 469.03 172.84 473.55 172.84 473.55 132.34 469.03 132.34 469.03 165.24"/>
<path class="cls-6" d="M318.45,132.28h0v40.55h0c7.18,0,14.32,0,21.45,0h0s.05,0,.05,0v-4.44h-.05c-5.63,0-11.22,0-16.8,0v-14.17h16.2v-4.45h-16.2v-13.05h16.79s.05,0,.05,0v-4.44h-21.5Z"/>
<path class="cls-6" d="M397.79,136.72v-4.43h0s-25.96-.01-25.96-.01h0v4.47h10.65v36.08h4.65v-36.1h10.66v-.02Z"/>
<path class="cls-6" d="M479.04,132.28h0v40.55h0c6.99,0,13.94,0,20.89,0h.61v-4.44h-.05c-5.63,0-11.22,0-16.8,0v-14.17h16.2v-4.45h-16.2v-13.05h16.79s.05,0,.05,0v-4.44h-21.5Z"/>
<path class="cls-6" d="M417.64,131.69c-11.5,0-20.85,9.35-20.85,20.85s9.35,20.85,20.85,20.85,20.85-9.35,20.85-20.85-9.35-20.85-20.85-20.85ZM417.64,168.88c-9.01,0-16.34-7.33-16.34-16.34s7.33-16.34,16.34-16.34,16.34,7.33,16.34,16.34-7.33,16.34-16.34,16.34Z"/>
</g>
</g>
</g>
</svg>

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 115 KiB

+10
View File
@@ -0,0 +1,10 @@
import SwiftUI
@main
struct ExampleiOSApp: App {
var body: some Scene {
WindowGroup {
ContentView()
}
}
}
+30
View File
@@ -0,0 +1,30 @@
import Foundation
import AVFoundation
final class AudioPlayer: NSObject, AVAudioPlayerDelegate {
private var player: AVAudioPlayer?
private var onFinish: (() -> Void)?
func play(url: URL, onFinish: (() -> Void)? = nil) {
self.onFinish = onFinish
do {
let data = try Data(contentsOf: url)
let player = try AVAudioPlayer(data: data)
player.delegate = self
player.prepareToPlay()
player.play()
self.player = player
} catch {
print("Audio play error: \(error)")
}
}
func stop() {
player?.stop()
player = nil
}
func audioPlayerDidFinishPlaying(_ player: AVAudioPlayer, successfully flag: Bool) {
onFinish?()
}
}
+98
View File
@@ -0,0 +1,98 @@
import SwiftUI
struct ContentView: View {
@StateObject private var vm = TTSViewModel()
var body: some View {
ZStack {
LinearGradient(gradient: Gradient(colors: [Color(.systemBackground), Color(.secondarySystemBackground)]), startPoint: .topLeading, endPoint: .bottomTrailing)
.ignoresSafeArea()
VStack(spacing: 20) {
Spacer()
VStack(spacing: 12) {
Text("SupertonicTTS iOS Demo")
.font(.title2.weight(.semibold))
.foregroundColor(.primary)
ZStack(alignment: .topLeading) {
if vm.text.isEmpty {
Text("Type text to synthesize")
.foregroundColor(.secondary)
.padding(.horizontal, 14)
.padding(.vertical, 12)
}
TextEditor(text: $vm.text)
.frame(minHeight: 120, maxHeight: 180)
.padding(8)
.background(Color(.secondarySystemBackground))
.cornerRadius(12)
.overlay(
RoundedRectangle(cornerRadius: 12)
.stroke(Color.secondary.opacity(0.3), lineWidth: 1)
)
}
.padding(.horizontal)
HStack(spacing: 12) {
Text("NFE")
.font(.subheadline)
.foregroundColor(.secondary)
Slider(value: $vm.nfe, in: 2...15, step: 1)
Text("\(Int(vm.nfe))")
.font(.subheadline.monospacedDigit())
.frame(width: 36)
}
.padding(.horizontal)
Picker("Voice", selection: $vm.voice) {
Text("M").tag(TTSService.Voice.male)
Text("F").tag(TTSService.Voice.female)
}
.pickerStyle(SegmentedPickerStyle())
.padding(.horizontal)
}
HStack(spacing: 16) {
Button(action: { vm.generate() }) {
Label(vm.isGenerating ? "Generating..." : "Generate", systemImage: vm.isGenerating ? "hourglass" : "wand.and.stars"
)
.labelStyle(.titleAndIcon)
}
.buttonStyle(.borderedProminent)
.tint(.accentColor)
.disabled(vm.isGenerating)
Button(action: { vm.togglePlay() }) {
Label(vm.isPlaying ? "Stop" : "Play", systemImage: vm.isPlaying ? "stop.fill" : "play.fill")
}
.buttonStyle(.bordered)
.disabled(vm.audioURL == nil)
}
if let rtf = vm.rtfText {
Text(rtf)
.font(.footnote.monospacedDigit())
.foregroundColor(.secondary)
.padding(.top, 2)
}
if let error = vm.errorMessage {
Text(error)
.foregroundColor(.red)
.font(.footnote)
.multilineTextAlignment(.center)
.padding(.horizontal)
}
Spacer()
}
}
.onAppear { vm.startup() }
}
}
#Preview {
ContentView()
}
+29
View File
@@ -0,0 +1,29 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>CFBundleDevelopmentRegion</key>
<string>en</string>
<key>CFBundleExecutable</key>
<string>$(EXECUTABLE_NAME)</string>
<key>CFBundleIdentifier</key>
<string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
<key>CFBundleInfoDictionaryVersion</key>
<string>6.0</string>
<key>CFBundleName</key>
<string>ExampleiOSApp</string>
<key>CFBundlePackageType</key>
<string>APPL</string>
<key>CFBundleShortVersionString</key>
<string>1.0</string>
<key>CFBundleVersion</key>
<string>1</string>
<key>UILaunchScreen</key>
<dict/>
<key>UIApplicationSceneManifest</key>
<dict>
<key>UIApplicationSupportsMultipleScenes</key>
<false/>
</dict>
</dict>
</plist>
+140
View File
@@ -0,0 +1,140 @@
import Foundation
import OnnxRuntimeBindings
final class TTSService {
enum Voice { case male, female }
struct Settings {
var nTest: Int = 1
}
struct SynthesisResult {
let url: URL
let elapsedSeconds: Double
let audioSeconds: Double
var rtf: Double { elapsedSeconds / max(audioSeconds, 1e-6) }
}
private let env: ORTEnv
private let textToSpeech: TextToSpeech
private let bundleOnnxDir: String
private let sampleRate: Int
// Cached style per voice (precomputed at startup or on first use)
private var cachedStyle: [Voice: Style] = [:]
init() throws {
bundleOnnxDir = try Self.locateOnnxDirInBundle()
env = try ORTEnv(loggingLevel: .warning)
textToSpeech = try loadTextToSpeech(bundleOnnxDir, false, env)
sampleRate = textToSpeech.sampleRate
}
// Public warmup: precompute styles and run a quick generation to warm models
func warmup(nfe: Int = 1) async {
do { try precomputeStyle(for: .male) } catch { print("Warmup style (M) error: \(error)") }
do { try precomputeStyle(for: .female) } catch { print("Warmup style (F) error: \(error)") }
// Run a tiny synthesis to JIT/warm up kernels; discard file
do {
let res = try await synthesize(text: "Warm up", nfe: max(1, nfe), voice: .male)
try? FileManager.default.removeItem(at: res.url)
} catch {
print("Warmup synth error: \(error)")
}
}
func synthesize(text: String, nfe: Int, voice: Voice, settings: Settings = Settings()) async throws -> SynthesisResult {
let tic = Date()
// 1) Get or compute style for the selected voice
let style = try getStyle(voice: voice)
// 2) Synthesize via packed TextToSpeech component
let (wav, duration) = try textToSpeech.call([text], style, nfe)
let audioSeconds = Double(duration[0])
let wavLenSample = min(Int(Double(sampleRate) * audioSeconds), wav.count)
let wavOut = Array(wav[0..<wavLenSample])
let tmpURL = FileManager.default.temporaryDirectory.appendingPathComponent("supertonic_tts_\(UUID().uuidString).wav")
try writeWavFile(tmpURL.path, wavOut, sampleRate)
let elapsed = Date().timeIntervalSince(tic)
return SynthesisResult(url: tmpURL, elapsedSeconds: elapsed, audioSeconds: audioSeconds)
}
// MARK: - Style helpers
private func precomputeStyle(for voice: Voice) throws {
if cachedStyle[voice] != nil { return }
let styleURL = try Self.locateVoiceStyleURL(voice: voice)
let style = try loadVoiceStyle([styleURL.path], verbose: false)
cachedStyle[voice] = style
}
private func getStyle(voice: Voice) throws -> Style {
if let style = cachedStyle[voice] { return style }
try precomputeStyle(for: voice)
return cachedStyle[voice]!
}
// MARK: - Resource location helpers
private static func locateOnnxDirInBundle() throws -> String {
let bundle = Bundle.main
let fm = FileManager.default
func dirHasRequiredFiles(_ dir: URL) -> Bool {
let required = [
"tts.json",
"duration_predictor.onnx",
"text_encoder.onnx",
"vector_estimator.onnx",
"vocoder.onnx"
]
return required.allSatisfy { fm.fileExists(atPath: dir.appendingPathComponent($0).path) }
}
var candidates: [URL] = []
if let dir = bundle.resourceURL?.appendingPathComponent("onnx", isDirectory: true) { candidates.append(dir) }
if let dir = bundle.resourceURL?.appendingPathComponent("assets/onnx", isDirectory: true) { candidates.append(dir) }
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "onnx") { candidates.append(url.deletingLastPathComponent()) }
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "assets/onnx") { candidates.append(url.deletingLastPathComponent()) }
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: nil) { candidates.append(url.deletingLastPathComponent()) }
if let root = bundle.resourceURL { candidates.append(root) }
for dir in candidates {
if dirHasRequiredFiles(dir) { return dir.path }
}
throw NSError(
domain: "TTS",
code: -100,
userInfo: [NSLocalizedDescriptionKey: "Could not find the onnx directory in the bundle. Please make sure the onnx folder (as a folder reference) is included in Copy Bundle Resources in Xcode."]
)
}
private static func locateVoiceStyleURL(voice: Voice) throws -> URL {
// Prefer M1/F1 defaults; search common subdirectories
let fileName = (voice == .male) ? "M1" : "F1"
let bundle = Bundle.main
let candidates: [URL?] = [
bundle.url(forResource: fileName, withExtension: "json", subdirectory: "voice_styles"),
bundle.url(forResource: fileName, withExtension: "json", subdirectory: "assets/voice_styles"),
bundle.url(forResource: fileName, withExtension: "json", subdirectory: nil)
]
for url in candidates {
if let url = url { return url }
}
// Fallback: scan folders if needed
if let folder1 = bundle.resourceURL?.appendingPathComponent("voice_styles", isDirectory: true) {
let file = folder1.appendingPathComponent("\(fileName).json")
if FileManager.default.fileExists(atPath: file.path) { return file }
}
if let folder2 = bundle.resourceURL?.appendingPathComponent("assets/voice_styles", isDirectory: true) {
let file = folder2.appendingPathComponent("\(fileName).json")
if FileManager.default.fileExists(atPath: file.path) { return file }
}
throw NSError(
domain: "TTS",
code: -102,
userInfo: [NSLocalizedDescriptionKey: "Could not find the voice style JSON (\(fileName).json) in the bundle. Ensure voice_styles folder is included in Copy Bundle Resources."]
)
}
}
+76
View File
@@ -0,0 +1,76 @@
import Foundation
import AVFoundation
@MainActor
final class TTSViewModel: ObservableObject {
@Published var text: String = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
@Published var nfe: Double = 5
@Published var voice: TTSService.Voice = .male
@Published var isGenerating: Bool = false
@Published var isPlaying: Bool = false
@Published var errorMessage: String?
@Published var audioURL: URL?
@Published var elapsedSeconds: Double?
@Published var audioSeconds: Double?
private var service: TTSService?
private var player = AudioPlayer()
var rtfText: String? {
guard let e = elapsedSeconds, let a = audioSeconds, a > 0 else { return nil }
let rtf = e / a
return String(format: "RTF %.2fx · %.2fs / %.2fs", rtf, e, a)
}
func startup() {
do {
service = try TTSService()
Task { await self.service?.warmup(nfe: 5) }
} catch {
errorMessage = "Failed to init TTS: \(error.localizedDescription)"
}
}
func generate() {
guard let service = service else { return }
isGenerating = true
errorMessage = nil
audioURL = nil
elapsedSeconds = nil
audioSeconds = nil
Task {
do {
let result = try await service.synthesize(text: text, nfe: Int(nfe), voice: voice)
await MainActor.run {
self.audioURL = result.url
self.elapsedSeconds = result.elapsedSeconds
self.audioSeconds = result.audioSeconds
self.isGenerating = false
}
self.play(url: result.url)
} catch {
await MainActor.run {
self.errorMessage = error.localizedDescription
self.isGenerating = false
}
}
}
}
func togglePlay() {
if isPlaying {
player.stop()
isPlaying = false
} else if let url = audioURL {
play(url: url)
}
}
private func play(url: URL) {
player.play(url: url) { [weak self] in
DispatchQueue.main.async { self?.isPlaying = false }
}
isPlaying = true
}
}
+29
View File
@@ -0,0 +1,29 @@
name: ExampleiOSApp
options:
minimumXcodeGenVersion: 2.37.0
packages:
onnxruntime:
url: https://github.com/microsoft/onnxruntime-swift-package-manager.git
from: 1.16.0
targets:
ExampleiOSApp:
type: application
platform: iOS
deploymentTarget: "15.0"
sources:
- path: .
- path: ../../swift/Sources/Helper.swift <<- 여기
type: file
resources:
- path: onnx
type: folder
- path: audio
type: folder
settings:
base:
PRODUCT_BUNDLE_IDENTIFIER: com.supertonic.ExampleiOSApp
SWIFT_VERSION: 5.9
INFOPLIST_FILE: Info.plist
dependencies:
- package: onnxruntime
product: onnxruntime
+59
View File
@@ -0,0 +1,59 @@
# Supertonic iOS Example App
A minimal iOS demo that runs Supertonic (ONNX Runtime) on-device. The app shows:
- Multiline text input
- NFE (denoising steps) slider
- Voice toggle (M/F)
- Generate & Play buttons
- RTF display (Elapsed / Audio seconds)
All ONNX models/configs are reused from `Supertonic/assets/onnx`, and voice style JSON files from `Supertonic/assets/voice_styles`.
## Prerequisites
- macOS 13+, Xcode 15+
- Swift 5.9+
- iOS 15+ device (recommended)
- Homebrew, XcodeGen
Install tools (if needed):
```bash
brew install xcodegen
```
## Quick Start (zero-click in Xcode)
0) Prepare assets next to the iOS target (one-time)
```bash
cd ios/ExampleiOSApp
mkdir -p onnx voice_styles
rsync -a ../../assets/onnx/ onnx/
rsync -a ../../assets/voice_styles/ voice_styles/
```
1) Generate the Xcode project with XcodeGen
```bash
xcodegen generate
open ExampleiOSApp.xcodeproj
```
2) Open in Xcode and select your iPhone as the run destination
- Targets → ExampleiOSApp → Signing & Capabilities: Select your Team
- iOS Deployment Target: 15.0+
3) Build & Run on device
- Type text → adjust NFE/Voice → Tap Generate → Audio plays automatically
- An RTF line shows like: `RTF 0.30x · 3.04s / 10.11s`
## What's included (generated project)
- SwiftUI app files: `App.swift`, `ContentView.swift`, `TTSViewModel.swift`, `AudioPlayer.swift`
- Runtime wrapper: `TTSService.swift` (includes TTS inference logic)
- Resources (local, vendored in `ios/ExampleiOSApp/onnx` and `ios/ExampleiOSApp/voice_styles` after step 0)
These references are defined in `project.yml` and added to the app bundle by XcodeGen.
## App Controls
- **Text**: Multiline `TextEditor`
- **NFE**: Denoising steps (default 5)
- **Voice**: M1/M2/F1/F2 voice style selector (4 pre-extracted styles)
- **Generate**: Runs end-to-end synthesis
- **Play/Stop**: Controls playback of the last output
- **RTF**: Shows Elapsed / Audio seconds for quick performance intuition
+35
View File
@@ -0,0 +1,35 @@
# Maven
target/
pom.xml.tag
pom.xml.releaseBackup
pom.xml.versionsBackup
pom.xml.next
release.properties
dependency-reduced-pom.xml
buildNumber.properties
.mvn/timing.properties
.mvn/wrapper/maven-wrapper.jar
# Compiled class files
*.class
# IntelliJ IDEA
.idea/
*.iml
*.iws
*.ipr
# Eclipse
.classpath
.project
.settings/
# VS Code
.vscode/
# Results
results/*.wav
# Mac
.DS_Store
+141
View File
@@ -0,0 +1,141 @@
import ai.onnxruntime.*;
import java.io.File;
import java.util.*;
/**
* TTS Inference Example with ONNX Runtime (Java)
*/
public class ExampleONNX {
/**
* Command line arguments
*/
static class Args {
boolean useGpu = false;
String onnxDir = "assets/onnx";
int totalStep = 5;
int nTest = 4;
List<String> voiceStyle = Arrays.asList("assets/voice_styles/M1.json");
List<String> text = Arrays.asList(
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
);
String saveDir = "results";
}
/**
* Parse command line arguments
*/
private static Args parseArgs(String[] args) {
Args result = new Args();
for (int i = 0; i < args.length; i++) {
switch (args[i]) {
case "--use-gpu":
result.useGpu = true;
break;
case "--onnx-dir":
if (i + 1 < args.length) result.onnxDir = args[++i];
break;
case "--total-step":
if (i + 1 < args.length) result.totalStep = Integer.parseInt(args[++i]);
break;
case "--n-test":
if (i + 1 < args.length) result.nTest = Integer.parseInt(args[++i]);
break;
case "--voice-style":
if (i + 1 < args.length) {
result.voiceStyle = Arrays.asList(args[++i].split(","));
}
break;
case "--text":
if (i + 1 < args.length) {
result.text = Arrays.asList(args[++i].split("\\|"));
}
break;
case "--save-dir":
if (i + 1 < args.length) result.saveDir = args[++i];
break;
}
}
return result;
}
/**
* Main inference function
*/
public static void main(String[] args) {
try {
System.out.println("=== TTS Inference with ONNX Runtime (Java) ===\n");
// --- 1. Parse arguments --- //
Args parsedArgs = parseArgs(args);
int totalStep = parsedArgs.totalStep;
int nTest = parsedArgs.nTest;
String saveDir = parsedArgs.saveDir;
List<String> voiceStylePaths = parsedArgs.voiceStyle;
List<String> textList = parsedArgs.text;
if (voiceStylePaths.size() != textList.size()) {
throw new RuntimeException("Number of voice styles (" + voiceStylePaths.size() +
") must match number of texts (" + textList.size() + ")");
}
int bsz = voiceStylePaths.size();
OrtEnvironment env = OrtEnvironment.getEnvironment();
// --- 2. Load TTS components --- //
TextToSpeech textToSpeech = Helper.loadTextToSpeech(parsedArgs.onnxDir, parsedArgs.useGpu, env);
// --- 3. Load voice styles --- //
Style style = Helper.loadVoiceStyle(voiceStylePaths, true, env);
// --- 4. Synthesize speech --- //
File saveDirFile = new File(saveDir);
if (!saveDirFile.exists()) {
saveDirFile.mkdirs();
}
for (int n = 0; n < nTest; n++) {
System.out.println("\n[" + (n + 1) + "/" + nTest + "] Starting synthesis...");
TTSResult ttsResult = Helper.timer("Generating speech from text", () -> {
try {
return textToSpeech.call(textList, style, totalStep, env);
} catch (Exception e) {
throw new RuntimeException(e);
}
});
float[] wav = ttsResult.wav;
float[] duration = ttsResult.duration;
// Save outputs
int wavLen = wav.length / bsz;
for (int i = 0; i < bsz; i++) {
String fname = Helper.sanitizeFilename(textList.get(i), 20) + "_" + (n + 1) + ".wav";
int actualLen = (int) (textToSpeech.sampleRate * duration[i]);
float[] wavOut = new float[actualLen];
System.arraycopy(wav, i * wavLen, wavOut, 0, Math.min(actualLen, wavLen));
String outputPath = saveDir + "/" + fname;
Helper.writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
System.out.println("Saved: " + outputPath);
}
}
// Clean up
style.close();
textToSpeech.close();
System.out.println("\n=== Synthesis completed successfully! ===");
} catch (Exception e) {
System.err.println("Error during inference: " + e.getMessage());
e.printStackTrace();
System.exit(1);
}
}
}
+597
View File
@@ -0,0 +1,597 @@
import ai.onnxruntime.*;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import javax.sound.sampled.AudioFileFormat;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.FloatBuffer;
import java.nio.LongBuffer;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.text.Normalizer;
import java.util.*;
/**
* Configuration classes
*/
class Config {
static class AEConfig {
int sampleRate;
int baseChunkSize;
}
static class TTLConfig {
int chunkCompressFactor;
int latentDim;
}
AEConfig ae;
TTLConfig ttl;
}
/**
* Voice Style Data from JSON
*/
class VoiceStyleData {
static class StyleData {
float[][][] data;
long[] dims;
String type;
}
StyleData styleTtl;
StyleData styleDp;
}
/**
* Unicode text processor
*/
class UnicodeProcessor {
private long[] indexer;
public UnicodeProcessor(String unicodeIndexerJsonPath) throws IOException {
this.indexer = Helper.loadJsonLongArray(unicodeIndexerJsonPath);
}
public TextProcessResult call(List<String> textList) {
List<String> processedTexts = new ArrayList<>();
for (String text : textList) {
processedTexts.add(preprocessText(text));
}
int[] textIdsLengths = new int[processedTexts.size()];
int maxLen = 0;
for (int i = 0; i < processedTexts.size(); i++) {
textIdsLengths[i] = processedTexts.get(i).length();
maxLen = Math.max(maxLen, textIdsLengths[i]);
}
long[][] textIds = new long[processedTexts.size()][maxLen];
for (int i = 0; i < processedTexts.size(); i++) {
int[] unicodeVals = textToUnicodeValues(processedTexts.get(i));
for (int j = 0; j < unicodeVals.length; j++) {
textIds[i][j] = indexer[unicodeVals[j]];
}
}
float[][][] textMask = getTextMask(textIdsLengths);
return new TextProcessResult(textIds, textMask);
}
private String preprocessText(String text) {
return Normalizer.normalize(text, Normalizer.Form.NFKD);
}
private int[] textToUnicodeValues(String text) {
int[] values = new int[text.length()];
for (int i = 0; i < text.length(); i++) {
values[i] = text.codePointAt(i);
}
return values;
}
private float[][][] getTextMask(int[] lengths) {
int bsz = lengths.length;
int maxLen = 0;
for (int len : lengths) {
maxLen = Math.max(maxLen, len);
}
float[][][] mask = new float[bsz][1][maxLen];
for (int i = 0; i < bsz; i++) {
for (int j = 0; j < maxLen; j++) {
mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
}
}
return mask;
}
static class TextProcessResult {
long[][] textIds;
float[][][] textMask;
TextProcessResult(long[][] textIds, float[][][] textMask) {
this.textIds = textIds;
this.textMask = textMask;
}
}
}
/**
* Text-to-Speech inference class
*/
class TextToSpeech {
private Config config;
private UnicodeProcessor textProcessor;
private OrtSession dpSession;
private OrtSession textEncSession;
private OrtSession vectorEstSession;
private OrtSession vocoderSession;
public int sampleRate;
private int baseChunkSize;
private int chunkCompress;
private int ldim;
public TextToSpeech(Config config, UnicodeProcessor textProcessor,
OrtSession dpSession, OrtSession textEncSession,
OrtSession vectorEstSession, OrtSession vocoderSession) {
this.config = config;
this.textProcessor = textProcessor;
this.dpSession = dpSession;
this.textEncSession = textEncSession;
this.vectorEstSession = vectorEstSession;
this.vocoderSession = vocoderSession;
this.sampleRate = config.ae.sampleRate;
this.baseChunkSize = config.ae.baseChunkSize;
this.chunkCompress = config.ttl.chunkCompressFactor;
this.ldim = config.ttl.latentDim;
}
public TTSResult call(List<String> textList, Style style, int totalStep, OrtEnvironment env)
throws OrtException {
int bsz = textList.size();
// Process text
UnicodeProcessor.TextProcessResult textResult = textProcessor.call(textList);
long[][] textIds = textResult.textIds;
float[][][] textMask = textResult.textMask;
// Create tensors
OnnxTensor textIdsTensor = Helper.createLongTensor(textIds, env);
OnnxTensor textMaskTensor = Helper.createFloatTensor(textMask, env);
// Predict duration
Map<String, OnnxTensor> dpInputs = new HashMap<>();
dpInputs.put("text_ids", textIdsTensor);
dpInputs.put("style_dp", style.dpTensor);
dpInputs.put("text_mask", textMaskTensor);
OrtSession.Result dpResult = dpSession.run(dpInputs);
Object dpValue = dpResult.get(0).getValue();
float[] duration;
if (dpValue instanceof float[][]) {
duration = ((float[][]) dpValue)[0];
} else {
duration = (float[]) dpValue;
}
// Encode text
Map<String, OnnxTensor> textEncInputs = new HashMap<>();
textEncInputs.put("text_ids", textIdsTensor);
textEncInputs.put("style_ttl", style.ttlTensor);
textEncInputs.put("text_mask", textMaskTensor);
OrtSession.Result textEncResult = textEncSession.run(textEncInputs);
OnnxTensor textEmbTensor = (OnnxTensor) textEncResult.get(0);
// Sample noisy latent
NoisyLatentResult noisyLatentResult = sampleNoisyLatent(duration);
float[][][] xt = noisyLatentResult.noisyLatent;
float[][][] latentMask = noisyLatentResult.latentMask;
// Prepare constant tensors
float[] totalStepArray = new float[bsz];
Arrays.fill(totalStepArray, (float) totalStep);
OnnxTensor totalStepTensor = OnnxTensor.createTensor(env, totalStepArray);
// Denoising loop
for (int step = 0; step < totalStep; step++) {
float[] currentStepArray = new float[bsz];
Arrays.fill(currentStepArray, (float) step);
OnnxTensor currentStepTensor = OnnxTensor.createTensor(env, currentStepArray);
OnnxTensor noisyLatentTensor = Helper.createFloatTensor(xt, env);
OnnxTensor latentMaskTensor = Helper.createFloatTensor(latentMask, env);
OnnxTensor textMaskTensor2 = Helper.createFloatTensor(textMask, env);
Map<String, OnnxTensor> vectorEstInputs = new HashMap<>();
vectorEstInputs.put("noisy_latent", noisyLatentTensor);
vectorEstInputs.put("text_emb", textEmbTensor);
vectorEstInputs.put("style_ttl", style.ttlTensor);
vectorEstInputs.put("latent_mask", latentMaskTensor);
vectorEstInputs.put("text_mask", textMaskTensor2);
vectorEstInputs.put("current_step", currentStepTensor);
vectorEstInputs.put("total_step", totalStepTensor);
OrtSession.Result vectorEstResult = vectorEstSession.run(vectorEstInputs);
float[][][] denoised = (float[][][]) vectorEstResult.get(0).getValue();
// Update latent
xt = denoised;
// Clean up
currentStepTensor.close();
noisyLatentTensor.close();
latentMaskTensor.close();
textMaskTensor2.close();
vectorEstResult.close();
}
// Generate waveform
OnnxTensor finalLatentTensor = Helper.createFloatTensor(xt, env);
Map<String, OnnxTensor> vocoderInputs = new HashMap<>();
vocoderInputs.put("latent", finalLatentTensor);
OrtSession.Result vocoderResult = vocoderSession.run(vocoderInputs);
float[][] wavBatch = (float[][]) vocoderResult.get(0).getValue();
float[] wav = wavBatch[0];
// Clean up
textIdsTensor.close();
textMaskTensor.close();
dpResult.close();
textEncResult.close();
totalStepTensor.close();
finalLatentTensor.close();
vocoderResult.close();
return new TTSResult(wav, duration);
}
private NoisyLatentResult sampleNoisyLatent(float[] duration) {
int bsz = duration.length;
float maxDur = 0;
for (float d : duration) {
maxDur = Math.max(maxDur, d);
}
long wavLenMax = (long) (maxDur * sampleRate);
long[] wavLengths = new long[bsz];
for (int i = 0; i < bsz; i++) {
wavLengths[i] = (long) (duration[i] * sampleRate);
}
int chunkSize = baseChunkSize * chunkCompress;
int latentLen = (int) ((wavLenMax + chunkSize - 1) / chunkSize);
int latentDim = ldim * chunkCompress;
Random rng = new Random();
float[][][] noisyLatent = new float[bsz][latentDim][latentLen];
for (int b = 0; b < bsz; b++) {
for (int d = 0; d < latentDim; d++) {
for (int t = 0; t < latentLen; t++) {
// Box-Muller transform
double u1 = Math.max(1e-10, rng.nextDouble());
double u2 = rng.nextDouble();
noisyLatent[b][d][t] = (float) (Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2));
}
}
}
float[][][] latentMask = Helper.getLatentMask(wavLengths, config);
// Apply mask
for (int b = 0; b < bsz; b++) {
for (int d = 0; d < latentDim; d++) {
for (int t = 0; t < latentLen; t++) {
noisyLatent[b][d][t] *= latentMask[b][0][t];
}
}
}
return new NoisyLatentResult(noisyLatent, latentMask);
}
public void close() throws OrtException {
if (dpSession != null) dpSession.close();
if (textEncSession != null) textEncSession.close();
if (vectorEstSession != null) vectorEstSession.close();
if (vocoderSession != null) vocoderSession.close();
}
}
/**
* Style holder class
*/
class Style {
OnnxTensor ttlTensor;
OnnxTensor dpTensor;
Style(OnnxTensor ttlTensor, OnnxTensor dpTensor) {
this.ttlTensor = ttlTensor;
this.dpTensor = dpTensor;
}
public void close() throws OrtException {
if (ttlTensor != null) ttlTensor.close();
if (dpTensor != null) dpTensor.close();
}
}
/**
* TTS result holder
*/
class TTSResult {
float[] wav;
float[] duration;
TTSResult(float[] wav, float[] duration) {
this.wav = wav;
this.duration = duration;
}
}
/**
* Noisy latent result holder
*/
class NoisyLatentResult {
float[][][] noisyLatent;
float[][][] latentMask;
NoisyLatentResult(float[][][] noisyLatent, float[][][] latentMask) {
this.noisyLatent = noisyLatent;
this.latentMask = latentMask;
}
}
/**
* Helper utility class
*/
public class Helper {
/**
* Load voice style from JSON files
*/
public static Style loadVoiceStyle(List<String> voiceStylePaths, boolean verbose, OrtEnvironment env)
throws IOException, OrtException {
int bsz = voiceStylePaths.size();
// Read first file to get dimensions
ObjectMapper mapper = new ObjectMapper();
JsonNode firstRoot = mapper.readTree(new File(voiceStylePaths.get(0)));
long[] ttlDims = new long[3];
for (int i = 0; i < 3; i++) {
ttlDims[i] = firstRoot.get("style_ttl").get("dims").get(i).asLong();
}
long[] dpDims = new long[3];
for (int i = 0; i < 3; i++) {
dpDims[i] = firstRoot.get("style_dp").get("dims").get(i).asLong();
}
long ttlDim1 = ttlDims[1];
long ttlDim2 = ttlDims[2];
long dpDim1 = dpDims[1];
long dpDim2 = dpDims[2];
// Pre-allocate arrays with full batch size
int ttlSize = (int) (bsz * ttlDim1 * ttlDim2);
int dpSize = (int) (bsz * dpDim1 * dpDim2);
float[] ttlFlat = new float[ttlSize];
float[] dpFlat = new float[dpSize];
// Fill in the data
for (int i = 0; i < bsz; i++) {
JsonNode root = mapper.readTree(new File(voiceStylePaths.get(i)));
// Flatten TTL data
int ttlOffset = (int) (i * ttlDim1 * ttlDim2);
int idx = 0;
JsonNode ttlData = root.get("style_ttl").get("data");
for (JsonNode batch : ttlData) {
for (JsonNode row : batch) {
for (JsonNode val : row) {
ttlFlat[ttlOffset + idx++] = (float) val.asDouble();
}
}
}
// Flatten DP data
int dpOffset = (int) (i * dpDim1 * dpDim2);
idx = 0;
JsonNode dpData = root.get("style_dp").get("data");
for (JsonNode batch : dpData) {
for (JsonNode row : batch) {
for (JsonNode val : row) {
dpFlat[dpOffset + idx++] = (float) val.asDouble();
}
}
}
}
long[] ttlShape = {bsz, ttlDim1, ttlDim2};
long[] dpShape = {bsz, dpDim1, dpDim2};
OnnxTensor ttlTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(ttlFlat), ttlShape);
OnnxTensor dpTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(dpFlat), dpShape);
if (verbose) {
System.out.println("Loaded " + bsz + " voice styles\n");
}
return new Style(ttlTensor, dpTensor);
}
/**
* Load TTS components
*/
public static TextToSpeech loadTextToSpeech(String onnxDir, boolean useGpu, OrtEnvironment env)
throws IOException, OrtException {
if (useGpu) {
throw new RuntimeException("GPU mode is not supported yet");
}
System.out.println("Using CPU for inference\n");
// Load config
Config config = loadCfgs(onnxDir);
// Create session options
OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
// Load models
OrtSession dpSession = env.createSession(onnxDir + "/duration_predictor.onnx", opts);
OrtSession textEncSession = env.createSession(onnxDir + "/text_encoder.onnx", opts);
OrtSession vectorEstSession = env.createSession(onnxDir + "/vector_estimator.onnx", opts);
OrtSession vocoderSession = env.createSession(onnxDir + "/vocoder.onnx", opts);
// Load text processor
UnicodeProcessor textProcessor = new UnicodeProcessor(onnxDir + "/unicode_indexer.json");
return new TextToSpeech(config, textProcessor, dpSession, textEncSession, vectorEstSession, vocoderSession);
}
/**
* Load configuration from JSON
*/
public static Config loadCfgs(String onnxDir) throws IOException {
ObjectMapper mapper = new ObjectMapper();
JsonNode root = mapper.readTree(new File(onnxDir + "/tts.json"));
Config config = new Config();
config.ae = new Config.AEConfig();
config.ae.sampleRate = root.get("ae").get("sample_rate").asInt();
config.ae.baseChunkSize = root.get("ae").get("base_chunk_size").asInt();
config.ttl = new Config.TTLConfig();
config.ttl.chunkCompressFactor = root.get("ttl").get("chunk_compress_factor").asInt();
config.ttl.latentDim = root.get("ttl").get("latent_dim").asInt();
return config;
}
/**
* Get latent mask from wav lengths
*/
public static float[][][] getLatentMask(long[] wavLengths, Config config) {
long baseChunkSize = config.ae.baseChunkSize;
long chunkCompressFactor = config.ttl.chunkCompressFactor;
long latentSize = baseChunkSize * chunkCompressFactor;
long[] latentLengths = new long[wavLengths.length];
long maxLen = 0;
for (int i = 0; i < wavLengths.length; i++) {
latentLengths[i] = (wavLengths[i] + latentSize - 1) / latentSize;
maxLen = Math.max(maxLen, latentLengths[i]);
}
float[][][] mask = new float[wavLengths.length][1][(int) maxLen];
for (int i = 0; i < wavLengths.length; i++) {
for (int j = 0; j < maxLen; j++) {
mask[i][0][j] = j < latentLengths[i] ? 1.0f : 0.0f;
}
}
return mask;
}
/**
* Write WAV file
*/
public static void writeWavFile(String filename, float[] audioData, int sampleRate) throws IOException {
// Convert float to byte array
byte[] bytes = new byte[audioData.length * 2];
ByteBuffer buffer = ByteBuffer.wrap(bytes);
buffer.order(ByteOrder.LITTLE_ENDIAN);
for (float sample : audioData) {
short val = (short) Math.max(-32768, Math.min(32767, sample * 32767));
buffer.putShort(val);
}
ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
AudioInputStream ais = new AudioInputStream(bais, format, audioData.length);
AudioSystem.write(ais, AudioFileFormat.Type.WAVE, new File(filename));
}
/**
* Sanitize filename
*/
public static String sanitizeFilename(String text, int maxLen) {
if (text.length() > maxLen) {
text = text.substring(0, maxLen);
}
return text.replaceAll("[^a-zA-Z0-9]", "_");
}
/**
* Timer utility
*/
public static <T> T timer(String name, java.util.function.Supplier<T> fn) {
long start = System.currentTimeMillis();
System.out.println(name + "...");
T result = fn.get();
long elapsed = System.currentTimeMillis() - start;
System.out.printf(" -> %s completed in %.2f sec\n", name, elapsed / 1000.0);
return result;
}
/**
* Create float tensor from 3D array
*/
public static OnnxTensor createFloatTensor(float[][][] array, OrtEnvironment env) throws OrtException {
int dim0 = array.length;
int dim1 = array[0].length;
int dim2 = array[0][0].length;
float[] flat = new float[dim0 * dim1 * dim2];
int idx = 0;
for (int i = 0; i < dim0; i++) {
for (int j = 0; j < dim1; j++) {
for (int k = 0; k < dim2; k++) {
flat[idx++] = array[i][j][k];
}
}
}
long[] shape = {dim0, dim1, dim2};
return OnnxTensor.createTensor(env, FloatBuffer.wrap(flat), shape);
}
/**
* Create long tensor from 2D array
*/
public static OnnxTensor createLongTensor(long[][] array, OrtEnvironment env) throws OrtException {
int dim0 = array.length;
int dim1 = array[0].length;
long[] flat = new long[dim0 * dim1];
int idx = 0;
for (int i = 0; i < dim0; i++) {
for (int j = 0; j < dim1; j++) {
flat[idx++] = array[i][j];
}
}
long[] shape = {dim0, dim1};
return OnnxTensor.createTensor(env, LongBuffer.wrap(flat), shape);
}
/**
* Load JSON long array
*/
public static long[] loadJsonLongArray(String filePath) throws IOException {
ObjectMapper mapper = new ObjectMapper();
JsonNode root = mapper.readTree(new File(filePath));
long[] result = new long[root.size()];
for (int i = 0; i < root.size(); i++) {
result[i] = root.get(i).asLong();
}
return result;
}
}
+97
View File
@@ -0,0 +1,97 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `ExampleONNX.java`.
## Installation
This project uses [Maven](https://maven.apache.org/) for dependency management.
### Prerequisites
- Java 11 or higher
- Maven 3.6 or higher
### Install dependencies
```bash
mvn clean install
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
mvn exec:java
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
mvn exec:java -Dexec.args="--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json --text 'The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant.'"
```
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice (M1.json) for the first text
- Use female voice (F1.json) for the second text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
mvn exec:java -Dexec.args="--total-step 10 --voice-style assets/voice_styles/M1.json --text 'Increasing the number of denoising steps improves the output fidelity and overall quality.'"
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
**Note**: If your text contains apostrophes, use escaping or run the JAR directly:
```bash
java -jar target/tts-example.jar --total-step 10 --text "Text with apostrophe's here"
```
## Building a Fat JAR
To create a standalone JAR with all dependencies:
```bash
mvn clean package
```
Then run it directly:
```bash
java -jar target/tts-example.jar
```
Or with arguments:
```bash
java -jar target/tts-example.jar --total-step 10 --text "Your custom text here"
```
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
| `--text` | str+ | (long default text) | Text(s) to synthesize |
| `--save-dir` | str | `results` | Output directory |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
- **Voice Styles**: Uses pre-extracted voice style JSON files for fast inference
+1
View File
@@ -0,0 +1 @@
../assets
+110
View File
@@ -0,0 +1,110 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ai.supertonic</groupId>
<artifactId>tts-onnx-java</artifactId>
<version>1.0.0</version>
<packaging>jar</packaging>
<name>TTS ONNX Java Example</name>
<description>Text-to-Speech inference using ONNX Runtime in Java</description>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
<onnxruntime.version>1.23.1</onnxruntime.version>
<jackson.version>2.15.2</jackson.version>
</properties>
<dependencies>
<!-- ONNX Runtime -->
<dependency>
<groupId>com.microsoft.onnxruntime</groupId>
<artifactId>onnxruntime</artifactId>
<version>${onnxruntime.version}</version>
</dependency>
<!-- Jackson for JSON parsing -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>${jackson.version}</version>
</dependency>
<!-- JTransforms for Fast FFT -->
<dependency>
<groupId>com.github.wendykierp</groupId>
<artifactId>JTransforms</artifactId>
<version>3.1</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>.</sourceDirectory>
<plugins>
<!-- Maven Compiler Plugin -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version>
<configuration>
<source>11</source>
<target>11</target>
</configuration>
</plugin>
<!-- Maven Exec Plugin for running the example -->
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<mainClass>ExampleONNX</mainClass>
</configuration>
</plugin>
<!-- Maven Jar Plugin -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<archive>
<manifest>
<mainClass>ExampleONNX</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<!-- Maven Shade Plugin for creating fat JAR -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.5.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>ExampleONNX</mainClass>
</transformer>
</transformers>
<finalName>tts-example</finalName>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
+102
View File
@@ -0,0 +1,102 @@
# TTS ONNX Node.js Implementation
Node.js implementation for TTS inference. Uses ONNX Runtime to generate speech from text.
## Requirements
- Node.js v16 or higher
- npm or yarn
## Installation
```bash
cd nodejs
npm install
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
npm start
```
Or:
```bash
node example_onnx.js
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
node example_onnx.js \
--voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
```
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first text
- Use female voice style (F1.json) for the second text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
node example_onnx.js \
--total-step 10 \
--voice-style "assets/voice_styles/M1.json" \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s). Separate multiple files with commas |
| `--text` | str+ | (long default text) | Text(s) to synthesize. Separate multiple texts with pipes |
| `--save-dir` | str | `results` | Output directory |
## Notes
- **Batch Processing**: The number of voice style files must match the number of texts. Use commas to separate files and pipes to separate texts
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
## Architecture
- `helper.js`: Node.js port of Python's `helper.py`
- `Preprocessor`: Audio preprocessing (STFT, Mel Spectrogram)
- `UnicodeProcessor`: Text preprocessing
- Utility functions (mask generation, tensor conversion, etc.)
- `example_onnx.js`: Main inference script
- ONNX model loading
- TTS inference pipeline execution
- WAV file saving
- `package.json`: Node.js project configuration and dependencies
## Implementation Notes
1. **Pure Node.js WAV Processing**: Writes WAV files without external native libraries. Outputs 16-bit PCM format.
2. **Memory Efficiency**: Note that Node.js may consume significant memory when processing large arrays.
3. **Performance**: The mel spectrogram extraction (Step 1-1) is currently slower than Python's Librosa, which uses highly optimized C extensions. This bottleneck could be further improved with additional optimizations such as WASM-based FFT libraries or native addons.
+1
View File
@@ -0,0 +1 @@
../assets
+104
View File
@@ -0,0 +1,104 @@
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import { loadTextToSpeech, loadVoiceStyle, timer, writeWavFile } from './helper.js';
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
/**
* Parse command line arguments
*/
function parseArgs() {
const args = {
useGpu: false,
onnxDir: 'assets/onnx',
totalStep: 5,
nTest: 4,
voiceStyle: ['assets/voice_styles/M1.json'],
text: ['This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.'],
saveDir: 'results'
};
for (let i = 2; i < process.argv.length; i++) {
const arg = process.argv[i];
if (arg === '--use-gpu') {
args.useGpu = true;
} else if (arg === '--onnx-dir' && i + 1 < process.argv.length) {
args.onnxDir = process.argv[++i];
} else if (arg === '--total-step' && i + 1 < process.argv.length) {
args.totalStep = parseInt(process.argv[++i]);
} else if (arg === '--n-test' && i + 1 < process.argv.length) {
args.nTest = parseInt(process.argv[++i]);
} else if (arg === '--voice-style' && i + 1 < process.argv.length) {
args.voiceStyle = process.argv[++i].split(',');
} else if (arg === '--text' && i + 1 < process.argv.length) {
args.text = process.argv[++i].split('|');
} else if (arg === '--save-dir' && i + 1 < process.argv.length) {
args.saveDir = process.argv[++i];
}
}
return args;
}
/**
* Main inference function
*/
async function main() {
console.log('=== TTS Inference with ONNX Runtime (Node.js) ===\n');
// --- 1. Parse arguments --- //
const args = parseArgs();
const totalStep = args.totalStep;
const nTest = args.nTest;
const saveDir = args.saveDir;
const voiceStylePaths = args.voiceStyle.map(p => path.resolve(__dirname, p));
const textList = args.text;
if (voiceStylePaths.length !== textList.length) {
throw new Error(`Number of voice styles (${voiceStylePaths.length}) must match number of texts (${textList.length})`);
}
const bsz = voiceStylePaths.length;
// --- 2. Load Text to Speech --- //
const onnxDir = path.resolve(__dirname, args.onnxDir);
const textToSpeech = await loadTextToSpeech(onnxDir, args.useGpu);
// --- 3. Load Voice Style --- //
const style = loadVoiceStyle(voiceStylePaths, true);
// --- 4. Synthesize speech --- //
for (let n = 0; n < nTest; n++) {
console.log(`\n[${n + 1}/${nTest}] Starting synthesis...`);
const { wav, duration } = await timer('Generating speech from text', async () => {
return await textToSpeech.call(textList, style, totalStep);
});
if (!fs.existsSync(saveDir)) {
fs.mkdirSync(saveDir, { recursive: true });
}
const wavShape = [bsz, wav.length / bsz];
for (let b = 0; b < bsz; b++) {
const fname = `${textList[b].substring(0, 20).replace(/[^a-zA-Z0-9]/g, '_')}_${n + 1}.wav`;
const wavLen = Math.floor(textToSpeech.sampleRate * duration[b]);
const wavOut = wav.slice(b * wavShape[1], b * wavShape[1] + wavLen);
const outputPath = path.join(saveDir, fname);
writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
console.log(`Saved: ${outputPath}`);
}
}
console.log('\n=== Synthesis completed successfully! ===');
}
// Run main function
main().catch(err => {
console.error('Error during inference:', err);
process.exit(1);
});
+392
View File
@@ -0,0 +1,392 @@
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import * as ort from 'onnxruntime-node';
const __filename = fileURLToPath(import.meta.url);
/**
* Unicode text processor
*/
class UnicodeProcessor {
constructor(unicodeIndexerJsonPath) {
this.indexer = JSON.parse(fs.readFileSync(unicodeIndexerJsonPath, 'utf8'));
}
_preprocessText(text) {
// Simple NFKD normalization (JavaScript has normalize built-in)
return text.normalize('NFKD');
}
_textToUnicodeValues(text) {
return Array.from(text).map(char => char.charCodeAt(0));
}
_getTextMask(textIdsLengths) {
return lengthToMask(textIdsLengths);
}
call(textList) {
const processedTexts = textList.map(t => this._preprocessText(t));
const textIdsLengths = processedTexts.map(t => t.length);
const maxLen = Math.max(...textIdsLengths);
const textIds = [];
for (let i = 0; i < processedTexts.length; i++) {
const row = new Array(maxLen).fill(0);
const unicodeVals = this._textToUnicodeValues(processedTexts[i]);
for (let j = 0; j < unicodeVals.length; j++) {
row[j] = this.indexer[unicodeVals[j]];
}
textIds.push(row);
}
const textMask = this._getTextMask(textIdsLengths);
return { textIds, textMask };
}
}
/**
* Style class
*/
class Style {
constructor(styleTtlOnnx, styleDpOnnx) {
this.ttl = styleTtlOnnx;
this.dp = styleDpOnnx;
}
}
/**
* TextToSpeech class
*/
class TextToSpeech {
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
this.cfgs = cfgs;
this.textProcessor = textProcessor;
this.dpOrt = dpOrt;
this.textEncOrt = textEncOrt;
this.vectorEstOrt = vectorEstOrt;
this.vocoderOrt = vocoderOrt;
this.sampleRate = cfgs.ae.sample_rate;
this.baseChunkSize = cfgs.ae.base_chunk_size;
this.chunkCompressFactor = cfgs.ttl.chunk_compress_factor;
this.ldim = cfgs.ttl.latent_dim;
}
sampleNoisyLatent(duration) {
const wavLenMax = Math.max(...duration) * this.sampleRate;
const wavLengths = duration.map(d => Math.floor(d * this.sampleRate));
const chunkSize = this.baseChunkSize * this.chunkCompressFactor;
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
const latentDim = this.ldim * this.chunkCompressFactor;
// Generate random noise
const noisyLatent = [];
for (let b = 0; b < duration.length; b++) {
const batch = [];
for (let d = 0; d < latentDim; d++) {
const row = [];
for (let t = 0; t < latentLen; t++) {
// Box-Muller transform for normal distribution
// Add epsilon to avoid log(0)
const eps = 1e-10;
const u1 = Math.max(eps, Math.random());
const u2 = Math.random();
const randNormal = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
row.push(randNormal);
}
batch.push(row);
}
noisyLatent.push(batch);
}
const latentMask = getLatentMask(wavLengths, this.baseChunkSize, this.chunkCompressFactor);
// Apply mask
for (let b = 0; b < noisyLatent.length; b++) {
for (let d = 0; d < noisyLatent[b].length; d++) {
for (let t = 0; t < noisyLatent[b][d].length; t++) {
noisyLatent[b][d][t] *= latentMask[b][0][t];
}
}
}
return { noisyLatent, latentMask };
}
async call(textList, style, totalStep) {
if (textList.length !== style.ttl.dims[0]) {
throw new Error('Number of texts must match number of style vectors');
}
const bsz = textList.length;
const { textIds, textMask } = this.textProcessor.call(textList);
const textIdsShape = [bsz, textIds[0].length];
const textMaskShape = [bsz, 1, textMask[0][0].length];
const textMaskTensor = arrayToTensor(textMask, textMaskShape);
const dpResult = await this.dpOrt.run({
text_ids: intArrayToTensor(textIds, textIdsShape),
style_dp: style.dp,
text_mask: textMaskTensor
});
const durOnnx = Array.from(dpResult.duration.data);
const textEncResult = await this.textEncOrt.run({
text_ids: intArrayToTensor(textIds, textIdsShape),
style_ttl: style.ttl,
text_mask: textMaskTensor
});
const textEmbTensor = textEncResult.text_emb;
let { noisyLatent, latentMask } = this.sampleNoisyLatent(durOnnx);
const latentShape = [bsz, noisyLatent[0].length, noisyLatent[0][0].length];
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
const latentMaskTensor = arrayToTensor(latentMask, latentMaskShape);
const totalStepArray = new Array(bsz).fill(totalStep);
const scalarShape = [bsz];
const totalStepTensor = arrayToTensor(totalStepArray, scalarShape);
for (let step = 0; step < totalStep; step++) {
const currentStepArray = new Array(bsz).fill(step);
const vectorEstResult = await this.vectorEstOrt.run({
noisy_latent: arrayToTensor(noisyLatent, latentShape),
text_emb: textEmbTensor,
style_ttl: style.ttl,
text_mask: textMaskTensor,
latent_mask: latentMaskTensor,
total_step: totalStepTensor,
current_step: arrayToTensor(currentStepArray, scalarShape)
});
const denoisedLatent = Array.from(vectorEstResult.denoised_latent.data);
// Update latent with the denoised output
let idx = 0;
for (let b = 0; b < noisyLatent.length; b++) {
for (let d = 0; d < noisyLatent[b].length; d++) {
for (let t = 0; t < noisyLatent[b][d].length; t++) {
noisyLatent[b][d][t] = denoisedLatent[idx++];
}
}
}
}
const vocoderResult = await this.vocoderOrt.run({
latent: arrayToTensor(noisyLatent, latentShape)
});
const wav = Array.from(vocoderResult.wav_tts.data);
return { wav, duration: durOnnx };
}
}
/**
* Convert lengths to binary mask
*/
function lengthToMask(lengths, maxLen = null) {
maxLen = maxLen || Math.max(...lengths);
const mask = [];
for (let i = 0; i < lengths.length; i++) {
const row = [];
for (let j = 0; j < maxLen; j++) {
row.push(j < lengths[i] ? 1.0 : 0.0);
}
mask.push([row]); // [B, 1, maxLen]
}
return mask;
}
/**
* Get latent mask from wav lengths
*/
function getLatentMask(wavLengths, baseChunkSize, chunkCompressFactor) {
const latentSize = baseChunkSize * chunkCompressFactor;
const latentLengths = wavLengths.map(len =>
Math.floor((len + latentSize - 1) / latentSize)
);
return lengthToMask(latentLengths);
}
/**
* Load ONNX model
*/
async function loadOnnx(onnxPath, opts) {
return await ort.InferenceSession.create(onnxPath, opts);
}
/**
* Load all ONNX models for TTS
*/
async function loadOnnxAll(onnxDir, opts) {
const dpPath = path.join(onnxDir, 'duration_predictor.onnx');
const textEncPath = path.join(onnxDir, 'text_encoder.onnx');
const vectorEstPath = path.join(onnxDir, 'vector_estimator.onnx');
const vocoderPath = path.join(onnxDir, 'vocoder.onnx');
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = await Promise.all([
loadOnnx(dpPath, opts),
loadOnnx(textEncPath, opts),
loadOnnx(vectorEstPath, opts),
loadOnnx(vocoderPath, opts)
]);
return { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt };
}
/**
* Load configuration
*/
function loadCfgs(onnxDir) {
const cfgPath = path.join(onnxDir, 'tts.json');
const cfgs = JSON.parse(fs.readFileSync(cfgPath, 'utf8'));
return cfgs;
}
/**
* Load text processor
*/
function loadTextProcessor(onnxDir) {
const unicodeIndexerPath = path.join(onnxDir, 'unicode_indexer.json');
const textProcessor = new UnicodeProcessor(unicodeIndexerPath);
return textProcessor;
}
/**
* Load voice style from JSON file
*/
export function loadVoiceStyle(voiceStylePaths, verbose = false) {
const bsz = voiceStylePaths.length;
// Read first file to get dimensions
const firstStyle = JSON.parse(fs.readFileSync(voiceStylePaths[0], 'utf8'));
const ttlDims = firstStyle.style_ttl.dims;
const dpDims = firstStyle.style_dp.dims;
const ttlDim1 = ttlDims[1];
const ttlDim2 = ttlDims[2];
const dpDim1 = dpDims[1];
const dpDim2 = dpDims[2];
// Pre-allocate arrays with full batch size
const ttlSize = bsz * ttlDim1 * ttlDim2;
const dpSize = bsz * dpDim1 * dpDim2;
const ttlFlat = new Float32Array(ttlSize);
const dpFlat = new Float32Array(dpSize);
// Fill in the data
for (let i = 0; i < bsz; i++) {
const voiceStyle = JSON.parse(fs.readFileSync(voiceStylePaths[i], 'utf8'));
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
const ttlOffset = i * ttlDim1 * ttlDim2;
ttlFlat.set(ttlData, ttlOffset);
const dpData = voiceStyle.style_dp.data.flat(Infinity);
const dpOffset = i * dpDim1 * dpDim2;
dpFlat.set(dpData, dpOffset);
}
const ttlStyle = new ort.Tensor('float32', ttlFlat, [bsz, ttlDim1, ttlDim2]);
const dpStyle = new ort.Tensor('float32', dpFlat, [bsz, dpDim1, dpDim2]);
if (verbose) {
console.log(`Loaded ${bsz} voice styles`);
}
return new Style(ttlStyle, dpStyle);
}
/**
* Load text to speech components
*/
export async function loadTextToSpeech(onnxDir, useGpu = false) {
const opts = {};
if (useGpu) {
throw new Error('GPU mode is not supported yet');
} else {
console.log('Using CPU for inference');
}
const cfgs = loadCfgs(onnxDir);
const { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt } = await loadOnnxAll(onnxDir, opts);
const textProcessor = loadTextProcessor(onnxDir);
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
return textToSpeech;
}
/**
* Convert 3D array to ONNX tensor
*/
function arrayToTensor(array, dims) {
// Flatten the array
const flat = array.flat(Infinity);
return new ort.Tensor('float32', Float32Array.from(flat), dims);
}
/**
* Convert 2D int array to ONNX tensor
*/
function intArrayToTensor(array, dims) {
const flat = array.flat(Infinity);
return new ort.Tensor('int64', BigInt64Array.from(flat.map(x => BigInt(x))), dims);
}
/**
* Write WAV file
*/
export function writeWavFile(filename, audioData, sampleRate) {
const numChannels = 1;
const bitsPerSample = 16;
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
const blockAlign = numChannels * bitsPerSample / 8;
const dataSize = audioData.length * bitsPerSample / 8;
const buffer = Buffer.alloc(44 + dataSize);
// RIFF header
buffer.write('RIFF', 0);
buffer.writeUInt32LE(36 + dataSize, 4);
buffer.write('WAVE', 8);
// fmt chunk
buffer.write('fmt ', 12);
buffer.writeUInt32LE(16, 16); // fmt chunk size
buffer.writeUInt16LE(1, 20); // audio format (PCM)
buffer.writeUInt16LE(numChannels, 22);
buffer.writeUInt32LE(sampleRate, 24);
buffer.writeUInt32LE(byteRate, 28);
buffer.writeUInt16LE(blockAlign, 32);
buffer.writeUInt16LE(bitsPerSample, 34);
// data chunk
buffer.write('data', 36);
buffer.writeUInt32LE(dataSize, 40);
// Write audio data
for (let i = 0; i < audioData.length; i++) {
const sample = Math.max(-1, Math.min(1, audioData[i]));
const intSample = Math.floor(sample * 32767);
buffer.writeInt16LE(intSample, 44 + i * 2);
}
fs.writeFileSync(filename, buffer);
}
/**
* Timer utility for measuring execution time
*/
export async function timer(name, fn) {
const start = Date.now();
console.log(`${name}...`);
const result = await fn();
const elapsed = ((Date.now() - start) / 1000).toFixed(2);
console.log(` -> ${name} completed in ${elapsed} sec`);
return result;
}
+26
View File
@@ -0,0 +1,26 @@
{
"name": "tts-onnx-nodejs",
"version": "1.0.0",
"description": "TTS inference using ONNX Runtime for Node.js",
"main": "example_onnx.js",
"type": "module",
"scripts": {
"start": "node example_onnx.js"
},
"keywords": [
"tts",
"onnx",
"speech-synthesis",
"nodejs"
],
"author": "",
"license": "MIT",
"dependencies": {
"fft.js": "^4.0.3",
"js-yaml": "^4.1.0",
"onnxruntime-node": "^1.19.2"
},
"engines": {
"node": ">=16.0.0"
}
}
+83
View File
@@ -0,0 +1,83 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `example_onnx.py`.
## Installation
This project uses [uv](https://docs.astral.sh/uv/) for fast package management.
### Install uv (if not already installed)
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
### Install dependencies
```bash
uv sync
```
Or if you prefer using traditional pip with requirements.txt:
```bash
pip install -r requirements.txt
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
uv run example_onnx.py
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
uv run example_onnx.py \
--voice-style assets/voice_styles/M1.json assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange." "The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
```
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first text
- Use female voice style (F1.json) for the second text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
uv run example_onnx.py \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (with CPU fallback) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
| `--text` | str+ | (long default text) | Text(s) to synthesize |
| `--save-dir` | str | `results` | Output directory |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
Symlink
+1
View File
@@ -0,0 +1 @@
../assets
+91
View File
@@ -0,0 +1,91 @@
import argparse
import os
import soundfile as sf
from helper import load_text_to_speech, timer, sanitize_filename, load_voice_style
def parse_args():
parser = argparse.ArgumentParser(description="TTS Inference with ONNX")
# Device settings
parser.add_argument(
"--use-gpu", action="store_true", help="Use GPU for inference (default: CPU)"
)
# Model settings
parser.add_argument(
"--onnx-dir",
type=str,
default="assets/onnx",
help="Path to ONNX model directory",
)
# Synthesis parameters
parser.add_argument(
"--total-step", type=int, default=5, help="Number of denoising steps"
)
parser.add_argument(
"--n-test", type=int, default=4, help="Number of times to generate"
)
# Input/Output
parser.add_argument(
"--voice-style",
type=str,
nargs="+",
default=["assets/voice_styles/M1.json"],
help="Voice style file path(s). Can specify multiple files for batch processing",
)
parser.add_argument(
"--text",
type=str,
nargs="+",
default=[
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
],
help="Text(s) to synthesize. Can specify multiple texts for batch processing",
)
parser.add_argument(
"--save-dir", type=str, default="results", help="Output directory"
)
return parser.parse_args()
print("=== TTS Inference with ONNX Runtime (Python) ===\n")
# --- 1. Parse arguments --- #
args = parse_args()
total_step = args.total_step
n_test = args.n_test
save_dir = args.save_dir
voice_style_paths = args.voice_style
text_list = args.text
assert len(voice_style_paths) == len(
text_list
), f"Number of voice styles ({len(voice_style_paths)}) must match number of texts ({len(text_list)})"
bsz = len(voice_style_paths)
# --- 2. Load Text to Speech --- #
text_to_speech = load_text_to_speech(args.onnx_dir, args.use_gpu)
# --- 3. Load Voice Style --- #
style = load_voice_style(voice_style_paths, verbose=True)
# --- 4. Synthesize Speech --- #
for n in range(n_test):
print(f"\n[{n+1}/{n_test}] Starting synthesis...")
with timer("Generating speech from text"):
wav, duration = text_to_speech(text_list, style, total_step)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
for b in range(bsz):
fname = f"{sanitize_filename(text_list[b], 20)}_{n+1}.wav"
w = wav[b, : int(text_to_speech.sample_rate * duration[b].item())] # [T_trim]
sf.write(os.path.join(save_dir, fname), w, text_to_speech.sample_rate)
print(f"Saved: {save_dir}/{fname}")
print("\n=== Synthesis completed successfully! ===")
+249
View File
@@ -0,0 +1,249 @@
import json
import os
import time
from contextlib import contextmanager
from typing import Optional
from unicodedata import normalize
import numpy as np
import onnxruntime as ort
class UnicodeProcessor:
def __init__(self, unicode_indexer_path: str):
with open(unicode_indexer_path, "r") as f:
self.indexer = json.load(f)
def _preprocess_text(self, text: str) -> str:
# TODO: add more preprocessing
text = normalize("NFKD", text)
return text
def _get_text_mask(self, text_ids_lengths: np.ndarray) -> np.ndarray:
text_mask = length_to_mask(text_ids_lengths)
return text_mask
def _text_to_unicode_values(self, text: str) -> np.ndarray:
unicode_values = np.array(
[ord(char) for char in text], dtype=np.uint16
) # 2 bytes
return unicode_values
def __call__(self, text_list: list[str]) -> tuple[np.ndarray, np.ndarray]:
text_list = [self._preprocess_text(t) for t in text_list]
text_ids_lengths = np.array([len(text) for text in text_list], dtype=np.int64)
text_ids = np.zeros((len(text_list), text_ids_lengths.max()), dtype=np.int64)
for i, text in enumerate(text_list):
unicode_vals = self._text_to_unicode_values(text)
text_ids[i, : len(unicode_vals)] = np.array(
[self.indexer[val] for val in unicode_vals], dtype=np.int64
)
text_mask = self._get_text_mask(text_ids_lengths)
return text_ids, text_mask
class Style:
def __init__(self, style_ttl_onnx: np.ndarray, style_dp_onnx: np.ndarray):
self.ttl = style_ttl_onnx
self.dp = style_dp_onnx
class TextToSpeech:
def __init__(
self,
cfgs: dict,
text_processor: UnicodeProcessor,
dp_ort: ort.InferenceSession,
text_enc_ort: ort.InferenceSession,
vector_est_ort: ort.InferenceSession,
vocoder_ort: ort.InferenceSession,
):
self.cfgs = cfgs
self.text_processor = text_processor
self.dp_ort = dp_ort
self.text_enc_ort = text_enc_ort
self.vector_est_ort = vector_est_ort
self.vocoder_ort = vocoder_ort
self.sample_rate = cfgs["ae"]["sample_rate"]
self.base_chunk_size = cfgs["ae"]["base_chunk_size"]
self.chunk_compress_factor = cfgs["ttl"]["chunk_compress_factor"]
self.ldim = cfgs["ttl"]["latent_dim"]
def sample_noisy_latent(
self, duration: np.ndarray
) -> tuple[np.ndarray, np.ndarray]:
bsz = len(duration)
wav_len_max = duration.max() * self.sample_rate
wav_lengths = (duration * self.sample_rate).astype(np.int64)
chunk_size = self.base_chunk_size * self.chunk_compress_factor
latent_len = ((wav_len_max + chunk_size - 1) / chunk_size).astype(np.int32)
latent_dim = self.ldim * self.chunk_compress_factor
noisy_latent = np.random.randn(bsz, latent_dim, latent_len).astype(np.float32)
latent_mask = get_latent_mask(
wav_lengths, self.base_chunk_size, self.chunk_compress_factor
)
noisy_latent = noisy_latent * latent_mask
return noisy_latent, latent_mask
def __call__(
self, text_list: list[str], style: Style, total_step: int
) -> tuple[np.ndarray, np.ndarray]:
assert (
len(text_list) == style.ttl.shape[0]
), "Number of texts must match number of style vectors"
bsz = len(text_list)
text_ids, text_mask = self.text_processor(text_list)
dur_onnx, *_ = self.dp_ort.run(
None, {"text_ids": text_ids, "style_dp": style.dp, "text_mask": text_mask}
)
text_emb_onnx, *_ = self.text_enc_ort.run(
None,
{"text_ids": text_ids, "style_ttl": style.ttl, "text_mask": text_mask},
) # dur_onnx: [bsz]
xt, latent_mask = self.sample_noisy_latent(dur_onnx)
total_step_np = np.array([total_step] * bsz, dtype=np.float32)
for step in range(total_step):
current_step = np.array([step] * bsz, dtype=np.float32)
xt, *_ = self.vector_est_ort.run(
None,
{
"noisy_latent": xt,
"text_emb": text_emb_onnx,
"style_ttl": style.ttl,
"text_mask": text_mask,
"latent_mask": latent_mask,
"current_step": current_step,
"total_step": total_step_np,
},
)
wav, *_ = self.vocoder_ort.run(None, {"latent": xt})
return wav, dur_onnx
def length_to_mask(lengths: np.ndarray, max_len: Optional[int] = None) -> np.ndarray:
"""
Convert lengths to binary mask.
Args:
lengths: (B,)
max_len: int
Returns:
mask: (B, 1, max_len)
"""
max_len = max_len or lengths.max()
ids = np.arange(0, max_len)
mask = (ids < np.expand_dims(lengths, axis=1)).astype(np.float32)
return mask.reshape(-1, 1, max_len)
def get_latent_mask(
wav_lengths: np.ndarray, base_chunk_size: int, chunk_compress_factor: int
) -> np.ndarray:
latent_size = base_chunk_size * chunk_compress_factor
latent_lengths = (wav_lengths + latent_size - 1) // latent_size
latent_mask = length_to_mask(latent_lengths)
return latent_mask
def load_onnx(
onnx_path: str, opts: ort.SessionOptions, providers: list[str]
) -> ort.InferenceSession:
return ort.InferenceSession(onnx_path, sess_options=opts, providers=providers)
def load_onnx_all(
onnx_dir: str, opts: ort.SessionOptions, providers: list[str]
) -> tuple[
ort.InferenceSession,
ort.InferenceSession,
ort.InferenceSession,
ort.InferenceSession,
]:
dp_onnx_path = os.path.join(onnx_dir, "duration_predictor.onnx")
text_enc_onnx_path = os.path.join(onnx_dir, "text_encoder.onnx")
vector_est_onnx_path = os.path.join(onnx_dir, "vector_estimator.onnx")
vocoder_onnx_path = os.path.join(onnx_dir, "vocoder.onnx")
dp_ort = load_onnx(dp_onnx_path, opts, providers)
text_enc_ort = load_onnx(text_enc_onnx_path, opts, providers)
vector_est_ort = load_onnx(vector_est_onnx_path, opts, providers)
vocoder_ort = load_onnx(vocoder_onnx_path, opts, providers)
return dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
def load_cfgs(onnx_dir: str) -> dict:
cfg_path = os.path.join(onnx_dir, "tts.json")
with open(cfg_path, "r") as f:
cfgs = json.load(f)
return cfgs
def load_text_processor(onnx_dir: str) -> UnicodeProcessor:
unicode_indexer_path = os.path.join(onnx_dir, "unicode_indexer.json")
text_processor = UnicodeProcessor(unicode_indexer_path)
return text_processor
def load_text_to_speech(onnx_dir: str, use_gpu: bool = False) -> TextToSpeech:
opts = ort.SessionOptions()
if use_gpu:
raise NotImplementedError("GPU mode is not fully tested")
else:
providers = ["CPUExecutionProvider"]
print("Using CPU for inference")
cfgs = load_cfgs(onnx_dir)
dp_ort, text_enc_ort, vector_est_ort, vocoder_ort = load_onnx_all(
onnx_dir, opts, providers
)
text_processor = load_text_processor(onnx_dir)
return TextToSpeech(
cfgs, text_processor, dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
)
def load_voice_style(voice_style_paths: list[str], verbose: bool = False) -> Style:
bsz = len(voice_style_paths)
# Read first file to get dimensions
with open(voice_style_paths[0], "r") as f:
first_style = json.load(f)
ttl_dims = first_style["style_ttl"]["dims"]
dp_dims = first_style["style_dp"]["dims"]
# Pre-allocate arrays with full batch size
ttl_style = np.zeros([bsz, ttl_dims[1], ttl_dims[2]], dtype=np.float32)
dp_style = np.zeros([bsz, dp_dims[1], dp_dims[2]], dtype=np.float32)
# Fill in the data
for i, voice_style_path in enumerate(voice_style_paths):
with open(voice_style_path, "r") as f:
voice_style = json.load(f)
ttl_data = np.array(
voice_style["style_ttl"]["data"], dtype=np.float32
).flatten()
ttl_style[i] = ttl_data.reshape(ttl_dims[1], ttl_dims[2])
dp_data = np.array(voice_style["style_dp"]["data"], dtype=np.float32).flatten()
dp_style[i] = dp_data.reshape(dp_dims[1], dp_dims[2])
if verbose:
print(f"Loaded {bsz} voice styles")
return Style(ttl_style, dp_style)
@contextmanager
def timer(name: str):
start = time.time()
print(f"{name}...")
yield
print(f" -> {name} completed in {time.time() - start:.2f} sec")
def sanitize_filename(text: str, max_len: int) -> str:
"""Sanitize filename by replacing non-alphanumeric characters with underscores"""
import re
prefix = text[:max_len]
return re.sub(r"[^a-zA-Z0-9]", "_", prefix)
+20
View File
@@ -0,0 +1,20 @@
[project]
name = "tts-onnx"
version = "1.0.0"
description = "TTS ONNX Inference"
requires-python = ">=3.10"
dependencies = [
"onnxruntime==1.23.1",
"numpy>=1.26.0",
"soundfile>=0.12.1",
"librosa>=0.10.0",
"PyYAML>=6.0",
]
[tool.setuptools]
py-modules = []
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
+5
View File
@@ -0,0 +1,5 @@
onnxruntime==1.23.1
numpy>=1.26.0
soundfile>=0.12.1
librosa>=0.10.0
PyYAML>=6.0
Generated
+1142
View File
File diff suppressed because it is too large Load Diff
+21
View File
@@ -0,0 +1,21 @@
# Rust build artifacts
/target/
Cargo.lock
# Output directory
/results/
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
# Debug
*.pdb
+41
View File
@@ -0,0 +1,41 @@
[package]
name = "supertonic-tts"
version = "0.1.0"
edition = "2021"
[dependencies]
# ONNX Runtime
ort = "2.0.0-rc.7"
# Array processing (like NumPy)
ndarray = { version = "0.16", features = ["rayon"] }
rand = "0.8"
rand_distr = "0.4"
# Parallel processing
rayon = "1.10"
# Audio processing
hound = "3.5"
rustfft = "6.2"
# JSON serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
# CLI argument parsing
clap = { version = "4.5", features = ["derive"] }
# Error handling
anyhow = "1.0"
# Unicode normalization
unicode-normalization = "0.1"
# System calls
libc = "0.2"
[[bin]]
name = "example_onnx"
path = "src/example_onnx.rs"
+101
View File
@@ -0,0 +1,101 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using Rust.
## Installation
This project uses [Cargo](https://doc.rust-lang.org/cargo/) for package management.
### Install Rust (if not already installed)
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
### Build the project
```bash
cargo build --release
```
## Basic Usage
You can run the inference in two ways:
1. **Using cargo run** (builds if needed, then runs)
2. **Direct binary execution** (faster if already built)
### Example 1: Default Inference
Run inference with default settings:
```bash
# Using cargo run
cargo run --release --bin example_onnx
# Or directly execute the built binary (faster)
./target/release/example_onnx
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
# Using cargo run
cargo run --release --bin example_onnx -- \
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
# Or using the binary directly
./target/release/example_onnx \
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
```
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice (M1.json) for the first text
- Use female voice (F1.json) for the second text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
# Using cargo run
cargo run --release --bin example_onnx -- \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
# Or using the binary directly
./target/release/example_onnx \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
| `--text` | str+ | (long default text) | Text(s) to synthesize |
| `--save-dir` | str | `results` | Output directory |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
- **Known Issues**: On some platforms (especially macOS), there might be a mutex cleanup warning during exit. This is a known ONNX Runtime issue and doesn't affect functionality. The implementation uses `libc::_exit()` and `mem::forget()` to bypass this issue.
+1
View File
@@ -0,0 +1 @@
../assets
+108
View File
@@ -0,0 +1,108 @@
use anyhow::Result;
use clap::Parser;
use std::path::PathBuf;
use std::fs;
use std::mem;
mod helper;
use helper::{
load_text_to_speech, load_voice_style, timer, write_wav_file, sanitize_filename,
};
#[derive(Parser, Debug)]
#[command(name = "TTS ONNX Inference")]
#[command(about = "TTS Inference with ONNX Runtime (Rust)", long_about = None)]
struct Args {
/// Use GPU for inference (default: CPU)
#[arg(long, default_value = "false")]
use_gpu: bool,
/// Path to ONNX model directory
#[arg(long, default_value = "assets/onnx")]
onnx_dir: String,
/// Number of denoising steps
#[arg(long, default_value = "5")]
total_step: usize,
/// Number of times to generate
#[arg(long, default_value = "4")]
n_test: usize,
/// Voice style file path(s)
#[arg(long, value_delimiter = ',', default_values_t = vec!["assets/voice_styles/M1.json".to_string()])]
voice_style: Vec<String>,
/// Text(s) to synthesize
#[arg(long, value_delimiter = '|', default_values_t = vec!["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.".to_string()])]
text: Vec<String>,
/// Output directory
#[arg(long, default_value = "results")]
save_dir: String,
}
fn main() -> Result<()> {
println!("=== TTS Inference with ONNX Runtime (Rust) ===\n");
// --- 1. Parse arguments --- //
let args = Args::parse();
let total_step = args.total_step;
let n_test = args.n_test;
let voice_style_paths = &args.voice_style;
let text_list = &args.text;
let save_dir = &args.save_dir;
if voice_style_paths.len() != text_list.len() {
anyhow::bail!(
"Number of voice styles ({}) must match number of texts ({})",
voice_style_paths.len(),
text_list.len()
);
}
let bsz = voice_style_paths.len();
// --- 2. Load TTS components --- //
let mut text_to_speech = load_text_to_speech(&args.onnx_dir, args.use_gpu)?;
// --- 3. Load voice styles --- //
let style = load_voice_style(voice_style_paths, true)?;
// --- 4. Synthesize speech --- //
fs::create_dir_all(save_dir)?;
for n in 0..n_test {
println!("\n[{}/{}] Starting synthesis...", n + 1, n_test);
let (wav, duration) = timer("Generating speech from text", || {
text_to_speech.call(text_list, &style, total_step)
})?;
// Save outputs
let wav_len = wav.len() / bsz;
for i in 0..bsz {
let fname = format!("{}_{}.wav", sanitize_filename(&text_list[i], 20), n + 1);
let actual_len = (text_to_speech.sample_rate as f32 * duration[i]) as usize;
let wav_start = i * wav_len;
let wav_end = wav_start + actual_len.min(wav_len);
let wav_slice = &wav[wav_start..wav_end];
let output_path = PathBuf::from(save_dir).join(&fname);
write_wav_file(&output_path, wav_slice, text_to_speech.sample_rate)?;
println!("Saved: {}", output_path.display());
}
}
println!("\n=== Synthesis completed successfully! ===");
// Prevent ONNX Runtime sessions from being dropped, which causes mutex cleanup issues
mem::forget(text_to_speech);
// Use _exit to bypass all cleanup handlers and avoid ONNX Runtime mutex issues on macOS
unsafe {
libc::_exit(0);
}
}
+507
View File
@@ -0,0 +1,507 @@
// ============================================================================
// TTS Helper Module - All utility functions and structures
// ============================================================================
use ndarray::{Array, Array3};
use serde::{Deserialize, Serialize};
use serde_json;
use std::fs::File;
use std::io::BufReader;
use std::path::Path;
use anyhow::{Result, Context};
use unicode_normalization::UnicodeNormalization;
use hound::{WavWriter, WavSpec, SampleFormat};
use rand_distr::{Distribution, Normal};
// ============================================================================
// Configuration Structures
// ============================================================================
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Config {
pub ae: AEConfig,
pub ttl: TTLConfig,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AEConfig {
pub sample_rate: i32,
pub base_chunk_size: i32,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TTLConfig {
pub chunk_compress_factor: i32,
pub latent_dim: i32,
}
/// Load configuration from JSON file
pub fn load_cfgs<P: AsRef<Path>>(onnx_dir: P) -> Result<Config> {
let cfg_path = onnx_dir.as_ref().join("tts.json");
let file = File::open(cfg_path)?;
let reader = BufReader::new(file);
let cfgs: Config = serde_json::from_reader(reader)?;
Ok(cfgs)
}
// ============================================================================
// Voice Style Data Structure
// ============================================================================
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VoiceStyleData {
pub style_ttl: StyleComponent,
pub style_dp: StyleComponent,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StyleComponent {
pub data: Vec<Vec<Vec<f32>>>,
pub dims: Vec<usize>,
#[serde(rename = "type")]
pub dtype: String,
}
// ============================================================================
// Unicode Text Processor
// ============================================================================
pub struct UnicodeProcessor {
indexer: Vec<i64>,
}
impl UnicodeProcessor {
pub fn new<P: AsRef<Path>>(unicode_indexer_json_path: P) -> Result<Self> {
let file = File::open(unicode_indexer_json_path)?;
let reader = BufReader::new(file);
let indexer: Vec<i64> = serde_json::from_reader(reader)?;
Ok(UnicodeProcessor { indexer })
}
pub fn call(&self, text_list: &[String]) -> (Vec<Vec<i64>>, Array3<f32>) {
let processed_texts: Vec<String> = text_list
.iter()
.map(|t| preprocess_text(t))
.collect();
let text_ids_lengths: Vec<usize> = processed_texts
.iter()
.map(|t| t.chars().count())
.collect();
let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
let mut text_ids = Vec::new();
for text in &processed_texts {
let mut row = vec![0i64; max_len];
let unicode_vals = text_to_unicode_values(text);
for (j, &val) in unicode_vals.iter().enumerate() {
if val < self.indexer.len() {
row[j] = self.indexer[val];
} else {
row[j] = -1;
}
}
text_ids.push(row);
}
let text_mask = get_text_mask(&text_ids_lengths);
(text_ids, text_mask)
}
}
pub fn preprocess_text(text: &str) -> String {
text.nfkd().collect()
}
pub fn text_to_unicode_values(text: &str) -> Vec<usize> {
text.chars().map(|c| c as usize).collect()
}
pub fn length_to_mask(lengths: &[usize], max_len: Option<usize>) -> Array3<f32> {
let bsz = lengths.len();
let max_len = max_len.unwrap_or_else(|| *lengths.iter().max().unwrap_or(&0));
let mut mask = Array3::<f32>::zeros((bsz, 1, max_len));
for (i, &len) in lengths.iter().enumerate() {
for j in 0..len.min(max_len) {
mask[[i, 0, j]] = 1.0;
}
}
mask
}
pub fn get_text_mask(text_ids_lengths: &[usize]) -> Array3<f32> {
let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
length_to_mask(text_ids_lengths, Some(max_len))
}
/// Sample noisy latent from normal distribution and apply mask
pub fn sample_noisy_latent(
duration: &[f32],
sample_rate: i32,
base_chunk_size: i32,
chunk_compress: i32,
latent_dim: i32,
) -> (Array3<f32>, Array3<f32>) {
let bsz = duration.len();
let max_dur = duration.iter().fold(0.0f32, |a, &b| a.max(b));
let wav_len_max = (max_dur * sample_rate as f32) as usize;
let wav_lengths: Vec<usize> = duration
.iter()
.map(|&d| (d * sample_rate as f32) as usize)
.collect();
let chunk_size = (base_chunk_size * chunk_compress) as usize;
let latent_len = (wav_len_max + chunk_size - 1) / chunk_size;
let latent_dim_val = (latent_dim * chunk_compress) as usize;
let mut noisy_latent = Array3::<f32>::zeros((bsz, latent_dim_val, latent_len));
let normal = Normal::new(0.0, 1.0).unwrap();
let mut rng = rand::thread_rng();
for b in 0..bsz {
for d in 0..latent_dim_val {
for t in 0..latent_len {
noisy_latent[[b, d, t]] = normal.sample(&mut rng);
}
}
}
let latent_lengths: Vec<usize> = wav_lengths
.iter()
.map(|&len| (len + chunk_size - 1) / chunk_size)
.collect();
let latent_mask = length_to_mask(&latent_lengths, Some(latent_len));
// Apply mask
for b in 0..bsz {
for d in 0..latent_dim_val {
for t in 0..latent_len {
noisy_latent[[b, d, t]] *= latent_mask[[b, 0, t]];
}
}
}
(noisy_latent, latent_mask)
}
// ============================================================================
// WAV File I/O
// ============================================================================
pub fn write_wav_file<P: AsRef<Path>>(
filename: P,
audio_data: &[f32],
sample_rate: i32,
) -> Result<()> {
let spec = WavSpec {
channels: 1,
sample_rate: sample_rate as u32,
bits_per_sample: 16,
sample_format: SampleFormat::Int,
};
let mut writer = WavWriter::create(filename, spec)?;
for &sample in audio_data {
let clamped = sample.max(-1.0).min(1.0);
let val = (clamped * 32767.0) as i16;
writer.write_sample(val)?;
}
writer.finalize()?;
Ok(())
}
// ============================================================================
// Utility Functions
// ============================================================================
pub fn timer<F, T>(name: &str, f: F) -> Result<T>
where
F: FnOnce() -> Result<T>,
{
let start = std::time::Instant::now();
println!("{}...", name);
let result = f()?;
let elapsed = start.elapsed().as_secs_f64();
println!(" -> {} completed in {:.2} sec", name, elapsed);
Ok(result)
}
pub fn sanitize_filename(text: &str, max_len: usize) -> String {
let text = if text.len() > max_len {
&text[..max_len]
} else {
text
};
text.chars()
.map(|c| {
if c.is_ascii_alphanumeric() {
c
} else {
'_'
}
})
.collect()
}
// ============================================================================
// ONNX Runtime Integration
// ============================================================================
use ort::{
session::Session,
value::Value,
};
pub struct Style {
pub ttl: Array3<f32>,
pub dp: Array3<f32>,
}
pub struct TextToSpeech {
cfgs: Config,
text_processor: UnicodeProcessor,
dp_ort: Session,
text_enc_ort: Session,
vector_est_ort: Session,
vocoder_ort: Session,
pub sample_rate: i32,
}
impl TextToSpeech {
pub fn new(
cfgs: Config,
text_processor: UnicodeProcessor,
dp_ort: Session,
text_enc_ort: Session,
vector_est_ort: Session,
vocoder_ort: Session,
) -> Self {
let sample_rate = cfgs.ae.sample_rate;
TextToSpeech {
cfgs,
text_processor,
dp_ort,
text_enc_ort,
vector_est_ort,
vocoder_ort,
sample_rate,
}
}
pub fn call(
&mut self,
text_list: &[String],
style: &Style,
total_step: usize,
) -> Result<(Vec<f32>, Vec<f32>)> {
let bsz = text_list.len();
// Process text
let (text_ids, text_mask) = self.text_processor.call(text_list);
let text_ids_array = {
let text_ids_shape = (bsz, text_ids[0].len());
let mut flat = Vec::new();
for row in &text_ids {
flat.extend_from_slice(row);
}
Array::from_shape_vec(text_ids_shape, flat)?
};
let text_ids_value = Value::from_array(text_ids_array)?;
let text_mask_value = Value::from_array(text_mask.clone())?;
let style_dp_value = Value::from_array(style.dp.clone())?;
// Predict duration
let dp_outputs = self.dp_ort.run(ort::inputs!{
"text_ids" => &text_ids_value,
"style_dp" => &style_dp_value,
"text_mask" => &text_mask_value
})?;
let (_, duration_data) = dp_outputs["duration"].try_extract_tensor::<f32>()?;
let duration: Vec<f32> = duration_data.to_vec();
// Encode text
let style_ttl_value = Value::from_array(style.ttl.clone())?;
let text_enc_outputs = self.text_enc_ort.run(ort::inputs!{
"text_ids" => &text_ids_value,
"style_ttl" => &style_ttl_value,
"text_mask" => &text_mask_value
})?;
let (text_emb_shape, text_emb_data) = text_enc_outputs["text_emb"].try_extract_tensor::<f32>()?;
let text_emb = Array3::from_shape_vec(
(text_emb_shape[0] as usize, text_emb_shape[1] as usize, text_emb_shape[2] as usize),
text_emb_data.to_vec()
)?;
// Sample noisy latent
let (mut xt, latent_mask) = sample_noisy_latent(
&duration,
self.sample_rate,
self.cfgs.ae.base_chunk_size,
self.cfgs.ttl.chunk_compress_factor,
self.cfgs.ttl.latent_dim,
);
// Prepare constant arrays
let total_step_array = Array::from_elem(bsz, total_step as f32);
// Denoising loop
for step in 0..total_step {
let current_step_array = Array::from_elem(bsz, step as f32);
let xt_value = Value::from_array(xt.clone())?;
let text_emb_value = Value::from_array(text_emb.clone())?;
let latent_mask_value = Value::from_array(latent_mask.clone())?;
let text_mask_value2 = Value::from_array(text_mask.clone())?;
let current_step_value = Value::from_array(current_step_array)?;
let total_step_value = Value::from_array(total_step_array.clone())?;
let vector_est_outputs = self.vector_est_ort.run(ort::inputs!{
"noisy_latent" => &xt_value,
"text_emb" => &text_emb_value,
"style_ttl" => &style_ttl_value,
"latent_mask" => &latent_mask_value,
"text_mask" => &text_mask_value2,
"current_step" => &current_step_value,
"total_step" => &total_step_value
})?;
let (denoised_shape, denoised_data) = vector_est_outputs["denoised_latent"].try_extract_tensor::<f32>()?;
xt = Array3::from_shape_vec(
(denoised_shape[0] as usize, denoised_shape[1] as usize, denoised_shape[2] as usize),
denoised_data.to_vec()
)?;
}
// Generate waveform
let final_latent_value = Value::from_array(xt)?;
let vocoder_outputs = self.vocoder_ort.run(ort::inputs!{
"latent" => &final_latent_value
})?;
let (_, wav_data) = vocoder_outputs["wav_tts"].try_extract_tensor::<f32>()?;
let wav: Vec<f32> = wav_data.to_vec();
Ok((wav, duration))
}
}
// ============================================================================
// Component Loading Functions
// ============================================================================
/// Load voice style from JSON files
pub fn load_voice_style(voice_style_paths: &[String], verbose: bool) -> Result<Style> {
let bsz = voice_style_paths.len();
// Read first file to get dimensions
let first_file = File::open(&voice_style_paths[0])
.context("Failed to open voice style file")?;
let first_reader = BufReader::new(first_file);
let first_data: VoiceStyleData = serde_json::from_reader(first_reader)?;
let ttl_dims = &first_data.style_ttl.dims;
let dp_dims = &first_data.style_dp.dims;
let ttl_dim1 = ttl_dims[1];
let ttl_dim2 = ttl_dims[2];
let dp_dim1 = dp_dims[1];
let dp_dim2 = dp_dims[2];
// Pre-allocate arrays with full batch size
let ttl_size = bsz * ttl_dim1 * ttl_dim2;
let dp_size = bsz * dp_dim1 * dp_dim2;
let mut ttl_flat = vec![0.0f32; ttl_size];
let mut dp_flat = vec![0.0f32; dp_size];
// Fill in the data
for (i, path) in voice_style_paths.iter().enumerate() {
let file = File::open(path).context("Failed to open voice style file")?;
let reader = BufReader::new(file);
let data: VoiceStyleData = serde_json::from_reader(reader)?;
// Flatten TTL data
let ttl_offset = i * ttl_dim1 * ttl_dim2;
let mut idx = 0;
for batch in &data.style_ttl.data {
for row in batch {
for &val in row {
ttl_flat[ttl_offset + idx] = val;
idx += 1;
}
}
}
// Flatten DP data
let dp_offset = i * dp_dim1 * dp_dim2;
idx = 0;
for batch in &data.style_dp.data {
for row in batch {
for &val in row {
dp_flat[dp_offset + idx] = val;
idx += 1;
}
}
}
}
let ttl_style = Array3::from_shape_vec((bsz, ttl_dim1, ttl_dim2), ttl_flat)?;
let dp_style = Array3::from_shape_vec((bsz, dp_dim1, dp_dim2), dp_flat)?;
if verbose {
println!("Loaded {} voice styles\n", bsz);
}
Ok(Style {
ttl: ttl_style,
dp: dp_style,
})
}
/// Load TTS components
pub fn load_text_to_speech(onnx_dir: &str, use_gpu: bool) -> Result<TextToSpeech> {
if use_gpu {
anyhow::bail!("GPU mode is not supported yet");
}
println!("Using CPU for inference\n");
let cfgs = load_cfgs(onnx_dir)?;
let dp_path = format!("{}/duration_predictor.onnx", onnx_dir);
let text_enc_path = format!("{}/text_encoder.onnx", onnx_dir);
let vector_est_path = format!("{}/vector_estimator.onnx", onnx_dir);
let vocoder_path = format!("{}/vocoder.onnx", onnx_dir);
let dp_ort = Session::builder()?
.commit_from_file(&dp_path)?;
let text_enc_ort = Session::builder()?
.commit_from_file(&text_enc_path)?;
let vector_est_ort = Session::builder()?
.commit_from_file(&vector_est_path)?;
let vocoder_ort = Session::builder()?
.commit_from_file(&vocoder_path)?;
let unicode_indexer_path = format!("{}/unicode_indexer.json", onnx_dir);
let text_processor = UnicodeProcessor::new(&unicode_indexer_path)?;
Ok(TextToSpeech::new(
cfgs,
text_processor,
dp_ort,
text_enc_ort,
vector_est_ort,
vocoder_ort,
))
}
+15
View File
@@ -0,0 +1,15 @@
# Swift Package Manager
.build/
.swiftpm/
*.xcodeproj
*.xcworkspace
# Build artifacts
example_onnx
# Results
results/*.wav
# macOS
.DS_Store
+14
View File
@@ -0,0 +1,14 @@
{
"pins" : [
{
"identity" : "onnxruntime-swift-package-manager",
"kind" : "remoteSourceControl",
"location" : "https://github.com/microsoft/onnxruntime-swift-package-manager.git",
"state" : {
"revision" : "12ce7374c86944e1f68f3a866d10105d8357f074",
"version" : "1.20.0"
}
}
],
"version" : 2
}
+22
View File
@@ -0,0 +1,22 @@
// swift-tools-version: 5.9
import PackageDescription
let package = Package(
name: "Supertonic",
platforms: [
.macOS(.v13)
],
dependencies: [
.package(url: "https://github.com/microsoft/onnxruntime-swift-package-manager.git", from: "1.16.0"),
],
targets: [
.executableTarget(
name: "example_onnx",
dependencies: [
.product(name: "onnxruntime", package: "onnxruntime-swift-package-manager")
],
path: "Sources"
)
]
)
+76
View File
@@ -0,0 +1,76 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `example_onnx`.
## Installation
This project uses Swift Package Manager (SPM) for dependency management.
### Prerequisites
- Swift 5.9 or later
- macOS 13.0 or later
### Build the project
```bash
swift build -c release
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
.build/release/example_onnx
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
.build/release/example_onnx \
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
```
This will:
- Generate speech for 2 different voice-text pairs
- Use male voice (M1.json) for the first text
- Use female voice (F1.json) for the second text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
.build/release/example_onnx \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
| `--text` | str+ | (long default text) | Text(s) to synthesize |
| `--save-dir` | str | `results` | Output directory |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
+122
View File
@@ -0,0 +1,122 @@
import Foundation
import OnnxRuntimeBindings
struct Args {
var useGpu: Bool = false
var onnxDir: String = "assets/onnx"
var totalStep: Int = 5
var nTest: Int = 4
var voiceStyle: [String] = ["assets/voice_styles/M1.json"]
var text: [String] = ["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."]
var saveDir: String = "results"
}
func parseArgs() -> Args {
var args = Args()
let arguments = CommandLine.arguments
var i = 1
while i < arguments.count {
let arg = arguments[i]
switch arg {
case "--use-gpu":
args.useGpu = true
case "--onnx-dir":
if i + 1 < arguments.count {
args.onnxDir = arguments[i + 1]
i += 1
}
case "--total-step":
if i + 1 < arguments.count {
args.totalStep = Int(arguments[i + 1]) ?? 5
i += 1
}
case "--n-test":
if i + 1 < arguments.count {
args.nTest = Int(arguments[i + 1]) ?? 4
i += 1
}
case "--voice-style":
if i + 1 < arguments.count {
args.voiceStyle = arguments[i + 1].components(separatedBy: ",")
i += 1
}
case "--text":
if i + 1 < arguments.count {
args.text = arguments[i + 1].components(separatedBy: "|")
i += 1
}
case "--save-dir":
if i + 1 < arguments.count {
args.saveDir = arguments[i + 1]
i += 1
}
default:
break
}
i += 1
}
return args
}
@main
struct ExampleONNX {
static func main() async {
print("=== TTS Inference with ONNX Runtime (Swift) ===\n")
// --- 1. Parse arguments --- //
let args = parseArgs()
guard args.voiceStyle.count == args.text.count else {
print("Error: Number of voice styles (\(args.voiceStyle.count)) must match number of texts (\(args.text.count))")
return
}
let bsz = args.voiceStyle.count
do {
let env = try ORTEnv(loggingLevel: .warning)
// --- 2. Load TTS components --- //
let textToSpeech = try loadTextToSpeech(args.onnxDir, args.useGpu, env)
// --- 3. Load voice styles --- //
let style = try loadVoiceStyle(args.voiceStyle, verbose: true)
// --- 4. Synthesize speech --- //
try? FileManager.default.createDirectory(atPath: args.saveDir, withIntermediateDirectories: true)
for n in 0..<args.nTest {
print("\n[\(n + 1)/\(args.nTest)] Starting synthesis...")
let (wav, duration) = try timer("Generating speech from text") {
try textToSpeech.call(args.text, style, args.totalStep)
}
// Save outputs
let wavLen = wav.count / bsz
for i in 0..<bsz {
let fname = "\(sanitizeFilename(args.text[i], maxLen: 20))_\(n + 1).wav"
let actualLen = Int(Float(textToSpeech.sampleRate) * duration[i])
let wavStart = i * wavLen
let wavEnd = min(wavStart + actualLen, wavStart + wavLen)
let wavOut = Array(wav[wavStart..<wavEnd])
let outputPath = "\(args.saveDir)/\(fname)"
try writeWavFile(outputPath, wavOut, textToSpeech.sampleRate)
print("Saved: \(outputPath)")
}
}
print("\n=== Synthesis completed successfully! ===")
} catch {
print("Error during inference: \(error)")
exit(1)
}
}
}
+483
View File
@@ -0,0 +1,483 @@
import Foundation
import Accelerate
import OnnxRuntimeBindings
// MARK: - Configuration Structures
struct Config: Codable {
struct AEConfig: Codable {
let sample_rate: Int
let base_chunk_size: Int
}
struct TTLConfig: Codable {
let chunk_compress_factor: Int
let latent_dim: Int
}
let ae: AEConfig
let ttl: TTLConfig
}
// MARK: - Voice Style Data Structure
struct VoiceStyleData: Codable {
struct StyleComponent: Codable {
let data: [[[Float]]]
let dims: [Int]
let type: String
}
let style_ttl: StyleComponent
let style_dp: StyleComponent
}
// MARK: - Unicode Text Processor
class UnicodeProcessor {
let indexer: [Int64]
init(unicodeIndexerPath: String) throws {
let data = try Data(contentsOf: URL(fileURLWithPath: unicodeIndexerPath))
self.indexer = try JSONDecoder().decode([Int64].self, from: data)
}
func call(_ textList: [String]) -> (textIds: [[Int64]], textMask: [[[Float]]]) {
let processedTexts = textList.map { preprocessText($0) }
var textIdsLengths = [Int]()
for text in processedTexts {
textIdsLengths.append(text.count)
}
let maxLen = textIdsLengths.max() ?? 0
var textIds = [[Int64]]()
for text in processedTexts {
var row = Array(repeating: Int64(0), count: maxLen)
let unicodeValues = Array(text.unicodeScalars.map { Int($0.value) })
for (j, val) in unicodeValues.enumerated() {
if val < indexer.count {
row[j] = indexer[val]
} else {
row[j] = -1
}
}
textIds.append(row)
}
let textMask = getTextMask(textIdsLengths)
return (textIds, textMask)
}
}
func preprocessText(_ text: String) -> String {
return text.precomposedStringWithCompatibilityMapping
}
func lengthToMask(_ lengths: [Int], maxLen: Int? = nil) -> [[[Float]]] {
let actualMaxLen = maxLen ?? (lengths.max() ?? 0)
var mask = [[[Float]]]()
for len in lengths {
var row = Array(repeating: Float(0.0), count: actualMaxLen)
for j in 0..<min(len, actualMaxLen) {
row[j] = 1.0
}
mask.append([row])
}
return mask
}
func getTextMask(_ textIdsLengths: [Int]) -> [[[Float]]] {
let maxLen = textIdsLengths.max() ?? 0
return lengthToMask(textIdsLengths, maxLen: maxLen)
}
func sampleNoisyLatent(duration: [Float], sampleRate: Int, baseChunkSize: Int, chunkCompress: Int, latentDim: Int) -> (noisyLatent: [[[Float]]], latentMask: [[[Float]]]) {
let bsz = duration.count
let maxDur = duration.max() ?? 0.0
let wavLenMax = Int(maxDur * Float(sampleRate))
var wavLengths = [Int]()
for d in duration {
wavLengths.append(Int(d * Float(sampleRate)))
}
let chunkSize = baseChunkSize * chunkCompress
let latentLen = (wavLenMax + chunkSize - 1) / chunkSize
let latentDimVal = latentDim * chunkCompress
var noisyLatent = [[[Float]]]()
for _ in 0..<bsz {
var batch = [[Float]]()
for _ in 0..<latentDimVal {
var row = [Float]()
for _ in 0..<latentLen {
// Box-Muller transform
let u1 = Float.random(in: 0.0001...1.0)
let u2 = Float.random(in: 0.0...1.0)
let val = sqrt(-2.0 * log(u1)) * cos(2.0 * Float.pi * u2)
row.append(val)
}
batch.append(row)
}
noisyLatent.append(batch)
}
var latentLengths = [Int]()
for len in wavLengths {
latentLengths.append((len + chunkSize - 1) / chunkSize)
}
let latentMask = lengthToMask(latentLengths, maxLen: latentLen)
// Apply mask
for b in 0..<bsz {
for d in 0..<latentDimVal {
for t in 0..<latentLen {
noisyLatent[b][d][t] *= latentMask[b][0][t]
}
}
}
return (noisyLatent, latentMask)
}
func getLatentMask(_ wavLengths: [Int64], _ cfgs: Config) -> [[[Float]]] {
let baseChunkSize = cfgs.ae.base_chunk_size
let chunkCompressFactor = cfgs.ttl.chunk_compress_factor
let latentSize = baseChunkSize * chunkCompressFactor
var latentLengths = [Int]()
for len in wavLengths {
latentLengths.append((Int(len) + latentSize - 1) / latentSize)
}
let maxLen = latentLengths.max() ?? 0
return lengthToMask(latentLengths, maxLen: maxLen)
}
// MARK: - WAV File I/O
func writeWavFile(_ filename: String, _ audioData: [Float], _ sampleRate: Int) throws {
let url = URL(fileURLWithPath: filename)
// Convert float to int16
let int16Data = audioData.map { sample -> Int16 in
let clamped = max(-1.0, min(1.0, sample))
return Int16(clamped * 32767.0)
}
// Create WAV header
let numChannels: UInt16 = 1
let bitsPerSample: UInt16 = 16
let byteRate = UInt32(sampleRate) * UInt32(numChannels) * UInt32(bitsPerSample) / 8
let blockAlign = numChannels * bitsPerSample / 8
let dataSize = UInt32(int16Data.count * 2)
var data = Data()
// RIFF chunk
data.append("RIFF".data(using: .ascii)!)
withUnsafeBytes(of: UInt32(36 + dataSize).littleEndian) { data.append(contentsOf: $0) }
data.append("WAVE".data(using: .ascii)!)
// fmt chunk
data.append("fmt ".data(using: .ascii)!)
withUnsafeBytes(of: UInt32(16).littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: UInt16(1).littleEndian) { data.append(contentsOf: $0) } // PCM
withUnsafeBytes(of: numChannels.littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: UInt32(sampleRate).littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: byteRate.littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: blockAlign.littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: bitsPerSample.littleEndian) { data.append(contentsOf: $0) }
// data chunk
data.append("data".data(using: .ascii)!)
withUnsafeBytes(of: dataSize.littleEndian) { data.append(contentsOf: $0) }
// audio data
int16Data.withUnsafeBytes { data.append(contentsOf: $0) }
try data.write(to: url)
}
// MARK: - Utility Functions
func timer<T>(_ name: String, _ f: () throws -> T) rethrows -> T {
let start = Date()
print("\(name)...")
let result = try f()
let elapsed = Date().timeIntervalSince(start)
print(String(format: " -> %@ completed in %.2f sec", name, elapsed))
return result
}
func sanitizeFilename(_ text: String, maxLen: Int) -> String {
let truncated = text.count > maxLen ? String(text.prefix(maxLen)) : text
return truncated.map { char in
if char.isLetter || char.isNumber {
return char
} else {
return Character("_")
}
}.map(String.init).joined()
}
func loadCfgs(_ onnxDir: String) throws -> Config {
let cfgPath = "\(onnxDir)/tts.json"
let data = try Data(contentsOf: URL(fileURLWithPath: cfgPath))
let config = try JSONDecoder().decode(Config.self, from: data)
return config
}
// MARK: - ONNX Runtime Integration
struct Style {
let ttl: ORTValue
let dp: ORTValue
}
class TextToSpeech {
let cfgs: Config
let textProcessor: UnicodeProcessor
let dpOrt: ORTSession
let textEncOrt: ORTSession
let vectorEstOrt: ORTSession
let vocoderOrt: ORTSession
let sampleRate: Int
init(cfgs: Config, textProcessor: UnicodeProcessor,
dpOrt: ORTSession, textEncOrt: ORTSession,
vectorEstOrt: ORTSession, vocoderOrt: ORTSession) {
self.cfgs = cfgs
self.textProcessor = textProcessor
self.dpOrt = dpOrt
self.textEncOrt = textEncOrt
self.vectorEstOrt = vectorEstOrt
self.vocoderOrt = vocoderOrt
self.sampleRate = cfgs.ae.sample_rate
}
func call(_ textList: [String], _ style: Style, _ totalStep: Int) throws -> (wav: [Float], duration: [Float]) {
let bsz = textList.count
// Process text
let (textIds, textMask) = textProcessor.call(textList)
// Flatten text IDs
let textIdsFlat = textIds.flatMap { $0 }
let textIdsShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: textIds[0].count)]
let textIdsValue = try ORTValue(tensorData: NSMutableData(bytes: textIdsFlat, length: textIdsFlat.count * MemoryLayout<Int64>.size),
elementType: .int64,
shape: textIdsShape)
// Flatten text mask
let textMaskFlat = textMask.flatMap { $0.flatMap { $0 } }
let textMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: textMask[0][0].count)]
let textMaskValue = try ORTValue(tensorData: NSMutableData(bytes: textMaskFlat, length: textMaskFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: textMaskShape)
// Predict duration
let dpOutputs = try dpOrt.run(withInputs: ["text_ids": textIdsValue, "style_dp": style.dp, "text_mask": textMaskValue],
outputNames: ["duration"],
runOptions: nil)
let durationData = try dpOutputs["duration"]!.tensorData() as Data
let duration = durationData.withUnsafeBytes { ptr in
Array(ptr.bindMemory(to: Float.self))
}
// Encode text
let textEncOutputs = try textEncOrt.run(withInputs: ["text_ids": textIdsValue, "style_ttl": style.ttl, "text_mask": textMaskValue],
outputNames: ["text_emb"],
runOptions: nil)
let textEmbValue = textEncOutputs["text_emb"]!
// Sample noisy latent
var (xt, latentMask) = sampleNoisyLatent(duration: duration, sampleRate: sampleRate,
baseChunkSize: cfgs.ae.base_chunk_size,
chunkCompress: cfgs.ttl.chunk_compress_factor,
latentDim: cfgs.ttl.latent_dim)
// Prepare constant arrays
let totalStepArray = Array(repeating: Float(totalStep), count: bsz)
let totalStepValue = try ORTValue(tensorData: NSMutableData(bytes: totalStepArray, length: totalStepArray.count * MemoryLayout<Float>.size),
elementType: .float,
shape: [NSNumber(value: bsz)])
// Denoising loop
for step in 0..<totalStep {
let currentStepArray = Array(repeating: Float(step), count: bsz)
let currentStepValue = try ORTValue(tensorData: NSMutableData(bytes: currentStepArray, length: currentStepArray.count * MemoryLayout<Float>.size),
elementType: .float,
shape: [NSNumber(value: bsz)])
// Flatten xt
let xtFlat = xt.flatMap { $0.flatMap { $0 } }
let xtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
let xtValue = try ORTValue(tensorData: NSMutableData(bytes: xtFlat, length: xtFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: xtShape)
// Flatten latent mask
let latentMaskFlat = latentMask.flatMap { $0.flatMap { $0 } }
let latentMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: latentMask[0][0].count)]
let latentMaskValue = try ORTValue(tensorData: NSMutableData(bytes: latentMaskFlat, length: latentMaskFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: latentMaskShape)
let vectorEstOutputs = try vectorEstOrt.run(withInputs: [
"noisy_latent": xtValue,
"text_emb": textEmbValue,
"style_ttl": style.ttl,
"latent_mask": latentMaskValue,
"text_mask": textMaskValue,
"current_step": currentStepValue,
"total_step": totalStepValue
], outputNames: ["denoised_latent"], runOptions: nil)
let denoisedData = try vectorEstOutputs["denoised_latent"]!.tensorData() as Data
let denoisedFlat = denoisedData.withUnsafeBytes { ptr in
Array(ptr.bindMemory(to: Float.self))
}
// Reshape to 3D
let latentDimVal = xt[0].count
let latentLen = xt[0][0].count
xt = []
var idx = 0
for _ in 0..<bsz {
var batch = [[Float]]()
for _ in 0..<latentDimVal {
var row = [Float]()
for _ in 0..<latentLen {
row.append(denoisedFlat[idx])
idx += 1
}
batch.append(row)
}
xt.append(batch)
}
}
// Generate waveform
let finalXtFlat = xt.flatMap { $0.flatMap { $0 } }
let finalXtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
let finalXtValue = try ORTValue(tensorData: NSMutableData(bytes: finalXtFlat, length: finalXtFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: finalXtShape)
let vocoderOutputs = try vocoderOrt.run(withInputs: ["latent": finalXtValue],
outputNames: ["wav_tts"],
runOptions: nil)
let wavData = try vocoderOutputs["wav_tts"]!.tensorData() as Data
let wav = wavData.withUnsafeBytes { ptr in
Array(ptr.bindMemory(to: Float.self))
}
return (wav, duration)
}
}
// MARK: - Component Loading Functions
func loadVoiceStyle(_ voiceStylePaths: [String], verbose: Bool) throws -> Style {
let bsz = voiceStylePaths.count
// Read first file to get dimensions
let firstData = try Data(contentsOf: URL(fileURLWithPath: voiceStylePaths[0]))
let firstStyle = try JSONDecoder().decode(VoiceStyleData.self, from: firstData)
let ttlDims = firstStyle.style_ttl.dims
let dpDims = firstStyle.style_dp.dims
let ttlDim1 = ttlDims[1]
let ttlDim2 = ttlDims[2]
let dpDim1 = dpDims[1]
let dpDim2 = dpDims[2]
// Pre-allocate arrays with full batch size
let ttlSize = bsz * ttlDim1 * ttlDim2
let dpSize = bsz * dpDim1 * dpDim2
var ttlFlat = [Float](repeating: 0.0, count: ttlSize)
var dpFlat = [Float](repeating: 0.0, count: dpSize)
// Fill in the data
for (i, path) in voiceStylePaths.enumerated() {
let data = try Data(contentsOf: URL(fileURLWithPath: path))
let voiceStyle = try JSONDecoder().decode(VoiceStyleData.self, from: data)
// Flatten TTL data
let ttlOffset = i * ttlDim1 * ttlDim2
var idx = 0
for batch in voiceStyle.style_ttl.data {
for row in batch {
for val in row {
ttlFlat[ttlOffset + idx] = val
idx += 1
}
}
}
// Flatten DP data
let dpOffset = i * dpDim1 * dpDim2
idx = 0
for batch in voiceStyle.style_dp.data {
for row in batch {
for val in row {
dpFlat[dpOffset + idx] = val
idx += 1
}
}
}
}
let ttlShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: ttlDim1), NSNumber(value: ttlDim2)]
let dpShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: dpDim1), NSNumber(value: dpDim2)]
let ttlValue = try ORTValue(tensorData: NSMutableData(bytes: &ttlFlat, length: ttlFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: ttlShape)
let dpValue = try ORTValue(tensorData: NSMutableData(bytes: &dpFlat, length: dpFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: dpShape)
if verbose {
print("Loaded \(bsz) voice styles\n")
}
return Style(ttl: ttlValue, dp: dpValue)
}
func loadTextToSpeech(_ onnxDir: String, _ useGpu: Bool, _ env: ORTEnv) throws -> TextToSpeech {
if useGpu {
throw NSError(domain: "TTS", code: 1, userInfo: [NSLocalizedDescriptionKey: "GPU mode is not supported yet"])
}
print("Using CPU for inference\n")
let cfgs = try loadCfgs(onnxDir)
let sessionOptions = try ORTSessionOptions()
let dpPath = "\(onnxDir)/duration_predictor.onnx"
let textEncPath = "\(onnxDir)/text_encoder.onnx"
let vectorEstPath = "\(onnxDir)/vector_estimator.onnx"
let vocoderPath = "\(onnxDir)/vocoder.onnx"
let dpOrt = try ORTSession(env: env, modelPath: dpPath, sessionOptions: sessionOptions)
let textEncOrt = try ORTSession(env: env, modelPath: textEncPath, sessionOptions: sessionOptions)
let vectorEstOrt = try ORTSession(env: env, modelPath: vectorEstPath, sessionOptions: sessionOptions)
let vocoderOrt = try ORTSession(env: env, modelPath: vocoderPath, sessionOptions: sessionOptions)
let unicodeIndexerPath = "\(onnxDir)/unicode_indexer.json"
let textProcessor = try UnicodeProcessor(unicodeIndexerPath: unicodeIndexerPath)
return TextToSpeech(cfgs: cfgs, textProcessor: textProcessor,
dpOrt: dpOrt, textEncOrt: textEncOrt,
vectorEstOrt: vectorEstOrt, vocoderOrt: vocoderOrt)
}
+1
View File
@@ -0,0 +1 @@
../assets
Executable
+248
View File
@@ -0,0 +1,248 @@
#!/bin/bash
# Supertonic - Test All Language Implementations
# This script runs inference tests for all supported languages except web
set -e # Exit on error
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd "$SCRIPT_DIR"
echo "=================================="
echo "Supertonic - Testing All Examples"
echo "=================================="
echo ""
# Ask user to select test mode
echo "Select test mode:"
echo " 1) Default inference only"
echo " 2) Batch inference only"
echo " 3) Both default and batch inference"
echo -e "Enter your choice (1/2/3) [default: 1]: \c"
read -r test_mode
test_mode=${test_mode:-1}
case $test_mode in
1)
TEST_DEFAULT=true
TEST_BATCH=false
echo "Running default inference tests only"
;;
2)
TEST_DEFAULT=false
TEST_BATCH=true
echo "Running batch inference tests only"
;;
3)
TEST_DEFAULT=true
TEST_BATCH=true
echo "Running both default and batch inference tests"
;;
*)
echo "Invalid choice. Using default inference only."
TEST_DEFAULT=true
TEST_BATCH=false
;;
esac
echo ""
# Batch inference test data - base variables
BATCH_VOICE_STYLE_1="assets/voice_styles/M1.json"
BATCH_VOICE_STYLE_2="assets/voice_styles/F1.json"
BATCH_TEXT_1="The sun sets behind the mountains, painting the sky in shades of pink and orange."
BATCH_TEXT_2="The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
# Ask if user wants to clean results folders
echo -e "Do you want to clean all results folders before running tests? (y/N): \c"
read -r response
if [[ "$response" =~ ^[Yy]$ ]]; then
echo ""
echo "Cleaning results folders..."
# List of result directories
declare -a RESULT_DIRS=(
"py/results"
"nodejs/results"
"go/results"
"rust/results"
"csharp/results"
"java/results"
"swift/results"
"cpp/build/results"
)
for dir in "${RESULT_DIRS[@]}"; do
if [ -d "$SCRIPT_DIR/$dir" ]; then
echo " - Cleaning $dir"
rm -rf "$SCRIPT_DIR/$dir"/*
fi
done
echo "Results folders cleaned!"
echo ""
fi
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Track results
declare -a PASSED=()
declare -a FAILED=()
# Helper function to run tests
run_test() {
local name=$1
local dir=$2
shift 2
local cmd="$@"
echo -e "${BLUE}[$name]${NC} Running inference..."
cd "$SCRIPT_DIR/$dir"
# Run command and prefix each output line with the language name
if eval "$cmd" 2>&1 | sed "s/^/[$name] /"; then
echo -e "${GREEN}[$name]${NC} ✓ Success"
PASSED+=("$name")
else
echo -e "${RED}[$name]${NC} ✗ Failed"
FAILED+=("$name")
fi
echo ""
cd "$SCRIPT_DIR"
}
# ====================================
# Python
# ====================================
echo -e "${YELLOW}Testing Python...${NC}"
if [ "$TEST_DEFAULT" = true ]; then
run_test "Python (default)" "py" "uv run example_onnx.py"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Python (batch)" "py" "uv run example_onnx.py --voice-style $BATCH_VOICE_STYLE_1 $BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1' '$BATCH_TEXT_2'"
fi
# ====================================
# JavaScript (Node.js)
# ====================================
echo -e "${YELLOW}Testing JavaScript (Node.js)...${NC}"
echo "Installing Node.js dependencies..."
cd nodejs && npm install --silent && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "JavaScript (default)" "nodejs" "node example_onnx.js"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "JavaScript (batch)" "nodejs" "node example_onnx.js --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
fi
# ====================================
# Go
# ====================================
echo -e "${YELLOW}Testing Go...${NC}"
echo "Cleaning Go cache..."
cd go && go clean && cd ..
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
if [ "$TEST_DEFAULT" = true ]; then
run_test "Go (default)" "go" "go run example_onnx.go helper.go"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Go (batch)" "go" "go run example_onnx.go helper.go --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
fi
# ====================================
# Rust
# ====================================
echo -e "${YELLOW}Testing Rust...${NC}"
echo "Building Rust project..."
cd rust && cargo clean && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "Rust (default)" "rust" "cargo run --release"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Rust (batch)" "rust" "cargo run --release -- --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
fi
# ====================================
# C#
# ====================================
echo -e "${YELLOW}Testing C#...${NC}"
echo "Building C# project..."
cd csharp && dotnet clean && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "C# (default)" "csharp" "dotnet run --configuration Release"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "C# (batch)" "csharp" "dotnet run --configuration Release -- --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
fi
# ====================================
# Java
# ====================================
echo -e "${YELLOW}Testing Java...${NC}"
echo "Building Java project..."
cd java && mvn clean install -q && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "Java (default)" "java" "mvn exec:java -q"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Java (batch)" "java" "mvn exec:java -q -Dexec.args='--voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text \"$BATCH_TEXT_1|$BATCH_TEXT_2\"'"
fi
# ====================================
# Swift
# ====================================
echo -e "${YELLOW}Testing Swift...${NC}"
echo "Building Swift project..."
cd swift && swift build -c release && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "Swift (default)" "swift" ".build/release/example_onnx"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Swift (batch)" "swift" ".build/release/example_onnx --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
fi
# ====================================
# C++
# ====================================
echo -e "${YELLOW}Testing C++...${NC}"
echo "Building C++ project..."
cd cpp && mkdir -p build && cd build && cmake .. && make && cd ../..
if [ "$TEST_DEFAULT" = true ]; then
run_test "C++ (default)" "cpp/build" "./example_onnx"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "C++ (batch)" "cpp/build" "./example_onnx --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
fi
# ====================================
# Summary
# ====================================
echo "=================================="
echo "Test Summary"
echo "=================================="
echo ""
if [ ${#PASSED[@]} -gt 0 ]; then
echo -e "${GREEN}Passed (${#PASSED[@]}):${NC}"
for lang in "${PASSED[@]}"; do
echo -e " ${GREEN}${NC} $lang"
done
echo ""
fi
if [ ${#FAILED[@]} -gt 0 ]; then
echo -e "${RED}Failed (${#FAILED[@]}):${NC}"
for lang in "${FAILED[@]}"; do
echo -e " ${RED}${NC} $lang"
done
echo ""
exit 1
else
echo -e "${GREEN}All tests passed! 🎉${NC}"
exit 0
fi
+4
View File
@@ -0,0 +1,4 @@
node_modules/
dist/
.DS_Store
*.log
+98
View File
@@ -0,0 +1,98 @@
# Supertonic Web Example
This example demonstrates how to use Supertonic in a web browser using ONNX Runtime Web.
## Features
- 🌐 Runs entirely in the browser (no server required for inference)
- 🚀 WebGPU support with automatic fallback to WebAssembly
- ⚡ Pre-extracted voice styles for instant generation
- 🎨 Modern, responsive UI
- 🎭 Multiple voice style presets (2 Male, 2 Female)
- 💾 Download generated audio as WAV files
- 📊 Detailed generation statistics (audio length, generation time)
- ⏱️ Real-time progress tracking
## Requirements
- Node.js (for development server)
- Modern web browser (Chrome, Edge, Firefox, Safari)
## Installation
1. Install dependencies:
```bash
npm install
```
## Running the Demo
Start the development server:
```bash
npm run dev
```
This will start a local development server (usually at http://localhost:3000) and open the demo in your browser.
## Usage
1. **Wait for Models to Load**: The app will automatically load models and the default voice style (M1)
2. **Select Voice Style**: Choose from available voice presets
- **Male 1 (M1)**: Default male voice
- **Male 2 (M2)**: Alternative male voice
- **Female 1 (F1)**: Default female voice
- **Female 2 (F2)**: Alternative female voice
3. **Enter Text**: Type or paste the text you want to convert to speech
4. **Adjust Settings** (optional):
- **Total Steps**: More steps = better quality but slower (default: 5)
5. **Generate Speech**: Click the "Generate Speech" button
6. **View Results**:
- See the full input text
- View audio length and generation time statistics
- Play the generated audio in the browser
- Download as WAV file
## Technical Details
### Browser Compatibility
This demo uses:
- **ONNX Runtime Web**: For running models in the browser
- **Web Audio API**: For playing generated audio
- **Vite**: For development and bundling
## Notes
- The ONNX models must be accessible at `assets/onnx/` relative to the web root
- Voice style JSON files must be accessible at `assets/voice_styles/` relative to the web root
- Pre-extracted voice styles enable instant generation without audio processing
- Four voice style presets are provided (M1, M2, F1, F2)
## Troubleshooting
### Models not loading
- Check browser console for errors
- Ensure `assets/onnx/` path is correct and models are accessible
- Check CORS settings if serving from a different domain
### WebGPU not available
- WebGPU is only available in recent Chrome/Edge browsers (version 113+)
- The app will automatically fall back to WebAssembly if WebGPU is not available
- Check the backend badge to see which execution provider is being used
### Out of memory errors
- Try shorter text inputs
- Reduce denoising steps
- Use a browser with more available memory
- Close other tabs to free up memory
### Audio quality issues
- Try different voice style presets
- Increase denoising steps for better quality
### Slow generation
- If using WebAssembly, try a browser that supports WebGPU
- Ensure no other heavy processes are running
- Consider using fewer denoising steps for faster (but lower quality) results
Symlink
+1
View File
@@ -0,0 +1 @@
../assets
+396
View File
@@ -0,0 +1,396 @@
import * as ort from 'onnxruntime-web';
/**
* Unicode Text Processor
*/
export class UnicodeProcessor {
constructor(indexer) {
this.indexer = indexer;
}
call(textList) {
const processedTexts = textList.map(text => this.preprocessText(text));
const textIdsLengths = processedTexts.map(text => text.length);
const maxLen = Math.max(...textIdsLengths);
const textIds = processedTexts.map(text => {
const row = new Array(maxLen).fill(0);
for (let j = 0; j < text.length; j++) {
const codePoint = text.codePointAt(j);
row[j] = (codePoint < this.indexer.length) ? this.indexer[codePoint] : -1;
}
return row;
});
const textMask = this.getTextMask(textIdsLengths);
return { textIds, textMask };
}
preprocessText(text) {
return text.normalize('NFKC');
}
getTextMask(textIdsLengths) {
const maxLen = Math.max(...textIdsLengths);
return this.lengthToMask(textIdsLengths, maxLen);
}
lengthToMask(lengths, maxLen = null) {
const actualMaxLen = maxLen || Math.max(...lengths);
return lengths.map(len => {
const row = new Array(actualMaxLen).fill(0.0);
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
row[j] = 1.0;
}
return [row];
});
}
}
/**
* Style class to hold TTL and DP tensors
*/
export class Style {
constructor(ttlTensor, dpTensor) {
this.ttl = ttlTensor;
this.dp = dpTensor;
}
}
/**
* Text-to-Speech class
*/
export class TextToSpeech {
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
this.cfgs = cfgs;
this.textProcessor = textProcessor;
this.dpOrt = dpOrt;
this.textEncOrt = textEncOrt;
this.vectorEstOrt = vectorEstOrt;
this.vocoderOrt = vocoderOrt;
this.sampleRate = cfgs.ae.sample_rate;
}
async call(textList, style, totalStep, progressCallback = null) {
const bsz = textList.length;
// Process text
const { textIds, textMask } = this.textProcessor.call(textList);
const textIdsFlat = new BigInt64Array(textIds.flat().map(x => BigInt(x)));
const textIdsShape = [bsz, textIds[0].length];
const textIdsTensor = new ort.Tensor('int64', textIdsFlat, textIdsShape);
const textMaskFlat = new Float32Array(textMask.flat(2));
const textMaskShape = [bsz, 1, textMask[0][0].length];
const textMaskTensor = new ort.Tensor('float32', textMaskFlat, textMaskShape);
// Predict duration
const dpOutputs = await this.dpOrt.run({
text_ids: textIdsTensor,
style_dp: style.dp,
text_mask: textMaskTensor
});
const duration = Array.from(dpOutputs.duration.data);
// Encode text
const textEncOutputs = await this.textEncOrt.run({
text_ids: textIdsTensor,
style_ttl: style.ttl,
text_mask: textMaskTensor
});
const textEmb = textEncOutputs.text_emb;
// Sample noisy latent
let { xt, latentMask } = this.sampleNoisyLatent(
duration,
this.sampleRate,
this.cfgs.ae.base_chunk_size,
this.cfgs.ttl.chunk_compress_factor,
this.cfgs.ttl.latent_dim
);
const latentMaskFlat = new Float32Array(latentMask.flat(2));
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
const latentMaskTensor = new ort.Tensor('float32', latentMaskFlat, latentMaskShape);
// Prepare constant arrays
const totalStepArray = new Float32Array(bsz).fill(totalStep);
const totalStepTensor = new ort.Tensor('float32', totalStepArray, [bsz]);
// Denoising loop
for (let step = 0; step < totalStep; step++) {
if (progressCallback) {
progressCallback(step + 1, totalStep);
}
const currentStepArray = new Float32Array(bsz).fill(step);
const currentStepTensor = new ort.Tensor('float32', currentStepArray, [bsz]);
const xtFlat = new Float32Array(xt.flat(2));
const xtShape = [bsz, xt[0].length, xt[0][0].length];
const xtTensor = new ort.Tensor('float32', xtFlat, xtShape);
const vectorEstOutputs = await this.vectorEstOrt.run({
noisy_latent: xtTensor,
text_emb: textEmb,
style_ttl: style.ttl,
latent_mask: latentMaskTensor,
text_mask: textMaskTensor,
current_step: currentStepTensor,
total_step: totalStepTensor
});
const denoised = Array.from(vectorEstOutputs.denoised_latent.data);
// Reshape to 3D
const latentDim = xt[0].length;
const latentLen = xt[0][0].length;
xt = [];
let idx = 0;
for (let b = 0; b < bsz; b++) {
const batch = [];
for (let d = 0; d < latentDim; d++) {
const row = [];
for (let t = 0; t < latentLen; t++) {
row.push(denoised[idx++]);
}
batch.push(row);
}
xt.push(batch);
}
}
// Generate waveform
const finalXtFlat = new Float32Array(xt.flat(2));
const finalXtShape = [bsz, xt[0].length, xt[0][0].length];
const finalXtTensor = new ort.Tensor('float32', finalXtFlat, finalXtShape);
const vocoderOutputs = await this.vocoderOrt.run({
latent: finalXtTensor
});
const wav = Array.from(vocoderOutputs.wav_tts.data);
return { wav, duration };
}
sampleNoisyLatent(duration, sampleRate, baseChunkSize, chunkCompress, latentDim) {
const bsz = duration.length;
const maxDur = Math.max(...duration);
const wavLenMax = Math.floor(maxDur * sampleRate);
const wavLengths = duration.map(d => Math.floor(d * sampleRate));
const chunkSize = baseChunkSize * chunkCompress;
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
const latentDimVal = latentDim * chunkCompress;
const xt = [];
for (let b = 0; b < bsz; b++) {
const batch = [];
for (let d = 0; d < latentDimVal; d++) {
const row = [];
for (let t = 0; t < latentLen; t++) {
// Box-Muller transform
const u1 = Math.max(0.0001, Math.random());
const u2 = Math.random();
const val = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
row.push(val);
}
batch.push(row);
}
xt.push(batch);
}
const latentLengths = wavLengths.map(len => Math.floor((len + chunkSize - 1) / chunkSize));
const latentMask = this.lengthToMask(latentLengths, latentLen);
// Apply mask
for (let b = 0; b < bsz; b++) {
for (let d = 0; d < latentDimVal; d++) {
for (let t = 0; t < latentLen; t++) {
xt[b][d][t] *= latentMask[b][0][t];
}
}
}
return { xt, latentMask };
}
lengthToMask(lengths, maxLen = null) {
const actualMaxLen = maxLen || Math.max(...lengths);
return lengths.map(len => {
const row = new Array(actualMaxLen).fill(0.0);
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
row[j] = 1.0;
}
return [row];
});
}
}
/**
* Load voice style from JSON files
*/
export async function loadVoiceStyle(voiceStylePaths, verbose = false) {
const bsz = voiceStylePaths.length;
// Read first file to get dimensions
const firstResponse = await fetch(voiceStylePaths[0]);
const firstStyle = await firstResponse.json();
const ttlDims = firstStyle.style_ttl.dims;
const dpDims = firstStyle.style_dp.dims;
const ttlDim1 = ttlDims[1];
const ttlDim2 = ttlDims[2];
const dpDim1 = dpDims[1];
const dpDim2 = dpDims[2];
// Pre-allocate arrays with full batch size
const ttlSize = bsz * ttlDim1 * ttlDim2;
const dpSize = bsz * dpDim1 * dpDim2;
const ttlFlat = new Float32Array(ttlSize);
const dpFlat = new Float32Array(dpSize);
// Fill in the data
for (let i = 0; i < bsz; i++) {
const response = await fetch(voiceStylePaths[i]);
const voiceStyle = await response.json();
// Flatten TTL data
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
const ttlOffset = i * ttlDim1 * ttlDim2;
ttlFlat.set(ttlData, ttlOffset);
// Flatten DP data
const dpData = voiceStyle.style_dp.data.flat(Infinity);
const dpOffset = i * dpDim1 * dpDim2;
dpFlat.set(dpData, dpOffset);
}
const ttlShape = [bsz, ttlDim1, ttlDim2];
const dpShape = [bsz, dpDim1, dpDim2];
const ttlTensor = new ort.Tensor('float32', ttlFlat, ttlShape);
const dpTensor = new ort.Tensor('float32', dpFlat, dpShape);
if (verbose) {
console.log(`Loaded ${bsz} voice styles`);
}
return new Style(ttlTensor, dpTensor);
}
/**
* Load configuration from JSON
*/
export async function loadCfgs(onnxDir) {
const response = await fetch(`${onnxDir}/tts.json`);
const cfgs = await response.json();
return cfgs;
}
/**
* Load text processor
*/
export async function loadTextProcessor(onnxDir) {
const response = await fetch(`${onnxDir}/unicode_indexer.json`);
const indexer = await response.json();
return new UnicodeProcessor(indexer);
}
/**
* Load ONNX model
*/
export async function loadOnnx(onnxPath, options) {
const session = await ort.InferenceSession.create(onnxPath, options);
return session;
}
/**
* Load all TTS components
*/
export async function loadTextToSpeech(onnxDir, sessionOptions = {}, progressCallback = null) {
console.log('Using WebAssembly/WebGPU for inference');
const cfgs = await loadCfgs(onnxDir);
const dpPath = `${onnxDir}/duration_predictor.onnx`;
const textEncPath = `${onnxDir}/text_encoder.onnx`;
const vectorEstPath = `${onnxDir}/vector_estimator.onnx`;
const vocoderPath = `${onnxDir}/vocoder.onnx`;
const modelPaths = [
{ name: 'Duration Predictor', path: dpPath },
{ name: 'Text Encoder', path: textEncPath },
{ name: 'Vector Estimator', path: vectorEstPath },
{ name: 'Vocoder', path: vocoderPath }
];
const sessions = [];
for (let i = 0; i < modelPaths.length; i++) {
if (progressCallback) {
progressCallback(modelPaths[i].name, i + 1, modelPaths.length);
}
const session = await loadOnnx(modelPaths[i].path, sessionOptions);
sessions.push(session);
}
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = sessions;
const textProcessor = await loadTextProcessor(onnxDir);
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
return { textToSpeech, cfgs };
}
/**
* Write WAV file to ArrayBuffer
*/
export function writeWavFile(audioData, sampleRate) {
const numChannels = 1;
const bitsPerSample = 16;
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
const blockAlign = numChannels * bitsPerSample / 8;
const dataSize = audioData.length * 2;
// Create ArrayBuffer
const buffer = new ArrayBuffer(44 + dataSize);
const view = new DataView(buffer);
// Write WAV header
const writeString = (offset, string) => {
for (let i = 0; i < string.length; i++) {
view.setUint8(offset + i, string.charCodeAt(i));
}
};
writeString(0, 'RIFF');
view.setUint32(4, 36 + dataSize, true);
writeString(8, 'WAVE');
writeString(12, 'fmt ');
view.setUint32(16, 16, true);
view.setUint16(20, 1, true); // PCM
view.setUint16(22, numChannels, true);
view.setUint32(24, sampleRate, true);
view.setUint32(28, byteRate, true);
view.setUint16(32, blockAlign, true);
view.setUint16(34, bitsPerSample, true);
writeString(36, 'data');
view.setUint32(40, dataSize, true);
// Write audio data
const int16Data = new Int16Array(audioData.length);
for (let i = 0; i < audioData.length; i++) {
const clamped = Math.max(-1.0, Math.min(1.0, audioData[i]));
int16Data[i] = Math.floor(clamped * 32767);
}
const dataView = new Uint8Array(buffer, 44);
dataView.set(new Uint8Array(int16Data.buffer));
return buffer;
}
+72
View File
@@ -0,0 +1,72 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Supertonic - Web Demo</title>
<link rel="stylesheet" href="/style.css">
</head>
<body>
<div class="container">
<h1>🎤 Supertonic</h1>
<p class="subtitle">Text-to-Speech with ONNX Runtime Web</p>
<div id="statusBox" class="status-box">
<div class="status-text-wrapper">
<div id="statusText"><strong>Loading models...</strong>
Please wait...</div>
</div>
<div id="backendBadge" class="backend-badge">WebAssembly</div>
</div>
<div class="main-content">
<div class="left-panel">
<div class="section">
<div class="ref-audio-label">
<label for="voiceStyleSelect">Voice Style: </label>
<span id="voiceStyleInfo"
class="ref-audio-info">Loading...</span>
</div>
<select id="voiceStyleSelect">
<option value="assets/voice_styles/M1.json">Male 1 (M1)</option>
<option value="assets/voice_styles/M2.json">Male 2 (M2)</option>
<option value="assets/voice_styles/F1.json">Female 1 (F1)</option>
<option value="assets/voice_styles/F2.json">Female 2 (F2)</option>
</select>
</div>
<div class="section">
<label for="text">Text to Synthesize:</label>
<textarea id="text"
placeholder="Enter the text you want to convert to speech...">This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.</textarea>
</div>
<div class="params-grid">
<div class="section">
<label for="totalStep">Total Steps (higher = better
quality):</label>
<input type="number" id="totalStep" value="5"
min="1" max="50">
</div>
</div>
<button id="generateBtn">Generate Speech</button>
<div id="error" class="error"></div>
</div>
<div class="right-panel">
<div id="results" class="results">
<div class="results-placeholder">
<div class="results-placeholder-icon">🎤</div>
<p>Generated speech will appear here</p>
</div>
</div>
</div>
</div>
</div>
<script type="module" src="/main.js"></script>
</body>
</html>
+285
View File
@@ -0,0 +1,285 @@
import {
loadTextToSpeech,
loadVoiceStyle,
writeWavFile
} from './helper.js';
// Configuration
const DEFAULT_VOICE_STYLE_PATH = 'assets/voice_styles/M1.json';
// Helper function to extract filename from path
function getFilenameFromPath(path) {
return path.split('/').pop();
}
// Global state
let textToSpeech = null;
let cfgs = null;
// Pre-computed style
let currentStyle = null;
let currentStylePath = DEFAULT_VOICE_STYLE_PATH;
// UI Elements
const textInput = document.getElementById('text');
const voiceStyleSelect = document.getElementById('voiceStyleSelect');
const voiceStyleInfo = document.getElementById('voiceStyleInfo');
const totalStepInput = document.getElementById('totalStep');
const generateBtn = document.getElementById('generateBtn');
const statusBox = document.getElementById('statusBox');
const statusText = document.getElementById('statusText');
const backendBadge = document.getElementById('backendBadge');
const resultsContainer = document.getElementById('results');
const errorBox = document.getElementById('error');
function showStatus(message, type = 'info') {
statusText.innerHTML = message;
statusBox.className = 'status-box';
if (type === 'success') {
statusBox.classList.add('success');
} else if (type === 'error') {
statusBox.classList.add('error');
}
}
function showError(message) {
errorBox.textContent = message;
errorBox.classList.add('active');
}
function hideError() {
errorBox.classList.remove('active');
}
function showBackendBadge() {
backendBadge.classList.add('visible');
}
// Load voice style from JSON
async function loadStyleFromJSON(stylePath) {
try {
const style = await loadVoiceStyle([stylePath], true);
return style;
} catch (error) {
console.error('Error loading voice style:', error);
throw error;
}
}
// Load models on page load
async function initializeModels() {
try {
showStatus('️ <strong>Loading configuration...</strong>');
const basePath = 'assets/onnx';
// Try WebGPU first, fallback to WASM
let executionProvider = 'wasm';
try {
const result = await loadTextToSpeech(basePath, {
executionProviders: ['webgpu'],
graphOptimizationLevel: 'all'
}, (modelName, current, total) => {
showStatus(`️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
});
textToSpeech = result.textToSpeech;
cfgs = result.cfgs;
executionProvider = 'webgpu';
backendBadge.textContent = 'WebGPU';
backendBadge.style.background = '#4caf50';
} catch (webgpuError) {
console.log('WebGPU not available, falling back to WebAssembly');
const result = await loadTextToSpeech(basePath, {
executionProviders: ['wasm'],
graphOptimizationLevel: 'all'
}, (modelName, current, total) => {
showStatus(`️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
});
textToSpeech = result.textToSpeech;
cfgs = result.cfgs;
}
showStatus('️ <strong>Loading default voice style...</strong>');
// Load default voice style
currentStyle = await loadStyleFromJSON(currentStylePath);
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
showStatus(`✅ <strong>Models loaded!</strong> Using ${executionProvider.toUpperCase()}. You can now generate speech.`, 'success');
showBackendBadge();
generateBtn.disabled = false;
} catch (error) {
console.error('Error loading models:', error);
showStatus(`❌ <strong>Error loading models:</strong> ${error.message}`, 'error');
}
}
// Handle voice style selection
voiceStyleSelect.addEventListener('change', async (e) => {
const selectedValue = e.target.value;
if (!selectedValue) return;
try {
generateBtn.disabled = true;
showStatus(`️ <strong>Loading voice style...</strong>`, 'info');
currentStylePath = selectedValue;
currentStyle = await loadStyleFromJSON(currentStylePath);
voiceStyleInfo.textContent = getFilenameFromPath(currentStylePath);
showStatus(`✅ <strong>Voice style loaded:</strong> ${getFilenameFromPath(currentStylePath)}`, 'success');
generateBtn.disabled = false;
} catch (error) {
showError(`Error loading voice style: ${error.message}`);
// Restore default style
currentStylePath = DEFAULT_VOICE_STYLE_PATH;
voiceStyleSelect.value = currentStylePath;
try {
currentStyle = await loadStyleFromJSON(currentStylePath);
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
} catch (styleError) {
console.error('Error restoring default style:', styleError);
}
generateBtn.disabled = false;
}
});
// Main synthesis function
async function generateSpeech() {
const text = textInput.value.trim();
if (!text) {
showError('Please enter some text to synthesize.');
return;
}
if (!textToSpeech || !cfgs) {
showError('Models are still loading. Please wait.');
return;
}
if (!currentStyle) {
showError('Voice style is not ready. Please wait.');
return;
}
const startTime = Date.now();
try {
generateBtn.disabled = true;
hideError();
// Clear results and show placeholder
resultsContainer.innerHTML = `
<div class="results-placeholder generating">
<div class="results-placeholder-icon"></div>
<p>Generating speech...</p>
</div>
`;
const totalStep = parseInt(totalStepInput.value);
const textList = [text];
showStatus('️ <strong>Generating speech from text...</strong>');
const tic = Date.now();
const { wav, duration } = await textToSpeech.call(
textList,
currentStyle,
totalStep,
(step, total) => {
showStatus(`️ <strong>Denoising (${step}/${total})...</strong>`);
}
);
const toc = Date.now();
console.log(`Text-to-speech synthesis: ${((toc - tic) / 1000).toFixed(2)}s`);
showStatus('️ <strong>Creating audio file...</strong>');
const wavLen = Math.floor(textToSpeech.sampleRate * duration[0]);
const wavOut = wav.slice(0, wavLen);
// Create WAV file
const wavBuffer = writeWavFile(wavOut, textToSpeech.sampleRate);
const blob = new Blob([wavBuffer], { type: 'audio/wav' });
const url = URL.createObjectURL(blob);
// Calculate total time and audio duration
const endTime = Date.now();
const totalTimeSec = ((endTime - startTime) / 1000).toFixed(2);
const audioDurationSec = duration[0].toFixed(2);
// Display result with full text
resultsContainer.innerHTML = `
<div class="result-item">
<div class="result-text-container">
<div class="result-text-label">Input Text</div>
<div class="result-text">${text}</div>
</div>
<div class="result-info">
<div class="info-item">
<span>📊 Audio Length</span>
<strong>${audioDurationSec}s</strong>
</div>
<div class="info-item">
<span> Generation Time</span>
<strong>${totalTimeSec}s</strong>
</div>
</div>
<div class="result-player">
<audio controls>
<source src="${url}" type="audio/wav">
</audio>
</div>
<div class="result-actions">
<button onclick="downloadAudio('${url}', 'synthesized_speech.wav')">
<span></span>
<span>Download WAV</span>
</button>
</div>
</div>
`;
showStatus('✅ <strong>Speech synthesis completed successfully!</strong>', 'success');
} catch (error) {
console.error('Error during synthesis:', error);
showStatus(`❌ <strong>Error during synthesis:</strong> ${error.message}`, 'error');
showError(`Error during synthesis: ${error.message}`);
// Restore placeholder
resultsContainer.innerHTML = `
<div class="results-placeholder">
<div class="results-placeholder-icon">🎤</div>
<p>Generated speech will appear here</p>
</div>
`;
} finally {
generateBtn.disabled = false;
}
}
// Download handler (make it global so it can be called from onclick)
window.downloadAudio = function(url, filename) {
const a = document.createElement('a');
a.href = url;
a.download = filename;
a.click();
};
// Attach generate function to button
generateBtn.addEventListener('click', generateSpeech);
// Initialize on load
window.addEventListener('load', async () => {
generateBtn.disabled = true;
await initializeModels();
});
+21
View File
@@ -0,0 +1,21 @@
{
"name": "tts-onnx-web",
"version": "1.0.0",
"description": "TTS inference using ONNX Runtime for Web Browser",
"type": "module",
"scripts": {
"dev": "vite",
"build": "vite build",
"preview": "vite preview"
},
"keywords": ["tts", "onnx", "speech-synthesis", "web"],
"author": "",
"license": "MIT",
"dependencies": {
"fft.js": "^4.0.3",
"onnxruntime-web": "^1.17.0"
},
"devDependencies": {
"vite": "^5.0.0"
}
}
+453
View File
@@ -0,0 +1,453 @@
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
min-height: 100vh;
display: flex;
justify-content: center;
align-items: center;
padding: 20px;
}
.container {
background: white;
border-radius: 20px;
padding: 40px;
max-width: 1400px;
width: 100%;
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
}
.main-content {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 40px;
margin-top: 30px;
align-items: start;
}
.left-panel {
display: flex;
flex-direction: column;
}
.right-panel {
display: flex;
flex-direction: column;
height: 100%;
}
@media (max-width: 1024px) {
.main-content {
grid-template-columns: 1fr;
}
}
h1 {
color: #333;
margin-bottom: 10px;
font-size: 2em;
}
.subtitle {
color: #666;
margin-bottom: 30px;
font-size: 1.1em;
}
.section {
margin-bottom: 25px;
}
label {
display: block;
font-weight: 600;
color: #333;
margin-bottom: 8px;
font-size: 0.95em;
}
input[type="file"],
textarea,
input[type="number"] {
width: 100%;
padding: 12px;
border: 2px solid #e0e0e0;
border-radius: 8px;
font-size: 1em;
transition: border-color 0.3s;
}
input[type="file"]:focus,
textarea:focus,
input[type="number"]:focus {
outline: none;
border-color: #667eea;
}
textarea {
resize: vertical;
min-height: 100px;
font-family: inherit;
}
.params-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 15px;
}
button {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
border: none;
padding: 15px 30px;
font-size: 1.1em;
font-weight: 600;
border-radius: 8px;
cursor: pointer;
width: 100%;
transition: transform 0.2s, box-shadow 0.2s;
}
button:hover:not(:disabled) {
transform: translateY(-2px);
box-shadow: 0 5px 20px rgba(102, 126, 234, 0.4);
}
button:disabled {
opacity: 0.6;
cursor: not-allowed;
}
.status-box {
background: #e3f2fd;
border-left: 4px solid #2196f3;
padding: 15px;
margin-bottom: 10px;
border-radius: 4px;
font-size: 0.9em;
color: #1565c0;
transition: all 0.3s ease;
display: flex;
justify-content: space-between;
align-items: center;
flex-wrap: wrap;
gap: 15px;
min-height: 50px;
}
.status-box.success {
background: #e8f5e9;
border-left-color: #4caf50;
color: #2e7d32;
}
.status-box.error {
background: #ffebee;
border-left-color: #f44336;
color: #c62828;
}
.status-text-wrapper {
flex: 1;
min-width: 200px;
}
.backend-badge {
display: inline-block;
visibility: hidden;
padding: 6px 12px;
background: #ff9800;
color: white;
border-radius: 12px;
font-size: 0.85em;
font-weight: 600;
margin-left: 10px;
white-space: nowrap;
}
.backend-badge.visible {
visibility: visible;
}
.ref-audio-info {
color: #4caf50;
font-weight: 700;
font-size: 0.95em;
}
.ref-audio-label {
margin-bottom: 8px;
}
.ref-audio-label label {
display: inline;
margin-bottom: 0;
}
.results {
flex: 1;
display: flex;
flex-direction: column;
}
.result-item {
background: white;
border-radius: 16px;
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
overflow: hidden;
transition: box-shadow 0.3s ease;
display: flex;
flex-direction: column;
flex: 1;
}
.result-item:hover {
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
}
.result-item h3 {
color: #667eea;
margin-bottom: 15px;
font-size: 1.2em;
}
.result-text-container {
padding: 20px;
background: linear-gradient(135deg, #f8f9ff 0%, #ffffff 100%);
border-bottom: 1px solid #e8ecf5;
flex: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
.result-text-label {
font-size: 0.75em;
text-transform: uppercase;
letter-spacing: 0.5px;
color: #667eea;
font-weight: 600;
margin-bottom: 8px;
}
.result-text {
color: #333;
line-height: 1.7;
font-size: 0.95em;
word-wrap: break-word;
white-space: pre-wrap;
overflow-y: auto;
padding-right: 8px;
flex: 1;
}
.result-text::-webkit-scrollbar {
width: 6px;
}
.result-text::-webkit-scrollbar-track {
background: #f0f0f0;
border-radius: 3px;
}
.result-text::-webkit-scrollbar-thumb {
background: #c0c0c0;
border-radius: 3px;
}
.result-text::-webkit-scrollbar-thumb:hover {
background: #a0a0a0;
}
.result-info {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 0;
background: #fafbff;
}
.info-item {
padding: 16px 20px;
display: flex;
align-items: center;
gap: 8px;
font-size: 0.9em;
color: #666;
border-bottom: 1px solid #e8ecf5;
}
.info-item:nth-child(1) {
border-right: 1px solid #e8ecf5;
}
.info-item strong {
color: #333;
font-size: 1.1em;
font-weight: 600;
margin-left: auto;
}
.result-player {
padding: 20px;
background: white;
}
.result-item audio {
width: 100%;
height: 48px;
outline: none;
}
.result-item audio:focus {
outline: 2px solid #667eea;
outline-offset: 2px;
border-radius: 4px;
}
.result-actions {
padding: 16px 20px 20px;
background: white;
}
.result-item button {
width: 100%;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
border: none;
padding: 12px 24px;
font-size: 0.95em;
font-weight: 600;
border-radius: 8px;
cursor: pointer;
transition: all 0.3s ease;
display: flex;
align-items: center;
justify-content: center;
gap: 8px;
}
.result-item button:hover {
transform: translateY(-2px);
box-shadow: 0 4px 16px rgba(102, 126, 234, 0.3);
}
.result-item button:active {
transform: translateY(0);
}
@media (max-width: 640px) {
.result-info {
grid-template-columns: 1fr;
}
.info-item:nth-child(1) {
border-right: none;
}
}
audio {
width: 100%;
margin-top: 10px;
}
.error {
background: #fee;
color: #c00;
padding: 15px;
border-radius: 8px;
margin-top: 20px;
display: none;
}
.error.active {
display: block;
}
.warning-box {
background: #fff3cd;
color: #856404;
padding: 12px 15px;
border-radius: 8px;
margin-top: 10px;
border-left: 4px solid #ffc107;
font-size: 0.9em;
display: none;
line-height: 1.5;
}
.warning-box.active {
display: block;
}
.warning-box::before {
content: "⚠️ ";
margin-right: 5px;
}
.results-placeholder {
background: white;
border-radius: 16px;
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
padding: 60px 40px;
text-align: center;
color: #999;
transition: all 0.3s ease;
display: flex;
flex-direction: column;
justify-content: center;
align-items: center;
flex: 1;
min-height: 400px;
}
.results-placeholder:hover {
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
}
.results-placeholder-icon {
font-size: 4em;
margin-bottom: 20px;
opacity: 0.6;
animation: float 3s ease-in-out infinite;
}
.results-placeholder.generating .results-placeholder-icon {
animation: spin 2s linear infinite;
}
@keyframes float {
0%, 100% {
transform: translateY(0px);
}
50% {
transform: translateY(-10px);
}
}
@keyframes spin {
0% {
transform: rotate(0deg);
}
100% {
transform: rotate(360deg);
}
}
.results-placeholder p {
font-size: 1.05em;
color: #888;
font-weight: 500;
margin: 0;
}
.hidden {
display: none;
}
+14
View File
@@ -0,0 +1,14 @@
import { defineConfig } from 'vite';
export default defineConfig({
server: {
port: 3000,
open: true
},
build: {
target: 'esnext'
},
optimizeDeps: {
exclude: ['onnxruntime-web']
}
});