mirror of
https://github.com/supertone-inc/supertonic.git
synced 2026-07-03 14:08:32 +02:00
init
This commit is contained in:
@@ -0,0 +1 @@
|
||||
assets/onnx/*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
+61
@@ -0,0 +1,61 @@
|
||||
assets/*
|
||||
assets/.git
|
||||
assets/.gitignore
|
||||
assets/.gitattributes
|
||||
|
||||
*.onnx
|
||||
onnx
|
||||
|
||||
# Output files
|
||||
results
|
||||
|
||||
# Python
|
||||
__pycache__
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
|
||||
# Virtual environments
|
||||
.venv
|
||||
venv/
|
||||
ENV/
|
||||
env/
|
||||
|
||||
# Node.js
|
||||
node_modules/
|
||||
npm-debug.log*
|
||||
yarn-debug.log*
|
||||
yarn-error.log*
|
||||
package-lock.json
|
||||
|
||||
# Swift
|
||||
.build/
|
||||
.swiftpm/
|
||||
*.xcodeproj
|
||||
*.xcworkspace
|
||||
xcuserdata/
|
||||
DerivedData/
|
||||
|
||||
# Distribution / packaging
|
||||
build/
|
||||
dist/
|
||||
*.egg-info/
|
||||
.eggs/
|
||||
|
||||
# Testing
|
||||
.pytest_cache/
|
||||
.coverage
|
||||
htmlcov/
|
||||
.tox/
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2025 Supertone Inc.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
@@ -0,0 +1,393 @@
|
||||
# Supertonic — Lightning Fast, On-Device TTS
|
||||
|
||||
[](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo)
|
||||
[](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
<p align="center">
|
||||
<img src="img/Supertonic_IMG_v02_4x.webp" alt="Supertonic Banner">
|
||||
</p>
|
||||
|
||||
**Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
|
||||
|
||||
> 🎧 **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo), or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic)
|
||||
|
||||
### Table of Contents
|
||||
|
||||
- [Why Supertonic?](#why-supertonic)
|
||||
- [Language Support](#language-support)
|
||||
- [Getting Started](#getting-started)
|
||||
- [Performance](#performance)
|
||||
- [Citation](#citation)
|
||||
- [License](#license)
|
||||
|
||||
## Why Supertonic?
|
||||
|
||||
- **⚡ Blazingly Fast**: Generates speech up to **167× faster than real-time** on consumer hardware (M4 Pro)—unmatched by any other TTS system
|
||||
- **🪶 Ultra Lightweight**: Only **66M parameters**, optimized for efficient on-device performance with minimal footprint
|
||||
- **📱 On-Device Capable**: **Complete privacy** and **zero latency**—all processing happens locally on your device
|
||||
- **🎨 Natural Text Handling**: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
|
||||
- **⚙️ Highly Configurable**: Adjust inference steps, batch processing, and other parameters to match your specific needs
|
||||
- **🧩 Flexible Deployment**: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
|
||||
|
||||
|
||||
## Language Support
|
||||
|
||||
We provide ready-to-use TTS inference examples across multiple ecosystems:
|
||||
|
||||
| Language/Platform | Path | Description |
|
||||
|-------------------|------|-------------|
|
||||
| [**Python**](py/) | `py/` | ONNX Runtime inference |
|
||||
| [**Node.js**](nodejs/) | `nodejs/` | Server-side JavaScript |
|
||||
| [**Browser**](web/) | `web/` | WebGPU/WASM inference |
|
||||
| [**Java**](java/) | `java/` | Cross-platform JVM |
|
||||
| [**C++**](cpp/) | `cpp/` | High-performance C++ |
|
||||
| [**C#**](csharp/) | `csharp/` | .NET ecosystem |
|
||||
| [**Go**](go/) | `go/` | Go implementation |
|
||||
| [**Swift**](swift/) | `swift/` | macOS applications |
|
||||
| [**iOS**](ios/) | `ios/` | Native iOS apps |
|
||||
| [**Rust**](rust/) | `rust/` | Memory-safe systems |
|
||||
|
||||
> For detailed usage instructions, please refer to the README.md in each language directory.
|
||||
|
||||
## Getting Started
|
||||
|
||||
First, clone the repository:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/supertone-inc/supertonic.git
|
||||
cd supertonic
|
||||
```
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Before running the examples, download the ONNX models and preset voices, and place them in the `assets` directory:
|
||||
|
||||
> **Note:** The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
|
||||
> - macOS: `brew install git-lfs && git lfs install`
|
||||
> - Generic: see `https://git-lfs.com` for installers
|
||||
|
||||
```bash
|
||||
git clone https://huggingface.co/Supertone/supertonic assets
|
||||
```
|
||||
|
||||
### Quick Start
|
||||
|
||||
**Python Example** ([Details](py/))
|
||||
```bash
|
||||
cd py
|
||||
uv sync
|
||||
uv run example_onnx.py
|
||||
```
|
||||
|
||||
**Node.js Example** ([Details](nodejs/))
|
||||
```bash
|
||||
cd nodejs
|
||||
npm install
|
||||
npm start
|
||||
```
|
||||
|
||||
**Browser Example** ([Details](web/))
|
||||
```bash
|
||||
cd web
|
||||
npm install
|
||||
npm run dev
|
||||
```
|
||||
|
||||
**Java Example** ([Details](java/))
|
||||
```bash
|
||||
cd java
|
||||
mvn clean install
|
||||
mvn exec:java
|
||||
```
|
||||
|
||||
**C++ Example** ([Details](cpp/))
|
||||
```bash
|
||||
cd cpp
|
||||
mkdir build && cd build
|
||||
cmake .. && cmake --build . --config Release
|
||||
./example_onnx
|
||||
```
|
||||
|
||||
**C# Example** ([Details](csharp/))
|
||||
```bash
|
||||
cd csharp
|
||||
dotnet restore
|
||||
dotnet run
|
||||
```
|
||||
|
||||
**Go Example** ([Details](go/))
|
||||
```bash
|
||||
cd go
|
||||
go mod download
|
||||
go run example_onnx.go helper.go
|
||||
```
|
||||
|
||||
**Swift Example** ([Details](swift/))
|
||||
```bash
|
||||
cd swift
|
||||
swift build -c release
|
||||
.build/release/example_onnx
|
||||
```
|
||||
|
||||
**Rust Example** ([Details](rust/))
|
||||
```bash
|
||||
cd rust
|
||||
cargo build --release
|
||||
./target/release/example_onnx
|
||||
```
|
||||
|
||||
**iOS Example** ([Details](ios/))
|
||||
```bash
|
||||
cd ios/ExampleiOSApp
|
||||
xcodegen generate
|
||||
open ExampleiOSApp.xcodeproj
|
||||
```
|
||||
- In Xcode: Targets → ExampleiOSApp → Signing: select your Team
|
||||
- Choose your iPhone as run destination → Build & Run
|
||||
|
||||
|
||||
### Technical Details
|
||||
|
||||
- **Runtime**: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
|
||||
- **Browser Support**: onnxruntime-web for client-side inference
|
||||
- **Batch Processing**: Supports batch inference for improved throughput
|
||||
- **Audio Output**: Outputs 16-bit WAV files
|
||||
|
||||
## Performance
|
||||
|
||||
We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
|
||||
|
||||
**Metrics:**
|
||||
- **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
|
||||
- **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
|
||||
|
||||
### Characters per Second
|
||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
||||
|--------|-----------------|----------------|-----------------|
|
||||
| **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
|
||||
| **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
|
||||
| **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
|
||||
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
|
||||
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
|
||||
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
|
||||
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
|
||||
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
|
||||
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
|
||||
|
||||
> **Notes:**
|
||||
> `API` = Cloud-based API services (measured from Seoul)
|
||||
> `Open` = Open-source models
|
||||
> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
|
||||
> Supertonic (RTX4090): Tested with PyTorch model
|
||||
> Kokoro: Tested on M4 Pro CPU with ONNX
|
||||
> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
|
||||
|
||||
### Real-time Factor
|
||||
|
||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
||||
|--------|-----------------|----------------|-----------------|
|
||||
| **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
|
||||
| **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
|
||||
| **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
|
||||
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
|
||||
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
|
||||
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
|
||||
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
|
||||
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
|
||||
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
|
||||
|
||||
<details>
|
||||
<summary><b>Additional Performance Data (5-step inference)</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Characters per Second (5-step)**
|
||||
|
||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
||||
|--------|-----------------|----------------|-----------------|
|
||||
| **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
|
||||
| **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
|
||||
| **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
|
||||
|
||||
**Real-time Factor (5-step)**
|
||||
|
||||
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
||||
|--------|-----------------|----------------|-----------------|
|
||||
| **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
|
||||
| **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
|
||||
| **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
|
||||
|
||||
</details>
|
||||
|
||||
### Natural Text Handling
|
||||
|
||||
Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.
|
||||
|
||||
> 🎧 **View audio samples more easily**: Check out our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#text-handling) for a better viewing experience of all audio examples
|
||||
|
||||
**Overview of Test Cases:**
|
||||
|
||||
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini |
|
||||
|:--------:|:--------------:|:----------:|:----------:|:------:|:------:|
|
||||
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ |
|
||||
| Time and Date | Time notation, abbreviated weekdays/months, date formats | ✅ | ❌ | ❌ | ❌ |
|
||||
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ |
|
||||
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ |
|
||||
|
||||
<details>
|
||||
<summary><b>Example 1: Financial Expression</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Text:**
|
||||
> "The startup secured **$5.2M** in venture capital, a huge leap from their initial **$450K** seed round."
|
||||
|
||||
**Challenges:**
|
||||
- Decimal point in currency ($5.2M should be read as "five point two million")
|
||||
- Abbreviated magnitude units (M for million, K for thousand)
|
||||
- Currency symbol ($) that needs to be properly pronounced as "dollars"
|
||||
|
||||
**Audio Samples:**
|
||||
|
||||
| System | Result | Audio Sample |
|
||||
|--------|--------|--------------|
|
||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1eancUOhiSXCVoTu9ddh4S-OcVQaWrPV-/view?usp=sharing) |
|
||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1-r2scv7XQ1crIDu6QOh3eqVl445W6ap_/view?usp=sharing) |
|
||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1MFDXMjfmsAVOqwPx7iveS0KUJtZvcwxB/view?usp=sharing) |
|
||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1dEHpNzfMUucFTJPQK0k4RcFZvPwQTt09/view?usp=sharing) |
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Example 2: Time and Date</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Text:**
|
||||
> "The train delay was announced at **4:45 PM** on **Wed, Apr 3, 2024** due to track maintenance."
|
||||
|
||||
**Challenges:**
|
||||
- Time expression with PM notation (4:45 PM)
|
||||
- Abbreviated weekday (Wed)
|
||||
- Abbreviated month (Apr)
|
||||
- Full date format (Apr 3, 2024)
|
||||
|
||||
**Audio Samples:**
|
||||
|
||||
| System | Result | Audio Sample |
|
||||
|--------|--------|--------------|
|
||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1ehkZU8eiizBenG2DgR5tzBGQBvHS0Uaj/view?usp=sharing) |
|
||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ta3r6jFyebmA-sT44l8EaEQcMLVmuOEr/view?usp=sharing) |
|
||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1sskmem9AzHAQ3Hv8DRSZoqX_pye-CXuU/view?usp=sharing) |
|
||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1zx9X8oMsLMXW0Zx_SURoqjju-By2yh_n/view?usp=sharing) |
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Example 3: Phone Number</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Text:**
|
||||
> "You can reach the hotel front desk at **(212) 555-0142 ext. 402** anytime."
|
||||
|
||||
**Challenges:**
|
||||
- Area code in parentheses that should be read as separate digits
|
||||
- Phone number with hyphen separator (555-0142)
|
||||
- Abbreviated extension notation (ext.)
|
||||
- Extension number (402)
|
||||
|
||||
**Audio Samples:**
|
||||
|
||||
| System | Result | Audio Sample |
|
||||
|--------|--------|--------------|
|
||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1z-e5iTsihryMR8ll1-N1YXkB2CIJYJ6F/view?usp=sharing) |
|
||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1HAzVXFTZfZm0VEK2laSpsMTxzufcuaxA/view?usp=sharing) |
|
||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/15tjfAmb3GbjP_kmvD7zSdIWkhtAaCPOg/view?usp=sharing) |
|
||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1BCL8n7yligUZyso970ud7Gf5NWb1OhKD/view?usp=sharing) |
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Example 4: Technical Unit</b></summary>
|
||||
|
||||
<br>
|
||||
|
||||
**Text:**
|
||||
> "Our drone battery lasts **2.3h** when flying at **30kph** with full camera payload."
|
||||
|
||||
**Challenges:**
|
||||
- Decimal time duration with abbreviation (2.3h = two point three hours)
|
||||
- Speed unit with abbreviation (30kph = thirty kilometers per hour)
|
||||
- Technical abbreviations (h for hours, kph for kilometers per hour)
|
||||
- Technical/engineering context requiring proper pronunciation
|
||||
|
||||
**Audio Samples:**
|
||||
|
||||
| System | Result | Audio Sample |
|
||||
|--------|--------|--------------|
|
||||
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1kvOBvswFkLfmr8hGplH0V2XiMxy1shYf/view?usp=sharing) |
|
||||
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1_SzfjWJe5YEd0t3R7DztkYhHcI_av48p/view?usp=sharing) |
|
||||
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1P5BSilj5xFPTV2Xz6yW5jitKZohO9o-6/view?usp=sharing) |
|
||||
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1GU82SnWC50OvC8CZNjhxvNZFKQb7I9_Y/view?usp=sharing) |
|
||||
|
||||
</details>
|
||||
|
||||
> **Note:** These samples demonstrate how each system handles text normalization and pronunciation of complex expressions **without requiring pre-processing or phonetic annotations**.
|
||||
|
||||
## Citation
|
||||
|
||||
The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:
|
||||
|
||||
### SupertonicTTS: Main Architecture
|
||||
|
||||
This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.
|
||||
|
||||
```bibtex
|
||||
@article{kim2025supertonic,
|
||||
title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
|
||||
author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
|
||||
journal={arXiv preprint arXiv:2503.23108},
|
||||
year={2025},
|
||||
url={https://arxiv.org/abs/2503.23108}
|
||||
}
|
||||
```
|
||||
|
||||
### Length-Aware RoPE: Text-Speech Alignment
|
||||
|
||||
This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.
|
||||
|
||||
```bibtex
|
||||
@article{kim2025larope,
|
||||
title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
|
||||
author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
|
||||
journal={arXiv preprint arXiv:2509.11084},
|
||||
year={2025},
|
||||
url={https://arxiv.org/abs/2509.11084}
|
||||
}
|
||||
```
|
||||
|
||||
### Self-Purifying Flow Matching: Training with Noisy Labels
|
||||
|
||||
This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.
|
||||
|
||||
```bibtex
|
||||
@article{kim2025spfm,
|
||||
title={Training Flow Matching Models with Reliable Labels via Self-Purification},
|
||||
author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
|
||||
journal={arXiv preprint arXiv:2509.19091},
|
||||
year={2025},
|
||||
url={https://arxiv.org/abs/2509.19091}
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This project’s sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
|
||||
|
||||
The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic/blob/main/LICENSE) file for details.
|
||||
|
||||
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
|
||||
|
||||
Copyright (c) 2025 Supertone Inc.
|
||||
|
||||
@@ -0,0 +1,122 @@
|
||||
cmake_minimum_required(VERSION 3.15)
|
||||
project(Supertonic_CPP)
|
||||
|
||||
set(CMAKE_CXX_STANDARD 17)
|
||||
set(CMAKE_CXX_STANDARD_REQUIRED ON)
|
||||
|
||||
# Enable aggressive optimization
|
||||
if(NOT CMAKE_BUILD_TYPE)
|
||||
set(CMAKE_BUILD_TYPE Release)
|
||||
endif()
|
||||
|
||||
# Add optimization flags
|
||||
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
|
||||
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
|
||||
|
||||
# Find required packages
|
||||
find_package(PkgConfig REQUIRED)
|
||||
find_package(OpenMP)
|
||||
|
||||
# ONNX Runtime - Try multiple methods
|
||||
# Method 1: Try to find via CMake config
|
||||
find_package(onnxruntime QUIET CONFIG)
|
||||
|
||||
if(NOT onnxruntime_FOUND)
|
||||
# Method 2: Try pkg-config
|
||||
pkg_check_modules(ONNXRUNTIME QUIET libonnxruntime)
|
||||
|
||||
if(ONNXRUNTIME_FOUND)
|
||||
set(ONNXRUNTIME_INCLUDE_DIR ${ONNXRUNTIME_INCLUDE_DIRS})
|
||||
set(ONNXRUNTIME_LIB ${ONNXRUNTIME_LIBRARIES})
|
||||
else()
|
||||
# Method 3: Manual search in common locations
|
||||
find_path(ONNXRUNTIME_INCLUDE_DIR
|
||||
NAMES onnxruntime_cxx_api.h
|
||||
PATHS
|
||||
/usr/local/include
|
||||
/opt/homebrew/include
|
||||
/usr/include
|
||||
${CMAKE_PREFIX_PATH}/include
|
||||
PATH_SUFFIXES onnxruntime
|
||||
)
|
||||
|
||||
find_library(ONNXRUNTIME_LIB
|
||||
NAMES onnxruntime libonnxruntime
|
||||
PATHS
|
||||
/usr/local/lib
|
||||
/opt/homebrew/lib
|
||||
/usr/lib
|
||||
${CMAKE_PREFIX_PATH}/lib
|
||||
)
|
||||
endif()
|
||||
|
||||
if(NOT ONNXRUNTIME_INCLUDE_DIR OR NOT ONNXRUNTIME_LIB)
|
||||
message(FATAL_ERROR "ONNX Runtime not found. Please install it:\n"
|
||||
" macOS: brew install onnxruntime\n"
|
||||
" Ubuntu: See README.md for installation instructions")
|
||||
endif()
|
||||
|
||||
message(STATUS "Found ONNX Runtime:")
|
||||
message(STATUS " Include: ${ONNXRUNTIME_INCLUDE_DIR}")
|
||||
message(STATUS " Library: ${ONNXRUNTIME_LIB}")
|
||||
endif()
|
||||
|
||||
# nlohmann/json
|
||||
find_package(nlohmann_json REQUIRED)
|
||||
|
||||
# Include directories
|
||||
if(NOT onnxruntime_FOUND)
|
||||
include_directories(${ONNXRUNTIME_INCLUDE_DIR})
|
||||
endif()
|
||||
|
||||
# Helper library
|
||||
add_library(tts_helper STATIC
|
||||
helper.cpp
|
||||
helper.h
|
||||
)
|
||||
|
||||
if(onnxruntime_FOUND)
|
||||
target_link_libraries(tts_helper
|
||||
onnxruntime::onnxruntime
|
||||
nlohmann_json::nlohmann_json
|
||||
)
|
||||
else()
|
||||
target_include_directories(tts_helper PUBLIC ${ONNXRUNTIME_INCLUDE_DIR})
|
||||
target_link_libraries(tts_helper
|
||||
${ONNXRUNTIME_LIB}
|
||||
nlohmann_json::nlohmann_json
|
||||
)
|
||||
endif()
|
||||
|
||||
# Enable OpenMP if available
|
||||
if(OpenMP_CXX_FOUND)
|
||||
target_link_libraries(tts_helper OpenMP::OpenMP_CXX)
|
||||
message(STATUS "OpenMP enabled for parallel processing")
|
||||
else()
|
||||
message(WARNING "OpenMP not found - parallel processing will be disabled")
|
||||
endif()
|
||||
|
||||
# Example executable
|
||||
add_executable(example_onnx
|
||||
example_onnx.cpp
|
||||
)
|
||||
|
||||
if(onnxruntime_FOUND)
|
||||
target_link_libraries(example_onnx
|
||||
tts_helper
|
||||
onnxruntime::onnxruntime
|
||||
nlohmann_json::nlohmann_json
|
||||
)
|
||||
else()
|
||||
target_link_libraries(example_onnx
|
||||
tts_helper
|
||||
${ONNXRUNTIME_LIB}
|
||||
nlohmann_json::nlohmann_json
|
||||
)
|
||||
endif()
|
||||
|
||||
# Installation
|
||||
install(TARGETS example_onnx DESTINATION bin)
|
||||
install(TARGETS tts_helper DESTINATION lib)
|
||||
install(FILES helper.h DESTINATION include)
|
||||
|
||||
+101
@@ -0,0 +1,101 @@
|
||||
# Supertonic C++ Implementation
|
||||
|
||||
High-performance text-to-speech inference using ONNX Runtime.
|
||||
|
||||
## Requirements
|
||||
|
||||
- C++17 compiler, CMake 3.15+
|
||||
- Libraries: ONNX Runtime, nlohmann/json
|
||||
|
||||
## Installation
|
||||
|
||||
**Ubuntu/Debian:**
|
||||
> ⚠️ **Note:** Installation instructions not yet verified.
|
||||
|
||||
```bash
|
||||
sudo apt-get install -y cmake g++ nlohmann-json3-dev
|
||||
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.3/onnxruntime-linux-x64-1.16.3.tgz
|
||||
tar -xzf onnxruntime-linux-x64-1.16.3.tgz
|
||||
sudo cp -r onnxruntime-linux-x64-1.16.3/include/* /usr/local/include/
|
||||
sudo cp -r onnxruntime-linux-x64-1.16.3/lib/* /usr/local/lib/
|
||||
sudo ldconfig
|
||||
```
|
||||
|
||||
**macOS:**
|
||||
```bash
|
||||
brew install cmake nlohmann-json onnxruntime
|
||||
```
|
||||
|
||||
**Windows (vcpkg):**
|
||||
> ⚠️ **Note:** Installation instructions not yet verified.
|
||||
|
||||
```powershell
|
||||
vcpkg install nlohmann-json:x64-windows onnxruntime:x64-windows
|
||||
vcpkg integrate install
|
||||
```
|
||||
|
||||
## Building
|
||||
|
||||
```bash
|
||||
cd cpp && mkdir build && cd build
|
||||
cmake .. && cmake --build . --config Release
|
||||
./example_onnx
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
./example_onnx
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `../assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
./example_onnx \
|
||||
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice style (M1.json) for the first text
|
||||
- Use female voice style (F1.json) for the second text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
./example_onnx \
|
||||
--total-step 10 \
|
||||
--voice-style ../assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str | `../assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated for batch) |
|
||||
| `--text` | str | (long default text) | Text(s) to synthesize (pipe-separated for batch) |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
Symlink
+1
@@ -0,0 +1 @@
|
||||
../assets
|
||||
@@ -0,0 +1,109 @@
|
||||
#include "helper.h"
|
||||
#include <iostream>
|
||||
#include <filesystem>
|
||||
#include <algorithm>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace fs = std::filesystem;
|
||||
|
||||
struct Args {
|
||||
std::string onnx_dir = "../assets/onnx";
|
||||
int total_step = 5;
|
||||
int n_test = 4;
|
||||
std::vector<std::string> voice_style = {"../assets/voice_styles/M1.json"};
|
||||
std::vector<std::string> text = {
|
||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
};
|
||||
std::string save_dir = "results";
|
||||
};
|
||||
|
||||
auto splitString = [](const std::string& str, char delim) {
|
||||
std::vector<std::string> result;
|
||||
size_t start = 0, pos;
|
||||
while ((pos = str.find(delim, start)) != std::string::npos) {
|
||||
result.push_back(str.substr(start, pos - start));
|
||||
start = pos + 1;
|
||||
}
|
||||
result.push_back(str.substr(start));
|
||||
return result;
|
||||
};
|
||||
|
||||
Args parseArgs(int argc, char* argv[]) {
|
||||
Args args;
|
||||
for (int i = 1; i < argc; i++) {
|
||||
std::string arg = argv[i];
|
||||
if (arg == "--onnx-dir" && i + 1 < argc) args.onnx_dir = argv[++i];
|
||||
else if (arg == "--total-step" && i + 1 < argc) args.total_step = std::stoi(argv[++i]);
|
||||
else if (arg == "--n-test" && i + 1 < argc) args.n_test = std::stoi(argv[++i]);
|
||||
else if (arg == "--voice-style" && i + 1 < argc) args.voice_style = splitString(argv[++i], ',');
|
||||
else if (arg == "--text" && i + 1 < argc) args.text = splitString(argv[++i], '|');
|
||||
else if (arg == "--save-dir" && i + 1 < argc) args.save_dir = argv[++i];
|
||||
}
|
||||
return args;
|
||||
}
|
||||
|
||||
int main(int argc, char* argv[]) {
|
||||
std::cout << "=== TTS Inference with ONNX Runtime (C++) ===\n\n";
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
Args args = parseArgs(argc, argv);
|
||||
int total_step = args.total_step;
|
||||
int n_test = args.n_test;
|
||||
std::string save_dir = args.save_dir;
|
||||
std::vector<std::string> voice_style_paths = args.voice_style;
|
||||
std::vector<std::string> text_list = args.text;
|
||||
|
||||
if (voice_style_paths.size() != text_list.size()) {
|
||||
std::cerr << "Error: Number of voice styles (" << voice_style_paths.size()
|
||||
<< ") must match number of texts (" << text_list.size() << ")\n";
|
||||
return 1;
|
||||
}
|
||||
|
||||
int bsz = voice_style_paths.size();
|
||||
|
||||
// --- 2. Load Text to Speech --- //
|
||||
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "TTS");
|
||||
Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(
|
||||
OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault
|
||||
);
|
||||
|
||||
auto text_to_speech = loadTextToSpeech(env, args.onnx_dir, false);
|
||||
std::cout << std::endl;
|
||||
|
||||
// --- 3. Load Voice Style --- //
|
||||
auto style = loadVoiceStyle(voice_style_paths, true);
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
fs::create_directories(save_dir);
|
||||
|
||||
for (int n = 0; n < n_test; n++) {
|
||||
std::cout << "\n[" << (n + 1) << "/" << n_test << "] Starting synthesis...\n";
|
||||
|
||||
auto result = timer("Generating speech from text", [&]() {
|
||||
return text_to_speech->call(memory_info, text_list, style, total_step);
|
||||
});
|
||||
|
||||
int sample_rate = text_to_speech->getSampleRate();
|
||||
int wav_shape_1 = result.wav.size() / bsz;
|
||||
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
std::string fname = sanitizeFilename(text_list[b], 20) + "_" + std::to_string(n + 1) + ".wav";
|
||||
int wav_len = static_cast<int>(sample_rate * result.duration[b]);
|
||||
|
||||
std::vector<float> wav_out(
|
||||
result.wav.begin() + b * wav_shape_1,
|
||||
result.wav.begin() + b * wav_shape_1 + wav_len
|
||||
);
|
||||
|
||||
std::string output_path = save_dir + "/" + fname;
|
||||
writeWavFile(output_path, wav_out, sample_rate);
|
||||
std::cout << "Saved: " << output_path << "\n";
|
||||
}
|
||||
|
||||
clearTensorBuffers();
|
||||
}
|
||||
|
||||
std::cout << "\n=== Synthesis completed successfully! ===\n";
|
||||
return 0;
|
||||
}
|
||||
+714
@@ -0,0 +1,714 @@
|
||||
#include "helper.h"
|
||||
#include <fstream>
|
||||
#include <iostream>
|
||||
#include <cmath>
|
||||
#include <algorithm>
|
||||
#include <random>
|
||||
#include <sstream>
|
||||
#include <nlohmann/json.hpp>
|
||||
|
||||
using json = nlohmann::json;
|
||||
|
||||
// Global tensor buffers for memory management
|
||||
static std::vector<std::vector<float>> g_tensor_buffers_float;
|
||||
static std::vector<std::vector<int64_t>> g_tensor_buffers_int64;
|
||||
|
||||
void clearTensorBuffers() {
|
||||
g_tensor_buffers_float.clear();
|
||||
g_tensor_buffers_int64.clear();
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// UnicodeProcessor implementation
|
||||
// ============================================================================
|
||||
|
||||
UnicodeProcessor::UnicodeProcessor(const std::string& unicode_indexer_json_path) {
|
||||
indexer_ = loadJsonInt64(unicode_indexer_json_path);
|
||||
}
|
||||
|
||||
std::string UnicodeProcessor::preprocessText(const std::string& text) {
|
||||
// Simple NFKD normalization (C++ doesn't have built-in Unicode normalization)
|
||||
// For now, just return the text as-is
|
||||
// TODO: add proper Unicode normalization
|
||||
return text;
|
||||
}
|
||||
|
||||
std::vector<uint16_t> UnicodeProcessor::textToUnicodeValues(const std::string& text) {
|
||||
std::vector<uint16_t> unicode_values;
|
||||
for (char c : text) {
|
||||
unicode_values.push_back(static_cast<uint16_t>(static_cast<unsigned char>(c)));
|
||||
}
|
||||
return unicode_values;
|
||||
}
|
||||
|
||||
std::vector<std::vector<std::vector<float>>> UnicodeProcessor::getTextMask(
|
||||
const std::vector<int64_t>& text_ids_lengths
|
||||
) {
|
||||
return lengthToMask(text_ids_lengths);
|
||||
}
|
||||
|
||||
void UnicodeProcessor::call(
|
||||
const std::vector<std::string>& text_list,
|
||||
std::vector<std::vector<int64_t>>& text_ids,
|
||||
std::vector<std::vector<std::vector<float>>>& text_mask
|
||||
) {
|
||||
std::vector<std::string> processed_texts;
|
||||
for (const auto& text : text_list) {
|
||||
processed_texts.push_back(preprocessText(text));
|
||||
}
|
||||
|
||||
std::vector<int64_t> text_ids_lengths;
|
||||
for (const auto& text : processed_texts) {
|
||||
text_ids_lengths.push_back(static_cast<int64_t>(text.length()));
|
||||
}
|
||||
|
||||
int64_t max_len = *std::max_element(text_ids_lengths.begin(), text_ids_lengths.end());
|
||||
|
||||
text_ids.resize(text_list.size());
|
||||
for (size_t i = 0; i < processed_texts.size(); i++) {
|
||||
text_ids[i].resize(max_len, 0);
|
||||
auto unicode_vals = textToUnicodeValues(processed_texts[i]);
|
||||
for (size_t j = 0; j < unicode_vals.size(); j++) {
|
||||
if (unicode_vals[j] < indexer_.size()) {
|
||||
text_ids[i][j] = indexer_[unicode_vals[j]];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
text_mask = getTextMask(text_ids_lengths);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Style implementation
|
||||
// ============================================================================
|
||||
|
||||
Style::Style(const std::vector<float>& ttl_data, const std::vector<int64_t>& ttl_shape,
|
||||
const std::vector<float>& dp_data, const std::vector<int64_t>& dp_shape)
|
||||
: ttl_data_(ttl_data), ttl_shape_(ttl_shape), dp_data_(dp_data), dp_shape_(dp_shape) {}
|
||||
|
||||
// ============================================================================
|
||||
// TextToSpeech implementation
|
||||
// ============================================================================
|
||||
|
||||
TextToSpeech::TextToSpeech(
|
||||
const Config& cfgs,
|
||||
UnicodeProcessor* text_processor,
|
||||
Ort::Session* dp_ort,
|
||||
Ort::Session* text_enc_ort,
|
||||
Ort::Session* vector_est_ort,
|
||||
Ort::Session* vocoder_ort
|
||||
) : cfgs_(cfgs),
|
||||
text_processor_(text_processor),
|
||||
dp_ort_(dp_ort),
|
||||
text_enc_ort_(text_enc_ort),
|
||||
vector_est_ort_(vector_est_ort),
|
||||
vocoder_ort_(vocoder_ort) {
|
||||
|
||||
sample_rate_ = cfgs.ae.sample_rate;
|
||||
base_chunk_size_ = cfgs.ae.base_chunk_size;
|
||||
chunk_compress_factor_ = cfgs.ttl.chunk_compress_factor;
|
||||
ldim_ = cfgs.ttl.latent_dim;
|
||||
}
|
||||
|
||||
void TextToSpeech::sampleNoisyLatent(
|
||||
const std::vector<float>& duration,
|
||||
std::vector<std::vector<std::vector<float>>>& noisy_latent,
|
||||
std::vector<std::vector<std::vector<float>>>& latent_mask
|
||||
) {
|
||||
int bsz = duration.size();
|
||||
float wav_len_max = *std::max_element(duration.begin(), duration.end()) * sample_rate_;
|
||||
|
||||
std::vector<int64_t> wav_lengths;
|
||||
for (float d : duration) {
|
||||
wav_lengths.push_back(static_cast<int64_t>(d * sample_rate_));
|
||||
}
|
||||
|
||||
int chunk_size = base_chunk_size_ * chunk_compress_factor_;
|
||||
int latent_len = static_cast<int>((wav_len_max + chunk_size - 1) / chunk_size);
|
||||
int latent_dim = ldim_ * chunk_compress_factor_;
|
||||
|
||||
// Generate random noise with normal distribution
|
||||
std::random_device rd;
|
||||
std::mt19937 gen(rd());
|
||||
std::normal_distribution<float> dist(0.0f, 1.0f);
|
||||
|
||||
noisy_latent.resize(bsz);
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
noisy_latent[b].resize(latent_dim);
|
||||
for (int d = 0; d < latent_dim; d++) {
|
||||
noisy_latent[b][d].resize(latent_len);
|
||||
for (int t = 0; t < latent_len; t++) {
|
||||
noisy_latent[b][d][t] = dist(gen);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
latent_mask = getLatentMask(wav_lengths, base_chunk_size_, chunk_compress_factor_);
|
||||
|
||||
// Apply mask
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
for (int d = 0; d < latent_dim; d++) {
|
||||
for (size_t t = 0; t < noisy_latent[b][d].size(); t++) {
|
||||
noisy_latent[b][d][t] *= latent_mask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
TextToSpeech::SynthesisResult TextToSpeech::call(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::string>& text_list,
|
||||
const Style& style,
|
||||
int total_step
|
||||
) {
|
||||
int bsz = text_list.size();
|
||||
|
||||
if (bsz != style.getTtlShape()[0]) {
|
||||
throw std::runtime_error("Number of texts must match number of style vectors");
|
||||
}
|
||||
|
||||
// Process text
|
||||
std::vector<std::vector<int64_t>> text_ids;
|
||||
std::vector<std::vector<std::vector<float>>> text_mask;
|
||||
text_processor_->call(text_list, text_ids, text_mask);
|
||||
|
||||
std::vector<int64_t> text_ids_shape = {bsz, static_cast<int64_t>(text_ids[0].size())};
|
||||
std::vector<int64_t> text_mask_shape = {bsz, 1, static_cast<int64_t>(text_mask[0][0].size())};
|
||||
|
||||
auto text_ids_tensor = intArrayToTensor(memory_info, text_ids, text_ids_shape);
|
||||
auto text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
|
||||
|
||||
// Create style tensors
|
||||
auto style_ttl_tensor = Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
const_cast<float*>(style.getTtlData().data()),
|
||||
style.getTtlData().size(),
|
||||
style.getTtlShape().data(),
|
||||
style.getTtlShape().size()
|
||||
);
|
||||
|
||||
auto style_dp_tensor = Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
const_cast<float*>(style.getDpData().data()),
|
||||
style.getDpData().size(),
|
||||
style.getDpShape().data(),
|
||||
style.getDpShape().size()
|
||||
);
|
||||
|
||||
// Run duration predictor
|
||||
const char* dp_input_names[] = {"text_ids", "style_dp", "text_mask"};
|
||||
const char* dp_output_names[] = {"duration"};
|
||||
std::vector<Ort::Value> dp_inputs;
|
||||
dp_inputs.push_back(std::move(text_ids_tensor));
|
||||
dp_inputs.push_back(std::move(style_dp_tensor));
|
||||
dp_inputs.push_back(std::move(text_mask_tensor));
|
||||
|
||||
auto dp_outputs = dp_ort_->Run(
|
||||
Ort::RunOptions{nullptr},
|
||||
dp_input_names, dp_inputs.data(), dp_inputs.size(),
|
||||
dp_output_names, 1
|
||||
);
|
||||
|
||||
auto* dur_data = dp_outputs[0].GetTensorMutableData<float>();
|
||||
std::vector<float> duration(dur_data, dur_data + bsz);
|
||||
|
||||
// Create new tensors for text encoder (previous ones were moved)
|
||||
text_ids_tensor = intArrayToTensor(memory_info, text_ids, text_ids_shape);
|
||||
text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
|
||||
style_ttl_tensor = Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
const_cast<float*>(style.getTtlData().data()),
|
||||
style.getTtlData().size(),
|
||||
style.getTtlShape().data(),
|
||||
style.getTtlShape().size()
|
||||
);
|
||||
|
||||
// Run text encoder
|
||||
const char* text_enc_input_names[] = {"text_ids", "style_ttl", "text_mask"};
|
||||
const char* text_enc_output_names[] = {"text_emb"};
|
||||
std::vector<Ort::Value> text_enc_inputs;
|
||||
text_enc_inputs.push_back(std::move(text_ids_tensor));
|
||||
text_enc_inputs.push_back(std::move(style_ttl_tensor));
|
||||
text_enc_inputs.push_back(std::move(text_mask_tensor));
|
||||
|
||||
auto text_enc_outputs = text_enc_ort_->Run(
|
||||
Ort::RunOptions{nullptr},
|
||||
text_enc_input_names, text_enc_inputs.data(), text_enc_inputs.size(),
|
||||
text_enc_output_names, 1
|
||||
);
|
||||
|
||||
// Sample noisy latent
|
||||
std::vector<std::vector<std::vector<float>>> xt, latent_mask;
|
||||
sampleNoisyLatent(duration, xt, latent_mask);
|
||||
|
||||
std::vector<int64_t> latent_shape = {
|
||||
bsz,
|
||||
static_cast<int64_t>(xt[0].size()),
|
||||
static_cast<int64_t>(xt[0][0].size())
|
||||
};
|
||||
std::vector<int64_t> latent_mask_shape = {
|
||||
bsz, 1,
|
||||
static_cast<int64_t>(latent_mask[0][0].size())
|
||||
};
|
||||
|
||||
// Prepare scalar tensors
|
||||
std::vector<float> total_step_vec(bsz, static_cast<float>(total_step));
|
||||
auto total_step_tensor = Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
total_step_vec.data(),
|
||||
total_step_vec.size(),
|
||||
std::vector<int64_t>{bsz}.data(),
|
||||
1
|
||||
);
|
||||
|
||||
// Store text_emb data to reuse across iterations
|
||||
auto text_emb_info = text_enc_outputs[0].GetTensorTypeAndShapeInfo();
|
||||
size_t text_emb_size = text_emb_info.GetElementCount();
|
||||
auto* text_emb_data = text_enc_outputs[0].GetTensorMutableData<float>();
|
||||
std::vector<float> text_emb_vec(text_emb_data, text_emb_data + text_emb_size);
|
||||
auto text_emb_shape = text_emb_info.GetShape();
|
||||
|
||||
// Iterative denoising
|
||||
for (int step = 0; step < total_step; step++) {
|
||||
std::vector<float> current_step_vec(bsz, static_cast<float>(step));
|
||||
|
||||
text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
|
||||
auto latent_mask_tensor = arrayToTensor(memory_info, latent_mask, latent_mask_shape);
|
||||
auto noisy_latent_tensor = arrayToTensor(memory_info, xt, latent_shape);
|
||||
style_ttl_tensor = Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
const_cast<float*>(style.getTtlData().data()),
|
||||
style.getTtlData().size(),
|
||||
style.getTtlShape().data(),
|
||||
style.getTtlShape().size()
|
||||
);
|
||||
|
||||
auto text_emb_tensor = Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
text_emb_vec.data(),
|
||||
text_emb_vec.size(),
|
||||
text_emb_shape.data(),
|
||||
text_emb_shape.size()
|
||||
);
|
||||
|
||||
auto current_step_tensor = Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
current_step_vec.data(),
|
||||
current_step_vec.size(),
|
||||
std::vector<int64_t>{bsz}.data(),
|
||||
1
|
||||
);
|
||||
|
||||
const char* vector_est_input_names[] = {
|
||||
"noisy_latent", "text_emb", "style_ttl", "text_mask", "latent_mask", "total_step", "current_step"
|
||||
};
|
||||
const char* vector_est_output_names[] = {"denoised_latent"};
|
||||
|
||||
std::vector<Ort::Value> vector_est_inputs;
|
||||
vector_est_inputs.push_back(std::move(noisy_latent_tensor));
|
||||
vector_est_inputs.push_back(std::move(text_emb_tensor));
|
||||
vector_est_inputs.push_back(std::move(style_ttl_tensor));
|
||||
vector_est_inputs.push_back(std::move(text_mask_tensor));
|
||||
vector_est_inputs.push_back(std::move(latent_mask_tensor));
|
||||
|
||||
// Create a new total_step tensor for each iteration
|
||||
auto total_step_tensor_iter = Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
total_step_vec.data(),
|
||||
total_step_vec.size(),
|
||||
std::vector<int64_t>{bsz}.data(),
|
||||
1
|
||||
);
|
||||
vector_est_inputs.push_back(std::move(total_step_tensor_iter));
|
||||
vector_est_inputs.push_back(std::move(current_step_tensor));
|
||||
|
||||
auto vector_est_outputs = vector_est_ort_->Run(
|
||||
Ort::RunOptions{nullptr},
|
||||
vector_est_input_names, vector_est_inputs.data(), vector_est_inputs.size(),
|
||||
vector_est_output_names, 1
|
||||
);
|
||||
|
||||
// Update xt with denoised output
|
||||
auto* denoised_data = vector_est_outputs[0].GetTensorMutableData<float>();
|
||||
size_t idx = 0;
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
for (size_t d = 0; d < xt[b].size(); d++) {
|
||||
for (size_t t = 0; t < xt[b][d].size(); t++) {
|
||||
xt[b][d][t] = denoised_data[idx++];
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Run vocoder
|
||||
auto latent_tensor = arrayToTensor(memory_info, xt, latent_shape);
|
||||
const char* vocoder_input_names[] = {"latent"};
|
||||
const char* vocoder_output_names[] = {"wav_tts"};
|
||||
std::vector<Ort::Value> vocoder_inputs;
|
||||
vocoder_inputs.push_back(std::move(latent_tensor));
|
||||
|
||||
auto vocoder_outputs = vocoder_ort_->Run(
|
||||
Ort::RunOptions{nullptr},
|
||||
vocoder_input_names, vocoder_inputs.data(), vocoder_inputs.size(),
|
||||
vocoder_output_names, 1
|
||||
);
|
||||
|
||||
auto wav_info = vocoder_outputs[0].GetTensorTypeAndShapeInfo();
|
||||
size_t wav_size = wav_info.GetElementCount();
|
||||
auto* wav_data = vocoder_outputs[0].GetTensorMutableData<float>();
|
||||
|
||||
SynthesisResult result;
|
||||
result.wav.assign(wav_data, wav_data + wav_size);
|
||||
result.duration = duration;
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Utility functions
|
||||
// ============================================================================
|
||||
|
||||
std::vector<std::vector<std::vector<float>>> lengthToMask(
|
||||
const std::vector<int64_t>& lengths, int max_len
|
||||
) {
|
||||
if (max_len == -1) {
|
||||
max_len = *std::max_element(lengths.begin(), lengths.end());
|
||||
}
|
||||
|
||||
std::vector<std::vector<std::vector<float>>> mask;
|
||||
for (auto len : lengths) {
|
||||
std::vector<std::vector<float>> batch_mask(1);
|
||||
batch_mask[0].resize(max_len);
|
||||
for (int i = 0; i < max_len; i++) {
|
||||
batch_mask[0][i] = (i < len) ? 1.0f : 0.0f;
|
||||
}
|
||||
mask.push_back(batch_mask);
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
std::vector<std::vector<std::vector<float>>> getLatentMask(
|
||||
const std::vector<int64_t>& wav_lengths,
|
||||
int base_chunk_size,
|
||||
int chunk_compress_factor
|
||||
) {
|
||||
int latent_size = base_chunk_size * chunk_compress_factor;
|
||||
std::vector<int64_t> latent_lengths;
|
||||
for (auto len : wav_lengths) {
|
||||
latent_lengths.push_back((len + latent_size - 1) / latent_size);
|
||||
}
|
||||
return lengthToMask(latent_lengths);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ONNX model loading
|
||||
// ============================================================================
|
||||
|
||||
std::unique_ptr<Ort::Session> loadOnnx(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_path,
|
||||
const Ort::SessionOptions& opts
|
||||
) {
|
||||
return std::make_unique<Ort::Session>(env, onnx_path.c_str(), opts);
|
||||
}
|
||||
|
||||
OnnxModels loadOnnxAll(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_dir,
|
||||
const Ort::SessionOptions& opts
|
||||
) {
|
||||
OnnxModels models;
|
||||
models.dp = loadOnnx(env, onnx_dir + "/duration_predictor.onnx", opts);
|
||||
models.text_enc = loadOnnx(env, onnx_dir + "/text_encoder.onnx", opts);
|
||||
models.vector_est = loadOnnx(env, onnx_dir + "/vector_estimator.onnx", opts);
|
||||
models.vocoder = loadOnnx(env, onnx_dir + "/vocoder.onnx", opts);
|
||||
return models;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Configuration and processor loading
|
||||
// ============================================================================
|
||||
|
||||
Config loadCfgs(const std::string& onnx_dir) {
|
||||
std::string cfg_path = onnx_dir + "/tts.json";
|
||||
std::ifstream file(cfg_path);
|
||||
if (!file.is_open()) {
|
||||
throw std::runtime_error("Failed to open config file: " + cfg_path);
|
||||
}
|
||||
|
||||
json j;
|
||||
file >> j;
|
||||
|
||||
Config cfg;
|
||||
cfg.ae.sample_rate = j["ae"]["sample_rate"];
|
||||
cfg.ae.base_chunk_size = j["ae"]["base_chunk_size"];
|
||||
cfg.ttl.chunk_compress_factor = j["ttl"]["chunk_compress_factor"];
|
||||
cfg.ttl.latent_dim = j["ttl"]["latent_dim"];
|
||||
|
||||
return cfg;
|
||||
}
|
||||
|
||||
std::unique_ptr<UnicodeProcessor> loadTextProcessor(const std::string& onnx_dir) {
|
||||
std::string unicode_indexer_path = onnx_dir + "/unicode_indexer.json";
|
||||
return std::make_unique<UnicodeProcessor>(unicode_indexer_path);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Voice style loading
|
||||
// ============================================================================
|
||||
|
||||
Style loadVoiceStyle(const std::vector<std::string>& voice_style_paths, bool verbose) {
|
||||
int bsz = voice_style_paths.size();
|
||||
|
||||
// Read first file to get dimensions
|
||||
std::ifstream first_file(voice_style_paths[0]);
|
||||
if (!first_file.is_open()) {
|
||||
throw std::runtime_error("Failed to open voice style file: " + voice_style_paths[0]);
|
||||
}
|
||||
json first_json;
|
||||
first_file >> first_json;
|
||||
|
||||
auto ttl_dims = first_json["style_ttl"]["dims"].get<std::vector<int64_t>>();
|
||||
auto dp_dims = first_json["style_dp"]["dims"].get<std::vector<int64_t>>();
|
||||
|
||||
int64_t ttl_dim1 = ttl_dims[1];
|
||||
int64_t ttl_dim2 = ttl_dims[2];
|
||||
int64_t dp_dim1 = dp_dims[1];
|
||||
int64_t dp_dim2 = dp_dims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
size_t ttl_size = bsz * ttl_dim1 * ttl_dim2;
|
||||
size_t dp_size = bsz * dp_dim1 * dp_dim2;
|
||||
std::vector<float> ttl_flat(ttl_size);
|
||||
std::vector<float> dp_flat(dp_size);
|
||||
|
||||
// Fill in the data
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
std::ifstream file(voice_style_paths[i]);
|
||||
if (!file.is_open()) {
|
||||
throw std::runtime_error("Failed to open voice style file: " + voice_style_paths[i]);
|
||||
}
|
||||
|
||||
json j;
|
||||
file >> j;
|
||||
|
||||
// Flatten data
|
||||
auto ttl_data_nested = j["style_ttl"]["data"].get<std::vector<std::vector<std::vector<float>>>>();
|
||||
std::vector<float> ttl_data;
|
||||
for (const auto& batch : ttl_data_nested) {
|
||||
for (const auto& row : batch) {
|
||||
ttl_data.insert(ttl_data.end(), row.begin(), row.end());
|
||||
}
|
||||
}
|
||||
|
||||
auto dp_data_nested = j["style_dp"]["data"].get<std::vector<std::vector<std::vector<float>>>>();
|
||||
std::vector<float> dp_data;
|
||||
for (const auto& batch : dp_data_nested) {
|
||||
for (const auto& row : batch) {
|
||||
dp_data.insert(dp_data.end(), row.begin(), row.end());
|
||||
}
|
||||
}
|
||||
|
||||
// Copy to pre-allocated array
|
||||
size_t ttl_offset = i * ttl_dim1 * ttl_dim2;
|
||||
std::copy(ttl_data.begin(), ttl_data.end(), ttl_flat.begin() + ttl_offset);
|
||||
|
||||
size_t dp_offset = i * dp_dim1 * dp_dim2;
|
||||
std::copy(dp_data.begin(), dp_data.end(), dp_flat.begin() + dp_offset);
|
||||
}
|
||||
|
||||
std::vector<int64_t> ttl_shape = {bsz, ttl_dim1, ttl_dim2};
|
||||
std::vector<int64_t> dp_shape = {bsz, dp_dim1, dp_dim2};
|
||||
|
||||
if (verbose) {
|
||||
std::cout << "Loaded " << bsz << " voice styles" << std::endl;
|
||||
}
|
||||
|
||||
return Style(ttl_flat, ttl_shape, dp_flat, dp_shape);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// TextToSpeech loading
|
||||
// ============================================================================
|
||||
|
||||
std::unique_ptr<TextToSpeech> loadTextToSpeech(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_dir,
|
||||
bool use_gpu
|
||||
) {
|
||||
Ort::SessionOptions opts;
|
||||
if (use_gpu) {
|
||||
throw std::runtime_error("GPU mode is not supported yet");
|
||||
} else {
|
||||
std::cout << "Using CPU for inference" << std::endl;
|
||||
}
|
||||
|
||||
auto cfgs = loadCfgs(onnx_dir);
|
||||
auto models = loadOnnxAll(env, onnx_dir, opts);
|
||||
auto text_processor = loadTextProcessor(onnx_dir);
|
||||
|
||||
// Transfer ownership to TextToSpeech (use raw pointers internally)
|
||||
auto tts = std::make_unique<TextToSpeech>(
|
||||
cfgs,
|
||||
text_processor.get(),
|
||||
models.dp.get(),
|
||||
models.text_enc.get(),
|
||||
models.vector_est.get(),
|
||||
models.vocoder.get()
|
||||
);
|
||||
|
||||
// Keep the models and processor alive by storing them
|
||||
// (In production, you'd want better lifetime management)
|
||||
static OnnxModels static_models;
|
||||
static std::unique_ptr<UnicodeProcessor> static_text_processor;
|
||||
static_models = std::move(models);
|
||||
static_text_processor = std::move(text_processor);
|
||||
|
||||
return tts;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// WAV file writing
|
||||
// ============================================================================
|
||||
|
||||
void writeWavFile(
|
||||
const std::string& filename,
|
||||
const std::vector<float>& audio_data,
|
||||
int sample_rate
|
||||
) {
|
||||
std::ofstream file(filename, std::ios::binary);
|
||||
if (!file.is_open()) {
|
||||
throw std::runtime_error("Failed to open file for writing: " + filename);
|
||||
}
|
||||
|
||||
int num_channels = 1;
|
||||
int bits_per_sample = 16;
|
||||
int byte_rate = sample_rate * num_channels * bits_per_sample / 8;
|
||||
int block_align = num_channels * bits_per_sample / 8;
|
||||
int data_size = audio_data.size() * bits_per_sample / 8;
|
||||
|
||||
// RIFF header
|
||||
file.write("RIFF", 4);
|
||||
int32_t chunk_size = 36 + data_size;
|
||||
file.write(reinterpret_cast<char*>(&chunk_size), 4);
|
||||
file.write("WAVE", 4);
|
||||
|
||||
// fmt chunk
|
||||
file.write("fmt ", 4);
|
||||
int32_t fmt_chunk_size = 16;
|
||||
file.write(reinterpret_cast<char*>(&fmt_chunk_size), 4);
|
||||
int16_t audio_format = 1; // PCM
|
||||
file.write(reinterpret_cast<char*>(&audio_format), 2);
|
||||
int16_t num_channels_16 = num_channels;
|
||||
file.write(reinterpret_cast<char*>(&num_channels_16), 2);
|
||||
file.write(reinterpret_cast<char*>(&sample_rate), 4);
|
||||
file.write(reinterpret_cast<char*>(&byte_rate), 4);
|
||||
int16_t block_align_16 = block_align;
|
||||
file.write(reinterpret_cast<char*>(&block_align_16), 2);
|
||||
int16_t bits_per_sample_16 = bits_per_sample;
|
||||
file.write(reinterpret_cast<char*>(&bits_per_sample_16), 2);
|
||||
|
||||
// data chunk
|
||||
file.write("data", 4);
|
||||
file.write(reinterpret_cast<char*>(&data_size), 4);
|
||||
|
||||
// Write audio data
|
||||
for (float sample : audio_data) {
|
||||
float clamped = std::max(-1.0f, std::min(1.0f, sample));
|
||||
int16_t int_sample = static_cast<int16_t>(clamped * 32767);
|
||||
file.write(reinterpret_cast<char*>(&int_sample), 2);
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tensor conversion utilities
|
||||
// ============================================================================
|
||||
|
||||
Ort::Value arrayToTensor(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::vector<std::vector<float>>>& array,
|
||||
const std::vector<int64_t>& dims
|
||||
) {
|
||||
// Flatten the array
|
||||
std::vector<float> flat;
|
||||
for (const auto& batch : array) {
|
||||
for (const auto& row : batch) {
|
||||
for (float val : row) {
|
||||
flat.push_back(val);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Store in global buffer to keep data alive
|
||||
g_tensor_buffers_float.push_back(std::move(flat));
|
||||
auto& buffer = g_tensor_buffers_float.back();
|
||||
|
||||
return Ort::Value::CreateTensor<float>(
|
||||
memory_info,
|
||||
buffer.data(),
|
||||
buffer.size(),
|
||||
dims.data(),
|
||||
dims.size()
|
||||
);
|
||||
}
|
||||
|
||||
Ort::Value intArrayToTensor(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::vector<int64_t>>& array,
|
||||
const std::vector<int64_t>& dims
|
||||
) {
|
||||
// Flatten the array
|
||||
std::vector<int64_t> flat;
|
||||
for (const auto& row : array) {
|
||||
for (int64_t val : row) {
|
||||
flat.push_back(val);
|
||||
}
|
||||
}
|
||||
|
||||
// Store in global buffer to keep data alive
|
||||
g_tensor_buffers_int64.push_back(std::move(flat));
|
||||
auto& buffer = g_tensor_buffers_int64.back();
|
||||
|
||||
return Ort::Value::CreateTensor<int64_t>(
|
||||
memory_info,
|
||||
buffer.data(),
|
||||
buffer.size(),
|
||||
dims.data(),
|
||||
dims.size()
|
||||
);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// JSON loading helpers
|
||||
// ============================================================================
|
||||
|
||||
std::vector<int64_t> loadJsonInt64(const std::string& file_path) {
|
||||
std::ifstream file(file_path);
|
||||
if (!file.is_open()) {
|
||||
throw std::runtime_error("Failed to open file: " + file_path);
|
||||
}
|
||||
|
||||
json j;
|
||||
file >> j;
|
||||
|
||||
return j.get<std::vector<int64_t>>();
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Sanitize filename
|
||||
// ============================================================================
|
||||
|
||||
std::string sanitizeFilename(const std::string& text, int max_len) {
|
||||
std::string result;
|
||||
int count = 0;
|
||||
for (char c : text) {
|
||||
if (count >= max_len) break;
|
||||
if (std::isalnum(static_cast<unsigned char>(c))) {
|
||||
result += c;
|
||||
} else {
|
||||
result += '_';
|
||||
}
|
||||
count++;
|
||||
}
|
||||
return result;
|
||||
}
|
||||
+202
@@ -0,0 +1,202 @@
|
||||
#pragma once
|
||||
|
||||
#include <string>
|
||||
#include <vector>
|
||||
#include <memory>
|
||||
#include <iostream>
|
||||
#include <iomanip>
|
||||
#include <chrono>
|
||||
#include <onnxruntime_cxx_api.h>
|
||||
|
||||
/**
|
||||
* Configuration structure
|
||||
*/
|
||||
struct Config {
|
||||
struct AEConfig {
|
||||
int sample_rate;
|
||||
int base_chunk_size;
|
||||
} ae;
|
||||
|
||||
struct TTLConfig {
|
||||
int chunk_compress_factor;
|
||||
int latent_dim;
|
||||
} ttl;
|
||||
};
|
||||
|
||||
/**
|
||||
* Unicode text processor
|
||||
*/
|
||||
class UnicodeProcessor {
|
||||
public:
|
||||
explicit UnicodeProcessor(const std::string& unicode_indexer_json_path);
|
||||
|
||||
// Process text list to text IDs and mask
|
||||
void call(
|
||||
const std::vector<std::string>& text_list,
|
||||
std::vector<std::vector<int64_t>>& text_ids,
|
||||
std::vector<std::vector<std::vector<float>>>& text_mask
|
||||
);
|
||||
|
||||
private:
|
||||
std::vector<int64_t> indexer_;
|
||||
|
||||
std::string preprocessText(const std::string& text);
|
||||
std::vector<uint16_t> textToUnicodeValues(const std::string& text);
|
||||
std::vector<std::vector<std::vector<float>>> getTextMask(
|
||||
const std::vector<int64_t>& text_ids_lengths
|
||||
);
|
||||
};
|
||||
|
||||
/**
|
||||
* Style class
|
||||
*/
|
||||
class Style {
|
||||
public:
|
||||
Style(const std::vector<float>& ttl_data, const std::vector<int64_t>& ttl_shape,
|
||||
const std::vector<float>& dp_data, const std::vector<int64_t>& dp_shape);
|
||||
|
||||
const std::vector<float>& getTtlData() const { return ttl_data_; }
|
||||
const std::vector<float>& getDpData() const { return dp_data_; }
|
||||
const std::vector<int64_t>& getTtlShape() const { return ttl_shape_; }
|
||||
const std::vector<int64_t>& getDpShape() const { return dp_shape_; }
|
||||
|
||||
private:
|
||||
std::vector<float> ttl_data_;
|
||||
std::vector<float> dp_data_;
|
||||
std::vector<int64_t> ttl_shape_;
|
||||
std::vector<int64_t> dp_shape_;
|
||||
};
|
||||
|
||||
/**
|
||||
* TextToSpeech class
|
||||
*/
|
||||
class TextToSpeech {
|
||||
public:
|
||||
TextToSpeech(
|
||||
const Config& cfgs,
|
||||
UnicodeProcessor* text_processor,
|
||||
Ort::Session* dp_ort,
|
||||
Ort::Session* text_enc_ort,
|
||||
Ort::Session* vector_est_ort,
|
||||
Ort::Session* vocoder_ort
|
||||
);
|
||||
|
||||
struct SynthesisResult {
|
||||
std::vector<float> wav;
|
||||
std::vector<float> duration;
|
||||
};
|
||||
|
||||
SynthesisResult call(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::string>& text_list,
|
||||
const Style& style,
|
||||
int total_step
|
||||
);
|
||||
|
||||
int getSampleRate() const { return sample_rate_; }
|
||||
|
||||
private:
|
||||
Config cfgs_;
|
||||
UnicodeProcessor* text_processor_;
|
||||
Ort::Session* dp_ort_;
|
||||
Ort::Session* text_enc_ort_;
|
||||
Ort::Session* vector_est_ort_;
|
||||
Ort::Session* vocoder_ort_;
|
||||
int sample_rate_;
|
||||
int base_chunk_size_;
|
||||
int chunk_compress_factor_;
|
||||
int ldim_;
|
||||
|
||||
void sampleNoisyLatent(
|
||||
const std::vector<float>& duration,
|
||||
std::vector<std::vector<std::vector<float>>>& noisy_latent,
|
||||
std::vector<std::vector<std::vector<float>>>& latent_mask
|
||||
);
|
||||
};
|
||||
|
||||
// Utility functions
|
||||
std::vector<std::vector<std::vector<float>>> lengthToMask(
|
||||
const std::vector<int64_t>& lengths, int max_len = -1
|
||||
);
|
||||
|
||||
std::vector<std::vector<std::vector<float>>> getLatentMask(
|
||||
const std::vector<int64_t>& wav_lengths,
|
||||
int base_chunk_size,
|
||||
int chunk_compress_factor
|
||||
);
|
||||
|
||||
// ONNX model loading
|
||||
struct OnnxModels {
|
||||
std::unique_ptr<Ort::Session> dp;
|
||||
std::unique_ptr<Ort::Session> text_enc;
|
||||
std::unique_ptr<Ort::Session> vector_est;
|
||||
std::unique_ptr<Ort::Session> vocoder;
|
||||
};
|
||||
|
||||
std::unique_ptr<Ort::Session> loadOnnx(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_path,
|
||||
const Ort::SessionOptions& opts
|
||||
);
|
||||
|
||||
OnnxModels loadOnnxAll(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_dir,
|
||||
const Ort::SessionOptions& opts
|
||||
);
|
||||
|
||||
// Configuration and processor loading
|
||||
Config loadCfgs(const std::string& onnx_dir);
|
||||
|
||||
std::unique_ptr<UnicodeProcessor> loadTextProcessor(const std::string& onnx_dir);
|
||||
|
||||
// Voice style loading
|
||||
Style loadVoiceStyle(const std::vector<std::string>& voice_style_paths, bool verbose = false);
|
||||
|
||||
// TextToSpeech loading
|
||||
std::unique_ptr<TextToSpeech> loadTextToSpeech(
|
||||
Ort::Env& env,
|
||||
const std::string& onnx_dir,
|
||||
bool use_gpu = false
|
||||
);
|
||||
|
||||
// WAV file writing
|
||||
void writeWavFile(
|
||||
const std::string& filename,
|
||||
const std::vector<float>& audio_data,
|
||||
int sample_rate
|
||||
);
|
||||
|
||||
// Tensor conversion utilities
|
||||
void clearTensorBuffers();
|
||||
|
||||
Ort::Value arrayToTensor(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::vector<std::vector<float>>>& array,
|
||||
const std::vector<int64_t>& dims
|
||||
);
|
||||
|
||||
Ort::Value intArrayToTensor(
|
||||
Ort::MemoryInfo& memory_info,
|
||||
const std::vector<std::vector<int64_t>>& array,
|
||||
const std::vector<int64_t>& dims
|
||||
);
|
||||
|
||||
// JSON loading helpers
|
||||
std::vector<int64_t> loadJsonInt64(const std::string& file_path);
|
||||
|
||||
// Timer utility
|
||||
template<typename Func>
|
||||
auto timer(const std::string& name, Func&& func) -> decltype(func()) {
|
||||
auto start = std::chrono::high_resolution_clock::now();
|
||||
std::cout << name << "..." << std::endl;
|
||||
auto result = func();
|
||||
auto end = std::chrono::high_resolution_clock::now();
|
||||
std::chrono::duration<double> elapsed = end - start;
|
||||
std::cout << " -> " << name << " completed in "
|
||||
<< std::fixed << std::setprecision(2) << elapsed.count() << " sec" << std::endl;
|
||||
return result;
|
||||
}
|
||||
|
||||
// Sanitize filename
|
||||
std::string sanitizeFilename(const std::string& text, int max_len);
|
||||
@@ -0,0 +1,41 @@
|
||||
# Build results
|
||||
bin/
|
||||
obj/
|
||||
[Dd]ebug/
|
||||
[Rr]elease/
|
||||
x64/
|
||||
x86/
|
||||
[Aa]rm/
|
||||
[Aa]rm64/
|
||||
bld/
|
||||
[Bb]in/
|
||||
[Oo]bj/
|
||||
[Ll]og/
|
||||
|
||||
# Visual Studio files
|
||||
.vs/
|
||||
*.suo
|
||||
*.user
|
||||
*.userosscache
|
||||
*.sln.docstates
|
||||
*.userprefs
|
||||
|
||||
# Rider
|
||||
.idea/
|
||||
*.sln.iml
|
||||
|
||||
# User-specific files
|
||||
*.rsuser
|
||||
*.suo
|
||||
*.user
|
||||
*.userosscache
|
||||
*.sln.docstates
|
||||
|
||||
# Output directory
|
||||
results/*.wav
|
||||
|
||||
# OS files
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
|
||||
@@ -0,0 +1,118 @@
|
||||
using System;
|
||||
using System.Collections.Generic;
|
||||
using System.IO;
|
||||
using System.Linq;
|
||||
|
||||
namespace Supertonic
|
||||
{
|
||||
class Program
|
||||
{
|
||||
class Args
|
||||
{
|
||||
public bool UseGpu { get; set; } = false;
|
||||
public string OnnxDir { get; set; } = "assets/onnx";
|
||||
public int TotalStep { get; set; } = 5;
|
||||
public int NTest { get; set; } = 4;
|
||||
public List<string> VoiceStyle { get; set; } = new List<string> { "assets/voice_styles/M1.json" };
|
||||
public List<string> Text { get; set; } = new List<string>
|
||||
{
|
||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
};
|
||||
public string SaveDir { get; set; } = "results";
|
||||
}
|
||||
|
||||
static Args ParseArgs(string[] args)
|
||||
{
|
||||
var result = new Args();
|
||||
|
||||
for (int i = 0; i < args.Length; i++)
|
||||
{
|
||||
switch (args[i])
|
||||
{
|
||||
case "--use-gpu":
|
||||
result.UseGpu = true;
|
||||
break;
|
||||
case "--onnx-dir" when i + 1 < args.Length:
|
||||
result.OnnxDir = args[++i];
|
||||
break;
|
||||
case "--total-step" when i + 1 < args.Length:
|
||||
result.TotalStep = int.Parse(args[++i]);
|
||||
break;
|
||||
case "--n-test" when i + 1 < args.Length:
|
||||
result.NTest = int.Parse(args[++i]);
|
||||
break;
|
||||
case "--voice-style" when i + 1 < args.Length:
|
||||
result.VoiceStyle = args[++i].Split(',').ToList();
|
||||
break;
|
||||
case "--text" when i + 1 < args.Length:
|
||||
result.Text = args[++i].Split('|').ToList();
|
||||
break;
|
||||
case "--save-dir" when i + 1 < args.Length:
|
||||
result.SaveDir = args[++i];
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
static void Main(string[] args)
|
||||
{
|
||||
Console.WriteLine("=== TTS Inference with ONNX Runtime (C#) ===\n");
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
var parsedArgs = ParseArgs(args);
|
||||
int totalStep = parsedArgs.TotalStep;
|
||||
int nTest = parsedArgs.NTest;
|
||||
string saveDir = parsedArgs.SaveDir;
|
||||
var voiceStylePaths = parsedArgs.VoiceStyle;
|
||||
var textList = parsedArgs.Text;
|
||||
|
||||
if (voiceStylePaths.Count != textList.Count)
|
||||
{
|
||||
throw new ArgumentException(
|
||||
$"Number of voice styles ({voiceStylePaths.Count}) must match number of texts ({textList.Count})");
|
||||
}
|
||||
|
||||
int bsz = voiceStylePaths.Count;
|
||||
|
||||
// --- 2. Load Text to Speech --- //
|
||||
var textToSpeech = Helper.LoadTextToSpeech(parsedArgs.OnnxDir, parsedArgs.UseGpu);
|
||||
Console.WriteLine();
|
||||
|
||||
// --- 3. Load Voice Style --- //
|
||||
var style = Helper.LoadVoiceStyle(voiceStylePaths, verbose: true);
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
for (int n = 0; n < nTest; n++)
|
||||
{
|
||||
Console.WriteLine($"\n[{n + 1}/{nTest}] Starting synthesis...");
|
||||
|
||||
var (wav, duration) = Helper.Timer("Generating speech from text", () =>
|
||||
textToSpeech.Call(textList, style, totalStep)
|
||||
);
|
||||
|
||||
if (!Directory.Exists(saveDir))
|
||||
{
|
||||
Directory.CreateDirectory(saveDir);
|
||||
}
|
||||
|
||||
for (int b = 0; b < bsz; b++)
|
||||
{
|
||||
string fname = $"{Helper.SanitizeFilename(textList[b], 20)}_{n + 1}.wav";
|
||||
|
||||
int wavLen = (int)(textToSpeech.SampleRate * duration[b]);
|
||||
var wavOut = new float[wavLen];
|
||||
Array.Copy(wav, b * wav.Length / bsz, wavOut, 0, Math.Min(wavLen, wav.Length / bsz));
|
||||
|
||||
string outputPath = Path.Combine(saveDir, fname);
|
||||
Helper.WriteWavFile(outputPath, wavOut, textToSpeech.SampleRate);
|
||||
Console.WriteLine($"Saved: {outputPath}");
|
||||
}
|
||||
}
|
||||
|
||||
Console.WriteLine("\n=== Synthesis completed successfully! ===");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -0,0 +1,612 @@
|
||||
using System;
|
||||
using System.Collections.Generic;
|
||||
using System.IO;
|
||||
using System.Linq;
|
||||
using System.Text;
|
||||
using System.Text.Json;
|
||||
using Microsoft.ML.OnnxRuntime;
|
||||
using Microsoft.ML.OnnxRuntime.Tensors;
|
||||
|
||||
namespace Supertonic
|
||||
{
|
||||
// ============================================================================
|
||||
// Configuration classes
|
||||
// ============================================================================
|
||||
|
||||
public class Config
|
||||
{
|
||||
public AEConfig AE { get; set; } = null!;
|
||||
public TTLConfig TTL { get; set; } = null!;
|
||||
|
||||
public class AEConfig
|
||||
{
|
||||
public int SampleRate { get; set; }
|
||||
public int BaseChunkSize { get; set; }
|
||||
}
|
||||
|
||||
public class TTLConfig
|
||||
{
|
||||
public int ChunkCompressFactor { get; set; }
|
||||
public int LatentDim { get; set; }
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Style class
|
||||
// ============================================================================
|
||||
|
||||
public class Style
|
||||
{
|
||||
public float[] Ttl { get; set; }
|
||||
public long[] TtlShape { get; set; }
|
||||
public float[] Dp { get; set; }
|
||||
public long[] DpShape { get; set; }
|
||||
|
||||
public Style(float[] ttl, long[] ttlShape, float[] dp, long[] dpShape)
|
||||
{
|
||||
Ttl = ttl;
|
||||
TtlShape = ttlShape;
|
||||
Dp = dp;
|
||||
DpShape = dpShape;
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Unicode text processor
|
||||
// ============================================================================
|
||||
|
||||
public class UnicodeProcessor
|
||||
{
|
||||
private readonly Dictionary<int, long> _indexer;
|
||||
|
||||
public UnicodeProcessor(string unicodeIndexerPath)
|
||||
{
|
||||
var json = File.ReadAllText(unicodeIndexerPath);
|
||||
var indexerArray = JsonSerializer.Deserialize<long[]>(json) ?? throw new Exception("Failed to load indexer");
|
||||
_indexer = new Dictionary<int, long>();
|
||||
for (int i = 0; i < indexerArray.Length; i++)
|
||||
{
|
||||
_indexer[i] = indexerArray[i];
|
||||
}
|
||||
}
|
||||
|
||||
private string PreprocessText(string text)
|
||||
{
|
||||
// Simple normalization (C# has Normalize built-in)
|
||||
return text.Normalize(NormalizationForm.FormKD);
|
||||
}
|
||||
|
||||
private int[] TextToUnicodeValues(string text)
|
||||
{
|
||||
return text.Select(c => (int)c).ToArray();
|
||||
}
|
||||
|
||||
private float[][][] GetTextMask(long[] textIdsLengths)
|
||||
{
|
||||
return Helper.LengthToMask(textIdsLengths);
|
||||
}
|
||||
|
||||
public (long[][] textIds, float[][][] textMask) Call(List<string> textList)
|
||||
{
|
||||
var processedTexts = textList.Select(t => PreprocessText(t)).ToList();
|
||||
var textIdsLengths = processedTexts.Select(t => (long)t.Length).ToArray();
|
||||
long maxLen = textIdsLengths.Max();
|
||||
|
||||
var textIds = new long[textList.Count][];
|
||||
for (int i = 0; i < processedTexts.Count; i++)
|
||||
{
|
||||
textIds[i] = new long[maxLen];
|
||||
var unicodeVals = TextToUnicodeValues(processedTexts[i]);
|
||||
for (int j = 0; j < unicodeVals.Length; j++)
|
||||
{
|
||||
if (_indexer.TryGetValue(unicodeVals[j], out long val))
|
||||
{
|
||||
textIds[i][j] = val;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
var textMask = GetTextMask(textIdsLengths);
|
||||
return (textIds, textMask);
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// TextToSpeech class
|
||||
// ============================================================================
|
||||
|
||||
public class TextToSpeech
|
||||
{
|
||||
private readonly Config _cfgs;
|
||||
private readonly UnicodeProcessor _textProcessor;
|
||||
private readonly InferenceSession _dpOrt;
|
||||
private readonly InferenceSession _textEncOrt;
|
||||
private readonly InferenceSession _vectorEstOrt;
|
||||
private readonly InferenceSession _vocoderOrt;
|
||||
public readonly int SampleRate;
|
||||
private readonly int _baseChunkSize;
|
||||
private readonly int _chunkCompressFactor;
|
||||
private readonly int _ldim;
|
||||
|
||||
public TextToSpeech(
|
||||
Config cfgs,
|
||||
UnicodeProcessor textProcessor,
|
||||
InferenceSession dpOrt,
|
||||
InferenceSession textEncOrt,
|
||||
InferenceSession vectorEstOrt,
|
||||
InferenceSession vocoderOrt)
|
||||
{
|
||||
_cfgs = cfgs;
|
||||
_textProcessor = textProcessor;
|
||||
_dpOrt = dpOrt;
|
||||
_textEncOrt = textEncOrt;
|
||||
_vectorEstOrt = vectorEstOrt;
|
||||
_vocoderOrt = vocoderOrt;
|
||||
SampleRate = cfgs.AE.SampleRate;
|
||||
_baseChunkSize = cfgs.AE.BaseChunkSize;
|
||||
_chunkCompressFactor = cfgs.TTL.ChunkCompressFactor;
|
||||
_ldim = cfgs.TTL.LatentDim;
|
||||
}
|
||||
|
||||
private (float[][][] noisyLatent, float[][][] latentMask) SampleNoisyLatent(float[] duration)
|
||||
{
|
||||
int bsz = duration.Length;
|
||||
float wavLenMax = duration.Max() * SampleRate;
|
||||
var wavLengths = duration.Select(d => (long)(d * SampleRate)).ToArray();
|
||||
int chunkSize = _baseChunkSize * _chunkCompressFactor;
|
||||
int latentLen = (int)((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
int latentDim = _ldim * _chunkCompressFactor;
|
||||
|
||||
// Generate random noise
|
||||
var random = new Random();
|
||||
var noisyLatent = new float[bsz][][];
|
||||
for (int b = 0; b < bsz; b++)
|
||||
{
|
||||
noisyLatent[b] = new float[latentDim][];
|
||||
for (int d = 0; d < latentDim; d++)
|
||||
{
|
||||
noisyLatent[b][d] = new float[latentLen];
|
||||
for (int t = 0; t < latentLen; t++)
|
||||
{
|
||||
// Box-Muller transform for normal distribution
|
||||
double u1 = 1.0 - random.NextDouble();
|
||||
double u2 = 1.0 - random.NextDouble();
|
||||
noisyLatent[b][d][t] = (float)(Math.Sqrt(-2.0 * Math.Log(u1)) * Math.Cos(2.0 * Math.PI * u2));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
var latentMask = Helper.GetLatentMask(wavLengths, _baseChunkSize, _chunkCompressFactor);
|
||||
|
||||
// Apply mask
|
||||
for (int b = 0; b < bsz; b++)
|
||||
{
|
||||
for (int d = 0; d < latentDim; d++)
|
||||
{
|
||||
for (int t = 0; t < latentLen; t++)
|
||||
{
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return (noisyLatent, latentMask);
|
||||
}
|
||||
|
||||
public (float[] wav, float[] duration) Call(List<string> textList, Style style, int totalStep)
|
||||
{
|
||||
int bsz = textList.Count;
|
||||
if (bsz != style.TtlShape[0])
|
||||
{
|
||||
throw new ArgumentException("Number of texts must match number of style vectors");
|
||||
}
|
||||
|
||||
// Process text
|
||||
var (textIds, textMask) = _textProcessor.Call(textList);
|
||||
var textIdsShape = new long[] { bsz, textIds[0].Length };
|
||||
var textMaskShape = new long[] { bsz, 1, textMask[0][0].Length };
|
||||
|
||||
var textIdsTensor = Helper.IntArrayToTensor(textIds, textIdsShape);
|
||||
var textMaskTensor = Helper.ArrayToTensor(textMask, textMaskShape);
|
||||
|
||||
var styleTtlTensor = new DenseTensor<float>(style.Ttl, style.TtlShape.Select(x => (int)x).ToArray());
|
||||
var styleDpTensor = new DenseTensor<float>(style.Dp, style.DpShape.Select(x => (int)x).ToArray());
|
||||
|
||||
// Run duration predictor
|
||||
var dpInputs = new List<NamedOnnxValue>
|
||||
{
|
||||
NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
|
||||
NamedOnnxValue.CreateFromTensor("style_dp", styleDpTensor),
|
||||
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
|
||||
};
|
||||
using var dpOutputs = _dpOrt.Run(dpInputs);
|
||||
var durOnnx = dpOutputs.First(o => o.Name == "duration").AsTensor<float>().ToArray();
|
||||
|
||||
// Run text encoder
|
||||
var textEncInputs = new List<NamedOnnxValue>
|
||||
{
|
||||
NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
|
||||
NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
|
||||
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
|
||||
};
|
||||
using var textEncOutputs = _textEncOrt.Run(textEncInputs);
|
||||
var textEmbTensor = textEncOutputs.First(o => o.Name == "text_emb").AsTensor<float>();
|
||||
|
||||
// Sample noisy latent
|
||||
var (xt, latentMask) = SampleNoisyLatent(durOnnx);
|
||||
var latentShape = new long[] { bsz, xt[0].Length, xt[0][0].Length };
|
||||
var latentMaskShape = new long[] { bsz, 1, latentMask[0][0].Length };
|
||||
|
||||
var totalStepArray = Enumerable.Repeat((float)totalStep, bsz).ToArray();
|
||||
|
||||
// Iterative denoising
|
||||
for (int step = 0; step < totalStep; step++)
|
||||
{
|
||||
var currentStepArray = Enumerable.Repeat((float)step, bsz).ToArray();
|
||||
|
||||
var vectorEstInputs = new List<NamedOnnxValue>
|
||||
{
|
||||
NamedOnnxValue.CreateFromTensor("noisy_latent", Helper.ArrayToTensor(xt, latentShape)),
|
||||
NamedOnnxValue.CreateFromTensor("text_emb", textEmbTensor),
|
||||
NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
|
||||
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor),
|
||||
NamedOnnxValue.CreateFromTensor("latent_mask", Helper.ArrayToTensor(latentMask, latentMaskShape)),
|
||||
NamedOnnxValue.CreateFromTensor("total_step", new DenseTensor<float>(totalStepArray, new int[] { bsz })),
|
||||
NamedOnnxValue.CreateFromTensor("current_step", new DenseTensor<float>(currentStepArray, new int[] { bsz }))
|
||||
};
|
||||
|
||||
using var vectorEstOutputs = _vectorEstOrt.Run(vectorEstInputs);
|
||||
var denoisedLatent = vectorEstOutputs.First(o => o.Name == "denoised_latent").AsTensor<float>();
|
||||
|
||||
// Update xt
|
||||
int idx = 0;
|
||||
for (int b = 0; b < bsz; b++)
|
||||
{
|
||||
for (int d = 0; d < xt[b].Length; d++)
|
||||
{
|
||||
for (int t = 0; t < xt[b][d].Length; t++)
|
||||
{
|
||||
xt[b][d][t] = denoisedLatent.GetValue(idx++);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Run vocoder
|
||||
var vocoderInputs = new List<NamedOnnxValue>
|
||||
{
|
||||
NamedOnnxValue.CreateFromTensor("latent", Helper.ArrayToTensor(xt, latentShape))
|
||||
};
|
||||
using var vocoderOutputs = _vocoderOrt.Run(vocoderInputs);
|
||||
var wavTensor = vocoderOutputs.First(o => o.Name == "wav_tts").AsTensor<float>();
|
||||
|
||||
return (wavTensor.ToArray(), durOnnx);
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Helper class with utility functions
|
||||
// ============================================================================
|
||||
|
||||
public static class Helper
|
||||
{
|
||||
// ============================================================================
|
||||
// Utility functions
|
||||
// ============================================================================
|
||||
|
||||
public static float[][][] LengthToMask(long[] lengths, long maxLen = -1)
|
||||
{
|
||||
if (maxLen == -1)
|
||||
{
|
||||
maxLen = lengths.Max();
|
||||
}
|
||||
|
||||
var mask = new float[lengths.Length][][];
|
||||
for (int i = 0; i < lengths.Length; i++)
|
||||
{
|
||||
mask[i] = new float[1][];
|
||||
mask[i][0] = new float[maxLen];
|
||||
for (int j = 0; j < maxLen; j++)
|
||||
{
|
||||
mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
|
||||
}
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
public static float[][][] GetLatentMask(long[] wavLengths, int baseChunkSize, int chunkCompressFactor)
|
||||
{
|
||||
int latentSize = baseChunkSize * chunkCompressFactor;
|
||||
var latentLengths = wavLengths.Select(len => (len + latentSize - 1) / latentSize).ToArray();
|
||||
return LengthToMask(latentLengths);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ONNX model loading
|
||||
// ============================================================================
|
||||
|
||||
public static InferenceSession LoadOnnx(string onnxPath, SessionOptions opts)
|
||||
{
|
||||
return new InferenceSession(onnxPath, opts);
|
||||
}
|
||||
|
||||
public static (InferenceSession dp, InferenceSession textEnc, InferenceSession vectorEst, InferenceSession vocoder)
|
||||
LoadOnnxAll(string onnxDir, SessionOptions opts)
|
||||
{
|
||||
var dpPath = Path.Combine(onnxDir, "duration_predictor.onnx");
|
||||
var textEncPath = Path.Combine(onnxDir, "text_encoder.onnx");
|
||||
var vectorEstPath = Path.Combine(onnxDir, "vector_estimator.onnx");
|
||||
var vocoderPath = Path.Combine(onnxDir, "vocoder.onnx");
|
||||
|
||||
return (
|
||||
LoadOnnx(dpPath, opts),
|
||||
LoadOnnx(textEncPath, opts),
|
||||
LoadOnnx(vectorEstPath, opts),
|
||||
LoadOnnx(vocoderPath, opts)
|
||||
);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Configuration loading
|
||||
// ============================================================================
|
||||
|
||||
public static Config LoadCfgs(string onnxDir)
|
||||
{
|
||||
var cfgPath = Path.Combine(onnxDir, "tts.json");
|
||||
var json = File.ReadAllText(cfgPath);
|
||||
|
||||
using var doc = JsonDocument.Parse(json);
|
||||
var root = doc.RootElement;
|
||||
|
||||
return new Config
|
||||
{
|
||||
AE = new Config.AEConfig
|
||||
{
|
||||
SampleRate = root.GetProperty("ae").GetProperty("sample_rate").GetInt32(),
|
||||
BaseChunkSize = root.GetProperty("ae").GetProperty("base_chunk_size").GetInt32()
|
||||
},
|
||||
TTL = new Config.TTLConfig
|
||||
{
|
||||
ChunkCompressFactor = root.GetProperty("ttl").GetProperty("chunk_compress_factor").GetInt32(),
|
||||
LatentDim = root.GetProperty("ttl").GetProperty("latent_dim").GetInt32()
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
public static UnicodeProcessor LoadTextProcessor(string onnxDir)
|
||||
{
|
||||
var unicodeIndexerPath = Path.Combine(onnxDir, "unicode_indexer.json");
|
||||
return new UnicodeProcessor(unicodeIndexerPath);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Voice style loading
|
||||
// ============================================================================
|
||||
|
||||
public static Style LoadVoiceStyle(List<string> voiceStylePaths, bool verbose = false)
|
||||
{
|
||||
int bsz = voiceStylePaths.Count;
|
||||
|
||||
// Read first file to get dimensions
|
||||
var firstJson = File.ReadAllText(voiceStylePaths[0]);
|
||||
using var firstDoc = JsonDocument.Parse(firstJson);
|
||||
var firstRoot = firstDoc.RootElement;
|
||||
|
||||
var ttlDims = ParseInt64Array(firstRoot.GetProperty("style_ttl").GetProperty("dims"));
|
||||
var dpDims = ParseInt64Array(firstRoot.GetProperty("style_dp").GetProperty("dims"));
|
||||
|
||||
long ttlDim1 = ttlDims[1];
|
||||
long ttlDim2 = ttlDims[2];
|
||||
long dpDim1 = dpDims[1];
|
||||
long dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
int ttlSize = (int)(bsz * ttlDim1 * ttlDim2);
|
||||
int dpSize = (int)(bsz * dpDim1 * dpDim2);
|
||||
var ttlFlat = new float[ttlSize];
|
||||
var dpFlat = new float[dpSize];
|
||||
|
||||
// Fill in the data
|
||||
for (int i = 0; i < bsz; i++)
|
||||
{
|
||||
var json = File.ReadAllText(voiceStylePaths[i]);
|
||||
using var doc = JsonDocument.Parse(json);
|
||||
var root = doc.RootElement;
|
||||
|
||||
// Flatten data
|
||||
var ttlData3D = ParseFloat3DArray(root.GetProperty("style_ttl").GetProperty("data"));
|
||||
var ttlDataFlat = new List<float>();
|
||||
foreach (var batch in ttlData3D)
|
||||
{
|
||||
foreach (var row in batch)
|
||||
{
|
||||
ttlDataFlat.AddRange(row);
|
||||
}
|
||||
}
|
||||
|
||||
var dpData3D = ParseFloat3DArray(root.GetProperty("style_dp").GetProperty("data"));
|
||||
var dpDataFlat = new List<float>();
|
||||
foreach (var batch in dpData3D)
|
||||
{
|
||||
foreach (var row in batch)
|
||||
{
|
||||
dpDataFlat.AddRange(row);
|
||||
}
|
||||
}
|
||||
|
||||
// Copy to pre-allocated array
|
||||
int ttlOffset = (int)(i * ttlDim1 * ttlDim2);
|
||||
ttlDataFlat.CopyTo(ttlFlat, ttlOffset);
|
||||
|
||||
int dpOffset = (int)(i * dpDim1 * dpDim2);
|
||||
dpDataFlat.CopyTo(dpFlat, dpOffset);
|
||||
}
|
||||
|
||||
var ttlShape = new long[] { bsz, ttlDim1, ttlDim2 };
|
||||
var dpShape = new long[] { bsz, dpDim1, dpDim2 };
|
||||
|
||||
if (verbose)
|
||||
{
|
||||
Console.WriteLine($"Loaded {bsz} voice styles");
|
||||
}
|
||||
|
||||
return new Style(ttlFlat, ttlShape, dpFlat, dpShape);
|
||||
}
|
||||
|
||||
private static float[][][] ParseFloat3DArray(JsonElement element)
|
||||
{
|
||||
var result = new List<float[][]>();
|
||||
foreach (var batch in element.EnumerateArray())
|
||||
{
|
||||
var batch2D = new List<float[]>();
|
||||
foreach (var row in batch.EnumerateArray())
|
||||
{
|
||||
var rowData = new List<float>();
|
||||
foreach (var val in row.EnumerateArray())
|
||||
{
|
||||
rowData.Add(val.GetSingle());
|
||||
}
|
||||
batch2D.Add(rowData.ToArray());
|
||||
}
|
||||
result.Add(batch2D.ToArray());
|
||||
}
|
||||
return result.ToArray();
|
||||
}
|
||||
|
||||
private static long[] ParseInt64Array(JsonElement element)
|
||||
{
|
||||
var result = new List<long>();
|
||||
foreach (var val in element.EnumerateArray())
|
||||
{
|
||||
result.Add(val.GetInt64());
|
||||
}
|
||||
return result.ToArray();
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// TextToSpeech loading
|
||||
// ============================================================================
|
||||
|
||||
public static TextToSpeech LoadTextToSpeech(string onnxDir, bool useGpu = false)
|
||||
{
|
||||
var opts = new SessionOptions();
|
||||
if (useGpu)
|
||||
{
|
||||
throw new NotImplementedException("GPU mode is not supported yet");
|
||||
}
|
||||
else
|
||||
{
|
||||
Console.WriteLine("Using CPU for inference");
|
||||
}
|
||||
|
||||
var cfgs = LoadCfgs(onnxDir);
|
||||
var (dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) = LoadOnnxAll(onnxDir, opts);
|
||||
var textProcessor = LoadTextProcessor(onnxDir);
|
||||
|
||||
return new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// WAV file writing
|
||||
// ============================================================================
|
||||
|
||||
public static void WriteWavFile(string filename, float[] audioData, int sampleRate)
|
||||
{
|
||||
using var writer = new BinaryWriter(File.Open(filename, FileMode.Create));
|
||||
|
||||
int numChannels = 1;
|
||||
int bitsPerSample = 16;
|
||||
int byteRate = sampleRate * numChannels * bitsPerSample / 8;
|
||||
short blockAlign = (short)(numChannels * bitsPerSample / 8);
|
||||
int dataSize = audioData.Length * bitsPerSample / 8;
|
||||
|
||||
// RIFF header
|
||||
writer.Write(Encoding.ASCII.GetBytes("RIFF"));
|
||||
writer.Write(36 + dataSize);
|
||||
writer.Write(Encoding.ASCII.GetBytes("WAVE"));
|
||||
|
||||
// fmt chunk
|
||||
writer.Write(Encoding.ASCII.GetBytes("fmt "));
|
||||
writer.Write(16); // fmt chunk size
|
||||
writer.Write((short)1); // audio format (PCM)
|
||||
writer.Write((short)numChannels);
|
||||
writer.Write(sampleRate);
|
||||
writer.Write(byteRate);
|
||||
writer.Write(blockAlign);
|
||||
writer.Write((short)bitsPerSample);
|
||||
|
||||
// data chunk
|
||||
writer.Write(Encoding.ASCII.GetBytes("data"));
|
||||
writer.Write(dataSize);
|
||||
|
||||
// Write audio data
|
||||
foreach (var sample in audioData)
|
||||
{
|
||||
float clamped = Math.Max(-1.0f, Math.Min(1.0f, sample));
|
||||
short intSample = (short)(clamped * 32767);
|
||||
writer.Write(intSample);
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Tensor conversion utilities
|
||||
// ============================================================================
|
||||
|
||||
public static DenseTensor<float> ArrayToTensor(float[][][] array, long[] dims)
|
||||
{
|
||||
var flat = new List<float>();
|
||||
foreach (var batch in array)
|
||||
{
|
||||
foreach (var row in batch)
|
||||
{
|
||||
flat.AddRange(row);
|
||||
}
|
||||
}
|
||||
return new DenseTensor<float>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
|
||||
}
|
||||
|
||||
public static DenseTensor<long> IntArrayToTensor(long[][] array, long[] dims)
|
||||
{
|
||||
var flat = new List<long>();
|
||||
foreach (var row in array)
|
||||
{
|
||||
flat.AddRange(row);
|
||||
}
|
||||
return new DenseTensor<long>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Timer utility
|
||||
// ============================================================================
|
||||
|
||||
public static T Timer<T>(string name, Func<T> func)
|
||||
{
|
||||
var start = DateTime.Now;
|
||||
Console.WriteLine($"{name}...");
|
||||
var result = func();
|
||||
var elapsed = (DateTime.Now - start).TotalSeconds;
|
||||
Console.WriteLine($" -> {name} completed in {elapsed:F2} sec");
|
||||
return result;
|
||||
}
|
||||
|
||||
public static string SanitizeFilename(string text, int maxLen)
|
||||
{
|
||||
var result = new StringBuilder();
|
||||
int count = 0;
|
||||
foreach (char c in text)
|
||||
{
|
||||
if (count >= maxLen) break;
|
||||
if (char.IsLetterOrDigit(c))
|
||||
{
|
||||
result.Append(c);
|
||||
}
|
||||
else
|
||||
{
|
||||
result.Append('_');
|
||||
}
|
||||
count++;
|
||||
}
|
||||
return result.ToString();
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,99 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `ExampleONNX.cs`.
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
- .NET 9.0 SDK or later
|
||||
- [Download .NET SDK](https://dotnet.microsoft.com/download)
|
||||
|
||||
### Install dependencies
|
||||
```bash
|
||||
dotnet restore
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
dotnet run
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
dotnet run -- \
|
||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice style (M1.json) for the first text
|
||||
- Use female voice style (F1.json) for the second text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
dotnet run -- \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated) |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize (pipe-separated: `|`) |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
|
||||
## Building the Project
|
||||
|
||||
### Build for Release
|
||||
```bash
|
||||
dotnet build -c Release
|
||||
```
|
||||
|
||||
### Run the compiled executable
|
||||
```bash
|
||||
./bin/Release/net9.0/Supertonic
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
csharp/
|
||||
├── ExampleONNX.cs # Main inference script
|
||||
├── Helper.cs # Helper functions and classes
|
||||
├── Supertonic.csproj # Project configuration
|
||||
├── README.md # This file
|
||||
└── results/ # Output directory (created automatically)
|
||||
```
|
||||
|
||||
|
||||
@@ -0,0 +1,17 @@
|
||||
<Project Sdk="Microsoft.NET.Sdk">
|
||||
|
||||
<PropertyGroup>
|
||||
<OutputType>Exe</OutputType>
|
||||
<TargetFramework>net9.0</TargetFramework>
|
||||
<LangVersion>13.0</LangVersion>
|
||||
<Nullable>enable</Nullable>
|
||||
</PropertyGroup>
|
||||
|
||||
<ItemGroup>
|
||||
<PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.20.1" />
|
||||
<PackageReference Include="System.Text.Json" Version="9.0.1" />
|
||||
</ItemGroup>
|
||||
|
||||
</Project>
|
||||
|
||||
|
||||
Symlink
+1
@@ -0,0 +1 @@
|
||||
../assets
|
||||
@@ -0,0 +1,17 @@
|
||||
# Binaries
|
||||
tts_example
|
||||
example_onnx
|
||||
*.exe
|
||||
|
||||
# Go build artifacts
|
||||
*.o
|
||||
*.a
|
||||
*.so
|
||||
|
||||
# Results
|
||||
results/
|
||||
|
||||
# Go workspace
|
||||
go.work
|
||||
go.work.sum
|
||||
|
||||
+128
@@ -0,0 +1,128 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `example_onnx.go`.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses Go modules for dependency management.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. Install Go 1.21 or later from [https://golang.org/dl/](https://golang.org/dl/)
|
||||
2. Install ONNX Runtime C library:
|
||||
|
||||
**macOS (via Homebrew):**
|
||||
```bash
|
||||
brew install onnxruntime
|
||||
```
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
# Download ONNX Runtime from GitHub releases
|
||||
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.0/onnxruntime-linux-x64-1.16.0.tgz
|
||||
tar -xzf onnxruntime-linux-x64-1.16.0.tgz
|
||||
sudo cp onnxruntime-linux-x64-1.16.0/lib/* /usr/local/lib/
|
||||
sudo cp -r onnxruntime-linux-x64-1.16.0/include/* /usr/local/include/
|
||||
sudo ldconfig
|
||||
```
|
||||
|
||||
### Install Go dependencies
|
||||
|
||||
```bash
|
||||
go mod download
|
||||
```
|
||||
|
||||
### Configure ONNX Runtime Library Path (Optional)
|
||||
|
||||
If the ONNX Runtime library is not in a standard location, set the environment variable:
|
||||
|
||||
**Automatic Detection (Recommended):**
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
|
||||
|
||||
# Linux
|
||||
export ONNXRUNTIME_LIB_PATH=$(find /usr/local/lib /usr/lib -name "libonnxruntime.so*" 2>/dev/null | head -n 1)
|
||||
```
|
||||
|
||||
**Manual Configuration:**
|
||||
|
||||
```bash
|
||||
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.so # Linux
|
||||
# or
|
||||
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.dylib # macOS
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
go run example_onnx.go helper.go
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
go run example_onnx.go helper.go \
|
||||
-voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
|
||||
-text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice (M1.json) for the first text
|
||||
- Use female voice (F1.json) for the second text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
go run example_onnx.go helper.go \
|
||||
-total-step 10 \
|
||||
-voice-style "assets/voice_styles/M1.json" \
|
||||
-text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `-use-gpu` | flag | false | Use GPU for inference (default: CPU) |
|
||||
| `-onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `-total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `-n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `-voice-style` | str | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
|
||||
| `-text` | str | (long default text) | Text(s) to synthesize, pipe-separated |
|
||||
| `-save-dir` | str | `results` | Output directory |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `-voice-style` files must match the number of `-text` entries
|
||||
- **Quality vs Speed**: Higher `-total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
|
||||
## Building a Binary
|
||||
|
||||
To build a standalone executable:
|
||||
```bash
|
||||
go build -o tts_example example_onnx.go helper.go
|
||||
```
|
||||
|
||||
Then run it:
|
||||
```bash
|
||||
./tts_example -voice-style "../assets/voice_styles/M1.json" -text "Hello world"
|
||||
```
|
||||
|
||||
@@ -0,0 +1,144 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"flag"
|
||||
"fmt"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
|
||||
ort "github.com/yalue/onnxruntime_go"
|
||||
)
|
||||
|
||||
// Args holds command line arguments
|
||||
type Args struct {
|
||||
useGPU bool
|
||||
onnxDir string
|
||||
totalStep int
|
||||
nTest int
|
||||
voiceStyle []string
|
||||
text []string
|
||||
saveDir string
|
||||
}
|
||||
|
||||
func parseArgs() *Args {
|
||||
args := &Args{}
|
||||
|
||||
flag.BoolVar(&args.useGPU, "use-gpu", false, "Use GPU for inference (default: CPU)")
|
||||
flag.StringVar(&args.onnxDir, "onnx-dir", "assets/onnx", "Path to ONNX model directory")
|
||||
flag.IntVar(&args.totalStep, "total-step", 5, "Number of denoising steps")
|
||||
flag.IntVar(&args.nTest, "n-test", 4, "Number of times to generate")
|
||||
flag.StringVar(&args.saveDir, "save-dir", "results", "Output directory")
|
||||
|
||||
var voiceStyleStr, textStr string
|
||||
flag.StringVar(&voiceStyleStr, "voice-style", "assets/voice_styles/M1.json", "Voice style file path(s), comma-separated")
|
||||
flag.StringVar(&textStr, "text", "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.", "Text(s) to synthesize, pipe-separated")
|
||||
|
||||
flag.Parse()
|
||||
|
||||
// Parse comma-separated voice-style
|
||||
if voiceStyleStr != "" {
|
||||
args.voiceStyle = strings.Split(voiceStyleStr, ",")
|
||||
for i := range args.voiceStyle {
|
||||
args.voiceStyle[i] = strings.TrimSpace(args.voiceStyle[i])
|
||||
}
|
||||
}
|
||||
|
||||
// Parse pipe-separated text
|
||||
if textStr != "" {
|
||||
args.text = strings.Split(textStr, "|")
|
||||
for i := range args.text {
|
||||
args.text[i] = strings.TrimSpace(args.text[i])
|
||||
}
|
||||
}
|
||||
|
||||
return args
|
||||
}
|
||||
|
||||
func main() {
|
||||
fmt.Println("=== TTS Inference with ONNX Runtime (Go) ===\n")
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
args := parseArgs()
|
||||
totalStep := args.totalStep
|
||||
nTest := args.nTest
|
||||
saveDir := args.saveDir
|
||||
voiceStylePaths := args.voiceStyle
|
||||
textList := args.text
|
||||
|
||||
if len(voiceStylePaths) != len(textList) {
|
||||
fmt.Printf("Error: Number of voice styles (%d) must match number of texts (%d)\n",
|
||||
len(voiceStylePaths), len(textList))
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
bsz := len(voiceStylePaths)
|
||||
|
||||
// Initialize ONNX Runtime
|
||||
if err := InitializeONNXRuntime(); err != nil {
|
||||
fmt.Printf("Error initializing ONNX Runtime: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
defer ort.DestroyEnvironment()
|
||||
|
||||
// --- 2. Load config --- //
|
||||
cfg, err := LoadCfgs(args.onnxDir)
|
||||
if err != nil {
|
||||
fmt.Printf("Error loading config: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
// --- 3. Load TTS components --- //
|
||||
textToSpeech, err := LoadTextToSpeech(args.onnxDir, args.useGPU, cfg)
|
||||
if err != nil {
|
||||
fmt.Printf("Error loading TTS components: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
defer textToSpeech.Destroy()
|
||||
|
||||
// --- 4. Load voice styles --- //
|
||||
style, err := LoadVoiceStyle(voiceStylePaths, true)
|
||||
if err != nil {
|
||||
fmt.Printf("Error loading voice styles: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
defer style.Destroy()
|
||||
|
||||
// --- 5. Synthesize speech --- //
|
||||
if err := os.MkdirAll(saveDir, 0755); err != nil {
|
||||
fmt.Printf("Error creating save directory: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
for n := 0; n < nTest; n++ {
|
||||
fmt.Printf("\n[%d/%d] Starting synthesis...\n", n+1, nTest)
|
||||
|
||||
var wav []float32
|
||||
var duration []float32
|
||||
Timer("Generating speech from text", func() interface{} {
|
||||
w, d, err := textToSpeech.Call(textList, style, totalStep)
|
||||
if err != nil {
|
||||
fmt.Printf("Error generating speech: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
wav = w
|
||||
duration = d
|
||||
return nil
|
||||
})
|
||||
|
||||
// Save outputs
|
||||
for i := 0; i < bsz; i++ {
|
||||
fname := fmt.Sprintf("%s_%d.wav", sanitizeFilename(textList[i], 20), n+1)
|
||||
wavOut := extractWavSegment(wav, duration[i], textToSpeech.SampleRate, i, bsz)
|
||||
|
||||
outputPath := filepath.Join(saveDir, fname)
|
||||
if err := writeWavFile(outputPath, wavOut, textToSpeech.SampleRate); err != nil {
|
||||
fmt.Printf("Error writing wav file: %v\n", err)
|
||||
continue
|
||||
}
|
||||
fmt.Printf("Saved: %s\n", outputPath)
|
||||
}
|
||||
}
|
||||
|
||||
fmt.Println("\n=== Synthesis completed successfully! ===")
|
||||
}
|
||||
@@ -0,0 +1,12 @@
|
||||
module supertonic-tts
|
||||
|
||||
go 1.21
|
||||
|
||||
require (
|
||||
github.com/go-audio/audio v1.0.0
|
||||
github.com/go-audio/wav v1.1.0
|
||||
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12
|
||||
github.com/yalue/onnxruntime_go v1.11.0
|
||||
)
|
||||
|
||||
require github.com/go-audio/riff v1.0.0 // indirect
|
||||
@@ -0,0 +1,10 @@
|
||||
github.com/go-audio/audio v1.0.0 h1:zS9vebldgbQqktK4H0lUqWrG8P0NxCJVqcj7ZpNnwd4=
|
||||
github.com/go-audio/audio v1.0.0/go.mod h1:6uAu0+H2lHkwdGsAY+j2wHPNPpPoeg5AaEFh9FlA+Zs=
|
||||
github.com/go-audio/riff v1.0.0 h1:d8iCGbDvox9BfLagY94fBynxSPHO80LmZCaOsmKxokA=
|
||||
github.com/go-audio/riff v1.0.0/go.mod h1:l3cQwc85y79NQFCRB7TiPoNiaijp6q8Z0Uv38rVG498=
|
||||
github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
|
||||
github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
|
||||
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12 h1:dd7vnTDfjtwCETZDrRe+GPYNLA1jBtbZeyfyE8eZCyk=
|
||||
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12/go.mod h1:i/KKcxEWEO8Yyl11DYafRPKOPVYTrhxiTRigjtEEXZU=
|
||||
github.com/yalue/onnxruntime_go v1.11.0 h1:aKH4yPIbqfcB3SfnQWq/WxzLelkyolntHnffL3eMBHY=
|
||||
github.com/yalue/onnxruntime_go v1.11.0/go.mod h1:b4X26A8pekNb1ACJ58wAXgNKeUCGEAQ9dmACut9Sm/4=
|
||||
+734
@@ -0,0 +1,734 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"math"
|
||||
"math/rand"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"time"
|
||||
|
||||
"github.com/go-audio/audio"
|
||||
"github.com/go-audio/wav"
|
||||
ort "github.com/yalue/onnxruntime_go"
|
||||
)
|
||||
|
||||
// Config structures
|
||||
type SpecProcessorConfig struct {
|
||||
NFFT int `json:"n_fft"`
|
||||
WinLength int `json:"win_length"`
|
||||
HopLength int `json:"hop_length"`
|
||||
NMels int `json:"n_mels"`
|
||||
Eps float64 `json:"eps"`
|
||||
NormMean float64 `json:"norm_mean"`
|
||||
NormStd float64 `json:"norm_std"`
|
||||
}
|
||||
|
||||
type EncoderConfig struct {
|
||||
SpecProcessor SpecProcessorConfig `json:"spec_processor"`
|
||||
}
|
||||
|
||||
type AEConfig struct {
|
||||
SampleRate int `json:"sample_rate"`
|
||||
BaseChunkSize int `json:"base_chunk_size"`
|
||||
Encoder EncoderConfig `json:"encoder"`
|
||||
}
|
||||
|
||||
type StyleTokenLayerConfig struct {
|
||||
NStyle int `json:"n_style"`
|
||||
StyleValueDim int `json:"style_value_dim"`
|
||||
}
|
||||
|
||||
type StyleEncoderConfig struct {
|
||||
StyleTokenLayer StyleTokenLayerConfig `json:"style_token_layer"`
|
||||
}
|
||||
|
||||
type ProjOutConfig struct {
|
||||
Idim int `json:"idim"`
|
||||
Odim int `json:"odim"`
|
||||
}
|
||||
|
||||
type TextEncoderConfig struct {
|
||||
ProjOut ProjOutConfig `json:"proj_out"`
|
||||
}
|
||||
|
||||
type TTLConfig struct {
|
||||
ChunkCompressFactor int `json:"chunk_compress_factor"`
|
||||
LatentDim int `json:"latent_dim"`
|
||||
StyleEncoder StyleEncoderConfig `json:"style_encoder"`
|
||||
TextEncoder TextEncoderConfig `json:"text_encoder"`
|
||||
}
|
||||
|
||||
type DPStyleEncoderConfig struct {
|
||||
StyleTokenLayer StyleTokenLayerConfig `json:"style_token_layer"`
|
||||
}
|
||||
|
||||
type DPConfig struct {
|
||||
LatentDim int `json:"latent_dim"`
|
||||
ChunkCompressFactor int `json:"chunk_compress_factor"`
|
||||
StyleEncoder DPStyleEncoderConfig `json:"style_encoder"`
|
||||
}
|
||||
|
||||
type Config struct {
|
||||
AE AEConfig `json:"ae"`
|
||||
TTL TTLConfig `json:"ttl"`
|
||||
DP DPConfig `json:"dp"`
|
||||
}
|
||||
|
||||
// VoiceStyleData holds voice style JSON structure
|
||||
type VoiceStyleData struct {
|
||||
StyleTTL struct {
|
||||
Data [][][]float64 `json:"data"`
|
||||
Dims []int64 `json:"dims"`
|
||||
Type string `json:"type"`
|
||||
} `json:"style_ttl"`
|
||||
StyleDP struct {
|
||||
Data [][][]float64 `json:"data"`
|
||||
Dims []int64 `json:"dims"`
|
||||
Type string `json:"type"`
|
||||
} `json:"style_dp"`
|
||||
}
|
||||
|
||||
// UnicodeProcessor for text processing
|
||||
type UnicodeProcessor struct {
|
||||
indexer []int64
|
||||
}
|
||||
|
||||
// NewUnicodeProcessor creates a new UnicodeProcessor
|
||||
func NewUnicodeProcessor(unicodeIndexerPath string) (*UnicodeProcessor, error) {
|
||||
indexer, err := loadJSONInt64(unicodeIndexerPath)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to load unicode indexer: %w", err)
|
||||
}
|
||||
|
||||
return &UnicodeProcessor{indexer: indexer}, nil
|
||||
}
|
||||
|
||||
// Call processes text list to text IDs and mask
|
||||
func (up *UnicodeProcessor) Call(textList []string) ([][]int64, [][][]float64) {
|
||||
// Preprocess texts
|
||||
processedTexts := make([]string, len(textList))
|
||||
for i, text := range textList {
|
||||
processedTexts[i] = preprocessText(text)
|
||||
}
|
||||
|
||||
// Get text lengths
|
||||
textLengths := make([]int64, len(processedTexts))
|
||||
maxLen := 0
|
||||
for i, text := range processedTexts {
|
||||
textLengths[i] = int64(len([]rune(text)))
|
||||
if int(textLengths[i]) > maxLen {
|
||||
maxLen = int(textLengths[i])
|
||||
}
|
||||
}
|
||||
|
||||
// Create text IDs
|
||||
textIDs := make([][]int64, len(processedTexts))
|
||||
for i, text := range processedTexts {
|
||||
row := make([]int64, maxLen)
|
||||
runes := []rune(text)
|
||||
for j, r := range runes {
|
||||
unicodeVal := int(r)
|
||||
if unicodeVal < len(up.indexer) {
|
||||
row[j] = up.indexer[unicodeVal]
|
||||
} else {
|
||||
row[j] = -1
|
||||
}
|
||||
}
|
||||
textIDs[i] = row
|
||||
}
|
||||
|
||||
// Create text mask
|
||||
textMask := lengthToMask(textLengths, maxLen)
|
||||
|
||||
return textIDs, textMask
|
||||
}
|
||||
|
||||
// Utility functions
|
||||
func preprocessText(text string) string {
|
||||
// Simple normalization (Go doesn't have built-in NFKD normalization)
|
||||
// For full Unicode normalization, use golang.org/x/text/unicode/norm
|
||||
return text
|
||||
}
|
||||
|
||||
func lengthToMask(lengths []int64, maxLen int) [][][]float64 {
|
||||
bsz := len(lengths)
|
||||
mask := make([][][]float64, bsz)
|
||||
|
||||
for i := 0; i < bsz; i++ {
|
||||
row := make([]float64, maxLen)
|
||||
for j := 0; j < maxLen; j++ {
|
||||
if int64(j) < lengths[i] {
|
||||
row[j] = 1.0
|
||||
} else {
|
||||
row[j] = 0.0
|
||||
}
|
||||
}
|
||||
mask[i] = [][]float64{row}
|
||||
}
|
||||
|
||||
return mask
|
||||
}
|
||||
|
||||
func getTextMask(textLengths []int64, maxLen int) [][][]float64 {
|
||||
return lengthToMask(textLengths, maxLen)
|
||||
}
|
||||
|
||||
func getLatentMask(wavLengths []int64, cfg Config) [][][]float64 {
|
||||
baseChunkSize := int64(cfg.AE.BaseChunkSize)
|
||||
chunkCompressFactor := int64(cfg.TTL.ChunkCompressFactor)
|
||||
latentSize := baseChunkSize * chunkCompressFactor
|
||||
|
||||
latentLengths := make([]int64, len(wavLengths))
|
||||
maxLen := int64(0)
|
||||
for i, wavLen := range wavLengths {
|
||||
latentLengths[i] = (wavLen + latentSize - 1) / latentSize
|
||||
if latentLengths[i] > maxLen {
|
||||
maxLen = latentLengths[i]
|
||||
}
|
||||
}
|
||||
|
||||
return lengthToMask(latentLengths, int(maxLen))
|
||||
}
|
||||
|
||||
func writeWavFile(filename string, audioData []float64, sampleRate int) error {
|
||||
file, err := os.Create(filename)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
// Convert float64 to int
|
||||
intData := make([]int, len(audioData))
|
||||
for i, sample := range audioData {
|
||||
// Clamp to [-1, 1] and convert to 16-bit int
|
||||
clamped := math.Max(-1.0, math.Min(1.0, sample))
|
||||
intData[i] = int(clamped * 32767)
|
||||
}
|
||||
|
||||
encoder := wav.NewEncoder(file, sampleRate, 16, 1, 1)
|
||||
buf := &audio.IntBuffer{
|
||||
Data: intData,
|
||||
Format: &audio.Format{SampleRate: sampleRate, NumChannels: 1},
|
||||
SourceBitDepth: 16,
|
||||
}
|
||||
|
||||
if err := encoder.Write(buf); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return encoder.Close()
|
||||
}
|
||||
|
||||
// Style holds style tensors
|
||||
type Style struct {
|
||||
TtlTensor *ort.Tensor[float32]
|
||||
DpTensor *ort.Tensor[float32]
|
||||
}
|
||||
|
||||
func (s *Style) Destroy() {
|
||||
if s.TtlTensor != nil {
|
||||
s.TtlTensor.Destroy()
|
||||
}
|
||||
if s.DpTensor != nil {
|
||||
s.DpTensor.Destroy()
|
||||
}
|
||||
}
|
||||
|
||||
// LoadVoiceStyle loads voice style from JSON files
|
||||
func LoadVoiceStyle(voiceStylePaths []string, verbose bool) (*Style, error) {
|
||||
bsz := len(voiceStylePaths)
|
||||
|
||||
// Read first file to get dimensions
|
||||
firstData, err := os.ReadFile(voiceStylePaths[0])
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to read voice style file: %w", err)
|
||||
}
|
||||
|
||||
var firstStyle VoiceStyleData
|
||||
if err := json.Unmarshal(firstData, &firstStyle); err != nil {
|
||||
return nil, fmt.Errorf("failed to parse voice style JSON: %w", err)
|
||||
}
|
||||
|
||||
ttlDims := firstStyle.StyleTTL.Dims
|
||||
dpDims := firstStyle.StyleDP.Dims
|
||||
|
||||
ttlDim1 := ttlDims[1]
|
||||
ttlDim2 := ttlDims[2]
|
||||
dpDim1 := dpDims[1]
|
||||
dpDim2 := dpDims[2]
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
ttlSize := int(int64(bsz) * ttlDim1 * ttlDim2)
|
||||
dpSize := int(int64(bsz) * dpDim1 * dpDim2)
|
||||
ttlFlat := make([]float32, ttlSize)
|
||||
dpFlat := make([]float32, dpSize)
|
||||
|
||||
// Fill in the data
|
||||
for i := 0; i < bsz; i++ {
|
||||
data, err := os.ReadFile(voiceStylePaths[i])
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to read voice style file: %w", err)
|
||||
}
|
||||
|
||||
var voiceStyle VoiceStyleData
|
||||
if err := json.Unmarshal(data, &voiceStyle); err != nil {
|
||||
return nil, fmt.Errorf("failed to parse voice style JSON: %w", err)
|
||||
}
|
||||
|
||||
// Flatten TTL data
|
||||
ttlOffset := int(int64(i) * ttlDim1 * ttlDim2)
|
||||
idx := 0
|
||||
for _, batch := range voiceStyle.StyleTTL.Data {
|
||||
for _, row := range batch {
|
||||
for _, val := range row {
|
||||
ttlFlat[ttlOffset+idx] = float32(val)
|
||||
idx++
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Flatten DP data
|
||||
dpOffset := int(int64(i) * dpDim1 * dpDim2)
|
||||
idx = 0
|
||||
for _, batch := range voiceStyle.StyleDP.Data {
|
||||
for _, row := range batch {
|
||||
for _, val := range row {
|
||||
dpFlat[dpOffset+idx] = float32(val)
|
||||
idx++
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
ttlShape := []int64{int64(bsz), ttlDim1, ttlDim2}
|
||||
dpShape := []int64{int64(bsz), dpDim1, dpDim2}
|
||||
|
||||
ttlTensor, err := ort.NewTensor(ttlShape, ttlFlat)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to create TTL tensor: %w", err)
|
||||
}
|
||||
|
||||
dpTensor, err := ort.NewTensor(dpShape, dpFlat)
|
||||
if err != nil {
|
||||
ttlTensor.Destroy()
|
||||
return nil, fmt.Errorf("failed to create DP tensor: %w", err)
|
||||
}
|
||||
|
||||
if verbose {
|
||||
fmt.Printf("Loaded %d voice styles\n\n", bsz)
|
||||
}
|
||||
|
||||
return &Style{
|
||||
TtlTensor: ttlTensor,
|
||||
DpTensor: dpTensor,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// TextToSpeech generates speech from text
|
||||
type TextToSpeech struct {
|
||||
cfg Config
|
||||
textProcessor *UnicodeProcessor
|
||||
dpOrt *ort.DynamicAdvancedSession
|
||||
textEncOrt *ort.DynamicAdvancedSession
|
||||
vectorEstOrt *ort.DynamicAdvancedSession
|
||||
vocoderOrt *ort.DynamicAdvancedSession
|
||||
SampleRate int
|
||||
baseChunkSize int
|
||||
chunkCompress int
|
||||
ldim int
|
||||
}
|
||||
|
||||
func (tts *TextToSpeech) sampleNoisyLatent(durOnnx []float32) ([][][]float64, [][][]float64) {
|
||||
bsz := len(durOnnx)
|
||||
maxDur := float64(0)
|
||||
for _, d := range durOnnx {
|
||||
if float64(d) > maxDur {
|
||||
maxDur = float64(d)
|
||||
}
|
||||
}
|
||||
|
||||
wavLenMax := maxDur * float64(tts.SampleRate)
|
||||
wavLengths := make([]int64, bsz)
|
||||
for i, d := range durOnnx {
|
||||
wavLengths[i] = int64(float64(d) * float64(tts.SampleRate))
|
||||
}
|
||||
|
||||
chunkSize := tts.baseChunkSize * tts.chunkCompress
|
||||
latentLen := int((wavLenMax + float64(chunkSize) - 1) / float64(chunkSize))
|
||||
latentDim := tts.ldim * tts.chunkCompress
|
||||
|
||||
rng := rand.New(rand.NewSource(time.Now().UnixNano()))
|
||||
noisyLatent := make([][][]float64, bsz)
|
||||
for b := 0; b < bsz; b++ {
|
||||
batch := make([][]float64, latentDim)
|
||||
for d := 0; d < latentDim; d++ {
|
||||
row := make([]float64, latentLen)
|
||||
for t := 0; t < latentLen; t++ {
|
||||
// Box-Muller transform for normal distribution
|
||||
// Add epsilon to avoid log(0)
|
||||
const eps = 1e-10
|
||||
u1 := math.Max(eps, rng.Float64())
|
||||
u2 := rng.Float64()
|
||||
row[t] = math.Sqrt(-2.0*math.Log(u1)) * math.Cos(2.0*math.Pi*u2)
|
||||
}
|
||||
batch[d] = row
|
||||
}
|
||||
noisyLatent[b] = batch
|
||||
}
|
||||
|
||||
latentMask := getLatentMask(wavLengths, tts.cfg)
|
||||
|
||||
// Apply mask
|
||||
for b := 0; b < bsz; b++ {
|
||||
for d := 0; d < latentDim; d++ {
|
||||
for t := 0; t < latentLen; t++ {
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return noisyLatent, latentMask
|
||||
}
|
||||
|
||||
func (tts *TextToSpeech) Call(textList []string, style *Style, totalStep int) ([]float32, []float32, error) {
|
||||
bsz := len(textList)
|
||||
|
||||
// Process text
|
||||
textIDs, textMask := tts.textProcessor.Call(textList)
|
||||
textIDsShape := []int64{int64(bsz), int64(len(textIDs[0]))}
|
||||
textMaskShape := []int64{int64(bsz), 1, int64(len(textMask[0][0]))}
|
||||
|
||||
textIDsTensor := IntArrayToTensor(textIDs, textIDsShape)
|
||||
defer textIDsTensor.Destroy()
|
||||
textMaskTensor := ArrayToTensor(textMask, textMaskShape)
|
||||
defer textMaskTensor.Destroy()
|
||||
|
||||
// Predict duration
|
||||
dpOutputs := []ort.Value{nil}
|
||||
err := tts.dpOrt.Run(
|
||||
[]ort.Value{textIDsTensor, style.DpTensor, textMaskTensor},
|
||||
dpOutputs,
|
||||
)
|
||||
if err != nil {
|
||||
return nil, nil, fmt.Errorf("failed to run duration predictor: %w", err)
|
||||
}
|
||||
durTensor := dpOutputs[0].(*ort.Tensor[float32])
|
||||
defer durTensor.Destroy()
|
||||
durOnnx := durTensor.GetData()
|
||||
|
||||
// Encode text
|
||||
textIDsTensor2 := IntArrayToTensor(textIDs, textIDsShape)
|
||||
defer textIDsTensor2.Destroy()
|
||||
textEncOutputs := []ort.Value{nil}
|
||||
err = tts.textEncOrt.Run(
|
||||
[]ort.Value{textIDsTensor2, style.TtlTensor, textMaskTensor},
|
||||
textEncOutputs,
|
||||
)
|
||||
if err != nil {
|
||||
return nil, nil, fmt.Errorf("failed to run text encoder: %w", err)
|
||||
}
|
||||
textEmbTensor := textEncOutputs[0].(*ort.Tensor[float32])
|
||||
defer textEmbTensor.Destroy()
|
||||
|
||||
// Sample noisy latent
|
||||
xt, latentMask := tts.sampleNoisyLatent(durOnnx)
|
||||
latentShape := []int64{int64(bsz), int64(len(xt[0])), int64(len(xt[0][0]))}
|
||||
latentMaskShape := []int64{int64(bsz), 1, int64(len(latentMask[0][0]))}
|
||||
|
||||
// Prepare constant arrays
|
||||
totalStepArray := make([]float32, bsz)
|
||||
for b := 0; b < bsz; b++ {
|
||||
totalStepArray[b] = float32(totalStep)
|
||||
}
|
||||
scalarShape := []int64{int64(bsz)}
|
||||
|
||||
totalStepTensor, _ := ort.NewTensor(scalarShape, totalStepArray)
|
||||
defer totalStepTensor.Destroy()
|
||||
|
||||
// Denoising loop
|
||||
for step := 0; step < totalStep; step++ {
|
||||
currentStepArray := make([]float32, bsz)
|
||||
for b := 0; b < bsz; b++ {
|
||||
currentStepArray[b] = float32(step)
|
||||
}
|
||||
|
||||
currentStepTensor, _ := ort.NewTensor(scalarShape, currentStepArray)
|
||||
noisyLatentTensor := ArrayToTensor(xt, latentShape)
|
||||
latentMaskTensor := ArrayToTensor(latentMask, latentMaskShape)
|
||||
textMaskTensor2 := ArrayToTensor(textMask, textMaskShape)
|
||||
|
||||
vectorEstOutputs := []ort.Value{nil}
|
||||
err = tts.vectorEstOrt.Run(
|
||||
[]ort.Value{noisyLatentTensor, textEmbTensor, style.TtlTensor, latentMaskTensor, textMaskTensor2,
|
||||
currentStepTensor, totalStepTensor},
|
||||
vectorEstOutputs,
|
||||
)
|
||||
if err != nil {
|
||||
return nil, nil, fmt.Errorf("failed to run vector estimator: %w", err)
|
||||
}
|
||||
|
||||
denoisedTensor := vectorEstOutputs[0].(*ort.Tensor[float32])
|
||||
denoisedData := denoisedTensor.GetData()
|
||||
|
||||
// Update latent
|
||||
idx := 0
|
||||
for b := 0; b < bsz; b++ {
|
||||
for d := 0; d < len(xt[b]); d++ {
|
||||
for t := 0; t < len(xt[b][d]); t++ {
|
||||
xt[b][d][t] = float64(denoisedData[idx])
|
||||
idx++
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
noisyLatentTensor.Destroy()
|
||||
latentMaskTensor.Destroy()
|
||||
textMaskTensor2.Destroy()
|
||||
currentStepTensor.Destroy()
|
||||
denoisedTensor.Destroy()
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
finalLatentTensor := ArrayToTensor(xt, latentShape)
|
||||
defer finalLatentTensor.Destroy()
|
||||
|
||||
vocoderOutputs := []ort.Value{nil}
|
||||
err = tts.vocoderOrt.Run(
|
||||
[]ort.Value{finalLatentTensor},
|
||||
vocoderOutputs,
|
||||
)
|
||||
if err != nil {
|
||||
return nil, nil, fmt.Errorf("failed to run vocoder: %w", err)
|
||||
}
|
||||
|
||||
wavBatchTensor := vocoderOutputs[0].(*ort.Tensor[float32])
|
||||
defer wavBatchTensor.Destroy()
|
||||
wav := wavBatchTensor.GetData()
|
||||
|
||||
return wav, durOnnx, nil
|
||||
}
|
||||
|
||||
func (tts *TextToSpeech) Destroy() {
|
||||
if tts.dpOrt != nil {
|
||||
tts.dpOrt.Destroy()
|
||||
}
|
||||
if tts.textEncOrt != nil {
|
||||
tts.textEncOrt.Destroy()
|
||||
}
|
||||
if tts.vectorEstOrt != nil {
|
||||
tts.vectorEstOrt.Destroy()
|
||||
}
|
||||
if tts.vocoderOrt != nil {
|
||||
tts.vocoderOrt.Destroy()
|
||||
}
|
||||
}
|
||||
|
||||
// LoadTextToSpeech loads TTS components
|
||||
func LoadTextToSpeech(onnxDir string, useGPU bool, cfg Config) (*TextToSpeech, error) {
|
||||
if useGPU {
|
||||
return nil, fmt.Errorf("GPU mode is not supported yet")
|
||||
}
|
||||
fmt.Println("Using CPU for inference\n")
|
||||
|
||||
// Load models
|
||||
dpPath := filepath.Join(onnxDir, "duration_predictor.onnx")
|
||||
textEncPath := filepath.Join(onnxDir, "text_encoder.onnx")
|
||||
vectorEstPath := filepath.Join(onnxDir, "vector_estimator.onnx")
|
||||
vocoderPath := filepath.Join(onnxDir, "vocoder.onnx")
|
||||
|
||||
dpOrt, err := ort.NewDynamicAdvancedSession(dpPath, []string{"text_ids", "style_dp", "text_mask"},
|
||||
[]string{"duration"}, nil)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to load duration predictor: %w", err)
|
||||
}
|
||||
|
||||
textEncOrt, err := ort.NewDynamicAdvancedSession(textEncPath, []string{"text_ids", "style_ttl", "text_mask"},
|
||||
[]string{"text_emb"}, nil)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to load text encoder: %w", err)
|
||||
}
|
||||
|
||||
vectorEstOrt, err := ort.NewDynamicAdvancedSession(vectorEstPath,
|
||||
[]string{"noisy_latent", "text_emb", "style_ttl", "latent_mask", "text_mask", "current_step", "total_step"},
|
||||
[]string{"denoised_latent"}, nil)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to load vector estimator: %w", err)
|
||||
}
|
||||
|
||||
vocoderOrt, err := ort.NewDynamicAdvancedSession(vocoderPath, []string{"latent"},
|
||||
[]string{"wav_tts"}, nil)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to load vocoder: %w", err)
|
||||
}
|
||||
|
||||
// Load text processor
|
||||
unicodeIndexerPath := filepath.Join(onnxDir, "unicode_indexer.json")
|
||||
textProcessor, err := NewUnicodeProcessor(unicodeIndexerPath)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
textToSpeech := &TextToSpeech{
|
||||
cfg: cfg,
|
||||
textProcessor: textProcessor,
|
||||
dpOrt: dpOrt,
|
||||
textEncOrt: textEncOrt,
|
||||
vectorEstOrt: vectorEstOrt,
|
||||
vocoderOrt: vocoderOrt,
|
||||
SampleRate: cfg.AE.SampleRate,
|
||||
baseChunkSize: cfg.AE.BaseChunkSize,
|
||||
chunkCompress: cfg.TTL.ChunkCompressFactor,
|
||||
ldim: cfg.TTL.LatentDim,
|
||||
}
|
||||
|
||||
return textToSpeech, nil
|
||||
}
|
||||
|
||||
// InitializeONNXRuntime initializes ONNX Runtime environment
|
||||
func InitializeONNXRuntime() error {
|
||||
libPath := os.Getenv("ONNXRUNTIME_LIB_PATH")
|
||||
if libPath == "" {
|
||||
libPath = "/usr/local/lib/libonnxruntime.so"
|
||||
if _, err := os.Stat("/usr/local/lib/libonnxruntime.dylib"); err == nil {
|
||||
libPath = "/usr/local/lib/libonnxruntime.dylib"
|
||||
} else if _, err := os.Stat("/usr/lib/libonnxruntime.so"); err == nil {
|
||||
libPath = "/usr/lib/libonnxruntime.so"
|
||||
}
|
||||
}
|
||||
ort.SetSharedLibraryPath(libPath)
|
||||
|
||||
if err := ort.InitializeEnvironment(); err != nil {
|
||||
return fmt.Errorf("failed to initialize ONNX Runtime: %w\nHint: Set ONNXRUNTIME_LIB_PATH environment variable", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// sanitizeFilename creates a safe filename from text
|
||||
func sanitizeFilename(text string, maxLen int) string {
|
||||
if len(text) > maxLen {
|
||||
text = text[:maxLen]
|
||||
}
|
||||
|
||||
result := make([]rune, 0, len(text))
|
||||
for _, r := range text {
|
||||
if (r >= 'a' && r <= 'z') || (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
|
||||
result = append(result, r)
|
||||
} else {
|
||||
result = append(result, '_')
|
||||
}
|
||||
}
|
||||
return string(result)
|
||||
}
|
||||
|
||||
// extractWavSegment extracts a single audio segment from batch output
|
||||
func extractWavSegment(wav []float32, duration float32, sampleRate int, index int, batchSize int) []float64 {
|
||||
wavLen := int(float64(sampleRate) * float64(duration))
|
||||
wavPerBatch := len(wav) / batchSize
|
||||
|
||||
wavStart := index * wavPerBatch
|
||||
wavEnd := wavStart + wavLen
|
||||
if wavEnd > len(wav) {
|
||||
wavEnd = len(wav)
|
||||
}
|
||||
|
||||
wavOut := make([]float64, wavLen)
|
||||
for j := 0; j < wavLen && wavStart+j < len(wav); j++ {
|
||||
wavOut[j] = float64(wav[wavStart+j])
|
||||
}
|
||||
|
||||
return wavOut
|
||||
}
|
||||
|
||||
// Timer measures execution time
|
||||
func Timer(name string, fn func() interface{}) interface{} {
|
||||
start := time.Now()
|
||||
fmt.Printf("%s...\n", name)
|
||||
result := fn()
|
||||
elapsed := time.Since(start).Seconds()
|
||||
fmt.Printf(" -> %s completed in %.2f sec\n", name, elapsed)
|
||||
return result
|
||||
}
|
||||
|
||||
// LoadCfgs loads configuration from JSON file
|
||||
func LoadCfgs(onnxDir string) (Config, error) {
|
||||
cfgPath := filepath.Join(onnxDir, "tts.json")
|
||||
data, err := os.ReadFile(cfgPath)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
var cfg Config
|
||||
if err := json.Unmarshal(data, &cfg); err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
return cfg, nil
|
||||
}
|
||||
|
||||
// JSON loading helpers
|
||||
func loadJSONInt64(filePath string) ([]int64, error) {
|
||||
data, err := os.ReadFile(filePath)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
var result []int64
|
||||
if err := json.Unmarshal(data, &result); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
return result, nil
|
||||
}
|
||||
|
||||
// Tensor conversion utilities
|
||||
func ArrayToTensor(array [][][]float64, shape []int64) *ort.Tensor[float32] {
|
||||
// Flatten array
|
||||
totalSize := int64(1)
|
||||
for _, dim := range shape {
|
||||
totalSize *= dim
|
||||
}
|
||||
|
||||
flat := make([]float32, totalSize)
|
||||
idx := 0
|
||||
for b := 0; b < len(array); b++ {
|
||||
for d := 0; d < len(array[b]); d++ {
|
||||
for t := 0; t < len(array[b][d]); t++ {
|
||||
flat[idx] = float32(array[b][d][t])
|
||||
idx++
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
tensor, err := ort.NewTensor(shape, flat)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
return tensor
|
||||
}
|
||||
|
||||
func IntArrayToTensor(array [][]int64, shape []int64) *ort.Tensor[int64] {
|
||||
// Flatten array
|
||||
totalSize := int64(1)
|
||||
for _, dim := range shape {
|
||||
totalSize *= dim
|
||||
}
|
||||
|
||||
flat := make([]int64, totalSize)
|
||||
idx := 0
|
||||
for b := 0; b < len(array); b++ {
|
||||
for t := 0; t < len(array[b]); t++ {
|
||||
flat[idx] = array[b][t]
|
||||
idx++
|
||||
}
|
||||
}
|
||||
|
||||
tensor, err := ort.NewTensor(shape, flat)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
return tensor
|
||||
}
|
||||
@@ -0,0 +1,250 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<svg id="_레이어_2" data-name="레이어 2" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0 0 1920 1080">
|
||||
<defs>
|
||||
<style>
|
||||
.cls-1, .cls-2 {
|
||||
fill: none;
|
||||
}
|
||||
|
||||
.cls-3 {
|
||||
fill: #227cff;
|
||||
}
|
||||
|
||||
.cls-4 {
|
||||
fill: #ff0;
|
||||
}
|
||||
|
||||
.cls-2 {
|
||||
stroke: #0a0a0a;
|
||||
stroke-miterlimit: 10;
|
||||
stroke-width: 1.72px;
|
||||
}
|
||||
|
||||
.cls-5 {
|
||||
fill: #f2f2f2;
|
||||
}
|
||||
|
||||
.cls-6 {
|
||||
fill: #0a0a0a;
|
||||
}
|
||||
|
||||
.cls-7 {
|
||||
clip-path: url(#clippath);
|
||||
}
|
||||
</style>
|
||||
<clipPath id="clippath">
|
||||
<rect class="cls-1" x="181.43" width="1626.55" height="1080"/>
|
||||
</clipPath>
|
||||
</defs>
|
||||
<g id="Work">
|
||||
<g>
|
||||
<rect class="cls-5" width="1920" height="1080"/>
|
||||
<g>
|
||||
<circle class="cls-3" cx="1679.4" cy="880.43" r="59.81"/>
|
||||
<path class="cls-6" d="M1713.35,805.14c-5.14,12.55-10.93,23.87-18.55,37.52l14.39-2.21c-12.75,23.11-27.5,45.05-44.09,65.57l14.46-2.22c-18.8,23.34-39.84,44.76-62.85,63.95,37.17-24.18,63.26-49.33,92.3-75.66l-16.71,2.56c24.03-21.85,46.85-44.98,68.39-69.3l-22.1,3.39c10.89-12.33,20.91-24.28,30.11-35.66l-55.34,12.05Z"/>
|
||||
</g>
|
||||
<g class="cls-7">
|
||||
<path class="cls-4" d="M1036.15-20.65c-38.28,93.45-81.37,177.77-138.14,279.42l107.16-16.43c-94.95,172.09-204.79,335.46-328.29,488.29l107.68-16.51c-139.96,173.79-296.71,333.29-468.05,476.23,276.77-180.08,471.07-367.33,687.29-563.39l-124.4,19.07c178.91-162.71,348.9-334.99,509.26-516.04l-164.57,25.23c81.12-91.85,155.67-180.81,224.18-265.57l-412.13,89.69Z"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-6" d="M157.78,462.19h7.78v52.03h28.69v7.36h-36.47v-59.39Z"/>
|
||||
<path class="cls-6" d="M204.45,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM200.77,479.75h7.19v41.82h-7.19v-41.82Z"/>
|
||||
<path class="cls-6" d="M237.74,539.9c-9.03,0-16.06-3.85-19.66-8.53l5.02-5.02c3.35,4.27,8.2,7.03,14.64,7.03,7.11,0,13.97-4.43,13.97-14.47v-4.77c-2.84,4.18-8.37,7.36-14.56,7.36-11.63,0-20.49-9.37-20.49-21.33s8.87-21.25,20.49-21.25c6.19,0,11.71,3.09,14.56,7.28v-6.44h7.19v39.4c0,13.97-9.2,20.74-21.16,20.74ZM238.16,514.8c8.37,0,14.14-6.19,14.14-14.64s-5.77-14.64-14.14-14.64-14.14,6.27-14.14,14.64,5.77,14.64,14.14,14.64Z"/>
|
||||
<path class="cls-6" d="M270.78,458.84h7.19v27.35c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-62.74Z"/>
|
||||
<path class="cls-6" d="M334.1,521.99c-7.28,0-12.8-4.1-12.8-12.8v-22.84h-8.87v-6.61h8.87v-11.63h7.19v11.63h12.05v6.61h-12.05v21.92c0,5.35,2.43,7.11,6.94,7.11,1.76,0,3.76-.33,5.1-.84v6.44c-1.76.59-3.85,1-6.44,1Z"/>
|
||||
<path class="cls-6" d="M348.07,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
|
||||
<path class="cls-6" d="M399.35,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM395.67,479.75h7.19v41.82h-7.19v-41.82Z"/>
|
||||
<path class="cls-6" d="M414.82,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
|
||||
<path class="cls-6" d="M480.23,539.9c-9.03,0-16.06-3.85-19.66-8.53l5.02-5.02c3.35,4.27,8.2,7.03,14.64,7.03,7.11,0,13.97-4.43,13.97-14.47v-4.77c-2.84,4.18-8.37,7.36-14.56,7.36-11.63,0-20.49-9.37-20.49-21.33s8.87-21.25,20.49-21.25c6.19,0,11.71,3.09,14.56,7.28v-6.44h7.19v39.4c0,13.97-9.2,20.74-21.16,20.74ZM480.65,514.8c8.37,0,14.14-6.19,14.14-14.64s-5.77-14.64-14.14-14.64-14.14,6.27-14.14,14.64,5.77,14.64,14.14,14.64Z"/>
|
||||
<path class="cls-6" d="M511.93,494.56h21.33v7.19h-21.33v-7.19Z"/>
|
||||
<path class="cls-6" d="M544.97,462.19h34.55v7.36h-26.77v17.98h21.16v7.36h-21.16v26.68h-7.78v-59.39Z"/>
|
||||
<path class="cls-6" d="M600.68,478.92c6.27,0,11.96,3.26,14.72,7.28v-6.44h7.19v41.82h-7.19v-6.44c-2.76,4.02-8.45,7.28-14.72,7.28-11.71,0-20.49-9.79-20.49-21.75s8.78-21.75,20.49-21.75ZM601.77,485.52c-8.45,0-14.3,6.69-14.3,15.14s5.86,15.14,14.3,15.14,14.22-6.69,14.22-15.14-5.77-15.14-14.22-15.14Z"/>
|
||||
<path class="cls-6" d="M647.11,522.41c-7.44,0-13.89-3.18-16.39-9.7l5.86-3.26c1.5,4.27,6.02,6.61,10.62,6.61,4.01,0,7.28-2.01,7.28-5.6,0-3.01-1.84-4.85-7.28-6.52l-4.18-1.34c-6.86-2.01-10.46-6.36-10.46-12.21.08-7.19,6.36-11.46,14.39-11.46,6.19,0,11.04,2.59,13.8,7.28l-5.35,3.68c-1.84-2.68-4.68-4.77-8.7-4.77-3.51,0-6.94,1.92-6.94,5.02,0,2.51,1.34,4.6,5.94,6.02l4.6,1.42c7.19,2.17,11.38,5.94,11.38,12.3,0,8.03-6.19,12.55-14.55,12.55Z"/>
|
||||
<path class="cls-6" d="M686.25,521.99c-7.28,0-12.8-4.1-12.8-12.8v-22.84h-8.87v-6.61h8.87v-11.63h7.19v11.63h12.04v6.61h-12.04v21.92c0,5.35,2.43,7.11,6.94,7.11,1.76,0,3.76-.33,5.1-.84v6.44c-1.76.59-3.85,1-6.44,1Z"/>
|
||||
<path class="cls-6" d="M700.56,533.54h-4.77l5.6-11.96c-2.01-1-3.43-3.09-3.43-5.6,0-3.35,2.76-6.27,6.19-6.27s6.19,2.93,6.19,6.27c0,1.42-.5,2.68-1.17,3.76l-8.62,13.8Z"/>
|
||||
<path class="cls-6" d="M766.14,522.58c-17.06,0-30.7-13.47-30.7-30.7s13.63-30.7,30.7-30.7,30.7,13.47,30.7,30.7-13.55,30.7-30.7,30.7ZM766.14,515.22c13.05,0,22.92-10.29,22.92-23.34s-9.87-23.34-22.92-23.34-22.84,10.29-22.84,23.34,9.87,23.34,22.84,23.34Z"/>
|
||||
<path class="cls-6" d="M805.7,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
|
||||
<path class="cls-6" d="M851.96,494.56h21.33v7.19h-21.33v-7.19Z"/>
|
||||
<path class="cls-6" d="M885,462.19h17.06c18.24,0,31.62,12.71,31.62,29.7s-13.38,29.7-31.62,29.7h-17.06v-59.39ZM902.06,514.3c14.3,0,23.76-9.54,23.76-22.42s-9.45-22.42-23.76-22.42h-9.29v44.84h9.29Z"/>
|
||||
<path class="cls-6" d="M961.03,478.92c10.96,0,19.83,7.61,19.91,21,0,.75,0,1.25-.08,2.17h-34.3c.25,7.86,6.19,13.72,14.47,13.72,6.44,0,10.46-2.84,13.05-7.28l5.69,3.93c-3.76,6.11-10.12,9.95-18.82,9.95-12.97,0-21.67-9.45-21.67-21.75s9.03-21.75,21.75-21.75ZM947.07,496.23h26.52c-1-6.86-6.44-10.96-12.8-10.96s-12.38,4.02-13.72,10.96Z"/>
|
||||
<path class="cls-6" d="M982.53,479.75h8.03l14.3,31.62,14.3-31.62h8.03l-19.32,41.82h-6.02l-19.32-41.82Z"/>
|
||||
<path class="cls-6" d="M1036.48,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM1032.8,479.75h7.19v41.82h-7.19v-41.82Z"/>
|
||||
<path class="cls-6" d="M1070.61,522.41c-12.63,0-21.92-9.62-21.92-21.75s9.28-21.75,21.92-21.75c8.62,0,15.64,4.52,19.32,11.21l-6.27,3.51c-2.34-4.77-7.03-8.03-13.05-8.03-8.7,0-14.64,6.69-14.64,15.06s5.94,15.06,14.64,15.06c6.02,0,10.71-3.26,13.05-8.03l6.27,3.51c-3.68,6.69-10.71,11.21-19.32,11.21Z"/>
|
||||
<path class="cls-6" d="M1116.03,478.92c10.96,0,19.83,7.61,19.91,21,0,.75,0,1.25-.08,2.17h-34.3c.25,7.86,6.19,13.72,14.47,13.72,6.44,0,10.46-2.84,13.05-7.28l5.69,3.93c-3.76,6.11-10.12,9.95-18.82,9.95-12.97,0-21.67-9.45-21.67-21.75s9.03-21.75,21.75-21.75ZM1102.06,496.23h26.52c-1-6.86-6.44-10.96-12.8-10.96s-12.38,4.02-13.72,10.96Z"/>
|
||||
<path class="cls-6" d="M1173.33,469.55h-18.49v-7.36h44.58v7.36h-18.49v52.03h-7.61v-52.03Z"/>
|
||||
<path class="cls-6" d="M1219.08,469.55h-18.49v-7.36h44.58v7.36h-18.49v52.03h-7.61v-52.03Z"/>
|
||||
<path class="cls-6" d="M1253.71,507.19c3.18,5.02,8.03,8.11,14.72,8.11,6.11,0,10.96-3.35,10.96-8.87,0-4.77-3.01-7.95-8.53-10.04l-7.86-3.01c-9.29-3.43-13.47-8.28-13.47-16.4,0-9.7,7.95-15.81,18.57-15.81,7.44,0,13.63,3.35,17.32,8.11l-5.69,5.02c-3.01-3.6-6.69-5.86-11.79-5.86-5.86,0-10.71,3.26-10.71,8.2s2.93,7.28,8.95,9.54l7.19,2.76c8.78,3.35,13.8,8.36,13.8,17.15,0,10.12-7.86,16.48-18.9,16.48-9.79,0-17.73-4.6-20.83-10.71l6.27-4.68Z"/>
|
||||
<path class="cls-6" d="M1299.64,522.25c-3.43,0-6.19-2.84-6.19-6.27s2.76-6.27,6.19-6.27,6.19,2.93,6.19,6.27-2.76,6.27-6.19,6.27Z"/>
|
||||
</g>
|
||||
<g>
|
||||
<g>
|
||||
<path class="cls-6" d="M175.59,793.41c0,5.19-3.89,9.12-9.89,9.12h-5.03v10.5h-3.77v-28.77h8.79c6,0,9.89,3.97,9.89,9.16ZM171.86,793.41c0-3.2-2.19-5.63-6.16-5.63h-5.03v11.27h5.03c3.97,0,6.16-2.43,6.16-5.63Z"/>
|
||||
<path class="cls-6" d="M186.24,792.36c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM186.77,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M201.8,792.76h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M222.67,792.36c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM223.2,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M238.23,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M281.43,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM274.66,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M301.98,813.23c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
|
||||
<path class="cls-6" d="M316.61,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM309.84,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M329.57,792.76h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M348.5,813.43c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-6" d="M642.94,793.41c0,5.19-3.89,9.12-9.89,9.12h-5.03v10.5h-3.77v-28.77h8.79c6,0,9.89,3.97,9.89,9.16ZM639.22,793.41c0-3.2-2.19-5.63-6.16-5.63h-5.03v11.27h5.03c3.97,0,6.16-2.43,6.16-5.63Z"/>
|
||||
<path class="cls-6" d="M645.58,792.76h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M667.05,813.43c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM667.05,810.19c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
|
||||
<path class="cls-6" d="M689.63,821.9c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM689.83,809.74c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
|
||||
<path class="cls-6" d="M704.82,792.76h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M725.69,792.36c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM726.22,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M741.25,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M775.49,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M811.52,787.33c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM809.73,792.76h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M818.2,792.76h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M849.08,821.9c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM849.29,809.74c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
|
||||
<path class="cls-6" d="M623.73,824.36h3.48v30.4h-3.48v-30.4Z"/>
|
||||
<path class="cls-6" d="M640.55,834.09c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM641.08,837.29c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M656.11,834.49h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M686.99,863.63c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM687.19,851.47c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
|
||||
<path class="cls-6" d="M701.9,834.49h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
|
||||
<path class="cls-6" d="M732.38,834.09c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM732.9,837.29c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M756.57,863.63c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM756.77,851.47c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
|
||||
<path class="cls-6" d="M780.72,834.09c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM773.95,842.48h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M799.81,855.16c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-6" d="M1102.25,813.51c-8.27,0-14.87-6.52-14.87-14.87s6.61-14.87,14.87-14.87,14.87,6.52,14.87,14.87-6.56,14.87-14.87,14.87ZM1102.25,809.94c6.32,0,11.1-4.98,11.1-11.31s-4.78-11.31-11.1-11.31-11.06,4.98-11.06,11.31,4.78,11.31,11.06,11.31Z"/>
|
||||
<path class="cls-6" d="M1120.61,792.76h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M1142.21,799.93h10.33v3.49h-10.33v-3.49Z"/>
|
||||
<path class="cls-6" d="M1157.4,784.25h8.27c8.83,0,15.32,6.16,15.32,14.39s-6.48,14.39-15.32,14.39h-8.27v-28.77ZM1165.67,809.5c6.93,0,11.51-4.62,11.51-10.86s-4.58-10.86-11.51-10.86h-4.5v21.72h4.5Z"/>
|
||||
<path class="cls-6" d="M1193.43,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM1186.66,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M1203.03,792.76h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
|
||||
<path class="cls-6" d="M1228.36,787.33c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM1226.57,792.76h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M1244.08,813.43c-6.12,0-10.62-4.66-10.62-10.54s4.5-10.54,10.62-10.54c4.17,0,7.58,2.19,9.36,5.43l-3.04,1.7c-1.13-2.31-3.4-3.89-6.32-3.89-4.21,0-7.09,3.24-7.09,7.29s2.88,7.29,7.09,7.29c2.92,0,5.19-1.58,6.32-3.89l3.04,1.7c-1.78,3.24-5.19,5.43-9.36,5.43Z"/>
|
||||
<path class="cls-6" d="M1265.27,792.36c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM1258.5,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-6" d="M1097.68,909.11h-11.46l5.86-7.11h12.8v59.39h-7.19v-52.28Z"/>
|
||||
<path class="cls-6" d="M1135.41,962.23c-12.21,0-21-9.12-21-22.33v-16.56c0-13.13,8.78-22.33,21-22.33s21.08,9.2,21.08,22.33v16.56c0,13.22-8.87,22.33-21.08,22.33ZM1135.41,955.2c8.11,0,13.89-5.69,13.89-15.31v-16.56c0-9.62-5.77-15.31-13.89-15.31s-13.8,5.69-13.8,15.31v16.56c0,9.62,5.77,15.31,13.8,15.31Z"/>
|
||||
<path class="cls-6" d="M1184.17,962.23c-12.21,0-21-9.12-21-22.33v-16.56c0-13.13,8.78-22.33,21-22.33s21.08,9.2,21.08,22.33v16.56c0,13.22-8.87,22.33-21.08,22.33ZM1184.17,955.2c8.11,0,13.89-5.69,13.89-15.31v-16.56c0-9.62-5.77-15.31-13.89-15.31s-13.8,5.69-13.8,15.31v16.56c0,9.62,5.77,15.31,13.8,15.31Z"/>
|
||||
<path class="cls-6" d="M1226.08,934.29c-9.12,0-16.48-7.44-16.48-16.48s7.36-16.48,16.48-16.48,16.48,7.36,16.48,16.48-7.45,16.48-16.48,16.48ZM1226.08,927.76c5.44,0,9.87-4.6,9.87-9.95s-4.43-9.95-9.87-9.95-9.87,4.6-9.87,9.95,4.35,9.95,9.87,9.95ZM1263.3,902h7.78l-37.64,59.39h-7.78l37.64-59.39ZM1270.66,962.06c-9.12,0-16.48-7.44-16.48-16.48s7.36-16.48,16.48-16.48,16.48,7.36,16.48,16.48-7.44,16.48-16.48,16.48ZM1270.66,955.53c5.44,0,9.87-4.6,9.87-9.95s-4.43-9.95-9.87-9.95-9.87,4.6-9.87,9.95,4.35,9.95,9.87,9.95Z"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-6" d="M644.38,962.23c-11.63,0-19.99-7.86-19.99-18.65,0-7.11,3.51-12.63,9.45-15.56-3.85-2.59-6.19-6.78-6.19-11.71,0-8.87,7.19-15.31,16.73-15.31s16.65,6.44,16.65,15.31c0,4.94-2.26,9.12-6.02,11.71,5.86,2.93,9.29,8.45,9.29,15.56,0,10.79-8.11,18.65-19.91,18.65ZM644.38,955.37c7.95,0,12.63-5.27,12.63-11.79s-4.68-11.88-12.63-11.88-12.71,5.19-12.71,11.88,4.94,11.79,12.71,11.79ZM644.38,925.25c6.02,0,9.54-4.18,9.54-8.78,0-4.94-3.6-8.7-9.54-8.7s-9.62,3.76-9.62,8.7c0,4.6,3.76,8.78,9.62,8.78Z"/>
|
||||
<path class="cls-6" d="M668.05,933.95h14.47v-14.39h6.53v14.39h14.39v6.44h-14.39v14.39h-6.53v-14.39h-14.47v-6.44Z"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-6" d="M176.26,962.23c-11.54,0-20.16-8.78-20.16-19.57,0-6.27,3.01-10.79,6.69-15.14l21.67-25.51h9.12l-18.15,21.5c.84-.08,1.59-.17,2.34-.17,9.95,0,18.57,8.78,18.57,19.16,0,10.96-8.53,19.74-20.08,19.74ZM176.26,955.28c7.19,0,12.71-5.77,12.71-12.8s-5.52-12.8-12.71-12.8-12.8,5.77-12.8,12.8,5.52,12.8,12.8,12.8Z"/>
|
||||
<path class="cls-6" d="M219.68,962.23c-11.54,0-20.16-8.78-20.16-19.57,0-6.27,3.01-10.79,6.69-15.14l21.67-25.51h9.12l-18.15,21.5c.84-.08,1.59-.17,2.34-.17,9.95,0,18.57,8.78,18.57,19.16,0,10.96-8.53,19.74-20.08,19.74ZM219.68,955.28c7.19,0,12.71-5.77,12.71-12.8s-5.52-12.8-12.71-12.8-12.8,5.77-12.8,12.8,5.52,12.8,12.8,12.8Z"/>
|
||||
<path class="cls-6" d="M255.56,915.22v46.17h-7.78v-59.39h7.11l22.08,29.61,22.08-29.61h7.03v59.39h-7.7v-46.17l-21.41,28.78-21.41-28.78Z"/>
|
||||
</g>
|
||||
<line class="cls-2" x1="157.77" y1="878.71" x2="407.74" y2="878.71"/>
|
||||
<line class="cls-2" x1="621.9" y1="878.71" x2="871.87" y2="878.71"/>
|
||||
<line class="cls-2" x1="1086.03" y1="878.71" x2="1336" y2="878.71"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-6" d="M158.3,582.04h3.77v28.77h-3.77v-28.77Z"/>
|
||||
<path class="cls-6" d="M167.46,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M198.74,611.22c-6.12,0-10.62-4.66-10.62-10.54s4.5-10.54,10.62-10.54c4.17,0,7.58,2.19,9.36,5.43l-3.04,1.7c-1.13-2.31-3.4-3.89-6.32-3.89-4.21,0-7.09,3.24-7.09,7.29s2.88,7.29,7.09,7.29c2.92,0,5.19-1.58,6.32-3.89l3.04,1.7c-1.78,3.24-5.19,5.43-9.36,5.43Z"/>
|
||||
<path class="cls-6" d="M210.98,590.55h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M232.42,590.15c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM225.65,598.53h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M253.73,590.15c3.04,0,5.79,1.58,7.13,3.53v-13.25h3.48v30.4h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM254.26,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M271.08,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM269.29,590.55h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M281.25,610.81h-3.49v-30.4h3.49v13.25c1.34-1.95,4.09-3.53,7.13-3.53,5.63,0,9.93,4.74,9.93,10.54s-4.3,10.54-9.93,10.54c-3.04,0-5.8-1.58-7.13-3.53v3.12ZM287.85,593.35c-4.09,0-6.89,3.24-6.89,7.34s2.8,7.34,6.89,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M301.67,580.42h3.49v30.4h-3.49v-30.4Z"/>
|
||||
<path class="cls-6" d="M311.64,619.28l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
|
||||
<path class="cls-6" d="M337.94,580.42h3.48v30.4h-3.48v-30.4Z"/>
|
||||
<path class="cls-6" d="M348.19,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM346.41,590.55h3.48v20.26h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M363.51,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM363.71,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
|
||||
<path class="cls-6" d="M378.71,580.42h3.48v13.25c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-30.4Z"/>
|
||||
<path class="cls-6" d="M408.57,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
|
||||
<path class="cls-6" d="M426.73,596.22l-5.03,14.59h-3.08l-6.89-20.26h3.65l4.9,14.79,5.07-14.79h2.67l5.07,14.79,4.9-14.79h3.69l-6.89,20.26h-3.08l-4.99-14.59Z"/>
|
||||
<path class="cls-6" d="M452.7,590.15c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM445.93,598.53h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M467.45,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM465.67,590.55h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M482.77,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM482.97,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
|
||||
<path class="cls-6" d="M497.96,580.42h3.48v13.25c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-30.4Z"/>
|
||||
<path class="cls-6" d="M527.83,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
|
||||
<path class="cls-6" d="M550.28,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM550.81,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M565.84,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M596.43,590.15c3.04,0,5.8,1.58,7.13,3.53v-13.25h3.49v30.4h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM596.96,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M623.62,610.81h-3.48v-30.4h3.48v13.25c1.34-1.95,4.09-3.53,7.13-3.53,5.63,0,9.93,4.74,9.93,10.54s-4.3,10.54-9.93,10.54c-3.04,0-5.79-1.58-7.13-3.53v3.12ZM630.23,593.35c-4.09,0-6.89,3.24-6.89,7.34s2.8,7.34,6.89,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M644.05,580.42h3.49v30.4h-3.49v-30.4Z"/>
|
||||
<path class="cls-6" d="M660.87,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM661.39,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M674.76,608.06l11.63-14.31h-11.47v-3.2h15.93v2.8l-11.55,14.27h11.92v3.2h-16.45v-2.76Z"/>
|
||||
<path class="cls-6" d="M696.2,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM694.42,590.55h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M702.89,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M733.76,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM733.97,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
|
||||
<path class="cls-6" d="M748.96,580.42h3.49v30.4h-3.49v-30.4Z"/>
|
||||
<path class="cls-6" d="M758.93,619.28l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
|
||||
<path class="cls-6" d="M786.4,593.75h-4.3v-3.2h4.3v-4.09c0-4.17,2.67-6.28,6.2-6.28,1.22,0,2.27.2,3.16.53v3.08c-.65-.24-1.7-.41-2.51-.41-2.23,0-3.36.89-3.36,3.44v3.73h5.88v3.2h-5.88v17.06h-3.48v-17.06Z"/>
|
||||
<path class="cls-6" d="M805.78,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM806.3,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M827.45,611.22c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
|
||||
<path class="cls-6" d="M845.61,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.17,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
|
||||
<path class="cls-6" d="M851.73,616.61h-2.31l2.72-5.8c-.97-.49-1.66-1.5-1.66-2.72,0-1.62,1.34-3.04,3-3.04s3,1.42,3,3.04c0,.69-.24,1.3-.57,1.82l-4.17,6.69Z"/>
|
||||
<path class="cls-6" d="M157.78,639.18h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M170.7,639.18h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
|
||||
<path class="cls-6" d="M192.83,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M215.07,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M239.1,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM237.32,639.18h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M245.79,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M276.67,668.32c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM276.87,656.16c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
|
||||
<path class="cls-6" d="M300.01,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M330.61,638.77c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM331.13,641.98c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M354.03,659.65c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.17,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
|
||||
<path class="cls-6" d="M361.77,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM359.98,639.18h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M365.41,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
|
||||
<path class="cls-6" d="M397.47,638.77c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM390.7,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M410.43,629.05h3.49v30.4h-3.49v-30.4Z"/>
|
||||
<path class="cls-6" d="M420.4,667.91l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
|
||||
<path class="cls-6" d="M448.49,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM446.7,639.18h3.48v20.26h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M455.17,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M486.13,667.91l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
|
||||
<path class="cls-6" d="M513.73,659.85c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM513.73,656.61c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
|
||||
<path class="cls-6" d="M527.38,639.18h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
|
||||
<path class="cls-6" d="M549.51,639.18h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M578.81,638.77c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM572.04,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M591.78,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M610.54,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
|
||||
<path class="cls-6" d="M635.86,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM634.08,639.18h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M642.55,639.18h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M664.03,659.85c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM664.03,656.61c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
|
||||
<path class="cls-6" d="M677.97,639.18h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M700.21,639.18h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M743.41,638.77c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM736.64,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
|
||||
<path class="cls-6" d="M756.38,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
|
||||
<path class="cls-6" d="M786.24,659.65c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
|
||||
<path class="cls-6" d="M796.42,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
|
||||
<path class="cls-6" d="M821.74,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM819.96,639.18h3.49v20.26h-3.49v-20.26Z"/>
|
||||
<path class="cls-6" d="M836.78,638.77c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM837.3,641.98c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
|
||||
<path class="cls-6" d="M873.9,659.93c-8.27,0-14.87-6.52-14.87-14.87s6.61-14.87,14.87-14.87,14.87,6.52,14.87,14.87-6.56,14.87-14.87,14.87ZM873.9,656.36c6.32,0,11.1-4.98,11.1-11.31s-4.78-11.31-11.1-11.31-11.06,4.98-11.06,11.31,4.78,11.31,11.06,11.31Z"/>
|
||||
<path class="cls-6" d="M914.1,659.44l-17.51-22.41v22.41h-3.77v-28.77h3.28l17.51,22.41v-22.41h3.73v28.77h-3.24Z"/>
|
||||
<path class="cls-6" d="M944.65,659.44l-17.51-22.41v22.41h-3.77v-28.77h3.28l17.51,22.41v-22.41h3.73v28.77h-3.24Z"/>
|
||||
<path class="cls-6" d="M963.7,647.89l-8.79,11.55h-4.62l11.1-14.55-10.66-14.23h4.5l8.47,11.23,8.47-11.23h4.5l-10.62,14.23,11.06,14.55h-4.58l-8.83-11.55Z"/>
|
||||
<path class="cls-6" d="M980.75,659.77c-1.66,0-3-1.38-3-3.04s1.34-3.04,3-3.04,3,1.42,3,3.04-1.34,3.04-3,3.04Z"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-3" d="M312.56,274.03h-18c0-5.18-4.04-10.02-9.4-10.07l-104.25.02c-11.33,2.12-11.32,17.43-.38,20.06l102.54.02c38.59,2.04,38.78,52.82,2.1,56.56-33.79-1.6-69.45,2.06-103.02,0-11.72-.72-21.95-7.64-26.04-18.75-1.18-3.2-1.28-6.01-1.77-9.32h18c.28,4.87,3.95,9.62,8.98,10.07l104.25-.02c12.15-2.24,11.75-18.22-.05-20.47l-105.04-.03c-34.49-3.76-34.33-52.52,0-56.14l107.07.13c13.99,1.94,24.86,13.7,25.02,27.94Z"/>
|
||||
<polygon class="cls-3" points="845.46 245.97 845.46 263.98 708.99 263.98 708.99 284.08 807.58 284.08 808.2 284.71 808.2 302.08 708.99 302.08 708.99 322.6 845.46 322.6 845.46 340.61 690.15 340.61 690.15 245.97 845.46 245.97"/>
|
||||
<path class="cls-3" d="M979.42,302.08l40.19,38.52h-24.07l-41.02-38.52h-66.35v38.52h-18.42v-94.63h127.89c2.7,0,8.65,2.55,11.1,3.97,18.71,10.84,17.89,38.72-.98,48.86-1.77.95-7.92,3.28-9.7,3.28h-18.63ZM888.16,284.08h108.63c.19,0,2.17-.96,2.59-1.18,5.89-3.09,6.76-10.78,2.58-15.72-.77-.9-3.71-3.2-4.75-3.2h-109.05v20.1Z"/>
|
||||
<path class="cls-3" d="M1223.38,246.09l104.97-.13c13.07,1.48,23.94,11.35,25.72,24.52-2.05,25.65,10.21,65.79-26.58,70.12l-101.32.02c-16.42-1.48-26.57-12.82-27.42-29.1-.57-10.96-.6-25.52,0-36.47.83-15.17,9.22-26.52,24.62-28.97ZM1225.46,264.08c-3.86,1.07-7.28,3.93-7.85,8.06,1.21,13.06-1.63,29.14.04,41.84.54,4.08,4.71,8.46,8.95,8.64l100.07-.02c4.88-.89,8.58-4.39,9.01-9.41,1.12-12.93-.84-27.49-.05-40.6-.75-4.36-4.01-8.14-8.53-8.64l-101.64.12Z"/>
|
||||
<path class="cls-3" d="M537.36,302.08v38.52h-18.42v-94.63h127.89c2.29,0,8.64,2.6,10.8,3.85,19.96,11.5,17.66,41.29-3.33,50.11-1.6.67-5.94,2.15-7.48,2.15h-109.47ZM537.36,284.08h108.21c2.55,0,6.68-3.98,7.46-6.35,1.45-4.38-.08-9.62-3.93-12.25-.44-.3-2.83-1.49-3.11-1.49h-108.63v20.1Z"/>
|
||||
<path class="cls-3" d="M1763.91,245.97v18h-128.72c-.58,0-3.75,1.77-4.41,2.29-2.94,2.33-3.53,5.62-3.78,9.2-.69,9.89-.63,24.87,0,34.79.15,2.35.5,5.46,1.76,7.45.92,1.44,4.77,4.88,6.42,4.88h128.72v18h-129.14c-14.54,0-25.37-14.67-26.18-28.25-.67-11.18-.67-26.96,0-38.14.74-12.49,10.69-28.25,24.51-28.25h130.82Z"/>
|
||||
<polygon class="cls-3" points="1517.76 323.86 1517.76 245.97 1536.6 245.97 1536.6 340.61 1510.02 340.61 1399.71 262.72 1399.71 340.61 1381.29 340.61 1381.29 245.97 1409.55 245.97 1517.76 323.86"/>
|
||||
<path class="cls-3" d="M355.26,245.97v69.3c0,2.61,5.06,7.86,8.15,7.34l101.74-.02c2.83.06,8.16-4.19,8.16-6.91v-69.72h18.42v71.39c0,11-15.16,23.94-26.14,23.26-33.52-1.59-68.88,2.04-102.18,0-10.81-.66-20.71-6.87-24.79-17.08-.41-1.02-1.77-4.96-1.77-5.76v-71.81h18.42Z"/>
|
||||
<polygon class="cls-3" points="1188.31 245.97 1188.31 263.35 1187.68 263.98 1118.4 263.98 1118.4 340.61 1099.98 340.61 1099.98 263.98 1030.49 263.98 1030.49 245.97 1188.31 245.97"/>
|
||||
<rect class="cls-3" x="1563.39" y="245.97" width="18.42" height="94.63"/>
|
||||
</g>
|
||||
<g>
|
||||
<path class="cls-6" d="M156.1,179.65h54.1v-54.1h-54.1v54.1ZM183.15,129.76c5.25,0,10.08,1.79,13.94,4.77l-13.03,13.03,4.13,4.13,13.03-13.03c2.98,3.86,4.77,8.68,4.77,13.94,0,12.62-10.23,22.84-22.84,22.84s-22.84-10.23-22.84-22.84,10.23-22.84,22.84-22.84Z"/>
|
||||
<path class="cls-6" d="M279.9,132.95c0,8.93,0,17.85,0,26.78,0,3-1.02,5.56-3.39,7.47-2,1.62-4.32,2.26-6.87,2.03-3.3-.31-5.79-1.89-7.41-4.8-.85-1.52-1.09-3.18-1.09-4.9,0-8.82,0-17.63,0-26.45v-.79h-4.64v.59c0,8.98,0,17.96,0,26.94,0,.86.03,1.74.18,2.58.55,3.28,2.09,6.04,4.68,8.15,3.29,2.68,7.09,3.66,11.28,3.08,3.43-.48,6.33-2.02,8.62-4.62,2.3-2.61,3.28-5.7,3.28-9.14v-27.6h-4.65v.67Z"/>
|
||||
<path class="cls-6" d="M244.82,151.28c-1.95-1-4.04-1.33-6.2-1.37-1.41-.03-2.71-.42-3.92-1.17-2.75-1.7-3.99-5.22-2.86-8.25,1.07-2.87,3.65-4.63,6.95-4.71,3.71-.09,6.89,2.1,7.55,5.82.07.39.09.8.13,1.21h4.6c.03-3.38-1.3-6.12-3.67-8.34-2.21-2.06-4.9-3.07-7.93-3.15-4.59-.11-8.33,1.55-10.88,5.43-1.96,2.98-2.26,6.25-1.21,9.63.68,2.16,1.97,3.92,3.77,5.32,2.02,1.57,4.29,2.47,6.85,2.64,1.48.1,2.98.1,4.38.73,4.68,2.1,5.63,7.4,3.21,10.99-1.52,2.24-3.77,3.2-6.42,3.32-3.59.16-6.69-1.91-7.78-5.25-.23-.71-.29-1.47-.43-2.2h-4.55c-.13,4.12,2.51,8.5,6.32,10.4,3.97,1.98,8,2.08,12.03.2,4.76-2.22,7.47-7.32,6.61-12.5-.67-4.02-2.94-6.89-6.54-8.75Z"/>
|
||||
<path class="cls-6" d="M310.07,135.16c-2.37-2.08-5.22-2.88-8.31-2.9-2.36-.01-4.72,0-7.09,0h0s-4.64,0-4.64,0v40.59h.03v.02h4.6v-.02h0v-18.01h.69c2.1,0,4.2,0,6.3,0,1.61,0,3.2-.18,4.72-.76,3.81-1.47,6.44-4.06,7.31-8.12.91-4.26-.33-7.89-3.62-10.78ZM302.79,150.34c-2.59.1-5.2.05-7.79.06-.12,0-.2-.04-.29-.04v-13.68h.18c2.8,0,5.61-.09,8.4.12,3.27.24,6.14,3.08,6.07,6.88-.06,3.69-2.97,6.52-6.56,6.66Z"/>
|
||||
<path class="cls-6" d="M361.78,154.05c3.82-1.47,6.44-4.06,7.31-8.12.91-4.26-.33-7.89-3.62-10.78-2.37-2.08-5.22-2.88-8.31-2.9-3.7-.02-7.41,0-11.11,0h-.62c0,13.56.03,27.08.03,40.6h4.6v-18.03h.69c1.09,0,2.17,0,3.26,0l12.03,18.06h5.39l-12.13-18.21c.84-.11,1.67-.3,2.48-.61ZM350.4,150.4c-.12,0-.2-.04-.29-.04v-13.68h.18c2.8,0,5.61-.09,8.4.12,3.27.24,6.13,3.08,6.07,6.88-.06,3.69-2.97,6.52-6.56,6.66-2.6.1-5.2.05-7.79.06Z"/>
|
||||
<polygon class="cls-6" points="469.03 165.24 447.05 132.34 442.53 132.34 442.53 172.84 447.05 172.84 447.05 139.95 469.03 172.84 473.55 172.84 473.55 132.34 469.03 132.34 469.03 165.24"/>
|
||||
<path class="cls-6" d="M318.45,132.28h0v40.55h0c7.18,0,14.32,0,21.45,0h0s.05,0,.05,0v-4.44h-.05c-5.63,0-11.22,0-16.8,0v-14.17h16.2v-4.45h-16.2v-13.05h16.79s.05,0,.05,0v-4.44h-21.5Z"/>
|
||||
<path class="cls-6" d="M397.79,136.72v-4.43h0s-25.96-.01-25.96-.01h0v4.47h10.65v36.08h4.65v-36.1h10.66v-.02Z"/>
|
||||
<path class="cls-6" d="M479.04,132.28h0v40.55h0c6.99,0,13.94,0,20.89,0h.61v-4.44h-.05c-5.63,0-11.22,0-16.8,0v-14.17h16.2v-4.45h-16.2v-13.05h16.79s.05,0,.05,0v-4.44h-21.5Z"/>
|
||||
<path class="cls-6" d="M417.64,131.69c-11.5,0-20.85,9.35-20.85,20.85s9.35,20.85,20.85,20.85,20.85-9.35,20.85-20.85-9.35-20.85-20.85-20.85ZM417.64,168.88c-9.01,0-16.34-7.33-16.34-16.34s7.33-16.34,16.34-16.34,16.34,7.33,16.34,16.34-7.33,16.34-16.34,16.34Z"/>
|
||||
</g>
|
||||
</g>
|
||||
</g>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 43 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 115 KiB |
@@ -0,0 +1,10 @@
|
||||
import SwiftUI
|
||||
|
||||
@main
|
||||
struct ExampleiOSApp: App {
|
||||
var body: some Scene {
|
||||
WindowGroup {
|
||||
ContentView()
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,30 @@
|
||||
import Foundation
|
||||
import AVFoundation
|
||||
|
||||
final class AudioPlayer: NSObject, AVAudioPlayerDelegate {
|
||||
private var player: AVAudioPlayer?
|
||||
private var onFinish: (() -> Void)?
|
||||
|
||||
func play(url: URL, onFinish: (() -> Void)? = nil) {
|
||||
self.onFinish = onFinish
|
||||
do {
|
||||
let data = try Data(contentsOf: url)
|
||||
let player = try AVAudioPlayer(data: data)
|
||||
player.delegate = self
|
||||
player.prepareToPlay()
|
||||
player.play()
|
||||
self.player = player
|
||||
} catch {
|
||||
print("Audio play error: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
func stop() {
|
||||
player?.stop()
|
||||
player = nil
|
||||
}
|
||||
|
||||
func audioPlayerDidFinishPlaying(_ player: AVAudioPlayer, successfully flag: Bool) {
|
||||
onFinish?()
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,98 @@
|
||||
import SwiftUI
|
||||
|
||||
struct ContentView: View {
|
||||
@StateObject private var vm = TTSViewModel()
|
||||
|
||||
var body: some View {
|
||||
ZStack {
|
||||
LinearGradient(gradient: Gradient(colors: [Color(.systemBackground), Color(.secondarySystemBackground)]), startPoint: .topLeading, endPoint: .bottomTrailing)
|
||||
.ignoresSafeArea()
|
||||
|
||||
VStack(spacing: 20) {
|
||||
Spacer()
|
||||
|
||||
VStack(spacing: 12) {
|
||||
Text("SupertonicTTS iOS Demo")
|
||||
.font(.title2.weight(.semibold))
|
||||
.foregroundColor(.primary)
|
||||
|
||||
ZStack(alignment: .topLeading) {
|
||||
if vm.text.isEmpty {
|
||||
Text("Type text to synthesize")
|
||||
.foregroundColor(.secondary)
|
||||
.padding(.horizontal, 14)
|
||||
.padding(.vertical, 12)
|
||||
}
|
||||
TextEditor(text: $vm.text)
|
||||
.frame(minHeight: 120, maxHeight: 180)
|
||||
.padding(8)
|
||||
.background(Color(.secondarySystemBackground))
|
||||
.cornerRadius(12)
|
||||
.overlay(
|
||||
RoundedRectangle(cornerRadius: 12)
|
||||
.stroke(Color.secondary.opacity(0.3), lineWidth: 1)
|
||||
)
|
||||
}
|
||||
.padding(.horizontal)
|
||||
|
||||
HStack(spacing: 12) {
|
||||
Text("NFE")
|
||||
.font(.subheadline)
|
||||
.foregroundColor(.secondary)
|
||||
Slider(value: $vm.nfe, in: 2...15, step: 1)
|
||||
Text("\(Int(vm.nfe))")
|
||||
.font(.subheadline.monospacedDigit())
|
||||
.frame(width: 36)
|
||||
}
|
||||
.padding(.horizontal)
|
||||
|
||||
Picker("Voice", selection: $vm.voice) {
|
||||
Text("M").tag(TTSService.Voice.male)
|
||||
Text("F").tag(TTSService.Voice.female)
|
||||
}
|
||||
.pickerStyle(SegmentedPickerStyle())
|
||||
.padding(.horizontal)
|
||||
}
|
||||
|
||||
HStack(spacing: 16) {
|
||||
Button(action: { vm.generate() }) {
|
||||
Label(vm.isGenerating ? "Generating..." : "Generate", systemImage: vm.isGenerating ? "hourglass" : "wand.and.stars"
|
||||
)
|
||||
.labelStyle(.titleAndIcon)
|
||||
}
|
||||
.buttonStyle(.borderedProminent)
|
||||
.tint(.accentColor)
|
||||
.disabled(vm.isGenerating)
|
||||
|
||||
Button(action: { vm.togglePlay() }) {
|
||||
Label(vm.isPlaying ? "Stop" : "Play", systemImage: vm.isPlaying ? "stop.fill" : "play.fill")
|
||||
}
|
||||
.buttonStyle(.bordered)
|
||||
.disabled(vm.audioURL == nil)
|
||||
}
|
||||
|
||||
if let rtf = vm.rtfText {
|
||||
Text(rtf)
|
||||
.font(.footnote.monospacedDigit())
|
||||
.foregroundColor(.secondary)
|
||||
.padding(.top, 2)
|
||||
}
|
||||
|
||||
if let error = vm.errorMessage {
|
||||
Text(error)
|
||||
.foregroundColor(.red)
|
||||
.font(.footnote)
|
||||
.multilineTextAlignment(.center)
|
||||
.padding(.horizontal)
|
||||
}
|
||||
|
||||
Spacer()
|
||||
}
|
||||
}
|
||||
.onAppear { vm.startup() }
|
||||
}
|
||||
}
|
||||
|
||||
#Preview {
|
||||
ContentView()
|
||||
}
|
||||
@@ -0,0 +1,29 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>CFBundleDevelopmentRegion</key>
|
||||
<string>en</string>
|
||||
<key>CFBundleExecutable</key>
|
||||
<string>$(EXECUTABLE_NAME)</string>
|
||||
<key>CFBundleIdentifier</key>
|
||||
<string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
|
||||
<key>CFBundleInfoDictionaryVersion</key>
|
||||
<string>6.0</string>
|
||||
<key>CFBundleName</key>
|
||||
<string>ExampleiOSApp</string>
|
||||
<key>CFBundlePackageType</key>
|
||||
<string>APPL</string>
|
||||
<key>CFBundleShortVersionString</key>
|
||||
<string>1.0</string>
|
||||
<key>CFBundleVersion</key>
|
||||
<string>1</string>
|
||||
<key>UILaunchScreen</key>
|
||||
<dict/>
|
||||
<key>UIApplicationSceneManifest</key>
|
||||
<dict>
|
||||
<key>UIApplicationSupportsMultipleScenes</key>
|
||||
<false/>
|
||||
</dict>
|
||||
</dict>
|
||||
</plist>
|
||||
@@ -0,0 +1,140 @@
|
||||
import Foundation
|
||||
import OnnxRuntimeBindings
|
||||
|
||||
final class TTSService {
|
||||
enum Voice { case male, female }
|
||||
|
||||
struct Settings {
|
||||
var nTest: Int = 1
|
||||
}
|
||||
|
||||
struct SynthesisResult {
|
||||
let url: URL
|
||||
let elapsedSeconds: Double
|
||||
let audioSeconds: Double
|
||||
var rtf: Double { elapsedSeconds / max(audioSeconds, 1e-6) }
|
||||
}
|
||||
|
||||
private let env: ORTEnv
|
||||
private let textToSpeech: TextToSpeech
|
||||
private let bundleOnnxDir: String
|
||||
private let sampleRate: Int
|
||||
|
||||
// Cached style per voice (precomputed at startup or on first use)
|
||||
private var cachedStyle: [Voice: Style] = [:]
|
||||
|
||||
init() throws {
|
||||
bundleOnnxDir = try Self.locateOnnxDirInBundle()
|
||||
env = try ORTEnv(loggingLevel: .warning)
|
||||
textToSpeech = try loadTextToSpeech(bundleOnnxDir, false, env)
|
||||
sampleRate = textToSpeech.sampleRate
|
||||
}
|
||||
|
||||
// Public warmup: precompute styles and run a quick generation to warm models
|
||||
func warmup(nfe: Int = 1) async {
|
||||
do { try precomputeStyle(for: .male) } catch { print("Warmup style (M) error: \(error)") }
|
||||
do { try precomputeStyle(for: .female) } catch { print("Warmup style (F) error: \(error)") }
|
||||
// Run a tiny synthesis to JIT/warm up kernels; discard file
|
||||
do {
|
||||
let res = try await synthesize(text: "Warm up", nfe: max(1, nfe), voice: .male)
|
||||
try? FileManager.default.removeItem(at: res.url)
|
||||
} catch {
|
||||
print("Warmup synth error: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
func synthesize(text: String, nfe: Int, voice: Voice, settings: Settings = Settings()) async throws -> SynthesisResult {
|
||||
let tic = Date()
|
||||
|
||||
// 1) Get or compute style for the selected voice
|
||||
let style = try getStyle(voice: voice)
|
||||
|
||||
// 2) Synthesize via packed TextToSpeech component
|
||||
let (wav, duration) = try textToSpeech.call([text], style, nfe)
|
||||
let audioSeconds = Double(duration[0])
|
||||
let wavLenSample = min(Int(Double(sampleRate) * audioSeconds), wav.count)
|
||||
let wavOut = Array(wav[0..<wavLenSample])
|
||||
|
||||
let tmpURL = FileManager.default.temporaryDirectory.appendingPathComponent("supertonic_tts_\(UUID().uuidString).wav")
|
||||
try writeWavFile(tmpURL.path, wavOut, sampleRate)
|
||||
|
||||
let elapsed = Date().timeIntervalSince(tic)
|
||||
return SynthesisResult(url: tmpURL, elapsedSeconds: elapsed, audioSeconds: audioSeconds)
|
||||
}
|
||||
|
||||
// MARK: - Style helpers
|
||||
private func precomputeStyle(for voice: Voice) throws {
|
||||
if cachedStyle[voice] != nil { return }
|
||||
let styleURL = try Self.locateVoiceStyleURL(voice: voice)
|
||||
let style = try loadVoiceStyle([styleURL.path], verbose: false)
|
||||
cachedStyle[voice] = style
|
||||
}
|
||||
|
||||
private func getStyle(voice: Voice) throws -> Style {
|
||||
if let style = cachedStyle[voice] { return style }
|
||||
try precomputeStyle(for: voice)
|
||||
return cachedStyle[voice]!
|
||||
}
|
||||
|
||||
// MARK: - Resource location helpers
|
||||
private static func locateOnnxDirInBundle() throws -> String {
|
||||
let bundle = Bundle.main
|
||||
let fm = FileManager.default
|
||||
|
||||
func dirHasRequiredFiles(_ dir: URL) -> Bool {
|
||||
let required = [
|
||||
"tts.json",
|
||||
"duration_predictor.onnx",
|
||||
"text_encoder.onnx",
|
||||
"vector_estimator.onnx",
|
||||
"vocoder.onnx"
|
||||
]
|
||||
return required.allSatisfy { fm.fileExists(atPath: dir.appendingPathComponent($0).path) }
|
||||
}
|
||||
|
||||
var candidates: [URL] = []
|
||||
if let dir = bundle.resourceURL?.appendingPathComponent("onnx", isDirectory: true) { candidates.append(dir) }
|
||||
if let dir = bundle.resourceURL?.appendingPathComponent("assets/onnx", isDirectory: true) { candidates.append(dir) }
|
||||
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "onnx") { candidates.append(url.deletingLastPathComponent()) }
|
||||
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "assets/onnx") { candidates.append(url.deletingLastPathComponent()) }
|
||||
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: nil) { candidates.append(url.deletingLastPathComponent()) }
|
||||
if let root = bundle.resourceURL { candidates.append(root) }
|
||||
|
||||
for dir in candidates {
|
||||
if dirHasRequiredFiles(dir) { return dir.path }
|
||||
}
|
||||
throw NSError(
|
||||
domain: "TTS",
|
||||
code: -100,
|
||||
userInfo: [NSLocalizedDescriptionKey: "Could not find the onnx directory in the bundle. Please make sure the onnx folder (as a folder reference) is included in Copy Bundle Resources in Xcode."]
|
||||
)
|
||||
}
|
||||
|
||||
private static func locateVoiceStyleURL(voice: Voice) throws -> URL {
|
||||
// Prefer M1/F1 defaults; search common subdirectories
|
||||
let fileName = (voice == .male) ? "M1" : "F1"
|
||||
let bundle = Bundle.main
|
||||
let candidates: [URL?] = [
|
||||
bundle.url(forResource: fileName, withExtension: "json", subdirectory: "voice_styles"),
|
||||
bundle.url(forResource: fileName, withExtension: "json", subdirectory: "assets/voice_styles"),
|
||||
bundle.url(forResource: fileName, withExtension: "json", subdirectory: nil)
|
||||
]
|
||||
for url in candidates {
|
||||
if let url = url { return url }
|
||||
}
|
||||
// Fallback: scan folders if needed
|
||||
if let folder1 = bundle.resourceURL?.appendingPathComponent("voice_styles", isDirectory: true) {
|
||||
let file = folder1.appendingPathComponent("\(fileName).json")
|
||||
if FileManager.default.fileExists(atPath: file.path) { return file }
|
||||
}
|
||||
if let folder2 = bundle.resourceURL?.appendingPathComponent("assets/voice_styles", isDirectory: true) {
|
||||
let file = folder2.appendingPathComponent("\(fileName).json")
|
||||
if FileManager.default.fileExists(atPath: file.path) { return file }
|
||||
}
|
||||
throw NSError(
|
||||
domain: "TTS",
|
||||
code: -102,
|
||||
userInfo: [NSLocalizedDescriptionKey: "Could not find the voice style JSON (\(fileName).json) in the bundle. Ensure voice_styles folder is included in Copy Bundle Resources."]
|
||||
)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,76 @@
|
||||
import Foundation
|
||||
import AVFoundation
|
||||
|
||||
@MainActor
|
||||
final class TTSViewModel: ObservableObject {
|
||||
@Published var text: String = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
@Published var nfe: Double = 5
|
||||
@Published var voice: TTSService.Voice = .male
|
||||
@Published var isGenerating: Bool = false
|
||||
@Published var isPlaying: Bool = false
|
||||
@Published var errorMessage: String?
|
||||
@Published var audioURL: URL?
|
||||
|
||||
@Published var elapsedSeconds: Double?
|
||||
@Published var audioSeconds: Double?
|
||||
|
||||
private var service: TTSService?
|
||||
private var player = AudioPlayer()
|
||||
|
||||
var rtfText: String? {
|
||||
guard let e = elapsedSeconds, let a = audioSeconds, a > 0 else { return nil }
|
||||
let rtf = e / a
|
||||
return String(format: "RTF %.2fx · %.2fs / %.2fs", rtf, e, a)
|
||||
}
|
||||
|
||||
func startup() {
|
||||
do {
|
||||
service = try TTSService()
|
||||
Task { await self.service?.warmup(nfe: 5) }
|
||||
} catch {
|
||||
errorMessage = "Failed to init TTS: \(error.localizedDescription)"
|
||||
}
|
||||
}
|
||||
|
||||
func generate() {
|
||||
guard let service = service else { return }
|
||||
isGenerating = true
|
||||
errorMessage = nil
|
||||
audioURL = nil
|
||||
elapsedSeconds = nil
|
||||
audioSeconds = nil
|
||||
Task {
|
||||
do {
|
||||
let result = try await service.synthesize(text: text, nfe: Int(nfe), voice: voice)
|
||||
await MainActor.run {
|
||||
self.audioURL = result.url
|
||||
self.elapsedSeconds = result.elapsedSeconds
|
||||
self.audioSeconds = result.audioSeconds
|
||||
self.isGenerating = false
|
||||
}
|
||||
self.play(url: result.url)
|
||||
} catch {
|
||||
await MainActor.run {
|
||||
self.errorMessage = error.localizedDescription
|
||||
self.isGenerating = false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func togglePlay() {
|
||||
if isPlaying {
|
||||
player.stop()
|
||||
isPlaying = false
|
||||
} else if let url = audioURL {
|
||||
play(url: url)
|
||||
}
|
||||
}
|
||||
|
||||
private func play(url: URL) {
|
||||
player.play(url: url) { [weak self] in
|
||||
DispatchQueue.main.async { self?.isPlaying = false }
|
||||
}
|
||||
isPlaying = true
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,29 @@
|
||||
name: ExampleiOSApp
|
||||
options:
|
||||
minimumXcodeGenVersion: 2.37.0
|
||||
packages:
|
||||
onnxruntime:
|
||||
url: https://github.com/microsoft/onnxruntime-swift-package-manager.git
|
||||
from: 1.16.0
|
||||
targets:
|
||||
ExampleiOSApp:
|
||||
type: application
|
||||
platform: iOS
|
||||
deploymentTarget: "15.0"
|
||||
sources:
|
||||
- path: .
|
||||
- path: ../../swift/Sources/Helper.swift <<- 여기
|
||||
type: file
|
||||
resources:
|
||||
- path: onnx
|
||||
type: folder
|
||||
- path: audio
|
||||
type: folder
|
||||
settings:
|
||||
base:
|
||||
PRODUCT_BUNDLE_IDENTIFIER: com.supertonic.ExampleiOSApp
|
||||
SWIFT_VERSION: 5.9
|
||||
INFOPLIST_FILE: Info.plist
|
||||
dependencies:
|
||||
- package: onnxruntime
|
||||
product: onnxruntime
|
||||
@@ -0,0 +1,59 @@
|
||||
# Supertonic iOS Example App
|
||||
|
||||
A minimal iOS demo that runs Supertonic (ONNX Runtime) on-device. The app shows:
|
||||
- Multiline text input
|
||||
- NFE (denoising steps) slider
|
||||
- Voice toggle (M/F)
|
||||
- Generate & Play buttons
|
||||
- RTF display (Elapsed / Audio seconds)
|
||||
|
||||
All ONNX models/configs are reused from `Supertonic/assets/onnx`, and voice style JSON files from `Supertonic/assets/voice_styles`.
|
||||
|
||||
## Prerequisites
|
||||
- macOS 13+, Xcode 15+
|
||||
- Swift 5.9+
|
||||
- iOS 15+ device (recommended)
|
||||
- Homebrew, XcodeGen
|
||||
|
||||
Install tools (if needed):
|
||||
```bash
|
||||
brew install xcodegen
|
||||
```
|
||||
|
||||
## Quick Start (zero-click in Xcode)
|
||||
0) Prepare assets next to the iOS target (one-time)
|
||||
```bash
|
||||
cd ios/ExampleiOSApp
|
||||
mkdir -p onnx voice_styles
|
||||
rsync -a ../../assets/onnx/ onnx/
|
||||
rsync -a ../../assets/voice_styles/ voice_styles/
|
||||
```
|
||||
|
||||
1) Generate the Xcode project with XcodeGen
|
||||
```bash
|
||||
xcodegen generate
|
||||
open ExampleiOSApp.xcodeproj
|
||||
```
|
||||
|
||||
2) Open in Xcode and select your iPhone as the run destination
|
||||
- Targets → ExampleiOSApp → Signing & Capabilities: Select your Team
|
||||
- iOS Deployment Target: 15.0+
|
||||
|
||||
3) Build & Run on device
|
||||
- Type text → adjust NFE/Voice → Tap Generate → Audio plays automatically
|
||||
- An RTF line shows like: `RTF 0.30x · 3.04s / 10.11s`
|
||||
|
||||
## What's included (generated project)
|
||||
- SwiftUI app files: `App.swift`, `ContentView.swift`, `TTSViewModel.swift`, `AudioPlayer.swift`
|
||||
- Runtime wrapper: `TTSService.swift` (includes TTS inference logic)
|
||||
- Resources (local, vendored in `ios/ExampleiOSApp/onnx` and `ios/ExampleiOSApp/voice_styles` after step 0)
|
||||
|
||||
These references are defined in `project.yml` and added to the app bundle by XcodeGen.
|
||||
|
||||
## App Controls
|
||||
- **Text**: Multiline `TextEditor`
|
||||
- **NFE**: Denoising steps (default 5)
|
||||
- **Voice**: M1/M2/F1/F2 voice style selector (4 pre-extracted styles)
|
||||
- **Generate**: Runs end-to-end synthesis
|
||||
- **Play/Stop**: Controls playback of the last output
|
||||
- **RTF**: Shows Elapsed / Audio seconds for quick performance intuition
|
||||
@@ -0,0 +1,35 @@
|
||||
# Maven
|
||||
target/
|
||||
pom.xml.tag
|
||||
pom.xml.releaseBackup
|
||||
pom.xml.versionsBackup
|
||||
pom.xml.next
|
||||
release.properties
|
||||
dependency-reduced-pom.xml
|
||||
buildNumber.properties
|
||||
.mvn/timing.properties
|
||||
.mvn/wrapper/maven-wrapper.jar
|
||||
|
||||
# Compiled class files
|
||||
*.class
|
||||
|
||||
# IntelliJ IDEA
|
||||
.idea/
|
||||
*.iml
|
||||
*.iws
|
||||
*.ipr
|
||||
|
||||
# Eclipse
|
||||
.classpath
|
||||
.project
|
||||
.settings/
|
||||
|
||||
# VS Code
|
||||
.vscode/
|
||||
|
||||
# Results
|
||||
results/*.wav
|
||||
|
||||
# Mac
|
||||
.DS_Store
|
||||
|
||||
@@ -0,0 +1,141 @@
|
||||
import ai.onnxruntime.*;
|
||||
|
||||
import java.io.File;
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* TTS Inference Example with ONNX Runtime (Java)
|
||||
*/
|
||||
public class ExampleONNX {
|
||||
|
||||
/**
|
||||
* Command line arguments
|
||||
*/
|
||||
static class Args {
|
||||
boolean useGpu = false;
|
||||
String onnxDir = "assets/onnx";
|
||||
int totalStep = 5;
|
||||
int nTest = 4;
|
||||
List<String> voiceStyle = Arrays.asList("assets/voice_styles/M1.json");
|
||||
List<String> text = Arrays.asList(
|
||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
);
|
||||
String saveDir = "results";
|
||||
}
|
||||
|
||||
/**
|
||||
* Parse command line arguments
|
||||
*/
|
||||
private static Args parseArgs(String[] args) {
|
||||
Args result = new Args();
|
||||
|
||||
for (int i = 0; i < args.length; i++) {
|
||||
switch (args[i]) {
|
||||
case "--use-gpu":
|
||||
result.useGpu = true;
|
||||
break;
|
||||
case "--onnx-dir":
|
||||
if (i + 1 < args.length) result.onnxDir = args[++i];
|
||||
break;
|
||||
case "--total-step":
|
||||
if (i + 1 < args.length) result.totalStep = Integer.parseInt(args[++i]);
|
||||
break;
|
||||
case "--n-test":
|
||||
if (i + 1 < args.length) result.nTest = Integer.parseInt(args[++i]);
|
||||
break;
|
||||
case "--voice-style":
|
||||
if (i + 1 < args.length) {
|
||||
result.voiceStyle = Arrays.asList(args[++i].split(","));
|
||||
}
|
||||
break;
|
||||
case "--text":
|
||||
if (i + 1 < args.length) {
|
||||
result.text = Arrays.asList(args[++i].split("\\|"));
|
||||
}
|
||||
break;
|
||||
case "--save-dir":
|
||||
if (i + 1 < args.length) result.saveDir = args[++i];
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
/**
|
||||
* Main inference function
|
||||
*/
|
||||
public static void main(String[] args) {
|
||||
try {
|
||||
System.out.println("=== TTS Inference with ONNX Runtime (Java) ===\n");
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
Args parsedArgs = parseArgs(args);
|
||||
int totalStep = parsedArgs.totalStep;
|
||||
int nTest = parsedArgs.nTest;
|
||||
String saveDir = parsedArgs.saveDir;
|
||||
List<String> voiceStylePaths = parsedArgs.voiceStyle;
|
||||
List<String> textList = parsedArgs.text;
|
||||
|
||||
if (voiceStylePaths.size() != textList.size()) {
|
||||
throw new RuntimeException("Number of voice styles (" + voiceStylePaths.size() +
|
||||
") must match number of texts (" + textList.size() + ")");
|
||||
}
|
||||
|
||||
int bsz = voiceStylePaths.size();
|
||||
OrtEnvironment env = OrtEnvironment.getEnvironment();
|
||||
|
||||
// --- 2. Load TTS components --- //
|
||||
TextToSpeech textToSpeech = Helper.loadTextToSpeech(parsedArgs.onnxDir, parsedArgs.useGpu, env);
|
||||
|
||||
// --- 3. Load voice styles --- //
|
||||
Style style = Helper.loadVoiceStyle(voiceStylePaths, true, env);
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
File saveDirFile = new File(saveDir);
|
||||
if (!saveDirFile.exists()) {
|
||||
saveDirFile.mkdirs();
|
||||
}
|
||||
|
||||
for (int n = 0; n < nTest; n++) {
|
||||
System.out.println("\n[" + (n + 1) + "/" + nTest + "] Starting synthesis...");
|
||||
|
||||
TTSResult ttsResult = Helper.timer("Generating speech from text", () -> {
|
||||
try {
|
||||
return textToSpeech.call(textList, style, totalStep, env);
|
||||
} catch (Exception e) {
|
||||
throw new RuntimeException(e);
|
||||
}
|
||||
});
|
||||
|
||||
float[] wav = ttsResult.wav;
|
||||
float[] duration = ttsResult.duration;
|
||||
|
||||
// Save outputs
|
||||
int wavLen = wav.length / bsz;
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
String fname = Helper.sanitizeFilename(textList.get(i), 20) + "_" + (n + 1) + ".wav";
|
||||
int actualLen = (int) (textToSpeech.sampleRate * duration[i]);
|
||||
|
||||
float[] wavOut = new float[actualLen];
|
||||
System.arraycopy(wav, i * wavLen, wavOut, 0, Math.min(actualLen, wavLen));
|
||||
|
||||
String outputPath = saveDir + "/" + fname;
|
||||
Helper.writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
|
||||
System.out.println("Saved: " + outputPath);
|
||||
}
|
||||
}
|
||||
|
||||
// Clean up
|
||||
style.close();
|
||||
textToSpeech.close();
|
||||
|
||||
System.out.println("\n=== Synthesis completed successfully! ===");
|
||||
|
||||
} catch (Exception e) {
|
||||
System.err.println("Error during inference: " + e.getMessage());
|
||||
e.printStackTrace();
|
||||
System.exit(1);
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,597 @@
|
||||
import ai.onnxruntime.*;
|
||||
import com.fasterxml.jackson.databind.JsonNode;
|
||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
|
||||
import javax.sound.sampled.AudioFileFormat;
|
||||
import javax.sound.sampled.AudioFormat;
|
||||
import javax.sound.sampled.AudioInputStream;
|
||||
import javax.sound.sampled.AudioSystem;
|
||||
import java.io.*;
|
||||
import java.nio.ByteBuffer;
|
||||
import java.nio.ByteOrder;
|
||||
import java.nio.FloatBuffer;
|
||||
import java.nio.LongBuffer;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Paths;
|
||||
import java.text.Normalizer;
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Configuration classes
|
||||
*/
|
||||
class Config {
|
||||
static class AEConfig {
|
||||
int sampleRate;
|
||||
int baseChunkSize;
|
||||
}
|
||||
|
||||
static class TTLConfig {
|
||||
int chunkCompressFactor;
|
||||
int latentDim;
|
||||
}
|
||||
|
||||
AEConfig ae;
|
||||
TTLConfig ttl;
|
||||
}
|
||||
|
||||
/**
|
||||
* Voice Style Data from JSON
|
||||
*/
|
||||
class VoiceStyleData {
|
||||
static class StyleData {
|
||||
float[][][] data;
|
||||
long[] dims;
|
||||
String type;
|
||||
}
|
||||
|
||||
StyleData styleTtl;
|
||||
StyleData styleDp;
|
||||
}
|
||||
|
||||
/**
|
||||
* Unicode text processor
|
||||
*/
|
||||
class UnicodeProcessor {
|
||||
private long[] indexer;
|
||||
|
||||
public UnicodeProcessor(String unicodeIndexerJsonPath) throws IOException {
|
||||
this.indexer = Helper.loadJsonLongArray(unicodeIndexerJsonPath);
|
||||
}
|
||||
|
||||
public TextProcessResult call(List<String> textList) {
|
||||
List<String> processedTexts = new ArrayList<>();
|
||||
for (String text : textList) {
|
||||
processedTexts.add(preprocessText(text));
|
||||
}
|
||||
|
||||
int[] textIdsLengths = new int[processedTexts.size()];
|
||||
int maxLen = 0;
|
||||
for (int i = 0; i < processedTexts.size(); i++) {
|
||||
textIdsLengths[i] = processedTexts.get(i).length();
|
||||
maxLen = Math.max(maxLen, textIdsLengths[i]);
|
||||
}
|
||||
|
||||
long[][] textIds = new long[processedTexts.size()][maxLen];
|
||||
for (int i = 0; i < processedTexts.size(); i++) {
|
||||
int[] unicodeVals = textToUnicodeValues(processedTexts.get(i));
|
||||
for (int j = 0; j < unicodeVals.length; j++) {
|
||||
textIds[i][j] = indexer[unicodeVals[j]];
|
||||
}
|
||||
}
|
||||
|
||||
float[][][] textMask = getTextMask(textIdsLengths);
|
||||
return new TextProcessResult(textIds, textMask);
|
||||
}
|
||||
|
||||
private String preprocessText(String text) {
|
||||
return Normalizer.normalize(text, Normalizer.Form.NFKD);
|
||||
}
|
||||
|
||||
private int[] textToUnicodeValues(String text) {
|
||||
int[] values = new int[text.length()];
|
||||
for (int i = 0; i < text.length(); i++) {
|
||||
values[i] = text.codePointAt(i);
|
||||
}
|
||||
return values;
|
||||
}
|
||||
|
||||
private float[][][] getTextMask(int[] lengths) {
|
||||
int bsz = lengths.length;
|
||||
int maxLen = 0;
|
||||
for (int len : lengths) {
|
||||
maxLen = Math.max(maxLen, len);
|
||||
}
|
||||
|
||||
float[][][] mask = new float[bsz][1][maxLen];
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
for (int j = 0; j < maxLen; j++) {
|
||||
mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
|
||||
}
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
static class TextProcessResult {
|
||||
long[][] textIds;
|
||||
float[][][] textMask;
|
||||
|
||||
TextProcessResult(long[][] textIds, float[][][] textMask) {
|
||||
this.textIds = textIds;
|
||||
this.textMask = textMask;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Text-to-Speech inference class
|
||||
*/
|
||||
class TextToSpeech {
|
||||
private Config config;
|
||||
private UnicodeProcessor textProcessor;
|
||||
private OrtSession dpSession;
|
||||
private OrtSession textEncSession;
|
||||
private OrtSession vectorEstSession;
|
||||
private OrtSession vocoderSession;
|
||||
public int sampleRate;
|
||||
private int baseChunkSize;
|
||||
private int chunkCompress;
|
||||
private int ldim;
|
||||
|
||||
public TextToSpeech(Config config, UnicodeProcessor textProcessor,
|
||||
OrtSession dpSession, OrtSession textEncSession,
|
||||
OrtSession vectorEstSession, OrtSession vocoderSession) {
|
||||
this.config = config;
|
||||
this.textProcessor = textProcessor;
|
||||
this.dpSession = dpSession;
|
||||
this.textEncSession = textEncSession;
|
||||
this.vectorEstSession = vectorEstSession;
|
||||
this.vocoderSession = vocoderSession;
|
||||
this.sampleRate = config.ae.sampleRate;
|
||||
this.baseChunkSize = config.ae.baseChunkSize;
|
||||
this.chunkCompress = config.ttl.chunkCompressFactor;
|
||||
this.ldim = config.ttl.latentDim;
|
||||
}
|
||||
|
||||
public TTSResult call(List<String> textList, Style style, int totalStep, OrtEnvironment env)
|
||||
throws OrtException {
|
||||
int bsz = textList.size();
|
||||
|
||||
// Process text
|
||||
UnicodeProcessor.TextProcessResult textResult = textProcessor.call(textList);
|
||||
long[][] textIds = textResult.textIds;
|
||||
float[][][] textMask = textResult.textMask;
|
||||
|
||||
// Create tensors
|
||||
OnnxTensor textIdsTensor = Helper.createLongTensor(textIds, env);
|
||||
OnnxTensor textMaskTensor = Helper.createFloatTensor(textMask, env);
|
||||
|
||||
// Predict duration
|
||||
Map<String, OnnxTensor> dpInputs = new HashMap<>();
|
||||
dpInputs.put("text_ids", textIdsTensor);
|
||||
dpInputs.put("style_dp", style.dpTensor);
|
||||
dpInputs.put("text_mask", textMaskTensor);
|
||||
|
||||
OrtSession.Result dpResult = dpSession.run(dpInputs);
|
||||
Object dpValue = dpResult.get(0).getValue();
|
||||
float[] duration;
|
||||
if (dpValue instanceof float[][]) {
|
||||
duration = ((float[][]) dpValue)[0];
|
||||
} else {
|
||||
duration = (float[]) dpValue;
|
||||
}
|
||||
|
||||
// Encode text
|
||||
Map<String, OnnxTensor> textEncInputs = new HashMap<>();
|
||||
textEncInputs.put("text_ids", textIdsTensor);
|
||||
textEncInputs.put("style_ttl", style.ttlTensor);
|
||||
textEncInputs.put("text_mask", textMaskTensor);
|
||||
|
||||
OrtSession.Result textEncResult = textEncSession.run(textEncInputs);
|
||||
OnnxTensor textEmbTensor = (OnnxTensor) textEncResult.get(0);
|
||||
|
||||
// Sample noisy latent
|
||||
NoisyLatentResult noisyLatentResult = sampleNoisyLatent(duration);
|
||||
float[][][] xt = noisyLatentResult.noisyLatent;
|
||||
float[][][] latentMask = noisyLatentResult.latentMask;
|
||||
|
||||
// Prepare constant tensors
|
||||
float[] totalStepArray = new float[bsz];
|
||||
Arrays.fill(totalStepArray, (float) totalStep);
|
||||
OnnxTensor totalStepTensor = OnnxTensor.createTensor(env, totalStepArray);
|
||||
|
||||
// Denoising loop
|
||||
for (int step = 0; step < totalStep; step++) {
|
||||
float[] currentStepArray = new float[bsz];
|
||||
Arrays.fill(currentStepArray, (float) step);
|
||||
OnnxTensor currentStepTensor = OnnxTensor.createTensor(env, currentStepArray);
|
||||
OnnxTensor noisyLatentTensor = Helper.createFloatTensor(xt, env);
|
||||
OnnxTensor latentMaskTensor = Helper.createFloatTensor(latentMask, env);
|
||||
OnnxTensor textMaskTensor2 = Helper.createFloatTensor(textMask, env);
|
||||
|
||||
Map<String, OnnxTensor> vectorEstInputs = new HashMap<>();
|
||||
vectorEstInputs.put("noisy_latent", noisyLatentTensor);
|
||||
vectorEstInputs.put("text_emb", textEmbTensor);
|
||||
vectorEstInputs.put("style_ttl", style.ttlTensor);
|
||||
vectorEstInputs.put("latent_mask", latentMaskTensor);
|
||||
vectorEstInputs.put("text_mask", textMaskTensor2);
|
||||
vectorEstInputs.put("current_step", currentStepTensor);
|
||||
vectorEstInputs.put("total_step", totalStepTensor);
|
||||
|
||||
OrtSession.Result vectorEstResult = vectorEstSession.run(vectorEstInputs);
|
||||
float[][][] denoised = (float[][][]) vectorEstResult.get(0).getValue();
|
||||
|
||||
// Update latent
|
||||
xt = denoised;
|
||||
|
||||
// Clean up
|
||||
currentStepTensor.close();
|
||||
noisyLatentTensor.close();
|
||||
latentMaskTensor.close();
|
||||
textMaskTensor2.close();
|
||||
vectorEstResult.close();
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
OnnxTensor finalLatentTensor = Helper.createFloatTensor(xt, env);
|
||||
Map<String, OnnxTensor> vocoderInputs = new HashMap<>();
|
||||
vocoderInputs.put("latent", finalLatentTensor);
|
||||
|
||||
OrtSession.Result vocoderResult = vocoderSession.run(vocoderInputs);
|
||||
float[][] wavBatch = (float[][]) vocoderResult.get(0).getValue();
|
||||
float[] wav = wavBatch[0];
|
||||
|
||||
// Clean up
|
||||
textIdsTensor.close();
|
||||
textMaskTensor.close();
|
||||
dpResult.close();
|
||||
textEncResult.close();
|
||||
totalStepTensor.close();
|
||||
finalLatentTensor.close();
|
||||
vocoderResult.close();
|
||||
|
||||
return new TTSResult(wav, duration);
|
||||
}
|
||||
|
||||
private NoisyLatentResult sampleNoisyLatent(float[] duration) {
|
||||
int bsz = duration.length;
|
||||
float maxDur = 0;
|
||||
for (float d : duration) {
|
||||
maxDur = Math.max(maxDur, d);
|
||||
}
|
||||
|
||||
long wavLenMax = (long) (maxDur * sampleRate);
|
||||
long[] wavLengths = new long[bsz];
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
wavLengths[i] = (long) (duration[i] * sampleRate);
|
||||
}
|
||||
|
||||
int chunkSize = baseChunkSize * chunkCompress;
|
||||
int latentLen = (int) ((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
int latentDim = ldim * chunkCompress;
|
||||
|
||||
Random rng = new Random();
|
||||
float[][][] noisyLatent = new float[bsz][latentDim][latentLen];
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
for (int d = 0; d < latentDim; d++) {
|
||||
for (int t = 0; t < latentLen; t++) {
|
||||
// Box-Muller transform
|
||||
double u1 = Math.max(1e-10, rng.nextDouble());
|
||||
double u2 = rng.nextDouble();
|
||||
noisyLatent[b][d][t] = (float) (Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
float[][][] latentMask = Helper.getLatentMask(wavLengths, config);
|
||||
|
||||
// Apply mask
|
||||
for (int b = 0; b < bsz; b++) {
|
||||
for (int d = 0; d < latentDim; d++) {
|
||||
for (int t = 0; t < latentLen; t++) {
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return new NoisyLatentResult(noisyLatent, latentMask);
|
||||
}
|
||||
|
||||
public void close() throws OrtException {
|
||||
if (dpSession != null) dpSession.close();
|
||||
if (textEncSession != null) textEncSession.close();
|
||||
if (vectorEstSession != null) vectorEstSession.close();
|
||||
if (vocoderSession != null) vocoderSession.close();
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Style holder class
|
||||
*/
|
||||
class Style {
|
||||
OnnxTensor ttlTensor;
|
||||
OnnxTensor dpTensor;
|
||||
|
||||
Style(OnnxTensor ttlTensor, OnnxTensor dpTensor) {
|
||||
this.ttlTensor = ttlTensor;
|
||||
this.dpTensor = dpTensor;
|
||||
}
|
||||
|
||||
public void close() throws OrtException {
|
||||
if (ttlTensor != null) ttlTensor.close();
|
||||
if (dpTensor != null) dpTensor.close();
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* TTS result holder
|
||||
*/
|
||||
class TTSResult {
|
||||
float[] wav;
|
||||
float[] duration;
|
||||
|
||||
TTSResult(float[] wav, float[] duration) {
|
||||
this.wav = wav;
|
||||
this.duration = duration;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Noisy latent result holder
|
||||
*/
|
||||
class NoisyLatentResult {
|
||||
float[][][] noisyLatent;
|
||||
float[][][] latentMask;
|
||||
|
||||
NoisyLatentResult(float[][][] noisyLatent, float[][][] latentMask) {
|
||||
this.noisyLatent = noisyLatent;
|
||||
this.latentMask = latentMask;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Helper utility class
|
||||
*/
|
||||
public class Helper {
|
||||
|
||||
/**
|
||||
* Load voice style from JSON files
|
||||
*/
|
||||
public static Style loadVoiceStyle(List<String> voiceStylePaths, boolean verbose, OrtEnvironment env)
|
||||
throws IOException, OrtException {
|
||||
int bsz = voiceStylePaths.size();
|
||||
|
||||
// Read first file to get dimensions
|
||||
ObjectMapper mapper = new ObjectMapper();
|
||||
JsonNode firstRoot = mapper.readTree(new File(voiceStylePaths.get(0)));
|
||||
|
||||
long[] ttlDims = new long[3];
|
||||
for (int i = 0; i < 3; i++) {
|
||||
ttlDims[i] = firstRoot.get("style_ttl").get("dims").get(i).asLong();
|
||||
}
|
||||
long[] dpDims = new long[3];
|
||||
for (int i = 0; i < 3; i++) {
|
||||
dpDims[i] = firstRoot.get("style_dp").get("dims").get(i).asLong();
|
||||
}
|
||||
|
||||
long ttlDim1 = ttlDims[1];
|
||||
long ttlDim2 = ttlDims[2];
|
||||
long dpDim1 = dpDims[1];
|
||||
long dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
int ttlSize = (int) (bsz * ttlDim1 * ttlDim2);
|
||||
int dpSize = (int) (bsz * dpDim1 * dpDim2);
|
||||
float[] ttlFlat = new float[ttlSize];
|
||||
float[] dpFlat = new float[dpSize];
|
||||
|
||||
// Fill in the data
|
||||
for (int i = 0; i < bsz; i++) {
|
||||
JsonNode root = mapper.readTree(new File(voiceStylePaths.get(i)));
|
||||
|
||||
// Flatten TTL data
|
||||
int ttlOffset = (int) (i * ttlDim1 * ttlDim2);
|
||||
int idx = 0;
|
||||
JsonNode ttlData = root.get("style_ttl").get("data");
|
||||
for (JsonNode batch : ttlData) {
|
||||
for (JsonNode row : batch) {
|
||||
for (JsonNode val : row) {
|
||||
ttlFlat[ttlOffset + idx++] = (float) val.asDouble();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Flatten DP data
|
||||
int dpOffset = (int) (i * dpDim1 * dpDim2);
|
||||
idx = 0;
|
||||
JsonNode dpData = root.get("style_dp").get("data");
|
||||
for (JsonNode batch : dpData) {
|
||||
for (JsonNode row : batch) {
|
||||
for (JsonNode val : row) {
|
||||
dpFlat[dpOffset + idx++] = (float) val.asDouble();
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
long[] ttlShape = {bsz, ttlDim1, ttlDim2};
|
||||
long[] dpShape = {bsz, dpDim1, dpDim2};
|
||||
|
||||
OnnxTensor ttlTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(ttlFlat), ttlShape);
|
||||
OnnxTensor dpTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(dpFlat), dpShape);
|
||||
|
||||
if (verbose) {
|
||||
System.out.println("Loaded " + bsz + " voice styles\n");
|
||||
}
|
||||
|
||||
return new Style(ttlTensor, dpTensor);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load TTS components
|
||||
*/
|
||||
public static TextToSpeech loadTextToSpeech(String onnxDir, boolean useGpu, OrtEnvironment env)
|
||||
throws IOException, OrtException {
|
||||
if (useGpu) {
|
||||
throw new RuntimeException("GPU mode is not supported yet");
|
||||
}
|
||||
System.out.println("Using CPU for inference\n");
|
||||
|
||||
// Load config
|
||||
Config config = loadCfgs(onnxDir);
|
||||
|
||||
// Create session options
|
||||
OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
|
||||
|
||||
// Load models
|
||||
OrtSession dpSession = env.createSession(onnxDir + "/duration_predictor.onnx", opts);
|
||||
OrtSession textEncSession = env.createSession(onnxDir + "/text_encoder.onnx", opts);
|
||||
OrtSession vectorEstSession = env.createSession(onnxDir + "/vector_estimator.onnx", opts);
|
||||
OrtSession vocoderSession = env.createSession(onnxDir + "/vocoder.onnx", opts);
|
||||
|
||||
// Load text processor
|
||||
UnicodeProcessor textProcessor = new UnicodeProcessor(onnxDir + "/unicode_indexer.json");
|
||||
|
||||
return new TextToSpeech(config, textProcessor, dpSession, textEncSession, vectorEstSession, vocoderSession);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load configuration from JSON
|
||||
*/
|
||||
public static Config loadCfgs(String onnxDir) throws IOException {
|
||||
ObjectMapper mapper = new ObjectMapper();
|
||||
JsonNode root = mapper.readTree(new File(onnxDir + "/tts.json"));
|
||||
|
||||
Config config = new Config();
|
||||
config.ae = new Config.AEConfig();
|
||||
config.ae.sampleRate = root.get("ae").get("sample_rate").asInt();
|
||||
config.ae.baseChunkSize = root.get("ae").get("base_chunk_size").asInt();
|
||||
|
||||
config.ttl = new Config.TTLConfig();
|
||||
config.ttl.chunkCompressFactor = root.get("ttl").get("chunk_compress_factor").asInt();
|
||||
config.ttl.latentDim = root.get("ttl").get("latent_dim").asInt();
|
||||
|
||||
return config;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get latent mask from wav lengths
|
||||
*/
|
||||
public static float[][][] getLatentMask(long[] wavLengths, Config config) {
|
||||
long baseChunkSize = config.ae.baseChunkSize;
|
||||
long chunkCompressFactor = config.ttl.chunkCompressFactor;
|
||||
long latentSize = baseChunkSize * chunkCompressFactor;
|
||||
|
||||
long[] latentLengths = new long[wavLengths.length];
|
||||
long maxLen = 0;
|
||||
for (int i = 0; i < wavLengths.length; i++) {
|
||||
latentLengths[i] = (wavLengths[i] + latentSize - 1) / latentSize;
|
||||
maxLen = Math.max(maxLen, latentLengths[i]);
|
||||
}
|
||||
|
||||
float[][][] mask = new float[wavLengths.length][1][(int) maxLen];
|
||||
for (int i = 0; i < wavLengths.length; i++) {
|
||||
for (int j = 0; j < maxLen; j++) {
|
||||
mask[i][0][j] = j < latentLengths[i] ? 1.0f : 0.0f;
|
||||
}
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
/**
|
||||
* Write WAV file
|
||||
*/
|
||||
public static void writeWavFile(String filename, float[] audioData, int sampleRate) throws IOException {
|
||||
// Convert float to byte array
|
||||
byte[] bytes = new byte[audioData.length * 2];
|
||||
ByteBuffer buffer = ByteBuffer.wrap(bytes);
|
||||
buffer.order(ByteOrder.LITTLE_ENDIAN);
|
||||
|
||||
for (float sample : audioData) {
|
||||
short val = (short) Math.max(-32768, Math.min(32767, sample * 32767));
|
||||
buffer.putShort(val);
|
||||
}
|
||||
|
||||
ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
|
||||
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
|
||||
AudioInputStream ais = new AudioInputStream(bais, format, audioData.length);
|
||||
AudioSystem.write(ais, AudioFileFormat.Type.WAVE, new File(filename));
|
||||
}
|
||||
|
||||
/**
|
||||
* Sanitize filename
|
||||
*/
|
||||
public static String sanitizeFilename(String text, int maxLen) {
|
||||
if (text.length() > maxLen) {
|
||||
text = text.substring(0, maxLen);
|
||||
}
|
||||
return text.replaceAll("[^a-zA-Z0-9]", "_");
|
||||
}
|
||||
|
||||
/**
|
||||
* Timer utility
|
||||
*/
|
||||
public static <T> T timer(String name, java.util.function.Supplier<T> fn) {
|
||||
long start = System.currentTimeMillis();
|
||||
System.out.println(name + "...");
|
||||
T result = fn.get();
|
||||
long elapsed = System.currentTimeMillis() - start;
|
||||
System.out.printf(" -> %s completed in %.2f sec\n", name, elapsed / 1000.0);
|
||||
return result;
|
||||
}
|
||||
|
||||
/**
|
||||
* Create float tensor from 3D array
|
||||
*/
|
||||
public static OnnxTensor createFloatTensor(float[][][] array, OrtEnvironment env) throws OrtException {
|
||||
int dim0 = array.length;
|
||||
int dim1 = array[0].length;
|
||||
int dim2 = array[0][0].length;
|
||||
|
||||
float[] flat = new float[dim0 * dim1 * dim2];
|
||||
int idx = 0;
|
||||
for (int i = 0; i < dim0; i++) {
|
||||
for (int j = 0; j < dim1; j++) {
|
||||
for (int k = 0; k < dim2; k++) {
|
||||
flat[idx++] = array[i][j][k];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
long[] shape = {dim0, dim1, dim2};
|
||||
return OnnxTensor.createTensor(env, FloatBuffer.wrap(flat), shape);
|
||||
}
|
||||
|
||||
/**
|
||||
* Create long tensor from 2D array
|
||||
*/
|
||||
public static OnnxTensor createLongTensor(long[][] array, OrtEnvironment env) throws OrtException {
|
||||
int dim0 = array.length;
|
||||
int dim1 = array[0].length;
|
||||
|
||||
long[] flat = new long[dim0 * dim1];
|
||||
int idx = 0;
|
||||
for (int i = 0; i < dim0; i++) {
|
||||
for (int j = 0; j < dim1; j++) {
|
||||
flat[idx++] = array[i][j];
|
||||
}
|
||||
}
|
||||
|
||||
long[] shape = {dim0, dim1};
|
||||
return OnnxTensor.createTensor(env, LongBuffer.wrap(flat), shape);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load JSON long array
|
||||
*/
|
||||
public static long[] loadJsonLongArray(String filePath) throws IOException {
|
||||
ObjectMapper mapper = new ObjectMapper();
|
||||
JsonNode root = mapper.readTree(new File(filePath));
|
||||
|
||||
long[] result = new long[root.size()];
|
||||
for (int i = 0; i < root.size(); i++) {
|
||||
result[i] = root.get(i).asLong();
|
||||
}
|
||||
return result;
|
||||
}
|
||||
}
|
||||
|
||||
@@ -0,0 +1,97 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `ExampleONNX.java`.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses [Maven](https://maven.apache.org/) for dependency management.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Java 11 or higher
|
||||
- Maven 3.6 or higher
|
||||
|
||||
### Install dependencies
|
||||
|
||||
```bash
|
||||
mvn clean install
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
mvn exec:java
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
mvn exec:java -Dexec.args="--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json --text 'The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant.'"
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice (M1.json) for the first text
|
||||
- Use female voice (F1.json) for the second text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
mvn exec:java -Dexec.args="--total-step 10 --voice-style assets/voice_styles/M1.json --text 'Increasing the number of denoising steps improves the output fidelity and overall quality.'"
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
**Note**: If your text contains apostrophes, use escaping or run the JAR directly:
|
||||
```bash
|
||||
java -jar target/tts-example.jar --total-step 10 --text "Text with apostrophe's here"
|
||||
```
|
||||
|
||||
## Building a Fat JAR
|
||||
|
||||
To create a standalone JAR with all dependencies:
|
||||
```bash
|
||||
mvn clean package
|
||||
```
|
||||
|
||||
Then run it directly:
|
||||
```bash
|
||||
java -jar target/tts-example.jar
|
||||
```
|
||||
|
||||
Or with arguments:
|
||||
```bash
|
||||
java -jar target/tts-example.jar --total-step 10 --text "Your custom text here"
|
||||
```
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
- **Voice Styles**: Uses pre-extracted voice style JSON files for fast inference
|
||||
|
||||
Symlink
+1
@@ -0,0 +1 @@
|
||||
../assets
|
||||
+110
@@ -0,0 +1,110 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<project xmlns="http://maven.apache.org/POM/4.0.0"
|
||||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
|
||||
http://maven.apache.org/xsd/maven-4.0.0.xsd">
|
||||
<modelVersion>4.0.0</modelVersion>
|
||||
|
||||
<groupId>ai.supertonic</groupId>
|
||||
<artifactId>tts-onnx-java</artifactId>
|
||||
<version>1.0.0</version>
|
||||
<packaging>jar</packaging>
|
||||
|
||||
<name>TTS ONNX Java Example</name>
|
||||
<description>Text-to-Speech inference using ONNX Runtime in Java</description>
|
||||
|
||||
<properties>
|
||||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
|
||||
<maven.compiler.source>11</maven.compiler.source>
|
||||
<maven.compiler.target>11</maven.compiler.target>
|
||||
<onnxruntime.version>1.23.1</onnxruntime.version>
|
||||
<jackson.version>2.15.2</jackson.version>
|
||||
</properties>
|
||||
|
||||
<dependencies>
|
||||
<!-- ONNX Runtime -->
|
||||
<dependency>
|
||||
<groupId>com.microsoft.onnxruntime</groupId>
|
||||
<artifactId>onnxruntime</artifactId>
|
||||
<version>${onnxruntime.version}</version>
|
||||
</dependency>
|
||||
|
||||
<!-- Jackson for JSON parsing -->
|
||||
<dependency>
|
||||
<groupId>com.fasterxml.jackson.core</groupId>
|
||||
<artifactId>jackson-databind</artifactId>
|
||||
<version>${jackson.version}</version>
|
||||
</dependency>
|
||||
|
||||
<!-- JTransforms for Fast FFT -->
|
||||
<dependency>
|
||||
<groupId>com.github.wendykierp</groupId>
|
||||
<artifactId>JTransforms</artifactId>
|
||||
<version>3.1</version>
|
||||
</dependency>
|
||||
</dependencies>
|
||||
|
||||
<build>
|
||||
<sourceDirectory>.</sourceDirectory>
|
||||
<plugins>
|
||||
<!-- Maven Compiler Plugin -->
|
||||
<plugin>
|
||||
<groupId>org.apache.maven.plugins</groupId>
|
||||
<artifactId>maven-compiler-plugin</artifactId>
|
||||
<version>3.11.0</version>
|
||||
<configuration>
|
||||
<source>11</source>
|
||||
<target>11</target>
|
||||
</configuration>
|
||||
</plugin>
|
||||
|
||||
<!-- Maven Exec Plugin for running the example -->
|
||||
<plugin>
|
||||
<groupId>org.codehaus.mojo</groupId>
|
||||
<artifactId>exec-maven-plugin</artifactId>
|
||||
<version>3.1.0</version>
|
||||
<configuration>
|
||||
<mainClass>ExampleONNX</mainClass>
|
||||
</configuration>
|
||||
</plugin>
|
||||
|
||||
<!-- Maven Jar Plugin -->
|
||||
<plugin>
|
||||
<groupId>org.apache.maven.plugins</groupId>
|
||||
<artifactId>maven-jar-plugin</artifactId>
|
||||
<version>3.3.0</version>
|
||||
<configuration>
|
||||
<archive>
|
||||
<manifest>
|
||||
<mainClass>ExampleONNX</mainClass>
|
||||
</manifest>
|
||||
</archive>
|
||||
</configuration>
|
||||
</plugin>
|
||||
|
||||
<!-- Maven Shade Plugin for creating fat JAR -->
|
||||
<plugin>
|
||||
<groupId>org.apache.maven.plugins</groupId>
|
||||
<artifactId>maven-shade-plugin</artifactId>
|
||||
<version>3.5.0</version>
|
||||
<executions>
|
||||
<execution>
|
||||
<phase>package</phase>
|
||||
<goals>
|
||||
<goal>shade</goal>
|
||||
</goals>
|
||||
<configuration>
|
||||
<transformers>
|
||||
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
|
||||
<mainClass>ExampleONNX</mainClass>
|
||||
</transformer>
|
||||
</transformers>
|
||||
<finalName>tts-example</finalName>
|
||||
</configuration>
|
||||
</execution>
|
||||
</executions>
|
||||
</plugin>
|
||||
</plugins>
|
||||
</build>
|
||||
</project>
|
||||
|
||||
@@ -0,0 +1,102 @@
|
||||
# TTS ONNX Node.js Implementation
|
||||
|
||||
Node.js implementation for TTS inference. Uses ONNX Runtime to generate speech from text.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Node.js v16 or higher
|
||||
- npm or yarn
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
cd nodejs
|
||||
npm install
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
npm start
|
||||
```
|
||||
|
||||
Or:
|
||||
```bash
|
||||
node example_onnx.js
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
node example_onnx.js \
|
||||
--voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice style (M1.json) for the first text
|
||||
- Use female voice style (F1.json) for the second text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
node example_onnx.js \
|
||||
--total-step 10 \
|
||||
--voice-style "assets/voice_styles/M1.json" \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s). Separate multiple files with commas |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize. Separate multiple texts with pipes |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of voice style files must match the number of texts. Use commas to separate files and pipes to separate texts
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
|
||||
## Architecture
|
||||
|
||||
- `helper.js`: Node.js port of Python's `helper.py`
|
||||
- `Preprocessor`: Audio preprocessing (STFT, Mel Spectrogram)
|
||||
- `UnicodeProcessor`: Text preprocessing
|
||||
- Utility functions (mask generation, tensor conversion, etc.)
|
||||
|
||||
- `example_onnx.js`: Main inference script
|
||||
- ONNX model loading
|
||||
- TTS inference pipeline execution
|
||||
- WAV file saving
|
||||
|
||||
- `package.json`: Node.js project configuration and dependencies
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
1. **Pure Node.js WAV Processing**: Writes WAV files without external native libraries. Outputs 16-bit PCM format.
|
||||
|
||||
2. **Memory Efficiency**: Note that Node.js may consume significant memory when processing large arrays.
|
||||
|
||||
3. **Performance**: The mel spectrogram extraction (Step 1-1) is currently slower than Python's Librosa, which uses highly optimized C extensions. This bottleneck could be further improved with additional optimizations such as WASM-based FFT libraries or native addons.
|
||||
Symlink
+1
@@ -0,0 +1 @@
|
||||
../assets
|
||||
@@ -0,0 +1,104 @@
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
import { fileURLToPath } from 'url';
|
||||
|
||||
import { loadTextToSpeech, loadVoiceStyle, timer, writeWavFile } from './helper.js';
|
||||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
const __dirname = path.dirname(__filename);
|
||||
|
||||
/**
|
||||
* Parse command line arguments
|
||||
*/
|
||||
function parseArgs() {
|
||||
const args = {
|
||||
useGpu: false,
|
||||
onnxDir: 'assets/onnx',
|
||||
totalStep: 5,
|
||||
nTest: 4,
|
||||
voiceStyle: ['assets/voice_styles/M1.json'],
|
||||
text: ['This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.'],
|
||||
saveDir: 'results'
|
||||
};
|
||||
|
||||
for (let i = 2; i < process.argv.length; i++) {
|
||||
const arg = process.argv[i];
|
||||
if (arg === '--use-gpu') {
|
||||
args.useGpu = true;
|
||||
} else if (arg === '--onnx-dir' && i + 1 < process.argv.length) {
|
||||
args.onnxDir = process.argv[++i];
|
||||
} else if (arg === '--total-step' && i + 1 < process.argv.length) {
|
||||
args.totalStep = parseInt(process.argv[++i]);
|
||||
} else if (arg === '--n-test' && i + 1 < process.argv.length) {
|
||||
args.nTest = parseInt(process.argv[++i]);
|
||||
} else if (arg === '--voice-style' && i + 1 < process.argv.length) {
|
||||
args.voiceStyle = process.argv[++i].split(',');
|
||||
} else if (arg === '--text' && i + 1 < process.argv.length) {
|
||||
args.text = process.argv[++i].split('|');
|
||||
} else if (arg === '--save-dir' && i + 1 < process.argv.length) {
|
||||
args.saveDir = process.argv[++i];
|
||||
}
|
||||
}
|
||||
|
||||
return args;
|
||||
}
|
||||
|
||||
/**
|
||||
* Main inference function
|
||||
*/
|
||||
async function main() {
|
||||
console.log('=== TTS Inference with ONNX Runtime (Node.js) ===\n');
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
const args = parseArgs();
|
||||
const totalStep = args.totalStep;
|
||||
const nTest = args.nTest;
|
||||
const saveDir = args.saveDir;
|
||||
const voiceStylePaths = args.voiceStyle.map(p => path.resolve(__dirname, p));
|
||||
const textList = args.text;
|
||||
|
||||
if (voiceStylePaths.length !== textList.length) {
|
||||
throw new Error(`Number of voice styles (${voiceStylePaths.length}) must match number of texts (${textList.length})`);
|
||||
}
|
||||
|
||||
const bsz = voiceStylePaths.length;
|
||||
|
||||
// --- 2. Load Text to Speech --- //
|
||||
const onnxDir = path.resolve(__dirname, args.onnxDir);
|
||||
const textToSpeech = await loadTextToSpeech(onnxDir, args.useGpu);
|
||||
|
||||
// --- 3. Load Voice Style --- //
|
||||
const style = loadVoiceStyle(voiceStylePaths, true);
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
for (let n = 0; n < nTest; n++) {
|
||||
console.log(`\n[${n + 1}/${nTest}] Starting synthesis...`);
|
||||
|
||||
const { wav, duration } = await timer('Generating speech from text', async () => {
|
||||
return await textToSpeech.call(textList, style, totalStep);
|
||||
});
|
||||
|
||||
if (!fs.existsSync(saveDir)) {
|
||||
fs.mkdirSync(saveDir, { recursive: true });
|
||||
}
|
||||
|
||||
const wavShape = [bsz, wav.length / bsz];
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
const fname = `${textList[b].substring(0, 20).replace(/[^a-zA-Z0-9]/g, '_')}_${n + 1}.wav`;
|
||||
const wavLen = Math.floor(textToSpeech.sampleRate * duration[b]);
|
||||
const wavOut = wav.slice(b * wavShape[1], b * wavShape[1] + wavLen);
|
||||
|
||||
const outputPath = path.join(saveDir, fname);
|
||||
writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
|
||||
console.log(`Saved: ${outputPath}`);
|
||||
}
|
||||
}
|
||||
|
||||
console.log('\n=== Synthesis completed successfully! ===');
|
||||
}
|
||||
|
||||
// Run main function
|
||||
main().catch(err => {
|
||||
console.error('Error during inference:', err);
|
||||
process.exit(1);
|
||||
});
|
||||
@@ -0,0 +1,392 @@
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
import { fileURLToPath } from 'url';
|
||||
import * as ort from 'onnxruntime-node';
|
||||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
|
||||
/**
|
||||
* Unicode text processor
|
||||
*/
|
||||
class UnicodeProcessor {
|
||||
constructor(unicodeIndexerJsonPath) {
|
||||
this.indexer = JSON.parse(fs.readFileSync(unicodeIndexerJsonPath, 'utf8'));
|
||||
}
|
||||
|
||||
_preprocessText(text) {
|
||||
// Simple NFKD normalization (JavaScript has normalize built-in)
|
||||
return text.normalize('NFKD');
|
||||
}
|
||||
|
||||
_textToUnicodeValues(text) {
|
||||
return Array.from(text).map(char => char.charCodeAt(0));
|
||||
}
|
||||
|
||||
_getTextMask(textIdsLengths) {
|
||||
return lengthToMask(textIdsLengths);
|
||||
}
|
||||
|
||||
call(textList) {
|
||||
const processedTexts = textList.map(t => this._preprocessText(t));
|
||||
const textIdsLengths = processedTexts.map(t => t.length);
|
||||
const maxLen = Math.max(...textIdsLengths);
|
||||
|
||||
const textIds = [];
|
||||
for (let i = 0; i < processedTexts.length; i++) {
|
||||
const row = new Array(maxLen).fill(0);
|
||||
const unicodeVals = this._textToUnicodeValues(processedTexts[i]);
|
||||
for (let j = 0; j < unicodeVals.length; j++) {
|
||||
row[j] = this.indexer[unicodeVals[j]];
|
||||
}
|
||||
textIds.push(row);
|
||||
}
|
||||
|
||||
const textMask = this._getTextMask(textIdsLengths);
|
||||
return { textIds, textMask };
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Style class
|
||||
*/
|
||||
class Style {
|
||||
constructor(styleTtlOnnx, styleDpOnnx) {
|
||||
this.ttl = styleTtlOnnx;
|
||||
this.dp = styleDpOnnx;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* TextToSpeech class
|
||||
*/
|
||||
class TextToSpeech {
|
||||
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
|
||||
this.cfgs = cfgs;
|
||||
this.textProcessor = textProcessor;
|
||||
this.dpOrt = dpOrt;
|
||||
this.textEncOrt = textEncOrt;
|
||||
this.vectorEstOrt = vectorEstOrt;
|
||||
this.vocoderOrt = vocoderOrt;
|
||||
this.sampleRate = cfgs.ae.sample_rate;
|
||||
this.baseChunkSize = cfgs.ae.base_chunk_size;
|
||||
this.chunkCompressFactor = cfgs.ttl.chunk_compress_factor;
|
||||
this.ldim = cfgs.ttl.latent_dim;
|
||||
}
|
||||
|
||||
sampleNoisyLatent(duration) {
|
||||
const wavLenMax = Math.max(...duration) * this.sampleRate;
|
||||
const wavLengths = duration.map(d => Math.floor(d * this.sampleRate));
|
||||
const chunkSize = this.baseChunkSize * this.chunkCompressFactor;
|
||||
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
const latentDim = this.ldim * this.chunkCompressFactor;
|
||||
|
||||
// Generate random noise
|
||||
const noisyLatent = [];
|
||||
for (let b = 0; b < duration.length; b++) {
|
||||
const batch = [];
|
||||
for (let d = 0; d < latentDim; d++) {
|
||||
const row = [];
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
// Box-Muller transform for normal distribution
|
||||
// Add epsilon to avoid log(0)
|
||||
const eps = 1e-10;
|
||||
const u1 = Math.max(eps, Math.random());
|
||||
const u2 = Math.random();
|
||||
const randNormal = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
|
||||
row.push(randNormal);
|
||||
}
|
||||
batch.push(row);
|
||||
}
|
||||
noisyLatent.push(batch);
|
||||
}
|
||||
|
||||
const latentMask = getLatentMask(wavLengths, this.baseChunkSize, this.chunkCompressFactor);
|
||||
|
||||
// Apply mask
|
||||
for (let b = 0; b < noisyLatent.length; b++) {
|
||||
for (let d = 0; d < noisyLatent[b].length; d++) {
|
||||
for (let t = 0; t < noisyLatent[b][d].length; t++) {
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return { noisyLatent, latentMask };
|
||||
}
|
||||
|
||||
async call(textList, style, totalStep) {
|
||||
if (textList.length !== style.ttl.dims[0]) {
|
||||
throw new Error('Number of texts must match number of style vectors');
|
||||
}
|
||||
const bsz = textList.length;
|
||||
const { textIds, textMask } = this.textProcessor.call(textList);
|
||||
const textIdsShape = [bsz, textIds[0].length];
|
||||
const textMaskShape = [bsz, 1, textMask[0][0].length];
|
||||
|
||||
const textMaskTensor = arrayToTensor(textMask, textMaskShape);
|
||||
|
||||
const dpResult = await this.dpOrt.run({
|
||||
text_ids: intArrayToTensor(textIds, textIdsShape),
|
||||
style_dp: style.dp,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
|
||||
const durOnnx = Array.from(dpResult.duration.data);
|
||||
|
||||
const textEncResult = await this.textEncOrt.run({
|
||||
text_ids: intArrayToTensor(textIds, textIdsShape),
|
||||
style_ttl: style.ttl,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
|
||||
const textEmbTensor = textEncResult.text_emb;
|
||||
|
||||
let { noisyLatent, latentMask } = this.sampleNoisyLatent(durOnnx);
|
||||
const latentShape = [bsz, noisyLatent[0].length, noisyLatent[0][0].length];
|
||||
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
|
||||
|
||||
const latentMaskTensor = arrayToTensor(latentMask, latentMaskShape);
|
||||
|
||||
const totalStepArray = new Array(bsz).fill(totalStep);
|
||||
const scalarShape = [bsz];
|
||||
const totalStepTensor = arrayToTensor(totalStepArray, scalarShape);
|
||||
|
||||
for (let step = 0; step < totalStep; step++) {
|
||||
const currentStepArray = new Array(bsz).fill(step);
|
||||
|
||||
const vectorEstResult = await this.vectorEstOrt.run({
|
||||
noisy_latent: arrayToTensor(noisyLatent, latentShape),
|
||||
text_emb: textEmbTensor,
|
||||
style_ttl: style.ttl,
|
||||
text_mask: textMaskTensor,
|
||||
latent_mask: latentMaskTensor,
|
||||
total_step: totalStepTensor,
|
||||
current_step: arrayToTensor(currentStepArray, scalarShape)
|
||||
});
|
||||
|
||||
const denoisedLatent = Array.from(vectorEstResult.denoised_latent.data);
|
||||
|
||||
// Update latent with the denoised output
|
||||
let idx = 0;
|
||||
for (let b = 0; b < noisyLatent.length; b++) {
|
||||
for (let d = 0; d < noisyLatent[b].length; d++) {
|
||||
for (let t = 0; t < noisyLatent[b][d].length; t++) {
|
||||
noisyLatent[b][d][t] = denoisedLatent[idx++];
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
const vocoderResult = await this.vocoderOrt.run({
|
||||
latent: arrayToTensor(noisyLatent, latentShape)
|
||||
});
|
||||
|
||||
const wav = Array.from(vocoderResult.wav_tts.data);
|
||||
return { wav, duration: durOnnx };
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Convert lengths to binary mask
|
||||
*/
|
||||
function lengthToMask(lengths, maxLen = null) {
|
||||
maxLen = maxLen || Math.max(...lengths);
|
||||
const mask = [];
|
||||
for (let i = 0; i < lengths.length; i++) {
|
||||
const row = [];
|
||||
for (let j = 0; j < maxLen; j++) {
|
||||
row.push(j < lengths[i] ? 1.0 : 0.0);
|
||||
}
|
||||
mask.push([row]); // [B, 1, maxLen]
|
||||
}
|
||||
return mask;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get latent mask from wav lengths
|
||||
*/
|
||||
function getLatentMask(wavLengths, baseChunkSize, chunkCompressFactor) {
|
||||
const latentSize = baseChunkSize * chunkCompressFactor;
|
||||
const latentLengths = wavLengths.map(len =>
|
||||
Math.floor((len + latentSize - 1) / latentSize)
|
||||
);
|
||||
return lengthToMask(latentLengths);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load ONNX model
|
||||
*/
|
||||
async function loadOnnx(onnxPath, opts) {
|
||||
return await ort.InferenceSession.create(onnxPath, opts);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load all ONNX models for TTS
|
||||
*/
|
||||
async function loadOnnxAll(onnxDir, opts) {
|
||||
const dpPath = path.join(onnxDir, 'duration_predictor.onnx');
|
||||
const textEncPath = path.join(onnxDir, 'text_encoder.onnx');
|
||||
const vectorEstPath = path.join(onnxDir, 'vector_estimator.onnx');
|
||||
const vocoderPath = path.join(onnxDir, 'vocoder.onnx');
|
||||
|
||||
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = await Promise.all([
|
||||
loadOnnx(dpPath, opts),
|
||||
loadOnnx(textEncPath, opts),
|
||||
loadOnnx(vectorEstPath, opts),
|
||||
loadOnnx(vocoderPath, opts)
|
||||
]);
|
||||
|
||||
return { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt };
|
||||
}
|
||||
|
||||
/**
|
||||
* Load configuration
|
||||
*/
|
||||
function loadCfgs(onnxDir) {
|
||||
const cfgPath = path.join(onnxDir, 'tts.json');
|
||||
const cfgs = JSON.parse(fs.readFileSync(cfgPath, 'utf8'));
|
||||
return cfgs;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load text processor
|
||||
*/
|
||||
function loadTextProcessor(onnxDir) {
|
||||
const unicodeIndexerPath = path.join(onnxDir, 'unicode_indexer.json');
|
||||
const textProcessor = new UnicodeProcessor(unicodeIndexerPath);
|
||||
return textProcessor;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load voice style from JSON file
|
||||
*/
|
||||
export function loadVoiceStyle(voiceStylePaths, verbose = false) {
|
||||
const bsz = voiceStylePaths.length;
|
||||
|
||||
// Read first file to get dimensions
|
||||
const firstStyle = JSON.parse(fs.readFileSync(voiceStylePaths[0], 'utf8'));
|
||||
const ttlDims = firstStyle.style_ttl.dims;
|
||||
const dpDims = firstStyle.style_dp.dims;
|
||||
|
||||
const ttlDim1 = ttlDims[1];
|
||||
const ttlDim2 = ttlDims[2];
|
||||
const dpDim1 = dpDims[1];
|
||||
const dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
const ttlSize = bsz * ttlDim1 * ttlDim2;
|
||||
const dpSize = bsz * dpDim1 * dpDim2;
|
||||
const ttlFlat = new Float32Array(ttlSize);
|
||||
const dpFlat = new Float32Array(dpSize);
|
||||
|
||||
// Fill in the data
|
||||
for (let i = 0; i < bsz; i++) {
|
||||
const voiceStyle = JSON.parse(fs.readFileSync(voiceStylePaths[i], 'utf8'));
|
||||
|
||||
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
|
||||
const ttlOffset = i * ttlDim1 * ttlDim2;
|
||||
ttlFlat.set(ttlData, ttlOffset);
|
||||
|
||||
const dpData = voiceStyle.style_dp.data.flat(Infinity);
|
||||
const dpOffset = i * dpDim1 * dpDim2;
|
||||
dpFlat.set(dpData, dpOffset);
|
||||
}
|
||||
|
||||
const ttlStyle = new ort.Tensor('float32', ttlFlat, [bsz, ttlDim1, ttlDim2]);
|
||||
const dpStyle = new ort.Tensor('float32', dpFlat, [bsz, dpDim1, dpDim2]);
|
||||
|
||||
if (verbose) {
|
||||
console.log(`Loaded ${bsz} voice styles`);
|
||||
}
|
||||
|
||||
return new Style(ttlStyle, dpStyle);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load text to speech components
|
||||
*/
|
||||
export async function loadTextToSpeech(onnxDir, useGpu = false) {
|
||||
const opts = {};
|
||||
if (useGpu) {
|
||||
throw new Error('GPU mode is not supported yet');
|
||||
} else {
|
||||
console.log('Using CPU for inference');
|
||||
}
|
||||
|
||||
const cfgs = loadCfgs(onnxDir);
|
||||
const { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt } = await loadOnnxAll(onnxDir, opts);
|
||||
const textProcessor = loadTextProcessor(onnxDir);
|
||||
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
|
||||
|
||||
return textToSpeech;
|
||||
}
|
||||
|
||||
/**
|
||||
* Convert 3D array to ONNX tensor
|
||||
*/
|
||||
function arrayToTensor(array, dims) {
|
||||
// Flatten the array
|
||||
const flat = array.flat(Infinity);
|
||||
return new ort.Tensor('float32', Float32Array.from(flat), dims);
|
||||
}
|
||||
|
||||
/**
|
||||
* Convert 2D int array to ONNX tensor
|
||||
*/
|
||||
function intArrayToTensor(array, dims) {
|
||||
const flat = array.flat(Infinity);
|
||||
return new ort.Tensor('int64', BigInt64Array.from(flat.map(x => BigInt(x))), dims);
|
||||
}
|
||||
|
||||
/**
|
||||
* Write WAV file
|
||||
*/
|
||||
export function writeWavFile(filename, audioData, sampleRate) {
|
||||
const numChannels = 1;
|
||||
const bitsPerSample = 16;
|
||||
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
|
||||
const blockAlign = numChannels * bitsPerSample / 8;
|
||||
const dataSize = audioData.length * bitsPerSample / 8;
|
||||
|
||||
const buffer = Buffer.alloc(44 + dataSize);
|
||||
|
||||
// RIFF header
|
||||
buffer.write('RIFF', 0);
|
||||
buffer.writeUInt32LE(36 + dataSize, 4);
|
||||
buffer.write('WAVE', 8);
|
||||
|
||||
// fmt chunk
|
||||
buffer.write('fmt ', 12);
|
||||
buffer.writeUInt32LE(16, 16); // fmt chunk size
|
||||
buffer.writeUInt16LE(1, 20); // audio format (PCM)
|
||||
buffer.writeUInt16LE(numChannels, 22);
|
||||
buffer.writeUInt32LE(sampleRate, 24);
|
||||
buffer.writeUInt32LE(byteRate, 28);
|
||||
buffer.writeUInt16LE(blockAlign, 32);
|
||||
buffer.writeUInt16LE(bitsPerSample, 34);
|
||||
|
||||
// data chunk
|
||||
buffer.write('data', 36);
|
||||
buffer.writeUInt32LE(dataSize, 40);
|
||||
|
||||
// Write audio data
|
||||
for (let i = 0; i < audioData.length; i++) {
|
||||
const sample = Math.max(-1, Math.min(1, audioData[i]));
|
||||
const intSample = Math.floor(sample * 32767);
|
||||
buffer.writeInt16LE(intSample, 44 + i * 2);
|
||||
}
|
||||
|
||||
fs.writeFileSync(filename, buffer);
|
||||
}
|
||||
|
||||
/**
|
||||
* Timer utility for measuring execution time
|
||||
*/
|
||||
export async function timer(name, fn) {
|
||||
const start = Date.now();
|
||||
console.log(`${name}...`);
|
||||
const result = await fn();
|
||||
const elapsed = ((Date.now() - start) / 1000).toFixed(2);
|
||||
console.log(` -> ${name} completed in ${elapsed} sec`);
|
||||
return result;
|
||||
}
|
||||
@@ -0,0 +1,26 @@
|
||||
{
|
||||
"name": "tts-onnx-nodejs",
|
||||
"version": "1.0.0",
|
||||
"description": "TTS inference using ONNX Runtime for Node.js",
|
||||
"main": "example_onnx.js",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"start": "node example_onnx.js"
|
||||
},
|
||||
"keywords": [
|
||||
"tts",
|
||||
"onnx",
|
||||
"speech-synthesis",
|
||||
"nodejs"
|
||||
],
|
||||
"author": "",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"fft.js": "^4.0.3",
|
||||
"js-yaml": "^4.1.0",
|
||||
"onnxruntime-node": "^1.19.2"
|
||||
},
|
||||
"engines": {
|
||||
"node": ">=16.0.0"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,83 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `example_onnx.py`.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses [uv](https://docs.astral.sh/uv/) for fast package management.
|
||||
|
||||
### Install uv (if not already installed)
|
||||
```bash
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
```
|
||||
|
||||
### Install dependencies
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
|
||||
Or if you prefer using traditional pip with requirements.txt:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
uv run example_onnx.py
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
uv run example_onnx.py \
|
||||
--voice-style assets/voice_styles/M1.json assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange." "The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice style (M1.json) for the first text
|
||||
- Use female voice style (F1.json) for the second text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
uv run example_onnx.py \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (with CPU fallback) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
|
||||
@@ -0,0 +1,91 @@
|
||||
import argparse
|
||||
import os
|
||||
|
||||
import soundfile as sf
|
||||
|
||||
from helper import load_text_to_speech, timer, sanitize_filename, load_voice_style
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description="TTS Inference with ONNX")
|
||||
|
||||
# Device settings
|
||||
parser.add_argument(
|
||||
"--use-gpu", action="store_true", help="Use GPU for inference (default: CPU)"
|
||||
)
|
||||
|
||||
# Model settings
|
||||
parser.add_argument(
|
||||
"--onnx-dir",
|
||||
type=str,
|
||||
default="assets/onnx",
|
||||
help="Path to ONNX model directory",
|
||||
)
|
||||
|
||||
# Synthesis parameters
|
||||
parser.add_argument(
|
||||
"--total-step", type=int, default=5, help="Number of denoising steps"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--n-test", type=int, default=4, help="Number of times to generate"
|
||||
)
|
||||
|
||||
# Input/Output
|
||||
parser.add_argument(
|
||||
"--voice-style",
|
||||
type=str,
|
||||
nargs="+",
|
||||
default=["assets/voice_styles/M1.json"],
|
||||
help="Voice style file path(s). Can specify multiple files for batch processing",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--text",
|
||||
type=str,
|
||||
nargs="+",
|
||||
default=[
|
||||
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
],
|
||||
help="Text(s) to synthesize. Can specify multiple texts for batch processing",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--save-dir", type=str, default="results", help="Output directory"
|
||||
)
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
print("=== TTS Inference with ONNX Runtime (Python) ===\n")
|
||||
|
||||
# --- 1. Parse arguments --- #
|
||||
args = parse_args()
|
||||
total_step = args.total_step
|
||||
n_test = args.n_test
|
||||
save_dir = args.save_dir
|
||||
voice_style_paths = args.voice_style
|
||||
text_list = args.text
|
||||
|
||||
assert len(voice_style_paths) == len(
|
||||
text_list
|
||||
), f"Number of voice styles ({len(voice_style_paths)}) must match number of texts ({len(text_list)})"
|
||||
|
||||
bsz = len(voice_style_paths)
|
||||
|
||||
# --- 2. Load Text to Speech --- #
|
||||
text_to_speech = load_text_to_speech(args.onnx_dir, args.use_gpu)
|
||||
|
||||
# --- 3. Load Voice Style --- #
|
||||
style = load_voice_style(voice_style_paths, verbose=True)
|
||||
|
||||
# --- 4. Synthesize Speech --- #
|
||||
for n in range(n_test):
|
||||
print(f"\n[{n+1}/{n_test}] Starting synthesis...")
|
||||
with timer("Generating speech from text"):
|
||||
wav, duration = text_to_speech(text_list, style, total_step)
|
||||
if not os.path.exists(save_dir):
|
||||
os.makedirs(save_dir)
|
||||
for b in range(bsz):
|
||||
fname = f"{sanitize_filename(text_list[b], 20)}_{n+1}.wav"
|
||||
w = wav[b, : int(text_to_speech.sample_rate * duration[b].item())] # [T_trim]
|
||||
sf.write(os.path.join(save_dir, fname), w, text_to_speech.sample_rate)
|
||||
print(f"Saved: {save_dir}/{fname}")
|
||||
print("\n=== Synthesis completed successfully! ===")
|
||||
+249
@@ -0,0 +1,249 @@
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from contextlib import contextmanager
|
||||
from typing import Optional
|
||||
from unicodedata import normalize
|
||||
|
||||
import numpy as np
|
||||
import onnxruntime as ort
|
||||
|
||||
|
||||
class UnicodeProcessor:
|
||||
def __init__(self, unicode_indexer_path: str):
|
||||
with open(unicode_indexer_path, "r") as f:
|
||||
self.indexer = json.load(f)
|
||||
|
||||
def _preprocess_text(self, text: str) -> str:
|
||||
# TODO: add more preprocessing
|
||||
text = normalize("NFKD", text)
|
||||
return text
|
||||
|
||||
def _get_text_mask(self, text_ids_lengths: np.ndarray) -> np.ndarray:
|
||||
text_mask = length_to_mask(text_ids_lengths)
|
||||
return text_mask
|
||||
|
||||
def _text_to_unicode_values(self, text: str) -> np.ndarray:
|
||||
unicode_values = np.array(
|
||||
[ord(char) for char in text], dtype=np.uint16
|
||||
) # 2 bytes
|
||||
return unicode_values
|
||||
|
||||
def __call__(self, text_list: list[str]) -> tuple[np.ndarray, np.ndarray]:
|
||||
text_list = [self._preprocess_text(t) for t in text_list]
|
||||
text_ids_lengths = np.array([len(text) for text in text_list], dtype=np.int64)
|
||||
text_ids = np.zeros((len(text_list), text_ids_lengths.max()), dtype=np.int64)
|
||||
for i, text in enumerate(text_list):
|
||||
unicode_vals = self._text_to_unicode_values(text)
|
||||
text_ids[i, : len(unicode_vals)] = np.array(
|
||||
[self.indexer[val] for val in unicode_vals], dtype=np.int64
|
||||
)
|
||||
text_mask = self._get_text_mask(text_ids_lengths)
|
||||
return text_ids, text_mask
|
||||
|
||||
|
||||
class Style:
|
||||
def __init__(self, style_ttl_onnx: np.ndarray, style_dp_onnx: np.ndarray):
|
||||
self.ttl = style_ttl_onnx
|
||||
self.dp = style_dp_onnx
|
||||
|
||||
|
||||
class TextToSpeech:
|
||||
def __init__(
|
||||
self,
|
||||
cfgs: dict,
|
||||
text_processor: UnicodeProcessor,
|
||||
dp_ort: ort.InferenceSession,
|
||||
text_enc_ort: ort.InferenceSession,
|
||||
vector_est_ort: ort.InferenceSession,
|
||||
vocoder_ort: ort.InferenceSession,
|
||||
):
|
||||
self.cfgs = cfgs
|
||||
self.text_processor = text_processor
|
||||
self.dp_ort = dp_ort
|
||||
self.text_enc_ort = text_enc_ort
|
||||
self.vector_est_ort = vector_est_ort
|
||||
self.vocoder_ort = vocoder_ort
|
||||
self.sample_rate = cfgs["ae"]["sample_rate"]
|
||||
self.base_chunk_size = cfgs["ae"]["base_chunk_size"]
|
||||
self.chunk_compress_factor = cfgs["ttl"]["chunk_compress_factor"]
|
||||
self.ldim = cfgs["ttl"]["latent_dim"]
|
||||
|
||||
def sample_noisy_latent(
|
||||
self, duration: np.ndarray
|
||||
) -> tuple[np.ndarray, np.ndarray]:
|
||||
bsz = len(duration)
|
||||
wav_len_max = duration.max() * self.sample_rate
|
||||
wav_lengths = (duration * self.sample_rate).astype(np.int64)
|
||||
chunk_size = self.base_chunk_size * self.chunk_compress_factor
|
||||
latent_len = ((wav_len_max + chunk_size - 1) / chunk_size).astype(np.int32)
|
||||
latent_dim = self.ldim * self.chunk_compress_factor
|
||||
noisy_latent = np.random.randn(bsz, latent_dim, latent_len).astype(np.float32)
|
||||
latent_mask = get_latent_mask(
|
||||
wav_lengths, self.base_chunk_size, self.chunk_compress_factor
|
||||
)
|
||||
noisy_latent = noisy_latent * latent_mask
|
||||
return noisy_latent, latent_mask
|
||||
|
||||
def __call__(
|
||||
self, text_list: list[str], style: Style, total_step: int
|
||||
) -> tuple[np.ndarray, np.ndarray]:
|
||||
assert (
|
||||
len(text_list) == style.ttl.shape[0]
|
||||
), "Number of texts must match number of style vectors"
|
||||
bsz = len(text_list)
|
||||
text_ids, text_mask = self.text_processor(text_list)
|
||||
dur_onnx, *_ = self.dp_ort.run(
|
||||
None, {"text_ids": text_ids, "style_dp": style.dp, "text_mask": text_mask}
|
||||
)
|
||||
text_emb_onnx, *_ = self.text_enc_ort.run(
|
||||
None,
|
||||
{"text_ids": text_ids, "style_ttl": style.ttl, "text_mask": text_mask},
|
||||
) # dur_onnx: [bsz]
|
||||
xt, latent_mask = self.sample_noisy_latent(dur_onnx)
|
||||
total_step_np = np.array([total_step] * bsz, dtype=np.float32)
|
||||
for step in range(total_step):
|
||||
current_step = np.array([step] * bsz, dtype=np.float32)
|
||||
xt, *_ = self.vector_est_ort.run(
|
||||
None,
|
||||
{
|
||||
"noisy_latent": xt,
|
||||
"text_emb": text_emb_onnx,
|
||||
"style_ttl": style.ttl,
|
||||
"text_mask": text_mask,
|
||||
"latent_mask": latent_mask,
|
||||
"current_step": current_step,
|
||||
"total_step": total_step_np,
|
||||
},
|
||||
)
|
||||
wav, *_ = self.vocoder_ort.run(None, {"latent": xt})
|
||||
return wav, dur_onnx
|
||||
|
||||
|
||||
def length_to_mask(lengths: np.ndarray, max_len: Optional[int] = None) -> np.ndarray:
|
||||
"""
|
||||
Convert lengths to binary mask.
|
||||
|
||||
Args:
|
||||
lengths: (B,)
|
||||
max_len: int
|
||||
|
||||
Returns:
|
||||
mask: (B, 1, max_len)
|
||||
"""
|
||||
max_len = max_len or lengths.max()
|
||||
ids = np.arange(0, max_len)
|
||||
mask = (ids < np.expand_dims(lengths, axis=1)).astype(np.float32)
|
||||
return mask.reshape(-1, 1, max_len)
|
||||
|
||||
|
||||
def get_latent_mask(
|
||||
wav_lengths: np.ndarray, base_chunk_size: int, chunk_compress_factor: int
|
||||
) -> np.ndarray:
|
||||
latent_size = base_chunk_size * chunk_compress_factor
|
||||
latent_lengths = (wav_lengths + latent_size - 1) // latent_size
|
||||
latent_mask = length_to_mask(latent_lengths)
|
||||
return latent_mask
|
||||
|
||||
|
||||
def load_onnx(
|
||||
onnx_path: str, opts: ort.SessionOptions, providers: list[str]
|
||||
) -> ort.InferenceSession:
|
||||
return ort.InferenceSession(onnx_path, sess_options=opts, providers=providers)
|
||||
|
||||
|
||||
def load_onnx_all(
|
||||
onnx_dir: str, opts: ort.SessionOptions, providers: list[str]
|
||||
) -> tuple[
|
||||
ort.InferenceSession,
|
||||
ort.InferenceSession,
|
||||
ort.InferenceSession,
|
||||
ort.InferenceSession,
|
||||
]:
|
||||
dp_onnx_path = os.path.join(onnx_dir, "duration_predictor.onnx")
|
||||
text_enc_onnx_path = os.path.join(onnx_dir, "text_encoder.onnx")
|
||||
vector_est_onnx_path = os.path.join(onnx_dir, "vector_estimator.onnx")
|
||||
vocoder_onnx_path = os.path.join(onnx_dir, "vocoder.onnx")
|
||||
|
||||
dp_ort = load_onnx(dp_onnx_path, opts, providers)
|
||||
text_enc_ort = load_onnx(text_enc_onnx_path, opts, providers)
|
||||
vector_est_ort = load_onnx(vector_est_onnx_path, opts, providers)
|
||||
vocoder_ort = load_onnx(vocoder_onnx_path, opts, providers)
|
||||
return dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
|
||||
|
||||
|
||||
def load_cfgs(onnx_dir: str) -> dict:
|
||||
cfg_path = os.path.join(onnx_dir, "tts.json")
|
||||
with open(cfg_path, "r") as f:
|
||||
cfgs = json.load(f)
|
||||
return cfgs
|
||||
|
||||
|
||||
def load_text_processor(onnx_dir: str) -> UnicodeProcessor:
|
||||
unicode_indexer_path = os.path.join(onnx_dir, "unicode_indexer.json")
|
||||
text_processor = UnicodeProcessor(unicode_indexer_path)
|
||||
return text_processor
|
||||
|
||||
|
||||
def load_text_to_speech(onnx_dir: str, use_gpu: bool = False) -> TextToSpeech:
|
||||
opts = ort.SessionOptions()
|
||||
if use_gpu:
|
||||
raise NotImplementedError("GPU mode is not fully tested")
|
||||
else:
|
||||
providers = ["CPUExecutionProvider"]
|
||||
print("Using CPU for inference")
|
||||
cfgs = load_cfgs(onnx_dir)
|
||||
dp_ort, text_enc_ort, vector_est_ort, vocoder_ort = load_onnx_all(
|
||||
onnx_dir, opts, providers
|
||||
)
|
||||
text_processor = load_text_processor(onnx_dir)
|
||||
return TextToSpeech(
|
||||
cfgs, text_processor, dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
|
||||
)
|
||||
|
||||
|
||||
def load_voice_style(voice_style_paths: list[str], verbose: bool = False) -> Style:
|
||||
bsz = len(voice_style_paths)
|
||||
|
||||
# Read first file to get dimensions
|
||||
with open(voice_style_paths[0], "r") as f:
|
||||
first_style = json.load(f)
|
||||
ttl_dims = first_style["style_ttl"]["dims"]
|
||||
dp_dims = first_style["style_dp"]["dims"]
|
||||
|
||||
# Pre-allocate arrays with full batch size
|
||||
ttl_style = np.zeros([bsz, ttl_dims[1], ttl_dims[2]], dtype=np.float32)
|
||||
dp_style = np.zeros([bsz, dp_dims[1], dp_dims[2]], dtype=np.float32)
|
||||
|
||||
# Fill in the data
|
||||
for i, voice_style_path in enumerate(voice_style_paths):
|
||||
with open(voice_style_path, "r") as f:
|
||||
voice_style = json.load(f)
|
||||
|
||||
ttl_data = np.array(
|
||||
voice_style["style_ttl"]["data"], dtype=np.float32
|
||||
).flatten()
|
||||
ttl_style[i] = ttl_data.reshape(ttl_dims[1], ttl_dims[2])
|
||||
|
||||
dp_data = np.array(voice_style["style_dp"]["data"], dtype=np.float32).flatten()
|
||||
dp_style[i] = dp_data.reshape(dp_dims[1], dp_dims[2])
|
||||
|
||||
if verbose:
|
||||
print(f"Loaded {bsz} voice styles")
|
||||
return Style(ttl_style, dp_style)
|
||||
|
||||
|
||||
@contextmanager
|
||||
def timer(name: str):
|
||||
start = time.time()
|
||||
print(f"{name}...")
|
||||
yield
|
||||
print(f" -> {name} completed in {time.time() - start:.2f} sec")
|
||||
|
||||
|
||||
def sanitize_filename(text: str, max_len: int) -> str:
|
||||
"""Sanitize filename by replacing non-alphanumeric characters with underscores"""
|
||||
import re
|
||||
|
||||
prefix = text[:max_len]
|
||||
return re.sub(r"[^a-zA-Z0-9]", "_", prefix)
|
||||
@@ -0,0 +1,20 @@
|
||||
[project]
|
||||
name = "tts-onnx"
|
||||
version = "1.0.0"
|
||||
description = "TTS ONNX Inference"
|
||||
requires-python = ">=3.10"
|
||||
dependencies = [
|
||||
"onnxruntime==1.23.1",
|
||||
"numpy>=1.26.0",
|
||||
"soundfile>=0.12.1",
|
||||
"librosa>=0.10.0",
|
||||
"PyYAML>=6.0",
|
||||
]
|
||||
|
||||
[tool.setuptools]
|
||||
py-modules = []
|
||||
|
||||
[build-system]
|
||||
requires = ["setuptools"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
@@ -0,0 +1,5 @@
|
||||
onnxruntime==1.23.1
|
||||
numpy>=1.26.0
|
||||
soundfile>=0.12.1
|
||||
librosa>=0.10.0
|
||||
PyYAML>=6.0
|
||||
Generated
+1142
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,21 @@
|
||||
# Rust build artifacts
|
||||
/target/
|
||||
Cargo.lock
|
||||
|
||||
# Output directory
|
||||
/results/
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Debug
|
||||
*.pdb
|
||||
|
||||
@@ -0,0 +1,41 @@
|
||||
[package]
|
||||
name = "supertonic-tts"
|
||||
version = "0.1.0"
|
||||
edition = "2021"
|
||||
|
||||
[dependencies]
|
||||
# ONNX Runtime
|
||||
ort = "2.0.0-rc.7"
|
||||
|
||||
# Array processing (like NumPy)
|
||||
ndarray = { version = "0.16", features = ["rayon"] }
|
||||
rand = "0.8"
|
||||
rand_distr = "0.4"
|
||||
|
||||
# Parallel processing
|
||||
rayon = "1.10"
|
||||
|
||||
# Audio processing
|
||||
hound = "3.5"
|
||||
rustfft = "6.2"
|
||||
|
||||
# JSON serialization
|
||||
serde = { version = "1.0", features = ["derive"] }
|
||||
serde_json = "1.0"
|
||||
|
||||
# CLI argument parsing
|
||||
clap = { version = "4.5", features = ["derive"] }
|
||||
|
||||
# Error handling
|
||||
anyhow = "1.0"
|
||||
|
||||
# Unicode normalization
|
||||
unicode-normalization = "0.1"
|
||||
|
||||
# System calls
|
||||
libc = "0.2"
|
||||
|
||||
[[bin]]
|
||||
name = "example_onnx"
|
||||
path = "src/example_onnx.rs"
|
||||
|
||||
+101
@@ -0,0 +1,101 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using Rust.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses [Cargo](https://doc.rust-lang.org/cargo/) for package management.
|
||||
|
||||
### Install Rust (if not already installed)
|
||||
```bash
|
||||
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
||||
```
|
||||
|
||||
### Build the project
|
||||
```bash
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
You can run the inference in two ways:
|
||||
1. **Using cargo run** (builds if needed, then runs)
|
||||
2. **Direct binary execution** (faster if already built)
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
# Using cargo run
|
||||
cargo run --release --bin example_onnx
|
||||
|
||||
# Or directly execute the built binary (faster)
|
||||
./target/release/example_onnx
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
# Using cargo run
|
||||
cargo run --release --bin example_onnx -- \
|
||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
|
||||
# Or using the binary directly
|
||||
./target/release/example_onnx \
|
||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice (M1.json) for the first text
|
||||
- Use female voice (F1.json) for the second text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
# Using cargo run
|
||||
cargo run --release --bin example_onnx -- \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
|
||||
# Or using the binary directly
|
||||
./target/release/example_onnx \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
- **Known Issues**: On some platforms (especially macOS), there might be a mutex cleanup warning during exit. This is a known ONNX Runtime issue and doesn't affect functionality. The implementation uses `libc::_exit()` and `mem::forget()` to bypass this issue.
|
||||
|
||||
|
||||
Symlink
+1
@@ -0,0 +1 @@
|
||||
../assets
|
||||
@@ -0,0 +1,108 @@
|
||||
use anyhow::Result;
|
||||
use clap::Parser;
|
||||
use std::path::PathBuf;
|
||||
use std::fs;
|
||||
use std::mem;
|
||||
|
||||
mod helper;
|
||||
|
||||
use helper::{
|
||||
load_text_to_speech, load_voice_style, timer, write_wav_file, sanitize_filename,
|
||||
};
|
||||
|
||||
#[derive(Parser, Debug)]
|
||||
#[command(name = "TTS ONNX Inference")]
|
||||
#[command(about = "TTS Inference with ONNX Runtime (Rust)", long_about = None)]
|
||||
struct Args {
|
||||
/// Use GPU for inference (default: CPU)
|
||||
#[arg(long, default_value = "false")]
|
||||
use_gpu: bool,
|
||||
|
||||
/// Path to ONNX model directory
|
||||
#[arg(long, default_value = "assets/onnx")]
|
||||
onnx_dir: String,
|
||||
|
||||
/// Number of denoising steps
|
||||
#[arg(long, default_value = "5")]
|
||||
total_step: usize,
|
||||
|
||||
/// Number of times to generate
|
||||
#[arg(long, default_value = "4")]
|
||||
n_test: usize,
|
||||
|
||||
/// Voice style file path(s)
|
||||
#[arg(long, value_delimiter = ',', default_values_t = vec!["assets/voice_styles/M1.json".to_string()])]
|
||||
voice_style: Vec<String>,
|
||||
|
||||
/// Text(s) to synthesize
|
||||
#[arg(long, value_delimiter = '|', default_values_t = vec!["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.".to_string()])]
|
||||
text: Vec<String>,
|
||||
|
||||
/// Output directory
|
||||
#[arg(long, default_value = "results")]
|
||||
save_dir: String,
|
||||
}
|
||||
|
||||
fn main() -> Result<()> {
|
||||
println!("=== TTS Inference with ONNX Runtime (Rust) ===\n");
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
let args = Args::parse();
|
||||
let total_step = args.total_step;
|
||||
let n_test = args.n_test;
|
||||
let voice_style_paths = &args.voice_style;
|
||||
let text_list = &args.text;
|
||||
let save_dir = &args.save_dir;
|
||||
|
||||
if voice_style_paths.len() != text_list.len() {
|
||||
anyhow::bail!(
|
||||
"Number of voice styles ({}) must match number of texts ({})",
|
||||
voice_style_paths.len(),
|
||||
text_list.len()
|
||||
);
|
||||
}
|
||||
|
||||
let bsz = voice_style_paths.len();
|
||||
|
||||
// --- 2. Load TTS components --- //
|
||||
let mut text_to_speech = load_text_to_speech(&args.onnx_dir, args.use_gpu)?;
|
||||
|
||||
// --- 3. Load voice styles --- //
|
||||
let style = load_voice_style(voice_style_paths, true)?;
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
fs::create_dir_all(save_dir)?;
|
||||
|
||||
for n in 0..n_test {
|
||||
println!("\n[{}/{}] Starting synthesis...", n + 1, n_test);
|
||||
|
||||
let (wav, duration) = timer("Generating speech from text", || {
|
||||
text_to_speech.call(text_list, &style, total_step)
|
||||
})?;
|
||||
|
||||
// Save outputs
|
||||
let wav_len = wav.len() / bsz;
|
||||
for i in 0..bsz {
|
||||
let fname = format!("{}_{}.wav", sanitize_filename(&text_list[i], 20), n + 1);
|
||||
let actual_len = (text_to_speech.sample_rate as f32 * duration[i]) as usize;
|
||||
|
||||
let wav_start = i * wav_len;
|
||||
let wav_end = wav_start + actual_len.min(wav_len);
|
||||
let wav_slice = &wav[wav_start..wav_end];
|
||||
|
||||
let output_path = PathBuf::from(save_dir).join(&fname);
|
||||
write_wav_file(&output_path, wav_slice, text_to_speech.sample_rate)?;
|
||||
println!("Saved: {}", output_path.display());
|
||||
}
|
||||
}
|
||||
|
||||
println!("\n=== Synthesis completed successfully! ===");
|
||||
|
||||
// Prevent ONNX Runtime sessions from being dropped, which causes mutex cleanup issues
|
||||
mem::forget(text_to_speech);
|
||||
|
||||
// Use _exit to bypass all cleanup handlers and avoid ONNX Runtime mutex issues on macOS
|
||||
unsafe {
|
||||
libc::_exit(0);
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,507 @@
|
||||
// ============================================================================
|
||||
// TTS Helper Module - All utility functions and structures
|
||||
// ============================================================================
|
||||
|
||||
use ndarray::{Array, Array3};
|
||||
use serde::{Deserialize, Serialize};
|
||||
use serde_json;
|
||||
use std::fs::File;
|
||||
use std::io::BufReader;
|
||||
use std::path::Path;
|
||||
use anyhow::{Result, Context};
|
||||
use unicode_normalization::UnicodeNormalization;
|
||||
use hound::{WavWriter, WavSpec, SampleFormat};
|
||||
use rand_distr::{Distribution, Normal};
|
||||
|
||||
// ============================================================================
|
||||
// Configuration Structures
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Config {
|
||||
pub ae: AEConfig,
|
||||
pub ttl: TTLConfig,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct AEConfig {
|
||||
pub sample_rate: i32,
|
||||
pub base_chunk_size: i32,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct TTLConfig {
|
||||
pub chunk_compress_factor: i32,
|
||||
pub latent_dim: i32,
|
||||
}
|
||||
|
||||
/// Load configuration from JSON file
|
||||
pub fn load_cfgs<P: AsRef<Path>>(onnx_dir: P) -> Result<Config> {
|
||||
let cfg_path = onnx_dir.as_ref().join("tts.json");
|
||||
let file = File::open(cfg_path)?;
|
||||
let reader = BufReader::new(file);
|
||||
let cfgs: Config = serde_json::from_reader(reader)?;
|
||||
Ok(cfgs)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Voice Style Data Structure
|
||||
// ============================================================================
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct VoiceStyleData {
|
||||
pub style_ttl: StyleComponent,
|
||||
pub style_dp: StyleComponent,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct StyleComponent {
|
||||
pub data: Vec<Vec<Vec<f32>>>,
|
||||
pub dims: Vec<usize>,
|
||||
#[serde(rename = "type")]
|
||||
pub dtype: String,
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Unicode Text Processor
|
||||
// ============================================================================
|
||||
|
||||
pub struct UnicodeProcessor {
|
||||
indexer: Vec<i64>,
|
||||
}
|
||||
|
||||
impl UnicodeProcessor {
|
||||
pub fn new<P: AsRef<Path>>(unicode_indexer_json_path: P) -> Result<Self> {
|
||||
let file = File::open(unicode_indexer_json_path)?;
|
||||
let reader = BufReader::new(file);
|
||||
let indexer: Vec<i64> = serde_json::from_reader(reader)?;
|
||||
Ok(UnicodeProcessor { indexer })
|
||||
}
|
||||
|
||||
pub fn call(&self, text_list: &[String]) -> (Vec<Vec<i64>>, Array3<f32>) {
|
||||
let processed_texts: Vec<String> = text_list
|
||||
.iter()
|
||||
.map(|t| preprocess_text(t))
|
||||
.collect();
|
||||
|
||||
let text_ids_lengths: Vec<usize> = processed_texts
|
||||
.iter()
|
||||
.map(|t| t.chars().count())
|
||||
.collect();
|
||||
|
||||
let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
|
||||
|
||||
let mut text_ids = Vec::new();
|
||||
for text in &processed_texts {
|
||||
let mut row = vec![0i64; max_len];
|
||||
let unicode_vals = text_to_unicode_values(text);
|
||||
for (j, &val) in unicode_vals.iter().enumerate() {
|
||||
if val < self.indexer.len() {
|
||||
row[j] = self.indexer[val];
|
||||
} else {
|
||||
row[j] = -1;
|
||||
}
|
||||
}
|
||||
text_ids.push(row);
|
||||
}
|
||||
|
||||
let text_mask = get_text_mask(&text_ids_lengths);
|
||||
|
||||
(text_ids, text_mask)
|
||||
}
|
||||
}
|
||||
|
||||
pub fn preprocess_text(text: &str) -> String {
|
||||
text.nfkd().collect()
|
||||
}
|
||||
|
||||
pub fn text_to_unicode_values(text: &str) -> Vec<usize> {
|
||||
text.chars().map(|c| c as usize).collect()
|
||||
}
|
||||
|
||||
pub fn length_to_mask(lengths: &[usize], max_len: Option<usize>) -> Array3<f32> {
|
||||
let bsz = lengths.len();
|
||||
let max_len = max_len.unwrap_or_else(|| *lengths.iter().max().unwrap_or(&0));
|
||||
|
||||
let mut mask = Array3::<f32>::zeros((bsz, 1, max_len));
|
||||
for (i, &len) in lengths.iter().enumerate() {
|
||||
for j in 0..len.min(max_len) {
|
||||
mask[[i, 0, j]] = 1.0;
|
||||
}
|
||||
}
|
||||
mask
|
||||
}
|
||||
|
||||
pub fn get_text_mask(text_ids_lengths: &[usize]) -> Array3<f32> {
|
||||
let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
|
||||
length_to_mask(text_ids_lengths, Some(max_len))
|
||||
}
|
||||
|
||||
/// Sample noisy latent from normal distribution and apply mask
|
||||
pub fn sample_noisy_latent(
|
||||
duration: &[f32],
|
||||
sample_rate: i32,
|
||||
base_chunk_size: i32,
|
||||
chunk_compress: i32,
|
||||
latent_dim: i32,
|
||||
) -> (Array3<f32>, Array3<f32>) {
|
||||
let bsz = duration.len();
|
||||
let max_dur = duration.iter().fold(0.0f32, |a, &b| a.max(b));
|
||||
|
||||
let wav_len_max = (max_dur * sample_rate as f32) as usize;
|
||||
let wav_lengths: Vec<usize> = duration
|
||||
.iter()
|
||||
.map(|&d| (d * sample_rate as f32) as usize)
|
||||
.collect();
|
||||
|
||||
let chunk_size = (base_chunk_size * chunk_compress) as usize;
|
||||
let latent_len = (wav_len_max + chunk_size - 1) / chunk_size;
|
||||
let latent_dim_val = (latent_dim * chunk_compress) as usize;
|
||||
|
||||
let mut noisy_latent = Array3::<f32>::zeros((bsz, latent_dim_val, latent_len));
|
||||
|
||||
let normal = Normal::new(0.0, 1.0).unwrap();
|
||||
let mut rng = rand::thread_rng();
|
||||
|
||||
for b in 0..bsz {
|
||||
for d in 0..latent_dim_val {
|
||||
for t in 0..latent_len {
|
||||
noisy_latent[[b, d, t]] = normal.sample(&mut rng);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let latent_lengths: Vec<usize> = wav_lengths
|
||||
.iter()
|
||||
.map(|&len| (len + chunk_size - 1) / chunk_size)
|
||||
.collect();
|
||||
|
||||
let latent_mask = length_to_mask(&latent_lengths, Some(latent_len));
|
||||
|
||||
// Apply mask
|
||||
for b in 0..bsz {
|
||||
for d in 0..latent_dim_val {
|
||||
for t in 0..latent_len {
|
||||
noisy_latent[[b, d, t]] *= latent_mask[[b, 0, t]];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
(noisy_latent, latent_mask)
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// WAV File I/O
|
||||
// ============================================================================
|
||||
|
||||
pub fn write_wav_file<P: AsRef<Path>>(
|
||||
filename: P,
|
||||
audio_data: &[f32],
|
||||
sample_rate: i32,
|
||||
) -> Result<()> {
|
||||
let spec = WavSpec {
|
||||
channels: 1,
|
||||
sample_rate: sample_rate as u32,
|
||||
bits_per_sample: 16,
|
||||
sample_format: SampleFormat::Int,
|
||||
};
|
||||
|
||||
let mut writer = WavWriter::create(filename, spec)?;
|
||||
|
||||
for &sample in audio_data {
|
||||
let clamped = sample.max(-1.0).min(1.0);
|
||||
let val = (clamped * 32767.0) as i16;
|
||||
writer.write_sample(val)?;
|
||||
}
|
||||
|
||||
writer.finalize()?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Utility Functions
|
||||
// ============================================================================
|
||||
|
||||
pub fn timer<F, T>(name: &str, f: F) -> Result<T>
|
||||
where
|
||||
F: FnOnce() -> Result<T>,
|
||||
{
|
||||
let start = std::time::Instant::now();
|
||||
println!("{}...", name);
|
||||
let result = f()?;
|
||||
let elapsed = start.elapsed().as_secs_f64();
|
||||
println!(" -> {} completed in {:.2} sec", name, elapsed);
|
||||
Ok(result)
|
||||
}
|
||||
|
||||
pub fn sanitize_filename(text: &str, max_len: usize) -> String {
|
||||
let text = if text.len() > max_len {
|
||||
&text[..max_len]
|
||||
} else {
|
||||
text
|
||||
};
|
||||
|
||||
text.chars()
|
||||
.map(|c| {
|
||||
if c.is_ascii_alphanumeric() {
|
||||
c
|
||||
} else {
|
||||
'_'
|
||||
}
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// ONNX Runtime Integration
|
||||
// ============================================================================
|
||||
|
||||
use ort::{
|
||||
session::Session,
|
||||
value::Value,
|
||||
};
|
||||
|
||||
pub struct Style {
|
||||
pub ttl: Array3<f32>,
|
||||
pub dp: Array3<f32>,
|
||||
}
|
||||
|
||||
pub struct TextToSpeech {
|
||||
cfgs: Config,
|
||||
text_processor: UnicodeProcessor,
|
||||
dp_ort: Session,
|
||||
text_enc_ort: Session,
|
||||
vector_est_ort: Session,
|
||||
vocoder_ort: Session,
|
||||
pub sample_rate: i32,
|
||||
}
|
||||
|
||||
impl TextToSpeech {
|
||||
pub fn new(
|
||||
cfgs: Config,
|
||||
text_processor: UnicodeProcessor,
|
||||
dp_ort: Session,
|
||||
text_enc_ort: Session,
|
||||
vector_est_ort: Session,
|
||||
vocoder_ort: Session,
|
||||
) -> Self {
|
||||
let sample_rate = cfgs.ae.sample_rate;
|
||||
TextToSpeech {
|
||||
cfgs,
|
||||
text_processor,
|
||||
dp_ort,
|
||||
text_enc_ort,
|
||||
vector_est_ort,
|
||||
vocoder_ort,
|
||||
sample_rate,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn call(
|
||||
&mut self,
|
||||
text_list: &[String],
|
||||
style: &Style,
|
||||
total_step: usize,
|
||||
) -> Result<(Vec<f32>, Vec<f32>)> {
|
||||
let bsz = text_list.len();
|
||||
|
||||
// Process text
|
||||
let (text_ids, text_mask) = self.text_processor.call(text_list);
|
||||
|
||||
let text_ids_array = {
|
||||
let text_ids_shape = (bsz, text_ids[0].len());
|
||||
let mut flat = Vec::new();
|
||||
for row in &text_ids {
|
||||
flat.extend_from_slice(row);
|
||||
}
|
||||
Array::from_shape_vec(text_ids_shape, flat)?
|
||||
};
|
||||
|
||||
let text_ids_value = Value::from_array(text_ids_array)?;
|
||||
let text_mask_value = Value::from_array(text_mask.clone())?;
|
||||
let style_dp_value = Value::from_array(style.dp.clone())?;
|
||||
|
||||
// Predict duration
|
||||
let dp_outputs = self.dp_ort.run(ort::inputs!{
|
||||
"text_ids" => &text_ids_value,
|
||||
"style_dp" => &style_dp_value,
|
||||
"text_mask" => &text_mask_value
|
||||
})?;
|
||||
|
||||
let (_, duration_data) = dp_outputs["duration"].try_extract_tensor::<f32>()?;
|
||||
let duration: Vec<f32> = duration_data.to_vec();
|
||||
|
||||
// Encode text
|
||||
let style_ttl_value = Value::from_array(style.ttl.clone())?;
|
||||
let text_enc_outputs = self.text_enc_ort.run(ort::inputs!{
|
||||
"text_ids" => &text_ids_value,
|
||||
"style_ttl" => &style_ttl_value,
|
||||
"text_mask" => &text_mask_value
|
||||
})?;
|
||||
|
||||
let (text_emb_shape, text_emb_data) = text_enc_outputs["text_emb"].try_extract_tensor::<f32>()?;
|
||||
let text_emb = Array3::from_shape_vec(
|
||||
(text_emb_shape[0] as usize, text_emb_shape[1] as usize, text_emb_shape[2] as usize),
|
||||
text_emb_data.to_vec()
|
||||
)?;
|
||||
|
||||
// Sample noisy latent
|
||||
let (mut xt, latent_mask) = sample_noisy_latent(
|
||||
&duration,
|
||||
self.sample_rate,
|
||||
self.cfgs.ae.base_chunk_size,
|
||||
self.cfgs.ttl.chunk_compress_factor,
|
||||
self.cfgs.ttl.latent_dim,
|
||||
);
|
||||
|
||||
// Prepare constant arrays
|
||||
let total_step_array = Array::from_elem(bsz, total_step as f32);
|
||||
|
||||
// Denoising loop
|
||||
for step in 0..total_step {
|
||||
let current_step_array = Array::from_elem(bsz, step as f32);
|
||||
|
||||
let xt_value = Value::from_array(xt.clone())?;
|
||||
let text_emb_value = Value::from_array(text_emb.clone())?;
|
||||
let latent_mask_value = Value::from_array(latent_mask.clone())?;
|
||||
let text_mask_value2 = Value::from_array(text_mask.clone())?;
|
||||
let current_step_value = Value::from_array(current_step_array)?;
|
||||
let total_step_value = Value::from_array(total_step_array.clone())?;
|
||||
|
||||
let vector_est_outputs = self.vector_est_ort.run(ort::inputs!{
|
||||
"noisy_latent" => &xt_value,
|
||||
"text_emb" => &text_emb_value,
|
||||
"style_ttl" => &style_ttl_value,
|
||||
"latent_mask" => &latent_mask_value,
|
||||
"text_mask" => &text_mask_value2,
|
||||
"current_step" => ¤t_step_value,
|
||||
"total_step" => &total_step_value
|
||||
})?;
|
||||
|
||||
let (denoised_shape, denoised_data) = vector_est_outputs["denoised_latent"].try_extract_tensor::<f32>()?;
|
||||
xt = Array3::from_shape_vec(
|
||||
(denoised_shape[0] as usize, denoised_shape[1] as usize, denoised_shape[2] as usize),
|
||||
denoised_data.to_vec()
|
||||
)?;
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
let final_latent_value = Value::from_array(xt)?;
|
||||
let vocoder_outputs = self.vocoder_ort.run(ort::inputs!{
|
||||
"latent" => &final_latent_value
|
||||
})?;
|
||||
|
||||
let (_, wav_data) = vocoder_outputs["wav_tts"].try_extract_tensor::<f32>()?;
|
||||
let wav: Vec<f32> = wav_data.to_vec();
|
||||
|
||||
Ok((wav, duration))
|
||||
}
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Component Loading Functions
|
||||
// ============================================================================
|
||||
|
||||
/// Load voice style from JSON files
|
||||
pub fn load_voice_style(voice_style_paths: &[String], verbose: bool) -> Result<Style> {
|
||||
let bsz = voice_style_paths.len();
|
||||
|
||||
// Read first file to get dimensions
|
||||
let first_file = File::open(&voice_style_paths[0])
|
||||
.context("Failed to open voice style file")?;
|
||||
let first_reader = BufReader::new(first_file);
|
||||
let first_data: VoiceStyleData = serde_json::from_reader(first_reader)?;
|
||||
|
||||
let ttl_dims = &first_data.style_ttl.dims;
|
||||
let dp_dims = &first_data.style_dp.dims;
|
||||
|
||||
let ttl_dim1 = ttl_dims[1];
|
||||
let ttl_dim2 = ttl_dims[2];
|
||||
let dp_dim1 = dp_dims[1];
|
||||
let dp_dim2 = dp_dims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
let ttl_size = bsz * ttl_dim1 * ttl_dim2;
|
||||
let dp_size = bsz * dp_dim1 * dp_dim2;
|
||||
let mut ttl_flat = vec![0.0f32; ttl_size];
|
||||
let mut dp_flat = vec![0.0f32; dp_size];
|
||||
|
||||
// Fill in the data
|
||||
for (i, path) in voice_style_paths.iter().enumerate() {
|
||||
let file = File::open(path).context("Failed to open voice style file")?;
|
||||
let reader = BufReader::new(file);
|
||||
let data: VoiceStyleData = serde_json::from_reader(reader)?;
|
||||
|
||||
// Flatten TTL data
|
||||
let ttl_offset = i * ttl_dim1 * ttl_dim2;
|
||||
let mut idx = 0;
|
||||
for batch in &data.style_ttl.data {
|
||||
for row in batch {
|
||||
for &val in row {
|
||||
ttl_flat[ttl_offset + idx] = val;
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Flatten DP data
|
||||
let dp_offset = i * dp_dim1 * dp_dim2;
|
||||
idx = 0;
|
||||
for batch in &data.style_dp.data {
|
||||
for row in batch {
|
||||
for &val in row {
|
||||
dp_flat[dp_offset + idx] = val;
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let ttl_style = Array3::from_shape_vec((bsz, ttl_dim1, ttl_dim2), ttl_flat)?;
|
||||
let dp_style = Array3::from_shape_vec((bsz, dp_dim1, dp_dim2), dp_flat)?;
|
||||
|
||||
if verbose {
|
||||
println!("Loaded {} voice styles\n", bsz);
|
||||
}
|
||||
|
||||
Ok(Style {
|
||||
ttl: ttl_style,
|
||||
dp: dp_style,
|
||||
})
|
||||
}
|
||||
|
||||
/// Load TTS components
|
||||
pub fn load_text_to_speech(onnx_dir: &str, use_gpu: bool) -> Result<TextToSpeech> {
|
||||
if use_gpu {
|
||||
anyhow::bail!("GPU mode is not supported yet");
|
||||
}
|
||||
println!("Using CPU for inference\n");
|
||||
|
||||
let cfgs = load_cfgs(onnx_dir)?;
|
||||
|
||||
let dp_path = format!("{}/duration_predictor.onnx", onnx_dir);
|
||||
let text_enc_path = format!("{}/text_encoder.onnx", onnx_dir);
|
||||
let vector_est_path = format!("{}/vector_estimator.onnx", onnx_dir);
|
||||
let vocoder_path = format!("{}/vocoder.onnx", onnx_dir);
|
||||
|
||||
let dp_ort = Session::builder()?
|
||||
.commit_from_file(&dp_path)?;
|
||||
let text_enc_ort = Session::builder()?
|
||||
.commit_from_file(&text_enc_path)?;
|
||||
let vector_est_ort = Session::builder()?
|
||||
.commit_from_file(&vector_est_path)?;
|
||||
let vocoder_ort = Session::builder()?
|
||||
.commit_from_file(&vocoder_path)?;
|
||||
|
||||
let unicode_indexer_path = format!("{}/unicode_indexer.json", onnx_dir);
|
||||
let text_processor = UnicodeProcessor::new(&unicode_indexer_path)?;
|
||||
|
||||
Ok(TextToSpeech::new(
|
||||
cfgs,
|
||||
text_processor,
|
||||
dp_ort,
|
||||
text_enc_ort,
|
||||
vector_est_ort,
|
||||
vocoder_ort,
|
||||
))
|
||||
}
|
||||
@@ -0,0 +1,15 @@
|
||||
# Swift Package Manager
|
||||
.build/
|
||||
.swiftpm/
|
||||
*.xcodeproj
|
||||
*.xcworkspace
|
||||
|
||||
# Build artifacts
|
||||
example_onnx
|
||||
|
||||
# Results
|
||||
results/*.wav
|
||||
|
||||
# macOS
|
||||
.DS_Store
|
||||
|
||||
@@ -0,0 +1,14 @@
|
||||
{
|
||||
"pins" : [
|
||||
{
|
||||
"identity" : "onnxruntime-swift-package-manager",
|
||||
"kind" : "remoteSourceControl",
|
||||
"location" : "https://github.com/microsoft/onnxruntime-swift-package-manager.git",
|
||||
"state" : {
|
||||
"revision" : "12ce7374c86944e1f68f3a866d10105d8357f074",
|
||||
"version" : "1.20.0"
|
||||
}
|
||||
}
|
||||
],
|
||||
"version" : 2
|
||||
}
|
||||
@@ -0,0 +1,22 @@
|
||||
// swift-tools-version: 5.9
|
||||
import PackageDescription
|
||||
|
||||
let package = Package(
|
||||
name: "Supertonic",
|
||||
platforms: [
|
||||
.macOS(.v13)
|
||||
],
|
||||
dependencies: [
|
||||
.package(url: "https://github.com/microsoft/onnxruntime-swift-package-manager.git", from: "1.16.0"),
|
||||
],
|
||||
targets: [
|
||||
.executableTarget(
|
||||
name: "example_onnx",
|
||||
dependencies: [
|
||||
.product(name: "onnxruntime", package: "onnxruntime-swift-package-manager")
|
||||
],
|
||||
path: "Sources"
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
@@ -0,0 +1,76 @@
|
||||
# TTS ONNX Inference Examples
|
||||
|
||||
This guide provides examples for running TTS inference using `example_onnx`.
|
||||
|
||||
## Installation
|
||||
|
||||
This project uses Swift Package Manager (SPM) for dependency management.
|
||||
|
||||
### Prerequisites
|
||||
- Swift 5.9 or later
|
||||
- macOS 13.0 or later
|
||||
|
||||
### Build the project
|
||||
```bash
|
||||
swift build -c release
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Example 1: Default Inference
|
||||
Run inference with default settings:
|
||||
```bash
|
||||
.build/release/example_onnx
|
||||
```
|
||||
|
||||
This will use:
|
||||
- Voice style: `assets/voice_styles/M1.json`
|
||||
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
|
||||
- Output directory: `results/`
|
||||
- Total steps: 5
|
||||
- Number of generations: 4
|
||||
|
||||
### Example 2: Batch Inference
|
||||
Process multiple voice styles and texts at once:
|
||||
```bash
|
||||
.build/release/example_onnx \
|
||||
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
|
||||
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Generate speech for 2 different voice-text pairs
|
||||
- Use male voice (M1.json) for the first text
|
||||
- Use female voice (F1.json) for the second text
|
||||
- Process both samples in a single batch
|
||||
|
||||
### Example 3: High Quality Inference
|
||||
Increase denoising steps for better quality:
|
||||
```bash
|
||||
.build/release/example_onnx \
|
||||
--total-step 10 \
|
||||
--voice-style assets/voice_styles/M1.json \
|
||||
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
|
||||
```
|
||||
|
||||
This will:
|
||||
- Use 10 denoising steps instead of the default 5
|
||||
- Produce higher quality output at the cost of slower inference
|
||||
|
||||
## Available Arguments
|
||||
|
||||
| Argument | Type | Default | Description |
|
||||
|----------|------|---------|-------------|
|
||||
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
|
||||
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
|
||||
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
|
||||
| `--n-test` | int | 4 | Number of times to generate each sample |
|
||||
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
|
||||
| `--text` | str+ | (long default text) | Text(s) to synthesize |
|
||||
| `--save-dir` | str | `results` | Output directory |
|
||||
|
||||
## Notes
|
||||
|
||||
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
|
||||
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
|
||||
- **GPU Support**: GPU mode is not supported yet
|
||||
@@ -0,0 +1,122 @@
|
||||
import Foundation
|
||||
import OnnxRuntimeBindings
|
||||
|
||||
struct Args {
|
||||
var useGpu: Bool = false
|
||||
var onnxDir: String = "assets/onnx"
|
||||
var totalStep: Int = 5
|
||||
var nTest: Int = 4
|
||||
var voiceStyle: [String] = ["assets/voice_styles/M1.json"]
|
||||
var text: [String] = ["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."]
|
||||
var saveDir: String = "results"
|
||||
}
|
||||
|
||||
func parseArgs() -> Args {
|
||||
var args = Args()
|
||||
let arguments = CommandLine.arguments
|
||||
|
||||
var i = 1
|
||||
while i < arguments.count {
|
||||
let arg = arguments[i]
|
||||
|
||||
switch arg {
|
||||
case "--use-gpu":
|
||||
args.useGpu = true
|
||||
case "--onnx-dir":
|
||||
if i + 1 < arguments.count {
|
||||
args.onnxDir = arguments[i + 1]
|
||||
i += 1
|
||||
}
|
||||
case "--total-step":
|
||||
if i + 1 < arguments.count {
|
||||
args.totalStep = Int(arguments[i + 1]) ?? 5
|
||||
i += 1
|
||||
}
|
||||
case "--n-test":
|
||||
if i + 1 < arguments.count {
|
||||
args.nTest = Int(arguments[i + 1]) ?? 4
|
||||
i += 1
|
||||
}
|
||||
case "--voice-style":
|
||||
if i + 1 < arguments.count {
|
||||
args.voiceStyle = arguments[i + 1].components(separatedBy: ",")
|
||||
i += 1
|
||||
}
|
||||
case "--text":
|
||||
if i + 1 < arguments.count {
|
||||
args.text = arguments[i + 1].components(separatedBy: "|")
|
||||
i += 1
|
||||
}
|
||||
case "--save-dir":
|
||||
if i + 1 < arguments.count {
|
||||
args.saveDir = arguments[i + 1]
|
||||
i += 1
|
||||
}
|
||||
default:
|
||||
break
|
||||
}
|
||||
|
||||
i += 1
|
||||
}
|
||||
|
||||
return args
|
||||
}
|
||||
|
||||
@main
|
||||
struct ExampleONNX {
|
||||
static func main() async {
|
||||
print("=== TTS Inference with ONNX Runtime (Swift) ===\n")
|
||||
|
||||
// --- 1. Parse arguments --- //
|
||||
let args = parseArgs()
|
||||
|
||||
guard args.voiceStyle.count == args.text.count else {
|
||||
print("Error: Number of voice styles (\(args.voiceStyle.count)) must match number of texts (\(args.text.count))")
|
||||
return
|
||||
}
|
||||
|
||||
let bsz = args.voiceStyle.count
|
||||
|
||||
do {
|
||||
let env = try ORTEnv(loggingLevel: .warning)
|
||||
|
||||
// --- 2. Load TTS components --- //
|
||||
let textToSpeech = try loadTextToSpeech(args.onnxDir, args.useGpu, env)
|
||||
|
||||
// --- 3. Load voice styles --- //
|
||||
let style = try loadVoiceStyle(args.voiceStyle, verbose: true)
|
||||
|
||||
// --- 4. Synthesize speech --- //
|
||||
try? FileManager.default.createDirectory(atPath: args.saveDir, withIntermediateDirectories: true)
|
||||
|
||||
for n in 0..<args.nTest {
|
||||
print("\n[\(n + 1)/\(args.nTest)] Starting synthesis...")
|
||||
|
||||
let (wav, duration) = try timer("Generating speech from text") {
|
||||
try textToSpeech.call(args.text, style, args.totalStep)
|
||||
}
|
||||
|
||||
// Save outputs
|
||||
let wavLen = wav.count / bsz
|
||||
for i in 0..<bsz {
|
||||
let fname = "\(sanitizeFilename(args.text[i], maxLen: 20))_\(n + 1).wav"
|
||||
let actualLen = Int(Float(textToSpeech.sampleRate) * duration[i])
|
||||
|
||||
let wavStart = i * wavLen
|
||||
let wavEnd = min(wavStart + actualLen, wavStart + wavLen)
|
||||
let wavOut = Array(wav[wavStart..<wavEnd])
|
||||
|
||||
let outputPath = "\(args.saveDir)/\(fname)"
|
||||
try writeWavFile(outputPath, wavOut, textToSpeech.sampleRate)
|
||||
print("Saved: \(outputPath)")
|
||||
}
|
||||
}
|
||||
|
||||
print("\n=== Synthesis completed successfully! ===")
|
||||
|
||||
} catch {
|
||||
print("Error during inference: \(error)")
|
||||
exit(1)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,483 @@
|
||||
import Foundation
|
||||
import Accelerate
|
||||
import OnnxRuntimeBindings
|
||||
|
||||
// MARK: - Configuration Structures
|
||||
|
||||
struct Config: Codable {
|
||||
struct AEConfig: Codable {
|
||||
let sample_rate: Int
|
||||
let base_chunk_size: Int
|
||||
}
|
||||
|
||||
struct TTLConfig: Codable {
|
||||
let chunk_compress_factor: Int
|
||||
let latent_dim: Int
|
||||
}
|
||||
|
||||
let ae: AEConfig
|
||||
let ttl: TTLConfig
|
||||
}
|
||||
|
||||
// MARK: - Voice Style Data Structure
|
||||
|
||||
struct VoiceStyleData: Codable {
|
||||
struct StyleComponent: Codable {
|
||||
let data: [[[Float]]]
|
||||
let dims: [Int]
|
||||
let type: String
|
||||
}
|
||||
|
||||
let style_ttl: StyleComponent
|
||||
let style_dp: StyleComponent
|
||||
}
|
||||
|
||||
// MARK: - Unicode Text Processor
|
||||
|
||||
class UnicodeProcessor {
|
||||
let indexer: [Int64]
|
||||
|
||||
init(unicodeIndexerPath: String) throws {
|
||||
let data = try Data(contentsOf: URL(fileURLWithPath: unicodeIndexerPath))
|
||||
self.indexer = try JSONDecoder().decode([Int64].self, from: data)
|
||||
}
|
||||
|
||||
func call(_ textList: [String]) -> (textIds: [[Int64]], textMask: [[[Float]]]) {
|
||||
let processedTexts = textList.map { preprocessText($0) }
|
||||
|
||||
var textIdsLengths = [Int]()
|
||||
for text in processedTexts {
|
||||
textIdsLengths.append(text.count)
|
||||
}
|
||||
|
||||
let maxLen = textIdsLengths.max() ?? 0
|
||||
|
||||
var textIds = [[Int64]]()
|
||||
for text in processedTexts {
|
||||
var row = Array(repeating: Int64(0), count: maxLen)
|
||||
let unicodeValues = Array(text.unicodeScalars.map { Int($0.value) })
|
||||
for (j, val) in unicodeValues.enumerated() {
|
||||
if val < indexer.count {
|
||||
row[j] = indexer[val]
|
||||
} else {
|
||||
row[j] = -1
|
||||
}
|
||||
}
|
||||
textIds.append(row)
|
||||
}
|
||||
|
||||
let textMask = getTextMask(textIdsLengths)
|
||||
return (textIds, textMask)
|
||||
}
|
||||
}
|
||||
|
||||
func preprocessText(_ text: String) -> String {
|
||||
return text.precomposedStringWithCompatibilityMapping
|
||||
}
|
||||
|
||||
func lengthToMask(_ lengths: [Int], maxLen: Int? = nil) -> [[[Float]]] {
|
||||
let actualMaxLen = maxLen ?? (lengths.max() ?? 0)
|
||||
|
||||
var mask = [[[Float]]]()
|
||||
for len in lengths {
|
||||
var row = Array(repeating: Float(0.0), count: actualMaxLen)
|
||||
for j in 0..<min(len, actualMaxLen) {
|
||||
row[j] = 1.0
|
||||
}
|
||||
mask.append([row])
|
||||
}
|
||||
return mask
|
||||
}
|
||||
|
||||
func getTextMask(_ textIdsLengths: [Int]) -> [[[Float]]] {
|
||||
let maxLen = textIdsLengths.max() ?? 0
|
||||
return lengthToMask(textIdsLengths, maxLen: maxLen)
|
||||
}
|
||||
|
||||
func sampleNoisyLatent(duration: [Float], sampleRate: Int, baseChunkSize: Int, chunkCompress: Int, latentDim: Int) -> (noisyLatent: [[[Float]]], latentMask: [[[Float]]]) {
|
||||
let bsz = duration.count
|
||||
let maxDur = duration.max() ?? 0.0
|
||||
|
||||
let wavLenMax = Int(maxDur * Float(sampleRate))
|
||||
var wavLengths = [Int]()
|
||||
for d in duration {
|
||||
wavLengths.append(Int(d * Float(sampleRate)))
|
||||
}
|
||||
|
||||
let chunkSize = baseChunkSize * chunkCompress
|
||||
let latentLen = (wavLenMax + chunkSize - 1) / chunkSize
|
||||
let latentDimVal = latentDim * chunkCompress
|
||||
|
||||
var noisyLatent = [[[Float]]]()
|
||||
for _ in 0..<bsz {
|
||||
var batch = [[Float]]()
|
||||
for _ in 0..<latentDimVal {
|
||||
var row = [Float]()
|
||||
for _ in 0..<latentLen {
|
||||
// Box-Muller transform
|
||||
let u1 = Float.random(in: 0.0001...1.0)
|
||||
let u2 = Float.random(in: 0.0...1.0)
|
||||
let val = sqrt(-2.0 * log(u1)) * cos(2.0 * Float.pi * u2)
|
||||
row.append(val)
|
||||
}
|
||||
batch.append(row)
|
||||
}
|
||||
noisyLatent.append(batch)
|
||||
}
|
||||
|
||||
var latentLengths = [Int]()
|
||||
for len in wavLengths {
|
||||
latentLengths.append((len + chunkSize - 1) / chunkSize)
|
||||
}
|
||||
|
||||
let latentMask = lengthToMask(latentLengths, maxLen: latentLen)
|
||||
|
||||
// Apply mask
|
||||
for b in 0..<bsz {
|
||||
for d in 0..<latentDimVal {
|
||||
for t in 0..<latentLen {
|
||||
noisyLatent[b][d][t] *= latentMask[b][0][t]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return (noisyLatent, latentMask)
|
||||
}
|
||||
|
||||
func getLatentMask(_ wavLengths: [Int64], _ cfgs: Config) -> [[[Float]]] {
|
||||
let baseChunkSize = cfgs.ae.base_chunk_size
|
||||
let chunkCompressFactor = cfgs.ttl.chunk_compress_factor
|
||||
let latentSize = baseChunkSize * chunkCompressFactor
|
||||
|
||||
var latentLengths = [Int]()
|
||||
for len in wavLengths {
|
||||
latentLengths.append((Int(len) + latentSize - 1) / latentSize)
|
||||
}
|
||||
|
||||
let maxLen = latentLengths.max() ?? 0
|
||||
return lengthToMask(latentLengths, maxLen: maxLen)
|
||||
}
|
||||
|
||||
// MARK: - WAV File I/O
|
||||
|
||||
func writeWavFile(_ filename: String, _ audioData: [Float], _ sampleRate: Int) throws {
|
||||
let url = URL(fileURLWithPath: filename)
|
||||
|
||||
// Convert float to int16
|
||||
let int16Data = audioData.map { sample -> Int16 in
|
||||
let clamped = max(-1.0, min(1.0, sample))
|
||||
return Int16(clamped * 32767.0)
|
||||
}
|
||||
|
||||
// Create WAV header
|
||||
let numChannels: UInt16 = 1
|
||||
let bitsPerSample: UInt16 = 16
|
||||
let byteRate = UInt32(sampleRate) * UInt32(numChannels) * UInt32(bitsPerSample) / 8
|
||||
let blockAlign = numChannels * bitsPerSample / 8
|
||||
let dataSize = UInt32(int16Data.count * 2)
|
||||
|
||||
var data = Data()
|
||||
|
||||
// RIFF chunk
|
||||
data.append("RIFF".data(using: .ascii)!)
|
||||
withUnsafeBytes(of: UInt32(36 + dataSize).littleEndian) { data.append(contentsOf: $0) }
|
||||
data.append("WAVE".data(using: .ascii)!)
|
||||
|
||||
// fmt chunk
|
||||
data.append("fmt ".data(using: .ascii)!)
|
||||
withUnsafeBytes(of: UInt32(16).littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: UInt16(1).littleEndian) { data.append(contentsOf: $0) } // PCM
|
||||
withUnsafeBytes(of: numChannels.littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: UInt32(sampleRate).littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: byteRate.littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: blockAlign.littleEndian) { data.append(contentsOf: $0) }
|
||||
withUnsafeBytes(of: bitsPerSample.littleEndian) { data.append(contentsOf: $0) }
|
||||
|
||||
// data chunk
|
||||
data.append("data".data(using: .ascii)!)
|
||||
withUnsafeBytes(of: dataSize.littleEndian) { data.append(contentsOf: $0) }
|
||||
|
||||
// audio data
|
||||
int16Data.withUnsafeBytes { data.append(contentsOf: $0) }
|
||||
|
||||
try data.write(to: url)
|
||||
}
|
||||
|
||||
// MARK: - Utility Functions
|
||||
|
||||
func timer<T>(_ name: String, _ f: () throws -> T) rethrows -> T {
|
||||
let start = Date()
|
||||
print("\(name)...")
|
||||
let result = try f()
|
||||
let elapsed = Date().timeIntervalSince(start)
|
||||
print(String(format: " -> %@ completed in %.2f sec", name, elapsed))
|
||||
return result
|
||||
}
|
||||
|
||||
func sanitizeFilename(_ text: String, maxLen: Int) -> String {
|
||||
let truncated = text.count > maxLen ? String(text.prefix(maxLen)) : text
|
||||
return truncated.map { char in
|
||||
if char.isLetter || char.isNumber {
|
||||
return char
|
||||
} else {
|
||||
return Character("_")
|
||||
}
|
||||
}.map(String.init).joined()
|
||||
}
|
||||
|
||||
func loadCfgs(_ onnxDir: String) throws -> Config {
|
||||
let cfgPath = "\(onnxDir)/tts.json"
|
||||
let data = try Data(contentsOf: URL(fileURLWithPath: cfgPath))
|
||||
let config = try JSONDecoder().decode(Config.self, from: data)
|
||||
return config
|
||||
}
|
||||
|
||||
// MARK: - ONNX Runtime Integration
|
||||
|
||||
struct Style {
|
||||
let ttl: ORTValue
|
||||
let dp: ORTValue
|
||||
}
|
||||
|
||||
class TextToSpeech {
|
||||
let cfgs: Config
|
||||
let textProcessor: UnicodeProcessor
|
||||
let dpOrt: ORTSession
|
||||
let textEncOrt: ORTSession
|
||||
let vectorEstOrt: ORTSession
|
||||
let vocoderOrt: ORTSession
|
||||
let sampleRate: Int
|
||||
|
||||
init(cfgs: Config, textProcessor: UnicodeProcessor,
|
||||
dpOrt: ORTSession, textEncOrt: ORTSession,
|
||||
vectorEstOrt: ORTSession, vocoderOrt: ORTSession) {
|
||||
self.cfgs = cfgs
|
||||
self.textProcessor = textProcessor
|
||||
self.dpOrt = dpOrt
|
||||
self.textEncOrt = textEncOrt
|
||||
self.vectorEstOrt = vectorEstOrt
|
||||
self.vocoderOrt = vocoderOrt
|
||||
self.sampleRate = cfgs.ae.sample_rate
|
||||
}
|
||||
|
||||
func call(_ textList: [String], _ style: Style, _ totalStep: Int) throws -> (wav: [Float], duration: [Float]) {
|
||||
let bsz = textList.count
|
||||
|
||||
// Process text
|
||||
let (textIds, textMask) = textProcessor.call(textList)
|
||||
|
||||
// Flatten text IDs
|
||||
let textIdsFlat = textIds.flatMap { $0 }
|
||||
let textIdsShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: textIds[0].count)]
|
||||
let textIdsValue = try ORTValue(tensorData: NSMutableData(bytes: textIdsFlat, length: textIdsFlat.count * MemoryLayout<Int64>.size),
|
||||
elementType: .int64,
|
||||
shape: textIdsShape)
|
||||
|
||||
// Flatten text mask
|
||||
let textMaskFlat = textMask.flatMap { $0.flatMap { $0 } }
|
||||
let textMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: textMask[0][0].count)]
|
||||
let textMaskValue = try ORTValue(tensorData: NSMutableData(bytes: textMaskFlat, length: textMaskFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: textMaskShape)
|
||||
|
||||
// Predict duration
|
||||
let dpOutputs = try dpOrt.run(withInputs: ["text_ids": textIdsValue, "style_dp": style.dp, "text_mask": textMaskValue],
|
||||
outputNames: ["duration"],
|
||||
runOptions: nil)
|
||||
|
||||
let durationData = try dpOutputs["duration"]!.tensorData() as Data
|
||||
let duration = durationData.withUnsafeBytes { ptr in
|
||||
Array(ptr.bindMemory(to: Float.self))
|
||||
}
|
||||
|
||||
// Encode text
|
||||
let textEncOutputs = try textEncOrt.run(withInputs: ["text_ids": textIdsValue, "style_ttl": style.ttl, "text_mask": textMaskValue],
|
||||
outputNames: ["text_emb"],
|
||||
runOptions: nil)
|
||||
|
||||
let textEmbValue = textEncOutputs["text_emb"]!
|
||||
|
||||
// Sample noisy latent
|
||||
var (xt, latentMask) = sampleNoisyLatent(duration: duration, sampleRate: sampleRate,
|
||||
baseChunkSize: cfgs.ae.base_chunk_size,
|
||||
chunkCompress: cfgs.ttl.chunk_compress_factor,
|
||||
latentDim: cfgs.ttl.latent_dim)
|
||||
|
||||
// Prepare constant arrays
|
||||
let totalStepArray = Array(repeating: Float(totalStep), count: bsz)
|
||||
let totalStepValue = try ORTValue(tensorData: NSMutableData(bytes: totalStepArray, length: totalStepArray.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: [NSNumber(value: bsz)])
|
||||
|
||||
// Denoising loop
|
||||
for step in 0..<totalStep {
|
||||
let currentStepArray = Array(repeating: Float(step), count: bsz)
|
||||
let currentStepValue = try ORTValue(tensorData: NSMutableData(bytes: currentStepArray, length: currentStepArray.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: [NSNumber(value: bsz)])
|
||||
|
||||
// Flatten xt
|
||||
let xtFlat = xt.flatMap { $0.flatMap { $0 } }
|
||||
let xtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
|
||||
let xtValue = try ORTValue(tensorData: NSMutableData(bytes: xtFlat, length: xtFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: xtShape)
|
||||
|
||||
// Flatten latent mask
|
||||
let latentMaskFlat = latentMask.flatMap { $0.flatMap { $0 } }
|
||||
let latentMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: latentMask[0][0].count)]
|
||||
let latentMaskValue = try ORTValue(tensorData: NSMutableData(bytes: latentMaskFlat, length: latentMaskFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: latentMaskShape)
|
||||
|
||||
let vectorEstOutputs = try vectorEstOrt.run(withInputs: [
|
||||
"noisy_latent": xtValue,
|
||||
"text_emb": textEmbValue,
|
||||
"style_ttl": style.ttl,
|
||||
"latent_mask": latentMaskValue,
|
||||
"text_mask": textMaskValue,
|
||||
"current_step": currentStepValue,
|
||||
"total_step": totalStepValue
|
||||
], outputNames: ["denoised_latent"], runOptions: nil)
|
||||
|
||||
let denoisedData = try vectorEstOutputs["denoised_latent"]!.tensorData() as Data
|
||||
let denoisedFlat = denoisedData.withUnsafeBytes { ptr in
|
||||
Array(ptr.bindMemory(to: Float.self))
|
||||
}
|
||||
|
||||
// Reshape to 3D
|
||||
let latentDimVal = xt[0].count
|
||||
let latentLen = xt[0][0].count
|
||||
xt = []
|
||||
var idx = 0
|
||||
for _ in 0..<bsz {
|
||||
var batch = [[Float]]()
|
||||
for _ in 0..<latentDimVal {
|
||||
var row = [Float]()
|
||||
for _ in 0..<latentLen {
|
||||
row.append(denoisedFlat[idx])
|
||||
idx += 1
|
||||
}
|
||||
batch.append(row)
|
||||
}
|
||||
xt.append(batch)
|
||||
}
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
let finalXtFlat = xt.flatMap { $0.flatMap { $0 } }
|
||||
let finalXtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
|
||||
let finalXtValue = try ORTValue(tensorData: NSMutableData(bytes: finalXtFlat, length: finalXtFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: finalXtShape)
|
||||
|
||||
let vocoderOutputs = try vocoderOrt.run(withInputs: ["latent": finalXtValue],
|
||||
outputNames: ["wav_tts"],
|
||||
runOptions: nil)
|
||||
|
||||
let wavData = try vocoderOutputs["wav_tts"]!.tensorData() as Data
|
||||
let wav = wavData.withUnsafeBytes { ptr in
|
||||
Array(ptr.bindMemory(to: Float.self))
|
||||
}
|
||||
|
||||
return (wav, duration)
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - Component Loading Functions
|
||||
|
||||
func loadVoiceStyle(_ voiceStylePaths: [String], verbose: Bool) throws -> Style {
|
||||
let bsz = voiceStylePaths.count
|
||||
|
||||
// Read first file to get dimensions
|
||||
let firstData = try Data(contentsOf: URL(fileURLWithPath: voiceStylePaths[0]))
|
||||
let firstStyle = try JSONDecoder().decode(VoiceStyleData.self, from: firstData)
|
||||
|
||||
let ttlDims = firstStyle.style_ttl.dims
|
||||
let dpDims = firstStyle.style_dp.dims
|
||||
|
||||
let ttlDim1 = ttlDims[1]
|
||||
let ttlDim2 = ttlDims[2]
|
||||
let dpDim1 = dpDims[1]
|
||||
let dpDim2 = dpDims[2]
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
let ttlSize = bsz * ttlDim1 * ttlDim2
|
||||
let dpSize = bsz * dpDim1 * dpDim2
|
||||
var ttlFlat = [Float](repeating: 0.0, count: ttlSize)
|
||||
var dpFlat = [Float](repeating: 0.0, count: dpSize)
|
||||
|
||||
// Fill in the data
|
||||
for (i, path) in voiceStylePaths.enumerated() {
|
||||
let data = try Data(contentsOf: URL(fileURLWithPath: path))
|
||||
let voiceStyle = try JSONDecoder().decode(VoiceStyleData.self, from: data)
|
||||
|
||||
// Flatten TTL data
|
||||
let ttlOffset = i * ttlDim1 * ttlDim2
|
||||
var idx = 0
|
||||
for batch in voiceStyle.style_ttl.data {
|
||||
for row in batch {
|
||||
for val in row {
|
||||
ttlFlat[ttlOffset + idx] = val
|
||||
idx += 1
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Flatten DP data
|
||||
let dpOffset = i * dpDim1 * dpDim2
|
||||
idx = 0
|
||||
for batch in voiceStyle.style_dp.data {
|
||||
for row in batch {
|
||||
for val in row {
|
||||
dpFlat[dpOffset + idx] = val
|
||||
idx += 1
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let ttlShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: ttlDim1), NSNumber(value: ttlDim2)]
|
||||
let dpShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: dpDim1), NSNumber(value: dpDim2)]
|
||||
|
||||
let ttlValue = try ORTValue(tensorData: NSMutableData(bytes: &ttlFlat, length: ttlFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: ttlShape)
|
||||
let dpValue = try ORTValue(tensorData: NSMutableData(bytes: &dpFlat, length: dpFlat.count * MemoryLayout<Float>.size),
|
||||
elementType: .float,
|
||||
shape: dpShape)
|
||||
|
||||
if verbose {
|
||||
print("Loaded \(bsz) voice styles\n")
|
||||
}
|
||||
|
||||
return Style(ttl: ttlValue, dp: dpValue)
|
||||
}
|
||||
|
||||
func loadTextToSpeech(_ onnxDir: String, _ useGpu: Bool, _ env: ORTEnv) throws -> TextToSpeech {
|
||||
if useGpu {
|
||||
throw NSError(domain: "TTS", code: 1, userInfo: [NSLocalizedDescriptionKey: "GPU mode is not supported yet"])
|
||||
}
|
||||
print("Using CPU for inference\n")
|
||||
|
||||
let cfgs = try loadCfgs(onnxDir)
|
||||
|
||||
let sessionOptions = try ORTSessionOptions()
|
||||
|
||||
let dpPath = "\(onnxDir)/duration_predictor.onnx"
|
||||
let textEncPath = "\(onnxDir)/text_encoder.onnx"
|
||||
let vectorEstPath = "\(onnxDir)/vector_estimator.onnx"
|
||||
let vocoderPath = "\(onnxDir)/vocoder.onnx"
|
||||
|
||||
let dpOrt = try ORTSession(env: env, modelPath: dpPath, sessionOptions: sessionOptions)
|
||||
let textEncOrt = try ORTSession(env: env, modelPath: textEncPath, sessionOptions: sessionOptions)
|
||||
let vectorEstOrt = try ORTSession(env: env, modelPath: vectorEstPath, sessionOptions: sessionOptions)
|
||||
let vocoderOrt = try ORTSession(env: env, modelPath: vocoderPath, sessionOptions: sessionOptions)
|
||||
|
||||
let unicodeIndexerPath = "\(onnxDir)/unicode_indexer.json"
|
||||
let textProcessor = try UnicodeProcessor(unicodeIndexerPath: unicodeIndexerPath)
|
||||
|
||||
return TextToSpeech(cfgs: cfgs, textProcessor: textProcessor,
|
||||
dpOrt: dpOrt, textEncOrt: textEncOrt,
|
||||
vectorEstOrt: vectorEstOrt, vocoderOrt: vocoderOrt)
|
||||
}
|
||||
Symlink
+1
@@ -0,0 +1 @@
|
||||
../assets
|
||||
Executable
+248
@@ -0,0 +1,248 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Supertonic - Test All Language Implementations
|
||||
# This script runs inference tests for all supported languages except web
|
||||
|
||||
set -e # Exit on error
|
||||
|
||||
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
echo "=================================="
|
||||
echo "Supertonic - Testing All Examples"
|
||||
echo "=================================="
|
||||
echo ""
|
||||
|
||||
# Ask user to select test mode
|
||||
echo "Select test mode:"
|
||||
echo " 1) Default inference only"
|
||||
echo " 2) Batch inference only"
|
||||
echo " 3) Both default and batch inference"
|
||||
echo -e "Enter your choice (1/2/3) [default: 1]: \c"
|
||||
read -r test_mode
|
||||
test_mode=${test_mode:-1}
|
||||
|
||||
case $test_mode in
|
||||
1)
|
||||
TEST_DEFAULT=true
|
||||
TEST_BATCH=false
|
||||
echo "Running default inference tests only"
|
||||
;;
|
||||
2)
|
||||
TEST_DEFAULT=false
|
||||
TEST_BATCH=true
|
||||
echo "Running batch inference tests only"
|
||||
;;
|
||||
3)
|
||||
TEST_DEFAULT=true
|
||||
TEST_BATCH=true
|
||||
echo "Running both default and batch inference tests"
|
||||
;;
|
||||
*)
|
||||
echo "Invalid choice. Using default inference only."
|
||||
TEST_DEFAULT=true
|
||||
TEST_BATCH=false
|
||||
;;
|
||||
esac
|
||||
echo ""
|
||||
|
||||
# Batch inference test data - base variables
|
||||
BATCH_VOICE_STYLE_1="assets/voice_styles/M1.json"
|
||||
BATCH_VOICE_STYLE_2="assets/voice_styles/F1.json"
|
||||
BATCH_TEXT_1="The sun sets behind the mountains, painting the sky in shades of pink and orange."
|
||||
BATCH_TEXT_2="The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
|
||||
|
||||
# Ask if user wants to clean results folders
|
||||
echo -e "Do you want to clean all results folders before running tests? (y/N): \c"
|
||||
read -r response
|
||||
if [[ "$response" =~ ^[Yy]$ ]]; then
|
||||
echo ""
|
||||
echo "Cleaning results folders..."
|
||||
|
||||
# List of result directories
|
||||
declare -a RESULT_DIRS=(
|
||||
"py/results"
|
||||
"nodejs/results"
|
||||
"go/results"
|
||||
"rust/results"
|
||||
"csharp/results"
|
||||
"java/results"
|
||||
"swift/results"
|
||||
"cpp/build/results"
|
||||
)
|
||||
|
||||
for dir in "${RESULT_DIRS[@]}"; do
|
||||
if [ -d "$SCRIPT_DIR/$dir" ]; then
|
||||
echo " - Cleaning $dir"
|
||||
rm -rf "$SCRIPT_DIR/$dir"/*
|
||||
fi
|
||||
done
|
||||
|
||||
echo "Results folders cleaned!"
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Track results
|
||||
declare -a PASSED=()
|
||||
declare -a FAILED=()
|
||||
|
||||
# Helper function to run tests
|
||||
run_test() {
|
||||
local name=$1
|
||||
local dir=$2
|
||||
shift 2
|
||||
local cmd="$@"
|
||||
|
||||
echo -e "${BLUE}[$name]${NC} Running inference..."
|
||||
cd "$SCRIPT_DIR/$dir"
|
||||
|
||||
# Run command and prefix each output line with the language name
|
||||
if eval "$cmd" 2>&1 | sed "s/^/[$name] /"; then
|
||||
echo -e "${GREEN}[$name]${NC} ✓ Success"
|
||||
PASSED+=("$name")
|
||||
else
|
||||
echo -e "${RED}[$name]${NC} ✗ Failed"
|
||||
FAILED+=("$name")
|
||||
fi
|
||||
echo ""
|
||||
cd "$SCRIPT_DIR"
|
||||
}
|
||||
|
||||
# ====================================
|
||||
# Python
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Python...${NC}"
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Python (default)" "py" "uv run example_onnx.py"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Python (batch)" "py" "uv run example_onnx.py --voice-style $BATCH_VOICE_STYLE_1 $BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1' '$BATCH_TEXT_2'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# JavaScript (Node.js)
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing JavaScript (Node.js)...${NC}"
|
||||
echo "Installing Node.js dependencies..."
|
||||
cd nodejs && npm install --silent && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "JavaScript (default)" "nodejs" "node example_onnx.js"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "JavaScript (batch)" "nodejs" "node example_onnx.js --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Go
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Go...${NC}"
|
||||
echo "Cleaning Go cache..."
|
||||
cd go && go clean && cd ..
|
||||
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Go (default)" "go" "go run example_onnx.go helper.go"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Go (batch)" "go" "go run example_onnx.go helper.go --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Rust
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Rust...${NC}"
|
||||
echo "Building Rust project..."
|
||||
cd rust && cargo clean && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Rust (default)" "rust" "cargo run --release"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Rust (batch)" "rust" "cargo run --release -- --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# C#
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing C#...${NC}"
|
||||
echo "Building C# project..."
|
||||
cd csharp && dotnet clean && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "C# (default)" "csharp" "dotnet run --configuration Release"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "C# (batch)" "csharp" "dotnet run --configuration Release -- --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Java
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Java...${NC}"
|
||||
echo "Building Java project..."
|
||||
cd java && mvn clean install -q && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Java (default)" "java" "mvn exec:java -q"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Java (batch)" "java" "mvn exec:java -q -Dexec.args='--voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text \"$BATCH_TEXT_1|$BATCH_TEXT_2\"'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Swift
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing Swift...${NC}"
|
||||
echo "Building Swift project..."
|
||||
cd swift && swift build -c release && cd ..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "Swift (default)" "swift" ".build/release/example_onnx"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "Swift (batch)" "swift" ".build/release/example_onnx --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# C++
|
||||
# ====================================
|
||||
echo -e "${YELLOW}Testing C++...${NC}"
|
||||
echo "Building C++ project..."
|
||||
cd cpp && mkdir -p build && cd build && cmake .. && make && cd ../..
|
||||
if [ "$TEST_DEFAULT" = true ]; then
|
||||
run_test "C++ (default)" "cpp/build" "./example_onnx"
|
||||
fi
|
||||
if [ "$TEST_BATCH" = true ]; then
|
||||
run_test "C++ (batch)" "cpp/build" "./example_onnx --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
|
||||
fi
|
||||
|
||||
# ====================================
|
||||
# Summary
|
||||
# ====================================
|
||||
echo "=================================="
|
||||
echo "Test Summary"
|
||||
echo "=================================="
|
||||
echo ""
|
||||
|
||||
if [ ${#PASSED[@]} -gt 0 ]; then
|
||||
echo -e "${GREEN}Passed (${#PASSED[@]}):${NC}"
|
||||
for lang in "${PASSED[@]}"; do
|
||||
echo -e " ${GREEN}✓${NC} $lang"
|
||||
done
|
||||
echo ""
|
||||
fi
|
||||
|
||||
if [ ${#FAILED[@]} -gt 0 ]; then
|
||||
echo -e "${RED}Failed (${#FAILED[@]}):${NC}"
|
||||
for lang in "${FAILED[@]}"; do
|
||||
echo -e " ${RED}✗${NC} $lang"
|
||||
done
|
||||
echo ""
|
||||
exit 1
|
||||
else
|
||||
echo -e "${GREEN}All tests passed! 🎉${NC}"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
@@ -0,0 +1,4 @@
|
||||
node_modules/
|
||||
dist/
|
||||
.DS_Store
|
||||
*.log
|
||||
@@ -0,0 +1,98 @@
|
||||
# Supertonic Web Example
|
||||
|
||||
This example demonstrates how to use Supertonic in a web browser using ONNX Runtime Web.
|
||||
|
||||
## Features
|
||||
|
||||
- 🌐 Runs entirely in the browser (no server required for inference)
|
||||
- 🚀 WebGPU support with automatic fallback to WebAssembly
|
||||
- ⚡ Pre-extracted voice styles for instant generation
|
||||
- 🎨 Modern, responsive UI
|
||||
- 🎭 Multiple voice style presets (2 Male, 2 Female)
|
||||
- 💾 Download generated audio as WAV files
|
||||
- 📊 Detailed generation statistics (audio length, generation time)
|
||||
- ⏱️ Real-time progress tracking
|
||||
|
||||
## Requirements
|
||||
|
||||
- Node.js (for development server)
|
||||
- Modern web browser (Chrome, Edge, Firefox, Safari)
|
||||
|
||||
## Installation
|
||||
|
||||
1. Install dependencies:
|
||||
|
||||
```bash
|
||||
npm install
|
||||
```
|
||||
|
||||
## Running the Demo
|
||||
|
||||
Start the development server:
|
||||
|
||||
```bash
|
||||
npm run dev
|
||||
```
|
||||
|
||||
This will start a local development server (usually at http://localhost:3000) and open the demo in your browser.
|
||||
|
||||
## Usage
|
||||
|
||||
1. **Wait for Models to Load**: The app will automatically load models and the default voice style (M1)
|
||||
2. **Select Voice Style**: Choose from available voice presets
|
||||
- **Male 1 (M1)**: Default male voice
|
||||
- **Male 2 (M2)**: Alternative male voice
|
||||
- **Female 1 (F1)**: Default female voice
|
||||
- **Female 2 (F2)**: Alternative female voice
|
||||
3. **Enter Text**: Type or paste the text you want to convert to speech
|
||||
4. **Adjust Settings** (optional):
|
||||
- **Total Steps**: More steps = better quality but slower (default: 5)
|
||||
5. **Generate Speech**: Click the "Generate Speech" button
|
||||
6. **View Results**:
|
||||
- See the full input text
|
||||
- View audio length and generation time statistics
|
||||
- Play the generated audio in the browser
|
||||
- Download as WAV file
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Browser Compatibility
|
||||
|
||||
This demo uses:
|
||||
- **ONNX Runtime Web**: For running models in the browser
|
||||
- **Web Audio API**: For playing generated audio
|
||||
- **Vite**: For development and bundling
|
||||
|
||||
## Notes
|
||||
|
||||
- The ONNX models must be accessible at `assets/onnx/` relative to the web root
|
||||
- Voice style JSON files must be accessible at `assets/voice_styles/` relative to the web root
|
||||
- Pre-extracted voice styles enable instant generation without audio processing
|
||||
- Four voice style presets are provided (M1, M2, F1, F2)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Models not loading
|
||||
- Check browser console for errors
|
||||
- Ensure `assets/onnx/` path is correct and models are accessible
|
||||
- Check CORS settings if serving from a different domain
|
||||
|
||||
### WebGPU not available
|
||||
- WebGPU is only available in recent Chrome/Edge browsers (version 113+)
|
||||
- The app will automatically fall back to WebAssembly if WebGPU is not available
|
||||
- Check the backend badge to see which execution provider is being used
|
||||
|
||||
### Out of memory errors
|
||||
- Try shorter text inputs
|
||||
- Reduce denoising steps
|
||||
- Use a browser with more available memory
|
||||
- Close other tabs to free up memory
|
||||
|
||||
### Audio quality issues
|
||||
- Try different voice style presets
|
||||
- Increase denoising steps for better quality
|
||||
|
||||
### Slow generation
|
||||
- If using WebAssembly, try a browser that supports WebGPU
|
||||
- Ensure no other heavy processes are running
|
||||
- Consider using fewer denoising steps for faster (but lower quality) results
|
||||
Symlink
+1
@@ -0,0 +1 @@
|
||||
../assets
|
||||
+396
@@ -0,0 +1,396 @@
|
||||
import * as ort from 'onnxruntime-web';
|
||||
|
||||
/**
|
||||
* Unicode Text Processor
|
||||
*/
|
||||
export class UnicodeProcessor {
|
||||
constructor(indexer) {
|
||||
this.indexer = indexer;
|
||||
}
|
||||
|
||||
call(textList) {
|
||||
const processedTexts = textList.map(text => this.preprocessText(text));
|
||||
|
||||
const textIdsLengths = processedTexts.map(text => text.length);
|
||||
const maxLen = Math.max(...textIdsLengths);
|
||||
|
||||
const textIds = processedTexts.map(text => {
|
||||
const row = new Array(maxLen).fill(0);
|
||||
for (let j = 0; j < text.length; j++) {
|
||||
const codePoint = text.codePointAt(j);
|
||||
row[j] = (codePoint < this.indexer.length) ? this.indexer[codePoint] : -1;
|
||||
}
|
||||
return row;
|
||||
});
|
||||
|
||||
const textMask = this.getTextMask(textIdsLengths);
|
||||
return { textIds, textMask };
|
||||
}
|
||||
|
||||
preprocessText(text) {
|
||||
return text.normalize('NFKC');
|
||||
}
|
||||
|
||||
getTextMask(textIdsLengths) {
|
||||
const maxLen = Math.max(...textIdsLengths);
|
||||
return this.lengthToMask(textIdsLengths, maxLen);
|
||||
}
|
||||
|
||||
lengthToMask(lengths, maxLen = null) {
|
||||
const actualMaxLen = maxLen || Math.max(...lengths);
|
||||
return lengths.map(len => {
|
||||
const row = new Array(actualMaxLen).fill(0.0);
|
||||
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
|
||||
row[j] = 1.0;
|
||||
}
|
||||
return [row];
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Style class to hold TTL and DP tensors
|
||||
*/
|
||||
export class Style {
|
||||
constructor(ttlTensor, dpTensor) {
|
||||
this.ttl = ttlTensor;
|
||||
this.dp = dpTensor;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Text-to-Speech class
|
||||
*/
|
||||
export class TextToSpeech {
|
||||
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
|
||||
this.cfgs = cfgs;
|
||||
this.textProcessor = textProcessor;
|
||||
this.dpOrt = dpOrt;
|
||||
this.textEncOrt = textEncOrt;
|
||||
this.vectorEstOrt = vectorEstOrt;
|
||||
this.vocoderOrt = vocoderOrt;
|
||||
this.sampleRate = cfgs.ae.sample_rate;
|
||||
}
|
||||
|
||||
async call(textList, style, totalStep, progressCallback = null) {
|
||||
const bsz = textList.length;
|
||||
|
||||
// Process text
|
||||
const { textIds, textMask } = this.textProcessor.call(textList);
|
||||
|
||||
const textIdsFlat = new BigInt64Array(textIds.flat().map(x => BigInt(x)));
|
||||
const textIdsShape = [bsz, textIds[0].length];
|
||||
const textIdsTensor = new ort.Tensor('int64', textIdsFlat, textIdsShape);
|
||||
|
||||
const textMaskFlat = new Float32Array(textMask.flat(2));
|
||||
const textMaskShape = [bsz, 1, textMask[0][0].length];
|
||||
const textMaskTensor = new ort.Tensor('float32', textMaskFlat, textMaskShape);
|
||||
|
||||
// Predict duration
|
||||
const dpOutputs = await this.dpOrt.run({
|
||||
text_ids: textIdsTensor,
|
||||
style_dp: style.dp,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
const duration = Array.from(dpOutputs.duration.data);
|
||||
|
||||
// Encode text
|
||||
const textEncOutputs = await this.textEncOrt.run({
|
||||
text_ids: textIdsTensor,
|
||||
style_ttl: style.ttl,
|
||||
text_mask: textMaskTensor
|
||||
});
|
||||
const textEmb = textEncOutputs.text_emb;
|
||||
|
||||
// Sample noisy latent
|
||||
let { xt, latentMask } = this.sampleNoisyLatent(
|
||||
duration,
|
||||
this.sampleRate,
|
||||
this.cfgs.ae.base_chunk_size,
|
||||
this.cfgs.ttl.chunk_compress_factor,
|
||||
this.cfgs.ttl.latent_dim
|
||||
);
|
||||
|
||||
const latentMaskFlat = new Float32Array(latentMask.flat(2));
|
||||
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
|
||||
const latentMaskTensor = new ort.Tensor('float32', latentMaskFlat, latentMaskShape);
|
||||
|
||||
// Prepare constant arrays
|
||||
const totalStepArray = new Float32Array(bsz).fill(totalStep);
|
||||
const totalStepTensor = new ort.Tensor('float32', totalStepArray, [bsz]);
|
||||
|
||||
// Denoising loop
|
||||
for (let step = 0; step < totalStep; step++) {
|
||||
if (progressCallback) {
|
||||
progressCallback(step + 1, totalStep);
|
||||
}
|
||||
|
||||
const currentStepArray = new Float32Array(bsz).fill(step);
|
||||
const currentStepTensor = new ort.Tensor('float32', currentStepArray, [bsz]);
|
||||
|
||||
const xtFlat = new Float32Array(xt.flat(2));
|
||||
const xtShape = [bsz, xt[0].length, xt[0][0].length];
|
||||
const xtTensor = new ort.Tensor('float32', xtFlat, xtShape);
|
||||
|
||||
const vectorEstOutputs = await this.vectorEstOrt.run({
|
||||
noisy_latent: xtTensor,
|
||||
text_emb: textEmb,
|
||||
style_ttl: style.ttl,
|
||||
latent_mask: latentMaskTensor,
|
||||
text_mask: textMaskTensor,
|
||||
current_step: currentStepTensor,
|
||||
total_step: totalStepTensor
|
||||
});
|
||||
|
||||
const denoised = Array.from(vectorEstOutputs.denoised_latent.data);
|
||||
|
||||
// Reshape to 3D
|
||||
const latentDim = xt[0].length;
|
||||
const latentLen = xt[0][0].length;
|
||||
xt = [];
|
||||
let idx = 0;
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
const batch = [];
|
||||
for (let d = 0; d < latentDim; d++) {
|
||||
const row = [];
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
row.push(denoised[idx++]);
|
||||
}
|
||||
batch.push(row);
|
||||
}
|
||||
xt.push(batch);
|
||||
}
|
||||
}
|
||||
|
||||
// Generate waveform
|
||||
const finalXtFlat = new Float32Array(xt.flat(2));
|
||||
const finalXtShape = [bsz, xt[0].length, xt[0][0].length];
|
||||
const finalXtTensor = new ort.Tensor('float32', finalXtFlat, finalXtShape);
|
||||
|
||||
const vocoderOutputs = await this.vocoderOrt.run({
|
||||
latent: finalXtTensor
|
||||
});
|
||||
|
||||
const wav = Array.from(vocoderOutputs.wav_tts.data);
|
||||
|
||||
return { wav, duration };
|
||||
}
|
||||
|
||||
sampleNoisyLatent(duration, sampleRate, baseChunkSize, chunkCompress, latentDim) {
|
||||
const bsz = duration.length;
|
||||
const maxDur = Math.max(...duration);
|
||||
|
||||
const wavLenMax = Math.floor(maxDur * sampleRate);
|
||||
const wavLengths = duration.map(d => Math.floor(d * sampleRate));
|
||||
|
||||
const chunkSize = baseChunkSize * chunkCompress;
|
||||
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
|
||||
const latentDimVal = latentDim * chunkCompress;
|
||||
|
||||
const xt = [];
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
const batch = [];
|
||||
for (let d = 0; d < latentDimVal; d++) {
|
||||
const row = [];
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
// Box-Muller transform
|
||||
const u1 = Math.max(0.0001, Math.random());
|
||||
const u2 = Math.random();
|
||||
const val = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
|
||||
row.push(val);
|
||||
}
|
||||
batch.push(row);
|
||||
}
|
||||
xt.push(batch);
|
||||
}
|
||||
|
||||
const latentLengths = wavLengths.map(len => Math.floor((len + chunkSize - 1) / chunkSize));
|
||||
const latentMask = this.lengthToMask(latentLengths, latentLen);
|
||||
|
||||
// Apply mask
|
||||
for (let b = 0; b < bsz; b++) {
|
||||
for (let d = 0; d < latentDimVal; d++) {
|
||||
for (let t = 0; t < latentLen; t++) {
|
||||
xt[b][d][t] *= latentMask[b][0][t];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return { xt, latentMask };
|
||||
}
|
||||
|
||||
lengthToMask(lengths, maxLen = null) {
|
||||
const actualMaxLen = maxLen || Math.max(...lengths);
|
||||
return lengths.map(len => {
|
||||
const row = new Array(actualMaxLen).fill(0.0);
|
||||
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
|
||||
row[j] = 1.0;
|
||||
}
|
||||
return [row];
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Load voice style from JSON files
|
||||
*/
|
||||
export async function loadVoiceStyle(voiceStylePaths, verbose = false) {
|
||||
const bsz = voiceStylePaths.length;
|
||||
|
||||
// Read first file to get dimensions
|
||||
const firstResponse = await fetch(voiceStylePaths[0]);
|
||||
const firstStyle = await firstResponse.json();
|
||||
|
||||
const ttlDims = firstStyle.style_ttl.dims;
|
||||
const dpDims = firstStyle.style_dp.dims;
|
||||
|
||||
const ttlDim1 = ttlDims[1];
|
||||
const ttlDim2 = ttlDims[2];
|
||||
const dpDim1 = dpDims[1];
|
||||
const dpDim2 = dpDims[2];
|
||||
|
||||
// Pre-allocate arrays with full batch size
|
||||
const ttlSize = bsz * ttlDim1 * ttlDim2;
|
||||
const dpSize = bsz * dpDim1 * dpDim2;
|
||||
const ttlFlat = new Float32Array(ttlSize);
|
||||
const dpFlat = new Float32Array(dpSize);
|
||||
|
||||
// Fill in the data
|
||||
for (let i = 0; i < bsz; i++) {
|
||||
const response = await fetch(voiceStylePaths[i]);
|
||||
const voiceStyle = await response.json();
|
||||
|
||||
// Flatten TTL data
|
||||
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
|
||||
const ttlOffset = i * ttlDim1 * ttlDim2;
|
||||
ttlFlat.set(ttlData, ttlOffset);
|
||||
|
||||
// Flatten DP data
|
||||
const dpData = voiceStyle.style_dp.data.flat(Infinity);
|
||||
const dpOffset = i * dpDim1 * dpDim2;
|
||||
dpFlat.set(dpData, dpOffset);
|
||||
}
|
||||
|
||||
const ttlShape = [bsz, ttlDim1, ttlDim2];
|
||||
const dpShape = [bsz, dpDim1, dpDim2];
|
||||
|
||||
const ttlTensor = new ort.Tensor('float32', ttlFlat, ttlShape);
|
||||
const dpTensor = new ort.Tensor('float32', dpFlat, dpShape);
|
||||
|
||||
if (verbose) {
|
||||
console.log(`Loaded ${bsz} voice styles`);
|
||||
}
|
||||
|
||||
return new Style(ttlTensor, dpTensor);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load configuration from JSON
|
||||
*/
|
||||
export async function loadCfgs(onnxDir) {
|
||||
const response = await fetch(`${onnxDir}/tts.json`);
|
||||
const cfgs = await response.json();
|
||||
return cfgs;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load text processor
|
||||
*/
|
||||
export async function loadTextProcessor(onnxDir) {
|
||||
const response = await fetch(`${onnxDir}/unicode_indexer.json`);
|
||||
const indexer = await response.json();
|
||||
return new UnicodeProcessor(indexer);
|
||||
}
|
||||
|
||||
/**
|
||||
* Load ONNX model
|
||||
*/
|
||||
export async function loadOnnx(onnxPath, options) {
|
||||
const session = await ort.InferenceSession.create(onnxPath, options);
|
||||
return session;
|
||||
}
|
||||
|
||||
/**
|
||||
* Load all TTS components
|
||||
*/
|
||||
export async function loadTextToSpeech(onnxDir, sessionOptions = {}, progressCallback = null) {
|
||||
console.log('Using WebAssembly/WebGPU for inference');
|
||||
|
||||
const cfgs = await loadCfgs(onnxDir);
|
||||
|
||||
const dpPath = `${onnxDir}/duration_predictor.onnx`;
|
||||
const textEncPath = `${onnxDir}/text_encoder.onnx`;
|
||||
const vectorEstPath = `${onnxDir}/vector_estimator.onnx`;
|
||||
const vocoderPath = `${onnxDir}/vocoder.onnx`;
|
||||
|
||||
const modelPaths = [
|
||||
{ name: 'Duration Predictor', path: dpPath },
|
||||
{ name: 'Text Encoder', path: textEncPath },
|
||||
{ name: 'Vector Estimator', path: vectorEstPath },
|
||||
{ name: 'Vocoder', path: vocoderPath }
|
||||
];
|
||||
|
||||
const sessions = [];
|
||||
for (let i = 0; i < modelPaths.length; i++) {
|
||||
if (progressCallback) {
|
||||
progressCallback(modelPaths[i].name, i + 1, modelPaths.length);
|
||||
}
|
||||
const session = await loadOnnx(modelPaths[i].path, sessionOptions);
|
||||
sessions.push(session);
|
||||
}
|
||||
|
||||
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = sessions;
|
||||
|
||||
const textProcessor = await loadTextProcessor(onnxDir);
|
||||
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
|
||||
|
||||
return { textToSpeech, cfgs };
|
||||
}
|
||||
|
||||
/**
|
||||
* Write WAV file to ArrayBuffer
|
||||
*/
|
||||
export function writeWavFile(audioData, sampleRate) {
|
||||
const numChannels = 1;
|
||||
const bitsPerSample = 16;
|
||||
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
|
||||
const blockAlign = numChannels * bitsPerSample / 8;
|
||||
const dataSize = audioData.length * 2;
|
||||
|
||||
// Create ArrayBuffer
|
||||
const buffer = new ArrayBuffer(44 + dataSize);
|
||||
const view = new DataView(buffer);
|
||||
|
||||
// Write WAV header
|
||||
const writeString = (offset, string) => {
|
||||
for (let i = 0; i < string.length; i++) {
|
||||
view.setUint8(offset + i, string.charCodeAt(i));
|
||||
}
|
||||
};
|
||||
|
||||
writeString(0, 'RIFF');
|
||||
view.setUint32(4, 36 + dataSize, true);
|
||||
writeString(8, 'WAVE');
|
||||
writeString(12, 'fmt ');
|
||||
view.setUint32(16, 16, true);
|
||||
view.setUint16(20, 1, true); // PCM
|
||||
view.setUint16(22, numChannels, true);
|
||||
view.setUint32(24, sampleRate, true);
|
||||
view.setUint32(28, byteRate, true);
|
||||
view.setUint16(32, blockAlign, true);
|
||||
view.setUint16(34, bitsPerSample, true);
|
||||
writeString(36, 'data');
|
||||
view.setUint32(40, dataSize, true);
|
||||
|
||||
// Write audio data
|
||||
const int16Data = new Int16Array(audioData.length);
|
||||
for (let i = 0; i < audioData.length; i++) {
|
||||
const clamped = Math.max(-1.0, Math.min(1.0, audioData[i]));
|
||||
int16Data[i] = Math.floor(clamped * 32767);
|
||||
}
|
||||
|
||||
const dataView = new Uint8Array(buffer, 44);
|
||||
dataView.set(new Uint8Array(int16Data.buffer));
|
||||
|
||||
return buffer;
|
||||
}
|
||||
@@ -0,0 +1,72 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Supertonic - Web Demo</title>
|
||||
<link rel="stylesheet" href="/style.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<h1>🎤 Supertonic</h1>
|
||||
<p class="subtitle">Text-to-Speech with ONNX Runtime Web</p>
|
||||
|
||||
<div id="statusBox" class="status-box">
|
||||
<div class="status-text-wrapper">
|
||||
<div id="statusText">ℹ️ <strong>Loading models...</strong>
|
||||
Please wait...</div>
|
||||
</div>
|
||||
<div id="backendBadge" class="backend-badge">WebAssembly</div>
|
||||
</div>
|
||||
|
||||
<div class="main-content">
|
||||
<div class="left-panel">
|
||||
<div class="section">
|
||||
<div class="ref-audio-label">
|
||||
<label for="voiceStyleSelect">Voice Style: </label>
|
||||
<span id="voiceStyleInfo"
|
||||
class="ref-audio-info">Loading...</span>
|
||||
</div>
|
||||
<select id="voiceStyleSelect">
|
||||
<option value="assets/voice_styles/M1.json">Male 1 (M1)</option>
|
||||
<option value="assets/voice_styles/M2.json">Male 2 (M2)</option>
|
||||
<option value="assets/voice_styles/F1.json">Female 1 (F1)</option>
|
||||
<option value="assets/voice_styles/F2.json">Female 2 (F2)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div class="section">
|
||||
<label for="text">Text to Synthesize:</label>
|
||||
<textarea id="text"
|
||||
placeholder="Enter the text you want to convert to speech...">This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.</textarea>
|
||||
</div>
|
||||
|
||||
<div class="params-grid">
|
||||
<div class="section">
|
||||
<label for="totalStep">Total Steps (higher = better
|
||||
quality):</label>
|
||||
<input type="number" id="totalStep" value="5"
|
||||
min="1" max="50">
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<button id="generateBtn">Generate Speech</button>
|
||||
|
||||
<div id="error" class="error"></div>
|
||||
</div>
|
||||
|
||||
<div class="right-panel">
|
||||
<div id="results" class="results">
|
||||
<div class="results-placeholder">
|
||||
<div class="results-placeholder-icon">🎤</div>
|
||||
<p>Generated speech will appear here</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script type="module" src="/main.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
+285
@@ -0,0 +1,285 @@
|
||||
import {
|
||||
loadTextToSpeech,
|
||||
loadVoiceStyle,
|
||||
writeWavFile
|
||||
} from './helper.js';
|
||||
|
||||
// Configuration
|
||||
const DEFAULT_VOICE_STYLE_PATH = 'assets/voice_styles/M1.json';
|
||||
|
||||
// Helper function to extract filename from path
|
||||
function getFilenameFromPath(path) {
|
||||
return path.split('/').pop();
|
||||
}
|
||||
|
||||
// Global state
|
||||
let textToSpeech = null;
|
||||
let cfgs = null;
|
||||
|
||||
// Pre-computed style
|
||||
let currentStyle = null;
|
||||
let currentStylePath = DEFAULT_VOICE_STYLE_PATH;
|
||||
|
||||
// UI Elements
|
||||
const textInput = document.getElementById('text');
|
||||
const voiceStyleSelect = document.getElementById('voiceStyleSelect');
|
||||
const voiceStyleInfo = document.getElementById('voiceStyleInfo');
|
||||
const totalStepInput = document.getElementById('totalStep');
|
||||
const generateBtn = document.getElementById('generateBtn');
|
||||
const statusBox = document.getElementById('statusBox');
|
||||
const statusText = document.getElementById('statusText');
|
||||
const backendBadge = document.getElementById('backendBadge');
|
||||
const resultsContainer = document.getElementById('results');
|
||||
const errorBox = document.getElementById('error');
|
||||
|
||||
function showStatus(message, type = 'info') {
|
||||
statusText.innerHTML = message;
|
||||
statusBox.className = 'status-box';
|
||||
if (type === 'success') {
|
||||
statusBox.classList.add('success');
|
||||
} else if (type === 'error') {
|
||||
statusBox.classList.add('error');
|
||||
}
|
||||
}
|
||||
|
||||
function showError(message) {
|
||||
errorBox.textContent = message;
|
||||
errorBox.classList.add('active');
|
||||
}
|
||||
|
||||
function hideError() {
|
||||
errorBox.classList.remove('active');
|
||||
}
|
||||
|
||||
function showBackendBadge() {
|
||||
backendBadge.classList.add('visible');
|
||||
}
|
||||
|
||||
// Load voice style from JSON
|
||||
async function loadStyleFromJSON(stylePath) {
|
||||
try {
|
||||
const style = await loadVoiceStyle([stylePath], true);
|
||||
return style;
|
||||
} catch (error) {
|
||||
console.error('Error loading voice style:', error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
// Load models on page load
|
||||
async function initializeModels() {
|
||||
try {
|
||||
showStatus('ℹ️ <strong>Loading configuration...</strong>');
|
||||
|
||||
const basePath = 'assets/onnx';
|
||||
|
||||
// Try WebGPU first, fallback to WASM
|
||||
let executionProvider = 'wasm';
|
||||
try {
|
||||
const result = await loadTextToSpeech(basePath, {
|
||||
executionProviders: ['webgpu'],
|
||||
graphOptimizationLevel: 'all'
|
||||
}, (modelName, current, total) => {
|
||||
showStatus(`ℹ️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
|
||||
});
|
||||
|
||||
textToSpeech = result.textToSpeech;
|
||||
cfgs = result.cfgs;
|
||||
|
||||
executionProvider = 'webgpu';
|
||||
backendBadge.textContent = 'WebGPU';
|
||||
backendBadge.style.background = '#4caf50';
|
||||
} catch (webgpuError) {
|
||||
console.log('WebGPU not available, falling back to WebAssembly');
|
||||
|
||||
const result = await loadTextToSpeech(basePath, {
|
||||
executionProviders: ['wasm'],
|
||||
graphOptimizationLevel: 'all'
|
||||
}, (modelName, current, total) => {
|
||||
showStatus(`ℹ️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
|
||||
});
|
||||
|
||||
textToSpeech = result.textToSpeech;
|
||||
cfgs = result.cfgs;
|
||||
}
|
||||
|
||||
showStatus('ℹ️ <strong>Loading default voice style...</strong>');
|
||||
|
||||
// Load default voice style
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
|
||||
|
||||
showStatus(`✅ <strong>Models loaded!</strong> Using ${executionProvider.toUpperCase()}. You can now generate speech.`, 'success');
|
||||
showBackendBadge();
|
||||
|
||||
generateBtn.disabled = false;
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error loading models:', error);
|
||||
showStatus(`❌ <strong>Error loading models:</strong> ${error.message}`, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// Handle voice style selection
|
||||
voiceStyleSelect.addEventListener('change', async (e) => {
|
||||
const selectedValue = e.target.value;
|
||||
|
||||
if (!selectedValue) return;
|
||||
|
||||
try {
|
||||
generateBtn.disabled = true;
|
||||
showStatus(`ℹ️ <strong>Loading voice style...</strong>`, 'info');
|
||||
|
||||
currentStylePath = selectedValue;
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = getFilenameFromPath(currentStylePath);
|
||||
|
||||
showStatus(`✅ <strong>Voice style loaded:</strong> ${getFilenameFromPath(currentStylePath)}`, 'success');
|
||||
generateBtn.disabled = false;
|
||||
} catch (error) {
|
||||
showError(`Error loading voice style: ${error.message}`);
|
||||
|
||||
// Restore default style
|
||||
currentStylePath = DEFAULT_VOICE_STYLE_PATH;
|
||||
voiceStyleSelect.value = currentStylePath;
|
||||
try {
|
||||
currentStyle = await loadStyleFromJSON(currentStylePath);
|
||||
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
|
||||
} catch (styleError) {
|
||||
console.error('Error restoring default style:', styleError);
|
||||
}
|
||||
|
||||
generateBtn.disabled = false;
|
||||
}
|
||||
});
|
||||
|
||||
// Main synthesis function
|
||||
async function generateSpeech() {
|
||||
const text = textInput.value.trim();
|
||||
if (!text) {
|
||||
showError('Please enter some text to synthesize.');
|
||||
return;
|
||||
}
|
||||
|
||||
if (!textToSpeech || !cfgs) {
|
||||
showError('Models are still loading. Please wait.');
|
||||
return;
|
||||
}
|
||||
|
||||
if (!currentStyle) {
|
||||
showError('Voice style is not ready. Please wait.');
|
||||
return;
|
||||
}
|
||||
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
generateBtn.disabled = true;
|
||||
hideError();
|
||||
|
||||
// Clear results and show placeholder
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="results-placeholder generating">
|
||||
<div class="results-placeholder-icon">⏳</div>
|
||||
<p>Generating speech...</p>
|
||||
</div>
|
||||
`;
|
||||
|
||||
const totalStep = parseInt(totalStepInput.value);
|
||||
const textList = [text];
|
||||
|
||||
showStatus('ℹ️ <strong>Generating speech from text...</strong>');
|
||||
const tic = Date.now();
|
||||
|
||||
const { wav, duration } = await textToSpeech.call(
|
||||
textList,
|
||||
currentStyle,
|
||||
totalStep,
|
||||
(step, total) => {
|
||||
showStatus(`ℹ️ <strong>Denoising (${step}/${total})...</strong>`);
|
||||
}
|
||||
);
|
||||
|
||||
const toc = Date.now();
|
||||
console.log(`Text-to-speech synthesis: ${((toc - tic) / 1000).toFixed(2)}s`);
|
||||
|
||||
showStatus('ℹ️ <strong>Creating audio file...</strong>');
|
||||
const wavLen = Math.floor(textToSpeech.sampleRate * duration[0]);
|
||||
const wavOut = wav.slice(0, wavLen);
|
||||
|
||||
// Create WAV file
|
||||
const wavBuffer = writeWavFile(wavOut, textToSpeech.sampleRate);
|
||||
const blob = new Blob([wavBuffer], { type: 'audio/wav' });
|
||||
const url = URL.createObjectURL(blob);
|
||||
|
||||
// Calculate total time and audio duration
|
||||
const endTime = Date.now();
|
||||
const totalTimeSec = ((endTime - startTime) / 1000).toFixed(2);
|
||||
const audioDurationSec = duration[0].toFixed(2);
|
||||
|
||||
// Display result with full text
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="result-item">
|
||||
<div class="result-text-container">
|
||||
<div class="result-text-label">Input Text</div>
|
||||
<div class="result-text">${text}</div>
|
||||
</div>
|
||||
<div class="result-info">
|
||||
<div class="info-item">
|
||||
<span>📊 Audio Length</span>
|
||||
<strong>${audioDurationSec}s</strong>
|
||||
</div>
|
||||
<div class="info-item">
|
||||
<span>⏱️ Generation Time</span>
|
||||
<strong>${totalTimeSec}s</strong>
|
||||
</div>
|
||||
</div>
|
||||
<div class="result-player">
|
||||
<audio controls>
|
||||
<source src="${url}" type="audio/wav">
|
||||
</audio>
|
||||
</div>
|
||||
<div class="result-actions">
|
||||
<button onclick="downloadAudio('${url}', 'synthesized_speech.wav')">
|
||||
<span>⬇️</span>
|
||||
<span>Download WAV</span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
showStatus('✅ <strong>Speech synthesis completed successfully!</strong>', 'success');
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error during synthesis:', error);
|
||||
showStatus(`❌ <strong>Error during synthesis:</strong> ${error.message}`, 'error');
|
||||
showError(`Error during synthesis: ${error.message}`);
|
||||
|
||||
// Restore placeholder
|
||||
resultsContainer.innerHTML = `
|
||||
<div class="results-placeholder">
|
||||
<div class="results-placeholder-icon">🎤</div>
|
||||
<p>Generated speech will appear here</p>
|
||||
</div>
|
||||
`;
|
||||
} finally {
|
||||
generateBtn.disabled = false;
|
||||
}
|
||||
}
|
||||
|
||||
// Download handler (make it global so it can be called from onclick)
|
||||
window.downloadAudio = function(url, filename) {
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
a.download = filename;
|
||||
a.click();
|
||||
};
|
||||
|
||||
// Attach generate function to button
|
||||
generateBtn.addEventListener('click', generateSpeech);
|
||||
|
||||
// Initialize on load
|
||||
window.addEventListener('load', async () => {
|
||||
generateBtn.disabled = true;
|
||||
await initializeModels();
|
||||
});
|
||||
@@ -0,0 +1,21 @@
|
||||
{
|
||||
"name": "tts-onnx-web",
|
||||
"version": "1.0.0",
|
||||
"description": "TTS inference using ONNX Runtime for Web Browser",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "vite",
|
||||
"build": "vite build",
|
||||
"preview": "vite preview"
|
||||
},
|
||||
"keywords": ["tts", "onnx", "speech-synthesis", "web"],
|
||||
"author": "",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"fft.js": "^4.0.3",
|
||||
"onnxruntime-web": "^1.17.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"vite": "^5.0.0"
|
||||
}
|
||||
}
|
||||
+453
@@ -0,0 +1,453 @@
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
min-height: 100vh;
|
||||
display: flex;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
.container {
|
||||
background: white;
|
||||
border-radius: 20px;
|
||||
padding: 40px;
|
||||
max-width: 1400px;
|
||||
width: 100%;
|
||||
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
|
||||
}
|
||||
|
||||
.main-content {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 40px;
|
||||
margin-top: 30px;
|
||||
align-items: start;
|
||||
}
|
||||
|
||||
.left-panel {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.right-panel {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
height: 100%;
|
||||
}
|
||||
|
||||
@media (max-width: 1024px) {
|
||||
.main-content {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
}
|
||||
|
||||
h1 {
|
||||
color: #333;
|
||||
margin-bottom: 10px;
|
||||
font-size: 2em;
|
||||
}
|
||||
|
||||
.subtitle {
|
||||
color: #666;
|
||||
margin-bottom: 30px;
|
||||
font-size: 1.1em;
|
||||
}
|
||||
|
||||
.section {
|
||||
margin-bottom: 25px;
|
||||
}
|
||||
|
||||
label {
|
||||
display: block;
|
||||
font-weight: 600;
|
||||
color: #333;
|
||||
margin-bottom: 8px;
|
||||
font-size: 0.95em;
|
||||
}
|
||||
|
||||
input[type="file"],
|
||||
textarea,
|
||||
input[type="number"] {
|
||||
width: 100%;
|
||||
padding: 12px;
|
||||
border: 2px solid #e0e0e0;
|
||||
border-radius: 8px;
|
||||
font-size: 1em;
|
||||
transition: border-color 0.3s;
|
||||
}
|
||||
|
||||
input[type="file"]:focus,
|
||||
textarea:focus,
|
||||
input[type="number"]:focus {
|
||||
outline: none;
|
||||
border-color: #667eea;
|
||||
}
|
||||
|
||||
textarea {
|
||||
resize: vertical;
|
||||
min-height: 100px;
|
||||
font-family: inherit;
|
||||
}
|
||||
|
||||
.params-grid {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 15px;
|
||||
}
|
||||
|
||||
button {
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 15px 30px;
|
||||
font-size: 1.1em;
|
||||
font-weight: 600;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
width: 100%;
|
||||
transition: transform 0.2s, box-shadow 0.2s;
|
||||
}
|
||||
|
||||
button:hover:not(:disabled) {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 5px 20px rgba(102, 126, 234, 0.4);
|
||||
}
|
||||
|
||||
button:disabled {
|
||||
opacity: 0.6;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
.status-box {
|
||||
background: #e3f2fd;
|
||||
border-left: 4px solid #2196f3;
|
||||
padding: 15px;
|
||||
margin-bottom: 10px;
|
||||
border-radius: 4px;
|
||||
font-size: 0.9em;
|
||||
color: #1565c0;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
flex-wrap: wrap;
|
||||
gap: 15px;
|
||||
min-height: 50px;
|
||||
}
|
||||
|
||||
.status-box.success {
|
||||
background: #e8f5e9;
|
||||
border-left-color: #4caf50;
|
||||
color: #2e7d32;
|
||||
}
|
||||
|
||||
.status-box.error {
|
||||
background: #ffebee;
|
||||
border-left-color: #f44336;
|
||||
color: #c62828;
|
||||
}
|
||||
|
||||
.status-text-wrapper {
|
||||
flex: 1;
|
||||
min-width: 200px;
|
||||
}
|
||||
|
||||
.backend-badge {
|
||||
display: inline-block;
|
||||
visibility: hidden;
|
||||
padding: 6px 12px;
|
||||
background: #ff9800;
|
||||
color: white;
|
||||
border-radius: 12px;
|
||||
font-size: 0.85em;
|
||||
font-weight: 600;
|
||||
margin-left: 10px;
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
.backend-badge.visible {
|
||||
visibility: visible;
|
||||
}
|
||||
|
||||
.ref-audio-info {
|
||||
color: #4caf50;
|
||||
font-weight: 700;
|
||||
font-size: 0.95em;
|
||||
}
|
||||
|
||||
.ref-audio-label {
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.ref-audio-label label {
|
||||
display: inline;
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
|
||||
.results {
|
||||
flex: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.result-item {
|
||||
background: white;
|
||||
border-radius: 16px;
|
||||
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
|
||||
overflow: hidden;
|
||||
transition: box-shadow 0.3s ease;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.result-item:hover {
|
||||
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
|
||||
}
|
||||
|
||||
.result-item h3 {
|
||||
color: #667eea;
|
||||
margin-bottom: 15px;
|
||||
font-size: 1.2em;
|
||||
}
|
||||
|
||||
.result-text-container {
|
||||
padding: 20px;
|
||||
background: linear-gradient(135deg, #f8f9ff 0%, #ffffff 100%);
|
||||
border-bottom: 1px solid #e8ecf5;
|
||||
flex: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.result-text-label {
|
||||
font-size: 0.75em;
|
||||
text-transform: uppercase;
|
||||
letter-spacing: 0.5px;
|
||||
color: #667eea;
|
||||
font-weight: 600;
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.result-text {
|
||||
color: #333;
|
||||
line-height: 1.7;
|
||||
font-size: 0.95em;
|
||||
word-wrap: break-word;
|
||||
white-space: pre-wrap;
|
||||
overflow-y: auto;
|
||||
padding-right: 8px;
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar {
|
||||
width: 6px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-track {
|
||||
background: #f0f0f0;
|
||||
border-radius: 3px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-thumb {
|
||||
background: #c0c0c0;
|
||||
border-radius: 3px;
|
||||
}
|
||||
|
||||
.result-text::-webkit-scrollbar-thumb:hover {
|
||||
background: #a0a0a0;
|
||||
}
|
||||
|
||||
.result-info {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 0;
|
||||
background: #fafbff;
|
||||
}
|
||||
|
||||
.info-item {
|
||||
padding: 16px 20px;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 8px;
|
||||
font-size: 0.9em;
|
||||
color: #666;
|
||||
border-bottom: 1px solid #e8ecf5;
|
||||
}
|
||||
|
||||
.info-item:nth-child(1) {
|
||||
border-right: 1px solid #e8ecf5;
|
||||
}
|
||||
|
||||
.info-item strong {
|
||||
color: #333;
|
||||
font-size: 1.1em;
|
||||
font-weight: 600;
|
||||
margin-left: auto;
|
||||
}
|
||||
|
||||
.result-player {
|
||||
padding: 20px;
|
||||
background: white;
|
||||
}
|
||||
|
||||
.result-item audio {
|
||||
width: 100%;
|
||||
height: 48px;
|
||||
outline: none;
|
||||
}
|
||||
|
||||
.result-item audio:focus {
|
||||
outline: 2px solid #667eea;
|
||||
outline-offset: 2px;
|
||||
border-radius: 4px;
|
||||
}
|
||||
|
||||
.result-actions {
|
||||
padding: 16px 20px 20px;
|
||||
background: white;
|
||||
}
|
||||
|
||||
.result-item button {
|
||||
width: 100%;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 12px 24px;
|
||||
font-size: 0.95em;
|
||||
font-weight: 600;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
.result-item button:hover {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 4px 16px rgba(102, 126, 234, 0.3);
|
||||
}
|
||||
|
||||
.result-item button:active {
|
||||
transform: translateY(0);
|
||||
}
|
||||
|
||||
@media (max-width: 640px) {
|
||||
.result-info {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.info-item:nth-child(1) {
|
||||
border-right: none;
|
||||
}
|
||||
}
|
||||
|
||||
audio {
|
||||
width: 100%;
|
||||
margin-top: 10px;
|
||||
}
|
||||
|
||||
.error {
|
||||
background: #fee;
|
||||
color: #c00;
|
||||
padding: 15px;
|
||||
border-radius: 8px;
|
||||
margin-top: 20px;
|
||||
display: none;
|
||||
}
|
||||
|
||||
.error.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
.warning-box {
|
||||
background: #fff3cd;
|
||||
color: #856404;
|
||||
padding: 12px 15px;
|
||||
border-radius: 8px;
|
||||
margin-top: 10px;
|
||||
border-left: 4px solid #ffc107;
|
||||
font-size: 0.9em;
|
||||
display: none;
|
||||
line-height: 1.5;
|
||||
}
|
||||
|
||||
.warning-box.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
.warning-box::before {
|
||||
content: "⚠️ ";
|
||||
margin-right: 5px;
|
||||
}
|
||||
|
||||
.results-placeholder {
|
||||
background: white;
|
||||
border-radius: 16px;
|
||||
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
|
||||
padding: 60px 40px;
|
||||
text-align: center;
|
||||
color: #999;
|
||||
transition: all 0.3s ease;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
flex: 1;
|
||||
min-height: 400px;
|
||||
}
|
||||
|
||||
.results-placeholder:hover {
|
||||
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
|
||||
}
|
||||
|
||||
.results-placeholder-icon {
|
||||
font-size: 4em;
|
||||
margin-bottom: 20px;
|
||||
opacity: 0.6;
|
||||
animation: float 3s ease-in-out infinite;
|
||||
}
|
||||
|
||||
.results-placeholder.generating .results-placeholder-icon {
|
||||
animation: spin 2s linear infinite;
|
||||
}
|
||||
|
||||
@keyframes float {
|
||||
0%, 100% {
|
||||
transform: translateY(0px);
|
||||
}
|
||||
50% {
|
||||
transform: translateY(-10px);
|
||||
}
|
||||
}
|
||||
|
||||
@keyframes spin {
|
||||
0% {
|
||||
transform: rotate(0deg);
|
||||
}
|
||||
100% {
|
||||
transform: rotate(360deg);
|
||||
}
|
||||
}
|
||||
|
||||
.results-placeholder p {
|
||||
font-size: 1.05em;
|
||||
color: #888;
|
||||
font-weight: 500;
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.hidden {
|
||||
display: none;
|
||||
}
|
||||
@@ -0,0 +1,14 @@
|
||||
import { defineConfig } from 'vite';
|
||||
|
||||
export default defineConfig({
|
||||
server: {
|
||||
port: 3000,
|
||||
open: true
|
||||
},
|
||||
build: {
|
||||
target: 'esnext'
|
||||
},
|
||||
optimizeDeps: {
|
||||
exclude: ['onnxruntime-web']
|
||||
}
|
||||
});
|
||||
Reference in New Issue
Block a user