init

2026-07-03 14:08:32 +02:00 · 2025-11-19 01:18:16 +09:00
commit d31536d9fc
74 changed files with 10681 additions and 0 deletions
@@ -0,0 +1 @@
+assets/onnx/*.onnx filter=lfs diff=lfs merge=lfs -text
@@ -0,0 +1,61 @@
+assets/*
+assets/.git
+assets/.gitignore
+assets/.gitattributes
+
+*.onnx
+onnx
+
+# Output files
+results
+
+# Python
+__pycache__
+*.py[cod]
+*$py.class
+*.so
+.Python
+
+# Virtual environments
+.venv
+venv/
+ENV/
+env/
+
+# Node.js
+node_modules/
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+package-lock.json
+
+# Swift
+.build/
+.swiftpm/
+*.xcodeproj
+*.xcworkspace
+xcuserdata/
+DerivedData/
+
+# Distribution / packaging
+build/
+dist/
+*.egg-info/
+.eggs/
+
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+.tox/
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# OS
+.DS_Store
+Thumbs.db
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2025 Supertone Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,393 @@
+# Supertonic — Lightning Fast, On-Device TTS
+
+[![Demo](https://img.shields.io/badge/🤗%20Hugging%20Face-Demo-yellow)](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo)
+[![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-blue)](https://huggingface.co/Supertone/supertonic)
+
+<p align="center">
+  <img src="img/Supertonic_IMG_v02_4x.webp" alt="Supertonic Banner">
+</p>
+
+**Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
+
+> 🎧 **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo), or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic)
+
+### Table of Contents
+
+- [Why Supertonic?](#why-supertonic)
+- [Language Support](#language-support)
+- [Getting Started](#getting-started)
+- [Performance](#performance)
+- [Citation](#citation)
+- [License](#license)
+
+## Why Supertonic?
+
+- **⚡ Blazingly Fast**: Generates speech up to **167× faster than real-time** on consumer hardware (M4 Pro)—unmatched by any other TTS system
+- **🪶 Ultra Lightweight**: Only **66M parameters**, optimized for efficient on-device performance with minimal footprint
+- **📱 On-Device Capable**: **Complete privacy** and **zero latency**—all processing happens locally on your device
+- **🎨 Natural Text Handling**: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
+- **⚙️ Highly Configurable**: Adjust inference steps, batch processing, and other parameters to match your specific needs
+- **🧩 Flexible Deployment**: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
+
+
+## Language Support
+
+We provide ready-to-use TTS inference examples across multiple ecosystems:
+
+| Language/Platform | Path | Description |
+|-------------------|------|-------------|
+| [**Python**](py/) | `py/` | ONNX Runtime inference |
+| [**Node.js**](nodejs/) | `nodejs/` | Server-side JavaScript |
+| [**Browser**](web/) | `web/` | WebGPU/WASM inference |
+| [**Java**](java/) | `java/` | Cross-platform JVM |
+| [**C++**](cpp/) | `cpp/` | High-performance C++ |
+| [**C#**](csharp/) | `csharp/` | .NET ecosystem |
+| [**Go**](go/) | `go/` | Go implementation |
+| [**Swift**](swift/) | `swift/` | macOS applications |
+| [**iOS**](ios/) | `ios/` | Native iOS apps |
+| [**Rust**](rust/) | `rust/` | Memory-safe systems |
+
+> For detailed usage instructions, please refer to the README.md in each language directory.
+
+## Getting Started
+
+First, clone the repository:
+
+```bash
+git clone https://github.com/supertone-inc/supertonic.git
+cd supertonic
+```
+
+### Prerequisites
+
+Before running the examples, download the ONNX models and preset voices, and place them in the `assets` directory:
+
+> **Note:** The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
+> - macOS: `brew install git-lfs && git lfs install`
+> - Generic: see `https://git-lfs.com` for installers
+
+```bash
+git clone https://huggingface.co/Supertone/supertonic assets
+```
+
+### Quick Start
+
+**Python Example** ([Details](py/))
+```bash
+cd py
+uv sync
+uv run example_onnx.py
+```
+
+**Node.js Example** ([Details](nodejs/))
+```bash
+cd nodejs
+npm install
+npm start
+```
+
+**Browser Example** ([Details](web/))
+```bash
+cd web
+npm install
+npm run dev
+```
+
+**Java Example** ([Details](java/))
+```bash
+cd java
+mvn clean install
+mvn exec:java
+```
+
+**C++ Example** ([Details](cpp/))
+```bash
+cd cpp
+mkdir build && cd build
+cmake .. && cmake --build . --config Release
+./example_onnx
+```
+
+**C# Example** ([Details](csharp/))
+```bash
+cd csharp
+dotnet restore
+dotnet run
+```
+
+**Go Example** ([Details](go/))
+```bash
+cd go
+go mod download
+go run example_onnx.go helper.go
+```
+
+**Swift Example** ([Details](swift/))
+```bash
+cd swift
+swift build -c release
+.build/release/example_onnx
+```
+
+**Rust Example** ([Details](rust/))
+```bash
+cd rust
+cargo build --release
+./target/release/example_onnx
+```
+
+**iOS Example** ([Details](ios/))
+```bash
+cd ios/ExampleiOSApp
+xcodegen generate
+open ExampleiOSApp.xcodeproj
+```
+- In Xcode: Targets → ExampleiOSApp → Signing: select your Team
+- Choose your iPhone as run destination → Build & Run
+
+
+### Technical Details
+
+- **Runtime**: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
+- **Browser Support**: onnxruntime-web for client-side inference
+- **Batch Processing**: Supports batch inference for improved throughput
+- **Audio Output**: Outputs 16-bit WAV files
+
+## Performance
+
+We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
+
+**Metrics:**
+- **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
+- **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
+
+### Characters per Second
+| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
+|--------|-----------------|----------------|-----------------|
+| **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
+| **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
+| **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
+| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
+| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
+| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
+| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
+| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
+| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
+
+> **Notes:**  
+> `API` = Cloud-based API services (measured from Seoul)  
+> `Open` = Open-source models  
+> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX  
+> Supertonic (RTX4090): Tested with PyTorch model  
+> Kokoro: Tested on M4 Pro CPU with ONNX  
+> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
+
+### Real-time Factor
+
+| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
+|--------|-----------------|----------------|-----------------|
+| **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
+| **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
+| **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
+| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
+| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
+| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
+| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
+| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
+| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
+
+<details>
+<summary><b>Additional Performance Data (5-step inference)</b></summary>
+
+<br>
+
+**Characters per Second (5-step)**
+
+| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
+|--------|-----------------|----------------|-----------------|
+| **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
+| **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
+| **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
+
+**Real-time Factor (5-step)**
+
+| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
+|--------|-----------------|----------------|-----------------|
+| **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
+| **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
+| **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
+
+</details>
+
+### Natural Text Handling
+
+Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.
+
+> 🎧 **View audio samples more easily**: Check out our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#text-handling) for a better viewing experience of all audio examples
+
+**Overview of Test Cases:**
+
+| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini |
+|:--------:|:--------------:|:----------:|:----------:|:------:|:------:|
+| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ |
+| Time and Date | Time notation, abbreviated weekdays/months, date formats | ✅ | ❌ | ❌ | ❌ |
+| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ |
+| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ |
+
+<details>
+<summary><b>Example 1: Financial Expression</b></summary>
+
+<br>
+
+**Text:**
+> "The startup secured **$5.2M** in venture capital, a huge leap from their initial **$450K** seed round."
+
+**Challenges:**
+- Decimal point in currency ($5.2M should be read as "five point two million")
+- Abbreviated magnitude units (M for million, K for thousand)
+- Currency symbol ($) that needs to be properly pronounced as "dollars"
+
+**Audio Samples:**
+
+| System | Result | Audio Sample |
+|--------|--------|--------------|
+| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1eancUOhiSXCVoTu9ddh4S-OcVQaWrPV-/view?usp=sharing) |
+| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1-r2scv7XQ1crIDu6QOh3eqVl445W6ap_/view?usp=sharing) |
+| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1MFDXMjfmsAVOqwPx7iveS0KUJtZvcwxB/view?usp=sharing) |
+| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1dEHpNzfMUucFTJPQK0k4RcFZvPwQTt09/view?usp=sharing) |
+
+</details>
+
+<details>
+<summary><b>Example 2: Time and Date</b></summary>
+
+<br>
+
+**Text:**
+> "The train delay was announced at **4:45 PM** on **Wed, Apr 3, 2024** due to track maintenance."
+
+**Challenges:**
+- Time expression with PM notation (4:45 PM)
+- Abbreviated weekday (Wed)
+- Abbreviated month (Apr)
+- Full date format (Apr 3, 2024)
+
+**Audio Samples:**
+
+| System | Result | Audio Sample |
+|--------|--------|--------------|
+| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1ehkZU8eiizBenG2DgR5tzBGQBvHS0Uaj/view?usp=sharing) |
+| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ta3r6jFyebmA-sT44l8EaEQcMLVmuOEr/view?usp=sharing) |
+| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1sskmem9AzHAQ3Hv8DRSZoqX_pye-CXuU/view?usp=sharing) |
+| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1zx9X8oMsLMXW0Zx_SURoqjju-By2yh_n/view?usp=sharing) |
+
+</details>
+
+<details>
+<summary><b>Example 3: Phone Number</b></summary>
+
+<br>
+
+**Text:**
+> "You can reach the hotel front desk at **(212) 555-0142 ext. 402** anytime."
+
+**Challenges:**
+- Area code in parentheses that should be read as separate digits
+- Phone number with hyphen separator (555-0142)
+- Abbreviated extension notation (ext.)
+- Extension number (402)
+
+**Audio Samples:**
+
+| System | Result | Audio Sample |
+|--------|--------|--------------|
+| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1z-e5iTsihryMR8ll1-N1YXkB2CIJYJ6F/view?usp=sharing) |
+| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1HAzVXFTZfZm0VEK2laSpsMTxzufcuaxA/view?usp=sharing) |
+| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/15tjfAmb3GbjP_kmvD7zSdIWkhtAaCPOg/view?usp=sharing) |
+| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1BCL8n7yligUZyso970ud7Gf5NWb1OhKD/view?usp=sharing) |
+
+</details>
+
+<details>
+<summary><b>Example 4: Technical Unit</b></summary>
+
+<br>
+
+**Text:**
+> "Our drone battery lasts **2.3h** when flying at **30kph** with full camera payload."
+
+**Challenges:**
+- Decimal time duration with abbreviation (2.3h = two point three hours)
+- Speed unit with abbreviation (30kph = thirty kilometers per hour)
+- Technical abbreviations (h for hours, kph for kilometers per hour)
+- Technical/engineering context requiring proper pronunciation
+
+**Audio Samples:**
+
+| System | Result | Audio Sample |
+|--------|--------|--------------|
+| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1kvOBvswFkLfmr8hGplH0V2XiMxy1shYf/view?usp=sharing) |
+| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1_SzfjWJe5YEd0t3R7DztkYhHcI_av48p/view?usp=sharing) |
+| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1P5BSilj5xFPTV2Xz6yW5jitKZohO9o-6/view?usp=sharing) |
+| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1GU82SnWC50OvC8CZNjhxvNZFKQb7I9_Y/view?usp=sharing) |
+
+</details>
+
+> **Note:** These samples demonstrate how each system handles text normalization and pronunciation of complex expressions **without requiring pre-processing or phonetic annotations**.
+
+## Citation
+
+The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:
+
+### SupertonicTTS: Main Architecture
+
+This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.
+
+```bibtex
+@article{kim2025supertonic,
+  title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
+  author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
+  journal={arXiv preprint arXiv:2503.23108},
+  year={2025},
+  url={https://arxiv.org/abs/2503.23108}
+}
+```
+
+### Length-Aware RoPE: Text-Speech Alignment
+
+This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.
+
+```bibtex
+@article{kim2025larope,
+  title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
+  author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
+  journal={arXiv preprint arXiv:2509.11084},
+  year={2025},
+  url={https://arxiv.org/abs/2509.11084}
+}
+```
+
+### Self-Purifying Flow Matching: Training with Noisy Labels
+
+This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.
+
+```bibtex
+@article{kim2025spfm,
+  title={Training Flow Matching Models with Reliable Labels via Self-Purification},
+  author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
+  journal={arXiv preprint arXiv:2509.19091},
+  year={2025},
+  url={https://arxiv.org/abs/2509.19091}
+}
+```
+
+## License
+
+This project’s sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
+
+The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic/blob/main/LICENSE) file for details.
+
+This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
+
+Copyright (c) 2025 Supertone Inc.
+
@@ -0,0 +1,122 @@
+cmake_minimum_required(VERSION 3.15)
+project(Supertonic_CPP)
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+
+# Enable aggressive optimization
+if(NOT CMAKE_BUILD_TYPE)
+    set(CMAKE_BUILD_TYPE Release)
+endif()
+
+# Add optimization flags
+set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
+set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
+
+# Find required packages
+find_package(PkgConfig REQUIRED)
+find_package(OpenMP)
+
+# ONNX Runtime - Try multiple methods
+# Method 1: Try to find via CMake config
+find_package(onnxruntime QUIET CONFIG)
+
+if(NOT onnxruntime_FOUND)
+    # Method 2: Try pkg-config
+    pkg_check_modules(ONNXRUNTIME QUIET libonnxruntime)
+    
+    if(ONNXRUNTIME_FOUND)
+        set(ONNXRUNTIME_INCLUDE_DIR ${ONNXRUNTIME_INCLUDE_DIRS})
+        set(ONNXRUNTIME_LIB ${ONNXRUNTIME_LIBRARIES})
+    else()
+        # Method 3: Manual search in common locations
+        find_path(ONNXRUNTIME_INCLUDE_DIR 
+            NAMES onnxruntime_cxx_api.h
+            PATHS
+                /usr/local/include
+                /opt/homebrew/include
+                /usr/include
+                ${CMAKE_PREFIX_PATH}/include
+            PATH_SUFFIXES onnxruntime
+        )
+        
+        find_library(ONNXRUNTIME_LIB
+            NAMES onnxruntime libonnxruntime
+            PATHS
+                /usr/local/lib
+                /opt/homebrew/lib
+                /usr/lib
+                ${CMAKE_PREFIX_PATH}/lib
+        )
+    endif()
+    
+    if(NOT ONNXRUNTIME_INCLUDE_DIR OR NOT ONNXRUNTIME_LIB)
+        message(FATAL_ERROR "ONNX Runtime not found. Please install it:\n"
+                            "  macOS: brew install onnxruntime\n"
+                            "  Ubuntu: See README.md for installation instructions")
+    endif()
+    
+    message(STATUS "Found ONNX Runtime:")
+    message(STATUS "  Include: ${ONNXRUNTIME_INCLUDE_DIR}")
+    message(STATUS "  Library: ${ONNXRUNTIME_LIB}")
+endif()
+
+# nlohmann/json
+find_package(nlohmann_json REQUIRED)
+
+# Include directories
+if(NOT onnxruntime_FOUND)
+    include_directories(${ONNXRUNTIME_INCLUDE_DIR})
+endif()
+
+# Helper library
+add_library(tts_helper STATIC
+    helper.cpp
+    helper.h
+)
+
+if(onnxruntime_FOUND)
+    target_link_libraries(tts_helper
+        onnxruntime::onnxruntime
+        nlohmann_json::nlohmann_json
+    )
+else()
+    target_include_directories(tts_helper PUBLIC ${ONNXRUNTIME_INCLUDE_DIR})
+    target_link_libraries(tts_helper
+        ${ONNXRUNTIME_LIB}
+        nlohmann_json::nlohmann_json
+    )
+endif()
+
+# Enable OpenMP if available
+if(OpenMP_CXX_FOUND)
+    target_link_libraries(tts_helper OpenMP::OpenMP_CXX)
+    message(STATUS "OpenMP enabled for parallel processing")
+else()
+    message(WARNING "OpenMP not found - parallel processing will be disabled")
+endif()
+
+# Example executable
+add_executable(example_onnx
+    example_onnx.cpp
+)
+
+if(onnxruntime_FOUND)
+    target_link_libraries(example_onnx
+        tts_helper
+        onnxruntime::onnxruntime
+        nlohmann_json::nlohmann_json
+    )
+else()
+    target_link_libraries(example_onnx
+        tts_helper
+        ${ONNXRUNTIME_LIB}
+        nlohmann_json::nlohmann_json
+    )
+endif()
+
+# Installation
+install(TARGETS example_onnx DESTINATION bin)
+install(TARGETS tts_helper DESTINATION lib)
+install(FILES helper.h DESTINATION include)
+
@@ -0,0 +1,101 @@
+# Supertonic C++ Implementation
+
+High-performance text-to-speech inference using ONNX Runtime.
+
+## Requirements
+
+- C++17 compiler, CMake 3.15+
+- Libraries: ONNX Runtime, nlohmann/json
+
+## Installation
+
+**Ubuntu/Debian:**
+> ⚠️ **Note:** Installation instructions not yet verified.
+
+```bash
+sudo apt-get install -y cmake g++ nlohmann-json3-dev
+wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.3/onnxruntime-linux-x64-1.16.3.tgz
+tar -xzf onnxruntime-linux-x64-1.16.3.tgz
+sudo cp -r onnxruntime-linux-x64-1.16.3/include/* /usr/local/include/
+sudo cp -r onnxruntime-linux-x64-1.16.3/lib/* /usr/local/lib/
+sudo ldconfig
+```
+
+**macOS:**
+```bash
+brew install cmake nlohmann-json onnxruntime
+```
+
+**Windows (vcpkg):**
+> ⚠️ **Note:** Installation instructions not yet verified.
+
+```powershell
+vcpkg install nlohmann-json:x64-windows onnxruntime:x64-windows
+vcpkg integrate install
+```
+
+## Building
+
+```bash
+cd cpp && mkdir build && cd build
+cmake .. && cmake --build . --config Release
+./example_onnx
+```
+
+## Basic Usage
+
+### Example 1: Default Inference
+Run inference with default settings:
+```bash
+./example_onnx
+```
+
+This will use:
+- Voice style: `../assets/voice_styles/M1.json`
+- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+- Output directory: `results/`
+- Total steps: 5
+- Number of generations: 4
+
+### Example 2: Batch Inference
+Process multiple voice styles and texts at once:
+```bash
+./example_onnx \
+  --voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
+  --text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+```
+
+This will:
+- Generate speech for 2 different voice-text pairs
+- Use male voice style (M1.json) for the first text
+- Use female voice style (F1.json) for the second text
+- Process both samples in a single batch
+
+### Example 3: High Quality Inference
+Increase denoising steps for better quality:
+```bash
+./example_onnx \
+  --total-step 10 \
+  --voice-style ../assets/voice_styles/M1.json \
+  --text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
+```
+
+This will:
+- Use 10 denoising steps instead of the default 5
+- Produce higher quality output at the cost of slower inference
+
+## Available Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
+| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
+| `--n-test` | int | 4 | Number of times to generate each sample |
+| `--voice-style` | str | `../assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated for batch) |
+| `--text` | str | (long default text) | Text(s) to synthesize (pipe-separated for batch) |
+| `--save-dir` | str | `results` | Output directory |
+
+## Notes
+
+- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
+- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,109 @@
+#include "helper.h"
+#include <iostream>
+#include <filesystem>
+#include <algorithm>
+#include <string>
+#include <vector>
+
+namespace fs = std::filesystem;
+
+struct Args {
+    std::string onnx_dir = "../assets/onnx";
+    int total_step = 5;
+    int n_test = 4;
+    std::vector<std::string> voice_style = {"../assets/voice_styles/M1.json"};
+    std::vector<std::string> text = {
+        "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+    };
+    std::string save_dir = "results";
+};
+
+auto splitString = [](const std::string& str, char delim) {
+    std::vector<std::string> result;
+    size_t start = 0, pos;
+    while ((pos = str.find(delim, start)) != std::string::npos) {
+        result.push_back(str.substr(start, pos - start));
+        start = pos + 1;
+    }
+    result.push_back(str.substr(start));
+    return result;
+};
+
+Args parseArgs(int argc, char* argv[]) {
+    Args args;
+    for (int i = 1; i < argc; i++) {
+        std::string arg = argv[i];
+        if (arg == "--onnx-dir" && i + 1 < argc) args.onnx_dir = argv[++i];
+        else if (arg == "--total-step" && i + 1 < argc) args.total_step = std::stoi(argv[++i]);
+        else if (arg == "--n-test" && i + 1 < argc) args.n_test = std::stoi(argv[++i]);
+        else if (arg == "--voice-style" && i + 1 < argc) args.voice_style = splitString(argv[++i], ',');
+        else if (arg == "--text" && i + 1 < argc) args.text = splitString(argv[++i], '|');
+        else if (arg == "--save-dir" && i + 1 < argc) args.save_dir = argv[++i];
+    }
+    return args;
+}
+
+int main(int argc, char* argv[]) {
+    std::cout << "=== TTS Inference with ONNX Runtime (C++) ===\n\n";
+    
+    // --- 1. Parse arguments --- //
+    Args args = parseArgs(argc, argv);
+    int total_step = args.total_step;
+    int n_test = args.n_test;
+    std::string save_dir = args.save_dir;
+    std::vector<std::string> voice_style_paths = args.voice_style;
+    std::vector<std::string> text_list = args.text;
+    
+    if (voice_style_paths.size() != text_list.size()) {
+        std::cerr << "Error: Number of voice styles (" << voice_style_paths.size() 
+                  << ") must match number of texts (" << text_list.size() << ")\n";
+        return 1;
+    }
+    
+    int bsz = voice_style_paths.size();
+    
+    // --- 2. Load Text to Speech --- //
+    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "TTS");
+    Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(
+        OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault
+    );
+    
+    auto text_to_speech = loadTextToSpeech(env, args.onnx_dir, false);
+    std::cout << std::endl;
+    
+    // --- 3. Load Voice Style --- //
+    auto style = loadVoiceStyle(voice_style_paths, true);
+    
+    // --- 4. Synthesize speech --- //
+    fs::create_directories(save_dir);
+    
+    for (int n = 0; n < n_test; n++) {
+        std::cout << "\n[" << (n + 1) << "/" << n_test << "] Starting synthesis...\n";
+        
+        auto result = timer("Generating speech from text", [&]() {
+            return text_to_speech->call(memory_info, text_list, style, total_step);
+        });
+        
+        int sample_rate = text_to_speech->getSampleRate();
+        int wav_shape_1 = result.wav.size() / bsz;
+        
+        for (int b = 0; b < bsz; b++) {
+            std::string fname = sanitizeFilename(text_list[b], 20) + "_" + std::to_string(n + 1) + ".wav";
+            int wav_len = static_cast<int>(sample_rate * result.duration[b]);
+            
+            std::vector<float> wav_out(
+                result.wav.begin() + b * wav_shape_1,
+                result.wav.begin() + b * wav_shape_1 + wav_len
+            );
+            
+            std::string output_path = save_dir + "/" + fname;
+            writeWavFile(output_path, wav_out, sample_rate);
+            std::cout << "Saved: " << output_path << "\n";
+        }
+        
+        clearTensorBuffers();
+    }
+    
+    std::cout << "\n=== Synthesis completed successfully! ===\n";
+    return 0;
+}
@@ -0,0 +1,714 @@
+#include "helper.h"
+#include <fstream>
+#include <iostream>
+#include <cmath>
+#include <algorithm>
+#include <random>
+#include <sstream>
+#include <nlohmann/json.hpp>
+
+using json = nlohmann::json;
+
+// Global tensor buffers for memory management
+static std::vector<std::vector<float>> g_tensor_buffers_float;
+static std::vector<std::vector<int64_t>> g_tensor_buffers_int64;
+
+void clearTensorBuffers() {
+    g_tensor_buffers_float.clear();
+    g_tensor_buffers_int64.clear();
+}
+
+// ============================================================================
+// UnicodeProcessor implementation
+// ============================================================================
+
+UnicodeProcessor::UnicodeProcessor(const std::string& unicode_indexer_json_path) {
+    indexer_ = loadJsonInt64(unicode_indexer_json_path);
+}
+
+std::string UnicodeProcessor::preprocessText(const std::string& text) {
+    // Simple NFKD normalization (C++ doesn't have built-in Unicode normalization)
+    // For now, just return the text as-is
+    // TODO: add proper Unicode normalization
+    return text;
+}
+
+std::vector<uint16_t> UnicodeProcessor::textToUnicodeValues(const std::string& text) {
+    std::vector<uint16_t> unicode_values;
+    for (char c : text) {
+        unicode_values.push_back(static_cast<uint16_t>(static_cast<unsigned char>(c)));
+    }
+    return unicode_values;
+}
+
+std::vector<std::vector<std::vector<float>>> UnicodeProcessor::getTextMask(
+    const std::vector<int64_t>& text_ids_lengths
+) {
+    return lengthToMask(text_ids_lengths);
+}
+
+void UnicodeProcessor::call(
+    const std::vector<std::string>& text_list,
+    std::vector<std::vector<int64_t>>& text_ids,
+    std::vector<std::vector<std::vector<float>>>& text_mask
+) {
+    std::vector<std::string> processed_texts;
+    for (const auto& text : text_list) {
+        processed_texts.push_back(preprocessText(text));
+    }
+    
+    std::vector<int64_t> text_ids_lengths;
+    for (const auto& text : processed_texts) {
+        text_ids_lengths.push_back(static_cast<int64_t>(text.length()));
+    }
+    
+    int64_t max_len = *std::max_element(text_ids_lengths.begin(), text_ids_lengths.end());
+    
+    text_ids.resize(text_list.size());
+    for (size_t i = 0; i < processed_texts.size(); i++) {
+        text_ids[i].resize(max_len, 0);
+        auto unicode_vals = textToUnicodeValues(processed_texts[i]);
+        for (size_t j = 0; j < unicode_vals.size(); j++) {
+            if (unicode_vals[j] < indexer_.size()) {
+                text_ids[i][j] = indexer_[unicode_vals[j]];
+            }
+        }
+    }
+    
+    text_mask = getTextMask(text_ids_lengths);
+}
+
+// ============================================================================
+// Style implementation
+// ============================================================================
+
+Style::Style(const std::vector<float>& ttl_data, const std::vector<int64_t>& ttl_shape,
+             const std::vector<float>& dp_data, const std::vector<int64_t>& dp_shape)
+    : ttl_data_(ttl_data), ttl_shape_(ttl_shape), dp_data_(dp_data), dp_shape_(dp_shape) {}
+
+// ============================================================================
+// TextToSpeech implementation
+// ============================================================================
+
+TextToSpeech::TextToSpeech(
+    const Config& cfgs,
+    UnicodeProcessor* text_processor,
+    Ort::Session* dp_ort,
+    Ort::Session* text_enc_ort,
+    Ort::Session* vector_est_ort,
+    Ort::Session* vocoder_ort
+) : cfgs_(cfgs),
+    text_processor_(text_processor),
+    dp_ort_(dp_ort),
+    text_enc_ort_(text_enc_ort),
+    vector_est_ort_(vector_est_ort),
+    vocoder_ort_(vocoder_ort) {
+    
+    sample_rate_ = cfgs.ae.sample_rate;
+    base_chunk_size_ = cfgs.ae.base_chunk_size;
+    chunk_compress_factor_ = cfgs.ttl.chunk_compress_factor;
+    ldim_ = cfgs.ttl.latent_dim;
+}
+
+void TextToSpeech::sampleNoisyLatent(
+    const std::vector<float>& duration,
+    std::vector<std::vector<std::vector<float>>>& noisy_latent,
+    std::vector<std::vector<std::vector<float>>>& latent_mask
+) {
+    int bsz = duration.size();
+    float wav_len_max = *std::max_element(duration.begin(), duration.end()) * sample_rate_;
+    
+    std::vector<int64_t> wav_lengths;
+    for (float d : duration) {
+        wav_lengths.push_back(static_cast<int64_t>(d * sample_rate_));
+    }
+    
+    int chunk_size = base_chunk_size_ * chunk_compress_factor_;
+    int latent_len = static_cast<int>((wav_len_max + chunk_size - 1) / chunk_size);
+    int latent_dim = ldim_ * chunk_compress_factor_;
+    
+    // Generate random noise with normal distribution
+    std::random_device rd;
+    std::mt19937 gen(rd());
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+    
+    noisy_latent.resize(bsz);
+    for (int b = 0; b < bsz; b++) {
+        noisy_latent[b].resize(latent_dim);
+        for (int d = 0; d < latent_dim; d++) {
+            noisy_latent[b][d].resize(latent_len);
+            for (int t = 0; t < latent_len; t++) {
+                noisy_latent[b][d][t] = dist(gen);
+            }
+        }
+    }
+    
+    latent_mask = getLatentMask(wav_lengths, base_chunk_size_, chunk_compress_factor_);
+    
+    // Apply mask
+    for (int b = 0; b < bsz; b++) {
+        for (int d = 0; d < latent_dim; d++) {
+            for (size_t t = 0; t < noisy_latent[b][d].size(); t++) {
+                noisy_latent[b][d][t] *= latent_mask[b][0][t];
+            }
+        }
+    }
+}
+
+TextToSpeech::SynthesisResult TextToSpeech::call(
+    Ort::MemoryInfo& memory_info,
+    const std::vector<std::string>& text_list,
+    const Style& style,
+    int total_step
+) {
+    int bsz = text_list.size();
+    
+    if (bsz != style.getTtlShape()[0]) {
+        throw std::runtime_error("Number of texts must match number of style vectors");
+    }
+    
+    // Process text
+    std::vector<std::vector<int64_t>> text_ids;
+    std::vector<std::vector<std::vector<float>>> text_mask;
+    text_processor_->call(text_list, text_ids, text_mask);
+    
+    std::vector<int64_t> text_ids_shape = {bsz, static_cast<int64_t>(text_ids[0].size())};
+    std::vector<int64_t> text_mask_shape = {bsz, 1, static_cast<int64_t>(text_mask[0][0].size())};
+    
+    auto text_ids_tensor = intArrayToTensor(memory_info, text_ids, text_ids_shape);
+    auto text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
+    
+    // Create style tensors
+    auto style_ttl_tensor = Ort::Value::CreateTensor<float>(
+        memory_info,
+        const_cast<float*>(style.getTtlData().data()),
+        style.getTtlData().size(),
+        style.getTtlShape().data(),
+        style.getTtlShape().size()
+    );
+    
+    auto style_dp_tensor = Ort::Value::CreateTensor<float>(
+        memory_info,
+        const_cast<float*>(style.getDpData().data()),
+        style.getDpData().size(),
+        style.getDpShape().data(),
+        style.getDpShape().size()
+    );
+    
+    // Run duration predictor
+    const char* dp_input_names[] = {"text_ids", "style_dp", "text_mask"};
+    const char* dp_output_names[] = {"duration"};
+    std::vector<Ort::Value> dp_inputs;
+    dp_inputs.push_back(std::move(text_ids_tensor));
+    dp_inputs.push_back(std::move(style_dp_tensor));
+    dp_inputs.push_back(std::move(text_mask_tensor));
+    
+    auto dp_outputs = dp_ort_->Run(
+        Ort::RunOptions{nullptr},
+        dp_input_names, dp_inputs.data(), dp_inputs.size(),
+        dp_output_names, 1
+    );
+    
+    auto* dur_data = dp_outputs[0].GetTensorMutableData<float>();
+    std::vector<float> duration(dur_data, dur_data + bsz);
+    
+    // Create new tensors for text encoder (previous ones were moved)
+    text_ids_tensor = intArrayToTensor(memory_info, text_ids, text_ids_shape);
+    text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
+    style_ttl_tensor = Ort::Value::CreateTensor<float>(
+        memory_info,
+        const_cast<float*>(style.getTtlData().data()),
+        style.getTtlData().size(),
+        style.getTtlShape().data(),
+        style.getTtlShape().size()
+    );
+    
+    // Run text encoder
+    const char* text_enc_input_names[] = {"text_ids", "style_ttl", "text_mask"};
+    const char* text_enc_output_names[] = {"text_emb"};
+    std::vector<Ort::Value> text_enc_inputs;
+    text_enc_inputs.push_back(std::move(text_ids_tensor));
+    text_enc_inputs.push_back(std::move(style_ttl_tensor));
+    text_enc_inputs.push_back(std::move(text_mask_tensor));
+    
+    auto text_enc_outputs = text_enc_ort_->Run(
+        Ort::RunOptions{nullptr},
+        text_enc_input_names, text_enc_inputs.data(), text_enc_inputs.size(),
+        text_enc_output_names, 1
+    );
+    
+    // Sample noisy latent
+    std::vector<std::vector<std::vector<float>>> xt, latent_mask;
+    sampleNoisyLatent(duration, xt, latent_mask);
+    
+    std::vector<int64_t> latent_shape = {
+        bsz,
+        static_cast<int64_t>(xt[0].size()),
+        static_cast<int64_t>(xt[0][0].size())
+    };
+    std::vector<int64_t> latent_mask_shape = {
+        bsz, 1,
+        static_cast<int64_t>(latent_mask[0][0].size())
+    };
+    
+    // Prepare scalar tensors
+    std::vector<float> total_step_vec(bsz, static_cast<float>(total_step));
+    auto total_step_tensor = Ort::Value::CreateTensor<float>(
+        memory_info,
+        total_step_vec.data(),
+        total_step_vec.size(),
+        std::vector<int64_t>{bsz}.data(),
+        1
+    );
+    
+    // Store text_emb data to reuse across iterations
+    auto text_emb_info = text_enc_outputs[0].GetTensorTypeAndShapeInfo();
+    size_t text_emb_size = text_emb_info.GetElementCount();
+    auto* text_emb_data = text_enc_outputs[0].GetTensorMutableData<float>();
+    std::vector<float> text_emb_vec(text_emb_data, text_emb_data + text_emb_size);
+    auto text_emb_shape = text_emb_info.GetShape();
+    
+    // Iterative denoising
+    for (int step = 0; step < total_step; step++) {
+        std::vector<float> current_step_vec(bsz, static_cast<float>(step));
+        
+        text_mask_tensor = arrayToTensor(memory_info, text_mask, text_mask_shape);
+        auto latent_mask_tensor = arrayToTensor(memory_info, latent_mask, latent_mask_shape);
+        auto noisy_latent_tensor = arrayToTensor(memory_info, xt, latent_shape);
+        style_ttl_tensor = Ort::Value::CreateTensor<float>(
+            memory_info,
+            const_cast<float*>(style.getTtlData().data()),
+            style.getTtlData().size(),
+            style.getTtlShape().data(),
+            style.getTtlShape().size()
+        );
+        
+        auto text_emb_tensor = Ort::Value::CreateTensor<float>(
+            memory_info,
+            text_emb_vec.data(),
+            text_emb_vec.size(),
+            text_emb_shape.data(),
+            text_emb_shape.size()
+        );
+        
+        auto current_step_tensor = Ort::Value::CreateTensor<float>(
+            memory_info,
+            current_step_vec.data(),
+            current_step_vec.size(),
+            std::vector<int64_t>{bsz}.data(),
+            1
+        );
+        
+        const char* vector_est_input_names[] = {
+            "noisy_latent", "text_emb", "style_ttl", "text_mask", "latent_mask", "total_step", "current_step"
+        };
+        const char* vector_est_output_names[] = {"denoised_latent"};
+        
+        std::vector<Ort::Value> vector_est_inputs;
+        vector_est_inputs.push_back(std::move(noisy_latent_tensor));
+        vector_est_inputs.push_back(std::move(text_emb_tensor));
+        vector_est_inputs.push_back(std::move(style_ttl_tensor));
+        vector_est_inputs.push_back(std::move(text_mask_tensor));
+        vector_est_inputs.push_back(std::move(latent_mask_tensor));
+        
+        // Create a new total_step tensor for each iteration
+        auto total_step_tensor_iter = Ort::Value::CreateTensor<float>(
+            memory_info,
+            total_step_vec.data(),
+            total_step_vec.size(),
+            std::vector<int64_t>{bsz}.data(),
+            1
+        );
+        vector_est_inputs.push_back(std::move(total_step_tensor_iter));
+        vector_est_inputs.push_back(std::move(current_step_tensor));
+        
+        auto vector_est_outputs = vector_est_ort_->Run(
+            Ort::RunOptions{nullptr},
+            vector_est_input_names, vector_est_inputs.data(), vector_est_inputs.size(),
+            vector_est_output_names, 1
+        );
+        
+        // Update xt with denoised output
+        auto* denoised_data = vector_est_outputs[0].GetTensorMutableData<float>();
+        size_t idx = 0;
+        for (int b = 0; b < bsz; b++) {
+            for (size_t d = 0; d < xt[b].size(); d++) {
+                for (size_t t = 0; t < xt[b][d].size(); t++) {
+                    xt[b][d][t] = denoised_data[idx++];
+                }
+            }
+        }
+    }
+    
+    // Run vocoder
+    auto latent_tensor = arrayToTensor(memory_info, xt, latent_shape);
+    const char* vocoder_input_names[] = {"latent"};
+    const char* vocoder_output_names[] = {"wav_tts"};
+    std::vector<Ort::Value> vocoder_inputs;
+    vocoder_inputs.push_back(std::move(latent_tensor));
+    
+    auto vocoder_outputs = vocoder_ort_->Run(
+        Ort::RunOptions{nullptr},
+        vocoder_input_names, vocoder_inputs.data(), vocoder_inputs.size(),
+        vocoder_output_names, 1
+    );
+    
+    auto wav_info = vocoder_outputs[0].GetTensorTypeAndShapeInfo();
+    size_t wav_size = wav_info.GetElementCount();
+    auto* wav_data = vocoder_outputs[0].GetTensorMutableData<float>();
+    
+    SynthesisResult result;
+    result.wav.assign(wav_data, wav_data + wav_size);
+    result.duration = duration;
+    
+    return result;
+}
+
+// ============================================================================
+// Utility functions
+// ============================================================================
+
+std::vector<std::vector<std::vector<float>>> lengthToMask(
+    const std::vector<int64_t>& lengths, int max_len
+) {
+    if (max_len == -1) {
+        max_len = *std::max_element(lengths.begin(), lengths.end());
+    }
+    
+    std::vector<std::vector<std::vector<float>>> mask;
+    for (auto len : lengths) {
+        std::vector<std::vector<float>> batch_mask(1);
+        batch_mask[0].resize(max_len);
+        for (int i = 0; i < max_len; i++) {
+            batch_mask[0][i] = (i < len) ? 1.0f : 0.0f;
+        }
+        mask.push_back(batch_mask);
+    }
+    return mask;
+}
+
+std::vector<std::vector<std::vector<float>>> getLatentMask(
+    const std::vector<int64_t>& wav_lengths,
+    int base_chunk_size,
+    int chunk_compress_factor
+) {
+    int latent_size = base_chunk_size * chunk_compress_factor;
+    std::vector<int64_t> latent_lengths;
+    for (auto len : wav_lengths) {
+        latent_lengths.push_back((len + latent_size - 1) / latent_size);
+    }
+    return lengthToMask(latent_lengths);
+}
+
+// ============================================================================
+// ONNX model loading
+// ============================================================================
+
+std::unique_ptr<Ort::Session> loadOnnx(
+    Ort::Env& env,
+    const std::string& onnx_path,
+    const Ort::SessionOptions& opts
+) {
+    return std::make_unique<Ort::Session>(env, onnx_path.c_str(), opts);
+}
+
+OnnxModels loadOnnxAll(
+    Ort::Env& env,
+    const std::string& onnx_dir,
+    const Ort::SessionOptions& opts
+) {
+    OnnxModels models;
+    models.dp = loadOnnx(env, onnx_dir + "/duration_predictor.onnx", opts);
+    models.text_enc = loadOnnx(env, onnx_dir + "/text_encoder.onnx", opts);
+    models.vector_est = loadOnnx(env, onnx_dir + "/vector_estimator.onnx", opts);
+    models.vocoder = loadOnnx(env, onnx_dir + "/vocoder.onnx", opts);
+    return models;
+}
+
+// ============================================================================
+// Configuration and processor loading
+// ============================================================================
+
+Config loadCfgs(const std::string& onnx_dir) {
+    std::string cfg_path = onnx_dir + "/tts.json";
+    std::ifstream file(cfg_path);
+    if (!file.is_open()) {
+        throw std::runtime_error("Failed to open config file: " + cfg_path);
+    }
+    
+    json j;
+    file >> j;
+    
+    Config cfg;
+    cfg.ae.sample_rate = j["ae"]["sample_rate"];
+    cfg.ae.base_chunk_size = j["ae"]["base_chunk_size"];
+    cfg.ttl.chunk_compress_factor = j["ttl"]["chunk_compress_factor"];
+    cfg.ttl.latent_dim = j["ttl"]["latent_dim"];
+    
+    return cfg;
+}
+
+std::unique_ptr<UnicodeProcessor> loadTextProcessor(const std::string& onnx_dir) {
+    std::string unicode_indexer_path = onnx_dir + "/unicode_indexer.json";
+    return std::make_unique<UnicodeProcessor>(unicode_indexer_path);
+}
+
+// ============================================================================
+// Voice style loading
+// ============================================================================
+
+Style loadVoiceStyle(const std::vector<std::string>& voice_style_paths, bool verbose) {
+    int bsz = voice_style_paths.size();
+    
+    // Read first file to get dimensions
+    std::ifstream first_file(voice_style_paths[0]);
+    if (!first_file.is_open()) {
+        throw std::runtime_error("Failed to open voice style file: " + voice_style_paths[0]);
+    }
+    json first_json;
+    first_file >> first_json;
+    
+    auto ttl_dims = first_json["style_ttl"]["dims"].get<std::vector<int64_t>>();
+    auto dp_dims = first_json["style_dp"]["dims"].get<std::vector<int64_t>>();
+    
+    int64_t ttl_dim1 = ttl_dims[1];
+    int64_t ttl_dim2 = ttl_dims[2];
+    int64_t dp_dim1 = dp_dims[1];
+    int64_t dp_dim2 = dp_dims[2];
+    
+    // Pre-allocate arrays with full batch size
+    size_t ttl_size = bsz * ttl_dim1 * ttl_dim2;
+    size_t dp_size = bsz * dp_dim1 * dp_dim2;
+    std::vector<float> ttl_flat(ttl_size);
+    std::vector<float> dp_flat(dp_size);
+    
+    // Fill in the data
+    for (int i = 0; i < bsz; i++) {
+        std::ifstream file(voice_style_paths[i]);
+        if (!file.is_open()) {
+            throw std::runtime_error("Failed to open voice style file: " + voice_style_paths[i]);
+        }
+        
+        json j;
+        file >> j;
+        
+        // Flatten data
+        auto ttl_data_nested = j["style_ttl"]["data"].get<std::vector<std::vector<std::vector<float>>>>();
+        std::vector<float> ttl_data;
+        for (const auto& batch : ttl_data_nested) {
+            for (const auto& row : batch) {
+                ttl_data.insert(ttl_data.end(), row.begin(), row.end());
+            }
+        }
+        
+        auto dp_data_nested = j["style_dp"]["data"].get<std::vector<std::vector<std::vector<float>>>>();
+        std::vector<float> dp_data;
+        for (const auto& batch : dp_data_nested) {
+            for (const auto& row : batch) {
+                dp_data.insert(dp_data.end(), row.begin(), row.end());
+            }
+        }
+        
+        // Copy to pre-allocated array
+        size_t ttl_offset = i * ttl_dim1 * ttl_dim2;
+        std::copy(ttl_data.begin(), ttl_data.end(), ttl_flat.begin() + ttl_offset);
+        
+        size_t dp_offset = i * dp_dim1 * dp_dim2;
+        std::copy(dp_data.begin(), dp_data.end(), dp_flat.begin() + dp_offset);
+    }
+    
+    std::vector<int64_t> ttl_shape = {bsz, ttl_dim1, ttl_dim2};
+    std::vector<int64_t> dp_shape = {bsz, dp_dim1, dp_dim2};
+    
+    if (verbose) {
+        std::cout << "Loaded " << bsz << " voice styles" << std::endl;
+    }
+    
+    return Style(ttl_flat, ttl_shape, dp_flat, dp_shape);
+}
+
+// ============================================================================
+// TextToSpeech loading
+// ============================================================================
+
+std::unique_ptr<TextToSpeech> loadTextToSpeech(
+    Ort::Env& env,
+    const std::string& onnx_dir,
+    bool use_gpu
+) {
+    Ort::SessionOptions opts;
+    if (use_gpu) {
+        throw std::runtime_error("GPU mode is not supported yet");
+    } else {
+        std::cout << "Using CPU for inference" << std::endl;
+    }
+    
+    auto cfgs = loadCfgs(onnx_dir);
+    auto models = loadOnnxAll(env, onnx_dir, opts);
+    auto text_processor = loadTextProcessor(onnx_dir);
+    
+    // Transfer ownership to TextToSpeech (use raw pointers internally)
+    auto tts = std::make_unique<TextToSpeech>(
+        cfgs,
+        text_processor.get(),
+        models.dp.get(),
+        models.text_enc.get(),
+        models.vector_est.get(),
+        models.vocoder.get()
+    );
+    
+    // Keep the models and processor alive by storing them
+    // (In production, you'd want better lifetime management)
+    static OnnxModels static_models;
+    static std::unique_ptr<UnicodeProcessor> static_text_processor;
+    static_models = std::move(models);
+    static_text_processor = std::move(text_processor);
+    
+    return tts;
+}
+
+// ============================================================================
+// WAV file writing
+// ============================================================================
+
+void writeWavFile(
+    const std::string& filename,
+    const std::vector<float>& audio_data,
+    int sample_rate
+) {
+    std::ofstream file(filename, std::ios::binary);
+    if (!file.is_open()) {
+        throw std::runtime_error("Failed to open file for writing: " + filename);
+    }
+    
+    int num_channels = 1;
+    int bits_per_sample = 16;
+    int byte_rate = sample_rate * num_channels * bits_per_sample / 8;
+    int block_align = num_channels * bits_per_sample / 8;
+    int data_size = audio_data.size() * bits_per_sample / 8;
+    
+    // RIFF header
+    file.write("RIFF", 4);
+    int32_t chunk_size = 36 + data_size;
+    file.write(reinterpret_cast<char*>(&chunk_size), 4);
+    file.write("WAVE", 4);
+    
+    // fmt chunk
+    file.write("fmt ", 4);
+    int32_t fmt_chunk_size = 16;
+    file.write(reinterpret_cast<char*>(&fmt_chunk_size), 4);
+    int16_t audio_format = 1; // PCM
+    file.write(reinterpret_cast<char*>(&audio_format), 2);
+    int16_t num_channels_16 = num_channels;
+    file.write(reinterpret_cast<char*>(&num_channels_16), 2);
+    file.write(reinterpret_cast<char*>(&sample_rate), 4);
+    file.write(reinterpret_cast<char*>(&byte_rate), 4);
+    int16_t block_align_16 = block_align;
+    file.write(reinterpret_cast<char*>(&block_align_16), 2);
+    int16_t bits_per_sample_16 = bits_per_sample;
+    file.write(reinterpret_cast<char*>(&bits_per_sample_16), 2);
+    
+    // data chunk
+    file.write("data", 4);
+    file.write(reinterpret_cast<char*>(&data_size), 4);
+    
+    // Write audio data
+    for (float sample : audio_data) {
+        float clamped = std::max(-1.0f, std::min(1.0f, sample));
+        int16_t int_sample = static_cast<int16_t>(clamped * 32767);
+        file.write(reinterpret_cast<char*>(&int_sample), 2);
+    }
+}
+
+// ============================================================================
+// Tensor conversion utilities
+// ============================================================================
+
+Ort::Value arrayToTensor(
+    Ort::MemoryInfo& memory_info,
+    const std::vector<std::vector<std::vector<float>>>& array,
+    const std::vector<int64_t>& dims
+) {
+    // Flatten the array
+    std::vector<float> flat;
+    for (const auto& batch : array) {
+        for (const auto& row : batch) {
+            for (float val : row) {
+                flat.push_back(val);
+            }
+        }
+    }
+    
+    // Store in global buffer to keep data alive
+    g_tensor_buffers_float.push_back(std::move(flat));
+    auto& buffer = g_tensor_buffers_float.back();
+    
+    return Ort::Value::CreateTensor<float>(
+        memory_info,
+        buffer.data(),
+        buffer.size(),
+        dims.data(),
+        dims.size()
+    );
+}
+
+Ort::Value intArrayToTensor(
+    Ort::MemoryInfo& memory_info,
+    const std::vector<std::vector<int64_t>>& array,
+    const std::vector<int64_t>& dims
+) {
+    // Flatten the array
+    std::vector<int64_t> flat;
+    for (const auto& row : array) {
+        for (int64_t val : row) {
+            flat.push_back(val);
+        }
+    }
+    
+    // Store in global buffer to keep data alive
+    g_tensor_buffers_int64.push_back(std::move(flat));
+    auto& buffer = g_tensor_buffers_int64.back();
+    
+    return Ort::Value::CreateTensor<int64_t>(
+        memory_info,
+        buffer.data(),
+        buffer.size(),
+        dims.data(),
+        dims.size()
+    );
+}
+
+// ============================================================================
+// JSON loading helpers
+// ============================================================================
+
+std::vector<int64_t> loadJsonInt64(const std::string& file_path) {
+    std::ifstream file(file_path);
+    if (!file.is_open()) {
+        throw std::runtime_error("Failed to open file: " + file_path);
+    }
+    
+    json j;
+    file >> j;
+    
+    return j.get<std::vector<int64_t>>();
+}
+
+// ============================================================================
+// Sanitize filename
+// ============================================================================
+
+std::string sanitizeFilename(const std::string& text, int max_len) {
+    std::string result;
+    int count = 0;
+    for (char c : text) {
+        if (count >= max_len) break;
+        if (std::isalnum(static_cast<unsigned char>(c))) {
+            result += c;
+        } else {
+            result += '_';
+        }
+        count++;
+    }
+    return result;
+}
@@ -0,0 +1,202 @@
+#pragma once
+
+#include <string>
+#include <vector>
+#include <memory>
+#include <iostream>
+#include <iomanip>
+#include <chrono>
+#include <onnxruntime_cxx_api.h>
+
+/**
+ * Configuration structure
+ */
+struct Config {
+    struct AEConfig {
+        int sample_rate;
+        int base_chunk_size;
+    } ae;
+    
+    struct TTLConfig {
+        int chunk_compress_factor;
+        int latent_dim;
+    } ttl;
+};
+
+/**
+ * Unicode text processor
+ */
+class UnicodeProcessor {
+public:
+    explicit UnicodeProcessor(const std::string& unicode_indexer_json_path);
+
+    // Process text list to text IDs and mask
+    void call(
+        const std::vector<std::string>& text_list,
+        std::vector<std::vector<int64_t>>& text_ids,
+        std::vector<std::vector<std::vector<float>>>& text_mask
+    );
+
+private:
+    std::vector<int64_t> indexer_;
+    
+    std::string preprocessText(const std::string& text);
+    std::vector<uint16_t> textToUnicodeValues(const std::string& text);
+    std::vector<std::vector<std::vector<float>>> getTextMask(
+        const std::vector<int64_t>& text_ids_lengths
+    );
+};
+
+/**
+ * Style class
+ */
+class Style {
+public:
+    Style(const std::vector<float>& ttl_data, const std::vector<int64_t>& ttl_shape,
+          const std::vector<float>& dp_data, const std::vector<int64_t>& dp_shape);
+    
+    const std::vector<float>& getTtlData() const { return ttl_data_; }
+    const std::vector<float>& getDpData() const { return dp_data_; }
+    const std::vector<int64_t>& getTtlShape() const { return ttl_shape_; }
+    const std::vector<int64_t>& getDpShape() const { return dp_shape_; }
+
+private:
+    std::vector<float> ttl_data_;
+    std::vector<float> dp_data_;
+    std::vector<int64_t> ttl_shape_;
+    std::vector<int64_t> dp_shape_;
+};
+
+/**
+ * TextToSpeech class
+ */
+class TextToSpeech {
+public:
+    TextToSpeech(
+        const Config& cfgs,
+        UnicodeProcessor* text_processor,
+        Ort::Session* dp_ort,
+        Ort::Session* text_enc_ort,
+        Ort::Session* vector_est_ort,
+        Ort::Session* vocoder_ort
+    );
+    
+    struct SynthesisResult {
+        std::vector<float> wav;
+        std::vector<float> duration;
+    };
+    
+    SynthesisResult call(
+        Ort::MemoryInfo& memory_info,
+        const std::vector<std::string>& text_list,
+        const Style& style,
+        int total_step
+    );
+    
+    int getSampleRate() const { return sample_rate_; }
+
+private:
+    Config cfgs_;
+    UnicodeProcessor* text_processor_;
+    Ort::Session* dp_ort_;
+    Ort::Session* text_enc_ort_;
+    Ort::Session* vector_est_ort_;
+    Ort::Session* vocoder_ort_;
+    int sample_rate_;
+    int base_chunk_size_;
+    int chunk_compress_factor_;
+    int ldim_;
+    
+    void sampleNoisyLatent(
+        const std::vector<float>& duration,
+        std::vector<std::vector<std::vector<float>>>& noisy_latent,
+        std::vector<std::vector<std::vector<float>>>& latent_mask
+    );
+};
+
+// Utility functions
+std::vector<std::vector<std::vector<float>>> lengthToMask(
+    const std::vector<int64_t>& lengths, int max_len = -1
+);
+
+std::vector<std::vector<std::vector<float>>> getLatentMask(
+    const std::vector<int64_t>& wav_lengths,
+    int base_chunk_size,
+    int chunk_compress_factor
+);
+
+// ONNX model loading
+struct OnnxModels {
+    std::unique_ptr<Ort::Session> dp;
+    std::unique_ptr<Ort::Session> text_enc;
+    std::unique_ptr<Ort::Session> vector_est;
+    std::unique_ptr<Ort::Session> vocoder;
+};
+
+std::unique_ptr<Ort::Session> loadOnnx(
+    Ort::Env& env,
+    const std::string& onnx_path,
+    const Ort::SessionOptions& opts
+);
+
+OnnxModels loadOnnxAll(
+    Ort::Env& env,
+    const std::string& onnx_dir,
+    const Ort::SessionOptions& opts
+);
+
+// Configuration and processor loading
+Config loadCfgs(const std::string& onnx_dir);
+
+std::unique_ptr<UnicodeProcessor> loadTextProcessor(const std::string& onnx_dir);
+
+// Voice style loading
+Style loadVoiceStyle(const std::vector<std::string>& voice_style_paths, bool verbose = false);
+
+// TextToSpeech loading
+std::unique_ptr<TextToSpeech> loadTextToSpeech(
+    Ort::Env& env,
+    const std::string& onnx_dir,
+    bool use_gpu = false
+);
+
+// WAV file writing
+void writeWavFile(
+    const std::string& filename,
+    const std::vector<float>& audio_data,
+    int sample_rate
+);
+
+// Tensor conversion utilities
+void clearTensorBuffers();
+
+Ort::Value arrayToTensor(
+    Ort::MemoryInfo& memory_info,
+    const std::vector<std::vector<std::vector<float>>>& array,
+    const std::vector<int64_t>& dims
+);
+
+Ort::Value intArrayToTensor(
+    Ort::MemoryInfo& memory_info,
+    const std::vector<std::vector<int64_t>>& array,
+    const std::vector<int64_t>& dims
+);
+
+// JSON loading helpers
+std::vector<int64_t> loadJsonInt64(const std::string& file_path);
+
+// Timer utility
+template<typename Func>
+auto timer(const std::string& name, Func&& func) -> decltype(func()) {
+    auto start = std::chrono::high_resolution_clock::now();
+    std::cout << name << "..." << std::endl;
+    auto result = func();
+    auto end = std::chrono::high_resolution_clock::now();
+    std::chrono::duration<double> elapsed = end - start;
+    std::cout << "  -> " << name << " completed in " 
+              << std::fixed << std::setprecision(2) << elapsed.count() << " sec" << std::endl;
+    return result;
+}
+
+// Sanitize filename
+std::string sanitizeFilename(const std::string& text, int max_len);
@@ -0,0 +1,41 @@
+# Build results
+bin/
+obj/
+[Dd]ebug/
+[Rr]elease/
+x64/
+x86/
+[Aa]rm/
+[Aa]rm64/
+bld/
+[Bb]in/
+[Oo]bj/
+[Ll]og/
+
+# Visual Studio files
+.vs/
+*.suo
+*.user
+*.userosscache
+*.sln.docstates
+*.userprefs
+
+# Rider
+.idea/
+*.sln.iml
+
+# User-specific files
+*.rsuser
+*.suo
+*.user
+*.userosscache
+*.sln.docstates
+
+# Output directory
+results/*.wav
+
+# OS files
+.DS_Store
+Thumbs.db
+
+
@@ -0,0 +1,118 @@
+using System;
+using System.Collections.Generic;
+using System.IO;
+using System.Linq;
+
+namespace Supertonic
+{
+    class Program
+    {
+        class Args
+        {
+            public bool UseGpu { get; set; } = false;
+            public string OnnxDir { get; set; } = "assets/onnx";
+            public int TotalStep { get; set; } = 5;
+            public int NTest { get; set; } = 4;
+            public List<string> VoiceStyle { get; set; } = new List<string> { "assets/voice_styles/M1.json" };
+            public List<string> Text { get; set; } = new List<string> 
+            { 
+                "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen." 
+            };
+            public string SaveDir { get; set; } = "results";
+        }
+
+        static Args ParseArgs(string[] args)
+        {
+            var result = new Args();
+            
+            for (int i = 0; i < args.Length; i++)
+            {
+                switch (args[i])
+                {
+                    case "--use-gpu":
+                        result.UseGpu = true;
+                        break;
+                    case "--onnx-dir" when i + 1 < args.Length:
+                        result.OnnxDir = args[++i];
+                        break;
+                    case "--total-step" when i + 1 < args.Length:
+                        result.TotalStep = int.Parse(args[++i]);
+                        break;
+                    case "--n-test" when i + 1 < args.Length:
+                        result.NTest = int.Parse(args[++i]);
+                        break;
+                    case "--voice-style" when i + 1 < args.Length:
+                        result.VoiceStyle = args[++i].Split(',').ToList();
+                        break;
+                    case "--text" when i + 1 < args.Length:
+                        result.Text = args[++i].Split('|').ToList();
+                        break;
+                    case "--save-dir" when i + 1 < args.Length:
+                        result.SaveDir = args[++i];
+                        break;
+                }
+            }
+            
+            return result;
+        }
+
+        static void Main(string[] args)
+        {
+            Console.WriteLine("=== TTS Inference with ONNX Runtime (C#) ===\n");
+
+            // --- 1. Parse arguments --- //
+            var parsedArgs = ParseArgs(args);
+            int totalStep = parsedArgs.TotalStep;
+            int nTest = parsedArgs.NTest;
+            string saveDir = parsedArgs.SaveDir;
+            var voiceStylePaths = parsedArgs.VoiceStyle;
+            var textList = parsedArgs.Text;
+
+            if (voiceStylePaths.Count != textList.Count)
+            {
+                throw new ArgumentException(
+                    $"Number of voice styles ({voiceStylePaths.Count}) must match number of texts ({textList.Count})");
+            }
+
+            int bsz = voiceStylePaths.Count;
+
+            // --- 2. Load Text to Speech --- //
+            var textToSpeech = Helper.LoadTextToSpeech(parsedArgs.OnnxDir, parsedArgs.UseGpu);
+            Console.WriteLine();
+
+            // --- 3. Load Voice Style --- //
+            var style = Helper.LoadVoiceStyle(voiceStylePaths, verbose: true);
+
+            // --- 4. Synthesize speech --- //
+            for (int n = 0; n < nTest; n++)
+            {
+                Console.WriteLine($"\n[{n + 1}/{nTest}] Starting synthesis...");
+                
+                var (wav, duration) = Helper.Timer("Generating speech from text", () => 
+                    textToSpeech.Call(textList, style, totalStep)
+                );
+
+                if (!Directory.Exists(saveDir))
+                {
+                    Directory.CreateDirectory(saveDir);
+                }
+
+                for (int b = 0; b < bsz; b++)
+                {
+                    string fname = $"{Helper.SanitizeFilename(textList[b], 20)}_{n + 1}.wav";
+                    
+                    int wavLen = (int)(textToSpeech.SampleRate * duration[b]);
+                    var wavOut = new float[wavLen];
+                    Array.Copy(wav, b * wav.Length / bsz, wavOut, 0, Math.Min(wavLen, wav.Length / bsz));
+
+                    string outputPath = Path.Combine(saveDir, fname);
+                    Helper.WriteWavFile(outputPath, wavOut, textToSpeech.SampleRate);
+                    Console.WriteLine($"Saved: {outputPath}");
+                }
+            }
+
+            Console.WriteLine("\n=== Synthesis completed successfully! ===");
+        }
+    }
+}
+
@@ -0,0 +1,612 @@
+using System;
+using System.Collections.Generic;
+using System.IO;
+using System.Linq;
+using System.Text;
+using System.Text.Json;
+using Microsoft.ML.OnnxRuntime;
+using Microsoft.ML.OnnxRuntime.Tensors;
+
+namespace Supertonic
+{
+    // ============================================================================
+    // Configuration classes
+    // ============================================================================
+
+    public class Config
+    {
+        public AEConfig AE { get; set; } = null!;
+        public TTLConfig TTL { get; set; } = null!;
+
+        public class AEConfig
+        {
+            public int SampleRate { get; set; }
+            public int BaseChunkSize { get; set; }
+        }
+
+        public class TTLConfig
+        {
+            public int ChunkCompressFactor { get; set; }
+            public int LatentDim { get; set; }
+        }
+    }
+
+    // ============================================================================
+    // Style class
+    // ============================================================================
+
+    public class Style
+    {
+        public float[] Ttl { get; set; }
+        public long[] TtlShape { get; set; }
+        public float[] Dp { get; set; }
+        public long[] DpShape { get; set; }
+
+        public Style(float[] ttl, long[] ttlShape, float[] dp, long[] dpShape)
+        {
+            Ttl = ttl;
+            TtlShape = ttlShape;
+            Dp = dp;
+            DpShape = dpShape;
+        }
+    }
+
+    // ============================================================================
+    // Unicode text processor
+    // ============================================================================
+
+    public class UnicodeProcessor
+    {
+        private readonly Dictionary<int, long> _indexer;
+
+        public UnicodeProcessor(string unicodeIndexerPath)
+        {
+            var json = File.ReadAllText(unicodeIndexerPath);
+            var indexerArray = JsonSerializer.Deserialize<long[]>(json) ?? throw new Exception("Failed to load indexer");
+            _indexer = new Dictionary<int, long>();
+            for (int i = 0; i < indexerArray.Length; i++)
+            {
+                _indexer[i] = indexerArray[i];
+            }
+        }
+
+        private string PreprocessText(string text)
+        {
+            // Simple normalization (C# has Normalize built-in)
+            return text.Normalize(NormalizationForm.FormKD);
+        }
+
+        private int[] TextToUnicodeValues(string text)
+        {
+            return text.Select(c => (int)c).ToArray();
+        }
+
+        private float[][][] GetTextMask(long[] textIdsLengths)
+        {
+            return Helper.LengthToMask(textIdsLengths);
+        }
+
+        public (long[][] textIds, float[][][] textMask) Call(List<string> textList)
+        {
+            var processedTexts = textList.Select(t => PreprocessText(t)).ToList();
+            var textIdsLengths = processedTexts.Select(t => (long)t.Length).ToArray();
+            long maxLen = textIdsLengths.Max();
+
+            var textIds = new long[textList.Count][];
+            for (int i = 0; i < processedTexts.Count; i++)
+            {
+                textIds[i] = new long[maxLen];
+                var unicodeVals = TextToUnicodeValues(processedTexts[i]);
+                for (int j = 0; j < unicodeVals.Length; j++)
+                {
+                    if (_indexer.TryGetValue(unicodeVals[j], out long val))
+                    {
+                        textIds[i][j] = val;
+                    }
+                }
+            }
+
+            var textMask = GetTextMask(textIdsLengths);
+            return (textIds, textMask);
+        }
+    }
+
+    // ============================================================================
+    // TextToSpeech class
+    // ============================================================================
+
+    public class TextToSpeech
+    {
+        private readonly Config _cfgs;
+        private readonly UnicodeProcessor _textProcessor;
+        private readonly InferenceSession _dpOrt;
+        private readonly InferenceSession _textEncOrt;
+        private readonly InferenceSession _vectorEstOrt;
+        private readonly InferenceSession _vocoderOrt;
+        public readonly int SampleRate;
+        private readonly int _baseChunkSize;
+        private readonly int _chunkCompressFactor;
+        private readonly int _ldim;
+
+        public TextToSpeech(
+            Config cfgs,
+            UnicodeProcessor textProcessor,
+            InferenceSession dpOrt,
+            InferenceSession textEncOrt,
+            InferenceSession vectorEstOrt,
+            InferenceSession vocoderOrt)
+        {
+            _cfgs = cfgs;
+            _textProcessor = textProcessor;
+            _dpOrt = dpOrt;
+            _textEncOrt = textEncOrt;
+            _vectorEstOrt = vectorEstOrt;
+            _vocoderOrt = vocoderOrt;
+            SampleRate = cfgs.AE.SampleRate;
+            _baseChunkSize = cfgs.AE.BaseChunkSize;
+            _chunkCompressFactor = cfgs.TTL.ChunkCompressFactor;
+            _ldim = cfgs.TTL.LatentDim;
+        }
+
+        private (float[][][] noisyLatent, float[][][] latentMask) SampleNoisyLatent(float[] duration)
+        {
+            int bsz = duration.Length;
+            float wavLenMax = duration.Max() * SampleRate;
+            var wavLengths = duration.Select(d => (long)(d * SampleRate)).ToArray();
+            int chunkSize = _baseChunkSize * _chunkCompressFactor;
+            int latentLen = (int)((wavLenMax + chunkSize - 1) / chunkSize);
+            int latentDim = _ldim * _chunkCompressFactor;
+
+            // Generate random noise
+            var random = new Random();
+            var noisyLatent = new float[bsz][][];
+            for (int b = 0; b < bsz; b++)
+            {
+                noisyLatent[b] = new float[latentDim][];
+                for (int d = 0; d < latentDim; d++)
+                {
+                    noisyLatent[b][d] = new float[latentLen];
+                    for (int t = 0; t < latentLen; t++)
+                    {
+                        // Box-Muller transform for normal distribution
+                        double u1 = 1.0 - random.NextDouble();
+                        double u2 = 1.0 - random.NextDouble();
+                        noisyLatent[b][d][t] = (float)(Math.Sqrt(-2.0 * Math.Log(u1)) * Math.Cos(2.0 * Math.PI * u2));
+                    }
+                }
+            }
+
+            var latentMask = Helper.GetLatentMask(wavLengths, _baseChunkSize, _chunkCompressFactor);
+
+            // Apply mask
+            for (int b = 0; b < bsz; b++)
+            {
+                for (int d = 0; d < latentDim; d++)
+                {
+                    for (int t = 0; t < latentLen; t++)
+                    {
+                        noisyLatent[b][d][t] *= latentMask[b][0][t];
+                    }
+                }
+            }
+
+            return (noisyLatent, latentMask);
+        }
+
+        public (float[] wav, float[] duration) Call(List<string> textList, Style style, int totalStep)
+        {
+            int bsz = textList.Count;
+            if (bsz != style.TtlShape[0])
+            {
+                throw new ArgumentException("Number of texts must match number of style vectors");
+            }
+
+            // Process text
+            var (textIds, textMask) = _textProcessor.Call(textList);
+            var textIdsShape = new long[] { bsz, textIds[0].Length };
+            var textMaskShape = new long[] { bsz, 1, textMask[0][0].Length };
+
+            var textIdsTensor = Helper.IntArrayToTensor(textIds, textIdsShape);
+            var textMaskTensor = Helper.ArrayToTensor(textMask, textMaskShape);
+
+            var styleTtlTensor = new DenseTensor<float>(style.Ttl, style.TtlShape.Select(x => (int)x).ToArray());
+            var styleDpTensor = new DenseTensor<float>(style.Dp, style.DpShape.Select(x => (int)x).ToArray());
+
+            // Run duration predictor
+            var dpInputs = new List<NamedOnnxValue>
+            {
+                NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
+                NamedOnnxValue.CreateFromTensor("style_dp", styleDpTensor),
+                NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
+            };
+            using var dpOutputs = _dpOrt.Run(dpInputs);
+            var durOnnx = dpOutputs.First(o => o.Name == "duration").AsTensor<float>().ToArray();
+
+            // Run text encoder
+            var textEncInputs = new List<NamedOnnxValue>
+            {
+                NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
+                NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
+                NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
+            };
+            using var textEncOutputs = _textEncOrt.Run(textEncInputs);
+            var textEmbTensor = textEncOutputs.First(o => o.Name == "text_emb").AsTensor<float>();
+
+            // Sample noisy latent
+            var (xt, latentMask) = SampleNoisyLatent(durOnnx);
+            var latentShape = new long[] { bsz, xt[0].Length, xt[0][0].Length };
+            var latentMaskShape = new long[] { bsz, 1, latentMask[0][0].Length };
+
+            var totalStepArray = Enumerable.Repeat((float)totalStep, bsz).ToArray();
+
+            // Iterative denoising
+            for (int step = 0; step < totalStep; step++)
+            {
+                var currentStepArray = Enumerable.Repeat((float)step, bsz).ToArray();
+
+                var vectorEstInputs = new List<NamedOnnxValue>
+                {
+                    NamedOnnxValue.CreateFromTensor("noisy_latent", Helper.ArrayToTensor(xt, latentShape)),
+                    NamedOnnxValue.CreateFromTensor("text_emb", textEmbTensor),
+                    NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
+                    NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor),
+                    NamedOnnxValue.CreateFromTensor("latent_mask", Helper.ArrayToTensor(latentMask, latentMaskShape)),
+                    NamedOnnxValue.CreateFromTensor("total_step", new DenseTensor<float>(totalStepArray, new int[] { bsz })),
+                    NamedOnnxValue.CreateFromTensor("current_step", new DenseTensor<float>(currentStepArray, new int[] { bsz }))
+                };
+
+                using var vectorEstOutputs = _vectorEstOrt.Run(vectorEstInputs);
+                var denoisedLatent = vectorEstOutputs.First(o => o.Name == "denoised_latent").AsTensor<float>();
+
+                // Update xt
+                int idx = 0;
+                for (int b = 0; b < bsz; b++)
+                {
+                    for (int d = 0; d < xt[b].Length; d++)
+                    {
+                        for (int t = 0; t < xt[b][d].Length; t++)
+                        {
+                            xt[b][d][t] = denoisedLatent.GetValue(idx++);
+                        }
+                    }
+                }
+            }
+
+            // Run vocoder
+            var vocoderInputs = new List<NamedOnnxValue>
+            {
+                NamedOnnxValue.CreateFromTensor("latent", Helper.ArrayToTensor(xt, latentShape))
+            };
+            using var vocoderOutputs = _vocoderOrt.Run(vocoderInputs);
+            var wavTensor = vocoderOutputs.First(o => o.Name == "wav_tts").AsTensor<float>();
+
+            return (wavTensor.ToArray(), durOnnx);
+        }
+    }
+
+    // ============================================================================
+    // Helper class with utility functions
+    // ============================================================================
+
+    public static class Helper
+    {
+        // ============================================================================
+        // Utility functions
+        // ============================================================================
+
+        public static float[][][] LengthToMask(long[] lengths, long maxLen = -1)
+        {
+            if (maxLen == -1)
+            {
+                maxLen = lengths.Max();
+            }
+
+            var mask = new float[lengths.Length][][];
+            for (int i = 0; i < lengths.Length; i++)
+            {
+                mask[i] = new float[1][];
+                mask[i][0] = new float[maxLen];
+                for (int j = 0; j < maxLen; j++)
+                {
+                    mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
+                }
+            }
+            return mask;
+        }
+
+        public static float[][][] GetLatentMask(long[] wavLengths, int baseChunkSize, int chunkCompressFactor)
+        {
+            int latentSize = baseChunkSize * chunkCompressFactor;
+            var latentLengths = wavLengths.Select(len => (len + latentSize - 1) / latentSize).ToArray();
+            return LengthToMask(latentLengths);
+        }
+
+        // ============================================================================
+        // ONNX model loading
+        // ============================================================================
+
+        public static InferenceSession LoadOnnx(string onnxPath, SessionOptions opts)
+        {
+            return new InferenceSession(onnxPath, opts);
+        }
+
+        public static (InferenceSession dp, InferenceSession textEnc, InferenceSession vectorEst, InferenceSession vocoder) 
+            LoadOnnxAll(string onnxDir, SessionOptions opts)
+        {
+            var dpPath = Path.Combine(onnxDir, "duration_predictor.onnx");
+            var textEncPath = Path.Combine(onnxDir, "text_encoder.onnx");
+            var vectorEstPath = Path.Combine(onnxDir, "vector_estimator.onnx");
+            var vocoderPath = Path.Combine(onnxDir, "vocoder.onnx");
+
+            return (
+                LoadOnnx(dpPath, opts),
+                LoadOnnx(textEncPath, opts),
+                LoadOnnx(vectorEstPath, opts),
+                LoadOnnx(vocoderPath, opts)
+            );
+        }
+
+        // ============================================================================
+        // Configuration loading
+        // ============================================================================
+
+        public static Config LoadCfgs(string onnxDir)
+        {
+            var cfgPath = Path.Combine(onnxDir, "tts.json");
+            var json = File.ReadAllText(cfgPath);
+            
+            using var doc = JsonDocument.Parse(json);
+            var root = doc.RootElement;
+            
+            return new Config
+            {
+                AE = new Config.AEConfig
+                {
+                    SampleRate = root.GetProperty("ae").GetProperty("sample_rate").GetInt32(),
+                    BaseChunkSize = root.GetProperty("ae").GetProperty("base_chunk_size").GetInt32()
+                },
+                TTL = new Config.TTLConfig
+                {
+                    ChunkCompressFactor = root.GetProperty("ttl").GetProperty("chunk_compress_factor").GetInt32(),
+                    LatentDim = root.GetProperty("ttl").GetProperty("latent_dim").GetInt32()
+                }
+            };
+        }
+
+        public static UnicodeProcessor LoadTextProcessor(string onnxDir)
+        {
+            var unicodeIndexerPath = Path.Combine(onnxDir, "unicode_indexer.json");
+            return new UnicodeProcessor(unicodeIndexerPath);
+        }
+
+        // ============================================================================
+        // Voice style loading
+        // ============================================================================
+
+        public static Style LoadVoiceStyle(List<string> voiceStylePaths, bool verbose = false)
+        {
+            int bsz = voiceStylePaths.Count;
+            
+            // Read first file to get dimensions
+            var firstJson = File.ReadAllText(voiceStylePaths[0]);
+            using var firstDoc = JsonDocument.Parse(firstJson);
+            var firstRoot = firstDoc.RootElement;
+            
+            var ttlDims = ParseInt64Array(firstRoot.GetProperty("style_ttl").GetProperty("dims"));
+            var dpDims = ParseInt64Array(firstRoot.GetProperty("style_dp").GetProperty("dims"));
+            
+            long ttlDim1 = ttlDims[1];
+            long ttlDim2 = ttlDims[2];
+            long dpDim1 = dpDims[1];
+            long dpDim2 = dpDims[2];
+            
+            // Pre-allocate arrays with full batch size
+            int ttlSize = (int)(bsz * ttlDim1 * ttlDim2);
+            int dpSize = (int)(bsz * dpDim1 * dpDim2);
+            var ttlFlat = new float[ttlSize];
+            var dpFlat = new float[dpSize];
+            
+            // Fill in the data
+            for (int i = 0; i < bsz; i++)
+            {
+                var json = File.ReadAllText(voiceStylePaths[i]);
+                using var doc = JsonDocument.Parse(json);
+                var root = doc.RootElement;
+
+                // Flatten data
+                var ttlData3D = ParseFloat3DArray(root.GetProperty("style_ttl").GetProperty("data"));
+                var ttlDataFlat = new List<float>();
+                foreach (var batch in ttlData3D)
+                {
+                    foreach (var row in batch)
+                    {
+                        ttlDataFlat.AddRange(row);
+                    }
+                }
+
+                var dpData3D = ParseFloat3DArray(root.GetProperty("style_dp").GetProperty("data"));
+                var dpDataFlat = new List<float>();
+                foreach (var batch in dpData3D)
+                {
+                    foreach (var row in batch)
+                    {
+                        dpDataFlat.AddRange(row);
+                    }
+                }
+
+                // Copy to pre-allocated array
+                int ttlOffset = (int)(i * ttlDim1 * ttlDim2);
+                ttlDataFlat.CopyTo(ttlFlat, ttlOffset);
+                
+                int dpOffset = (int)(i * dpDim1 * dpDim2);
+                dpDataFlat.CopyTo(dpFlat, dpOffset);
+            }
+
+            var ttlShape = new long[] { bsz, ttlDim1, ttlDim2 };
+            var dpShape = new long[] { bsz, dpDim1, dpDim2 };
+
+            if (verbose)
+            {
+                Console.WriteLine($"Loaded {bsz} voice styles");
+            }
+
+            return new Style(ttlFlat, ttlShape, dpFlat, dpShape);
+        }
+
+        private static float[][][] ParseFloat3DArray(JsonElement element)
+        {
+            var result = new List<float[][]>();
+            foreach (var batch in element.EnumerateArray())
+            {
+                var batch2D = new List<float[]>();
+                foreach (var row in batch.EnumerateArray())
+                {
+                    var rowData = new List<float>();
+                    foreach (var val in row.EnumerateArray())
+                    {
+                        rowData.Add(val.GetSingle());
+                    }
+                    batch2D.Add(rowData.ToArray());
+                }
+                result.Add(batch2D.ToArray());
+            }
+            return result.ToArray();
+        }
+
+        private static long[] ParseInt64Array(JsonElement element)
+        {
+            var result = new List<long>();
+            foreach (var val in element.EnumerateArray())
+            {
+                result.Add(val.GetInt64());
+            }
+            return result.ToArray();
+        }
+
+        // ============================================================================
+        // TextToSpeech loading
+        // ============================================================================
+
+        public static TextToSpeech LoadTextToSpeech(string onnxDir, bool useGpu = false)
+        {
+            var opts = new SessionOptions();
+            if (useGpu)
+            {
+                throw new NotImplementedException("GPU mode is not supported yet");
+            }
+            else
+            {
+                Console.WriteLine("Using CPU for inference");
+            }
+
+            var cfgs = LoadCfgs(onnxDir);
+            var (dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) = LoadOnnxAll(onnxDir, opts);
+            var textProcessor = LoadTextProcessor(onnxDir);
+
+            return new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
+        }
+
+        // ============================================================================
+        // WAV file writing
+        // ============================================================================
+
+        public static void WriteWavFile(string filename, float[] audioData, int sampleRate)
+        {
+            using var writer = new BinaryWriter(File.Open(filename, FileMode.Create));
+
+            int numChannels = 1;
+            int bitsPerSample = 16;
+            int byteRate = sampleRate * numChannels * bitsPerSample / 8;
+            short blockAlign = (short)(numChannels * bitsPerSample / 8);
+            int dataSize = audioData.Length * bitsPerSample / 8;
+
+            // RIFF header
+            writer.Write(Encoding.ASCII.GetBytes("RIFF"));
+            writer.Write(36 + dataSize);
+            writer.Write(Encoding.ASCII.GetBytes("WAVE"));
+
+            // fmt chunk
+            writer.Write(Encoding.ASCII.GetBytes("fmt "));
+            writer.Write(16); // fmt chunk size
+            writer.Write((short)1); // audio format (PCM)
+            writer.Write((short)numChannels);
+            writer.Write(sampleRate);
+            writer.Write(byteRate);
+            writer.Write(blockAlign);
+            writer.Write((short)bitsPerSample);
+
+            // data chunk
+            writer.Write(Encoding.ASCII.GetBytes("data"));
+            writer.Write(dataSize);
+
+            // Write audio data
+            foreach (var sample in audioData)
+            {
+                float clamped = Math.Max(-1.0f, Math.Min(1.0f, sample));
+                short intSample = (short)(clamped * 32767);
+                writer.Write(intSample);
+            }
+        }
+
+        // ============================================================================
+        // Tensor conversion utilities
+        // ============================================================================
+
+        public static DenseTensor<float> ArrayToTensor(float[][][] array, long[] dims)
+        {
+            var flat = new List<float>();
+            foreach (var batch in array)
+            {
+                foreach (var row in batch)
+                {
+                    flat.AddRange(row);
+                }
+            }
+            return new DenseTensor<float>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
+        }
+
+        public static DenseTensor<long> IntArrayToTensor(long[][] array, long[] dims)
+        {
+            var flat = new List<long>();
+            foreach (var row in array)
+            {
+                flat.AddRange(row);
+            }
+            return new DenseTensor<long>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
+        }
+
+        // ============================================================================
+        // Timer utility
+        // ============================================================================
+
+        public static T Timer<T>(string name, Func<T> func)
+        {
+            var start = DateTime.Now;
+            Console.WriteLine($"{name}...");
+            var result = func();
+            var elapsed = (DateTime.Now - start).TotalSeconds;
+            Console.WriteLine($"  -> {name} completed in {elapsed:F2} sec");
+            return result;
+        }
+
+        public static string SanitizeFilename(string text, int maxLen)
+        {
+            var result = new StringBuilder();
+            int count = 0;
+            foreach (char c in text)
+            {
+                if (count >= maxLen) break;
+                if (char.IsLetterOrDigit(c))
+                {
+                    result.Append(c);
+                }
+                else
+                {
+                    result.Append('_');
+                }
+                count++;
+            }
+            return result.ToString();
+        }
+    }
+}
@@ -0,0 +1,99 @@
+# TTS ONNX Inference Examples
+
+This guide provides examples for running TTS inference using `ExampleONNX.cs`.
+
+## Installation
+
+### Prerequisites
+- .NET 9.0 SDK or later
+- [Download .NET SDK](https://dotnet.microsoft.com/download)
+
+### Install dependencies
+```bash
+dotnet restore
+```
+
+## Basic Usage
+
+### Example 1: Default Inference
+Run inference with default settings:
+```bash
+dotnet run
+```
+
+This will use:
+- Voice style: `assets/voice_styles/M1.json`
+- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+- Output directory: `results/`
+- Total steps: 5
+- Number of generations: 4
+
+### Example 2: Batch Inference
+Process multiple voice styles and texts at once:
+```bash
+dotnet run -- \
+  --voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
+  --text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+```
+
+This will:
+- Generate speech for 2 different voice-text pairs
+- Use male voice style (M1.json) for the first text
+- Use female voice style (F1.json) for the second text
+- Process both samples in a single batch
+
+### Example 3: High Quality Inference
+Increase denoising steps for better quality:
+```bash
+dotnet run -- \
+  --total-step 10 \
+  --voice-style assets/voice_styles/M1.json \
+  --text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
+```
+
+This will:
+- Use 10 denoising steps instead of the default 5
+- Produce higher quality output at the cost of slower inference
+
+## Available Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
+| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
+| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
+| `--n-test` | int | 4 | Number of times to generate each sample |
+| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated) |
+| `--text` | str+ | (long default text) | Text(s) to synthesize (pipe-separated: `|`) |
+| `--save-dir` | str | `results` | Output directory |
+
+## Notes
+
+- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
+- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
+- **GPU Support**: GPU mode is not supported yet
+
+## Building the Project
+
+### Build for Release
+```bash
+dotnet build -c Release
+```
+
+### Run the compiled executable
+```bash
+./bin/Release/net9.0/Supertonic
+```
+
+## Project Structure
+
+```
+csharp/
+├── ExampleONNX.cs        # Main inference script
+├── Helper.cs             # Helper functions and classes
+├── Supertonic.csproj     # Project configuration
+├── README.md             # This file
+└── results/              # Output directory (created automatically)
+```
+
+
@@ -0,0 +1,17 @@
+<Project Sdk="Microsoft.NET.Sdk">
+
+  <PropertyGroup>
+    <OutputType>Exe</OutputType>
+    <TargetFramework>net9.0</TargetFramework>
+    <LangVersion>13.0</LangVersion>
+    <Nullable>enable</Nullable>
+  </PropertyGroup>
+
+  <ItemGroup>
+    <PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.20.1" />
+    <PackageReference Include="System.Text.Json" Version="9.0.1" />
+  </ItemGroup>
+
+</Project>
+
+
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,17 @@
+# Binaries
+tts_example
+example_onnx
+*.exe
+
+# Go build artifacts
+*.o
+*.a
+*.so
+
+# Results
+results/
+
+# Go workspace
+go.work
+go.work.sum
+
@@ -0,0 +1,128 @@
+# TTS ONNX Inference Examples
+
+This guide provides examples for running TTS inference using `example_onnx.go`.
+
+## Installation
+
+This project uses Go modules for dependency management.
+
+### Prerequisites
+
+1. Install Go 1.21 or later from [https://golang.org/dl/](https://golang.org/dl/)
+2. Install ONNX Runtime C library:
+
+**macOS (via Homebrew):**
+```bash
+brew install onnxruntime
+```
+
+**Linux:**
+```bash
+# Download ONNX Runtime from GitHub releases
+wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.0/onnxruntime-linux-x64-1.16.0.tgz
+tar -xzf onnxruntime-linux-x64-1.16.0.tgz
+sudo cp onnxruntime-linux-x64-1.16.0/lib/* /usr/local/lib/
+sudo cp -r onnxruntime-linux-x64-1.16.0/include/* /usr/local/include/
+sudo ldconfig
+```
+
+### Install Go dependencies
+
+```bash
+go mod download
+```
+
+### Configure ONNX Runtime Library Path (Optional)
+
+If the ONNX Runtime library is not in a standard location, set the environment variable:
+
+**Automatic Detection (Recommended):**
+
+```bash
+# macOS
+export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
+
+# Linux
+export ONNXRUNTIME_LIB_PATH=$(find /usr/local/lib /usr/lib -name "libonnxruntime.so*" 2>/dev/null | head -n 1)
+```
+
+**Manual Configuration:**
+
+```bash
+export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.so  # Linux
+# or
+export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.dylib  # macOS
+```
+
+## Basic Usage
+
+### Example 1: Default Inference
+Run inference with default settings:
+```bash
+go run example_onnx.go helper.go
+```
+
+This will use:
+- Voice style: `assets/voice_styles/M1.json`
+- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+- Output directory: `results/`
+- Total steps: 5
+- Number of generations: 4
+
+### Example 2: Batch Inference
+Process multiple voice styles and texts at once:
+```bash
+go run example_onnx.go helper.go \
+  -voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
+  -text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+```
+
+This will:
+- Generate speech for 2 different voice-text pairs
+- Use male voice (M1.json) for the first text
+- Use female voice (F1.json) for the second text
+- Process both samples in a single batch
+
+### Example 3: High Quality Inference
+Increase denoising steps for better quality:
+```bash
+go run example_onnx.go helper.go \
+  -total-step 10 \
+  -voice-style "assets/voice_styles/M1.json" \
+  -text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
+```
+
+This will:
+- Use 10 denoising steps instead of the default 5
+- Produce higher quality output at the cost of slower inference
+
+## Available Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `-use-gpu` | flag | false | Use GPU for inference (default: CPU) |
+| `-onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
+| `-total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
+| `-n-test` | int | 4 | Number of times to generate each sample |
+| `-voice-style` | str | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
+| `-text` | str | (long default text) | Text(s) to synthesize, pipe-separated |
+| `-save-dir` | str | `results` | Output directory |
+
+## Notes
+
+- **Batch Processing**: The number of `-voice-style` files must match the number of `-text` entries
+- **Quality vs Speed**: Higher `-total-step` values produce better quality but take longer
+- **GPU Support**: GPU mode is not supported yet
+
+## Building a Binary
+
+To build a standalone executable:
+```bash
+go build -o tts_example example_onnx.go helper.go
+```
+
+Then run it:
+```bash
+./tts_example -voice-style "../assets/voice_styles/M1.json" -text "Hello world"
+```
+
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,144 @@
+package main
+
+import (
+	"flag"
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+
+	ort "github.com/yalue/onnxruntime_go"
+)
+
+// Args holds command line arguments
+type Args struct {
+	useGPU      bool
+	onnxDir     string
+	totalStep   int
+	nTest       int
+	voiceStyle  []string
+	text        []string
+	saveDir     string
+}
+
+func parseArgs() *Args {
+	args := &Args{}
+
+	flag.BoolVar(&args.useGPU, "use-gpu", false, "Use GPU for inference (default: CPU)")
+	flag.StringVar(&args.onnxDir, "onnx-dir", "assets/onnx", "Path to ONNX model directory")
+	flag.IntVar(&args.totalStep, "total-step", 5, "Number of denoising steps")
+	flag.IntVar(&args.nTest, "n-test", 4, "Number of times to generate")
+	flag.StringVar(&args.saveDir, "save-dir", "results", "Output directory")
+
+	var voiceStyleStr, textStr string
+	flag.StringVar(&voiceStyleStr, "voice-style", "assets/voice_styles/M1.json", "Voice style file path(s), comma-separated")
+	flag.StringVar(&textStr, "text", "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.", "Text(s) to synthesize, pipe-separated")
+
+	flag.Parse()
+
+	// Parse comma-separated voice-style
+	if voiceStyleStr != "" {
+		args.voiceStyle = strings.Split(voiceStyleStr, ",")
+		for i := range args.voiceStyle {
+			args.voiceStyle[i] = strings.TrimSpace(args.voiceStyle[i])
+		}
+	}
+
+	// Parse pipe-separated text
+	if textStr != "" {
+		args.text = strings.Split(textStr, "|")
+		for i := range args.text {
+			args.text[i] = strings.TrimSpace(args.text[i])
+		}
+	}
+
+	return args
+}
+
+func main() {
+	fmt.Println("=== TTS Inference with ONNX Runtime (Go) ===\n")
+
+	// --- 1. Parse arguments --- //
+	args := parseArgs()
+	totalStep := args.totalStep
+	nTest := args.nTest
+	saveDir := args.saveDir
+	voiceStylePaths := args.voiceStyle
+	textList := args.text
+
+	if len(voiceStylePaths) != len(textList) {
+		fmt.Printf("Error: Number of voice styles (%d) must match number of texts (%d)\n",
+			len(voiceStylePaths), len(textList))
+		os.Exit(1)
+	}
+
+	bsz := len(voiceStylePaths)
+
+	// Initialize ONNX Runtime
+	if err := InitializeONNXRuntime(); err != nil {
+		fmt.Printf("Error initializing ONNX Runtime: %v\n", err)
+		os.Exit(1)
+	}
+	defer ort.DestroyEnvironment()
+
+	// --- 2. Load config --- //
+	cfg, err := LoadCfgs(args.onnxDir)
+	if err != nil {
+		fmt.Printf("Error loading config: %v\n", err)
+		os.Exit(1)
+	}
+
+	// --- 3. Load TTS components --- //
+	textToSpeech, err := LoadTextToSpeech(args.onnxDir, args.useGPU, cfg)
+	if err != nil {
+		fmt.Printf("Error loading TTS components: %v\n", err)
+		os.Exit(1)
+	}
+	defer textToSpeech.Destroy()
+
+	// --- 4. Load voice styles --- //
+	style, err := LoadVoiceStyle(voiceStylePaths, true)
+	if err != nil {
+		fmt.Printf("Error loading voice styles: %v\n", err)
+		os.Exit(1)
+	}
+	defer style.Destroy()
+
+	// --- 5. Synthesize speech --- //
+	if err := os.MkdirAll(saveDir, 0755); err != nil {
+		fmt.Printf("Error creating save directory: %v\n", err)
+		os.Exit(1)
+	}
+
+	for n := 0; n < nTest; n++ {
+		fmt.Printf("\n[%d/%d] Starting synthesis...\n", n+1, nTest)
+
+		var wav []float32
+		var duration []float32
+		Timer("Generating speech from text", func() interface{} {
+			w, d, err := textToSpeech.Call(textList, style, totalStep)
+			if err != nil {
+				fmt.Printf("Error generating speech: %v\n", err)
+				os.Exit(1)
+			}
+			wav = w
+			duration = d
+			return nil
+		})
+
+		// Save outputs
+		for i := 0; i < bsz; i++ {
+			fname := fmt.Sprintf("%s_%d.wav", sanitizeFilename(textList[i], 20), n+1)
+			wavOut := extractWavSegment(wav, duration[i], textToSpeech.SampleRate, i, bsz)
+			
+			outputPath := filepath.Join(saveDir, fname)
+			if err := writeWavFile(outputPath, wavOut, textToSpeech.SampleRate); err != nil {
+				fmt.Printf("Error writing wav file: %v\n", err)
+				continue
+			}
+			fmt.Printf("Saved: %s\n", outputPath)
+		}
+	}
+
+	fmt.Println("\n=== Synthesis completed successfully! ===")
+}
@@ -0,0 +1,12 @@
+module supertonic-tts
+
+go 1.21
+
+require (
+	github.com/go-audio/audio v1.0.0
+	github.com/go-audio/wav v1.1.0
+	github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12
+	github.com/yalue/onnxruntime_go v1.11.0
+)
+
+require github.com/go-audio/riff v1.0.0 // indirect
@@ -0,0 +1,10 @@
+github.com/go-audio/audio v1.0.0 h1:zS9vebldgbQqktK4H0lUqWrG8P0NxCJVqcj7ZpNnwd4=
+github.com/go-audio/audio v1.0.0/go.mod h1:6uAu0+H2lHkwdGsAY+j2wHPNPpPoeg5AaEFh9FlA+Zs=
+github.com/go-audio/riff v1.0.0 h1:d8iCGbDvox9BfLagY94fBynxSPHO80LmZCaOsmKxokA=
+github.com/go-audio/riff v1.0.0/go.mod h1:l3cQwc85y79NQFCRB7TiPoNiaijp6q8Z0Uv38rVG498=
+github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
+github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
+github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12 h1:dd7vnTDfjtwCETZDrRe+GPYNLA1jBtbZeyfyE8eZCyk=
+github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12/go.mod h1:i/KKcxEWEO8Yyl11DYafRPKOPVYTrhxiTRigjtEEXZU=
+github.com/yalue/onnxruntime_go v1.11.0 h1:aKH4yPIbqfcB3SfnQWq/WxzLelkyolntHnffL3eMBHY=
+github.com/yalue/onnxruntime_go v1.11.0/go.mod h1:b4X26A8pekNb1ACJ58wAXgNKeUCGEAQ9dmACut9Sm/4=
@@ -0,0 +1,734 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"math"
+	"math/rand"
+	"os"
+	"path/filepath"
+	"time"
+
+	"github.com/go-audio/audio"
+	"github.com/go-audio/wav"
+	ort "github.com/yalue/onnxruntime_go"
+)
+
+// Config structures
+type SpecProcessorConfig struct {
+	NFFT      int     `json:"n_fft"`
+	WinLength int     `json:"win_length"`
+	HopLength int     `json:"hop_length"`
+	NMels     int     `json:"n_mels"`
+	Eps       float64 `json:"eps"`
+	NormMean  float64 `json:"norm_mean"`
+	NormStd   float64 `json:"norm_std"`
+}
+
+type EncoderConfig struct {
+	SpecProcessor SpecProcessorConfig `json:"spec_processor"`
+}
+
+type AEConfig struct {
+	SampleRate    int           `json:"sample_rate"`
+	BaseChunkSize int           `json:"base_chunk_size"`
+	Encoder       EncoderConfig `json:"encoder"`
+}
+
+type StyleTokenLayerConfig struct {
+	NStyle        int `json:"n_style"`
+	StyleValueDim int `json:"style_value_dim"`
+}
+
+type StyleEncoderConfig struct {
+	StyleTokenLayer StyleTokenLayerConfig `json:"style_token_layer"`
+}
+
+type ProjOutConfig struct {
+	Idim int `json:"idim"`
+	Odim int `json:"odim"`
+}
+
+type TextEncoderConfig struct {
+	ProjOut ProjOutConfig `json:"proj_out"`
+}
+
+type TTLConfig struct {
+	ChunkCompressFactor int                `json:"chunk_compress_factor"`
+	LatentDim           int                `json:"latent_dim"`
+	StyleEncoder        StyleEncoderConfig `json:"style_encoder"`
+	TextEncoder         TextEncoderConfig  `json:"text_encoder"`
+}
+
+type DPStyleEncoderConfig struct {
+	StyleTokenLayer StyleTokenLayerConfig `json:"style_token_layer"`
+}
+
+type DPConfig struct {
+	LatentDim           int                  `json:"latent_dim"`
+	ChunkCompressFactor int                  `json:"chunk_compress_factor"`
+	StyleEncoder        DPStyleEncoderConfig `json:"style_encoder"`
+}
+
+type Config struct {
+	AE  AEConfig  `json:"ae"`
+	TTL TTLConfig `json:"ttl"`
+	DP  DPConfig  `json:"dp"`
+}
+
+// VoiceStyleData holds voice style JSON structure
+type VoiceStyleData struct {
+	StyleTTL struct {
+		Data [][][]float64 `json:"data"`
+		Dims []int64       `json:"dims"`
+		Type string        `json:"type"`
+	} `json:"style_ttl"`
+	StyleDP struct {
+		Data [][][]float64 `json:"data"`
+		Dims []int64       `json:"dims"`
+		Type string        `json:"type"`
+	} `json:"style_dp"`
+}
+
+// UnicodeProcessor for text processing
+type UnicodeProcessor struct {
+	indexer []int64
+}
+
+// NewUnicodeProcessor creates a new UnicodeProcessor
+func NewUnicodeProcessor(unicodeIndexerPath string) (*UnicodeProcessor, error) {
+	indexer, err := loadJSONInt64(unicodeIndexerPath)
+	if err != nil {
+		return nil, fmt.Errorf("failed to load unicode indexer: %w", err)
+	}
+
+	return &UnicodeProcessor{indexer: indexer}, nil
+}
+
+// Call processes text list to text IDs and mask
+func (up *UnicodeProcessor) Call(textList []string) ([][]int64, [][][]float64) {
+	// Preprocess texts
+	processedTexts := make([]string, len(textList))
+	for i, text := range textList {
+		processedTexts[i] = preprocessText(text)
+	}
+
+	// Get text lengths
+	textLengths := make([]int64, len(processedTexts))
+	maxLen := 0
+	for i, text := range processedTexts {
+		textLengths[i] = int64(len([]rune(text)))
+		if int(textLengths[i]) > maxLen {
+			maxLen = int(textLengths[i])
+		}
+	}
+
+	// Create text IDs
+	textIDs := make([][]int64, len(processedTexts))
+	for i, text := range processedTexts {
+		row := make([]int64, maxLen)
+		runes := []rune(text)
+		for j, r := range runes {
+			unicodeVal := int(r)
+			if unicodeVal < len(up.indexer) {
+				row[j] = up.indexer[unicodeVal]
+			} else {
+				row[j] = -1
+			}
+		}
+		textIDs[i] = row
+	}
+
+	// Create text mask
+	textMask := lengthToMask(textLengths, maxLen)
+
+	return textIDs, textMask
+}
+
+// Utility functions
+func preprocessText(text string) string {
+	// Simple normalization (Go doesn't have built-in NFKD normalization)
+	// For full Unicode normalization, use golang.org/x/text/unicode/norm
+	return text
+}
+
+func lengthToMask(lengths []int64, maxLen int) [][][]float64 {
+	bsz := len(lengths)
+	mask := make([][][]float64, bsz)
+
+	for i := 0; i < bsz; i++ {
+		row := make([]float64, maxLen)
+		for j := 0; j < maxLen; j++ {
+			if int64(j) < lengths[i] {
+				row[j] = 1.0
+			} else {
+				row[j] = 0.0
+			}
+		}
+		mask[i] = [][]float64{row}
+	}
+
+	return mask
+}
+
+func getTextMask(textLengths []int64, maxLen int) [][][]float64 {
+	return lengthToMask(textLengths, maxLen)
+}
+
+func getLatentMask(wavLengths []int64, cfg Config) [][][]float64 {
+	baseChunkSize := int64(cfg.AE.BaseChunkSize)
+	chunkCompressFactor := int64(cfg.TTL.ChunkCompressFactor)
+	latentSize := baseChunkSize * chunkCompressFactor
+
+	latentLengths := make([]int64, len(wavLengths))
+	maxLen := int64(0)
+	for i, wavLen := range wavLengths {
+		latentLengths[i] = (wavLen + latentSize - 1) / latentSize
+		if latentLengths[i] > maxLen {
+			maxLen = latentLengths[i]
+		}
+	}
+
+	return lengthToMask(latentLengths, int(maxLen))
+}
+
+func writeWavFile(filename string, audioData []float64, sampleRate int) error {
+	file, err := os.Create(filename)
+	if err != nil {
+		return err
+	}
+	defer file.Close()
+
+	// Convert float64 to int
+	intData := make([]int, len(audioData))
+	for i, sample := range audioData {
+		// Clamp to [-1, 1] and convert to 16-bit int
+		clamped := math.Max(-1.0, math.Min(1.0, sample))
+		intData[i] = int(clamped * 32767)
+	}
+
+	encoder := wav.NewEncoder(file, sampleRate, 16, 1, 1)
+	buf := &audio.IntBuffer{
+		Data:           intData,
+		Format:         &audio.Format{SampleRate: sampleRate, NumChannels: 1},
+		SourceBitDepth: 16,
+	}
+
+	if err := encoder.Write(buf); err != nil {
+		return err
+	}
+
+	return encoder.Close()
+}
+
+// Style holds style tensors
+type Style struct {
+	TtlTensor *ort.Tensor[float32]
+	DpTensor  *ort.Tensor[float32]
+}
+
+func (s *Style) Destroy() {
+	if s.TtlTensor != nil {
+		s.TtlTensor.Destroy()
+	}
+	if s.DpTensor != nil {
+		s.DpTensor.Destroy()
+	}
+}
+
+// LoadVoiceStyle loads voice style from JSON files
+func LoadVoiceStyle(voiceStylePaths []string, verbose bool) (*Style, error) {
+	bsz := len(voiceStylePaths)
+
+	// Read first file to get dimensions
+	firstData, err := os.ReadFile(voiceStylePaths[0])
+	if err != nil {
+		return nil, fmt.Errorf("failed to read voice style file: %w", err)
+	}
+
+	var firstStyle VoiceStyleData
+	if err := json.Unmarshal(firstData, &firstStyle); err != nil {
+		return nil, fmt.Errorf("failed to parse voice style JSON: %w", err)
+	}
+
+	ttlDims := firstStyle.StyleTTL.Dims
+	dpDims := firstStyle.StyleDP.Dims
+
+	ttlDim1 := ttlDims[1]
+	ttlDim2 := ttlDims[2]
+	dpDim1 := dpDims[1]
+	dpDim2 := dpDims[2]
+
+	// Pre-allocate arrays with full batch size
+	ttlSize := int(int64(bsz) * ttlDim1 * ttlDim2)
+	dpSize := int(int64(bsz) * dpDim1 * dpDim2)
+	ttlFlat := make([]float32, ttlSize)
+	dpFlat := make([]float32, dpSize)
+
+	// Fill in the data
+	for i := 0; i < bsz; i++ {
+		data, err := os.ReadFile(voiceStylePaths[i])
+		if err != nil {
+			return nil, fmt.Errorf("failed to read voice style file: %w", err)
+		}
+
+		var voiceStyle VoiceStyleData
+		if err := json.Unmarshal(data, &voiceStyle); err != nil {
+			return nil, fmt.Errorf("failed to parse voice style JSON: %w", err)
+		}
+
+		// Flatten TTL data
+		ttlOffset := int(int64(i) * ttlDim1 * ttlDim2)
+		idx := 0
+		for _, batch := range voiceStyle.StyleTTL.Data {
+			for _, row := range batch {
+				for _, val := range row {
+					ttlFlat[ttlOffset+idx] = float32(val)
+					idx++
+				}
+			}
+		}
+
+		// Flatten DP data
+		dpOffset := int(int64(i) * dpDim1 * dpDim2)
+		idx = 0
+		for _, batch := range voiceStyle.StyleDP.Data {
+			for _, row := range batch {
+				for _, val := range row {
+					dpFlat[dpOffset+idx] = float32(val)
+					idx++
+				}
+			}
+		}
+	}
+
+	ttlShape := []int64{int64(bsz), ttlDim1, ttlDim2}
+	dpShape := []int64{int64(bsz), dpDim1, dpDim2}
+
+	ttlTensor, err := ort.NewTensor(ttlShape, ttlFlat)
+	if err != nil {
+		return nil, fmt.Errorf("failed to create TTL tensor: %w", err)
+	}
+
+	dpTensor, err := ort.NewTensor(dpShape, dpFlat)
+	if err != nil {
+		ttlTensor.Destroy()
+		return nil, fmt.Errorf("failed to create DP tensor: %w", err)
+	}
+
+	if verbose {
+		fmt.Printf("Loaded %d voice styles\n\n", bsz)
+	}
+
+	return &Style{
+		TtlTensor: ttlTensor,
+		DpTensor:  dpTensor,
+	}, nil
+}
+
+// TextToSpeech generates speech from text
+type TextToSpeech struct {
+	cfg           Config
+	textProcessor *UnicodeProcessor
+	dpOrt         *ort.DynamicAdvancedSession
+	textEncOrt    *ort.DynamicAdvancedSession
+	vectorEstOrt  *ort.DynamicAdvancedSession
+	vocoderOrt    *ort.DynamicAdvancedSession
+	SampleRate    int
+	baseChunkSize int
+	chunkCompress int
+	ldim          int
+}
+
+func (tts *TextToSpeech) sampleNoisyLatent(durOnnx []float32) ([][][]float64, [][][]float64) {
+	bsz := len(durOnnx)
+	maxDur := float64(0)
+	for _, d := range durOnnx {
+		if float64(d) > maxDur {
+			maxDur = float64(d)
+		}
+	}
+
+	wavLenMax := maxDur * float64(tts.SampleRate)
+	wavLengths := make([]int64, bsz)
+	for i, d := range durOnnx {
+		wavLengths[i] = int64(float64(d) * float64(tts.SampleRate))
+	}
+
+	chunkSize := tts.baseChunkSize * tts.chunkCompress
+	latentLen := int((wavLenMax + float64(chunkSize) - 1) / float64(chunkSize))
+	latentDim := tts.ldim * tts.chunkCompress
+
+	rng := rand.New(rand.NewSource(time.Now().UnixNano()))
+	noisyLatent := make([][][]float64, bsz)
+	for b := 0; b < bsz; b++ {
+		batch := make([][]float64, latentDim)
+		for d := 0; d < latentDim; d++ {
+			row := make([]float64, latentLen)
+			for t := 0; t < latentLen; t++ {
+				// Box-Muller transform for normal distribution
+				// Add epsilon to avoid log(0)
+				const eps = 1e-10
+				u1 := math.Max(eps, rng.Float64())
+				u2 := rng.Float64()
+				row[t] = math.Sqrt(-2.0*math.Log(u1)) * math.Cos(2.0*math.Pi*u2)
+			}
+			batch[d] = row
+		}
+		noisyLatent[b] = batch
+	}
+
+	latentMask := getLatentMask(wavLengths, tts.cfg)
+
+	// Apply mask
+	for b := 0; b < bsz; b++ {
+		for d := 0; d < latentDim; d++ {
+			for t := 0; t < latentLen; t++ {
+				noisyLatent[b][d][t] *= latentMask[b][0][t]
+			}
+		}
+	}
+
+	return noisyLatent, latentMask
+}
+
+func (tts *TextToSpeech) Call(textList []string, style *Style, totalStep int) ([]float32, []float32, error) {
+	bsz := len(textList)
+
+	// Process text
+	textIDs, textMask := tts.textProcessor.Call(textList)
+	textIDsShape := []int64{int64(bsz), int64(len(textIDs[0]))}
+	textMaskShape := []int64{int64(bsz), 1, int64(len(textMask[0][0]))}
+
+	textIDsTensor := IntArrayToTensor(textIDs, textIDsShape)
+	defer textIDsTensor.Destroy()
+	textMaskTensor := ArrayToTensor(textMask, textMaskShape)
+	defer textMaskTensor.Destroy()
+
+	// Predict duration
+	dpOutputs := []ort.Value{nil}
+	err := tts.dpOrt.Run(
+		[]ort.Value{textIDsTensor, style.DpTensor, textMaskTensor},
+		dpOutputs,
+	)
+	if err != nil {
+		return nil, nil, fmt.Errorf("failed to run duration predictor: %w", err)
+	}
+	durTensor := dpOutputs[0].(*ort.Tensor[float32])
+	defer durTensor.Destroy()
+	durOnnx := durTensor.GetData()
+
+	// Encode text
+	textIDsTensor2 := IntArrayToTensor(textIDs, textIDsShape)
+	defer textIDsTensor2.Destroy()
+	textEncOutputs := []ort.Value{nil}
+	err = tts.textEncOrt.Run(
+		[]ort.Value{textIDsTensor2, style.TtlTensor, textMaskTensor},
+		textEncOutputs,
+	)
+	if err != nil {
+		return nil, nil, fmt.Errorf("failed to run text encoder: %w", err)
+	}
+	textEmbTensor := textEncOutputs[0].(*ort.Tensor[float32])
+	defer textEmbTensor.Destroy()
+
+	// Sample noisy latent
+	xt, latentMask := tts.sampleNoisyLatent(durOnnx)
+	latentShape := []int64{int64(bsz), int64(len(xt[0])), int64(len(xt[0][0]))}
+	latentMaskShape := []int64{int64(bsz), 1, int64(len(latentMask[0][0]))}
+
+	// Prepare constant arrays
+	totalStepArray := make([]float32, bsz)
+	for b := 0; b < bsz; b++ {
+		totalStepArray[b] = float32(totalStep)
+	}
+	scalarShape := []int64{int64(bsz)}
+
+	totalStepTensor, _ := ort.NewTensor(scalarShape, totalStepArray)
+	defer totalStepTensor.Destroy()
+
+	// Denoising loop
+	for step := 0; step < totalStep; step++ {
+		currentStepArray := make([]float32, bsz)
+		for b := 0; b < bsz; b++ {
+			currentStepArray[b] = float32(step)
+		}
+
+		currentStepTensor, _ := ort.NewTensor(scalarShape, currentStepArray)
+		noisyLatentTensor := ArrayToTensor(xt, latentShape)
+		latentMaskTensor := ArrayToTensor(latentMask, latentMaskShape)
+		textMaskTensor2 := ArrayToTensor(textMask, textMaskShape)
+
+		vectorEstOutputs := []ort.Value{nil}
+		err = tts.vectorEstOrt.Run(
+			[]ort.Value{noisyLatentTensor, textEmbTensor, style.TtlTensor, latentMaskTensor, textMaskTensor2,
+				currentStepTensor, totalStepTensor},
+			vectorEstOutputs,
+		)
+		if err != nil {
+			return nil, nil, fmt.Errorf("failed to run vector estimator: %w", err)
+		}
+
+		denoisedTensor := vectorEstOutputs[0].(*ort.Tensor[float32])
+		denoisedData := denoisedTensor.GetData()
+
+		// Update latent
+		idx := 0
+		for b := 0; b < bsz; b++ {
+			for d := 0; d < len(xt[b]); d++ {
+				for t := 0; t < len(xt[b][d]); t++ {
+					xt[b][d][t] = float64(denoisedData[idx])
+					idx++
+				}
+			}
+		}
+
+		noisyLatentTensor.Destroy()
+		latentMaskTensor.Destroy()
+		textMaskTensor2.Destroy()
+		currentStepTensor.Destroy()
+		denoisedTensor.Destroy()
+	}
+
+	// Generate waveform
+	finalLatentTensor := ArrayToTensor(xt, latentShape)
+	defer finalLatentTensor.Destroy()
+
+	vocoderOutputs := []ort.Value{nil}
+	err = tts.vocoderOrt.Run(
+		[]ort.Value{finalLatentTensor},
+		vocoderOutputs,
+	)
+	if err != nil {
+		return nil, nil, fmt.Errorf("failed to run vocoder: %w", err)
+	}
+
+	wavBatchTensor := vocoderOutputs[0].(*ort.Tensor[float32])
+	defer wavBatchTensor.Destroy()
+	wav := wavBatchTensor.GetData()
+
+	return wav, durOnnx, nil
+}
+
+func (tts *TextToSpeech) Destroy() {
+	if tts.dpOrt != nil {
+		tts.dpOrt.Destroy()
+	}
+	if tts.textEncOrt != nil {
+		tts.textEncOrt.Destroy()
+	}
+	if tts.vectorEstOrt != nil {
+		tts.vectorEstOrt.Destroy()
+	}
+	if tts.vocoderOrt != nil {
+		tts.vocoderOrt.Destroy()
+	}
+}
+
+// LoadTextToSpeech loads TTS components
+func LoadTextToSpeech(onnxDir string, useGPU bool, cfg Config) (*TextToSpeech, error) {
+	if useGPU {
+		return nil, fmt.Errorf("GPU mode is not supported yet")
+	}
+	fmt.Println("Using CPU for inference\n")
+
+	// Load models
+	dpPath := filepath.Join(onnxDir, "duration_predictor.onnx")
+	textEncPath := filepath.Join(onnxDir, "text_encoder.onnx")
+	vectorEstPath := filepath.Join(onnxDir, "vector_estimator.onnx")
+	vocoderPath := filepath.Join(onnxDir, "vocoder.onnx")
+
+	dpOrt, err := ort.NewDynamicAdvancedSession(dpPath, []string{"text_ids", "style_dp", "text_mask"},
+		[]string{"duration"}, nil)
+	if err != nil {
+		return nil, fmt.Errorf("failed to load duration predictor: %w", err)
+	}
+
+	textEncOrt, err := ort.NewDynamicAdvancedSession(textEncPath, []string{"text_ids", "style_ttl", "text_mask"},
+		[]string{"text_emb"}, nil)
+	if err != nil {
+		return nil, fmt.Errorf("failed to load text encoder: %w", err)
+	}
+
+	vectorEstOrt, err := ort.NewDynamicAdvancedSession(vectorEstPath,
+		[]string{"noisy_latent", "text_emb", "style_ttl", "latent_mask", "text_mask", "current_step", "total_step"},
+		[]string{"denoised_latent"}, nil)
+	if err != nil {
+		return nil, fmt.Errorf("failed to load vector estimator: %w", err)
+	}
+
+	vocoderOrt, err := ort.NewDynamicAdvancedSession(vocoderPath, []string{"latent"},
+		[]string{"wav_tts"}, nil)
+	if err != nil {
+		return nil, fmt.Errorf("failed to load vocoder: %w", err)
+	}
+
+	// Load text processor
+	unicodeIndexerPath := filepath.Join(onnxDir, "unicode_indexer.json")
+	textProcessor, err := NewUnicodeProcessor(unicodeIndexerPath)
+	if err != nil {
+		return nil, err
+	}
+
+	textToSpeech := &TextToSpeech{
+		cfg:           cfg,
+		textProcessor: textProcessor,
+		dpOrt:         dpOrt,
+		textEncOrt:    textEncOrt,
+		vectorEstOrt:  vectorEstOrt,
+		vocoderOrt:    vocoderOrt,
+		SampleRate:    cfg.AE.SampleRate,
+		baseChunkSize: cfg.AE.BaseChunkSize,
+		chunkCompress: cfg.TTL.ChunkCompressFactor,
+		ldim:          cfg.TTL.LatentDim,
+	}
+
+	return textToSpeech, nil
+}
+
+// InitializeONNXRuntime initializes ONNX Runtime environment
+func InitializeONNXRuntime() error {
+	libPath := os.Getenv("ONNXRUNTIME_LIB_PATH")
+	if libPath == "" {
+		libPath = "/usr/local/lib/libonnxruntime.so"
+		if _, err := os.Stat("/usr/local/lib/libonnxruntime.dylib"); err == nil {
+			libPath = "/usr/local/lib/libonnxruntime.dylib"
+		} else if _, err := os.Stat("/usr/lib/libonnxruntime.so"); err == nil {
+			libPath = "/usr/lib/libonnxruntime.so"
+		}
+	}
+	ort.SetSharedLibraryPath(libPath)
+
+	if err := ort.InitializeEnvironment(); err != nil {
+		return fmt.Errorf("failed to initialize ONNX Runtime: %w\nHint: Set ONNXRUNTIME_LIB_PATH environment variable", err)
+	}
+	return nil
+}
+
+// sanitizeFilename creates a safe filename from text
+func sanitizeFilename(text string, maxLen int) string {
+	if len(text) > maxLen {
+		text = text[:maxLen]
+	}
+	
+	result := make([]rune, 0, len(text))
+	for _, r := range text {
+		if (r >= 'a' && r <= 'z') || (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
+			result = append(result, r)
+		} else {
+			result = append(result, '_')
+		}
+	}
+	return string(result)
+}
+
+// extractWavSegment extracts a single audio segment from batch output
+func extractWavSegment(wav []float32, duration float32, sampleRate int, index int, batchSize int) []float64 {
+	wavLen := int(float64(sampleRate) * float64(duration))
+	wavPerBatch := len(wav) / batchSize
+	
+	wavStart := index * wavPerBatch
+	wavEnd := wavStart + wavLen
+	if wavEnd > len(wav) {
+		wavEnd = len(wav)
+	}
+	
+	wavOut := make([]float64, wavLen)
+	for j := 0; j < wavLen && wavStart+j < len(wav); j++ {
+		wavOut[j] = float64(wav[wavStart+j])
+	}
+	
+	return wavOut
+}
+
+// Timer measures execution time
+func Timer(name string, fn func() interface{}) interface{} {
+	start := time.Now()
+	fmt.Printf("%s...\n", name)
+	result := fn()
+	elapsed := time.Since(start).Seconds()
+	fmt.Printf("  -> %s completed in %.2f sec\n", name, elapsed)
+	return result
+}
+
+// LoadCfgs loads configuration from JSON file
+func LoadCfgs(onnxDir string) (Config, error) {
+	cfgPath := filepath.Join(onnxDir, "tts.json")
+	data, err := os.ReadFile(cfgPath)
+	if err != nil {
+		return Config{}, err
+	}
+
+	var cfg Config
+	if err := json.Unmarshal(data, &cfg); err != nil {
+		return Config{}, err
+	}
+
+	return cfg, nil
+}
+
+// JSON loading helpers
+func loadJSONInt64(filePath string) ([]int64, error) {
+	data, err := os.ReadFile(filePath)
+	if err != nil {
+		return nil, err
+	}
+
+	var result []int64
+	if err := json.Unmarshal(data, &result); err != nil {
+		return nil, err
+	}
+
+	return result, nil
+}
+
+// Tensor conversion utilities
+func ArrayToTensor(array [][][]float64, shape []int64) *ort.Tensor[float32] {
+	// Flatten array
+	totalSize := int64(1)
+	for _, dim := range shape {
+		totalSize *= dim
+	}
+
+	flat := make([]float32, totalSize)
+	idx := 0
+	for b := 0; b < len(array); b++ {
+		for d := 0; d < len(array[b]); d++ {
+			for t := 0; t < len(array[b][d]); t++ {
+				flat[idx] = float32(array[b][d][t])
+				idx++
+			}
+		}
+	}
+
+	tensor, err := ort.NewTensor(shape, flat)
+	if err != nil {
+		panic(err)
+	}
+
+	return tensor
+}
+
+func IntArrayToTensor(array [][]int64, shape []int64) *ort.Tensor[int64] {
+	// Flatten array
+	totalSize := int64(1)
+	for _, dim := range shape {
+		totalSize *= dim
+	}
+
+	flat := make([]int64, totalSize)
+	idx := 0
+	for b := 0; b < len(array); b++ {
+		for t := 0; t < len(array[b]); t++ {
+			flat[idx] = array[b][t]
+			idx++
+		}
+	}
+
+	tensor, err := ort.NewTensor(shape, flat)
+	if err != nil {
+		panic(err)
+	}
+
+	return tensor
+}
@@ -0,0 +1,250 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg id="_레이어_2" data-name="레이어 2" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0 0 1920 1080">
+  <defs>
+    <style>
+      .cls-1, .cls-2 {
+        fill: none;
+      }
+
+      .cls-3 {
+        fill: #227cff;
+      }
+
+      .cls-4 {
+        fill: #ff0;
+      }
+
+      .cls-2 {
+        stroke: #0a0a0a;
+        stroke-miterlimit: 10;
+        stroke-width: 1.72px;
+      }
+
+      .cls-5 {
+        fill: #f2f2f2;
+      }
+
+      .cls-6 {
+        fill: #0a0a0a;
+      }
+
+      .cls-7 {
+        clip-path: url(#clippath);
+      }
+    </style>
+    <clipPath id="clippath">
+      <rect class="cls-1" x="181.43" width="1626.55" height="1080"/>
+    </clipPath>
+  </defs>
+  <g id="Work">
+    <g>
+      <rect class="cls-5" width="1920" height="1080"/>
+      <g>
+        <circle class="cls-3" cx="1679.4" cy="880.43" r="59.81"/>
+        <path class="cls-6" d="M1713.35,805.14c-5.14,12.55-10.93,23.87-18.55,37.52l14.39-2.21c-12.75,23.11-27.5,45.05-44.09,65.57l14.46-2.22c-18.8,23.34-39.84,44.76-62.85,63.95,37.17-24.18,63.26-49.33,92.3-75.66l-16.71,2.56c24.03-21.85,46.85-44.98,68.39-69.3l-22.1,3.39c10.89-12.33,20.91-24.28,30.11-35.66l-55.34,12.05Z"/>
+      </g>
+      <g class="cls-7">
+        <path class="cls-4" d="M1036.15-20.65c-38.28,93.45-81.37,177.77-138.14,279.42l107.16-16.43c-94.95,172.09-204.79,335.46-328.29,488.29l107.68-16.51c-139.96,173.79-296.71,333.29-468.05,476.23,276.77-180.08,471.07-367.33,687.29-563.39l-124.4,19.07c178.91-162.71,348.9-334.99,509.26-516.04l-164.57,25.23c81.12-91.85,155.67-180.81,224.18-265.57l-412.13,89.69Z"/>
+      </g>
+      <g>
+        <path class="cls-6" d="M157.78,462.19h7.78v52.03h28.69v7.36h-36.47v-59.39Z"/>
+        <path class="cls-6" d="M204.45,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM200.77,479.75h7.19v41.82h-7.19v-41.82Z"/>
+        <path class="cls-6" d="M237.74,539.9c-9.03,0-16.06-3.85-19.66-8.53l5.02-5.02c3.35,4.27,8.2,7.03,14.64,7.03,7.11,0,13.97-4.43,13.97-14.47v-4.77c-2.84,4.18-8.37,7.36-14.56,7.36-11.63,0-20.49-9.37-20.49-21.33s8.87-21.25,20.49-21.25c6.19,0,11.71,3.09,14.56,7.28v-6.44h7.19v39.4c0,13.97-9.2,20.74-21.16,20.74ZM238.16,514.8c8.37,0,14.14-6.19,14.14-14.64s-5.77-14.64-14.14-14.64-14.14,6.27-14.14,14.64,5.77,14.64,14.14,14.64Z"/>
+        <path class="cls-6" d="M270.78,458.84h7.19v27.35c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-62.74Z"/>
+        <path class="cls-6" d="M334.1,521.99c-7.28,0-12.8-4.1-12.8-12.8v-22.84h-8.87v-6.61h8.87v-11.63h7.19v11.63h12.05v6.61h-12.05v21.92c0,5.35,2.43,7.11,6.94,7.11,1.76,0,3.76-.33,5.1-.84v6.44c-1.76.59-3.85,1-6.44,1Z"/>
+        <path class="cls-6" d="M348.07,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
+        <path class="cls-6" d="M399.35,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM395.67,479.75h7.19v41.82h-7.19v-41.82Z"/>
+        <path class="cls-6" d="M414.82,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
+        <path class="cls-6" d="M480.23,539.9c-9.03,0-16.06-3.85-19.66-8.53l5.02-5.02c3.35,4.27,8.2,7.03,14.64,7.03,7.11,0,13.97-4.43,13.97-14.47v-4.77c-2.84,4.18-8.37,7.36-14.56,7.36-11.63,0-20.49-9.37-20.49-21.33s8.87-21.25,20.49-21.25c6.19,0,11.71,3.09,14.56,7.28v-6.44h7.19v39.4c0,13.97-9.2,20.74-21.16,20.74ZM480.65,514.8c8.37,0,14.14-6.19,14.14-14.64s-5.77-14.64-14.14-14.64-14.14,6.27-14.14,14.64,5.77,14.64,14.14,14.64Z"/>
+        <path class="cls-6" d="M511.93,494.56h21.33v7.19h-21.33v-7.19Z"/>
+        <path class="cls-6" d="M544.97,462.19h34.55v7.36h-26.77v17.98h21.16v7.36h-21.16v26.68h-7.78v-59.39Z"/>
+        <path class="cls-6" d="M600.68,478.92c6.27,0,11.96,3.26,14.72,7.28v-6.44h7.19v41.82h-7.19v-6.44c-2.76,4.02-8.45,7.28-14.72,7.28-11.71,0-20.49-9.79-20.49-21.75s8.78-21.75,20.49-21.75ZM601.77,485.52c-8.45,0-14.3,6.69-14.3,15.14s5.86,15.14,14.3,15.14,14.22-6.69,14.22-15.14-5.77-15.14-14.22-15.14Z"/>
+        <path class="cls-6" d="M647.11,522.41c-7.44,0-13.89-3.18-16.39-9.7l5.86-3.26c1.5,4.27,6.02,6.61,10.62,6.61,4.01,0,7.28-2.01,7.28-5.6,0-3.01-1.84-4.85-7.28-6.52l-4.18-1.34c-6.86-2.01-10.46-6.36-10.46-12.21.08-7.19,6.36-11.46,14.39-11.46,6.19,0,11.04,2.59,13.8,7.28l-5.35,3.68c-1.84-2.68-4.68-4.77-8.7-4.77-3.51,0-6.94,1.92-6.94,5.02,0,2.51,1.34,4.6,5.94,6.02l4.6,1.42c7.19,2.17,11.38,5.94,11.38,12.3,0,8.03-6.19,12.55-14.55,12.55Z"/>
+        <path class="cls-6" d="M686.25,521.99c-7.28,0-12.8-4.1-12.8-12.8v-22.84h-8.87v-6.61h8.87v-11.63h7.19v11.63h12.04v6.61h-12.04v21.92c0,5.35,2.43,7.11,6.94,7.11,1.76,0,3.76-.33,5.1-.84v6.44c-1.76.59-3.85,1-6.44,1Z"/>
+        <path class="cls-6" d="M700.56,533.54h-4.77l5.6-11.96c-2.01-1-3.43-3.09-3.43-5.6,0-3.35,2.76-6.27,6.19-6.27s6.19,2.93,6.19,6.27c0,1.42-.5,2.68-1.17,3.76l-8.62,13.8Z"/>
+        <path class="cls-6" d="M766.14,522.58c-17.06,0-30.7-13.47-30.7-30.7s13.63-30.7,30.7-30.7,30.7,13.47,30.7,30.7-13.55,30.7-30.7,30.7ZM766.14,515.22c13.05,0,22.92-10.29,22.92-23.34s-9.87-23.34-22.92-23.34-22.84,10.29-22.84,23.34,9.87,23.34,22.84,23.34Z"/>
+        <path class="cls-6" d="M805.7,479.75h7.19v6.44c2.84-4.94,7.86-7.28,13.3-7.28,9.37,0,15.81,6.52,15.81,16.9v25.76h-7.11v-24.68c0-7.03-4.01-11.38-9.87-11.38-6.78,0-12.13,5.52-12.13,15.31v20.74h-7.19v-41.82Z"/>
+        <path class="cls-6" d="M851.96,494.56h21.33v7.19h-21.33v-7.19Z"/>
+        <path class="cls-6" d="M885,462.19h17.06c18.24,0,31.62,12.71,31.62,29.7s-13.38,29.7-31.62,29.7h-17.06v-59.39ZM902.06,514.3c14.3,0,23.76-9.54,23.76-22.42s-9.45-22.42-23.76-22.42h-9.29v44.84h9.29Z"/>
+        <path class="cls-6" d="M961.03,478.92c10.96,0,19.83,7.61,19.91,21,0,.75,0,1.25-.08,2.17h-34.3c.25,7.86,6.19,13.72,14.47,13.72,6.44,0,10.46-2.84,13.05-7.28l5.69,3.93c-3.76,6.11-10.12,9.95-18.82,9.95-12.97,0-21.67-9.45-21.67-21.75s9.03-21.75,21.75-21.75ZM947.07,496.23h26.52c-1-6.86-6.44-10.96-12.8-10.96s-12.38,4.02-13.72,10.96Z"/>
+        <path class="cls-6" d="M982.53,479.75h8.03l14.3,31.62,14.3-31.62h8.03l-19.32,41.82h-6.02l-19.32-41.82Z"/>
+        <path class="cls-6" d="M1036.48,468.54c-2.84,0-5.19-2.26-5.19-5.1s2.34-5.1,5.19-5.1,5.02,2.34,5.02,5.1-2.17,5.1-5.02,5.1ZM1032.8,479.75h7.19v41.82h-7.19v-41.82Z"/>
+        <path class="cls-6" d="M1070.61,522.41c-12.63,0-21.92-9.62-21.92-21.75s9.28-21.75,21.92-21.75c8.62,0,15.64,4.52,19.32,11.21l-6.27,3.51c-2.34-4.77-7.03-8.03-13.05-8.03-8.7,0-14.64,6.69-14.64,15.06s5.94,15.06,14.64,15.06c6.02,0,10.71-3.26,13.05-8.03l6.27,3.51c-3.68,6.69-10.71,11.21-19.32,11.21Z"/>
+        <path class="cls-6" d="M1116.03,478.92c10.96,0,19.83,7.61,19.91,21,0,.75,0,1.25-.08,2.17h-34.3c.25,7.86,6.19,13.72,14.47,13.72,6.44,0,10.46-2.84,13.05-7.28l5.69,3.93c-3.76,6.11-10.12,9.95-18.82,9.95-12.97,0-21.67-9.45-21.67-21.75s9.03-21.75,21.75-21.75ZM1102.06,496.23h26.52c-1-6.86-6.44-10.96-12.8-10.96s-12.38,4.02-13.72,10.96Z"/>
+        <path class="cls-6" d="M1173.33,469.55h-18.49v-7.36h44.58v7.36h-18.49v52.03h-7.61v-52.03Z"/>
+        <path class="cls-6" d="M1219.08,469.55h-18.49v-7.36h44.58v7.36h-18.49v52.03h-7.61v-52.03Z"/>
+        <path class="cls-6" d="M1253.71,507.19c3.18,5.02,8.03,8.11,14.72,8.11,6.11,0,10.96-3.35,10.96-8.87,0-4.77-3.01-7.95-8.53-10.04l-7.86-3.01c-9.29-3.43-13.47-8.28-13.47-16.4,0-9.7,7.95-15.81,18.57-15.81,7.44,0,13.63,3.35,17.32,8.11l-5.69,5.02c-3.01-3.6-6.69-5.86-11.79-5.86-5.86,0-10.71,3.26-10.71,8.2s2.93,7.28,8.95,9.54l7.19,2.76c8.78,3.35,13.8,8.36,13.8,17.15,0,10.12-7.86,16.48-18.9,16.48-9.79,0-17.73-4.6-20.83-10.71l6.27-4.68Z"/>
+        <path class="cls-6" d="M1299.64,522.25c-3.43,0-6.19-2.84-6.19-6.27s2.76-6.27,6.19-6.27,6.19,2.93,6.19,6.27-2.76,6.27-6.19,6.27Z"/>
+      </g>
+      <g>
+        <g>
+          <path class="cls-6" d="M175.59,793.41c0,5.19-3.89,9.12-9.89,9.12h-5.03v10.5h-3.77v-28.77h8.79c6,0,9.89,3.97,9.89,9.16ZM171.86,793.41c0-3.2-2.19-5.63-6.16-5.63h-5.03v11.27h5.03c3.97,0,6.16-2.43,6.16-5.63Z"/>
+          <path class="cls-6" d="M186.24,792.36c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM186.77,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+          <path class="cls-6" d="M201.8,792.76h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
+          <path class="cls-6" d="M222.67,792.36c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM223.2,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+          <path class="cls-6" d="M238.23,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M281.43,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM274.66,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+          <path class="cls-6" d="M301.98,813.23c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
+          <path class="cls-6" d="M316.61,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM309.84,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+          <path class="cls-6" d="M329.57,792.76h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
+          <path class="cls-6" d="M348.5,813.43c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
+        </g>
+        <g>
+          <path class="cls-6" d="M642.94,793.41c0,5.19-3.89,9.12-9.89,9.12h-5.03v10.5h-3.77v-28.77h8.79c6,0,9.89,3.97,9.89,9.16ZM639.22,793.41c0-3.2-2.19-5.63-6.16-5.63h-5.03v11.27h5.03c3.97,0,6.16-2.43,6.16-5.63Z"/>
+          <path class="cls-6" d="M645.58,792.76h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M667.05,813.43c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM667.05,810.19c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
+          <path class="cls-6" d="M689.63,821.9c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM689.83,809.74c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
+          <path class="cls-6" d="M704.82,792.76h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M725.69,792.36c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM726.22,795.56c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+          <path class="cls-6" d="M741.25,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M775.49,792.76h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M811.52,787.33c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM809.73,792.76h3.49v20.26h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M818.2,792.76h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M849.08,821.9c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM849.29,809.74c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
+          <path class="cls-6" d="M623.73,824.36h3.48v30.4h-3.48v-30.4Z"/>
+          <path class="cls-6" d="M640.55,834.09c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM641.08,837.29c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+          <path class="cls-6" d="M656.11,834.49h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
+          <path class="cls-6" d="M686.99,863.63c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM687.19,851.47c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
+          <path class="cls-6" d="M701.9,834.49h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
+          <path class="cls-6" d="M732.38,834.09c3.04,0,5.8,1.58,7.13,3.53v-3.12h3.49v20.26h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM732.9,837.29c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+          <path class="cls-6" d="M756.57,863.63c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM756.77,851.47c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
+          <path class="cls-6" d="M780.72,834.09c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM773.95,842.48h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+          <path class="cls-6" d="M799.81,855.16c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
+        </g>
+        <g>
+          <path class="cls-6" d="M1102.25,813.51c-8.27,0-14.87-6.52-14.87-14.87s6.61-14.87,14.87-14.87,14.87,6.52,14.87,14.87-6.56,14.87-14.87,14.87ZM1102.25,809.94c6.32,0,11.1-4.98,11.1-11.31s-4.78-11.31-11.1-11.31-11.06,4.98-11.06,11.31,4.78,11.31,11.06,11.31Z"/>
+          <path class="cls-6" d="M1120.61,792.76h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M1142.21,799.93h10.33v3.49h-10.33v-3.49Z"/>
+          <path class="cls-6" d="M1157.4,784.25h8.27c8.83,0,15.32,6.16,15.32,14.39s-6.48,14.39-15.32,14.39h-8.27v-28.77ZM1165.67,809.5c6.93,0,11.51-4.62,11.51-10.86s-4.58-10.86-11.51-10.86h-4.5v21.72h4.5Z"/>
+          <path class="cls-6" d="M1193.43,792.36c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM1186.66,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+          <path class="cls-6" d="M1203.03,792.76h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
+          <path class="cls-6" d="M1228.36,787.33c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM1226.57,792.76h3.49v20.26h-3.49v-20.26Z"/>
+          <path class="cls-6" d="M1244.08,813.43c-6.12,0-10.62-4.66-10.62-10.54s4.5-10.54,10.62-10.54c4.17,0,7.58,2.19,9.36,5.43l-3.04,1.7c-1.13-2.31-3.4-3.89-6.32-3.89-4.21,0-7.09,3.24-7.09,7.29s2.88,7.29,7.09,7.29c2.92,0,5.19-1.58,6.32-3.89l3.04,1.7c-1.78,3.24-5.19,5.43-9.36,5.43Z"/>
+          <path class="cls-6" d="M1265.27,792.36c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM1258.5,800.74h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+        </g>
+        <g>
+          <path class="cls-6" d="M1097.68,909.11h-11.46l5.86-7.11h12.8v59.39h-7.19v-52.28Z"/>
+          <path class="cls-6" d="M1135.41,962.23c-12.21,0-21-9.12-21-22.33v-16.56c0-13.13,8.78-22.33,21-22.33s21.08,9.2,21.08,22.33v16.56c0,13.22-8.87,22.33-21.08,22.33ZM1135.41,955.2c8.11,0,13.89-5.69,13.89-15.31v-16.56c0-9.62-5.77-15.31-13.89-15.31s-13.8,5.69-13.8,15.31v16.56c0,9.62,5.77,15.31,13.8,15.31Z"/>
+          <path class="cls-6" d="M1184.17,962.23c-12.21,0-21-9.12-21-22.33v-16.56c0-13.13,8.78-22.33,21-22.33s21.08,9.2,21.08,22.33v16.56c0,13.22-8.87,22.33-21.08,22.33ZM1184.17,955.2c8.11,0,13.89-5.69,13.89-15.31v-16.56c0-9.62-5.77-15.31-13.89-15.31s-13.8,5.69-13.8,15.31v16.56c0,9.62,5.77,15.31,13.8,15.31Z"/>
+          <path class="cls-6" d="M1226.08,934.29c-9.12,0-16.48-7.44-16.48-16.48s7.36-16.48,16.48-16.48,16.48,7.36,16.48,16.48-7.45,16.48-16.48,16.48ZM1226.08,927.76c5.44,0,9.87-4.6,9.87-9.95s-4.43-9.95-9.87-9.95-9.87,4.6-9.87,9.95,4.35,9.95,9.87,9.95ZM1263.3,902h7.78l-37.64,59.39h-7.78l37.64-59.39ZM1270.66,962.06c-9.12,0-16.48-7.44-16.48-16.48s7.36-16.48,16.48-16.48,16.48,7.36,16.48,16.48-7.44,16.48-16.48,16.48ZM1270.66,955.53c5.44,0,9.87-4.6,9.87-9.95s-4.43-9.95-9.87-9.95-9.87,4.6-9.87,9.95,4.35,9.95,9.87,9.95Z"/>
+        </g>
+        <g>
+          <path class="cls-6" d="M644.38,962.23c-11.63,0-19.99-7.86-19.99-18.65,0-7.11,3.51-12.63,9.45-15.56-3.85-2.59-6.19-6.78-6.19-11.71,0-8.87,7.19-15.31,16.73-15.31s16.65,6.44,16.65,15.31c0,4.94-2.26,9.12-6.02,11.71,5.86,2.93,9.29,8.45,9.29,15.56,0,10.79-8.11,18.65-19.91,18.65ZM644.38,955.37c7.95,0,12.63-5.27,12.63-11.79s-4.68-11.88-12.63-11.88-12.71,5.19-12.71,11.88,4.94,11.79,12.71,11.79ZM644.38,925.25c6.02,0,9.54-4.18,9.54-8.78,0-4.94-3.6-8.7-9.54-8.7s-9.62,3.76-9.62,8.7c0,4.6,3.76,8.78,9.62,8.78Z"/>
+          <path class="cls-6" d="M668.05,933.95h14.47v-14.39h6.53v14.39h14.39v6.44h-14.39v14.39h-6.53v-14.39h-14.47v-6.44Z"/>
+        </g>
+        <g>
+          <path class="cls-6" d="M176.26,962.23c-11.54,0-20.16-8.78-20.16-19.57,0-6.27,3.01-10.79,6.69-15.14l21.67-25.51h9.12l-18.15,21.5c.84-.08,1.59-.17,2.34-.17,9.95,0,18.57,8.78,18.57,19.16,0,10.96-8.53,19.74-20.08,19.74ZM176.26,955.28c7.19,0,12.71-5.77,12.71-12.8s-5.52-12.8-12.71-12.8-12.8,5.77-12.8,12.8,5.52,12.8,12.8,12.8Z"/>
+          <path class="cls-6" d="M219.68,962.23c-11.54,0-20.16-8.78-20.16-19.57,0-6.27,3.01-10.79,6.69-15.14l21.67-25.51h9.12l-18.15,21.5c.84-.08,1.59-.17,2.34-.17,9.95,0,18.57,8.78,18.57,19.16,0,10.96-8.53,19.74-20.08,19.74ZM219.68,955.28c7.19,0,12.71-5.77,12.71-12.8s-5.52-12.8-12.71-12.8-12.8,5.77-12.8,12.8,5.52,12.8,12.8,12.8Z"/>
+          <path class="cls-6" d="M255.56,915.22v46.17h-7.78v-59.39h7.11l22.08,29.61,22.08-29.61h7.03v59.39h-7.7v-46.17l-21.41,28.78-21.41-28.78Z"/>
+        </g>
+        <line class="cls-2" x1="157.77" y1="878.71" x2="407.74" y2="878.71"/>
+        <line class="cls-2" x1="621.9" y1="878.71" x2="871.87" y2="878.71"/>
+        <line class="cls-2" x1="1086.03" y1="878.71" x2="1336" y2="878.71"/>
+      </g>
+      <g>
+        <path class="cls-6" d="M158.3,582.04h3.77v28.77h-3.77v-28.77Z"/>
+        <path class="cls-6" d="M167.46,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M198.74,611.22c-6.12,0-10.62-4.66-10.62-10.54s4.5-10.54,10.62-10.54c4.17,0,7.58,2.19,9.36,5.43l-3.04,1.7c-1.13-2.31-3.4-3.89-6.32-3.89-4.21,0-7.09,3.24-7.09,7.29s2.88,7.29,7.09,7.29c2.92,0,5.19-1.58,6.32-3.89l3.04,1.7c-1.78,3.24-5.19,5.43-9.36,5.43Z"/>
+        <path class="cls-6" d="M210.98,590.55h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M232.42,590.15c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM225.65,598.53h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+        <path class="cls-6" d="M253.73,590.15c3.04,0,5.79,1.58,7.13,3.53v-13.25h3.48v30.4h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM254.26,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M271.08,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM269.29,590.55h3.49v20.26h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M281.25,610.81h-3.49v-30.4h3.49v13.25c1.34-1.95,4.09-3.53,7.13-3.53,5.63,0,9.93,4.74,9.93,10.54s-4.3,10.54-9.93,10.54c-3.04,0-5.8-1.58-7.13-3.53v3.12ZM287.85,593.35c-4.09,0-6.89,3.24-6.89,7.34s2.8,7.34,6.89,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M301.67,580.42h3.49v30.4h-3.49v-30.4Z"/>
+        <path class="cls-6" d="M311.64,619.28l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
+        <path class="cls-6" d="M337.94,580.42h3.48v30.4h-3.48v-30.4Z"/>
+        <path class="cls-6" d="M348.19,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM346.41,590.55h3.48v20.26h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M363.51,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM363.71,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
+        <path class="cls-6" d="M378.71,580.42h3.48v13.25c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-30.4Z"/>
+        <path class="cls-6" d="M408.57,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
+        <path class="cls-6" d="M426.73,596.22l-5.03,14.59h-3.08l-6.89-20.26h3.65l4.9,14.79,5.07-14.79h2.67l5.07,14.79,4.9-14.79h3.69l-6.89,20.26h-3.08l-4.99-14.59Z"/>
+        <path class="cls-6" d="M452.7,590.15c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM445.93,598.53h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+        <path class="cls-6" d="M467.45,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM465.67,590.55h3.49v20.26h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M482.77,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.45,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.49v19.09c0,6.77-4.46,10.05-10.25,10.05ZM482.97,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
+        <path class="cls-6" d="M497.96,580.42h3.48v13.25c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-30.4Z"/>
+        <path class="cls-6" d="M527.83,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
+        <path class="cls-6" d="M550.28,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM550.81,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M565.84,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M596.43,590.15c3.04,0,5.8,1.58,7.13,3.53v-13.25h3.49v30.4h-3.49v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.25-10.54,9.93-10.54ZM596.96,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M623.62,610.81h-3.48v-30.4h3.48v13.25c1.34-1.95,4.09-3.53,7.13-3.53,5.63,0,9.93,4.74,9.93,10.54s-4.3,10.54-9.93,10.54c-3.04,0-5.79-1.58-7.13-3.53v3.12ZM630.23,593.35c-4.09,0-6.89,3.24-6.89,7.34s2.8,7.34,6.89,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M644.05,580.42h3.49v30.4h-3.49v-30.4Z"/>
+        <path class="cls-6" d="M660.87,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM661.39,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M674.76,608.06l11.63-14.31h-11.47v-3.2h15.93v2.8l-11.55,14.27h11.92v3.2h-16.45v-2.76Z"/>
+        <path class="cls-6" d="M696.2,585.12c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM694.42,590.55h3.49v20.26h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M702.89,590.55h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M733.76,619.69c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM733.97,607.53c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
+        <path class="cls-6" d="M748.96,580.42h3.49v30.4h-3.49v-30.4Z"/>
+        <path class="cls-6" d="M758.93,619.28l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
+        <path class="cls-6" d="M786.4,593.75h-4.3v-3.2h4.3v-4.09c0-4.17,2.67-6.28,6.2-6.28,1.22,0,2.27.2,3.16.53v3.08c-.65-.24-1.7-.41-2.51-.41-2.23,0-3.36.89-3.36,3.44v3.73h5.88v3.2h-5.88v17.06h-3.48v-17.06Z"/>
+        <path class="cls-6" d="M805.78,590.15c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM806.3,593.35c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M827.45,611.22c-3.61,0-6.73-1.54-7.94-4.7l2.84-1.58c.73,2.07,2.92,3.2,5.15,3.2,1.95,0,3.53-.97,3.53-2.72,0-1.46-.89-2.35-3.53-3.16l-2.03-.65c-3.32-.97-5.07-3.08-5.07-5.92.04-3.49,3.08-5.55,6.97-5.55,3,0,5.35,1.26,6.69,3.53l-2.59,1.78c-.89-1.3-2.27-2.31-4.21-2.31-1.7,0-3.36.93-3.36,2.43,0,1.22.65,2.23,2.88,2.92l2.23.69c3.49,1.05,5.51,2.88,5.51,5.96,0,3.89-3,6.08-7.05,6.08Z"/>
+        <path class="cls-6" d="M845.61,611.02c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.17,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
+        <path class="cls-6" d="M851.73,616.61h-2.31l2.72-5.8c-.97-.49-1.66-1.5-1.66-2.72,0-1.62,1.34-3.04,3-3.04s3,1.42,3,3.04c0,.69-.24,1.3-.57,1.82l-4.17,6.69Z"/>
+        <path class="cls-6" d="M157.78,639.18h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M170.7,639.18h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
+        <path class="cls-6" d="M192.83,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M215.07,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M239.1,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM237.32,639.18h3.49v20.26h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M245.79,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M276.67,668.32c-4.38,0-7.78-1.86-9.52-4.13l2.43-2.43c1.62,2.07,3.97,3.4,7.09,3.4,3.44,0,6.77-2.15,6.77-7.01v-2.31c-1.38,2.03-4.05,3.57-7.05,3.57-5.63,0-9.93-4.54-9.93-10.33s4.3-10.29,9.93-10.29c3,0,5.67,1.5,7.05,3.53v-3.12h3.48v19.09c0,6.77-4.46,10.05-10.25,10.05ZM276.87,656.16c4.05,0,6.85-3,6.85-7.09s-2.8-7.09-6.85-7.09-6.85,3.04-6.85,7.09,2.8,7.09,6.85,7.09Z"/>
+        <path class="cls-6" d="M300.01,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M330.61,638.77c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM331.13,641.98c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M354.03,659.65c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.17,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
+        <path class="cls-6" d="M361.77,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM359.98,639.18h3.49v20.26h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M365.41,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
+        <path class="cls-6" d="M397.47,638.77c5.31,0,9.6,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM390.7,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+        <path class="cls-6" d="M410.43,629.05h3.49v30.4h-3.49v-30.4Z"/>
+        <path class="cls-6" d="M420.4,667.91l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
+        <path class="cls-6" d="M448.49,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM446.7,639.18h3.48v20.26h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M455.17,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M486.13,667.91l4.42-9.52-8.88-19.21h3.85l6.97,15.36,6.93-15.36h3.89l-13.29,28.73h-3.89Z"/>
+        <path class="cls-6" d="M513.73,659.85c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM513.73,656.61c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
+        <path class="cls-6" d="M527.38,639.18h3.49v11.92c0,3.4,1.78,5.55,4.58,5.55,3.16,0,5.63-2.76,5.63-7.42v-10.05h3.49v20.26h-3.49v-3.12c-1.34,2.35-3.69,3.53-6.24,3.53-4.46,0-7.46-3.2-7.46-8.23v-12.44Z"/>
+        <path class="cls-6" d="M549.51,639.18h3.48v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.91-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M578.81,638.77c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM572.04,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+        <path class="cls-6" d="M591.78,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M610.54,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
+        <path class="cls-6" d="M635.86,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM634.08,639.18h3.49v20.26h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M642.55,639.18h3.49v3.93c.81-2.59,3.32-4.13,5.67-4.13.53,0,1.01.04,1.58.16v3.61c-.65-.24-1.22-.32-1.9-.32-2.55,0-5.35,2.23-5.35,6.97v10.05h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M664.03,659.85c-6.12,0-10.62-4.7-10.62-10.54s4.5-10.54,10.62-10.54,10.58,4.7,10.58,10.54-4.5,10.54-10.58,10.54ZM664.03,656.61c4.21,0,7.01-3.24,7.01-7.29s-2.8-7.29-7.01-7.29-7.05,3.24-7.05,7.29,2.84,7.29,7.05,7.29Z"/>
+        <path class="cls-6" d="M677.97,639.18h3.49v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.45v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M700.21,639.18h3.49v3.12c1.17-2.15,3.4-3.53,6.12-3.53,3.04,0,5.23,1.66,6.24,4.34,1.09-2.67,3.65-4.34,6.65-4.34,4.42,0,7.09,3.28,7.09,8.23v12.44h-3.49v-11.92c0-3.28-1.38-5.55-4.01-5.55-3.28,0-5.55,2.84-5.55,7.42v10.05h-3.48v-11.92c0-3.28-1.34-5.55-3.97-5.55-3.28,0-5.59,2.84-5.59,7.42v10.05h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M743.41,638.77c5.31,0,9.61,3.69,9.65,10.17,0,.36,0,.61-.04,1.05h-16.62c.12,3.81,3,6.65,7.01,6.65,3.12,0,5.07-1.38,6.32-3.53l2.76,1.9c-1.82,2.96-4.9,4.82-9.12,4.82-6.28,0-10.5-4.58-10.5-10.54s4.38-10.54,10.54-10.54ZM736.64,647.16h12.85c-.49-3.32-3.12-5.31-6.2-5.31s-6,1.95-6.65,5.31Z"/>
+        <path class="cls-6" d="M756.38,639.18h3.48v3.12c1.38-2.39,3.81-3.53,6.44-3.53,4.54,0,7.66,3.16,7.66,8.19v12.48h-3.44v-11.96c0-3.4-1.95-5.51-4.78-5.51-3.28,0-5.88,2.67-5.88,7.42v10.05h-3.48v-20.26Z"/>
+        <path class="cls-6" d="M786.24,659.65c-3.53,0-6.2-1.99-6.2-6.2v-11.06h-4.3v-3.2h4.3v-5.63h3.49v5.63h5.84v3.2h-5.84v10.62c0,2.59,1.18,3.44,3.36,3.44.85,0,1.82-.16,2.47-.41v3.12c-.85.28-1.86.49-3.12.49Z"/>
+        <path class="cls-6" d="M796.42,639.18h3.89l6.93,15.32,6.93-15.32h3.89l-9.36,20.26h-2.92l-9.36-20.26Z"/>
+        <path class="cls-6" d="M821.74,633.75c-1.38,0-2.51-1.09-2.51-2.47s1.13-2.47,2.51-2.47,2.43,1.13,2.43,2.47-1.05,2.47-2.43,2.47ZM819.96,639.18h3.49v20.26h-3.49v-20.26Z"/>
+        <path class="cls-6" d="M836.78,638.77c3.04,0,5.79,1.58,7.13,3.53v-3.12h3.48v20.26h-3.48v-3.12c-1.34,1.95-4.09,3.53-7.13,3.53-5.67,0-9.93-4.74-9.93-10.54s4.26-10.54,9.93-10.54ZM837.3,641.98c-4.09,0-6.93,3.24-6.93,7.34s2.84,7.34,6.93,7.34,6.89-3.24,6.89-7.34-2.8-7.34-6.89-7.34Z"/>
+        <path class="cls-6" d="M873.9,659.93c-8.27,0-14.87-6.52-14.87-14.87s6.61-14.87,14.87-14.87,14.87,6.52,14.87,14.87-6.56,14.87-14.87,14.87ZM873.9,656.36c6.32,0,11.1-4.98,11.1-11.31s-4.78-11.31-11.1-11.31-11.06,4.98-11.06,11.31,4.78,11.31,11.06,11.31Z"/>
+        <path class="cls-6" d="M914.1,659.44l-17.51-22.41v22.41h-3.77v-28.77h3.28l17.51,22.41v-22.41h3.73v28.77h-3.24Z"/>
+        <path class="cls-6" d="M944.65,659.44l-17.51-22.41v22.41h-3.77v-28.77h3.28l17.51,22.41v-22.41h3.73v28.77h-3.24Z"/>
+        <path class="cls-6" d="M963.7,647.89l-8.79,11.55h-4.62l11.1-14.55-10.66-14.23h4.5l8.47,11.23,8.47-11.23h4.5l-10.62,14.23,11.06,14.55h-4.58l-8.83-11.55Z"/>
+        <path class="cls-6" d="M980.75,659.77c-1.66,0-3-1.38-3-3.04s1.34-3.04,3-3.04,3,1.42,3,3.04-1.34,3.04-3,3.04Z"/>
+      </g>
+      <g>
+        <path class="cls-3" d="M312.56,274.03h-18c0-5.18-4.04-10.02-9.4-10.07l-104.25.02c-11.33,2.12-11.32,17.43-.38,20.06l102.54.02c38.59,2.04,38.78,52.82,2.1,56.56-33.79-1.6-69.45,2.06-103.02,0-11.72-.72-21.95-7.64-26.04-18.75-1.18-3.2-1.28-6.01-1.77-9.32h18c.28,4.87,3.95,9.62,8.98,10.07l104.25-.02c12.15-2.24,11.75-18.22-.05-20.47l-105.04-.03c-34.49-3.76-34.33-52.52,0-56.14l107.07.13c13.99,1.94,24.86,13.7,25.02,27.94Z"/>
+        <polygon class="cls-3" points="845.46 245.97 845.46 263.98 708.99 263.98 708.99 284.08 807.58 284.08 808.2 284.71 808.2 302.08 708.99 302.08 708.99 322.6 845.46 322.6 845.46 340.61 690.15 340.61 690.15 245.97 845.46 245.97"/>
+        <path class="cls-3" d="M979.42,302.08l40.19,38.52h-24.07l-41.02-38.52h-66.35v38.52h-18.42v-94.63h127.89c2.7,0,8.65,2.55,11.1,3.97,18.71,10.84,17.89,38.72-.98,48.86-1.77.95-7.92,3.28-9.7,3.28h-18.63ZM888.16,284.08h108.63c.19,0,2.17-.96,2.59-1.18,5.89-3.09,6.76-10.78,2.58-15.72-.77-.9-3.71-3.2-4.75-3.2h-109.05v20.1Z"/>
+        <path class="cls-3" d="M1223.38,246.09l104.97-.13c13.07,1.48,23.94,11.35,25.72,24.52-2.05,25.65,10.21,65.79-26.58,70.12l-101.32.02c-16.42-1.48-26.57-12.82-27.42-29.1-.57-10.96-.6-25.52,0-36.47.83-15.17,9.22-26.52,24.62-28.97ZM1225.46,264.08c-3.86,1.07-7.28,3.93-7.85,8.06,1.21,13.06-1.63,29.14.04,41.84.54,4.08,4.71,8.46,8.95,8.64l100.07-.02c4.88-.89,8.58-4.39,9.01-9.41,1.12-12.93-.84-27.49-.05-40.6-.75-4.36-4.01-8.14-8.53-8.64l-101.64.12Z"/>
+        <path class="cls-3" d="M537.36,302.08v38.52h-18.42v-94.63h127.89c2.29,0,8.64,2.6,10.8,3.85,19.96,11.5,17.66,41.29-3.33,50.11-1.6.67-5.94,2.15-7.48,2.15h-109.47ZM537.36,284.08h108.21c2.55,0,6.68-3.98,7.46-6.35,1.45-4.38-.08-9.62-3.93-12.25-.44-.3-2.83-1.49-3.11-1.49h-108.63v20.1Z"/>
+        <path class="cls-3" d="M1763.91,245.97v18h-128.72c-.58,0-3.75,1.77-4.41,2.29-2.94,2.33-3.53,5.62-3.78,9.2-.69,9.89-.63,24.87,0,34.79.15,2.35.5,5.46,1.76,7.45.92,1.44,4.77,4.88,6.42,4.88h128.72v18h-129.14c-14.54,0-25.37-14.67-26.18-28.25-.67-11.18-.67-26.96,0-38.14.74-12.49,10.69-28.25,24.51-28.25h130.82Z"/>
+        <polygon class="cls-3" points="1517.76 323.86 1517.76 245.97 1536.6 245.97 1536.6 340.61 1510.02 340.61 1399.71 262.72 1399.71 340.61 1381.29 340.61 1381.29 245.97 1409.55 245.97 1517.76 323.86"/>
+        <path class="cls-3" d="M355.26,245.97v69.3c0,2.61,5.06,7.86,8.15,7.34l101.74-.02c2.83.06,8.16-4.19,8.16-6.91v-69.72h18.42v71.39c0,11-15.16,23.94-26.14,23.26-33.52-1.59-68.88,2.04-102.18,0-10.81-.66-20.71-6.87-24.79-17.08-.41-1.02-1.77-4.96-1.77-5.76v-71.81h18.42Z"/>
+        <polygon class="cls-3" points="1188.31 245.97 1188.31 263.35 1187.68 263.98 1118.4 263.98 1118.4 340.61 1099.98 340.61 1099.98 263.98 1030.49 263.98 1030.49 245.97 1188.31 245.97"/>
+        <rect class="cls-3" x="1563.39" y="245.97" width="18.42" height="94.63"/>
+      </g>
+      <g>
+        <path class="cls-6" d="M156.1,179.65h54.1v-54.1h-54.1v54.1ZM183.15,129.76c5.25,0,10.08,1.79,13.94,4.77l-13.03,13.03,4.13,4.13,13.03-13.03c2.98,3.86,4.77,8.68,4.77,13.94,0,12.62-10.23,22.84-22.84,22.84s-22.84-10.23-22.84-22.84,10.23-22.84,22.84-22.84Z"/>
+        <path class="cls-6" d="M279.9,132.95c0,8.93,0,17.85,0,26.78,0,3-1.02,5.56-3.39,7.47-2,1.62-4.32,2.26-6.87,2.03-3.3-.31-5.79-1.89-7.41-4.8-.85-1.52-1.09-3.18-1.09-4.9,0-8.82,0-17.63,0-26.45v-.79h-4.64v.59c0,8.98,0,17.96,0,26.94,0,.86.03,1.74.18,2.58.55,3.28,2.09,6.04,4.68,8.15,3.29,2.68,7.09,3.66,11.28,3.08,3.43-.48,6.33-2.02,8.62-4.62,2.3-2.61,3.28-5.7,3.28-9.14v-27.6h-4.65v.67Z"/>
+        <path class="cls-6" d="M244.82,151.28c-1.95-1-4.04-1.33-6.2-1.37-1.41-.03-2.71-.42-3.92-1.17-2.75-1.7-3.99-5.22-2.86-8.25,1.07-2.87,3.65-4.63,6.95-4.71,3.71-.09,6.89,2.1,7.55,5.82.07.39.09.8.13,1.21h4.6c.03-3.38-1.3-6.12-3.67-8.34-2.21-2.06-4.9-3.07-7.93-3.15-4.59-.11-8.33,1.55-10.88,5.43-1.96,2.98-2.26,6.25-1.21,9.63.68,2.16,1.97,3.92,3.77,5.32,2.02,1.57,4.29,2.47,6.85,2.64,1.48.1,2.98.1,4.38.73,4.68,2.1,5.63,7.4,3.21,10.99-1.52,2.24-3.77,3.2-6.42,3.32-3.59.16-6.69-1.91-7.78-5.25-.23-.71-.29-1.47-.43-2.2h-4.55c-.13,4.12,2.51,8.5,6.32,10.4,3.97,1.98,8,2.08,12.03.2,4.76-2.22,7.47-7.32,6.61-12.5-.67-4.02-2.94-6.89-6.54-8.75Z"/>
+        <path class="cls-6" d="M310.07,135.16c-2.37-2.08-5.22-2.88-8.31-2.9-2.36-.01-4.72,0-7.09,0h0s-4.64,0-4.64,0v40.59h.03v.02h4.6v-.02h0v-18.01h.69c2.1,0,4.2,0,6.3,0,1.61,0,3.2-.18,4.72-.76,3.81-1.47,6.44-4.06,7.31-8.12.91-4.26-.33-7.89-3.62-10.78ZM302.79,150.34c-2.59.1-5.2.05-7.79.06-.12,0-.2-.04-.29-.04v-13.68h.18c2.8,0,5.61-.09,8.4.12,3.27.24,6.14,3.08,6.07,6.88-.06,3.69-2.97,6.52-6.56,6.66Z"/>
+        <path class="cls-6" d="M361.78,154.05c3.82-1.47,6.44-4.06,7.31-8.12.91-4.26-.33-7.89-3.62-10.78-2.37-2.08-5.22-2.88-8.31-2.9-3.7-.02-7.41,0-11.11,0h-.62c0,13.56.03,27.08.03,40.6h4.6v-18.03h.69c1.09,0,2.17,0,3.26,0l12.03,18.06h5.39l-12.13-18.21c.84-.11,1.67-.3,2.48-.61ZM350.4,150.4c-.12,0-.2-.04-.29-.04v-13.68h.18c2.8,0,5.61-.09,8.4.12,3.27.24,6.13,3.08,6.07,6.88-.06,3.69-2.97,6.52-6.56,6.66-2.6.1-5.2.05-7.79.06Z"/>
+        <polygon class="cls-6" points="469.03 165.24 447.05 132.34 442.53 132.34 442.53 172.84 447.05 172.84 447.05 139.95 469.03 172.84 473.55 172.84 473.55 132.34 469.03 132.34 469.03 165.24"/>
+        <path class="cls-6" d="M318.45,132.28h0v40.55h0c7.18,0,14.32,0,21.45,0h0s.05,0,.05,0v-4.44h-.05c-5.63,0-11.22,0-16.8,0v-14.17h16.2v-4.45h-16.2v-13.05h16.79s.05,0,.05,0v-4.44h-21.5Z"/>
+        <path class="cls-6" d="M397.79,136.72v-4.43h0s-25.96-.01-25.96-.01h0v4.47h10.65v36.08h4.65v-36.1h10.66v-.02Z"/>
+        <path class="cls-6" d="M479.04,132.28h0v40.55h0c6.99,0,13.94,0,20.89,0h.61v-4.44h-.05c-5.63,0-11.22,0-16.8,0v-14.17h16.2v-4.45h-16.2v-13.05h16.79s.05,0,.05,0v-4.44h-21.5Z"/>
+        <path class="cls-6" d="M417.64,131.69c-11.5,0-20.85,9.35-20.85,20.85s9.35,20.85,20.85,20.85,20.85-9.35,20.85-20.85-9.35-20.85-20.85-20.85ZM417.64,168.88c-9.01,0-16.34-7.33-16.34-16.34s7.33-16.34,16.34-16.34,16.34,7.33,16.34,16.34-7.33,16.34-16.34,16.34Z"/>
+      </g>
+    </g>
+  </g>
+</svg>
@@ -0,0 +1,10 @@
+import SwiftUI
+
+@main
+struct ExampleiOSApp: App {
+    var body: some Scene {
+        WindowGroup {
+            ContentView()
+        }
+    }
+}
@@ -0,0 +1,30 @@
+import Foundation
+import AVFoundation
+
+final class AudioPlayer: NSObject, AVAudioPlayerDelegate {
+    private var player: AVAudioPlayer?
+    private var onFinish: (() -> Void)?
+
+    func play(url: URL, onFinish: (() -> Void)? = nil) {
+        self.onFinish = onFinish
+        do {
+            let data = try Data(contentsOf: url)
+            let player = try AVAudioPlayer(data: data)
+            player.delegate = self
+            player.prepareToPlay()
+            player.play()
+            self.player = player
+        } catch {
+            print("Audio play error: \(error)")
+        }
+    }
+
+    func stop() {
+        player?.stop()
+        player = nil
+    }
+
+    func audioPlayerDidFinishPlaying(_ player: AVAudioPlayer, successfully flag: Bool) {
+        onFinish?()
+    }
+}
@@ -0,0 +1,98 @@
+import SwiftUI
+
+struct ContentView: View {
+    @StateObject private var vm = TTSViewModel()
+
+    var body: some View {
+        ZStack {
+            LinearGradient(gradient: Gradient(colors: [Color(.systemBackground), Color(.secondarySystemBackground)]), startPoint: .topLeading, endPoint: .bottomTrailing)
+                .ignoresSafeArea()
+
+            VStack(spacing: 20) {
+                Spacer()
+
+                VStack(spacing: 12) {
+                    Text("SupertonicTTS iOS Demo")
+                        .font(.title2.weight(.semibold))
+                        .foregroundColor(.primary)
+
+                    ZStack(alignment: .topLeading) {
+                        if vm.text.isEmpty {
+                            Text("Type text to synthesize")
+                                .foregroundColor(.secondary)
+                                .padding(.horizontal, 14)
+                                .padding(.vertical, 12)
+                        }
+                        TextEditor(text: $vm.text)
+                            .frame(minHeight: 120, maxHeight: 180)
+                            .padding(8)
+                            .background(Color(.secondarySystemBackground))
+                            .cornerRadius(12)
+                            .overlay(
+                                RoundedRectangle(cornerRadius: 12)
+                                    .stroke(Color.secondary.opacity(0.3), lineWidth: 1)
+                            )
+                    }
+                    .padding(.horizontal)
+
+                    HStack(spacing: 12) {
+                        Text("NFE")
+                            .font(.subheadline)
+                            .foregroundColor(.secondary)
+                        Slider(value: $vm.nfe, in: 2...15, step: 1)
+                        Text("\(Int(vm.nfe))")
+                            .font(.subheadline.monospacedDigit())
+                            .frame(width: 36)
+                    }
+                    .padding(.horizontal)
+
+                    Picker("Voice", selection: $vm.voice) {
+                        Text("M").tag(TTSService.Voice.male)
+                        Text("F").tag(TTSService.Voice.female)
+                    }
+                    .pickerStyle(SegmentedPickerStyle())
+                    .padding(.horizontal)
+                }
+
+                HStack(spacing: 16) {
+                    Button(action: { vm.generate() }) {
+                        Label(vm.isGenerating ? "Generating..." : "Generate", systemImage: vm.isGenerating ? "hourglass" : "wand.and.stars"
+                        )
+                        .labelStyle(.titleAndIcon)
+                    }
+                    .buttonStyle(.borderedProminent)
+                    .tint(.accentColor)
+                    .disabled(vm.isGenerating)
+
+                    Button(action: { vm.togglePlay() }) {
+                        Label(vm.isPlaying ? "Stop" : "Play", systemImage: vm.isPlaying ? "stop.fill" : "play.fill")
+                    }
+                    .buttonStyle(.bordered)
+                    .disabled(vm.audioURL == nil)
+                }
+
+                if let rtf = vm.rtfText {
+                    Text(rtf)
+                        .font(.footnote.monospacedDigit())
+                        .foregroundColor(.secondary)
+                        .padding(.top, 2)
+                }
+
+                if let error = vm.errorMessage {
+                    Text(error)
+                        .foregroundColor(.red)
+                        .font(.footnote)
+                        .multilineTextAlignment(.center)
+                        .padding(.horizontal)
+                }
+
+                Spacer()
+            }
+        }
+        .onAppear { vm.startup() }
+    }
+}
+
+#Preview {
+    ContentView()
+}
@@ -0,0 +1,29 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+	<key>CFBundleDevelopmentRegion</key>
+	<string>en</string>
+	<key>CFBundleExecutable</key>
+	<string>$(EXECUTABLE_NAME)</string>
+	<key>CFBundleIdentifier</key>
+	<string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
+	<key>CFBundleInfoDictionaryVersion</key>
+	<string>6.0</string>
+	<key>CFBundleName</key>
+	<string>ExampleiOSApp</string>
+	<key>CFBundlePackageType</key>
+	<string>APPL</string>
+	<key>CFBundleShortVersionString</key>
+	<string>1.0</string>
+	<key>CFBundleVersion</key>
+	<string>1</string>
+	<key>UILaunchScreen</key>
+	<dict/>
+	<key>UIApplicationSceneManifest</key>
+	<dict>
+		<key>UIApplicationSupportsMultipleScenes</key>
+		<false/>
+	</dict>
+</dict>
+</plist>
@@ -0,0 +1,140 @@
+import Foundation
+import OnnxRuntimeBindings
+
+final class TTSService {
+    enum Voice { case male, female }
+
+    struct Settings {
+        var nTest: Int = 1
+    }
+
+    struct SynthesisResult {
+        let url: URL
+        let elapsedSeconds: Double
+        let audioSeconds: Double
+        var rtf: Double { elapsedSeconds / max(audioSeconds, 1e-6) }
+    }
+
+    private let env: ORTEnv
+    private let textToSpeech: TextToSpeech
+    private let bundleOnnxDir: String
+    private let sampleRate: Int
+
+    // Cached style per voice (precomputed at startup or on first use)
+    private var cachedStyle: [Voice: Style] = [:]
+
+    init() throws {
+        bundleOnnxDir = try Self.locateOnnxDirInBundle()
+        env = try ORTEnv(loggingLevel: .warning)
+        textToSpeech = try loadTextToSpeech(bundleOnnxDir, false, env)
+        sampleRate = textToSpeech.sampleRate
+    }
+
+    // Public warmup: precompute styles and run a quick generation to warm models
+    func warmup(nfe: Int = 1) async {
+        do { try precomputeStyle(for: .male) } catch { print("Warmup style (M) error: \(error)") }
+        do { try precomputeStyle(for: .female) } catch { print("Warmup style (F) error: \(error)") }
+        // Run a tiny synthesis to JIT/warm up kernels; discard file
+        do {
+            let res = try await synthesize(text: "Warm up", nfe: max(1, nfe), voice: .male)
+            try? FileManager.default.removeItem(at: res.url)
+        } catch {
+            print("Warmup synth error: \(error)")
+        }
+    }
+
+    func synthesize(text: String, nfe: Int, voice: Voice, settings: Settings = Settings()) async throws -> SynthesisResult {
+        let tic = Date()
+
+        // 1) Get or compute style for the selected voice
+        let style = try getStyle(voice: voice)
+
+        // 2) Synthesize via packed TextToSpeech component
+        let (wav, duration) = try textToSpeech.call([text], style, nfe)
+        let audioSeconds = Double(duration[0])
+        let wavLenSample = min(Int(Double(sampleRate) * audioSeconds), wav.count)
+        let wavOut = Array(wav[0..<wavLenSample])
+
+        let tmpURL = FileManager.default.temporaryDirectory.appendingPathComponent("supertonic_tts_\(UUID().uuidString).wav")
+        try writeWavFile(tmpURL.path, wavOut, sampleRate)
+
+        let elapsed = Date().timeIntervalSince(tic)
+        return SynthesisResult(url: tmpURL, elapsedSeconds: elapsed, audioSeconds: audioSeconds)
+    }
+
+    // MARK: - Style helpers
+    private func precomputeStyle(for voice: Voice) throws {
+        if cachedStyle[voice] != nil { return }
+        let styleURL = try Self.locateVoiceStyleURL(voice: voice)
+        let style = try loadVoiceStyle([styleURL.path], verbose: false)
+        cachedStyle[voice] = style
+    }
+
+    private func getStyle(voice: Voice) throws -> Style {
+        if let style = cachedStyle[voice] { return style }
+        try precomputeStyle(for: voice)
+        return cachedStyle[voice]!
+    }
+
+    // MARK: - Resource location helpers
+    private static func locateOnnxDirInBundle() throws -> String {
+        let bundle = Bundle.main
+        let fm = FileManager.default
+
+        func dirHasRequiredFiles(_ dir: URL) -> Bool {
+            let required = [
+                "tts.json",
+                "duration_predictor.onnx",
+                "text_encoder.onnx",
+                "vector_estimator.onnx",
+                "vocoder.onnx"
+            ]
+            return required.allSatisfy { fm.fileExists(atPath: dir.appendingPathComponent($0).path) }
+        }
+
+        var candidates: [URL] = []
+        if let dir = bundle.resourceURL?.appendingPathComponent("onnx", isDirectory: true) { candidates.append(dir) }
+        if let dir = bundle.resourceURL?.appendingPathComponent("assets/onnx", isDirectory: true) { candidates.append(dir) }
+        if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "onnx") { candidates.append(url.deletingLastPathComponent()) }
+        if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "assets/onnx") { candidates.append(url.deletingLastPathComponent()) }
+        if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: nil) { candidates.append(url.deletingLastPathComponent()) }
+        if let root = bundle.resourceURL { candidates.append(root) }
+
+        for dir in candidates {
+            if dirHasRequiredFiles(dir) { return dir.path }
+        }
+        throw NSError(
+            domain: "TTS",
+            code: -100,
+            userInfo: [NSLocalizedDescriptionKey: "Could not find the onnx directory in the bundle. Please make sure the onnx folder (as a folder reference) is included in Copy Bundle Resources in Xcode."]
+        )
+    }
+
+    private static func locateVoiceStyleURL(voice: Voice) throws -> URL {
+        // Prefer M1/F1 defaults; search common subdirectories
+        let fileName = (voice == .male) ? "M1" : "F1"
+        let bundle = Bundle.main
+        let candidates: [URL?] = [
+            bundle.url(forResource: fileName, withExtension: "json", subdirectory: "voice_styles"),
+            bundle.url(forResource: fileName, withExtension: "json", subdirectory: "assets/voice_styles"),
+            bundle.url(forResource: fileName, withExtension: "json", subdirectory: nil)
+        ]
+        for url in candidates {
+            if let url = url { return url }
+        }
+        // Fallback: scan folders if needed
+        if let folder1 = bundle.resourceURL?.appendingPathComponent("voice_styles", isDirectory: true) {
+            let file = folder1.appendingPathComponent("\(fileName).json")
+            if FileManager.default.fileExists(atPath: file.path) { return file }
+        }
+        if let folder2 = bundle.resourceURL?.appendingPathComponent("assets/voice_styles", isDirectory: true) {
+            let file = folder2.appendingPathComponent("\(fileName).json")
+            if FileManager.default.fileExists(atPath: file.path) { return file }
+        }
+        throw NSError(
+            domain: "TTS",
+            code: -102,
+            userInfo: [NSLocalizedDescriptionKey: "Could not find the voice style JSON (\(fileName).json) in the bundle. Ensure voice_styles folder is included in Copy Bundle Resources."]
+        )
+    }
+}
@@ -0,0 +1,76 @@
+import Foundation
+import AVFoundation
+
+@MainActor
+final class TTSViewModel: ObservableObject {
+    @Published var text: String = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+    @Published var nfe: Double = 5
+    @Published var voice: TTSService.Voice = .male
+    @Published var isGenerating: Bool = false
+    @Published var isPlaying: Bool = false
+    @Published var errorMessage: String?
+    @Published var audioURL: URL?
+
+    @Published var elapsedSeconds: Double?
+    @Published var audioSeconds: Double?
+
+    private var service: TTSService?
+    private var player = AudioPlayer()
+
+    var rtfText: String? {
+        guard let e = elapsedSeconds, let a = audioSeconds, a > 0 else { return nil }
+        let rtf = e / a
+        return String(format: "RTF %.2fx · %.2fs / %.2fs", rtf, e, a)
+    }
+
+    func startup() {
+        do {
+            service = try TTSService()
+            Task { await self.service?.warmup(nfe: 5) }
+        } catch {
+            errorMessage = "Failed to init TTS: \(error.localizedDescription)"
+        }
+    }
+
+    func generate() {
+        guard let service = service else { return }
+        isGenerating = true
+        errorMessage = nil
+        audioURL = nil
+        elapsedSeconds = nil
+        audioSeconds = nil
+        Task {
+            do {
+                let result = try await service.synthesize(text: text, nfe: Int(nfe), voice: voice)
+                await MainActor.run {
+                    self.audioURL = result.url
+                    self.elapsedSeconds = result.elapsedSeconds
+                    self.audioSeconds = result.audioSeconds
+                    self.isGenerating = false
+                }
+                self.play(url: result.url)
+            } catch {
+                await MainActor.run {
+                    self.errorMessage = error.localizedDescription
+                    self.isGenerating = false
+                }
+            }
+        }
+    }
+
+    func togglePlay() {
+        if isPlaying {
+            player.stop()
+            isPlaying = false
+        } else if let url = audioURL {
+            play(url: url)
+        }
+    }
+
+    private func play(url: URL) {
+        player.play(url: url) { [weak self] in
+            DispatchQueue.main.async { self?.isPlaying = false }
+        }
+        isPlaying = true
+    }
+}
@@ -0,0 +1,29 @@
+name: ExampleiOSApp
+options:
+  minimumXcodeGenVersion: 2.37.0
+packages:
+  onnxruntime:
+    url: https://github.com/microsoft/onnxruntime-swift-package-manager.git
+    from: 1.16.0
+targets:
+  ExampleiOSApp:
+    type: application
+    platform: iOS
+    deploymentTarget: "15.0"
+    sources:
+      - path: .
+      - path: ../../swift/Sources/Helper.swift <<- 여기
+        type: file
+    resources:
+      - path: onnx
+        type: folder
+      - path: audio
+        type: folder
+    settings:
+      base:
+        PRODUCT_BUNDLE_IDENTIFIER: com.supertonic.ExampleiOSApp
+        SWIFT_VERSION: 5.9
+        INFOPLIST_FILE: Info.plist
+    dependencies:
+      - package: onnxruntime
+        product: onnxruntime
@@ -0,0 +1,59 @@
+# Supertonic iOS Example App
+
+A minimal iOS demo that runs Supertonic (ONNX Runtime) on-device. The app shows:
+- Multiline text input
+- NFE (denoising steps) slider
+- Voice toggle (M/F)
+- Generate & Play buttons
+- RTF display (Elapsed / Audio seconds)
+
+All ONNX models/configs are reused from `Supertonic/assets/onnx`, and voice style JSON files from `Supertonic/assets/voice_styles`.
+
+## Prerequisites
+- macOS 13+, Xcode 15+
+- Swift 5.9+
+- iOS 15+ device (recommended)
+- Homebrew, XcodeGen
+
+Install tools (if needed):
+```bash
+brew install xcodegen
+```
+
+## Quick Start (zero-click in Xcode)
+0) Prepare assets next to the iOS target (one-time)
+```bash
+cd ios/ExampleiOSApp
+mkdir -p onnx voice_styles
+rsync -a ../../assets/onnx/ onnx/
+rsync -a ../../assets/voice_styles/ voice_styles/
+```
+
+1) Generate the Xcode project with XcodeGen
+```bash
+xcodegen generate
+open ExampleiOSApp.xcodeproj
+```
+
+2) Open in Xcode and select your iPhone as the run destination
+- Targets → ExampleiOSApp → Signing & Capabilities: Select your Team
+- iOS Deployment Target: 15.0+
+
+3) Build & Run on device
+- Type text → adjust NFE/Voice → Tap Generate → Audio plays automatically
+- An RTF line shows like: `RTF 0.30x · 3.04s / 10.11s`
+
+## What's included (generated project)
+- SwiftUI app files: `App.swift`, `ContentView.swift`, `TTSViewModel.swift`, `AudioPlayer.swift`
+- Runtime wrapper: `TTSService.swift` (includes TTS inference logic)
+- Resources (local, vendored in `ios/ExampleiOSApp/onnx` and `ios/ExampleiOSApp/voice_styles` after step 0)
+
+These references are defined in `project.yml` and added to the app bundle by XcodeGen.
+
+## App Controls
+- **Text**: Multiline `TextEditor`
+- **NFE**: Denoising steps (default 5)
+- **Voice**: M1/M2/F1/F2 voice style selector (4 pre-extracted styles)
+- **Generate**: Runs end-to-end synthesis
+- **Play/Stop**: Controls playback of the last output
+- **RTF**: Shows Elapsed / Audio seconds for quick performance intuition
@@ -0,0 +1,35 @@
+# Maven
+target/
+pom.xml.tag
+pom.xml.releaseBackup
+pom.xml.versionsBackup
+pom.xml.next
+release.properties
+dependency-reduced-pom.xml
+buildNumber.properties
+.mvn/timing.properties
+.mvn/wrapper/maven-wrapper.jar
+
+# Compiled class files
+*.class
+
+# IntelliJ IDEA
+.idea/
+*.iml
+*.iws
+*.ipr
+
+# Eclipse
+.classpath
+.project
+.settings/
+
+# VS Code
+.vscode/
+
+# Results
+results/*.wav
+
+# Mac
+.DS_Store
+
@@ -0,0 +1,141 @@
+import ai.onnxruntime.*;
+
+import java.io.File;
+import java.util.*;
+
+/**
+ * TTS Inference Example with ONNX Runtime (Java)
+ */
+public class ExampleONNX {
+    
+    /**
+     * Command line arguments
+     */
+    static class Args {
+        boolean useGpu = false;
+        String onnxDir = "assets/onnx";
+        int totalStep = 5;
+        int nTest = 4;
+        List<String> voiceStyle = Arrays.asList("assets/voice_styles/M1.json");
+        List<String> text = Arrays.asList(
+            "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+        );
+        String saveDir = "results";
+    }
+    
+    /**
+     * Parse command line arguments
+     */
+    private static Args parseArgs(String[] args) {
+        Args result = new Args();
+        
+        for (int i = 0; i < args.length; i++) {
+            switch (args[i]) {
+                case "--use-gpu":
+                    result.useGpu = true;
+                    break;
+                case "--onnx-dir":
+                    if (i + 1 < args.length) result.onnxDir = args[++i];
+                    break;
+                case "--total-step":
+                    if (i + 1 < args.length) result.totalStep = Integer.parseInt(args[++i]);
+                    break;
+                case "--n-test":
+                    if (i + 1 < args.length) result.nTest = Integer.parseInt(args[++i]);
+                    break;
+                case "--voice-style":
+                    if (i + 1 < args.length) {
+                        result.voiceStyle = Arrays.asList(args[++i].split(","));
+                    }
+                    break;
+                case "--text":
+                    if (i + 1 < args.length) {
+                        result.text = Arrays.asList(args[++i].split("\\|"));
+                    }
+                    break;
+                case "--save-dir":
+                    if (i + 1 < args.length) result.saveDir = args[++i];
+                    break;
+            }
+        }
+        
+        return result;
+    }
+    
+    /**
+     * Main inference function
+     */
+    public static void main(String[] args) {
+        try {
+            System.out.println("=== TTS Inference with ONNX Runtime (Java) ===\n");
+            
+            // --- 1. Parse arguments --- //
+            Args parsedArgs = parseArgs(args);
+            int totalStep = parsedArgs.totalStep;
+            int nTest = parsedArgs.nTest;
+            String saveDir = parsedArgs.saveDir;
+            List<String> voiceStylePaths = parsedArgs.voiceStyle;
+            List<String> textList = parsedArgs.text;
+            
+            if (voiceStylePaths.size() != textList.size()) {
+                throw new RuntimeException("Number of voice styles (" + voiceStylePaths.size() + 
+                    ") must match number of texts (" + textList.size() + ")");
+            }
+            
+            int bsz = voiceStylePaths.size();
+            OrtEnvironment env = OrtEnvironment.getEnvironment();
+            
+            // --- 2. Load TTS components --- //
+            TextToSpeech textToSpeech = Helper.loadTextToSpeech(parsedArgs.onnxDir, parsedArgs.useGpu, env);
+            
+            // --- 3. Load voice styles --- //
+            Style style = Helper.loadVoiceStyle(voiceStylePaths, true, env);
+            
+            // --- 4. Synthesize speech --- //
+            File saveDirFile = new File(saveDir);
+            if (!saveDirFile.exists()) {
+                saveDirFile.mkdirs();
+            }
+            
+            for (int n = 0; n < nTest; n++) {
+                System.out.println("\n[" + (n + 1) + "/" + nTest + "] Starting synthesis...");
+                
+                TTSResult ttsResult = Helper.timer("Generating speech from text", () -> {
+                    try {
+                        return textToSpeech.call(textList, style, totalStep, env);
+                    } catch (Exception e) {
+                        throw new RuntimeException(e);
+                    }
+                });
+                
+                float[] wav = ttsResult.wav;
+                float[] duration = ttsResult.duration;
+                
+                // Save outputs
+                int wavLen = wav.length / bsz;
+                for (int i = 0; i < bsz; i++) {
+                    String fname = Helper.sanitizeFilename(textList.get(i), 20) + "_" + (n + 1) + ".wav";
+                    int actualLen = (int) (textToSpeech.sampleRate * duration[i]);
+                    
+                    float[] wavOut = new float[actualLen];
+                    System.arraycopy(wav, i * wavLen, wavOut, 0, Math.min(actualLen, wavLen));
+                    
+                    String outputPath = saveDir + "/" + fname;
+                    Helper.writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
+                    System.out.println("Saved: " + outputPath);
+                }
+            }
+            
+            // Clean up
+            style.close();
+            textToSpeech.close();
+            
+            System.out.println("\n=== Synthesis completed successfully! ===");
+            
+        } catch (Exception e) {
+            System.err.println("Error during inference: " + e.getMessage());
+            e.printStackTrace();
+            System.exit(1);
+        }
+    }
+}
@@ -0,0 +1,597 @@
+import ai.onnxruntime.*;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.ObjectMapper;
+
+import javax.sound.sampled.AudioFileFormat;
+import javax.sound.sampled.AudioFormat;
+import javax.sound.sampled.AudioInputStream;
+import javax.sound.sampled.AudioSystem;
+import java.io.*;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.FloatBuffer;
+import java.nio.LongBuffer;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.text.Normalizer;
+import java.util.*;
+
+/**
+ * Configuration classes
+ */
+class Config {
+    static class AEConfig {
+        int sampleRate;
+        int baseChunkSize;
+    }
+    
+    static class TTLConfig {
+        int chunkCompressFactor;
+        int latentDim;
+    }
+    
+    AEConfig ae;
+    TTLConfig ttl;
+}
+
+/**
+ * Voice Style Data from JSON
+ */
+class VoiceStyleData {
+    static class StyleData {
+        float[][][] data;
+        long[] dims;
+        String type;
+    }
+    
+    StyleData styleTtl;
+    StyleData styleDp;
+}
+
+/**
+ * Unicode text processor
+ */
+class UnicodeProcessor {
+    private long[] indexer;
+    
+    public UnicodeProcessor(String unicodeIndexerJsonPath) throws IOException {
+        this.indexer = Helper.loadJsonLongArray(unicodeIndexerJsonPath);
+    }
+    
+    public TextProcessResult call(List<String> textList) {
+        List<String> processedTexts = new ArrayList<>();
+        for (String text : textList) {
+            processedTexts.add(preprocessText(text));
+        }
+        
+        int[] textIdsLengths = new int[processedTexts.size()];
+        int maxLen = 0;
+        for (int i = 0; i < processedTexts.size(); i++) {
+            textIdsLengths[i] = processedTexts.get(i).length();
+            maxLen = Math.max(maxLen, textIdsLengths[i]);
+        }
+        
+        long[][] textIds = new long[processedTexts.size()][maxLen];
+        for (int i = 0; i < processedTexts.size(); i++) {
+            int[] unicodeVals = textToUnicodeValues(processedTexts.get(i));
+            for (int j = 0; j < unicodeVals.length; j++) {
+                textIds[i][j] = indexer[unicodeVals[j]];
+            }
+        }
+        
+        float[][][] textMask = getTextMask(textIdsLengths);
+        return new TextProcessResult(textIds, textMask);
+    }
+    
+    private String preprocessText(String text) {
+        return Normalizer.normalize(text, Normalizer.Form.NFKD);
+    }
+    
+    private int[] textToUnicodeValues(String text) {
+        int[] values = new int[text.length()];
+        for (int i = 0; i < text.length(); i++) {
+            values[i] = text.codePointAt(i);
+        }
+        return values;
+    }
+    
+    private float[][][] getTextMask(int[] lengths) {
+        int bsz = lengths.length;
+        int maxLen = 0;
+        for (int len : lengths) {
+            maxLen = Math.max(maxLen, len);
+        }
+        
+        float[][][] mask = new float[bsz][1][maxLen];
+        for (int i = 0; i < bsz; i++) {
+            for (int j = 0; j < maxLen; j++) {
+                mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
+            }
+        }
+        return mask;
+    }
+    
+    static class TextProcessResult {
+        long[][] textIds;
+        float[][][] textMask;
+        
+        TextProcessResult(long[][] textIds, float[][][] textMask) {
+            this.textIds = textIds;
+            this.textMask = textMask;
+        }
+    }
+}
+
+/**
+ * Text-to-Speech inference class
+ */
+class TextToSpeech {
+    private Config config;
+    private UnicodeProcessor textProcessor;
+    private OrtSession dpSession;
+    private OrtSession textEncSession;
+    private OrtSession vectorEstSession;
+    private OrtSession vocoderSession;
+    public int sampleRate;
+    private int baseChunkSize;
+    private int chunkCompress;
+    private int ldim;
+    
+    public TextToSpeech(Config config, UnicodeProcessor textProcessor,
+                       OrtSession dpSession, OrtSession textEncSession,
+                       OrtSession vectorEstSession, OrtSession vocoderSession) {
+        this.config = config;
+        this.textProcessor = textProcessor;
+        this.dpSession = dpSession;
+        this.textEncSession = textEncSession;
+        this.vectorEstSession = vectorEstSession;
+        this.vocoderSession = vocoderSession;
+        this.sampleRate = config.ae.sampleRate;
+        this.baseChunkSize = config.ae.baseChunkSize;
+        this.chunkCompress = config.ttl.chunkCompressFactor;
+        this.ldim = config.ttl.latentDim;
+    }
+    
+    public TTSResult call(List<String> textList, Style style, int totalStep, OrtEnvironment env) 
+            throws OrtException {
+        int bsz = textList.size();
+        
+        // Process text
+        UnicodeProcessor.TextProcessResult textResult = textProcessor.call(textList);
+        long[][] textIds = textResult.textIds;
+        float[][][] textMask = textResult.textMask;
+        
+        // Create tensors
+        OnnxTensor textIdsTensor = Helper.createLongTensor(textIds, env);
+        OnnxTensor textMaskTensor = Helper.createFloatTensor(textMask, env);
+        
+        // Predict duration
+        Map<String, OnnxTensor> dpInputs = new HashMap<>();
+        dpInputs.put("text_ids", textIdsTensor);
+        dpInputs.put("style_dp", style.dpTensor);
+        dpInputs.put("text_mask", textMaskTensor);
+        
+        OrtSession.Result dpResult = dpSession.run(dpInputs);
+        Object dpValue = dpResult.get(0).getValue();
+        float[] duration;
+        if (dpValue instanceof float[][]) {
+            duration = ((float[][]) dpValue)[0];
+        } else {
+            duration = (float[]) dpValue;
+        }
+        
+        // Encode text
+        Map<String, OnnxTensor> textEncInputs = new HashMap<>();
+        textEncInputs.put("text_ids", textIdsTensor);
+        textEncInputs.put("style_ttl", style.ttlTensor);
+        textEncInputs.put("text_mask", textMaskTensor);
+        
+        OrtSession.Result textEncResult = textEncSession.run(textEncInputs);
+        OnnxTensor textEmbTensor = (OnnxTensor) textEncResult.get(0);
+        
+        // Sample noisy latent
+        NoisyLatentResult noisyLatentResult = sampleNoisyLatent(duration);
+        float[][][] xt = noisyLatentResult.noisyLatent;
+        float[][][] latentMask = noisyLatentResult.latentMask;
+        
+        // Prepare constant tensors
+        float[] totalStepArray = new float[bsz];
+        Arrays.fill(totalStepArray, (float) totalStep);
+        OnnxTensor totalStepTensor = OnnxTensor.createTensor(env, totalStepArray);
+        
+        // Denoising loop
+        for (int step = 0; step < totalStep; step++) {
+            float[] currentStepArray = new float[bsz];
+            Arrays.fill(currentStepArray, (float) step);
+            OnnxTensor currentStepTensor = OnnxTensor.createTensor(env, currentStepArray);
+            OnnxTensor noisyLatentTensor = Helper.createFloatTensor(xt, env);
+            OnnxTensor latentMaskTensor = Helper.createFloatTensor(latentMask, env);
+            OnnxTensor textMaskTensor2 = Helper.createFloatTensor(textMask, env);
+            
+            Map<String, OnnxTensor> vectorEstInputs = new HashMap<>();
+            vectorEstInputs.put("noisy_latent", noisyLatentTensor);
+            vectorEstInputs.put("text_emb", textEmbTensor);
+            vectorEstInputs.put("style_ttl", style.ttlTensor);
+            vectorEstInputs.put("latent_mask", latentMaskTensor);
+            vectorEstInputs.put("text_mask", textMaskTensor2);
+            vectorEstInputs.put("current_step", currentStepTensor);
+            vectorEstInputs.put("total_step", totalStepTensor);
+            
+            OrtSession.Result vectorEstResult = vectorEstSession.run(vectorEstInputs);
+            float[][][] denoised = (float[][][]) vectorEstResult.get(0).getValue();
+            
+            // Update latent
+            xt = denoised;
+            
+            // Clean up
+            currentStepTensor.close();
+            noisyLatentTensor.close();
+            latentMaskTensor.close();
+            textMaskTensor2.close();
+            vectorEstResult.close();
+        }
+        
+        // Generate waveform
+        OnnxTensor finalLatentTensor = Helper.createFloatTensor(xt, env);
+        Map<String, OnnxTensor> vocoderInputs = new HashMap<>();
+        vocoderInputs.put("latent", finalLatentTensor);
+        
+        OrtSession.Result vocoderResult = vocoderSession.run(vocoderInputs);
+        float[][] wavBatch = (float[][]) vocoderResult.get(0).getValue();
+        float[] wav = wavBatch[0];
+        
+        // Clean up
+        textIdsTensor.close();
+        textMaskTensor.close();
+        dpResult.close();
+        textEncResult.close();
+        totalStepTensor.close();
+        finalLatentTensor.close();
+        vocoderResult.close();
+        
+        return new TTSResult(wav, duration);
+    }
+    
+    private NoisyLatentResult sampleNoisyLatent(float[] duration) {
+        int bsz = duration.length;
+        float maxDur = 0;
+        for (float d : duration) {
+            maxDur = Math.max(maxDur, d);
+        }
+        
+        long wavLenMax = (long) (maxDur * sampleRate);
+        long[] wavLengths = new long[bsz];
+        for (int i = 0; i < bsz; i++) {
+            wavLengths[i] = (long) (duration[i] * sampleRate);
+        }
+        
+        int chunkSize = baseChunkSize * chunkCompress;
+        int latentLen = (int) ((wavLenMax + chunkSize - 1) / chunkSize);
+        int latentDim = ldim * chunkCompress;
+        
+        Random rng = new Random();
+        float[][][] noisyLatent = new float[bsz][latentDim][latentLen];
+        for (int b = 0; b < bsz; b++) {
+            for (int d = 0; d < latentDim; d++) {
+                for (int t = 0; t < latentLen; t++) {
+                    // Box-Muller transform
+                    double u1 = Math.max(1e-10, rng.nextDouble());
+                    double u2 = rng.nextDouble();
+                    noisyLatent[b][d][t] = (float) (Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2));
+                }
+            }
+        }
+        
+        float[][][] latentMask = Helper.getLatentMask(wavLengths, config);
+        
+        // Apply mask
+        for (int b = 0; b < bsz; b++) {
+            for (int d = 0; d < latentDim; d++) {
+                for (int t = 0; t < latentLen; t++) {
+                    noisyLatent[b][d][t] *= latentMask[b][0][t];
+                }
+            }
+        }
+        
+        return new NoisyLatentResult(noisyLatent, latentMask);
+    }
+    
+    public void close() throws OrtException {
+        if (dpSession != null) dpSession.close();
+        if (textEncSession != null) textEncSession.close();
+        if (vectorEstSession != null) vectorEstSession.close();
+        if (vocoderSession != null) vocoderSession.close();
+    }
+}
+
+/**
+ * Style holder class
+ */
+class Style {
+    OnnxTensor ttlTensor;
+    OnnxTensor dpTensor;
+    
+    Style(OnnxTensor ttlTensor, OnnxTensor dpTensor) {
+        this.ttlTensor = ttlTensor;
+        this.dpTensor = dpTensor;
+    }
+    
+    public void close() throws OrtException {
+        if (ttlTensor != null) ttlTensor.close();
+        if (dpTensor != null) dpTensor.close();
+    }
+}
+
+/**
+ * TTS result holder
+ */
+class TTSResult {
+    float[] wav;
+    float[] duration;
+    
+    TTSResult(float[] wav, float[] duration) {
+        this.wav = wav;
+        this.duration = duration;
+    }
+}
+
+/**
+ * Noisy latent result holder
+ */
+class NoisyLatentResult {
+    float[][][] noisyLatent;
+    float[][][] latentMask;
+    
+    NoisyLatentResult(float[][][] noisyLatent, float[][][] latentMask) {
+        this.noisyLatent = noisyLatent;
+        this.latentMask = latentMask;
+    }
+}
+
+/**
+ * Helper utility class
+ */
+public class Helper {
+    
+    /**
+     * Load voice style from JSON files
+     */
+    public static Style loadVoiceStyle(List<String> voiceStylePaths, boolean verbose, OrtEnvironment env) 
+            throws IOException, OrtException {
+        int bsz = voiceStylePaths.size();
+        
+        // Read first file to get dimensions
+        ObjectMapper mapper = new ObjectMapper();
+        JsonNode firstRoot = mapper.readTree(new File(voiceStylePaths.get(0)));
+        
+        long[] ttlDims = new long[3];
+        for (int i = 0; i < 3; i++) {
+            ttlDims[i] = firstRoot.get("style_ttl").get("dims").get(i).asLong();
+        }
+        long[] dpDims = new long[3];
+        for (int i = 0; i < 3; i++) {
+            dpDims[i] = firstRoot.get("style_dp").get("dims").get(i).asLong();
+        }
+        
+        long ttlDim1 = ttlDims[1];
+        long ttlDim2 = ttlDims[2];
+        long dpDim1 = dpDims[1];
+        long dpDim2 = dpDims[2];
+        
+        // Pre-allocate arrays with full batch size
+        int ttlSize = (int) (bsz * ttlDim1 * ttlDim2);
+        int dpSize = (int) (bsz * dpDim1 * dpDim2);
+        float[] ttlFlat = new float[ttlSize];
+        float[] dpFlat = new float[dpSize];
+        
+        // Fill in the data
+        for (int i = 0; i < bsz; i++) {
+            JsonNode root = mapper.readTree(new File(voiceStylePaths.get(i)));
+            
+            // Flatten TTL data
+            int ttlOffset = (int) (i * ttlDim1 * ttlDim2);
+            int idx = 0;
+            JsonNode ttlData = root.get("style_ttl").get("data");
+            for (JsonNode batch : ttlData) {
+                for (JsonNode row : batch) {
+                    for (JsonNode val : row) {
+                        ttlFlat[ttlOffset + idx++] = (float) val.asDouble();
+                    }
+                }
+            }
+            
+            // Flatten DP data
+            int dpOffset = (int) (i * dpDim1 * dpDim2);
+            idx = 0;
+            JsonNode dpData = root.get("style_dp").get("data");
+            for (JsonNode batch : dpData) {
+                for (JsonNode row : batch) {
+                    for (JsonNode val : row) {
+                        dpFlat[dpOffset + idx++] = (float) val.asDouble();
+                    }
+                }
+            }
+        }
+        
+        long[] ttlShape = {bsz, ttlDim1, ttlDim2};
+        long[] dpShape = {bsz, dpDim1, dpDim2};
+        
+        OnnxTensor ttlTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(ttlFlat), ttlShape);
+        OnnxTensor dpTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(dpFlat), dpShape);
+        
+        if (verbose) {
+            System.out.println("Loaded " + bsz + " voice styles\n");
+        }
+        
+        return new Style(ttlTensor, dpTensor);
+    }
+    
+    /**
+     * Load TTS components
+     */
+    public static TextToSpeech loadTextToSpeech(String onnxDir, boolean useGpu, OrtEnvironment env) 
+            throws IOException, OrtException {
+        if (useGpu) {
+            throw new RuntimeException("GPU mode is not supported yet");
+        }
+        System.out.println("Using CPU for inference\n");
+        
+        // Load config
+        Config config = loadCfgs(onnxDir);
+        
+        // Create session options
+        OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
+        
+        // Load models
+        OrtSession dpSession = env.createSession(onnxDir + "/duration_predictor.onnx", opts);
+        OrtSession textEncSession = env.createSession(onnxDir + "/text_encoder.onnx", opts);
+        OrtSession vectorEstSession = env.createSession(onnxDir + "/vector_estimator.onnx", opts);
+        OrtSession vocoderSession = env.createSession(onnxDir + "/vocoder.onnx", opts);
+        
+        // Load text processor
+        UnicodeProcessor textProcessor = new UnicodeProcessor(onnxDir + "/unicode_indexer.json");
+        
+        return new TextToSpeech(config, textProcessor, dpSession, textEncSession, vectorEstSession, vocoderSession);
+    }
+    
+    /**
+     * Load configuration from JSON
+     */
+    public static Config loadCfgs(String onnxDir) throws IOException {
+        ObjectMapper mapper = new ObjectMapper();
+        JsonNode root = mapper.readTree(new File(onnxDir + "/tts.json"));
+        
+        Config config = new Config();
+        config.ae = new Config.AEConfig();
+        config.ae.sampleRate = root.get("ae").get("sample_rate").asInt();
+        config.ae.baseChunkSize = root.get("ae").get("base_chunk_size").asInt();
+        
+        config.ttl = new Config.TTLConfig();
+        config.ttl.chunkCompressFactor = root.get("ttl").get("chunk_compress_factor").asInt();
+        config.ttl.latentDim = root.get("ttl").get("latent_dim").asInt();
+        
+        return config;
+    }
+    
+    /**
+     * Get latent mask from wav lengths
+     */
+    public static float[][][] getLatentMask(long[] wavLengths, Config config) {
+        long baseChunkSize = config.ae.baseChunkSize;
+        long chunkCompressFactor = config.ttl.chunkCompressFactor;
+        long latentSize = baseChunkSize * chunkCompressFactor;
+        
+        long[] latentLengths = new long[wavLengths.length];
+        long maxLen = 0;
+        for (int i = 0; i < wavLengths.length; i++) {
+            latentLengths[i] = (wavLengths[i] + latentSize - 1) / latentSize;
+            maxLen = Math.max(maxLen, latentLengths[i]);
+        }
+        
+        float[][][] mask = new float[wavLengths.length][1][(int) maxLen];
+        for (int i = 0; i < wavLengths.length; i++) {
+            for (int j = 0; j < maxLen; j++) {
+                mask[i][0][j] = j < latentLengths[i] ? 1.0f : 0.0f;
+            }
+        }
+        return mask;
+    }
+    
+    /**
+     * Write WAV file
+     */
+    public static void writeWavFile(String filename, float[] audioData, int sampleRate) throws IOException {
+        // Convert float to byte array
+        byte[] bytes = new byte[audioData.length * 2];
+        ByteBuffer buffer = ByteBuffer.wrap(bytes);
+        buffer.order(ByteOrder.LITTLE_ENDIAN);
+        
+        for (float sample : audioData) {
+            short val = (short) Math.max(-32768, Math.min(32767, sample * 32767));
+            buffer.putShort(val);
+        }
+        
+        ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
+        AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
+        AudioInputStream ais = new AudioInputStream(bais, format, audioData.length);
+        AudioSystem.write(ais, AudioFileFormat.Type.WAVE, new File(filename));
+    }
+    
+    /**
+     * Sanitize filename
+     */
+    public static String sanitizeFilename(String text, int maxLen) {
+        if (text.length() > maxLen) {
+            text = text.substring(0, maxLen);
+        }
+        return text.replaceAll("[^a-zA-Z0-9]", "_");
+    }
+    
+    /**
+     * Timer utility
+     */
+    public static <T> T timer(String name, java.util.function.Supplier<T> fn) {
+        long start = System.currentTimeMillis();
+        System.out.println(name + "...");
+        T result = fn.get();
+        long elapsed = System.currentTimeMillis() - start;
+        System.out.printf("  -> %s completed in %.2f sec\n", name, elapsed / 1000.0);
+        return result;
+    }
+    
+    /**
+     * Create float tensor from 3D array
+     */
+    public static OnnxTensor createFloatTensor(float[][][] array, OrtEnvironment env) throws OrtException {
+        int dim0 = array.length;
+        int dim1 = array[0].length;
+        int dim2 = array[0][0].length;
+        
+        float[] flat = new float[dim0 * dim1 * dim2];
+        int idx = 0;
+        for (int i = 0; i < dim0; i++) {
+            for (int j = 0; j < dim1; j++) {
+                for (int k = 0; k < dim2; k++) {
+                    flat[idx++] = array[i][j][k];
+                }
+            }
+        }
+        
+        long[] shape = {dim0, dim1, dim2};
+        return OnnxTensor.createTensor(env, FloatBuffer.wrap(flat), shape);
+    }
+    
+    /**
+     * Create long tensor from 2D array
+     */
+    public static OnnxTensor createLongTensor(long[][] array, OrtEnvironment env) throws OrtException {
+        int dim0 = array.length;
+        int dim1 = array[0].length;
+        
+        long[] flat = new long[dim0 * dim1];
+        int idx = 0;
+        for (int i = 0; i < dim0; i++) {
+            for (int j = 0; j < dim1; j++) {
+                flat[idx++] = array[i][j];
+            }
+        }
+        
+        long[] shape = {dim0, dim1};
+        return OnnxTensor.createTensor(env, LongBuffer.wrap(flat), shape);
+    }
+    
+    /**
+     * Load JSON long array
+     */
+    public static long[] loadJsonLongArray(String filePath) throws IOException {
+        ObjectMapper mapper = new ObjectMapper();
+        JsonNode root = mapper.readTree(new File(filePath));
+        
+        long[] result = new long[root.size()];
+        for (int i = 0; i < root.size(); i++) {
+            result[i] = root.get(i).asLong();
+        }
+        return result;
+    }
+}
+
@@ -0,0 +1,97 @@
+# TTS ONNX Inference Examples
+
+This guide provides examples for running TTS inference using `ExampleONNX.java`.
+
+## Installation
+
+This project uses [Maven](https://maven.apache.org/) for dependency management.
+
+### Prerequisites
+
+- Java 11 or higher
+- Maven 3.6 or higher
+
+### Install dependencies
+
+```bash
+mvn clean install
+```
+
+## Basic Usage
+
+### Example 1: Default Inference
+Run inference with default settings:
+```bash
+mvn exec:java
+```
+
+This will use:
+- Voice style: `assets/voice_styles/M1.json`
+- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+- Output directory: `results/`
+- Total steps: 5
+- Number of generations: 4
+
+### Example 2: Batch Inference
+Process multiple voice styles and texts at once:
+```bash
+mvn exec:java -Dexec.args="--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json --text 'The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant.'"
+```
+
+This will:
+- Generate speech for 2 different voice-text pairs
+- Use male voice (M1.json) for the first text
+- Use female voice (F1.json) for the second text
+- Process both samples in a single batch
+
+### Example 3: High Quality Inference
+Increase denoising steps for better quality:
+```bash
+mvn exec:java -Dexec.args="--total-step 10 --voice-style assets/voice_styles/M1.json --text 'Increasing the number of denoising steps improves the output fidelity and overall quality.'"
+```
+
+This will:
+- Use 10 denoising steps instead of the default 5
+- Produce higher quality output at the cost of slower inference
+
+**Note**: If your text contains apostrophes, use escaping or run the JAR directly:
+```bash
+java -jar target/tts-example.jar --total-step 10 --text "Text with apostrophe's here"
+```
+
+## Building a Fat JAR
+
+To create a standalone JAR with all dependencies:
+```bash
+mvn clean package
+```
+
+Then run it directly:
+```bash
+java -jar target/tts-example.jar
+```
+
+Or with arguments:
+```bash
+java -jar target/tts-example.jar --total-step 10 --text "Your custom text here"
+```
+
+## Available Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
+| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
+| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
+| `--n-test` | int | 4 | Number of times to generate each sample |
+| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
+| `--text` | str+ | (long default text) | Text(s) to synthesize |
+| `--save-dir` | str | `results` | Output directory |
+
+## Notes
+
+- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
+- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
+- **GPU Support**: GPU mode is not supported yet
+- **Voice Styles**: Uses pre-extracted voice style JSON files for fast inference
+
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,110 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
+                             http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <modelVersion>4.0.0</modelVersion>
+
+    <groupId>ai.supertonic</groupId>
+    <artifactId>tts-onnx-java</artifactId>
+    <version>1.0.0</version>
+    <packaging>jar</packaging>
+
+    <name>TTS ONNX Java Example</name>
+    <description>Text-to-Speech inference using ONNX Runtime in Java</description>
+
+    <properties>
+        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+        <maven.compiler.source>11</maven.compiler.source>
+        <maven.compiler.target>11</maven.compiler.target>
+        <onnxruntime.version>1.23.1</onnxruntime.version>
+        <jackson.version>2.15.2</jackson.version>
+    </properties>
+
+    <dependencies>
+        <!-- ONNX Runtime -->
+        <dependency>
+            <groupId>com.microsoft.onnxruntime</groupId>
+            <artifactId>onnxruntime</artifactId>
+            <version>${onnxruntime.version}</version>
+        </dependency>
+
+        <!-- Jackson for JSON parsing -->
+        <dependency>
+            <groupId>com.fasterxml.jackson.core</groupId>
+            <artifactId>jackson-databind</artifactId>
+            <version>${jackson.version}</version>
+        </dependency>
+
+        <!-- JTransforms for Fast FFT -->
+        <dependency>
+            <groupId>com.github.wendykierp</groupId>
+            <artifactId>JTransforms</artifactId>
+            <version>3.1</version>
+        </dependency>
+    </dependencies>
+
+    <build>
+        <sourceDirectory>.</sourceDirectory>
+        <plugins>
+            <!-- Maven Compiler Plugin -->
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-compiler-plugin</artifactId>
+                <version>3.11.0</version>
+                <configuration>
+                    <source>11</source>
+                    <target>11</target>
+                </configuration>
+            </plugin>
+
+            <!-- Maven Exec Plugin for running the example -->
+            <plugin>
+                <groupId>org.codehaus.mojo</groupId>
+                <artifactId>exec-maven-plugin</artifactId>
+                <version>3.1.0</version>
+                <configuration>
+                    <mainClass>ExampleONNX</mainClass>
+                </configuration>
+            </plugin>
+
+            <!-- Maven Jar Plugin -->
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-jar-plugin</artifactId>
+                <version>3.3.0</version>
+                <configuration>
+                    <archive>
+                        <manifest>
+                            <mainClass>ExampleONNX</mainClass>
+                        </manifest>
+                    </archive>
+                </configuration>
+            </plugin>
+
+            <!-- Maven Shade Plugin for creating fat JAR -->
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-shade-plugin</artifactId>
+                <version>3.5.0</version>
+                <executions>
+                    <execution>
+                        <phase>package</phase>
+                        <goals>
+                            <goal>shade</goal>
+                        </goals>
+                        <configuration>
+                            <transformers>
+                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
+                                    <mainClass>ExampleONNX</mainClass>
+                                </transformer>
+                            </transformers>
+                            <finalName>tts-example</finalName>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
+        </plugins>
+    </build>
+</project>
+
@@ -0,0 +1,102 @@
+# TTS ONNX Node.js Implementation
+
+Node.js implementation for TTS inference. Uses ONNX Runtime to generate speech from text.
+
+## Requirements
+
+- Node.js v16 or higher
+- npm or yarn
+
+## Installation
+
+```bash
+cd nodejs
+npm install
+```
+
+## Basic Usage
+
+### Example 1: Default Inference
+Run inference with default settings:
+```bash
+npm start
+```
+
+Or:
+```bash
+node example_onnx.js
+```
+
+This will use:
+- Voice style: `assets/voice_styles/M1.json`
+- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+- Output directory: `results/`
+- Total steps: 5
+- Number of generations: 4
+
+### Example 2: Batch Inference
+Process multiple voice styles and texts at once:
+```bash
+node example_onnx.js \
+  --voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
+  --text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+```
+
+This will:
+- Generate speech for 2 different voice-text pairs
+- Use male voice style (M1.json) for the first text
+- Use female voice style (F1.json) for the second text
+- Process both samples in a single batch
+
+### Example 3: High Quality Inference
+Increase denoising steps for better quality:
+```bash
+node example_onnx.js \
+  --total-step 10 \
+  --voice-style "assets/voice_styles/M1.json" \
+  --text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
+```
+
+This will:
+- Use 10 denoising steps instead of the default 5
+- Produce higher quality output at the cost of slower inference
+
+## Available Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
+| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
+| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
+| `--n-test` | int | 4 | Number of times to generate each sample |
+| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s). Separate multiple files with commas |
+| `--text` | str+ | (long default text) | Text(s) to synthesize. Separate multiple texts with pipes |
+| `--save-dir` | str | `results` | Output directory |
+
+## Notes
+
+- **Batch Processing**: The number of voice style files must match the number of texts. Use commas to separate files and pipes to separate texts
+- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
+- **GPU Support**: GPU mode is not supported yet
+
+## Architecture
+
+- `helper.js`: Node.js port of Python's `helper.py`
+  - `Preprocessor`: Audio preprocessing (STFT, Mel Spectrogram)
+  - `UnicodeProcessor`: Text preprocessing
+  - Utility functions (mask generation, tensor conversion, etc.)
+
+- `example_onnx.js`: Main inference script
+  - ONNX model loading
+  - TTS inference pipeline execution
+  - WAV file saving
+
+- `package.json`: Node.js project configuration and dependencies
+
+## Implementation Notes
+
+1. **Pure Node.js WAV Processing**: Writes WAV files without external native libraries. Outputs 16-bit PCM format.
+
+2. **Memory Efficiency**: Note that Node.js may consume significant memory when processing large arrays.
+
+3. **Performance**: The mel spectrogram extraction (Step 1-1) is currently slower than Python's Librosa, which uses highly optimized C extensions. This bottleneck could be further improved with additional optimizations such as WASM-based FFT libraries or native addons.
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,104 @@
+import fs from 'fs';
+import path from 'path';
+import { fileURLToPath } from 'url';
+
+import { loadTextToSpeech, loadVoiceStyle, timer, writeWavFile } from './helper.js';
+
+const __filename = fileURLToPath(import.meta.url);
+const __dirname = path.dirname(__filename);
+
+/**
+ * Parse command line arguments
+ */
+function parseArgs() {
+    const args = {
+        useGpu: false,
+        onnxDir: 'assets/onnx',
+        totalStep: 5,
+        nTest: 4,
+        voiceStyle: ['assets/voice_styles/M1.json'],
+        text: ['This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.'],
+        saveDir: 'results'
+    };
+
+    for (let i = 2; i < process.argv.length; i++) {
+        const arg = process.argv[i];
+        if (arg === '--use-gpu') {
+            args.useGpu = true;
+        } else if (arg === '--onnx-dir' && i + 1 < process.argv.length) {
+            args.onnxDir = process.argv[++i];
+        } else if (arg === '--total-step' && i + 1 < process.argv.length) {
+            args.totalStep = parseInt(process.argv[++i]);
+        } else if (arg === '--n-test' && i + 1 < process.argv.length) {
+            args.nTest = parseInt(process.argv[++i]);
+        } else if (arg === '--voice-style' && i + 1 < process.argv.length) {
+            args.voiceStyle = process.argv[++i].split(',');
+        } else if (arg === '--text' && i + 1 < process.argv.length) {
+            args.text = process.argv[++i].split('|');
+        } else if (arg === '--save-dir' && i + 1 < process.argv.length) {
+            args.saveDir = process.argv[++i];
+        }
+    }
+
+    return args;
+}
+
+/**
+ * Main inference function
+ */
+async function main() {
+    console.log('=== TTS Inference with ONNX Runtime (Node.js) ===\n');
+
+    // --- 1. Parse arguments --- //
+    const args = parseArgs();
+    const totalStep = args.totalStep;
+    const nTest = args.nTest;
+    const saveDir = args.saveDir;
+    const voiceStylePaths = args.voiceStyle.map(p => path.resolve(__dirname, p));
+    const textList = args.text;
+
+    if (voiceStylePaths.length !== textList.length) {
+        throw new Error(`Number of voice styles (${voiceStylePaths.length}) must match number of texts (${textList.length})`);
+    }
+
+    const bsz = voiceStylePaths.length;
+
+    // --- 2. Load Text to Speech --- //
+    const onnxDir = path.resolve(__dirname, args.onnxDir);
+    const textToSpeech = await loadTextToSpeech(onnxDir, args.useGpu);
+
+    // --- 3. Load Voice Style --- //
+    const style = loadVoiceStyle(voiceStylePaths, true);
+
+    // --- 4. Synthesize speech --- //
+    for (let n = 0; n < nTest; n++) {
+        console.log(`\n[${n + 1}/${nTest}] Starting synthesis...`);
+        
+        const { wav, duration } = await timer('Generating speech from text', async () => {
+            return await textToSpeech.call(textList, style, totalStep);
+        });
+        
+        if (!fs.existsSync(saveDir)) {
+            fs.mkdirSync(saveDir, { recursive: true });
+        }
+
+        const wavShape = [bsz, wav.length / bsz];
+        for (let b = 0; b < bsz; b++) {
+            const fname = `${textList[b].substring(0, 20).replace(/[^a-zA-Z0-9]/g, '_')}_${n + 1}.wav`;
+            const wavLen = Math.floor(textToSpeech.sampleRate * duration[b]);
+            const wavOut = wav.slice(b * wavShape[1], b * wavShape[1] + wavLen);
+            
+            const outputPath = path.join(saveDir, fname);
+            writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
+            console.log(`Saved: ${outputPath}`);
+        }
+    }
+
+    console.log('\n=== Synthesis completed successfully! ===');
+}
+
+// Run main function
+main().catch(err => {
+    console.error('Error during inference:', err);
+    process.exit(1);
+});
@@ -0,0 +1,392 @@
+import fs from 'fs';
+import path from 'path';
+import { fileURLToPath } from 'url';
+import * as ort from 'onnxruntime-node';
+
+const __filename = fileURLToPath(import.meta.url);
+
+/**
+ * Unicode text processor
+ */
+class UnicodeProcessor {
+    constructor(unicodeIndexerJsonPath) {
+        this.indexer = JSON.parse(fs.readFileSync(unicodeIndexerJsonPath, 'utf8'));
+    }
+
+    _preprocessText(text) {
+        // Simple NFKD normalization (JavaScript has normalize built-in)
+        return text.normalize('NFKD');
+    }
+
+    _textToUnicodeValues(text) {
+        return Array.from(text).map(char => char.charCodeAt(0));
+    }
+
+    _getTextMask(textIdsLengths) {
+        return lengthToMask(textIdsLengths);
+    }
+
+    call(textList) {
+        const processedTexts = textList.map(t => this._preprocessText(t));
+        const textIdsLengths = processedTexts.map(t => t.length);
+        const maxLen = Math.max(...textIdsLengths);
+        
+        const textIds = [];
+        for (let i = 0; i < processedTexts.length; i++) {
+            const row = new Array(maxLen).fill(0);
+            const unicodeVals = this._textToUnicodeValues(processedTexts[i]);
+            for (let j = 0; j < unicodeVals.length; j++) {
+                row[j] = this.indexer[unicodeVals[j]];
+            }
+            textIds.push(row);
+        }
+        
+        const textMask = this._getTextMask(textIdsLengths);
+        return { textIds, textMask };
+    }
+}
+
+/**
+ * Style class
+ */
+class Style {
+    constructor(styleTtlOnnx, styleDpOnnx) {
+        this.ttl = styleTtlOnnx;
+        this.dp = styleDpOnnx;
+    }
+}
+
+/**
+ * TextToSpeech class
+ */
+class TextToSpeech {
+    constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
+        this.cfgs = cfgs;
+        this.textProcessor = textProcessor;
+        this.dpOrt = dpOrt;
+        this.textEncOrt = textEncOrt;
+        this.vectorEstOrt = vectorEstOrt;
+        this.vocoderOrt = vocoderOrt;
+        this.sampleRate = cfgs.ae.sample_rate;
+        this.baseChunkSize = cfgs.ae.base_chunk_size;
+        this.chunkCompressFactor = cfgs.ttl.chunk_compress_factor;
+        this.ldim = cfgs.ttl.latent_dim;
+    }
+
+    sampleNoisyLatent(duration) {
+        const wavLenMax = Math.max(...duration) * this.sampleRate;
+        const wavLengths = duration.map(d => Math.floor(d * this.sampleRate));
+        const chunkSize = this.baseChunkSize * this.chunkCompressFactor;
+        const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
+        const latentDim = this.ldim * this.chunkCompressFactor;
+
+        // Generate random noise
+        const noisyLatent = [];
+        for (let b = 0; b < duration.length; b++) {
+            const batch = [];
+            for (let d = 0; d < latentDim; d++) {
+                const row = [];
+                for (let t = 0; t < latentLen; t++) {
+                    // Box-Muller transform for normal distribution
+                    // Add epsilon to avoid log(0)
+                    const eps = 1e-10;
+                    const u1 = Math.max(eps, Math.random());
+                    const u2 = Math.random();
+                    const randNormal = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
+                    row.push(randNormal);
+                }
+                batch.push(row);
+            }
+            noisyLatent.push(batch);
+        }
+
+        const latentMask = getLatentMask(wavLengths, this.baseChunkSize, this.chunkCompressFactor);
+        
+        // Apply mask
+        for (let b = 0; b < noisyLatent.length; b++) {
+            for (let d = 0; d < noisyLatent[b].length; d++) {
+                for (let t = 0; t < noisyLatent[b][d].length; t++) {
+                    noisyLatent[b][d][t] *= latentMask[b][0][t];
+                }
+            }
+        }
+
+        return { noisyLatent, latentMask };
+    }
+
+    async call(textList, style, totalStep) {
+        if (textList.length !== style.ttl.dims[0]) {
+            throw new Error('Number of texts must match number of style vectors');
+        }
+        const bsz = textList.length;
+        const { textIds, textMask } = this.textProcessor.call(textList);
+        const textIdsShape = [bsz, textIds[0].length];
+        const textMaskShape = [bsz, 1, textMask[0][0].length];
+        
+        const textMaskTensor = arrayToTensor(textMask, textMaskShape);
+        
+        const dpResult = await this.dpOrt.run({
+            text_ids: intArrayToTensor(textIds, textIdsShape),
+            style_dp: style.dp,
+            text_mask: textMaskTensor
+        });
+        
+        const durOnnx = Array.from(dpResult.duration.data);
+        
+        const textEncResult = await this.textEncOrt.run({
+            text_ids: intArrayToTensor(textIds, textIdsShape),
+            style_ttl: style.ttl,
+            text_mask: textMaskTensor
+        });
+        
+        const textEmbTensor = textEncResult.text_emb;
+
+        let { noisyLatent, latentMask } = this.sampleNoisyLatent(durOnnx);
+        const latentShape = [bsz, noisyLatent[0].length, noisyLatent[0][0].length];
+        const latentMaskShape = [bsz, 1, latentMask[0][0].length];
+        
+        const latentMaskTensor = arrayToTensor(latentMask, latentMaskShape);
+        
+        const totalStepArray = new Array(bsz).fill(totalStep);
+        const scalarShape = [bsz];
+        const totalStepTensor = arrayToTensor(totalStepArray, scalarShape);
+
+        for (let step = 0; step < totalStep; step++) {
+            const currentStepArray = new Array(bsz).fill(step);
+
+            const vectorEstResult = await this.vectorEstOrt.run({
+                noisy_latent: arrayToTensor(noisyLatent, latentShape),
+                text_emb: textEmbTensor,
+                style_ttl: style.ttl,
+                text_mask: textMaskTensor,
+                latent_mask: latentMaskTensor,
+                total_step: totalStepTensor,
+                current_step: arrayToTensor(currentStepArray, scalarShape)
+            });
+
+            const denoisedLatent = Array.from(vectorEstResult.denoised_latent.data);
+
+            // Update latent with the denoised output
+            let idx = 0;
+            for (let b = 0; b < noisyLatent.length; b++) {
+                for (let d = 0; d < noisyLatent[b].length; d++) {
+                    for (let t = 0; t < noisyLatent[b][d].length; t++) {
+                        noisyLatent[b][d][t] = denoisedLatent[idx++];
+                    }
+                }
+            }
+        }
+
+        const vocoderResult = await this.vocoderOrt.run({
+            latent: arrayToTensor(noisyLatent, latentShape)
+        });
+
+        const wav = Array.from(vocoderResult.wav_tts.data);
+        return { wav, duration: durOnnx };
+    }
+}
+
+/**
+ * Convert lengths to binary mask
+ */
+function lengthToMask(lengths, maxLen = null) {
+    maxLen = maxLen || Math.max(...lengths);
+    const mask = [];
+    for (let i = 0; i < lengths.length; i++) {
+        const row = [];
+        for (let j = 0; j < maxLen; j++) {
+            row.push(j < lengths[i] ? 1.0 : 0.0);
+        }
+        mask.push([row]); // [B, 1, maxLen]
+    }
+    return mask;
+}
+
+/**
+ * Get latent mask from wav lengths
+ */
+function getLatentMask(wavLengths, baseChunkSize, chunkCompressFactor) {
+    const latentSize = baseChunkSize * chunkCompressFactor;
+    const latentLengths = wavLengths.map(len => 
+        Math.floor((len + latentSize - 1) / latentSize)
+    );
+    return lengthToMask(latentLengths);
+}
+
+/**
+ * Load ONNX model
+ */
+async function loadOnnx(onnxPath, opts) {
+    return await ort.InferenceSession.create(onnxPath, opts);
+}
+
+/**
+ * Load all ONNX models for TTS
+ */
+async function loadOnnxAll(onnxDir, opts) {
+    const dpPath = path.join(onnxDir, 'duration_predictor.onnx');
+    const textEncPath = path.join(onnxDir, 'text_encoder.onnx');
+    const vectorEstPath = path.join(onnxDir, 'vector_estimator.onnx');
+    const vocoderPath = path.join(onnxDir, 'vocoder.onnx');
+
+    const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = await Promise.all([
+        loadOnnx(dpPath, opts),
+        loadOnnx(textEncPath, opts),
+        loadOnnx(vectorEstPath, opts),
+        loadOnnx(vocoderPath, opts)
+    ]);
+
+    return { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt };
+}
+
+/**
+ * Load configuration
+ */
+function loadCfgs(onnxDir) {
+    const cfgPath = path.join(onnxDir, 'tts.json');
+    const cfgs = JSON.parse(fs.readFileSync(cfgPath, 'utf8'));
+    return cfgs;
+}
+
+/**
+ * Load text processor
+ */
+function loadTextProcessor(onnxDir) {
+    const unicodeIndexerPath = path.join(onnxDir, 'unicode_indexer.json');
+    const textProcessor = new UnicodeProcessor(unicodeIndexerPath);
+    return textProcessor;
+}
+
+/**
+ * Load voice style from JSON file
+ */
+export function loadVoiceStyle(voiceStylePaths, verbose = false) {
+    const bsz = voiceStylePaths.length;
+    
+    // Read first file to get dimensions
+    const firstStyle = JSON.parse(fs.readFileSync(voiceStylePaths[0], 'utf8'));
+    const ttlDims = firstStyle.style_ttl.dims;
+    const dpDims = firstStyle.style_dp.dims;
+    
+    const ttlDim1 = ttlDims[1];
+    const ttlDim2 = ttlDims[2];
+    const dpDim1 = dpDims[1];
+    const dpDim2 = dpDims[2];
+    
+    // Pre-allocate arrays with full batch size
+    const ttlSize = bsz * ttlDim1 * ttlDim2;
+    const dpSize = bsz * dpDim1 * dpDim2;
+    const ttlFlat = new Float32Array(ttlSize);
+    const dpFlat = new Float32Array(dpSize);
+    
+    // Fill in the data
+    for (let i = 0; i < bsz; i++) {
+        const voiceStyle = JSON.parse(fs.readFileSync(voiceStylePaths[i], 'utf8'));
+        
+        const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
+        const ttlOffset = i * ttlDim1 * ttlDim2;
+        ttlFlat.set(ttlData, ttlOffset);
+        
+        const dpData = voiceStyle.style_dp.data.flat(Infinity);
+        const dpOffset = i * dpDim1 * dpDim2;
+        dpFlat.set(dpData, dpOffset);
+    }
+    
+    const ttlStyle = new ort.Tensor('float32', ttlFlat, [bsz, ttlDim1, ttlDim2]);
+    const dpStyle = new ort.Tensor('float32', dpFlat, [bsz, dpDim1, dpDim2]);
+    
+    if (verbose) {
+        console.log(`Loaded ${bsz} voice styles`);
+    }
+    
+    return new Style(ttlStyle, dpStyle);
+}
+
+/**
+ * Load text to speech components
+ */
+export async function loadTextToSpeech(onnxDir, useGpu = false) {
+    const opts = {};
+    if (useGpu) {
+        throw new Error('GPU mode is not supported yet');
+    } else {
+        console.log('Using CPU for inference');
+    }
+    
+    const cfgs = loadCfgs(onnxDir);
+    const { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt } = await loadOnnxAll(onnxDir, opts);
+    const textProcessor = loadTextProcessor(onnxDir);
+    const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
+    
+    return textToSpeech;
+}
+
+/**
+ * Convert 3D array to ONNX tensor
+ */
+function arrayToTensor(array, dims) {
+    // Flatten the array
+    const flat = array.flat(Infinity);
+    return new ort.Tensor('float32', Float32Array.from(flat), dims);
+}
+
+/**
+ * Convert 2D int array to ONNX tensor
+ */
+function intArrayToTensor(array, dims) {
+    const flat = array.flat(Infinity);
+    return new ort.Tensor('int64', BigInt64Array.from(flat.map(x => BigInt(x))), dims);
+}
+
+/**
+ * Write WAV file
+ */
+export function writeWavFile(filename, audioData, sampleRate) {
+    const numChannels = 1;
+    const bitsPerSample = 16;
+    const byteRate = sampleRate * numChannels * bitsPerSample / 8;
+    const blockAlign = numChannels * bitsPerSample / 8;
+    const dataSize = audioData.length * bitsPerSample / 8;
+
+    const buffer = Buffer.alloc(44 + dataSize);
+    
+    // RIFF header
+    buffer.write('RIFF', 0);
+    buffer.writeUInt32LE(36 + dataSize, 4);
+    buffer.write('WAVE', 8);
+    
+    // fmt chunk
+    buffer.write('fmt ', 12);
+    buffer.writeUInt32LE(16, 16); // fmt chunk size
+    buffer.writeUInt16LE(1, 20); // audio format (PCM)
+    buffer.writeUInt16LE(numChannels, 22);
+    buffer.writeUInt32LE(sampleRate, 24);
+    buffer.writeUInt32LE(byteRate, 28);
+    buffer.writeUInt16LE(blockAlign, 32);
+    buffer.writeUInt16LE(bitsPerSample, 34);
+    
+    // data chunk
+    buffer.write('data', 36);
+    buffer.writeUInt32LE(dataSize, 40);
+    
+    // Write audio data
+    for (let i = 0; i < audioData.length; i++) {
+        const sample = Math.max(-1, Math.min(1, audioData[i]));
+        const intSample = Math.floor(sample * 32767);
+        buffer.writeInt16LE(intSample, 44 + i * 2);
+    }
+    
+    fs.writeFileSync(filename, buffer);
+}
+
+/**
+ * Timer utility for measuring execution time
+ */
+export async function timer(name, fn) {
+    const start = Date.now();
+    console.log(`${name}...`);
+    const result = await fn();
+    const elapsed = ((Date.now() - start) / 1000).toFixed(2);
+    console.log(`  -> ${name} completed in ${elapsed} sec`);
+    return result;
+}
@@ -0,0 +1,26 @@
+{
+  "name": "tts-onnx-nodejs",
+  "version": "1.0.0",
+  "description": "TTS inference using ONNX Runtime for Node.js",
+  "main": "example_onnx.js",
+  "type": "module",
+  "scripts": {
+    "start": "node example_onnx.js"
+  },
+  "keywords": [
+    "tts",
+    "onnx",
+    "speech-synthesis",
+    "nodejs"
+  ],
+  "author": "",
+  "license": "MIT",
+  "dependencies": {
+    "fft.js": "^4.0.3",
+    "js-yaml": "^4.1.0",
+    "onnxruntime-node": "^1.19.2"
+  },
+  "engines": {
+    "node": ">=16.0.0"
+  }
+}
@@ -0,0 +1,83 @@
+# TTS ONNX Inference Examples
+
+This guide provides examples for running TTS inference using `example_onnx.py`.
+
+## Installation
+
+This project uses [uv](https://docs.astral.sh/uv/) for fast package management.
+
+### Install uv (if not already installed)
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+```
+
+### Install dependencies
+```bash
+uv sync
+```
+
+Or if you prefer using traditional pip with requirements.txt:
+```bash
+pip install -r requirements.txt
+```
+
+## Basic Usage
+
+### Example 1: Default Inference
+Run inference with default settings:
+```bash
+uv run example_onnx.py
+```
+
+This will use:
+- Voice style: `assets/voice_styles/M1.json`
+- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+- Output directory: `results/`
+- Total steps: 5
+- Number of generations: 4
+
+### Example 2: Batch Inference
+Process multiple voice styles and texts at once:
+```bash
+uv run example_onnx.py \
+  --voice-style assets/voice_styles/M1.json assets/voice_styles/F1.json \
+  --text "The sun sets behind the mountains, painting the sky in shades of pink and orange." "The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+```
+
+This will:
+- Generate speech for 2 different voice-text pairs
+- Use male voice style (M1.json) for the first text
+- Use female voice style (F1.json) for the second text
+- Process both samples in a single batch
+
+### Example 3: High Quality Inference
+Increase denoising steps for better quality:
+```bash
+uv run example_onnx.py \
+  --total-step 10 \
+  --voice-style assets/voice_styles/M1.json \
+  --text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
+```
+
+This will:
+- Use 10 denoising steps instead of the default 5
+- Produce higher quality output at the cost of slower inference
+
+## Available Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--use-gpu` | flag | False | Use GPU for inference (with CPU fallback) |
+| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
+| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
+| `--n-test` | int | 4 | Number of times to generate each sample |
+| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
+| `--text` | str+ | (long default text) | Text(s) to synthesize |
+| `--save-dir` | str | `results` | Output directory |
+
+## Notes
+
+- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
+- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
+- **GPU Support**: GPU mode is not supported yet
+
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,91 @@
+import argparse
+import os
+
+import soundfile as sf
+
+from helper import load_text_to_speech, timer, sanitize_filename, load_voice_style
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="TTS Inference with ONNX")
+
+    # Device settings
+    parser.add_argument(
+        "--use-gpu", action="store_true", help="Use GPU for inference (default: CPU)"
+    )
+
+    # Model settings
+    parser.add_argument(
+        "--onnx-dir",
+        type=str,
+        default="assets/onnx",
+        help="Path to ONNX model directory",
+    )
+
+    # Synthesis parameters
+    parser.add_argument(
+        "--total-step", type=int, default=5, help="Number of denoising steps"
+    )
+    parser.add_argument(
+        "--n-test", type=int, default=4, help="Number of times to generate"
+    )
+
+    # Input/Output
+    parser.add_argument(
+        "--voice-style",
+        type=str,
+        nargs="+",
+        default=["assets/voice_styles/M1.json"],
+        help="Voice style file path(s). Can specify multiple files for batch processing",
+    )
+    parser.add_argument(
+        "--text",
+        type=str,
+        nargs="+",
+        default=[
+            "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+        ],
+        help="Text(s) to synthesize. Can specify multiple texts for batch processing",
+    )
+    parser.add_argument(
+        "--save-dir", type=str, default="results", help="Output directory"
+    )
+
+    return parser.parse_args()
+
+
+print("=== TTS Inference with ONNX Runtime (Python) ===\n")
+
+# --- 1. Parse arguments --- #
+args = parse_args()
+total_step = args.total_step
+n_test = args.n_test
+save_dir = args.save_dir
+voice_style_paths = args.voice_style
+text_list = args.text
+
+assert len(voice_style_paths) == len(
+    text_list
+), f"Number of voice styles ({len(voice_style_paths)}) must match number of texts ({len(text_list)})"
+
+bsz = len(voice_style_paths)
+
+# --- 2. Load Text to Speech --- #
+text_to_speech = load_text_to_speech(args.onnx_dir, args.use_gpu)
+
+# --- 3. Load Voice Style --- #
+style = load_voice_style(voice_style_paths, verbose=True)
+
+# --- 4. Synthesize Speech --- #
+for n in range(n_test):
+    print(f"\n[{n+1}/{n_test}] Starting synthesis...")
+    with timer("Generating speech from text"):
+        wav, duration = text_to_speech(text_list, style, total_step)
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+    for b in range(bsz):
+        fname = f"{sanitize_filename(text_list[b], 20)}_{n+1}.wav"
+        w = wav[b, : int(text_to_speech.sample_rate * duration[b].item())]  # [T_trim]
+        sf.write(os.path.join(save_dir, fname), w, text_to_speech.sample_rate)
+        print(f"Saved: {save_dir}/{fname}")
+print("\n=== Synthesis completed successfully! ===")
@@ -0,0 +1,249 @@
+import json
+import os
+import time
+from contextlib import contextmanager
+from typing import Optional
+from unicodedata import normalize
+
+import numpy as np
+import onnxruntime as ort
+
+
+class UnicodeProcessor:
+    def __init__(self, unicode_indexer_path: str):
+        with open(unicode_indexer_path, "r") as f:
+            self.indexer = json.load(f)
+
+    def _preprocess_text(self, text: str) -> str:
+        # TODO: add more preprocessing
+        text = normalize("NFKD", text)
+        return text
+
+    def _get_text_mask(self, text_ids_lengths: np.ndarray) -> np.ndarray:
+        text_mask = length_to_mask(text_ids_lengths)
+        return text_mask
+
+    def _text_to_unicode_values(self, text: str) -> np.ndarray:
+        unicode_values = np.array(
+            [ord(char) for char in text], dtype=np.uint16
+        )  # 2 bytes
+        return unicode_values
+
+    def __call__(self, text_list: list[str]) -> tuple[np.ndarray, np.ndarray]:
+        text_list = [self._preprocess_text(t) for t in text_list]
+        text_ids_lengths = np.array([len(text) for text in text_list], dtype=np.int64)
+        text_ids = np.zeros((len(text_list), text_ids_lengths.max()), dtype=np.int64)
+        for i, text in enumerate(text_list):
+            unicode_vals = self._text_to_unicode_values(text)
+            text_ids[i, : len(unicode_vals)] = np.array(
+                [self.indexer[val] for val in unicode_vals], dtype=np.int64
+            )
+        text_mask = self._get_text_mask(text_ids_lengths)
+        return text_ids, text_mask
+
+
+class Style:
+    def __init__(self, style_ttl_onnx: np.ndarray, style_dp_onnx: np.ndarray):
+        self.ttl = style_ttl_onnx
+        self.dp = style_dp_onnx
+
+
+class TextToSpeech:
+    def __init__(
+        self,
+        cfgs: dict,
+        text_processor: UnicodeProcessor,
+        dp_ort: ort.InferenceSession,
+        text_enc_ort: ort.InferenceSession,
+        vector_est_ort: ort.InferenceSession,
+        vocoder_ort: ort.InferenceSession,
+    ):
+        self.cfgs = cfgs
+        self.text_processor = text_processor
+        self.dp_ort = dp_ort
+        self.text_enc_ort = text_enc_ort
+        self.vector_est_ort = vector_est_ort
+        self.vocoder_ort = vocoder_ort
+        self.sample_rate = cfgs["ae"]["sample_rate"]
+        self.base_chunk_size = cfgs["ae"]["base_chunk_size"]
+        self.chunk_compress_factor = cfgs["ttl"]["chunk_compress_factor"]
+        self.ldim = cfgs["ttl"]["latent_dim"]
+
+    def sample_noisy_latent(
+        self, duration: np.ndarray
+    ) -> tuple[np.ndarray, np.ndarray]:
+        bsz = len(duration)
+        wav_len_max = duration.max() * self.sample_rate
+        wav_lengths = (duration * self.sample_rate).astype(np.int64)
+        chunk_size = self.base_chunk_size * self.chunk_compress_factor
+        latent_len = ((wav_len_max + chunk_size - 1) / chunk_size).astype(np.int32)
+        latent_dim = self.ldim * self.chunk_compress_factor
+        noisy_latent = np.random.randn(bsz, latent_dim, latent_len).astype(np.float32)
+        latent_mask = get_latent_mask(
+            wav_lengths, self.base_chunk_size, self.chunk_compress_factor
+        )
+        noisy_latent = noisy_latent * latent_mask
+        return noisy_latent, latent_mask
+
+    def __call__(
+        self, text_list: list[str], style: Style, total_step: int
+    ) -> tuple[np.ndarray, np.ndarray]:
+        assert (
+            len(text_list) == style.ttl.shape[0]
+        ), "Number of texts must match number of style vectors"
+        bsz = len(text_list)
+        text_ids, text_mask = self.text_processor(text_list)
+        dur_onnx, *_ = self.dp_ort.run(
+            None, {"text_ids": text_ids, "style_dp": style.dp, "text_mask": text_mask}
+        )
+        text_emb_onnx, *_ = self.text_enc_ort.run(
+            None,
+            {"text_ids": text_ids, "style_ttl": style.ttl, "text_mask": text_mask},
+        )  # dur_onnx: [bsz]
+        xt, latent_mask = self.sample_noisy_latent(dur_onnx)
+        total_step_np = np.array([total_step] * bsz, dtype=np.float32)
+        for step in range(total_step):
+            current_step = np.array([step] * bsz, dtype=np.float32)
+            xt, *_ = self.vector_est_ort.run(
+                None,
+                {
+                    "noisy_latent": xt,
+                    "text_emb": text_emb_onnx,
+                    "style_ttl": style.ttl,
+                    "text_mask": text_mask,
+                    "latent_mask": latent_mask,
+                    "current_step": current_step,
+                    "total_step": total_step_np,
+                },
+            )
+        wav, *_ = self.vocoder_ort.run(None, {"latent": xt})
+        return wav, dur_onnx
+
+
+def length_to_mask(lengths: np.ndarray, max_len: Optional[int] = None) -> np.ndarray:
+    """
+    Convert lengths to binary mask.
+
+    Args:
+        lengths: (B,)
+        max_len: int
+
+    Returns:
+        mask: (B, 1, max_len)
+    """
+    max_len = max_len or lengths.max()
+    ids = np.arange(0, max_len)
+    mask = (ids < np.expand_dims(lengths, axis=1)).astype(np.float32)
+    return mask.reshape(-1, 1, max_len)
+
+
+def get_latent_mask(
+    wav_lengths: np.ndarray, base_chunk_size: int, chunk_compress_factor: int
+) -> np.ndarray:
+    latent_size = base_chunk_size * chunk_compress_factor
+    latent_lengths = (wav_lengths + latent_size - 1) // latent_size
+    latent_mask = length_to_mask(latent_lengths)
+    return latent_mask
+
+
+def load_onnx(
+    onnx_path: str, opts: ort.SessionOptions, providers: list[str]
+) -> ort.InferenceSession:
+    return ort.InferenceSession(onnx_path, sess_options=opts, providers=providers)
+
+
+def load_onnx_all(
+    onnx_dir: str, opts: ort.SessionOptions, providers: list[str]
+) -> tuple[
+    ort.InferenceSession,
+    ort.InferenceSession,
+    ort.InferenceSession,
+    ort.InferenceSession,
+]:
+    dp_onnx_path = os.path.join(onnx_dir, "duration_predictor.onnx")
+    text_enc_onnx_path = os.path.join(onnx_dir, "text_encoder.onnx")
+    vector_est_onnx_path = os.path.join(onnx_dir, "vector_estimator.onnx")
+    vocoder_onnx_path = os.path.join(onnx_dir, "vocoder.onnx")
+
+    dp_ort = load_onnx(dp_onnx_path, opts, providers)
+    text_enc_ort = load_onnx(text_enc_onnx_path, opts, providers)
+    vector_est_ort = load_onnx(vector_est_onnx_path, opts, providers)
+    vocoder_ort = load_onnx(vocoder_onnx_path, opts, providers)
+    return dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
+
+
+def load_cfgs(onnx_dir: str) -> dict:
+    cfg_path = os.path.join(onnx_dir, "tts.json")
+    with open(cfg_path, "r") as f:
+        cfgs = json.load(f)
+    return cfgs
+
+
+def load_text_processor(onnx_dir: str) -> UnicodeProcessor:
+    unicode_indexer_path = os.path.join(onnx_dir, "unicode_indexer.json")
+    text_processor = UnicodeProcessor(unicode_indexer_path)
+    return text_processor
+
+
+def load_text_to_speech(onnx_dir: str, use_gpu: bool = False) -> TextToSpeech:
+    opts = ort.SessionOptions()
+    if use_gpu:
+        raise NotImplementedError("GPU mode is not fully tested")
+    else:
+        providers = ["CPUExecutionProvider"]
+        print("Using CPU for inference")
+    cfgs = load_cfgs(onnx_dir)
+    dp_ort, text_enc_ort, vector_est_ort, vocoder_ort = load_onnx_all(
+        onnx_dir, opts, providers
+    )
+    text_processor = load_text_processor(onnx_dir)
+    return TextToSpeech(
+        cfgs, text_processor, dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
+    )
+
+
+def load_voice_style(voice_style_paths: list[str], verbose: bool = False) -> Style:
+    bsz = len(voice_style_paths)
+
+    # Read first file to get dimensions
+    with open(voice_style_paths[0], "r") as f:
+        first_style = json.load(f)
+    ttl_dims = first_style["style_ttl"]["dims"]
+    dp_dims = first_style["style_dp"]["dims"]
+
+    # Pre-allocate arrays with full batch size
+    ttl_style = np.zeros([bsz, ttl_dims[1], ttl_dims[2]], dtype=np.float32)
+    dp_style = np.zeros([bsz, dp_dims[1], dp_dims[2]], dtype=np.float32)
+
+    # Fill in the data
+    for i, voice_style_path in enumerate(voice_style_paths):
+        with open(voice_style_path, "r") as f:
+            voice_style = json.load(f)
+
+        ttl_data = np.array(
+            voice_style["style_ttl"]["data"], dtype=np.float32
+        ).flatten()
+        ttl_style[i] = ttl_data.reshape(ttl_dims[1], ttl_dims[2])
+
+        dp_data = np.array(voice_style["style_dp"]["data"], dtype=np.float32).flatten()
+        dp_style[i] = dp_data.reshape(dp_dims[1], dp_dims[2])
+
+    if verbose:
+        print(f"Loaded {bsz} voice styles")
+    return Style(ttl_style, dp_style)
+
+
+@contextmanager
+def timer(name: str):
+    start = time.time()
+    print(f"{name}...")
+    yield
+    print(f"  -> {name} completed in {time.time() - start:.2f} sec")
+
+
+def sanitize_filename(text: str, max_len: int) -> str:
+    """Sanitize filename by replacing non-alphanumeric characters with underscores"""
+    import re
+
+    prefix = text[:max_len]
+    return re.sub(r"[^a-zA-Z0-9]", "_", prefix)
@@ -0,0 +1,20 @@
+[project]
+name = "tts-onnx"
+version = "1.0.0"
+description = "TTS ONNX Inference"
+requires-python = ">=3.10"
+dependencies = [
+    "onnxruntime==1.23.1",
+    "numpy>=1.26.0",
+    "soundfile>=0.12.1",
+    "librosa>=0.10.0",
+    "PyYAML>=6.0",
+]
+
+[tool.setuptools]
+py-modules = []
+
+[build-system]
+requires = ["setuptools"]
+build-backend = "setuptools.build_meta"
+
@@ -0,0 +1,5 @@
+onnxruntime==1.23.1
+numpy>=1.26.0
+soundfile>=0.12.1
+librosa>=0.10.0
+PyYAML>=6.0
@@ -0,0 +1,21 @@
+# Rust build artifacts
+/target/
+Cargo.lock
+
+# Output directory
+/results/
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Debug
+*.pdb
+
@@ -0,0 +1,41 @@
+[package]
+name = "supertonic-tts"
+version = "0.1.0"
+edition = "2021"
+
+[dependencies]
+# ONNX Runtime
+ort = "2.0.0-rc.7"
+
+# Array processing (like NumPy)
+ndarray = { version = "0.16", features = ["rayon"] }
+rand = "0.8"
+rand_distr = "0.4"
+
+# Parallel processing
+rayon = "1.10"
+
+# Audio processing
+hound = "3.5"
+rustfft = "6.2"
+
+# JSON serialization
+serde = { version = "1.0", features = ["derive"] }
+serde_json = "1.0"
+
+# CLI argument parsing
+clap = { version = "4.5", features = ["derive"] }
+
+# Error handling
+anyhow = "1.0"
+
+# Unicode normalization
+unicode-normalization = "0.1"
+
+# System calls
+libc = "0.2"
+
+[[bin]]
+name = "example_onnx"
+path = "src/example_onnx.rs"
+
@@ -0,0 +1,101 @@
+# TTS ONNX Inference Examples
+
+This guide provides examples for running TTS inference using Rust.
+
+## Installation
+
+This project uses [Cargo](https://doc.rust-lang.org/cargo/) for package management.
+
+### Install Rust (if not already installed)
+```bash
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+```
+
+### Build the project
+```bash
+cargo build --release
+```
+
+## Basic Usage
+
+You can run the inference in two ways:
+1. **Using cargo run** (builds if needed, then runs)
+2. **Direct binary execution** (faster if already built)
+
+### Example 1: Default Inference
+Run inference with default settings:
+```bash
+# Using cargo run
+cargo run --release --bin example_onnx
+
+# Or directly execute the built binary (faster)
+./target/release/example_onnx
+```
+
+This will use:
+- Voice style: `assets/voice_styles/M1.json`
+- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+- Output directory: `results/`
+- Total steps: 5
+- Number of generations: 4
+
+### Example 2: Batch Inference
+Process multiple voice styles and texts at once:
+```bash
+# Using cargo run
+cargo run --release --bin example_onnx -- \
+  --voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
+  --text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+
+# Or using the binary directly
+./target/release/example_onnx \
+  --voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
+  --text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+```
+
+This will:
+- Generate speech for 2 different voice-text pairs
+- Use male voice (M1.json) for the first text
+- Use female voice (F1.json) for the second text
+- Process both samples in a single batch
+
+### Example 3: High Quality Inference
+Increase denoising steps for better quality:
+```bash
+# Using cargo run
+cargo run --release --bin example_onnx -- \
+  --total-step 10 \
+  --voice-style assets/voice_styles/M1.json \
+  --text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
+
+# Or using the binary directly
+./target/release/example_onnx \
+  --total-step 10 \
+  --voice-style assets/voice_styles/M1.json \
+  --text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
+```
+
+This will:
+- Use 10 denoising steps instead of the default 5
+- Produce higher quality output at the cost of slower inference
+
+## Available Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
+| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
+| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
+| `--n-test` | int | 4 | Number of times to generate each sample |
+| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
+| `--text` | str+ | (long default text) | Text(s) to synthesize |
+| `--save-dir` | str | `results` | Output directory |
+
+## Notes
+
+- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
+- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
+- **GPU Support**: GPU mode is not supported yet
+- **Known Issues**: On some platforms (especially macOS), there might be a mutex cleanup warning during exit. This is a known ONNX Runtime issue and doesn't affect functionality. The implementation uses `libc::_exit()` and `mem::forget()` to bypass this issue.
+
+
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,108 @@
+use anyhow::Result;
+use clap::Parser;
+use std::path::PathBuf;
+use std::fs;
+use std::mem;
+
+mod helper;
+
+use helper::{
+    load_text_to_speech, load_voice_style, timer, write_wav_file, sanitize_filename,
+};
+
+#[derive(Parser, Debug)]
+#[command(name = "TTS ONNX Inference")]
+#[command(about = "TTS Inference with ONNX Runtime (Rust)", long_about = None)]
+struct Args {
+    /// Use GPU for inference (default: CPU)
+    #[arg(long, default_value = "false")]
+    use_gpu: bool,
+
+    /// Path to ONNX model directory
+    #[arg(long, default_value = "assets/onnx")]
+    onnx_dir: String,
+
+    /// Number of denoising steps
+    #[arg(long, default_value = "5")]
+    total_step: usize,
+
+    /// Number of times to generate
+    #[arg(long, default_value = "4")]
+    n_test: usize,
+
+    /// Voice style file path(s)
+    #[arg(long, value_delimiter = ',', default_values_t = vec!["assets/voice_styles/M1.json".to_string()])]
+    voice_style: Vec<String>,
+
+    /// Text(s) to synthesize
+    #[arg(long, value_delimiter = '|', default_values_t = vec!["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.".to_string()])]
+    text: Vec<String>,
+
+    /// Output directory
+    #[arg(long, default_value = "results")]
+    save_dir: String,
+}
+
+fn main() -> Result<()> {
+    println!("=== TTS Inference with ONNX Runtime (Rust) ===\n");
+
+    // --- 1. Parse arguments --- //
+    let args = Args::parse();
+    let total_step = args.total_step;
+    let n_test = args.n_test;
+    let voice_style_paths = &args.voice_style;
+    let text_list = &args.text;
+    let save_dir = &args.save_dir;
+
+    if voice_style_paths.len() != text_list.len() {
+        anyhow::bail!(
+            "Number of voice styles ({}) must match number of texts ({})",
+            voice_style_paths.len(),
+            text_list.len()
+        );
+    }
+
+    let bsz = voice_style_paths.len();
+
+    // --- 2. Load TTS components --- //
+    let mut text_to_speech = load_text_to_speech(&args.onnx_dir, args.use_gpu)?;
+
+    // --- 3. Load voice styles --- //
+    let style = load_voice_style(voice_style_paths, true)?;
+
+    // --- 4. Synthesize speech --- //
+    fs::create_dir_all(save_dir)?;
+
+    for n in 0..n_test {
+        println!("\n[{}/{}] Starting synthesis...", n + 1, n_test);
+
+        let (wav, duration) = timer("Generating speech from text", || {
+            text_to_speech.call(text_list, &style, total_step)
+        })?;
+
+        // Save outputs
+        let wav_len = wav.len() / bsz;
+        for i in 0..bsz {
+            let fname = format!("{}_{}.wav", sanitize_filename(&text_list[i], 20), n + 1);
+            let actual_len = (text_to_speech.sample_rate as f32 * duration[i]) as usize;
+
+            let wav_start = i * wav_len;
+            let wav_end = wav_start + actual_len.min(wav_len);
+            let wav_slice = &wav[wav_start..wav_end];
+
+            let output_path = PathBuf::from(save_dir).join(&fname);
+            write_wav_file(&output_path, wav_slice, text_to_speech.sample_rate)?;
+            println!("Saved: {}", output_path.display());
+        }
+    }
+
+    println!("\n=== Synthesis completed successfully! ===");
+    
+    // Prevent ONNX Runtime sessions from being dropped, which causes mutex cleanup issues
+    mem::forget(text_to_speech);
+    
+    // Use _exit to bypass all cleanup handlers and avoid ONNX Runtime mutex issues on macOS
+    unsafe {
+        libc::_exit(0);
+    }
+}
@@ -0,0 +1,507 @@
+// ============================================================================
+// TTS Helper Module - All utility functions and structures
+// ============================================================================
+
+use ndarray::{Array, Array3};
+use serde::{Deserialize, Serialize};
+use serde_json;
+use std::fs::File;
+use std::io::BufReader;
+use std::path::Path;
+use anyhow::{Result, Context};
+use unicode_normalization::UnicodeNormalization;
+use hound::{WavWriter, WavSpec, SampleFormat};
+use rand_distr::{Distribution, Normal};
+
+// ============================================================================
+// Configuration Structures
+// ============================================================================
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct Config {
+    pub ae: AEConfig,
+    pub ttl: TTLConfig,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct AEConfig {
+    pub sample_rate: i32,
+    pub base_chunk_size: i32,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct TTLConfig {
+    pub chunk_compress_factor: i32,
+    pub latent_dim: i32,
+}
+
+/// Load configuration from JSON file
+pub fn load_cfgs<P: AsRef<Path>>(onnx_dir: P) -> Result<Config> {
+    let cfg_path = onnx_dir.as_ref().join("tts.json");
+    let file = File::open(cfg_path)?;
+    let reader = BufReader::new(file);
+    let cfgs: Config = serde_json::from_reader(reader)?;
+    Ok(cfgs)
+}
+
+// ============================================================================
+// Voice Style Data Structure
+// ============================================================================
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct VoiceStyleData {
+    pub style_ttl: StyleComponent,
+    pub style_dp: StyleComponent,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct StyleComponent {
+    pub data: Vec<Vec<Vec<f32>>>,
+    pub dims: Vec<usize>,
+    #[serde(rename = "type")]
+    pub dtype: String,
+}
+
+// ============================================================================
+// Unicode Text Processor
+// ============================================================================
+
+pub struct UnicodeProcessor {
+    indexer: Vec<i64>,
+}
+
+impl UnicodeProcessor {
+    pub fn new<P: AsRef<Path>>(unicode_indexer_json_path: P) -> Result<Self> {
+        let file = File::open(unicode_indexer_json_path)?;
+        let reader = BufReader::new(file);
+        let indexer: Vec<i64> = serde_json::from_reader(reader)?;
+        Ok(UnicodeProcessor { indexer })
+    }
+
+    pub fn call(&self, text_list: &[String]) -> (Vec<Vec<i64>>, Array3<f32>) {
+        let processed_texts: Vec<String> = text_list
+            .iter()
+            .map(|t| preprocess_text(t))
+            .collect();
+
+        let text_ids_lengths: Vec<usize> = processed_texts
+            .iter()
+            .map(|t| t.chars().count())
+            .collect();
+
+        let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
+
+        let mut text_ids = Vec::new();
+        for text in &processed_texts {
+            let mut row = vec![0i64; max_len];
+            let unicode_vals = text_to_unicode_values(text);
+            for (j, &val) in unicode_vals.iter().enumerate() {
+                if val < self.indexer.len() {
+                    row[j] = self.indexer[val];
+                } else {
+                    row[j] = -1;
+                }
+            }
+            text_ids.push(row);
+        }
+
+        let text_mask = get_text_mask(&text_ids_lengths);
+
+        (text_ids, text_mask)
+    }
+}
+
+pub fn preprocess_text(text: &str) -> String {
+    text.nfkd().collect()
+}
+
+pub fn text_to_unicode_values(text: &str) -> Vec<usize> {
+    text.chars().map(|c| c as usize).collect()
+}
+
+pub fn length_to_mask(lengths: &[usize], max_len: Option<usize>) -> Array3<f32> {
+    let bsz = lengths.len();
+    let max_len = max_len.unwrap_or_else(|| *lengths.iter().max().unwrap_or(&0));
+
+    let mut mask = Array3::<f32>::zeros((bsz, 1, max_len));
+    for (i, &len) in lengths.iter().enumerate() {
+        for j in 0..len.min(max_len) {
+            mask[[i, 0, j]] = 1.0;
+        }
+    }
+    mask
+}
+
+pub fn get_text_mask(text_ids_lengths: &[usize]) -> Array3<f32> {
+    let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
+    length_to_mask(text_ids_lengths, Some(max_len))
+}
+
+/// Sample noisy latent from normal distribution and apply mask
+pub fn sample_noisy_latent(
+    duration: &[f32],
+    sample_rate: i32,
+    base_chunk_size: i32,
+    chunk_compress: i32,
+    latent_dim: i32,
+) -> (Array3<f32>, Array3<f32>) {
+    let bsz = duration.len();
+    let max_dur = duration.iter().fold(0.0f32, |a, &b| a.max(b));
+
+    let wav_len_max = (max_dur * sample_rate as f32) as usize;
+    let wav_lengths: Vec<usize> = duration
+        .iter()
+        .map(|&d| (d * sample_rate as f32) as usize)
+        .collect();
+
+    let chunk_size = (base_chunk_size * chunk_compress) as usize;
+    let latent_len = (wav_len_max + chunk_size - 1) / chunk_size;
+    let latent_dim_val = (latent_dim * chunk_compress) as usize;
+
+    let mut noisy_latent = Array3::<f32>::zeros((bsz, latent_dim_val, latent_len));
+
+    let normal = Normal::new(0.0, 1.0).unwrap();
+    let mut rng = rand::thread_rng();
+
+    for b in 0..bsz {
+        for d in 0..latent_dim_val {
+            for t in 0..latent_len {
+                noisy_latent[[b, d, t]] = normal.sample(&mut rng);
+            }
+        }
+    }
+
+    let latent_lengths: Vec<usize> = wav_lengths
+        .iter()
+        .map(|&len| (len + chunk_size - 1) / chunk_size)
+        .collect();
+
+    let latent_mask = length_to_mask(&latent_lengths, Some(latent_len));
+
+    // Apply mask
+    for b in 0..bsz {
+        for d in 0..latent_dim_val {
+            for t in 0..latent_len {
+                noisy_latent[[b, d, t]] *= latent_mask[[b, 0, t]];
+            }
+        }
+    }
+
+    (noisy_latent, latent_mask)
+}
+
+// ============================================================================
+// WAV File I/O
+// ============================================================================
+
+pub fn write_wav_file<P: AsRef<Path>>(
+    filename: P,
+    audio_data: &[f32],
+    sample_rate: i32,
+) -> Result<()> {
+    let spec = WavSpec {
+        channels: 1,
+        sample_rate: sample_rate as u32,
+        bits_per_sample: 16,
+        sample_format: SampleFormat::Int,
+    };
+
+    let mut writer = WavWriter::create(filename, spec)?;
+
+    for &sample in audio_data {
+        let clamped = sample.max(-1.0).min(1.0);
+        let val = (clamped * 32767.0) as i16;
+        writer.write_sample(val)?;
+    }
+
+    writer.finalize()?;
+    Ok(())
+}
+
+// ============================================================================
+// Utility Functions
+// ============================================================================
+
+pub fn timer<F, T>(name: &str, f: F) -> Result<T>
+where
+    F: FnOnce() -> Result<T>,
+{
+    let start = std::time::Instant::now();
+    println!("{}...", name);
+    let result = f()?;
+    let elapsed = start.elapsed().as_secs_f64();
+    println!("  -> {} completed in {:.2} sec", name, elapsed);
+    Ok(result)
+}
+
+pub fn sanitize_filename(text: &str, max_len: usize) -> String {
+    let text = if text.len() > max_len {
+        &text[..max_len]
+    } else {
+        text
+    };
+
+    text.chars()
+        .map(|c| {
+            if c.is_ascii_alphanumeric() {
+                c
+            } else {
+                '_'
+            }
+        })
+        .collect()
+}
+
+// ============================================================================
+// ONNX Runtime Integration
+// ============================================================================
+
+use ort::{
+    session::Session,
+    value::Value,
+};
+
+pub struct Style {
+    pub ttl: Array3<f32>,
+    pub dp: Array3<f32>,
+}
+
+pub struct TextToSpeech {
+    cfgs: Config,
+    text_processor: UnicodeProcessor,
+    dp_ort: Session,
+    text_enc_ort: Session,
+    vector_est_ort: Session,
+    vocoder_ort: Session,
+    pub sample_rate: i32,
+}
+
+impl TextToSpeech {
+    pub fn new(
+        cfgs: Config,
+        text_processor: UnicodeProcessor,
+        dp_ort: Session,
+        text_enc_ort: Session,
+        vector_est_ort: Session,
+        vocoder_ort: Session,
+    ) -> Self {
+        let sample_rate = cfgs.ae.sample_rate;
+        TextToSpeech {
+            cfgs,
+            text_processor,
+            dp_ort,
+            text_enc_ort,
+            vector_est_ort,
+            vocoder_ort,
+            sample_rate,
+        }
+    }
+
+    pub fn call(
+        &mut self,
+        text_list: &[String],
+        style: &Style,
+        total_step: usize,
+    ) -> Result<(Vec<f32>, Vec<f32>)> {
+        let bsz = text_list.len();
+
+        // Process text
+        let (text_ids, text_mask) = self.text_processor.call(text_list);
+        
+        let text_ids_array = {
+            let text_ids_shape = (bsz, text_ids[0].len());
+            let mut flat = Vec::new();
+            for row in &text_ids {
+                flat.extend_from_slice(row);
+            }
+            Array::from_shape_vec(text_ids_shape, flat)?
+        };
+
+        let text_ids_value = Value::from_array(text_ids_array)?;
+        let text_mask_value = Value::from_array(text_mask.clone())?;
+        let style_dp_value = Value::from_array(style.dp.clone())?;
+
+        // Predict duration
+        let dp_outputs = self.dp_ort.run(ort::inputs!{
+            "text_ids" => &text_ids_value,
+            "style_dp" => &style_dp_value,
+            "text_mask" => &text_mask_value
+        })?;
+
+        let (_, duration_data) = dp_outputs["duration"].try_extract_tensor::<f32>()?;
+        let duration: Vec<f32> = duration_data.to_vec();
+
+        // Encode text
+        let style_ttl_value = Value::from_array(style.ttl.clone())?;
+        let text_enc_outputs = self.text_enc_ort.run(ort::inputs!{
+            "text_ids" => &text_ids_value,
+            "style_ttl" => &style_ttl_value,
+            "text_mask" => &text_mask_value
+        })?;
+
+        let (text_emb_shape, text_emb_data) = text_enc_outputs["text_emb"].try_extract_tensor::<f32>()?;
+        let text_emb = Array3::from_shape_vec(
+            (text_emb_shape[0] as usize, text_emb_shape[1] as usize, text_emb_shape[2] as usize),
+            text_emb_data.to_vec()
+        )?;
+
+        // Sample noisy latent
+        let (mut xt, latent_mask) = sample_noisy_latent(
+            &duration,
+            self.sample_rate,
+            self.cfgs.ae.base_chunk_size,
+            self.cfgs.ttl.chunk_compress_factor,
+            self.cfgs.ttl.latent_dim,
+        );
+
+        // Prepare constant arrays
+        let total_step_array = Array::from_elem(bsz, total_step as f32);
+
+        // Denoising loop
+        for step in 0..total_step {
+            let current_step_array = Array::from_elem(bsz, step as f32);
+
+            let xt_value = Value::from_array(xt.clone())?;
+            let text_emb_value = Value::from_array(text_emb.clone())?;
+            let latent_mask_value = Value::from_array(latent_mask.clone())?;
+            let text_mask_value2 = Value::from_array(text_mask.clone())?;
+            let current_step_value = Value::from_array(current_step_array)?;
+            let total_step_value = Value::from_array(total_step_array.clone())?;
+
+            let vector_est_outputs = self.vector_est_ort.run(ort::inputs!{
+                "noisy_latent" => &xt_value,
+                "text_emb" => &text_emb_value,
+                "style_ttl" => &style_ttl_value,
+                "latent_mask" => &latent_mask_value,
+                "text_mask" => &text_mask_value2,
+                "current_step" => &current_step_value,
+                "total_step" => &total_step_value
+            })?;
+
+            let (denoised_shape, denoised_data) = vector_est_outputs["denoised_latent"].try_extract_tensor::<f32>()?;
+            xt = Array3::from_shape_vec(
+                (denoised_shape[0] as usize, denoised_shape[1] as usize, denoised_shape[2] as usize),
+                denoised_data.to_vec()
+            )?;
+        }
+
+        // Generate waveform
+        let final_latent_value = Value::from_array(xt)?;
+        let vocoder_outputs = self.vocoder_ort.run(ort::inputs!{
+            "latent" => &final_latent_value
+        })?;
+
+        let (_, wav_data) = vocoder_outputs["wav_tts"].try_extract_tensor::<f32>()?;
+        let wav: Vec<f32> = wav_data.to_vec();
+
+        Ok((wav, duration))
+    }
+}
+
+// ============================================================================
+// Component Loading Functions
+// ============================================================================
+
+/// Load voice style from JSON files
+pub fn load_voice_style(voice_style_paths: &[String], verbose: bool) -> Result<Style> {
+    let bsz = voice_style_paths.len();
+
+    // Read first file to get dimensions
+    let first_file = File::open(&voice_style_paths[0])
+        .context("Failed to open voice style file")?;
+    let first_reader = BufReader::new(first_file);
+    let first_data: VoiceStyleData = serde_json::from_reader(first_reader)?;
+
+    let ttl_dims = &first_data.style_ttl.dims;
+    let dp_dims = &first_data.style_dp.dims;
+
+    let ttl_dim1 = ttl_dims[1];
+    let ttl_dim2 = ttl_dims[2];
+    let dp_dim1 = dp_dims[1];
+    let dp_dim2 = dp_dims[2];
+
+    // Pre-allocate arrays with full batch size
+    let ttl_size = bsz * ttl_dim1 * ttl_dim2;
+    let dp_size = bsz * dp_dim1 * dp_dim2;
+    let mut ttl_flat = vec![0.0f32; ttl_size];
+    let mut dp_flat = vec![0.0f32; dp_size];
+
+    // Fill in the data
+    for (i, path) in voice_style_paths.iter().enumerate() {
+        let file = File::open(path).context("Failed to open voice style file")?;
+        let reader = BufReader::new(file);
+        let data: VoiceStyleData = serde_json::from_reader(reader)?;
+
+        // Flatten TTL data
+        let ttl_offset = i * ttl_dim1 * ttl_dim2;
+        let mut idx = 0;
+        for batch in &data.style_ttl.data {
+            for row in batch {
+                for &val in row {
+                    ttl_flat[ttl_offset + idx] = val;
+                    idx += 1;
+                }
+            }
+        }
+
+        // Flatten DP data
+        let dp_offset = i * dp_dim1 * dp_dim2;
+        idx = 0;
+        for batch in &data.style_dp.data {
+            for row in batch {
+                for &val in row {
+                    dp_flat[dp_offset + idx] = val;
+                    idx += 1;
+                }
+            }
+        }
+    }
+
+    let ttl_style = Array3::from_shape_vec((bsz, ttl_dim1, ttl_dim2), ttl_flat)?;
+    let dp_style = Array3::from_shape_vec((bsz, dp_dim1, dp_dim2), dp_flat)?;
+
+    if verbose {
+        println!("Loaded {} voice styles\n", bsz);
+    }
+
+    Ok(Style {
+        ttl: ttl_style,
+        dp: dp_style,
+    })
+}
+
+/// Load TTS components
+pub fn load_text_to_speech(onnx_dir: &str, use_gpu: bool) -> Result<TextToSpeech> {
+    if use_gpu {
+        anyhow::bail!("GPU mode is not supported yet");
+    }
+    println!("Using CPU for inference\n");
+
+    let cfgs = load_cfgs(onnx_dir)?;
+
+    let dp_path = format!("{}/duration_predictor.onnx", onnx_dir);
+    let text_enc_path = format!("{}/text_encoder.onnx", onnx_dir);
+    let vector_est_path = format!("{}/vector_estimator.onnx", onnx_dir);
+    let vocoder_path = format!("{}/vocoder.onnx", onnx_dir);
+
+    let dp_ort = Session::builder()?
+        .commit_from_file(&dp_path)?;
+    let text_enc_ort = Session::builder()?
+        .commit_from_file(&text_enc_path)?;
+    let vector_est_ort = Session::builder()?
+        .commit_from_file(&vector_est_path)?;
+    let vocoder_ort = Session::builder()?
+        .commit_from_file(&vocoder_path)?;
+
+    let unicode_indexer_path = format!("{}/unicode_indexer.json", onnx_dir);
+    let text_processor = UnicodeProcessor::new(&unicode_indexer_path)?;
+
+    Ok(TextToSpeech::new(
+        cfgs,
+        text_processor,
+        dp_ort,
+        text_enc_ort,
+        vector_est_ort,
+        vocoder_ort,
+    ))
+}
@@ -0,0 +1,15 @@
+# Swift Package Manager
+.build/
+.swiftpm/
+*.xcodeproj
+*.xcworkspace
+
+# Build artifacts
+example_onnx
+
+# Results
+results/*.wav
+
+# macOS
+.DS_Store
+
@@ -0,0 +1,14 @@
+{
+  "pins" : [
+    {
+      "identity" : "onnxruntime-swift-package-manager",
+      "kind" : "remoteSourceControl",
+      "location" : "https://github.com/microsoft/onnxruntime-swift-package-manager.git",
+      "state" : {
+        "revision" : "12ce7374c86944e1f68f3a866d10105d8357f074",
+        "version" : "1.20.0"
+      }
+    }
+  ],
+  "version" : 2
+}
@@ -0,0 +1,22 @@
+// swift-tools-version: 5.9
+import PackageDescription
+
+let package = Package(
+    name: "Supertonic",
+    platforms: [
+        .macOS(.v13)
+    ],
+    dependencies: [
+        .package(url: "https://github.com/microsoft/onnxruntime-swift-package-manager.git", from: "1.16.0"),
+    ],
+    targets: [
+        .executableTarget(
+            name: "example_onnx",
+            dependencies: [
+                .product(name: "onnxruntime", package: "onnxruntime-swift-package-manager")
+            ],
+            path: "Sources"
+        )
+    ]
+)
+
@@ -0,0 +1,76 @@
+# TTS ONNX Inference Examples
+
+This guide provides examples for running TTS inference using `example_onnx`.
+
+## Installation
+
+This project uses Swift Package Manager (SPM) for dependency management.
+
+### Prerequisites
+- Swift 5.9 or later
+- macOS 13.0 or later
+
+### Build the project
+```bash
+swift build -c release
+```
+
+## Basic Usage
+
+### Example 1: Default Inference
+Run inference with default settings:
+```bash
+.build/release/example_onnx
+```
+
+This will use:
+- Voice style: `assets/voice_styles/M1.json`
+- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
+- Output directory: `results/`
+- Total steps: 5
+- Number of generations: 4
+
+### Example 2: Batch Inference
+Process multiple voice styles and texts at once:
+```bash
+.build/release/example_onnx \
+  --voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
+  --text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+```
+
+This will:
+- Generate speech for 2 different voice-text pairs
+- Use male voice (M1.json) for the first text
+- Use female voice (F1.json) for the second text
+- Process both samples in a single batch
+
+### Example 3: High Quality Inference
+Increase denoising steps for better quality:
+```bash
+.build/release/example_onnx \
+  --total-step 10 \
+  --voice-style assets/voice_styles/M1.json \
+  --text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
+```
+
+This will:
+- Use 10 denoising steps instead of the default 5
+- Produce higher quality output at the cost of slower inference
+
+## Available Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
+| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
+| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
+| `--n-test` | int | 4 | Number of times to generate each sample |
+| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
+| `--text` | str+ | (long default text) | Text(s) to synthesize |
+| `--save-dir` | str | `results` | Output directory |
+
+## Notes
+
+- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
+- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
+- **GPU Support**: GPU mode is not supported yet
@@ -0,0 +1,122 @@
+import Foundation
+import OnnxRuntimeBindings
+
+struct Args {
+    var useGpu: Bool = false
+    var onnxDir: String = "assets/onnx"
+    var totalStep: Int = 5
+    var nTest: Int = 4
+    var voiceStyle: [String] = ["assets/voice_styles/M1.json"]
+    var text: [String] = ["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."]
+    var saveDir: String = "results"
+}
+
+func parseArgs() -> Args {
+    var args = Args()
+    let arguments = CommandLine.arguments
+    
+    var i = 1
+    while i < arguments.count {
+        let arg = arguments[i]
+        
+        switch arg {
+        case "--use-gpu":
+            args.useGpu = true
+        case "--onnx-dir":
+            if i + 1 < arguments.count {
+                args.onnxDir = arguments[i + 1]
+                i += 1
+            }
+        case "--total-step":
+            if i + 1 < arguments.count {
+                args.totalStep = Int(arguments[i + 1]) ?? 5
+                i += 1
+            }
+        case "--n-test":
+            if i + 1 < arguments.count {
+                args.nTest = Int(arguments[i + 1]) ?? 4
+                i += 1
+            }
+        case "--voice-style":
+            if i + 1 < arguments.count {
+                args.voiceStyle = arguments[i + 1].components(separatedBy: ",")
+                i += 1
+            }
+        case "--text":
+            if i + 1 < arguments.count {
+                args.text = arguments[i + 1].components(separatedBy: "|")
+                i += 1
+            }
+        case "--save-dir":
+            if i + 1 < arguments.count {
+                args.saveDir = arguments[i + 1]
+                i += 1
+            }
+        default:
+            break
+        }
+        
+        i += 1
+    }
+    
+    return args
+}
+
+@main
+struct ExampleONNX {
+    static func main() async {
+        print("=== TTS Inference with ONNX Runtime (Swift) ===\n")
+        
+        // --- 1. Parse arguments --- //
+        let args = parseArgs()
+        
+        guard args.voiceStyle.count == args.text.count else {
+            print("Error: Number of voice styles (\(args.voiceStyle.count)) must match number of texts (\(args.text.count))")
+            return
+        }
+        
+        let bsz = args.voiceStyle.count
+        
+        do {
+            let env = try ORTEnv(loggingLevel: .warning)
+            
+            // --- 2. Load TTS components --- //
+            let textToSpeech = try loadTextToSpeech(args.onnxDir, args.useGpu, env)
+            
+            // --- 3. Load voice styles --- //
+            let style = try loadVoiceStyle(args.voiceStyle, verbose: true)
+            
+            // --- 4. Synthesize speech --- //
+            try? FileManager.default.createDirectory(atPath: args.saveDir, withIntermediateDirectories: true)
+            
+            for n in 0..<args.nTest {
+                print("\n[\(n + 1)/\(args.nTest)] Starting synthesis...")
+                
+                let (wav, duration) = try timer("Generating speech from text") {
+                    try textToSpeech.call(args.text, style, args.totalStep)
+                }
+                
+                // Save outputs
+                let wavLen = wav.count / bsz
+                for i in 0..<bsz {
+                    let fname = "\(sanitizeFilename(args.text[i], maxLen: 20))_\(n + 1).wav"
+                    let actualLen = Int(Float(textToSpeech.sampleRate) * duration[i])
+                    
+                    let wavStart = i * wavLen
+                    let wavEnd = min(wavStart + actualLen, wavStart + wavLen)
+                    let wavOut = Array(wav[wavStart..<wavEnd])
+                    
+                    let outputPath = "\(args.saveDir)/\(fname)"
+                    try writeWavFile(outputPath, wavOut, textToSpeech.sampleRate)
+                    print("Saved: \(outputPath)")
+                }
+            }
+            
+            print("\n=== Synthesis completed successfully! ===")
+            
+        } catch {
+            print("Error during inference: \(error)")
+            exit(1)
+        }
+    }
+}
@@ -0,0 +1,483 @@
+import Foundation
+import Accelerate
+import OnnxRuntimeBindings
+
+// MARK: - Configuration Structures
+
+struct Config: Codable {
+    struct AEConfig: Codable {
+        let sample_rate: Int
+        let base_chunk_size: Int
+    }
+    
+    struct TTLConfig: Codable {
+        let chunk_compress_factor: Int
+        let latent_dim: Int
+    }
+    
+    let ae: AEConfig
+    let ttl: TTLConfig
+}
+
+// MARK: - Voice Style Data Structure
+
+struct VoiceStyleData: Codable {
+    struct StyleComponent: Codable {
+        let data: [[[Float]]]
+        let dims: [Int]
+        let type: String
+    }
+    
+    let style_ttl: StyleComponent
+    let style_dp: StyleComponent
+}
+
+// MARK: - Unicode Text Processor
+
+class UnicodeProcessor {
+    let indexer: [Int64]
+    
+    init(unicodeIndexerPath: String) throws {
+        let data = try Data(contentsOf: URL(fileURLWithPath: unicodeIndexerPath))
+        self.indexer = try JSONDecoder().decode([Int64].self, from: data)
+    }
+    
+    func call(_ textList: [String]) -> (textIds: [[Int64]], textMask: [[[Float]]]) {
+        let processedTexts = textList.map { preprocessText($0) }
+        
+        var textIdsLengths = [Int]()
+        for text in processedTexts {
+            textIdsLengths.append(text.count)
+        }
+        
+        let maxLen = textIdsLengths.max() ?? 0
+        
+        var textIds = [[Int64]]()
+        for text in processedTexts {
+            var row = Array(repeating: Int64(0), count: maxLen)
+            let unicodeValues = Array(text.unicodeScalars.map { Int($0.value) })
+            for (j, val) in unicodeValues.enumerated() {
+                if val < indexer.count {
+                    row[j] = indexer[val]
+                } else {
+                    row[j] = -1
+                }
+            }
+            textIds.append(row)
+        }
+        
+        let textMask = getTextMask(textIdsLengths)
+        return (textIds, textMask)
+    }
+}
+
+func preprocessText(_ text: String) -> String {
+    return text.precomposedStringWithCompatibilityMapping
+}
+
+func lengthToMask(_ lengths: [Int], maxLen: Int? = nil) -> [[[Float]]] {
+    let actualMaxLen = maxLen ?? (lengths.max() ?? 0)
+    
+    var mask = [[[Float]]]()
+    for len in lengths {
+        var row = Array(repeating: Float(0.0), count: actualMaxLen)
+        for j in 0..<min(len, actualMaxLen) {
+            row[j] = 1.0
+        }
+        mask.append([row])
+    }
+    return mask
+}
+
+func getTextMask(_ textIdsLengths: [Int]) -> [[[Float]]] {
+    let maxLen = textIdsLengths.max() ?? 0
+    return lengthToMask(textIdsLengths, maxLen: maxLen)
+}
+
+func sampleNoisyLatent(duration: [Float], sampleRate: Int, baseChunkSize: Int, chunkCompress: Int, latentDim: Int) -> (noisyLatent: [[[Float]]], latentMask: [[[Float]]]) {
+    let bsz = duration.count
+    let maxDur = duration.max() ?? 0.0
+    
+    let wavLenMax = Int(maxDur * Float(sampleRate))
+    var wavLengths = [Int]()
+    for d in duration {
+        wavLengths.append(Int(d * Float(sampleRate)))
+    }
+    
+    let chunkSize = baseChunkSize * chunkCompress
+    let latentLen = (wavLenMax + chunkSize - 1) / chunkSize
+    let latentDimVal = latentDim * chunkCompress
+    
+    var noisyLatent = [[[Float]]]()
+    for _ in 0..<bsz {
+        var batch = [[Float]]()
+        for _ in 0..<latentDimVal {
+            var row = [Float]()
+            for _ in 0..<latentLen {
+                // Box-Muller transform
+                let u1 = Float.random(in: 0.0001...1.0)
+                let u2 = Float.random(in: 0.0...1.0)
+                let val = sqrt(-2.0 * log(u1)) * cos(2.0 * Float.pi * u2)
+                row.append(val)
+            }
+            batch.append(row)
+        }
+        noisyLatent.append(batch)
+    }
+    
+    var latentLengths = [Int]()
+    for len in wavLengths {
+        latentLengths.append((len + chunkSize - 1) / chunkSize)
+    }
+    
+    let latentMask = lengthToMask(latentLengths, maxLen: latentLen)
+    
+    // Apply mask
+    for b in 0..<bsz {
+        for d in 0..<latentDimVal {
+            for t in 0..<latentLen {
+                noisyLatent[b][d][t] *= latentMask[b][0][t]
+            }
+        }
+    }
+    
+    return (noisyLatent, latentMask)
+}
+
+func getLatentMask(_ wavLengths: [Int64], _ cfgs: Config) -> [[[Float]]] {
+    let baseChunkSize = cfgs.ae.base_chunk_size
+    let chunkCompressFactor = cfgs.ttl.chunk_compress_factor
+    let latentSize = baseChunkSize * chunkCompressFactor
+    
+    var latentLengths = [Int]()
+    for len in wavLengths {
+        latentLengths.append((Int(len) + latentSize - 1) / latentSize)
+    }
+    
+    let maxLen = latentLengths.max() ?? 0
+    return lengthToMask(latentLengths, maxLen: maxLen)
+}
+
+// MARK: - WAV File I/O
+
+func writeWavFile(_ filename: String, _ audioData: [Float], _ sampleRate: Int) throws {
+    let url = URL(fileURLWithPath: filename)
+    
+    // Convert float to int16
+    let int16Data = audioData.map { sample -> Int16 in
+        let clamped = max(-1.0, min(1.0, sample))
+        return Int16(clamped * 32767.0)
+    }
+    
+    // Create WAV header
+    let numChannels: UInt16 = 1
+    let bitsPerSample: UInt16 = 16
+    let byteRate = UInt32(sampleRate) * UInt32(numChannels) * UInt32(bitsPerSample) / 8
+    let blockAlign = numChannels * bitsPerSample / 8
+    let dataSize = UInt32(int16Data.count * 2)
+    
+    var data = Data()
+    
+    // RIFF chunk
+    data.append("RIFF".data(using: .ascii)!)
+    withUnsafeBytes(of: UInt32(36 + dataSize).littleEndian) { data.append(contentsOf: $0) }
+    data.append("WAVE".data(using: .ascii)!)
+    
+    // fmt chunk
+    data.append("fmt ".data(using: .ascii)!)
+    withUnsafeBytes(of: UInt32(16).littleEndian) { data.append(contentsOf: $0) }
+    withUnsafeBytes(of: UInt16(1).littleEndian) { data.append(contentsOf: $0) } // PCM
+    withUnsafeBytes(of: numChannels.littleEndian) { data.append(contentsOf: $0) }
+    withUnsafeBytes(of: UInt32(sampleRate).littleEndian) { data.append(contentsOf: $0) }
+    withUnsafeBytes(of: byteRate.littleEndian) { data.append(contentsOf: $0) }
+    withUnsafeBytes(of: blockAlign.littleEndian) { data.append(contentsOf: $0) }
+    withUnsafeBytes(of: bitsPerSample.littleEndian) { data.append(contentsOf: $0) }
+    
+    // data chunk
+    data.append("data".data(using: .ascii)!)
+    withUnsafeBytes(of: dataSize.littleEndian) { data.append(contentsOf: $0) }
+    
+    // audio data
+    int16Data.withUnsafeBytes { data.append(contentsOf: $0) }
+    
+    try data.write(to: url)
+}
+
+// MARK: - Utility Functions
+
+func timer<T>(_ name: String, _ f: () throws -> T) rethrows -> T {
+    let start = Date()
+    print("\(name)...")
+    let result = try f()
+    let elapsed = Date().timeIntervalSince(start)
+    print(String(format: "  -> %@ completed in %.2f sec", name, elapsed))
+    return result
+}
+
+func sanitizeFilename(_ text: String, maxLen: Int) -> String {
+    let truncated = text.count > maxLen ? String(text.prefix(maxLen)) : text
+    return truncated.map { char in
+        if char.isLetter || char.isNumber {
+            return char
+        } else {
+            return Character("_")
+        }
+    }.map(String.init).joined()
+}
+
+func loadCfgs(_ onnxDir: String) throws -> Config {
+    let cfgPath = "\(onnxDir)/tts.json"
+    let data = try Data(contentsOf: URL(fileURLWithPath: cfgPath))
+    let config = try JSONDecoder().decode(Config.self, from: data)
+    return config
+}
+
+// MARK: - ONNX Runtime Integration
+
+struct Style {
+    let ttl: ORTValue
+    let dp: ORTValue
+}
+
+class TextToSpeech {
+    let cfgs: Config
+    let textProcessor: UnicodeProcessor
+    let dpOrt: ORTSession
+    let textEncOrt: ORTSession
+    let vectorEstOrt: ORTSession
+    let vocoderOrt: ORTSession
+    let sampleRate: Int
+    
+    init(cfgs: Config, textProcessor: UnicodeProcessor,
+         dpOrt: ORTSession, textEncOrt: ORTSession,
+         vectorEstOrt: ORTSession, vocoderOrt: ORTSession) {
+        self.cfgs = cfgs
+        self.textProcessor = textProcessor
+        self.dpOrt = dpOrt
+        self.textEncOrt = textEncOrt
+        self.vectorEstOrt = vectorEstOrt
+        self.vocoderOrt = vocoderOrt
+        self.sampleRate = cfgs.ae.sample_rate
+    }
+    
+    func call(_ textList: [String], _ style: Style, _ totalStep: Int) throws -> (wav: [Float], duration: [Float]) {
+        let bsz = textList.count
+        
+        // Process text
+        let (textIds, textMask) = textProcessor.call(textList)
+        
+        // Flatten text IDs
+        let textIdsFlat = textIds.flatMap { $0 }
+        let textIdsShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: textIds[0].count)]
+        let textIdsValue = try ORTValue(tensorData: NSMutableData(bytes: textIdsFlat, length: textIdsFlat.count * MemoryLayout<Int64>.size),
+                                        elementType: .int64,
+                                        shape: textIdsShape)
+        
+        // Flatten text mask
+        let textMaskFlat = textMask.flatMap { $0.flatMap { $0 } }
+        let textMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: textMask[0][0].count)]
+        let textMaskValue = try ORTValue(tensorData: NSMutableData(bytes: textMaskFlat, length: textMaskFlat.count * MemoryLayout<Float>.size),
+                                         elementType: .float,
+                                         shape: textMaskShape)
+        
+        // Predict duration
+        let dpOutputs = try dpOrt.run(withInputs: ["text_ids": textIdsValue, "style_dp": style.dp, "text_mask": textMaskValue],
+                                      outputNames: ["duration"],
+                                      runOptions: nil)
+        
+        let durationData = try dpOutputs["duration"]!.tensorData() as Data
+        let duration = durationData.withUnsafeBytes { ptr in
+            Array(ptr.bindMemory(to: Float.self))
+        }
+        
+        // Encode text
+        let textEncOutputs = try textEncOrt.run(withInputs: ["text_ids": textIdsValue, "style_ttl": style.ttl, "text_mask": textMaskValue],
+                                                outputNames: ["text_emb"],
+                                                runOptions: nil)
+        
+        let textEmbValue = textEncOutputs["text_emb"]!
+        
+        // Sample noisy latent
+        var (xt, latentMask) = sampleNoisyLatent(duration: duration, sampleRate: sampleRate,
+                                                  baseChunkSize: cfgs.ae.base_chunk_size,
+                                                  chunkCompress: cfgs.ttl.chunk_compress_factor,
+                                                  latentDim: cfgs.ttl.latent_dim)
+        
+        // Prepare constant arrays
+        let totalStepArray = Array(repeating: Float(totalStep), count: bsz)
+        let totalStepValue = try ORTValue(tensorData: NSMutableData(bytes: totalStepArray, length: totalStepArray.count * MemoryLayout<Float>.size),
+                                          elementType: .float,
+                                          shape: [NSNumber(value: bsz)])
+        
+        // Denoising loop
+        for step in 0..<totalStep {
+            let currentStepArray = Array(repeating: Float(step), count: bsz)
+            let currentStepValue = try ORTValue(tensorData: NSMutableData(bytes: currentStepArray, length: currentStepArray.count * MemoryLayout<Float>.size),
+                                                elementType: .float,
+                                                shape: [NSNumber(value: bsz)])
+            
+            // Flatten xt
+            let xtFlat = xt.flatMap { $0.flatMap { $0 } }
+            let xtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
+            let xtValue = try ORTValue(tensorData: NSMutableData(bytes: xtFlat, length: xtFlat.count * MemoryLayout<Float>.size),
+                                       elementType: .float,
+                                       shape: xtShape)
+            
+            // Flatten latent mask
+            let latentMaskFlat = latentMask.flatMap { $0.flatMap { $0 } }
+            let latentMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: latentMask[0][0].count)]
+            let latentMaskValue = try ORTValue(tensorData: NSMutableData(bytes: latentMaskFlat, length: latentMaskFlat.count * MemoryLayout<Float>.size),
+                                               elementType: .float,
+                                               shape: latentMaskShape)
+            
+            let vectorEstOutputs = try vectorEstOrt.run(withInputs: [
+                "noisy_latent": xtValue,
+                "text_emb": textEmbValue,
+                "style_ttl": style.ttl,
+                "latent_mask": latentMaskValue,
+                "text_mask": textMaskValue,
+                "current_step": currentStepValue,
+                "total_step": totalStepValue
+            ], outputNames: ["denoised_latent"], runOptions: nil)
+            
+            let denoisedData = try vectorEstOutputs["denoised_latent"]!.tensorData() as Data
+            let denoisedFlat = denoisedData.withUnsafeBytes { ptr in
+                Array(ptr.bindMemory(to: Float.self))
+            }
+            
+            // Reshape to 3D
+            let latentDimVal = xt[0].count
+            let latentLen = xt[0][0].count
+            xt = []
+            var idx = 0
+            for _ in 0..<bsz {
+                var batch = [[Float]]()
+                for _ in 0..<latentDimVal {
+                    var row = [Float]()
+                    for _ in 0..<latentLen {
+                        row.append(denoisedFlat[idx])
+                        idx += 1
+                    }
+                    batch.append(row)
+                }
+                xt.append(batch)
+            }
+        }
+        
+        // Generate waveform
+        let finalXtFlat = xt.flatMap { $0.flatMap { $0 } }
+        let finalXtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
+        let finalXtValue = try ORTValue(tensorData: NSMutableData(bytes: finalXtFlat, length: finalXtFlat.count * MemoryLayout<Float>.size),
+                                        elementType: .float,
+                                        shape: finalXtShape)
+        
+        let vocoderOutputs = try vocoderOrt.run(withInputs: ["latent": finalXtValue],
+                                                outputNames: ["wav_tts"],
+                                                runOptions: nil)
+        
+        let wavData = try vocoderOutputs["wav_tts"]!.tensorData() as Data
+        let wav = wavData.withUnsafeBytes { ptr in
+            Array(ptr.bindMemory(to: Float.self))
+        }
+        
+        return (wav, duration)
+    }
+}
+
+// MARK: - Component Loading Functions
+
+func loadVoiceStyle(_ voiceStylePaths: [String], verbose: Bool) throws -> Style {
+    let bsz = voiceStylePaths.count
+    
+    // Read first file to get dimensions
+    let firstData = try Data(contentsOf: URL(fileURLWithPath: voiceStylePaths[0]))
+    let firstStyle = try JSONDecoder().decode(VoiceStyleData.self, from: firstData)
+    
+    let ttlDims = firstStyle.style_ttl.dims
+    let dpDims = firstStyle.style_dp.dims
+    
+    let ttlDim1 = ttlDims[1]
+    let ttlDim2 = ttlDims[2]
+    let dpDim1 = dpDims[1]
+    let dpDim2 = dpDims[2]
+    
+    // Pre-allocate arrays with full batch size
+    let ttlSize = bsz * ttlDim1 * ttlDim2
+    let dpSize = bsz * dpDim1 * dpDim2
+    var ttlFlat = [Float](repeating: 0.0, count: ttlSize)
+    var dpFlat = [Float](repeating: 0.0, count: dpSize)
+    
+    // Fill in the data
+    for (i, path) in voiceStylePaths.enumerated() {
+        let data = try Data(contentsOf: URL(fileURLWithPath: path))
+        let voiceStyle = try JSONDecoder().decode(VoiceStyleData.self, from: data)
+        
+        // Flatten TTL data
+        let ttlOffset = i * ttlDim1 * ttlDim2
+        var idx = 0
+        for batch in voiceStyle.style_ttl.data {
+            for row in batch {
+                for val in row {
+                    ttlFlat[ttlOffset + idx] = val
+                    idx += 1
+                }
+            }
+        }
+        
+        // Flatten DP data
+        let dpOffset = i * dpDim1 * dpDim2
+        idx = 0
+        for batch in voiceStyle.style_dp.data {
+            for row in batch {
+                for val in row {
+                    dpFlat[dpOffset + idx] = val
+                    idx += 1
+                }
+            }
+        }
+    }
+    
+    let ttlShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: ttlDim1), NSNumber(value: ttlDim2)]
+    let dpShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: dpDim1), NSNumber(value: dpDim2)]
+    
+    let ttlValue = try ORTValue(tensorData: NSMutableData(bytes: &ttlFlat, length: ttlFlat.count * MemoryLayout<Float>.size),
+                                elementType: .float,
+                                shape: ttlShape)
+    let dpValue = try ORTValue(tensorData: NSMutableData(bytes: &dpFlat, length: dpFlat.count * MemoryLayout<Float>.size),
+                               elementType: .float,
+                               shape: dpShape)
+    
+    if verbose {
+        print("Loaded \(bsz) voice styles\n")
+    }
+    
+    return Style(ttl: ttlValue, dp: dpValue)
+}
+
+func loadTextToSpeech(_ onnxDir: String, _ useGpu: Bool, _ env: ORTEnv) throws -> TextToSpeech {
+    if useGpu {
+        throw NSError(domain: "TTS", code: 1, userInfo: [NSLocalizedDescriptionKey: "GPU mode is not supported yet"])
+    }
+    print("Using CPU for inference\n")
+    
+    let cfgs = try loadCfgs(onnxDir)
+    
+    let sessionOptions = try ORTSessionOptions()
+    
+    let dpPath = "\(onnxDir)/duration_predictor.onnx"
+    let textEncPath = "\(onnxDir)/text_encoder.onnx"
+    let vectorEstPath = "\(onnxDir)/vector_estimator.onnx"
+    let vocoderPath = "\(onnxDir)/vocoder.onnx"
+    
+    let dpOrt = try ORTSession(env: env, modelPath: dpPath, sessionOptions: sessionOptions)
+    let textEncOrt = try ORTSession(env: env, modelPath: textEncPath, sessionOptions: sessionOptions)
+    let vectorEstOrt = try ORTSession(env: env, modelPath: vectorEstPath, sessionOptions: sessionOptions)
+    let vocoderOrt = try ORTSession(env: env, modelPath: vocoderPath, sessionOptions: sessionOptions)
+    
+    let unicodeIndexerPath = "\(onnxDir)/unicode_indexer.json"
+    let textProcessor = try UnicodeProcessor(unicodeIndexerPath: unicodeIndexerPath)
+    
+    return TextToSpeech(cfgs: cfgs, textProcessor: textProcessor,
+                       dpOrt: dpOrt, textEncOrt: textEncOrt,
+                       vectorEstOrt: vectorEstOrt, vocoderOrt: vocoderOrt)
+}
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,248 @@
+#!/bin/bash
+
+# Supertonic - Test All Language Implementations
+# This script runs inference tests for all supported languages except web
+
+set -e  # Exit on error
+
+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+cd "$SCRIPT_DIR"
+
+echo "=================================="
+echo "Supertonic - Testing All Examples"
+echo "=================================="
+echo ""
+
+# Ask user to select test mode
+echo "Select test mode:"
+echo "  1) Default inference only"
+echo "  2) Batch inference only"
+echo "  3) Both default and batch inference"
+echo -e "Enter your choice (1/2/3) [default: 1]: \c"
+read -r test_mode
+test_mode=${test_mode:-1}
+
+case $test_mode in
+    1)
+        TEST_DEFAULT=true
+        TEST_BATCH=false
+        echo "Running default inference tests only"
+        ;;
+    2)
+        TEST_DEFAULT=false
+        TEST_BATCH=true
+        echo "Running batch inference tests only"
+        ;;
+    3)
+        TEST_DEFAULT=true
+        TEST_BATCH=true
+        echo "Running both default and batch inference tests"
+        ;;
+    *)
+        echo "Invalid choice. Using default inference only."
+        TEST_DEFAULT=true
+        TEST_BATCH=false
+        ;;
+esac
+echo ""
+
+# Batch inference test data - base variables
+BATCH_VOICE_STYLE_1="assets/voice_styles/M1.json"
+BATCH_VOICE_STYLE_2="assets/voice_styles/F1.json"
+BATCH_TEXT_1="The sun sets behind the mountains, painting the sky in shades of pink and orange."
+BATCH_TEXT_2="The weather is beautiful and sunny outside. A gentle breeze makes the air feel fresh and pleasant."
+
+# Ask if user wants to clean results folders
+echo -e "Do you want to clean all results folders before running tests? (y/N): \c"
+read -r response
+if [[ "$response" =~ ^[Yy]$ ]]; then
+    echo ""
+    echo "Cleaning results folders..."
+    
+    # List of result directories
+    declare -a RESULT_DIRS=(
+        "py/results"
+        "nodejs/results"
+        "go/results"
+        "rust/results"
+        "csharp/results"
+        "java/results"
+        "swift/results"
+        "cpp/build/results"
+    )
+    
+    for dir in "${RESULT_DIRS[@]}"; do
+        if [ -d "$SCRIPT_DIR/$dir" ]; then
+            echo "  - Cleaning $dir"
+            rm -rf "$SCRIPT_DIR/$dir"/*
+        fi
+    done
+    
+    echo "Results folders cleaned!"
+    echo ""
+fi
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Track results
+declare -a PASSED=()
+declare -a FAILED=()
+
+# Helper function to run tests
+run_test() {
+    local name=$1
+    local dir=$2
+    shift 2
+    local cmd="$@"
+    
+    echo -e "${BLUE}[$name]${NC} Running inference..."
+    cd "$SCRIPT_DIR/$dir"
+    
+    # Run command and prefix each output line with the language name
+    if eval "$cmd" 2>&1 | sed "s/^/[$name] /"; then
+        echo -e "${GREEN}[$name]${NC} ✓ Success"
+        PASSED+=("$name")
+    else
+        echo -e "${RED}[$name]${NC} ✗ Failed"
+        FAILED+=("$name")
+    fi
+    echo ""
+    cd "$SCRIPT_DIR"
+}
+
+# ====================================
+# Python
+# ====================================
+echo -e "${YELLOW}Testing Python...${NC}"
+if [ "$TEST_DEFAULT" = true ]; then
+    run_test "Python (default)" "py" "uv run example_onnx.py"
+fi
+if [ "$TEST_BATCH" = true ]; then
+    run_test "Python (batch)" "py" "uv run example_onnx.py --voice-style $BATCH_VOICE_STYLE_1 $BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1' '$BATCH_TEXT_2'"
+fi
+
+# ====================================
+# JavaScript (Node.js)
+# ====================================
+echo -e "${YELLOW}Testing JavaScript (Node.js)...${NC}"
+echo "Installing Node.js dependencies..."
+cd nodejs && npm install --silent && cd ..
+if [ "$TEST_DEFAULT" = true ]; then
+    run_test "JavaScript (default)" "nodejs" "node example_onnx.js"
+fi
+if [ "$TEST_BATCH" = true ]; then
+    run_test "JavaScript (batch)" "nodejs" "node example_onnx.js --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
+fi
+
+# ====================================
+# Go
+# ====================================
+echo -e "${YELLOW}Testing Go...${NC}"
+echo "Cleaning Go cache..."
+cd go && go clean && cd ..
+export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
+if [ "$TEST_DEFAULT" = true ]; then
+    run_test "Go (default)" "go" "go run example_onnx.go helper.go"
+fi
+if [ "$TEST_BATCH" = true ]; then
+    run_test "Go (batch)" "go" "go run example_onnx.go helper.go --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
+fi
+
+# ====================================
+# Rust
+# ====================================
+echo -e "${YELLOW}Testing Rust...${NC}"
+echo "Building Rust project..."
+cd rust && cargo clean && cd ..
+if [ "$TEST_DEFAULT" = true ]; then
+    run_test "Rust (default)" "rust" "cargo run --release"
+fi
+if [ "$TEST_BATCH" = true ]; then
+    run_test "Rust (batch)" "rust" "cargo run --release -- --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
+fi
+
+# ====================================
+# C#
+# ====================================
+echo -e "${YELLOW}Testing C#...${NC}"
+echo "Building C# project..."
+cd csharp && dotnet clean && cd ..
+if [ "$TEST_DEFAULT" = true ]; then
+    run_test "C# (default)" "csharp" "dotnet run --configuration Release"
+fi
+if [ "$TEST_BATCH" = true ]; then
+    run_test "C# (batch)" "csharp" "dotnet run --configuration Release -- --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
+fi
+
+# ====================================
+# Java
+# ====================================
+echo -e "${YELLOW}Testing Java...${NC}"
+echo "Building Java project..."
+cd java && mvn clean install -q && cd ..
+if [ "$TEST_DEFAULT" = true ]; then
+    run_test "Java (default)" "java" "mvn exec:java -q"
+fi
+if [ "$TEST_BATCH" = true ]; then
+    run_test "Java (batch)" "java" "mvn exec:java -q -Dexec.args='--voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text \"$BATCH_TEXT_1|$BATCH_TEXT_2\"'"
+fi
+
+# ====================================
+# Swift
+# ====================================
+echo -e "${YELLOW}Testing Swift...${NC}"
+echo "Building Swift project..."
+cd swift && swift build -c release && cd ..
+if [ "$TEST_DEFAULT" = true ]; then
+    run_test "Swift (default)" "swift" ".build/release/example_onnx"
+fi
+if [ "$TEST_BATCH" = true ]; then
+    run_test "Swift (batch)" "swift" ".build/release/example_onnx --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
+fi
+
+# ====================================
+# C++
+# ====================================
+echo -e "${YELLOW}Testing C++...${NC}"
+echo "Building C++ project..."
+cd cpp && mkdir -p build && cd build && cmake .. && make && cd ../..
+if [ "$TEST_DEFAULT" = true ]; then
+    run_test "C++ (default)" "cpp/build" "./example_onnx"
+fi
+if [ "$TEST_BATCH" = true ]; then
+    run_test "C++ (batch)" "cpp/build" "./example_onnx --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2'"
+fi
+
+# ====================================
+# Summary
+# ====================================
+echo "=================================="
+echo "Test Summary"
+echo "=================================="
+echo ""
+
+if [ ${#PASSED[@]} -gt 0 ]; then
+    echo -e "${GREEN}Passed (${#PASSED[@]}):${NC}"
+    for lang in "${PASSED[@]}"; do
+        echo -e "  ${GREEN}✓${NC} $lang"
+    done
+    echo ""
+fi
+
+if [ ${#FAILED[@]} -gt 0 ]; then
+    echo -e "${RED}Failed (${#FAILED[@]}):${NC}"
+    for lang in "${FAILED[@]}"; do
+        echo -e "  ${RED}✗${NC} $lang"
+    done
+    echo ""
+    exit 1
+else
+    echo -e "${GREEN}All tests passed! 🎉${NC}"
+    exit 0
+fi
+
@@ -0,0 +1,4 @@
+node_modules/
+dist/
+.DS_Store
+*.log
@@ -0,0 +1,98 @@
+# Supertonic Web Example
+
+This example demonstrates how to use Supertonic in a web browser using ONNX Runtime Web.
+
+## Features
+
+- 🌐 Runs entirely in the browser (no server required for inference)
+- 🚀 WebGPU support with automatic fallback to WebAssembly
+- ⚡ Pre-extracted voice styles for instant generation
+- 🎨 Modern, responsive UI
+- 🎭 Multiple voice style presets (2 Male, 2 Female)
+- 💾 Download generated audio as WAV files
+- 📊 Detailed generation statistics (audio length, generation time)
+- ⏱️ Real-time progress tracking
+
+## Requirements
+
+- Node.js (for development server)
+- Modern web browser (Chrome, Edge, Firefox, Safari)
+
+## Installation
+
+1. Install dependencies:
+
+```bash
+npm install
+```
+
+## Running the Demo
+
+Start the development server:
+
+```bash
+npm run dev
+```
+
+This will start a local development server (usually at http://localhost:3000) and open the demo in your browser.
+
+## Usage
+
+1. **Wait for Models to Load**: The app will automatically load models and the default voice style (M1)
+2. **Select Voice Style**: Choose from available voice presets
+   - **Male 1 (M1)**: Default male voice
+   - **Male 2 (M2)**: Alternative male voice
+   - **Female 1 (F1)**: Default female voice
+   - **Female 2 (F2)**: Alternative female voice
+3. **Enter Text**: Type or paste the text you want to convert to speech
+4. **Adjust Settings** (optional):
+   - **Total Steps**: More steps = better quality but slower (default: 5)
+5. **Generate Speech**: Click the "Generate Speech" button
+6. **View Results**: 
+   - See the full input text
+   - View audio length and generation time statistics
+   - Play the generated audio in the browser
+   - Download as WAV file
+
+## Technical Details
+
+### Browser Compatibility
+
+This demo uses:
+- **ONNX Runtime Web**: For running models in the browser
+- **Web Audio API**: For playing generated audio
+- **Vite**: For development and bundling
+
+## Notes
+
+- The ONNX models must be accessible at `assets/onnx/` relative to the web root
+- Voice style JSON files must be accessible at `assets/voice_styles/` relative to the web root
+- Pre-extracted voice styles enable instant generation without audio processing
+- Four voice style presets are provided (M1, M2, F1, F2)
+
+## Troubleshooting
+
+### Models not loading
+- Check browser console for errors
+- Ensure `assets/onnx/` path is correct and models are accessible
+- Check CORS settings if serving from a different domain
+
+### WebGPU not available
+- WebGPU is only available in recent Chrome/Edge browsers (version 113+)
+- The app will automatically fall back to WebAssembly if WebGPU is not available
+- Check the backend badge to see which execution provider is being used
+
+### Out of memory errors
+- Try shorter text inputs
+- Reduce denoising steps
+- Use a browser with more available memory
+- Close other tabs to free up memory
+
+### Audio quality issues
+- Try different voice style presets
+- Increase denoising steps for better quality
+
+### Slow generation
+- If using WebAssembly, try a browser that supports WebGPU
+- Ensure no other heavy processes are running
+- Consider using fewer denoising steps for faster (but lower quality) results
@@ -0,0 +1 @@
+../assets
@@ -0,0 +1,396 @@
+import * as ort from 'onnxruntime-web';
+
+/**
+ * Unicode Text Processor
+ */
+export class UnicodeProcessor {
+    constructor(indexer) {
+        this.indexer = indexer;
+    }
+
+    call(textList) {
+        const processedTexts = textList.map(text => this.preprocessText(text));
+        
+        const textIdsLengths = processedTexts.map(text => text.length);
+        const maxLen = Math.max(...textIdsLengths);
+        
+        const textIds = processedTexts.map(text => {
+            const row = new Array(maxLen).fill(0);
+            for (let j = 0; j < text.length; j++) {
+                const codePoint = text.codePointAt(j);
+                row[j] = (codePoint < this.indexer.length) ? this.indexer[codePoint] : -1;
+            }
+            return row;
+        });
+        
+        const textMask = this.getTextMask(textIdsLengths);
+        return { textIds, textMask };
+    }
+
+    preprocessText(text) {
+        return text.normalize('NFKC');
+    }
+
+    getTextMask(textIdsLengths) {
+        const maxLen = Math.max(...textIdsLengths);
+        return this.lengthToMask(textIdsLengths, maxLen);
+    }
+
+    lengthToMask(lengths, maxLen = null) {
+        const actualMaxLen = maxLen || Math.max(...lengths);
+        return lengths.map(len => {
+            const row = new Array(actualMaxLen).fill(0.0);
+            for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
+                row[j] = 1.0;
+            }
+            return [row];
+        });
+    }
+}
+
+/**
+ * Style class to hold TTL and DP tensors
+ */
+export class Style {
+    constructor(ttlTensor, dpTensor) {
+        this.ttl = ttlTensor;
+        this.dp = dpTensor;
+    }
+}
+
+/**
+ * Text-to-Speech class
+ */
+export class TextToSpeech {
+    constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
+        this.cfgs = cfgs;
+        this.textProcessor = textProcessor;
+        this.dpOrt = dpOrt;
+        this.textEncOrt = textEncOrt;
+        this.vectorEstOrt = vectorEstOrt;
+        this.vocoderOrt = vocoderOrt;
+        this.sampleRate = cfgs.ae.sample_rate;
+    }
+
+    async call(textList, style, totalStep, progressCallback = null) {
+        const bsz = textList.length;
+        
+        // Process text
+        const { textIds, textMask } = this.textProcessor.call(textList);
+        
+        const textIdsFlat = new BigInt64Array(textIds.flat().map(x => BigInt(x)));
+        const textIdsShape = [bsz, textIds[0].length];
+        const textIdsTensor = new ort.Tensor('int64', textIdsFlat, textIdsShape);
+        
+        const textMaskFlat = new Float32Array(textMask.flat(2));
+        const textMaskShape = [bsz, 1, textMask[0][0].length];
+        const textMaskTensor = new ort.Tensor('float32', textMaskFlat, textMaskShape);
+        
+        // Predict duration
+        const dpOutputs = await this.dpOrt.run({
+            text_ids: textIdsTensor,
+            style_dp: style.dp,
+            text_mask: textMaskTensor
+        });
+        const duration = Array.from(dpOutputs.duration.data);
+        
+        // Encode text
+        const textEncOutputs = await this.textEncOrt.run({
+            text_ids: textIdsTensor,
+            style_ttl: style.ttl,
+            text_mask: textMaskTensor
+        });
+        const textEmb = textEncOutputs.text_emb;
+        
+        // Sample noisy latent
+        let { xt, latentMask } = this.sampleNoisyLatent(
+            duration,
+            this.sampleRate,
+            this.cfgs.ae.base_chunk_size,
+            this.cfgs.ttl.chunk_compress_factor,
+            this.cfgs.ttl.latent_dim
+        );
+        
+        const latentMaskFlat = new Float32Array(latentMask.flat(2));
+        const latentMaskShape = [bsz, 1, latentMask[0][0].length];
+        const latentMaskTensor = new ort.Tensor('float32', latentMaskFlat, latentMaskShape);
+        
+        // Prepare constant arrays
+        const totalStepArray = new Float32Array(bsz).fill(totalStep);
+        const totalStepTensor = new ort.Tensor('float32', totalStepArray, [bsz]);
+        
+        // Denoising loop
+        for (let step = 0; step < totalStep; step++) {
+            if (progressCallback) {
+                progressCallback(step + 1, totalStep);
+            }
+            
+            const currentStepArray = new Float32Array(bsz).fill(step);
+            const currentStepTensor = new ort.Tensor('float32', currentStepArray, [bsz]);
+            
+            const xtFlat = new Float32Array(xt.flat(2));
+            const xtShape = [bsz, xt[0].length, xt[0][0].length];
+            const xtTensor = new ort.Tensor('float32', xtFlat, xtShape);
+            
+            const vectorEstOutputs = await this.vectorEstOrt.run({
+                noisy_latent: xtTensor,
+                text_emb: textEmb,
+                style_ttl: style.ttl,
+                latent_mask: latentMaskTensor,
+                text_mask: textMaskTensor,
+                current_step: currentStepTensor,
+                total_step: totalStepTensor
+            });
+            
+            const denoised = Array.from(vectorEstOutputs.denoised_latent.data);
+            
+            // Reshape to 3D
+            const latentDim = xt[0].length;
+            const latentLen = xt[0][0].length;
+            xt = [];
+            let idx = 0;
+            for (let b = 0; b < bsz; b++) {
+                const batch = [];
+                for (let d = 0; d < latentDim; d++) {
+                    const row = [];
+                    for (let t = 0; t < latentLen; t++) {
+                        row.push(denoised[idx++]);
+                    }
+                    batch.push(row);
+                }
+                xt.push(batch);
+            }
+        }
+        
+        // Generate waveform
+        const finalXtFlat = new Float32Array(xt.flat(2));
+        const finalXtShape = [bsz, xt[0].length, xt[0][0].length];
+        const finalXtTensor = new ort.Tensor('float32', finalXtFlat, finalXtShape);
+        
+        const vocoderOutputs = await this.vocoderOrt.run({
+            latent: finalXtTensor
+        });
+        
+        const wav = Array.from(vocoderOutputs.wav_tts.data);
+        
+        return { wav, duration };
+    }
+
+    sampleNoisyLatent(duration, sampleRate, baseChunkSize, chunkCompress, latentDim) {
+        const bsz = duration.length;
+        const maxDur = Math.max(...duration);
+        
+        const wavLenMax = Math.floor(maxDur * sampleRate);
+        const wavLengths = duration.map(d => Math.floor(d * sampleRate));
+        
+        const chunkSize = baseChunkSize * chunkCompress;
+        const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
+        const latentDimVal = latentDim * chunkCompress;
+        
+        const xt = [];
+        for (let b = 0; b < bsz; b++) {
+            const batch = [];
+            for (let d = 0; d < latentDimVal; d++) {
+                const row = [];
+                for (let t = 0; t < latentLen; t++) {
+                    // Box-Muller transform
+                    const u1 = Math.max(0.0001, Math.random());
+                    const u2 = Math.random();
+                    const val = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
+                    row.push(val);
+                }
+                batch.push(row);
+            }
+            xt.push(batch);
+        }
+        
+        const latentLengths = wavLengths.map(len => Math.floor((len + chunkSize - 1) / chunkSize));
+        const latentMask = this.lengthToMask(latentLengths, latentLen);
+        
+        // Apply mask
+        for (let b = 0; b < bsz; b++) {
+            for (let d = 0; d < latentDimVal; d++) {
+                for (let t = 0; t < latentLen; t++) {
+                    xt[b][d][t] *= latentMask[b][0][t];
+                }
+            }
+        }
+        
+        return { xt, latentMask };
+    }
+
+    lengthToMask(lengths, maxLen = null) {
+        const actualMaxLen = maxLen || Math.max(...lengths);
+        return lengths.map(len => {
+            const row = new Array(actualMaxLen).fill(0.0);
+            for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
+                row[j] = 1.0;
+            }
+            return [row];
+        });
+    }
+}
+
+/**
+ * Load voice style from JSON files
+ */
+export async function loadVoiceStyle(voiceStylePaths, verbose = false) {
+    const bsz = voiceStylePaths.length;
+    
+    // Read first file to get dimensions
+    const firstResponse = await fetch(voiceStylePaths[0]);
+    const firstStyle = await firstResponse.json();
+    
+    const ttlDims = firstStyle.style_ttl.dims;
+    const dpDims = firstStyle.style_dp.dims;
+    
+    const ttlDim1 = ttlDims[1];
+    const ttlDim2 = ttlDims[2];
+    const dpDim1 = dpDims[1];
+    const dpDim2 = dpDims[2];
+    
+    // Pre-allocate arrays with full batch size
+    const ttlSize = bsz * ttlDim1 * ttlDim2;
+    const dpSize = bsz * dpDim1 * dpDim2;
+    const ttlFlat = new Float32Array(ttlSize);
+    const dpFlat = new Float32Array(dpSize);
+    
+    // Fill in the data
+    for (let i = 0; i < bsz; i++) {
+        const response = await fetch(voiceStylePaths[i]);
+        const voiceStyle = await response.json();
+        
+        // Flatten TTL data
+        const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
+        const ttlOffset = i * ttlDim1 * ttlDim2;
+        ttlFlat.set(ttlData, ttlOffset);
+        
+        // Flatten DP data
+        const dpData = voiceStyle.style_dp.data.flat(Infinity);
+        const dpOffset = i * dpDim1 * dpDim2;
+        dpFlat.set(dpData, dpOffset);
+    }
+    
+    const ttlShape = [bsz, ttlDim1, ttlDim2];
+    const dpShape = [bsz, dpDim1, dpDim2];
+    
+    const ttlTensor = new ort.Tensor('float32', ttlFlat, ttlShape);
+    const dpTensor = new ort.Tensor('float32', dpFlat, dpShape);
+    
+    if (verbose) {
+        console.log(`Loaded ${bsz} voice styles`);
+    }
+    
+    return new Style(ttlTensor, dpTensor);
+}
+
+/**
+ * Load configuration from JSON
+ */
+export async function loadCfgs(onnxDir) {
+    const response = await fetch(`${onnxDir}/tts.json`);
+    const cfgs = await response.json();
+    return cfgs;
+}
+
+/**
+ * Load text processor
+ */
+export async function loadTextProcessor(onnxDir) {
+    const response = await fetch(`${onnxDir}/unicode_indexer.json`);
+    const indexer = await response.json();
+    return new UnicodeProcessor(indexer);
+}
+
+/**
+ * Load ONNX model
+ */
+export async function loadOnnx(onnxPath, options) {
+    const session = await ort.InferenceSession.create(onnxPath, options);
+    return session;
+}
+
+/**
+ * Load all TTS components
+ */
+export async function loadTextToSpeech(onnxDir, sessionOptions = {}, progressCallback = null) {
+    console.log('Using WebAssembly/WebGPU for inference');
+    
+    const cfgs = await loadCfgs(onnxDir);
+    
+    const dpPath = `${onnxDir}/duration_predictor.onnx`;
+    const textEncPath = `${onnxDir}/text_encoder.onnx`;
+    const vectorEstPath = `${onnxDir}/vector_estimator.onnx`;
+    const vocoderPath = `${onnxDir}/vocoder.onnx`;
+    
+    const modelPaths = [
+        { name: 'Duration Predictor', path: dpPath },
+        { name: 'Text Encoder', path: textEncPath },
+        { name: 'Vector Estimator', path: vectorEstPath },
+        { name: 'Vocoder', path: vocoderPath }
+    ];
+    
+    const sessions = [];
+    for (let i = 0; i < modelPaths.length; i++) {
+        if (progressCallback) {
+            progressCallback(modelPaths[i].name, i + 1, modelPaths.length);
+        }
+        const session = await loadOnnx(modelPaths[i].path, sessionOptions);
+        sessions.push(session);
+    }
+    
+    const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = sessions;
+    
+    const textProcessor = await loadTextProcessor(onnxDir);
+    const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
+    
+    return { textToSpeech, cfgs };
+}
+
+/**
+ * Write WAV file to ArrayBuffer
+ */
+export function writeWavFile(audioData, sampleRate) {
+    const numChannels = 1;
+    const bitsPerSample = 16;
+    const byteRate = sampleRate * numChannels * bitsPerSample / 8;
+    const blockAlign = numChannels * bitsPerSample / 8;
+    const dataSize = audioData.length * 2;
+    
+    // Create ArrayBuffer
+    const buffer = new ArrayBuffer(44 + dataSize);
+    const view = new DataView(buffer);
+    
+    // Write WAV header
+    const writeString = (offset, string) => {
+        for (let i = 0; i < string.length; i++) {
+            view.setUint8(offset + i, string.charCodeAt(i));
+        }
+    };
+    
+    writeString(0, 'RIFF');
+    view.setUint32(4, 36 + dataSize, true);
+    writeString(8, 'WAVE');
+    writeString(12, 'fmt ');
+    view.setUint32(16, 16, true);
+    view.setUint16(20, 1, true); // PCM
+    view.setUint16(22, numChannels, true);
+    view.setUint32(24, sampleRate, true);
+    view.setUint32(28, byteRate, true);
+    view.setUint16(32, blockAlign, true);
+    view.setUint16(34, bitsPerSample, true);
+    writeString(36, 'data');
+    view.setUint32(40, dataSize, true);
+    
+    // Write audio data
+    const int16Data = new Int16Array(audioData.length);
+    for (let i = 0; i < audioData.length; i++) {
+        const clamped = Math.max(-1.0, Math.min(1.0, audioData[i]));
+        int16Data[i] = Math.floor(clamped * 32767);
+    }
+    
+    const dataView = new Uint8Array(buffer, 44);
+    dataView.set(new Uint8Array(int16Data.buffer));
+    
+    return buffer;
+}
@@ -0,0 +1,72 @@
+<!DOCTYPE html>
+<html lang="en">
+    <head>
+        <meta charset="UTF-8">
+        <meta name="viewport" content="width=device-width, initial-scale=1.0">
+        <title>Supertonic - Web Demo</title>
+        <link rel="stylesheet" href="/style.css">
+    </head>
+    <body>
+        <div class="container">
+            <h1>🎤 Supertonic</h1>
+            <p class="subtitle">Text-to-Speech with ONNX Runtime Web</p>
+
+            <div id="statusBox" class="status-box">
+                <div class="status-text-wrapper">
+                    <div id="statusText">ℹ️ <strong>Loading models...</strong>
+                        Please wait...</div>
+                </div>
+                <div id="backendBadge" class="backend-badge">WebAssembly</div>
+            </div>
+
+            <div class="main-content">
+                <div class="left-panel">
+                    <div class="section">
+                        <div class="ref-audio-label">
+                            <label for="voiceStyleSelect">Voice Style: </label>
+                            <span id="voiceStyleInfo"
+                                class="ref-audio-info">Loading...</span>
+                        </div>
+                        <select id="voiceStyleSelect">
+                            <option value="assets/voice_styles/M1.json">Male 1 (M1)</option>
+                            <option value="assets/voice_styles/M2.json">Male 2 (M2)</option>
+                            <option value="assets/voice_styles/F1.json">Female 1 (F1)</option>
+                            <option value="assets/voice_styles/F2.json">Female 2 (F2)</option>
+                        </select>
+                    </div>
+
+                    <div class="section">
+                        <label for="text">Text to Synthesize:</label>
+                        <textarea id="text"
+                            placeholder="Enter the text you want to convert to speech...">This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.</textarea>
+                    </div>
+
+                    <div class="params-grid">
+                        <div class="section">
+                            <label for="totalStep">Total Steps (higher = better
+                                quality):</label>
+                            <input type="number" id="totalStep" value="5"
+                                min="1" max="50">
+                        </div>
+
+                    </div>
+
+                    <button id="generateBtn">Generate Speech</button>
+
+                    <div id="error" class="error"></div>
+                </div>
+
+                <div class="right-panel">
+                    <div id="results" class="results">
+                        <div class="results-placeholder">
+                            <div class="results-placeholder-icon">🎤</div>
+                            <p>Generated speech will appear here</p>
+                        </div>
+                    </div>
+                </div>
+            </div>
+        </div>
+
+        <script type="module" src="/main.js"></script>
+    </body>
+</html>
@@ -0,0 +1,285 @@
+import {
+    loadTextToSpeech,
+    loadVoiceStyle,
+    writeWavFile
+} from './helper.js';
+
+// Configuration
+const DEFAULT_VOICE_STYLE_PATH = 'assets/voice_styles/M1.json';
+
+// Helper function to extract filename from path
+function getFilenameFromPath(path) {
+    return path.split('/').pop();
+}
+
+// Global state
+let textToSpeech = null;
+let cfgs = null;
+
+// Pre-computed style
+let currentStyle = null;
+let currentStylePath = DEFAULT_VOICE_STYLE_PATH;
+
+// UI Elements
+const textInput = document.getElementById('text');
+const voiceStyleSelect = document.getElementById('voiceStyleSelect');
+const voiceStyleInfo = document.getElementById('voiceStyleInfo');
+const totalStepInput = document.getElementById('totalStep');
+const generateBtn = document.getElementById('generateBtn');
+const statusBox = document.getElementById('statusBox');
+const statusText = document.getElementById('statusText');
+const backendBadge = document.getElementById('backendBadge');
+const resultsContainer = document.getElementById('results');
+const errorBox = document.getElementById('error');
+
+function showStatus(message, type = 'info') {
+    statusText.innerHTML = message;
+    statusBox.className = 'status-box';
+    if (type === 'success') {
+        statusBox.classList.add('success');
+    } else if (type === 'error') {
+        statusBox.classList.add('error');
+    }
+}
+
+function showError(message) {
+    errorBox.textContent = message;
+    errorBox.classList.add('active');
+}
+
+function hideError() {
+    errorBox.classList.remove('active');
+}
+
+function showBackendBadge() {
+    backendBadge.classList.add('visible');
+}
+
+// Load voice style from JSON
+async function loadStyleFromJSON(stylePath) {
+    try {
+        const style = await loadVoiceStyle([stylePath], true);
+        return style;
+    } catch (error) {
+        console.error('Error loading voice style:', error);
+        throw error;
+    }
+}
+
+// Load models on page load
+async function initializeModels() {
+    try {
+        showStatus('ℹ️ <strong>Loading configuration...</strong>');
+        
+        const basePath = 'assets/onnx';
+        
+        // Try WebGPU first, fallback to WASM
+        let executionProvider = 'wasm';
+        try {
+            const result = await loadTextToSpeech(basePath, {
+                executionProviders: ['webgpu'],
+                graphOptimizationLevel: 'all'
+            }, (modelName, current, total) => {
+                showStatus(`ℹ️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
+            });
+            
+            textToSpeech = result.textToSpeech;
+            cfgs = result.cfgs;
+            
+            executionProvider = 'webgpu';
+            backendBadge.textContent = 'WebGPU';
+            backendBadge.style.background = '#4caf50';
+        } catch (webgpuError) {
+            console.log('WebGPU not available, falling back to WebAssembly');
+            
+            const result = await loadTextToSpeech(basePath, {
+                executionProviders: ['wasm'],
+                graphOptimizationLevel: 'all'
+            }, (modelName, current, total) => {
+                showStatus(`ℹ️ <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
+            });
+            
+            textToSpeech = result.textToSpeech;
+            cfgs = result.cfgs;
+        }
+        
+        showStatus('ℹ️ <strong>Loading default voice style...</strong>');
+        
+        // Load default voice style
+        currentStyle = await loadStyleFromJSON(currentStylePath);
+        voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
+        
+        showStatus(`✅ <strong>Models loaded!</strong> Using ${executionProvider.toUpperCase()}. You can now generate speech.`, 'success');
+        showBackendBadge();
+        
+        generateBtn.disabled = false;
+        
+    } catch (error) {
+        console.error('Error loading models:', error);
+        showStatus(`❌ <strong>Error loading models:</strong> ${error.message}`, 'error');
+    }
+}
+
+// Handle voice style selection
+voiceStyleSelect.addEventListener('change', async (e) => {
+    const selectedValue = e.target.value;
+    
+    if (!selectedValue) return;
+    
+    try {
+        generateBtn.disabled = true;
+        showStatus(`ℹ️ <strong>Loading voice style...</strong>`, 'info');
+        
+        currentStylePath = selectedValue;
+        currentStyle = await loadStyleFromJSON(currentStylePath);
+        voiceStyleInfo.textContent = getFilenameFromPath(currentStylePath);
+        
+        showStatus(`✅ <strong>Voice style loaded:</strong> ${getFilenameFromPath(currentStylePath)}`, 'success');
+        generateBtn.disabled = false;
+    } catch (error) {
+        showError(`Error loading voice style: ${error.message}`);
+        
+        // Restore default style
+        currentStylePath = DEFAULT_VOICE_STYLE_PATH;
+        voiceStyleSelect.value = currentStylePath;
+        try {
+            currentStyle = await loadStyleFromJSON(currentStylePath);
+            voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
+        } catch (styleError) {
+            console.error('Error restoring default style:', styleError);
+        }
+        
+        generateBtn.disabled = false;
+    }
+});
+
+// Main synthesis function
+async function generateSpeech() {
+    const text = textInput.value.trim();
+    if (!text) {
+        showError('Please enter some text to synthesize.');
+        return;
+    }
+    
+    if (!textToSpeech || !cfgs) {
+        showError('Models are still loading. Please wait.');
+        return;
+    }
+    
+    if (!currentStyle) {
+        showError('Voice style is not ready. Please wait.');
+        return;
+    }
+    
+    const startTime = Date.now();
+    
+    try {
+        generateBtn.disabled = true;
+        hideError();
+        
+        // Clear results and show placeholder
+        resultsContainer.innerHTML = `
+            <div class="results-placeholder generating">
+                <div class="results-placeholder-icon">⏳</div>
+                <p>Generating speech...</p>
+            </div>
+        `;
+        
+        const totalStep = parseInt(totalStepInput.value);
+        const textList = [text];
+        
+        showStatus('ℹ️ <strong>Generating speech from text...</strong>');
+        const tic = Date.now();
+        
+        const { wav, duration } = await textToSpeech.call(
+            textList, 
+            currentStyle, 
+            totalStep, 
+            (step, total) => {
+                showStatus(`ℹ️ <strong>Denoising (${step}/${total})...</strong>`);
+            }
+        );
+        
+        const toc = Date.now();
+        console.log(`Text-to-speech synthesis: ${((toc - tic) / 1000).toFixed(2)}s`);
+        
+        showStatus('ℹ️ <strong>Creating audio file...</strong>');
+        const wavLen = Math.floor(textToSpeech.sampleRate * duration[0]);
+        const wavOut = wav.slice(0, wavLen);
+        
+        // Create WAV file
+        const wavBuffer = writeWavFile(wavOut, textToSpeech.sampleRate);
+        const blob = new Blob([wavBuffer], { type: 'audio/wav' });
+        const url = URL.createObjectURL(blob);
+        
+        // Calculate total time and audio duration
+        const endTime = Date.now();
+        const totalTimeSec = ((endTime - startTime) / 1000).toFixed(2);
+        const audioDurationSec = duration[0].toFixed(2);
+        
+        // Display result with full text
+        resultsContainer.innerHTML = `
+            <div class="result-item">
+                <div class="result-text-container">
+                    <div class="result-text-label">Input Text</div>
+                    <div class="result-text">${text}</div>
+                </div>
+                <div class="result-info">
+                    <div class="info-item">
+                        <span>📊 Audio Length</span>
+                        <strong>${audioDurationSec}s</strong>
+                    </div>
+                    <div class="info-item">
+                        <span>⏱️ Generation Time</span>
+                        <strong>${totalTimeSec}s</strong>
+                    </div>
+                </div>
+                <div class="result-player">
+                    <audio controls>
+                        <source src="${url}" type="audio/wav">
+                    </audio>
+                </div>
+                <div class="result-actions">
+                    <button onclick="downloadAudio('${url}', 'synthesized_speech.wav')">
+                        <span>⬇️</span>
+                        <span>Download WAV</span>
+                    </button>
+                </div>
+            </div>
+        `;
+        
+        showStatus('✅ <strong>Speech synthesis completed successfully!</strong>', 'success');
+        
+    } catch (error) {
+        console.error('Error during synthesis:', error);
+        showStatus(`❌ <strong>Error during synthesis:</strong> ${error.message}`, 'error');
+        showError(`Error during synthesis: ${error.message}`);
+        
+        // Restore placeholder
+        resultsContainer.innerHTML = `
+            <div class="results-placeholder">
+                <div class="results-placeholder-icon">🎤</div>
+                <p>Generated speech will appear here</p>
+            </div>
+        `;
+    } finally {
+        generateBtn.disabled = false;
+    }
+}
+
+// Download handler (make it global so it can be called from onclick)
+window.downloadAudio = function(url, filename) {
+    const a = document.createElement('a');
+    a.href = url;
+    a.download = filename;
+    a.click();
+};
+
+// Attach generate function to button
+generateBtn.addEventListener('click', generateSpeech);
+
+// Initialize on load
+window.addEventListener('load', async () => {
+    generateBtn.disabled = true;
+    await initializeModels();
+});
@@ -0,0 +1,21 @@
+{
+  "name": "tts-onnx-web",
+  "version": "1.0.0",
+  "description": "TTS inference using ONNX Runtime for Web Browser",
+  "type": "module",
+  "scripts": {
+    "dev": "vite",
+    "build": "vite build",
+    "preview": "vite preview"
+  },
+  "keywords": ["tts", "onnx", "speech-synthesis", "web"],
+  "author": "",
+  "license": "MIT",
+  "dependencies": {
+    "fft.js": "^4.0.3",
+    "onnxruntime-web": "^1.17.0"
+  },
+  "devDependencies": {
+    "vite": "^5.0.0"
+  }
+}
@@ -0,0 +1,453 @@
+* {
+    margin: 0;
+    padding: 0;
+    box-sizing: border-box;
+}
+
+body {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
+    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+    min-height: 100vh;
+    display: flex;
+    justify-content: center;
+    align-items: center;
+    padding: 20px;
+}
+
+.container {
+    background: white;
+    border-radius: 20px;
+    padding: 40px;
+    max-width: 1400px;
+    width: 100%;
+    box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
+}
+
+.main-content {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 40px;
+    margin-top: 30px;
+    align-items: start;
+}
+
+.left-panel {
+    display: flex;
+    flex-direction: column;
+}
+
+.right-panel {
+    display: flex;
+    flex-direction: column;
+    height: 100%;
+}
+
+@media (max-width: 1024px) {
+    .main-content {
+        grid-template-columns: 1fr;
+    }
+}
+
+h1 {
+    color: #333;
+    margin-bottom: 10px;
+    font-size: 2em;
+}
+
+.subtitle {
+    color: #666;
+    margin-bottom: 30px;
+    font-size: 1.1em;
+}
+
+.section {
+    margin-bottom: 25px;
+}
+
+label {
+    display: block;
+    font-weight: 600;
+    color: #333;
+    margin-bottom: 8px;
+    font-size: 0.95em;
+}
+
+input[type="file"],
+textarea,
+input[type="number"] {
+    width: 100%;
+    padding: 12px;
+    border: 2px solid #e0e0e0;
+    border-radius: 8px;
+    font-size: 1em;
+    transition: border-color 0.3s;
+}
+
+input[type="file"]:focus,
+textarea:focus,
+input[type="number"]:focus {
+    outline: none;
+    border-color: #667eea;
+}
+
+textarea {
+    resize: vertical;
+    min-height: 100px;
+    font-family: inherit;
+}
+
+.params-grid {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 15px;
+}
+
+button {
+    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+    color: white;
+    border: none;
+    padding: 15px 30px;
+    font-size: 1.1em;
+    font-weight: 600;
+    border-radius: 8px;
+    cursor: pointer;
+    width: 100%;
+    transition: transform 0.2s, box-shadow 0.2s;
+}
+
+button:hover:not(:disabled) {
+    transform: translateY(-2px);
+    box-shadow: 0 5px 20px rgba(102, 126, 234, 0.4);
+}
+
+button:disabled {
+    opacity: 0.6;
+    cursor: not-allowed;
+}
+
+.status-box {
+    background: #e3f2fd;
+    border-left: 4px solid #2196f3;
+    padding: 15px;
+    margin-bottom: 10px;
+    border-radius: 4px;
+    font-size: 0.9em;
+    color: #1565c0;
+    transition: all 0.3s ease;
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+    flex-wrap: wrap;
+    gap: 15px;
+    min-height: 50px;
+}
+
+.status-box.success {
+    background: #e8f5e9;
+    border-left-color: #4caf50;
+    color: #2e7d32;
+}
+
+.status-box.error {
+    background: #ffebee;
+    border-left-color: #f44336;
+    color: #c62828;
+}
+
+.status-text-wrapper {
+    flex: 1;
+    min-width: 200px;
+}
+
+.backend-badge {
+    display: inline-block;
+    visibility: hidden;
+    padding: 6px 12px;
+    background: #ff9800;
+    color: white;
+    border-radius: 12px;
+    font-size: 0.85em;
+    font-weight: 600;
+    margin-left: 10px;
+    white-space: nowrap;
+}
+
+.backend-badge.visible {
+    visibility: visible;
+}
+
+.ref-audio-info {
+    color: #4caf50;
+    font-weight: 700;
+    font-size: 0.95em;
+}
+
+.ref-audio-label {
+    margin-bottom: 8px;
+}
+
+.ref-audio-label label {
+    display: inline;
+    margin-bottom: 0;
+}
+
+
+.results {
+    flex: 1;
+    display: flex;
+    flex-direction: column;
+}
+
+.result-item {
+    background: white;
+    border-radius: 16px;
+    box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
+    overflow: hidden;
+    transition: box-shadow 0.3s ease;
+    display: flex;
+    flex-direction: column;
+    flex: 1;
+}
+
+.result-item:hover {
+    box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
+}
+
+.result-item h3 {
+    color: #667eea;
+    margin-bottom: 15px;
+    font-size: 1.2em;
+}
+
+.result-text-container {
+    padding: 20px;
+    background: linear-gradient(135deg, #f8f9ff 0%, #ffffff 100%);
+    border-bottom: 1px solid #e8ecf5;
+    flex: 1;
+    display: flex;
+    flex-direction: column;
+    overflow: hidden;
+}
+
+.result-text-label {
+    font-size: 0.75em;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+    color: #667eea;
+    font-weight: 600;
+    margin-bottom: 8px;
+}
+
+.result-text {
+    color: #333;
+    line-height: 1.7;
+    font-size: 0.95em;
+    word-wrap: break-word;
+    white-space: pre-wrap;
+    overflow-y: auto;
+    padding-right: 8px;
+    flex: 1;
+}
+
+.result-text::-webkit-scrollbar {
+    width: 6px;
+}
+
+.result-text::-webkit-scrollbar-track {
+    background: #f0f0f0;
+    border-radius: 3px;
+}
+
+.result-text::-webkit-scrollbar-thumb {
+    background: #c0c0c0;
+    border-radius: 3px;
+}
+
+.result-text::-webkit-scrollbar-thumb:hover {
+    background: #a0a0a0;
+}
+
+.result-info {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 0;
+    background: #fafbff;
+}
+
+.info-item {
+    padding: 16px 20px;
+    display: flex;
+    align-items: center;
+    gap: 8px;
+    font-size: 0.9em;
+    color: #666;
+    border-bottom: 1px solid #e8ecf5;
+}
+
+.info-item:nth-child(1) {
+    border-right: 1px solid #e8ecf5;
+}
+
+.info-item strong {
+    color: #333;
+    font-size: 1.1em;
+    font-weight: 600;
+    margin-left: auto;
+}
+
+.result-player {
+    padding: 20px;
+    background: white;
+}
+
+.result-item audio {
+    width: 100%;
+    height: 48px;
+    outline: none;
+}
+
+.result-item audio:focus {
+    outline: 2px solid #667eea;
+    outline-offset: 2px;
+    border-radius: 4px;
+}
+
+.result-actions {
+    padding: 16px 20px 20px;
+    background: white;
+}
+
+.result-item button {
+    width: 100%;
+    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+    color: white;
+    border: none;
+    padding: 12px 24px;
+    font-size: 0.95em;
+    font-weight: 600;
+    border-radius: 8px;
+    cursor: pointer;
+    transition: all 0.3s ease;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    gap: 8px;
+}
+
+.result-item button:hover {
+    transform: translateY(-2px);
+    box-shadow: 0 4px 16px rgba(102, 126, 234, 0.3);
+}
+
+.result-item button:active {
+    transform: translateY(0);
+}
+
+@media (max-width: 640px) {
+    .result-info {
+        grid-template-columns: 1fr;
+    }
+    
+    .info-item:nth-child(1) {
+        border-right: none;
+    }
+}
+
+audio {
+    width: 100%;
+    margin-top: 10px;
+}
+
+.error {
+    background: #fee;
+    color: #c00;
+    padding: 15px;
+    border-radius: 8px;
+    margin-top: 20px;
+    display: none;
+}
+
+.error.active {
+    display: block;
+}
+
+.warning-box {
+    background: #fff3cd;
+    color: #856404;
+    padding: 12px 15px;
+    border-radius: 8px;
+    margin-top: 10px;
+    border-left: 4px solid #ffc107;
+    font-size: 0.9em;
+    display: none;
+    line-height: 1.5;
+}
+
+.warning-box.active {
+    display: block;
+}
+
+.warning-box::before {
+    content: "⚠️ ";
+    margin-right: 5px;
+}
+
+.results-placeholder {
+    background: white;
+    border-radius: 16px;
+    box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
+    padding: 60px 40px;
+    text-align: center;
+    color: #999;
+    transition: all 0.3s ease;
+    display: flex;
+    flex-direction: column;
+    justify-content: center;
+    align-items: center;
+    flex: 1;
+    min-height: 400px;
+}
+
+.results-placeholder:hover {
+    box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
+}
+
+.results-placeholder-icon {
+    font-size: 4em;
+    margin-bottom: 20px;
+    opacity: 0.6;
+    animation: float 3s ease-in-out infinite;
+}
+
+.results-placeholder.generating .results-placeholder-icon {
+    animation: spin 2s linear infinite;
+}
+
+@keyframes float {
+    0%, 100% {
+        transform: translateY(0px);
+    }
+    50% {
+        transform: translateY(-10px);
+    }
+}
+
+@keyframes spin {
+    0% {
+        transform: rotate(0deg);
+    }
+    100% {
+        transform: rotate(360deg);
+    }
+}
+
+.results-placeholder p {
+    font-size: 1.05em;
+    color: #888;
+    font-weight: 500;
+    margin: 0;
+}
+
+.hidden {
+    display: none;
+}
@@ -0,0 +1,14 @@
+import { defineConfig } from 'vite';
+
+export default defineConfig({
+  server: {
+    port: 3000,
+    open: true
+  },
+  build: {
+    target: 'esnext'
+  },
+  optimizeDeps: {
+    exclude: ['onnxruntime-web']
+  }
+});
				`@@ -0,0 +1 @@`
				`assets/onnx/*.onnx filter=lfs diff=lfs merge=lfs -text`