SoundLM | Balmukund Sharma

The High-Dimensional Acoustic Feature Extractor (HDAFE): A Novel ASR Paradigm for Ultra-Low Latency, Disentangled Control, and Expressive Voice Large Language Models

2025, March, 15th. Last updated - 2025, June 2nd

I. Introduction

1.1 Motivation: Bridging the Latency and Expressivity Gap in Conversational AI

The current generation of voice-first conversational AI agents struggles with a critical trade-off between conversational fluency and expressive fidelity. Real-world commercial deployments frequently encounter significant latency issues. Industry reports indicate that users find delays exceeding 3 to 4 seconds completely ruin the call quality and customer experience. Although some platforms aim for optimized configurations approaching (Telnyx, 2025) 465ms end-to-end latency, achieving consistent sub-second responses remains challenging, particularly during peak usage or when relying on external API congestion (Telnyx, 2025).

The gap exists because conventional architectures rely on sequential processing: Audio is transcribed via Automatic Speech Recognition (ASR), the text is processed by a Large Language Model (LLM), and the resulting response text is synthesized via Text-to-Speech (TTS). This sequence, often combined with reliance on high-bit precision models, results in effective Real-Time Factors (RTF) well above 1.0. The technical demands of a platform seeking "virtually zero delay" by achieving ultra-fast response times below 100ms necessitate a radical architectural departure from this standard pipeline (Wang et al., 2025; Liu et al., 2025).

1.2 The Axis Vision and the Core Technical Challenge

The commercial vision of the Axis platform redefines the requirements for voice AI, demanding not only ultra-low latency but also unparalleled expressive control. Key demands include expressing the full spectrum of human emotions (laugh, cry, sigh, cheer), instant voice mimicry, universal language and style fluency, and cross-modal creative generation, such as singing or making music.

The core technical challenge lies in integrating complex acoustic feature extraction—postulated here as capturing 35 to 300 dimensions of audio metadata—with the stringent sub-100ms real-time inference constraint. Conventionally, increasing the dimensionality and complexity of acoustic analysis significantly increases the computational load and, consequently, the latency. If a standard ASR were to extract 300 dense features and then pass them sequentially to a large LLM, the latency target would be impossible to meet. The solution must involve a fundamental shift in how acoustic information is processed and transferred.

1.3 Contributions: Introducing the HDAFE and the Disentangled VLLM Architecture

This paper proposes the High-Dimensional Acoustic Feature Extractor (HDAFE) as a specialized ASR component designed to overcome this challenge. The HDAFE is not optimized for mere transcription but for disentangled feature extraction, meaning it separates content, identity, prosody, and emotion into orthogonal latent dimensions. This rich output, ranging from 35 to 300 dimensions, serves as a high-bandwidth conditioning signal.

The subsequent contribution is the integration of this HDAFE output into a highly optimized Voice Large Language Model (VLLM). To maintain the low-latency requirement, the HDAFE must output tokens or embeddings that are simultaneously dense (high information content) and highly optimized for rapid transmission to a quantized VLLM, likely leveraging 4-bit precision. This methodology mandates an HDAFE design that prioritizes streaming tokenization and a low computational footprint, enabling continuous, concurrent operation of the ASR, VLLM, and TTS components (Wang et al., 2025).

II. Review of Related Work and State-of-the-Art

2.1 Evolution of Expressive Text-to-Speech (TTS) and Zero-Shot Synthesis

The ability of Axis to provide deeply engaging and human-like expressivity is predicated on recent breakthroughs in generative audio modeling. A pivotal advancement was the shift from continuous signal regression in TTS to conditional language modeling using discrete audio codes. Models such as VALL-E demonstrate the capability to treat TTS as a language modeling task by utilizing discrete codes derived from neural audio codecs (Microsoft Research, 2025).

This language modeling approach has proven highly effective for zero-shot synthesis. VALL-E, for instance, can synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. Crucially, it demonstrated the ability to preserve the speaker's emotion and acoustic environment in the synthesis, laying the necessary technical groundwork for Axis's instant voice mimicry and emotion transfer capabilities. Further architectural refinements, such as VibeVoice, leverage the power of LLMs to understand complex textual context and dialogue flow, coupled with a diffusion head to generate high-fidelity acoustic details, supporting synthesis of lengthy dialogue with multiple distinct speakers (Microsoft, 2025).

2.2 Advances in Low-Latency Voice Systems and Streaming Architectures

The pursuit of sub-second conversational latency has necessitated significant architectural changes across the industry. To overcome the inherent delay caused by sequential processing, state-of-the-art voice pipelines rely on concurrent module execution, involving the simultaneous operation of ASR, LLM, and TTS modules (Wang et al., 2025).

Streaming is a prerequisite for rapid response. Architectures that support sentence-level streaming allow the LLM to incrementally transmit generated sentences to the TTS module for early and continuous audio output. Furthermore, latency is managed through model compression. Techniques such as 4-bit LLM quantization are vital, significantly reducing the GPU memory footprint and inference latency while preserving generation quality. Despite these advances, the difficulty in achieving reliable sub-second performance in commercial deployments suggests that reaching the sub-100ms target requires a radical, end-to-end architectural redesign where every component runs with a Real-Time Factor (RTF) close to zero (Wang et al., 2025).

2.3 Analysis of Existing Acoustic Feature Sets and Dimensionality in Affective Computing

Traditional ASR relies on low-level acoustic features like Mel-Frequency Cepstral Coefficients (MFCCs) and basic prosodic attributes (pitch, energy). However, extracting nuanced affective states requires richer, higher-dimensional representations. Research into speech emotion recognition, a subset of affective computing, has confirmed the performance benefits of extracting high-level features from both text and audio using hybrid deep multimodal structures (Yoon et al., 2018). To capture the complex emotional depth required by Axis, acoustic embeddings often utilize higher dimensionality.

For instance, studies exploring emotional states frequently employ hidden sequences and emotion embeddings sized at 768 dimensions. Furthermore, accurately representing the full spectrum of human feeling demands a transition from the discrete label approach of emotion classification (e.g., Ekman’s six basic emotions) to the continuous, dimensional representation (e.g., Russell’s Valence, Arousal, Dominance model). The HDAFE must embody this shift, transforming raw audio into a high-dimensional, multidimensional space that allows the VLLM to use these features as "control knobs" for granular manipulation of generated speech, moving far beyond simple prosody control. This establishes the HDAFE's role as the critical translation layer between the continuous, raw acoustic signal and the symbolic, discrete processing capabilities of the VLLM (Hsu & Huang, 2023; Zhang & Liu, 2024; Google Cloud, 2025).

III. The High-Dimensional Acoustic Feature Extractor (HDAFE) Architecture

3.1 HDAFE Design Rationale: The Need for Disentanglement

The fundamental differentiator of the HDAFE component is its focus on disentanglement. Standard ASR feature extraction typically results in intertwined latent features where attributes such as speaker identity, emotional state, and phonetic content are blended. This makes independent control—such as instant voice mimicry with separate emotion modulation—computationally infeasible or highly prone to corruption.

The Axis platform’s core capabilities, particularly instant voice mimicry and cross-speaker style transfer, strictly necessitate a disentangled latent space. When features are disentangled, simple linear operations in the latent space can successfully perform tasks unseen during training, such as manipulating the level of "joy" or "formality" without altering the fundamental speaker identity. This enables highly efficient cross-speaker style transfer, where an expressive style from one set of speakers can be applied to a target speaker who only provided neutral training data (Li & Chen, 2023; Park & Song, 2023; Amazon Science, 2023).

3.2 Feature Categorization and Dimensionality Mapping (35–300 Dimensions)

The HDAFE achieves comprehensive expressive control by mapping audio input across four orthogonal feature groups. The total output dimensionality is tailored to be substantial enough to capture fine-grained acoustic nuances—ranging from approximately 165 to 300 dimensions—while remaining efficient enough for real-time

The selection of these dimensions ensures that each attribute is encoded in distinct, orthogonal latent vectors, minimizing overlap in the dimensions responsible for gender, speaking style, and speaker identity.

3.3 Training the HDAFE for Disentangled Representation Learning

Training the HDAFE requires complex objectives beyond simple audio reconstruction. Specialized loss functions are essential for reinforcing feature separation. These include standard Reconstruction Loss (ensuring high-fidelity output), Identity Loss (guaranteeing that speaker-specific features are perfectly preserved regardless of emotion), and Style Transfer Loss, which evaluates the model’s ability to swap emotional characteristics between speakers, validating the transferability of the Prosodic and Affective features (Li & Chen, 2023).

Crucially, a dedicated Disentanglement Loss must be implemented. This can be achieved using techniques derived from β-Variational Autoencoders (β-VAE) or through adversarial training, where a classifier is trained to predict an attribute (e.g., speaker identity) from an unrelated dimension (e.g., the affective dimension) and is subsequently penalized for high accuracy. This ensures that the latent dimensions are truly orthogonal. Furthermore, to meet the Axis requirement for expressing the "full spectrum" of emotion, the training dataset must be vast and richly annotated, specifically targeting the documented difficulty of generative AI in accurately conveying negative and nuanced emotional states (Kim et al., 2024; Zhao & Xu, 2025).

3.4 The HDAFE Output Transformation: High-Dimensional Discrete Codes

To seamlessly interface with the VLLM, which operates as a language model predicting the next token, the continuous, high-dimensional embedding output by the HDAFE must be converted into discrete tokens. This is accomplished through a Residual Vector Quantization (RVQ) layer, creating what can be termed a “Super-Codec.”

This Super-Codec does more than compress audio; it transforms the raw acoustic signal into tokens that explicitly carry the full 300 dimensions of disentangled information (C, P, A, I). The VLLM then learns the grammar and temporal sequence of these high-dimensional tokens, allowing it to predict not just textual content, but the corresponding complex expressive state. The high dimensionality ensures that although the tokens are discrete, they maintain sufficient temporal fidelity to capture the subtle micro-prosodic details relevant for emotional nuance and context-aware expression (Chen & Zhou, 2024).

IV. The Disentangled Voice Large Language Model (VLLM)

4.1 VLLM Architecture: A Unified Multimodal Token Predictor

The VLLM serves as the central cognitive engine, utilizing a decoder-only transformer backbone consistent with modern LLM architectures. This VLLM integrates three primary input streams: (1) Text/Prompt Tokens (for intent and dialogue conditioning), (2) HDAFE (Acoustic) Tokens (the high-dimensional Super-Codec output from the user's speech), and (3) Historical Context Tokens (maintaining long-context memory of the conversation) (Resemble AI, n.d.; Huang & Wang, 2023).

The model's generative task is the autoregressive prediction of output sequences. This sequence includes both the predicted textual response and the corresponding set of high-dimensional Super-Codec tokens that precisely dictate the desired acoustic attributes—expression, timbre, and style—necessary for lifelike generation (Microsoft Research, 2025).

4.2 Contextual Adaptation and Dynamic Style Control

The VLLM leverages the explicit conditioning provided by the HDAFE’s disentangled features to perform sophisticated, context-aware generation. This allows the VLLM to utilize the user’s acoustic prompt features (such as their specific timbre and identity) for in-context learning while dynamically modulating expressive elements (Microsoft Research, 2025).

For instance, using natural language prompts, the user or system can adapt the response delivery by steering it to adopt specific accents or produce a wide range of tones and expressions, including whispers or specific emotions. This dynamic performance requires that the VLLM continuously adapts acoustic attributes, such as prosody, loudness, and emotion, based on the historical context and the ongoing discourse, confirming that reliance on long-context information is critical for high-fidelity speech synthesis (Resemble AI, n.d.; Pindrop, n.d.).

4.3 Zero-Shot Voice Mimicry and Voice Signature Recognition

The Axis platform’s ability to instantly mimic any person’s voice is a direct consequence of the HDAFE’s successful disentanglement.

Mimicry Mechanism: During generation, the VLLM uses the HDAFE’s Identity (I) tokens (30–50 dimensions) extracted from the 3-second prompt as a hard, persistent constraint. The VLLM synthesizes the response by predicting the appropriate Phonetic (C), Prosodic (P), and Affective (A) tokens, while strictly adhering to the input (I) tokens to maintain the target speaker’s unique identity and vocal texture (Microsoft Research, 2025).

Personalization: The Voice Signature Recognition capability leverages the stability of the HDAFE’s Identity (I) tokens to identify and personalize responses for individual users. This process maps the incoming I-tokens to a stored user profile, allowing the VLLM to provide tailored, emotionally intelligent support in high-stakes scenarios such as healthcare or customer service. The architecture is designed for cross-speaker style transfer, meaning the style/emotion embedding (P + A) is globally transferable and orthogonal to the identity embedding (I). This orthogonality is essential for universal style switching across languages and is what allows the VLLM to be trained as a single-speaker, multi-style system using data augmentation (Amazon Science, 2023; Microsoft Research, 2025).

V. Cross-Modal Synthesis: Integrating Speech, Music, and Sound

The Axis platform extends far beyond dialogue, promising creative sound generation, singing, and musical performances. Achieving this necessitates a unified tokenization approach capable of representing all sonic outputs within the VLLM’s generative framework.

5.1 Mechanism for Unified Audio Tokenization

The HDAFE and its Super-Codec layer must be trained on a massive corpus encompassing not only speech but also musical and environmental acoustics. The core innovation is that the discrete codes must represent a unified language for all sonic outputs—speech, melody, rhythm, and environmental acoustics (Huang & Wang, 2023; Chen & Zhou, 2024).

This unified representation implies that the high-dimensional features (35–300D) must implicitly capture acoustic attributes traditionally handled separately, such as harmonic complexity, attack envelopes, and spectral density, allowing the VLLM to seamlessly transition between human dialogue and a live vocal performance.

5.2 Generating Interactive Music and Singing Performances

The VLLM integrates principles from specialized models like MusicGen, which operates over several streams of compressed discrete music tokens. For music generation, the VLLM is conditioned on textual input and potentially high-level musical features derived from an extended HDAFE analysis (e.g., detected tempo or key from the input audio) (Huang & Wang, 2023).

For vocal performances (singing), the system aligns the phonetic (C) and identity (I) tokens with musically appropriate prosodic (P) tokens, specifically pitch and duration, dictated by the VLLM's prediction of the musical conditioning. This alignment allows the generation of expressive readings of poetry or complex storytelling with accompanying melodies (Hsu & Huang, 2023).

5.3 Creative Sound Generation and Acoustic Environment Preservation

A requirement for truly lifelike AI dialogue is the seamless generation of non-speech sounds and the preservation of the acoustic environment. The HDAFE’s ability to encode the "acoustic environment" of the prompt is extended by the VLLM to generate contextually relevant, tailored background sound effects or to accurately recreate subtle non-speech sounds like sighs, laughs, or cheering, which are necessary to fill out the full emotional spectrum of a synthetic dialogue (Microsoft Research, 2025).

VI. Engineering for Ultra-Low Latency (Sub-100ms)

Achieving an effective Real-Time Factor (RTF) requires a complete overhaul of the sequential voice pipeline into a modular, concurrent architecture.

6.1 The End-to-End Voice Pipeline: Modular and Concurrent Design

The pipeline must operate as a modular, multi-threaded system, integrating streaming HDAFE, quantized VLLM inference, and real-time TTS synthesis (Wang et al., 2025). This architectural necessity is driven by the reality that the sum of the latencies of all components must be less than milliseconds for the first token, and subsequent processing must maintain an RTF near zero (Zhang & Liu, 2024).

HDAFE Streaming: The HDAFE must continuously perform sentence-level streaming of its high-dimensional, discrete tokens. This immediate feedback loop feeds the VLLM incrementally, requiring minimal lookahead buffering and allowing the VLLM to initiate inference immediately upon receiving the first high-dimensional tokens (Wang et al., 2025; Zhang & Liu, 2024).

6.2 Concurrency and Incremental Generation Strategies

Concurrency is the mechanism by which the pipeline components overlap their work cycles. The VLLM must transmit generated sentences incrementally to the TTS module for early and continuous audio output (Wang et al., 2025). Critically, the HDAFE processing of the next input segment must occur concurrently with the VLLM inference and TTS synthesis of the previous segment.

This high degree of overlap ensures that the HDAFE processing time and VLLM inference time components are executed in parallel, reducing the perceived end-to-end latency to the duration of the slowest step, rather than the sum of the steps. The effectiveness of this approach is critically dependent on the quality of the HDAFE conditioning. If the dimensional outputs perfectly condition the VLLM, the VLLM requires less processing time to determine the appropriate response and expression, thereby reducing the duration of the critical inference step (Wang et al., 2025).

6.3 Optimization Techniques: Quantization and Efficient Codification

Two primary optimization techniques are crucial for realizing the sub-millisecond target:

4-bit LLM Quantization: Quantizing the large VLLM to 4-bit precision is essential. This strategy significantly reduces the GPU memory footprint and memory access latency without substantially compromising generation quality, as evidenced by successful low-latency implementations (Wang et al., 2025).

Custom Kernel Acceleration: The HDAFE's complex, high-dimensional vector operations, and the VLLM's specialized 4-bit transformer architecture require custom kernel implementations (e.g., optimized CUDA or TensorRT) to maximize efficiency and minimize execution time.