Agentic ASR

Multi-turn Dialogue for ASR Error Correction

Abstract

Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction—an essential component of human communication—has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set, and MCA named-entity test sets. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability.

Framework Overview

Our proposed agentic framework for interactive ASR error correction

Chinese ASR Error Correction

ID: TEST_NET_Y0000000000_-KTKHdZ2fb8_S00043 2 Rounds
Ground Truth: 吴克儿待会儿去趟监控室
🔊 Ground Truth Audio
Loop 0 - Initial ASR
吴克,带我去趟办公室。
👤 Human Agent Think (Round 1)
点击展开/收起思考过程
1. 差异定位:人名"吴克儿"被切分为"吴克"且丢失儿化音,"待会儿"被误听为"带我","监控室"被误听为"办公室" 2. 策略选择:策略A(人名纠正)+策略C(语义结构纠正) 3. 话术构建:先否定错误人名并补充儿化音特征,再否定错误动词并完整复述正确指令
🗣️ Human Correction (Round 1)
不是吴克,是吴克儿,克后面有个儿字。要纠正的是待会儿去监控室,不是带我去办公室
🔊 ASR on Correction (Round 1)
不是吴克,是吴克儿,克后面有个儿字。要纠正的是,待会儿去监控室,不是带我去办公室。
Loop 1 - Partial Correction
吴克儿,待会儿去监控室。
👤 Human Agent Think (Round 2)
点击展开/收起思考过程
1. 差异定位:Loop1结果"吴克儿,待会儿去监控室"仍缺少"去趟"的"趟"字 2. 策略选择:策略C(语义补充) 3. 话术构建:补充缺失的"趟"字
🗣️ Human Correction (Round 2)
是去趟监控室,不是去监控室,有个趟字
🔊 ASR on Correction (Round 2)
是去趟监控室,不是去监控室,有个趟字。
Loop 2 - Corrected
吴克儿待会儿去趟监控室

English ASR Error Correction

ID: YOU1000000143_S0000123 1 Round
Ground Truth: SO , AFTER GETTING OBAMA AND BIDEN ELECTED , I FELT THIS POWER TRIP .
🔊 Ground Truth Audio
Loop 0 - Initial ASR
So after getting Obama Biden elected, I felt this power shift.
👤 Human Agent Think
Click to expand/collapse thinking process
1. Error identification: GT's "TRIP" was incorrectly recognized as "SHIFT", which is a homophone confusion error 2. Strategy selection: A strategy (precise word/spelling correction), requiring full spelling provision 3. Communication construction: Directly provide correct spelling T-R-I-P and explain semantics (feeling of power inflation)
🗣️ Human Correction
It is not "power shift", it is "power trip", T R I P. A feeling of inflated power.
🔊 ASR on Correction
It is not power shift, it is power trip, T R I P. A feeling of inflated power.
🤖 ASR Agent Think
Click to expand/collapse thinking process
1. Original Text Anchor: Analyze the core components of the current screen text. - The word "shift" in "power shift" is the key point for correction 2. Clue Extraction: Identify spelling features provided by the user. - User explicitly provides the spelling "T-R-I-P" - Semantic explanation: "a feeling of inflated power/authority" 3. Modification Decision: Perform precise replacement of the keyword. - Replace "shift" with "trip" - Keep all other words in the sentence unchanged
Loop 1 - Corrected
So after getting Obama Biden elected, I felt this power trip.

Code-switch ASR Error Correction

ID: 00117 1 Round
Ground Truth: 史上最厉害的 ATTENDEE 马上要出现了
🔊 Ground Truth Audio
Loop 0 - Initial ASR
史上最厉害的Attendant马上就要出现了。
👤 Human Agent Think
点击展开/收起思考过程
1. 差异定位:GT中的"ATTENDEE"被错误识别为"Attendant",属于英文单词混淆 2. 策略选择:B策略(跨语言/混合场景重写),需同时提供拼写和语义解释 3. 话术构建:强调英文字母拼写差异,并解释专业术语含义
🗣️ Human Correction
不是Attendant,是英文单词ATTENDEE,A T T E N D E E,意思是参加者
🔊 ASR on Correction
不是 attendant,是英文单词 attendee,A T T E N D E E。
🤖 ASR Agent Think
点击展开/收起思考过程
1. 原文锚点:分析当前屏幕文本的核心成分。 - 中文部分:"史上最厉害的"、"马上就要出现了" - 英文部分:Attendant(需纠正) 2. 线索提取:识别用户提供的拼写/语义特征。 - 用户明确给出完整拼写"A T T E N D E E" - 中文释义"参加者、出席的人" - 原ASR误识别为attendant(服务生)而非attendee(参与者) 3. 修改决策:针对实体名进行精准替换。 - 将attendant替换为attendee - 保持中文部分完全不变
Loop 1 - Corrected
史上最厉害的Attendee马上要出现了