Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

About

Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps testset show that the proposed optimization methods significantly improve the accuracy of CIF timestamps, reducing 66.7\% and 82.1\% of AAS and DER respectively. Comparing to Kaldi force-alignment trained with the same data, optimized CIF timestamps achieved 12.3\% relative AAS reduction.

Xian Shi, Yanni Chen, Shiliang Zhang, Zhijie Yan• 2023

Related benchmarks

Task	Dataset	Result
Forced Alignment	MFA-Labeled Raw (test)	AAS Latency (Avg)161.1	8
Forced Alignment	GTSinger-Speech-ZH	AAS61.98	5
Forced Alignment	Human-Labeled (test)	Avg. RTF0.0079	4
Forced Alignment	MFA-Labeled Concat-300s (test)	AAS (Avg) [ms]1.74e+3	4
Forced Alignment	MFA-labeled Long-form (test)	Average Alignment Value1.74e+3	4
Forced Alignment	human-labeled Chinese datasets (Mixed-300s)	AAS410.8	3
Forced Alignment	human-labeled Chinese datasets (Raw)	AAS49.9	3
Forced Alignment	human-labeled Chinese datasets (Raw-Noisy)	AAS0.533	3
Forced Alignment	human-labeled Chinese datasets (Mixed-60s)	AAS51.1	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord