Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model
About
Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps testset show that the proposed optimization methods significantly improve the accuracy of CIF timestamps, reducing 66.7\% and 82.1\% of AAS and DER respectively. Comparing to Kaldi force-alignment trained with the same data, optimized CIF timestamps achieved 12.3\% relative AAS reduction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Forced Alignment | MFA-Labeled Raw (test) | AAS Latency (Avg)161.1 | 8 | |
| Forced Alignment | Human-Labeled (test) | Avg. RTF0.0079 | 4 | |
| Forced Alignment | MFA-Labeled Concat-300s (test) | AAS (Avg) [ms]1.74e+3 | 4 | |
| Forced Alignment | MFA-labeled Long-form (test) | Average Alignment Value1.74e+3 | 4 | |
| Forced Alignment | human-labeled Chinese datasets (Mixed-300s) | AAS410.8 | 3 | |
| Forced Alignment | human-labeled Chinese datasets (Raw) | AAS49.9 | 3 | |
| Forced Alignment | human-labeled Chinese datasets (Raw-Noisy) | AAS0.533 | 3 | |
| Forced Alignment | human-labeled Chinese datasets (Mixed-60s) | AAS51.1 | 3 |