| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Tokenization | AI4Bharat Sangraha Total Indic Corpus | Token Count (M)6,623 | 3 | |
| Tokenization | AI4Bharat Sangraha Telugu | Token Count (M)599 | 3 | |
| Tokenization | AI4Bharat Sangraha Tamil | Token Count (M)684 | 3 | |
| Tokenization | AI4Bharat Sangraha Punjabi | Token Count210 | 3 | |
| Tokenization | AI4Bharat Sangraha Odia | Token Count (M)228 | 3 | |
| Tokenization | AI4Bharat Sangraha Marathi | Token Count529,000,000 | 3 | |
| Tokenization | AI4Bharat Sangraha Malayalam | Token Count518 | 3 | |
| Tokenization | AI4Bharat Sangraha Kannada | Token Count (M)313 | 3 | |
| Tokenization | AI4Bharat Sangraha Hindi | Token Count (M)1,231,000,000 | 3 | |
| Tokenization | AI4Bharat Sangraha Gujarati | Token Count (M)605 | 3 | |
| Tokenization | AI4Bharat Sangraha Bengali | Token Count (M)1,638 | 3 | |
| Tokenization | AI4Bharat Sangraha Assamese | Token Count (M)100 | 3 |