Indic languages, Indian Institute of Science’s AI and Robotics Technology Park (IISc-ARTPARK) plans to open-source 16,000 hours of spontaneous speech from 80 districts as part of Project Vaani in collaboration with American technology company Google.
IISc-ARTPARK is curating datasets of 150,000 hours of natural speech and text from around one million people across 773 districts of India, and the first phase of the project, launched at the end of 2022, is nearing completion.
In the second phase, it will cover 160 districts. Each district will have a target of 200 hours from about 1,000 people in each district. So far, IISc has received voice data in 58 different language variants or dialects from 80 districts which will be open-sourced.
“Bhashini, for the first time, helped create the digital data for low resource languages in its effort to build AI models for low resource languages,” Amitabh Nag, chief executive of Bhashini, told ET.
“The Vaani project with Google takes it a step further to ensure that we cover the length and breadth of the country. The datasets would be open-sourced for the use of startups to build applications and AI models.”
The idea is to use this dataset as base data or training data for speech-to-text AI