Google DeepMind's India unit is working on an Indic language AI project called Morni (Multimodal Representation for India) which aims to cover 125 Indian languages and dialects to build inclusive and equitable Indic AI.
«So India has 22 scheduled languages, which are viewed as official languages. But in our work, we are targeting over 100 Indian languages, because we find that there are 60 Indian languages which have over a billion speakers and over 125 languages that have over a lakh speakers each,» said Manish Gupta, Director of Google DeepMind, Google India. He was speaking at the Global Fintech Fest in Mumbai on Thursday.
He explained that 73 out of these 125 languages had zero corpus of digital data available. Even for a languages like Hindi, which is now spoken by close to 10% of the world's population, the share of Hindi text on the internet is 0.1%.
Google's research lab overcame the challenge of sourcing data for these languages by launching the project Vaani, a collaboration between Google, the Indian Institute of Science (IISc), and ARTPARK (Artificial Intelligence & Robotics Technology Park).
The project has completed its first phase to create an open-source database of over 14,000 hours of speech data across 58 languages, collected from 80,000 speakers in 80 districts, Gupta said.
First announced in December 2022, Project Vaani aims to collect and transcribe 154,000 hours of open-source anonymised speech data from all 773 districts of India. Gupta said they are now in the middle of phase two that