AI4Bharat, the research lab at IIT Madras, on Thursday announced the release of its first large language model Airavata trained on Hindi datasets.
Airavata (Sanskrit word for ‘elephant’) was created using SarvamAI’s foundational Hindi model OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks, the research wing said in a blogpost.
“We release Airavata, an open-source instruction tuned model for Hindi that shows encouraging performance on a wide range of tasks compared to other open-source models,” it said.
“This is a first step towards building high-quality open-source LLMs for Indian languages that encompass large pre-training datasets, diverse instruction tuning datasets and high-quality models.”
SarvamAI’s OpenHathi is an extension of Meta’s Llama2-7B model and boasts GPT-3.5-like performance for Indic languages.
“Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages,” it said.
Given that there is scarcity of Hindi training sets, AI4Bharat has developed its model by translating well-constructed English-supervised instruction-tuning datasets into Hindi. Along with the model, AI4Bharat has also shared the instruction tuning datasets to enable further research for IndicLLMs.
“We rely on human-curated, license-friendly instruction-tuned datasets to build ‘Airavata’. We do not use data generated from proprietary models like