Ministry of Education
azadi ka amrit mahotsav

Introducing Ganga-1B: The First Pre-trained Hindi Model by Lingo Research Group at IIT Gandhinagar


Empowering Hindi Language Technology

Posted On: 08 JUL 2024 4:19PM by PIB Ahmedabad

The Lingo Research Group at IIT Gandhinagar proudly presents Ganga-1B, a breakthrough in language models. Named after the longest river flowing through the Hindi-speaking region of India, Ganga-1B is the first pre-trained Hindi model developed by an academic research lab in India.

Project Unity aims to celebrate and harness India's rich linguistic diversity by creating a comprehensive resource for the country's major languages. The initiative strives to achieve state-of-the-art performance in understanding and generating text in Indian languages. Our first milestone is the release of the Ganga-1B model, trained on an extensive monolingual Hindi language dataset.

The Ganga-1B model has been meticulously trained on a large dataset of public domain web-crawled Hindi language data. This includes news articles, web documents, books, government publications, educational materials, and quality-filtered social media conversations. Native Indian speakers have further curated the dataset to ensure high quality. Impressively, Ganga-1B outperforms existing open-source models supporting Indian languages, even those with up to 7 billion parameters.

Key Features:

  • Developed by: Lingo Research Group at IIT Gandhinagar
  • Model Type: Autoregressive Language Model
  • Languages: Bilingual (Primary: Hindi [hi], Secondary: English [en])
  • License: Apache 2.0

Technical Specifications:

  • Precision: Float32
  • Context Length: 2,048
  • Learning Rate: 4e-4
  • Optimizer: AdamW
  • LR Scheduler: Cosine

Model Architecture and Objective: Ganga-1B is a decoder-only transformer model with the following specifications:

  • Layers: 16
  • Attention Heads: 32
  • Embedding Dimension: 2,048
  • Vocabulary Size: 30,000
  • Sliding Window: 512
  • Intermediate Dimension: 7,168

The team took nearly 1.5 years to develop the Ganga-1B model using open-source data from various websites. Ganga-1B is open source and has already been downloaded by over 600 people in less than 48 hours after the announcement. Furthermore, the research team is working on models for other languages, including Tamil, Telugu, Marathi, Gujarati and Urdu. They are also exploring the use of AI in e-governance for regional languages. To support school students and teachers, the team is working on an education LLM . If someone is looking for develop chatbot in Hindi, why wait? Take advantage of this free model today.

AP/GP/JD


(Release ID: 2031553) Visitor Counter : 372