ABSTRACT

The rapid advancement of large language models (LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India’s diverse knowledge domains. It enables assessment of models’ ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

Figure 1: Overview diagram and statistics of Logo BHASHABENCH.

Figure 2: Comparative performance of small models (≤4B) over Logo BHASHABENCH V1. (EN).

Figure 3: Comparative performance analysis of the GPT model family on Logo BHASHABENCH V1. (EN).

BHASHABENCH

The primary motivation behind BhashaBench V1 is to comprehensively assess domain-specific knowledge and reasoning capabilities of large language models within India’s diverse and culturally rich knowledge ecosystems. Unlike existing benchmarks focusing on general or Western-centric domains, our benchmark evaluates specialized Indian knowledge systems requiring deep cultural understanding and contextual awareness. BhashaBench V1 adheres to seven core design principles: (1) Critical Indian Domains: Encompasses Agriculture, Legal systems, Finance, and Ayurveda with fine-grained subfields. (2) Diverse Task Formats: Includes multiple-choice, assertion-reasoning, fill-in-blanks, and comprehension tasks. (3) India-Specific Reasoning: Evaluates domain-specific reasoning incorporating cultural contexts and regional practices. (4) Bilingual Framework: Sup- ports English and Hindi evaluation maintaining cultural authenticity. (5) Authentic Sources: Ques- tions curated from government examinations and professional certifications. (6) Difficulty Assess- ment: Categorized into Easy, Medium, Hard levels. (7) Cultural Authenticity: Prioritizes tradi- tional knowledge systems including Ayurvedic principles. 1 This framework spans 90+ subdomains covering 500+ topics, enabling comprehensive evaluation of model capabilities in India-centric con- texts.

DATA COLLECTION

The data collection process for BhashaBench V1 follows a systematic approach similar to AGIEVAL (Zhong et al., 2023), focusing on authentic examination materials from national and state-level assessments. We systematically gathered publicly available question papers from official online examination portals, which host previously released papers that are manually curated by subject matter experts, ensuring accurate topic tagging, language annotation, and validated answer keys. Our comprehensive collection encompasses over 40 different examination types across multiple categories: national competitive exams, domain-specific degree examinations, professional certification tests, and state-level civil services examinations. Regional state examinations proved particularly valuable as they incorporate state-specific topics, local knowledge systems, and cultural practices often overlooked in national assessments. These examinations are typically taken by individuals seeking higher education opportunities or career advancement, ensuring questions reflect practical, real-world knowledge requirements. The final dataset comprises 74,166 carefully curated question–answer pairs spanning four core domains, with 52,494 questions in English (70.8%) and 21,672 questions in Hindi (29.2%), reflecting practical usage patterns in Indian educational and professional contexts. This approach ensures BhashaBench V1 captures the nuanced intersection between language, culture, and domain expertise essential for effective model deployment in Indian contexts.

Data Processing and Analysis

The BhashaBench V1 dataset was built by extracting structured question-answer pairs from PDF examination papers while carefully preserving linguistic and cultural authenticity. A robust OCR pipeline powered by Surya OCR ensured high-quality text extraction across Indic languages, followed by a custom GPT-based system that structured the content into standardized JSON question-answer pairs. Extensive cleaning removed noisy and duplicate data, verified language integrity, and organized questions into six formats, with missing subdomains automatically classified. Rigorous expert validation further guaranteed accuracy, natural language flow, and cultural relevance. The final dataset contains 74,166 questions across four major domains (Agriculture, Finance, Ayurveda, and Legal) and 91 subdomains, with English (70.8%) and Hindi (29.2%) coverage. Agriculture emphasizes agronomy, Finance highlights quantitative problem solving, Ayurveda captures traditional medicine knowledge, and Legal spans both core jurisprudence and emerging areas like Cyber Law. With over 90% MCQs and a balanced distribution of difficulty levels, BhashaBench V1 stands as a high-quality, domain-rich bilingual benchmark, reflecting substantial technical and validation efforts to ensure both authenticity and reliability.

RESULTS AND DISCUSSIONS

Figure 4: Comparison of representative LLMs’ scores across different domains and subdomains.

Figure 5: Comparison of representative LLMs’ scores across different domains and subdomains.
25% for multiple-choice(MC), 50% for Assertion, 0% for fill-in-the-blank(FITB), and 10% for Open-ended.

Table 1: Zero-shot scores (%) of LLMs across domains on BhashaBench V1 BhashaBench (EN + HI).

Zero-shot scores (%) of LLMs across domains on BhashaBench V1 (EN + HI). The benchmark covers Agriculture (BBK), Finance (BBF), Legal (BBL), and Ayurveda (BBA). “Avg” denotes the overall average across that domain.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

BHASHABENCH V1

LIMITATIONS AND BIASES

In this paper,we introduce BhashaBench V1, providing a comprehensive evaluation of LLMs on India-centric knowledge systems and exploring model capabilities across critical Indian domains. However, there are several limitations to acknowledge. (1) Language Coverage Limitations: Al- though BhashaBench V1 supports English and Hindi, covering a significant portion of India’s pop- ulation, India has 22 official languages and hundreds of regional dialects. Our current evaluation cannot capture the full linguistic diversity of Indian knowledge systems, particularly regional varia- tions in agricultural practices, legal terminologies, and traditional medicine nomenclature that exist in languages like Tamil, Telugu, Bengali, and others. Future iterations will expand to include addi- tional Indian languages to enhance coverage. (2) Domain Scope Limitations: While we cover four fundamental domains (Agriculture, Legal, Finance, and Ayurveda) representing core areas of Indian society, our assessment cannot encompass the entire breadth of India-specific knowledge systems. Areas such as traditional crafts, regional governance systems, indigenous engineering practices, and other vernacular knowledge traditions remain unexplored for future expansion. Our content spans from grassroots practical knowledge to professional examination standards, ensuring broad appli- cability across different expertise levels. (3) Evaluation Methodology Limitations: Our evaluation primarily uses structured question formats derived from authentic government and professional ex- aminations. While this ensures real-world relevance and practical applicability, it may not fully capture all forms of contextual reasoning required in complex domain applications. The main biases in BhashaBench V1 can be categorized into three aspects: (1) Source Material Bias: Despite comprehensive curation from diverse authentic sources spanning grassroots to pro- fessional levels, certain regional practices and emerging contemporary developments may be un- derrepresented. (2) Language Resource Bias: The benchmark reflects the inherent resource dis- parity between English and Hindi, where Hindi content, while substantial, represents a relatively lower-resource context compared to English. (3) Examination Framework Bias: Our reliance on established examination systems, while ensuring authenticity, may introduce institutional perspec- tives present in the original assessment frameworks. However, our extensive coverage across 90+ subdomains and 500+ topics from diverse sources mitigates this bias significantly. The impact of these limitations on LLM evaluation includes clear performance distinctions between models across domains and languages, as evidenced by the substantial score variations from 34.28% to 76.49%, demonstrating BhashaBench V1’s effectiveness in distinguishing LLM capabilities while presenting meaningful challenges even for top-performing models in India-specific contexts.

TOWARDS BROADER IMPACT

Societal Impact. BhashaBench V1 is anticipated to play a transformative role in bridging the digi- tal divide for India-centric knowledge systems. LLMs trained and evaluated with BhashaBench V1 can significantly enhance accessibility to critical domain expertise across agriculture, legal services, finance, and traditional medicine, particularly benefiting underserved rural and semi-urban popula- tions. In agriculture, improved LLM capabilities can democratize access to expert crop advisory, pest management, and sustainable farming practices, potentially impacting the livelihoods of over 40 million farmers dependent on agricultural activities. In the legal domain, enhanced models can assist with legal document comprehension, procedural guidance, and basic legal literacy, address- ing the substantial access-to-justice challenges faced by millions in India’s complex legal system. For healthcare, particularly Ayurveda, better model performance can support practitioners and pa- tients in understanding traditional treatment protocols and medicinal formulations, preserving and disseminating indigenous medical knowledge. In finance, improved model capabilities can enhance financial literacy and support the growing digital payment ecosystem processing billions of trans- actions annually. However, we acknowledge potential risks including over-reliance on automated systems for critical decisions, potential displacement of traditional knowledge practitioners, and the risk of perpetuating biases present in examination-based evaluation systems. The benchmark’s fo- cus on professional examination standards, while ensuring quality, may inadvertently favor formal educational backgrounds over experiential knowledge.

Ethics Statement. We ensure strict adherence to applicable laws and ethical guidelines throughout our data collection, curation, and usage processes. All question-answer pairs are sourced exclusively from publicly available government and professional examination papers, respecting intellectual property rights and ensuring no unauthorized reproduction of copyrighted materials. Our curation process involved diverse teams to minimize cultural and regional biases, though we acknowledge the inherent limitations of our current English and Hindi coverage. The dataset contains no personally identifiable information, offensive content, or culturally insensitive material. All content has been thoroughly verified for authenticity and accuracy through multiple validation rounds involving do- main experts. BhashaBench V1 is intended solely for academic research and educational purposes to advance inclusive AI development for Indian contexts. Any commercial use, misuse for harmful applications, or deployment without appropriate safeguards is strictly prohibited. We strongly urge all users to employ this resource responsibly, ensuring that any models developed or evaluated using BhashaBench V1 are deployed with appropriate human oversight, particularly in critical domains affecting public welfare, and with transparent disclosure of model limitations to end users.

ACKNOWLEDGEMENTS

We would like to thank BharatGen for their generous support towards building this comprehen-
sive benchmarking initiative for Indian languages and knowledge systems. We are also immensely
grateful to our colleagues from the BharatGen team for their motivation and meticulous efforts in
conducting manual validation and data sourcing. We acknowledge the significant contributions of
BharatGen’s consortium members, including IIT Bombay and IIM Indore, for their expertise in data
curation, validation, and domain-specific guidance that made this benchmark possible.

BHASHABENCH V1

A COMPREHENSIVE BENCH-MARK FOR THE QUADRANT OF INDIC DOMAINS

ABSTRACT

BHASHABENCH

DATA COLLECTION

Data Processing and Analysis

RESULTS AND DISCUSSIONS

BHASHABENCH V1

LIMITATIONS AND BIASES

TOWARDS BROADER IMPACT

ACKNOWLEDGEMENTS