Skip to main content

Story 1: NCERT Content Ingestion & RAG Pipeline

Overview​

FieldValue
Story IDNGE-16-1
Story Points21
SprintSprint 19-20
LanguagePython

User Story​

As a System
I want to ingest NCERT content into a searchable knowledge base
So that AI features can reference accurate curriculum content

Technical Stack​

ComponentTechnology
PDF ExtractionPyMuPDF, pdfplumber
EmbeddingsCohere embed-multilingual-v3.0
Vector DBQdrant
Text SplitterLangChain RecursiveTextSplitter

Content Pipeline​

NCERT PDFs → PDF Parser → Text Cleaner → Chunk Splitter → Embeddings → Qdrant

Metadata Structure​

{
"board": "CBSE",
"class": 10,
"subject": "Science",
"chapter": 4,
"chapter_name": "Carbon and Its Compounds",
"learning_outcomes": ["LO-1", "LO-2"],
"bloom_level": "Understanding",
"page_number": 56
}

Coverage (Phase 1)​

BoardClassesSubjects
CBSE/NCERT6-12Science, Maths, SST, English