Story 1: NCERT Content Ingestion & RAG Pipeline
Overview​
| Field | Value |
|---|---|
| Story ID | NGE-16-1 |
| Story Points | 21 |
| Sprint | Sprint 19-20 |
| Language | Python |
User Story​
As a System
I want to ingest NCERT content into a searchable knowledge base
So that AI features can reference accurate curriculum content
Technical Stack​
| Component | Technology |
|---|---|
| PDF Extraction | PyMuPDF, pdfplumber |
| Embeddings | Cohere embed-multilingual-v3.0 |
| Vector DB | Qdrant |
| Text Splitter | LangChain RecursiveTextSplitter |
Content Pipeline​
NCERT PDFs → PDF Parser → Text Cleaner → Chunk Splitter → Embeddings → Qdrant
Metadata Structure​
{
"board": "CBSE",
"class": 10,
"subject": "Science",
"chapter": 4,
"chapter_name": "Carbon and Its Compounds",
"learning_outcomes": ["LO-1", "LO-2"],
"bloom_level": "Understanding",
"page_number": 56
}
Coverage (Phase 1)​
| Board | Classes | Subjects |
|---|---|---|
| CBSE/NCERT | 6-12 | Science, Maths, SST, English |