RAG on the Indian Constitution: Exploring Two Document Loaders in Langchain
Photo by Brandi Redd on Unsplash
Recently, I have been playing with RAG systems and decided to apply RAG to the Indian constitution. RAG stands for Retrieval Augmented Generation. It is a technique used in the Generative AI domain to provide information from a knowledge base outside the LLM's training data. You can read more about it in this article.
I first got the Indian constitution from the official government website in pdf format. It is 404 pages long and contains 448 articles (395 original but 100 amendments since the first release). Some pages contain the footer about amendment details about that article or law.
The accuracy of RAG solely depends on how well-formatted the document is and how you split the document into different sizes of chunks. PDF is a complex format that contains images, tables, headers, footers, etc. This PDF does not contain any images or tables mostly.
Using a Simple PDF structure
I first tried loading the constitution using the PyPDFLoader document loader in the langchain library. Here is how it is done,
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
loader = PyPDFLoader("constitution.pdf")
pages = loader.load()
Now, the PDF is loaded in pages
variable needs to be divided into chunks. I tried different sizes of chunks like 500, 800, and 1500 with different chunk_overlap
size and settled on 2000 tokens with 100chunk_overlap.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=100,
length_function=len,
is_separator_regex=True,
)
page = text_splitter.split_documents(pages)
I selected Alibaba-NLP/gte-large-en-v1.5 for the embedding model based on the MTEB Leaderboard and model size and generated vector embeddings on the Kaggle platform.
Here are some responses from these embeddings,
Query:
Summarize the article 22
Response:
Article 22 of the Indian Constitution deals with protection against arrest and detention in certain cases.
No person who is arrested shall be detained in custody without being informed, as soon as may be, of the grounds for such arrest.
The right to consult and to be defended by a legal practitioner of his choice must also be ensured upon arrest.
Query:
What does Part III of the Indian Constitution guarantee?
Response:
Part III of the Indian Constitution guarantees Fundamental Rights to all citizens.
These include rights such as equality before the law, freedom of speech and expression, right to form associations, protection against arrest without warrant, freedom from slavery and forced labor, and rights related to employment and appointment under the State.
It also includes provisions on personal liberty, right to move the High Court for enforcement of Fundamental Rights, right against detention in custody beyond a specified period without trial, and prohibition of torture or inhuman treatment while in custody.
Using Markdown Formatted structure
Markdown formatting works well with LLM because of its structured elements like headings, lists, scientific formulas, and tables, which helps maintain context and improves the LLM’s understanding of the document structure. Moreover, most LLMs are finetuned to generate responses in Markdown format.
I found a very good open-source tool called Marker, which converts PDFs into markdown format with high accuracy and uses fewer resources than Nougat, which is from Meta. I couldn't generate it in Google Colab because of less RAM. In Kaggle, it went well.
For loading the Markdown file, I kept it simple by using a simple TextLoader (I read somewhere that UnstructuredMarkdownLoader emits markdown characters, comment if I am wrong am too lazy to test myself 😑) and, for splitting, using RecursiveCharacterTextSplitter like before.
from langchain_community.document_loaders import TextLoader
loader = TextLoader("constitution.md")
pages = loader.load()
Here are some responses from these embeddings,
Query:
Summarize the article 22
Response:
Article 22 of the Indian Constitution provides certain rights and protections for individuals who are arrested or detained.
It states that any person who is arrested shall be produced before a Magistrate within twenty-four hours, unless the police have obtained the order of a Magistrate permitting them to detain the individual further.
The article also prohibits compelling anyone to be in custody beyond the period fixed by law without bringing such person up for trial or production before the court, except where the authority has been granted under specific provisions laid out in sub-clauses (a) and (b).
It guarantees that no one shall be detained in prison unless they are informed of their charge and have access to legal representation.
The article further ensures protection against unlawful detention based on certain conditions like being a minor or pregnant woman, among others.
Query:
What does Part III of the Indian Constitution guarantee?
Response:
Part III of the Indian Constitution guarantees Fundamental Rights to its citizens.
These rights encompass various aspects such as:\n\n1.
**Right to Equality**\n - Equality before law and prohibition of discrimination based on religion, race, caste, sex, place of birth.\n\n2.
**Freedom**\n - Freedom of speech and expression.\n - Protection against arrest without proper legal process.\n - Right to life and personal liberty.\n\n3.
**Right against Exploitation**\n - Prohibition of traffic in human beings and forced labor.\n - Prohibition on employment of children in factories.\n\n4.
**Freedom of Religion**\n - Freedom of conscience and free profession, practice, and propagation of religion.\n \n5.
**Cultural and Educational Rights**\n - Protection of interests of minorities.\n - Right of minorities to establish and administer educational institutions.\n\nThese rights form the cornerstone of individual liberties and protections enshrined in India's Constitution.
Conclusion
Oh yeah, I forgot to mention that all the sources are available on GitHub. Go check it out if you want. Overall, RAG on the Indian constitution with open-source tools worked great. Of course, there are many things to improve here, like RAG evaluation with different frameworks, trying out different DocumentLoaders, TextLoaders, and many more things to do.
I am looking forward to having your feedback in the comment section. I will keep updating my progress.
Thanks for reading until now.
#KeepLearning