Building an AI Chatbot for the Coal Mining Industry with PETALS and Pinecone

I recently worked on an exciting project to create an AI chatbot that can answer questions about laws, regulations, news, and other information relevant to the coal mining industry in India. The goal was to make it easy for anyone to access accurate, up-to-date information on this topic 24/7 through a simple chat interface on WhatsApp or a website.

To build the chatbot, I used a combination of cutting-edge AI technologies:

PETALS: A service that allows fine-tuning large language models like Llama 2 (70B parameters) without needing expensive compute resources
Pinecone: A vector database that enables fast similarity search over embeddings
Langchain: A framework for building applications with LLMs through composability
PEFT: Parameter-Efficient Fine-Tuning methods to adapt the LLM to the coal mining domain

Here’s a high-level overview of the approach:

Collect a dataset of relevant laws, acts, rules, regulations, news articles, and other content related to coal mining in India
Fine-tune the open-source Llama 2 LLM on this dataset using PETALS and PEFT techniques to give it deep domain knowledge
Convert the dataset into embeddings using a sentence transformer model and store them in Pinecone for fast retrieval
When the user asks a question, convert it to an embedding, do a similarity search on Pinecone to find the most relevant context
Pass the question and retrieved context to the fine-tuned LLM to generate an accurate final answer
Expose the chatbot via both a website and WhatsApp for easy access

Let’s dive into each of these steps in more detail.

Data Collection

The first challenge was gathering a comprehensive, high-quality dataset covering all the key information someone might want to ask about coal mining in India - laws, regulations, safety guidelines, industry news, etc.

I used web scraping to collect this data from various government and industry websites. The scraping script would periodically check these sites for any new or updated content to keep the chatbot’s knowledge current.

Some of the key data sources included:

Ministry of Coal website
Directorate General of Mines Safety
Coal Controller’s Organisation
Press Information Bureau - Coal Ministry
Major coal company websites and annual reports
Mining industry journals and news sites

The raw scraped data was then cleaned, formatted and saved for the next steps of fine-tuning and embedding.

Fine-Tuning the Language Model

With the coal mining dataset in hand, the next step was to teach the language model to understand this domain. I started with the open-source Llama 2 model which has an impressive 70 billion parameters and general knowledge.

To adapt it to the coal mining domain, I used the PETALS service and PEFT (Parameter-Efficient Fine-Tuning) techniques. PETALS allows you to fine-tune large models like Llama without needing expensive GPU hardware yourself. It works by distributing the model over a swarm of contributed compute.

Here’s how you can load a PETALS model in Python:

import torch
from transformers import AutoTokenizer 
from petals import AutoDistributedModelForCausalLM

model_name = "petals-team/StableBeluga2"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, add_bos_token=False)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
model = model.cuda()

For fine-tuning, I used the PEFT method of prompt-tuning. This keeps the core LLM weights frozen and adds a small number of trainable “soft prompt” tokens to steer the model to the desired task.

Prompt-tuning is memory-efficient and avoids interfering with the base model used by others. Here’s how to set it up:


model = AutoDistributedModelForCausalLM.from_pretrained(model_name, tuning_mode='deep_ptune', pre_seq_len=3)
model = model.cuda()

I then fine-tuned the model on the coal mining dataset, using a standard language modeling loss. The trainable prompts learn to steer the frozen LLM to generate text in the desired style and domain.



opt = torch.optim.Adam(model.parameters(), lr=1e-3)

for i in range(num_epochs):
    for batch in dataloader:
        input_ids = batch["input_ids"].cuda()
        labels = batch["labels"].cuda()
        
        loss = model(input_ids=input_ids, labels=labels).loss
        
        opt.zero_grad()
        loss.backward()
        opt.step()

After fine-tuning, the model has essentially “memorized” the coal mining corpus, while keeping its general language understanding and generation capabilities. We can probe its knowledge with some test queries:


inputs = tokenizer("The Coal Mines Provident Fund and Miscellaneous Provisions Act was enacted in", return_tensors="pt")["input_ids"].cuda()
outputs = model.generate(inputs, max_new_tokens=10)
print(tokenizer.decode(outputs))

Output:

The Coal Mines Provident Fund and Miscellaneous Provisions Act was enacted in 1948 to provide for the framing of a Provident Fund Scheme

Embedding and Vector Search

Fine-tuning gives the model deep knowledge of the coal mining domain. But to make it more efficient at question-answering, I used a retrieval-augmented approach with vector embeddings and a similarity search engine. The idea is to:

Convert each chunk of the coal mining corpus into a vector embedding
Store the embeddings in a vector database that allows fast similarity search
At inference time, convert the user’s question to an embedding and retrieve the most similar passages
Feed both the question and retrieved passages to the model to generate a final answer

This retrieval step helps the model hone in on the most relevant information and reduces hallucination. To generate embeddings, I used a SentenceTransformer model which maps text to a semantic vector space where similar meanings are close together.

To generate embeddings, I used a SentenceTransformer model which maps text to a semantic vector space where similar meanings are close together.


from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

corpus_embeddings = model.encode(corpus_chunks, convert_to_tensor=True)

I then stored the embeddings in the Pinecone vector database which is optimized for fast similarity search at scale. Each embedding is tagged with the original text passage and other metadata.



import pinecone

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)

index_name = "coal-mining-index"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=384, metric="cosine")
    
index = pinecone.Index(index_name)

for i, emb in enumerate(corpus_embeddings):
    meta = { 
        "text": corpus_chunks[i],
        "source": sources[i]
    }
    index.upsert([(str(i), emb.cpu().numpy(), meta)])

To retrieve relevant passages for a given question, I convert the question to an embedding with the same SentenceTransformer model, then query Pinecone for the most similar document embeddings.


query_embedding = model.encode(query, convert_to_tensor=True)
results = index.query(query_embedding.cpu().numpy(), top_k=5, include_metadata=True)

contexts = [r["metadata"]["text"] for r in results["matches"]]

Generating Answers

With the fine-tuned model and retrieved contexts, I can now generate the final answer. I prompt the model with the original question and the top retrieved passages, separated by delimiters.

I use the Langchain library to manage the prompt templates and interaction with the model:


from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import Petals

prompt_template = """
Answer the question based on the context provided. If the context does not contain enough information to answer, say that you do not have enough information.

Context:
{context}

Question: {question}

Answer:
"""

prompt = PromptTemplate(
    input_variables=["context", "question"], 
    template=prompt_template
)

llm = Petals(model_name="petals-team/StableBeluga2-coal-mining")

qa_chain = LLMChain(llm=llm, prompt=prompt)

def generate_answer(query):
    query_embedding = model.encode(query, convert_to_tensor=True)
    results = index.query(query_embedding.cpu().numpy(), top_k=5, include_metadata=True)
    
    contexts = "\n\n".join([r["metadata"]["text"] for r in results["matches"]])
    
    return qa_chain({"context": contexts, "question": query})

The model attends over the retrieved contexts and generates a final answer based on the most salient information. If the contexts are not sufficient to answer well, it is prompted to say so.

Wrap Up

So that was the process of my coal mining chatbot! an tbh it was quite the journey - tons of learning, a few roadblocks, but a great end result.

If you’re thinking of taking on an AI project yourself, my advice would be: start small, be ready to iterate a lot, and don’t be afraid to dive into new opensource tools and techniques as not all famous frameworks are built for the problem you are working on. The field is moving so fast, there’s always something new to learn. Feel free to reach out if you have any questions! And keep an eye out for future posts on my other AI adventures.