Using Langchain and Gemini to classify news articles
A quick tutorial on how to integrate the Langchain framework when working with Language Models
By Carlos A. Toruño P.
March 3, 2024
My dear three readers, I know I have been off for a whole month now and I feel terrible. Things have been really crazy at work lately. However, I found some time this weekend and I wanted to continue with my series on using AI to setup a gathering and classification system that is able to massively track and organize events that can provide insights on the state of the Rule of Law in a country. Until now, I have talked about how to gather news articles using a News API and how to classify them using AI, which is a topic that I left open back in January.
In my previous blog post, I briefly mentioned Langchain and why it is a must-known framework if you are planning on working with Large Language Models. Retaking that conversation, Langchain is an open-source framework that facilitates the integration of generative AI models into your own framework. You can see it as a toolkit that covers and provide easy and fast solutions to many of the usual tasks that programmers face when dealing with language models. Langchain provides a whole set of features that makes your life easier. In this post, I’m just going to explain the basic features and how to integrate this amazing tool into our news classification exercise.
How does it work?
Langchain was built following a modular architecture. This means that you can easily call and use different components depending on your needs. The framework provides separate libraries for selectors, model wrappers, prompt managing, output parsers, text splittling, API interaction, among many others. You can think of Langchain as a Swiss Army Knife that comes with multiple blades and tools that you can use independently depending on how are you planning to interact with the language model.
At the same time, Langchain also allows you to streamline the process by constructing “chains”. These chains are sequences of steps that process information, pass it to a language model, and ultimately generate an output. By programming these sequences or “chains”, you can pre-program the process in order to have an assembly line ready for use.
For example, going back to our news classification exercise, we needed to customize a prompt template, send a call to a language model, and then parse the received output. These three steps can be easily streamlined using Langchain. As I said before, in this post I will just touch base over some basic functionalities. However, if you would like a more indepth explanation of Langchain, I would suggest you to watch this video from Rabbitmetrics:
Additionally, I would also suggest you to read the official documentation publicly available in this website.
Required libraries
At this point, is more than clear that we need to install Langchain in order to use it. We can proceed by installing the official python release version by running the following line in our terminal:
pip install langchain
For this exercise, we will be using Google’s Gemini model that was released on December, 2023. Therefore, we will also need to install the official Python Software Development Kit from Google and the Google AI module from Langchain. Following the official documentation from both developers, we can install these by running the following lines in our terminal:
pip install -q -U google-generativeai
pip install --upgrade --quiet langchain-google-genai pillow
Once we have all the required libraries installed, we can proceed to import the modules we will be using in this tutorial. Most of these modules and libraries are already known by my three usual readers. Therefore, I’m just going to highlight three:
- The
ChatPromptTemplate
is a module that allows us to manage prompt templates. - The
ChatGoogleGenerativeAI
is a wrapper that allows you to send calls or invoke the Gemini language model in a standardized fashion. - The
BaseOutputParser
is a module that allow us to easily parse outputs received from language models.
import json
import time
import pandas as pd
from langchain.schema import BaseOutputParser
from langchain.prompts.chat import ChatPromptTemplate
from langchain_google_genai import (
ChatGoogleGenerativeAI,
HarmBlockThreshold,
HarmCategory
)
from google.generativeai.types import BlockedPromptException
Now that we have all of our libraries and modules, the next important setup is to load our API key from Google AI Studio. As I always highlight, you have to be super-extra-intensively careful when managing your API keys. NEVER display API keys in your scripts… unless… NO!! NEVER!! The most common way to load API keys is through environmental variables. For this, I usually use the Python dotenv library:
import os
from dotenv import load_dotenv
# Loading API KEY from environment
load_dotenv()
GoogleAI_key = os.getenv("googleAI_API_key")
os.environ['GOOGLE_API_KEY'] = GoogleAI_key
Reading and exploring our data
The data we will be working with for this tutorial is a dataset of 203 news articles for which we have 4 variables:
- Article ID
- Headline text
- Summary
- Full content text
Let’s read the data and take a quick look into the first 15 articles in our set:
master_data = pd.read_parquet("master-data.parquet.gzip")
master_data.head(15)
article_id | title_eng | desc_eng | content_eng | |
---|---|---|---|---|
0 | 5b3033b7646c124e2a893f135d6b6718 | Wikland's illustrations traveled to Latvia for... | [caption id="attachment_412290" align="alignno... | On Saturday, an exhibition of Ilon Wikland's i... |
1 | c55891953899b72b8fc9e4132727cec0 | Goals by Nova Englund and Linnea Helmin were n... | Surahammar was defeated in the meeting with Ha... | Surahammar was defeated in the meeting with Ha... |
2 | e7d7afc056b2aa73061834edf93aeaa7 | Leader of the Qassam Brigades: The Phantom of ... | Mohammed Deif is the leader of the military wi... | Mohammed Deif is the leader of the military wi... |
3 | aa4e2287a042765adaed62e66c38cc6b | Fierce criticism of the best tennis player in ... | The world's top-ranked tennis player, Arina Sa... | The best-ranked tennis player in the world, Be... |
4 | d07d8deeb60237fb1a3c12e1b07ab50c | Education Minister: Hubig: German skills in sc... | Education Minister Stefanie Hubig (SPD) wants ... | Hubig announced a precise analysis of the data... |
5 | 7abd533276ada8c99a4611a95fa6056b | The “Wild West” of Nantes in May: serial shoot... | Around ten episodes of gunfire left one dead a... | Le Figaro Nantes Bloody month of May in Nantes... |
6 | 2e51e03074e70136dc47cae1790bd721 | End of the Bundeswehr mission in Afghanistan: ... | The current federal government wanted to use “... | The current federal government wanted to use “... |
7 | c6697f8d32ddc66c6c48f6040e196466 | Cars and heavy vehicles, the European Parliame... | BRUXELLES. Less polluting cars and vans, the E... | BRUXELLES. Less polluting cars and vans, the E... |
8 | d1adbdd16c6f589859b3f0f4bb1bb7ed | Danko overshadowed the Voice. The nominations ... | As for the relations between Smer, Hlas and SN... | The leader of Hlas Peter Pellegrini presented ... |
9 | df8c3e7629128fe0ecabe64837cd54de | Iva Ančić: We expect a spectacle and an excell... | We see that the awareness of gaming itself has... | Reboot Infogamer powered by A1 is coming back ... |
10 | 6623a54bdb6452ede407e47779ae0f28 | An analysis by Ulrich Reitz - Ban the AfD? An ... | The Federal Minister of the Interior and her h... | Comments Email Share More Twitter Print Feedba... |
11 | fd6a409ea84add5d803fee8e2877d071 | Now it's coming: the green light for the first... | At its most recent meeting, the Homburg city c... | Now here it comes, the bike zone in the Hombur... |
12 | d88ae6a3ec59ddc8dd2df71d32a2cbe1 | Municipalities: District warns of fraud when d... | The Vorpommern-Greifswald district warns of in... | The Vorpommern-Greifswald district warns of in... |
13 | 1bf6ebbd3bad47afe77b0967f19b2a48 | That's why King Matthias shut down his uncle | Contrary to expectations, Mátyás turned out to... | Since Mátyás was a minor when László Hunyadi w... |
14 | f852ba76ef4574a0064c812b215d4ce0 | A PFAS ban? What does this mean for buyers of ... | Alarming news is coming from the backrooms of ... | Several environmental protection associations ... |
Loading the prompts
Today, we will be performing the same exercise that we did in our previous blog post. As a summary, we will be doing a classification exercise in which we ask the Gemini model to read a news article and classify it in two groups: (i) articles that are related to our Rule of Law, Justice, and Governance framework, and (ii) those that are unrelated. Once that we have identified which articles are related to the Rule of Law, Justice, and Governance, we ask Gemini to provide a score telling us how closely related is the article to each one of the eight pillars of our framework: Constraints to Government Powers, Abscense of Corruption, Open Government, Fundamental Freedoms, Order and Security, Regulatory Enforcement, Civil Justice, and Criminal Justice. For that reason, we will be referring to each one of these classification rounds as stage 1 and stage 2, respectively. For each one of these stages, we will be passing a context and an instructions prompt. You can go over these prompts by clicking on the URLs bellow:
- Stage 1 - Context Prompt Template
- Stage 1 - Instructions Prompt Template
- Stage 2 - Context Prompt Template
- Stage 2 - Instructions Prompt Template
We proceed to load these plain text files as Python objects:
def read_prompt(file_path):
with open(file_path, 'r', encoding = "utf-8") as f:
text = f.read()
return text
context_stage_1 = read_prompt("context_stage_1.txt")
instructions_stage_1 = read_prompt("instructions_stage_1.txt")
context_stage_2 = read_prompt("context_stage_2.txt")
instructions_stage_2 = read_prompt("instructions_stage_2.txt")
You can open these prompt templates and see how they are trying to provide an accurate context and instructions to the model. Similarly, they provide a very extensive explanation of our theoretical framework so the model output fits our needs as best as possible. Our target is to pass this context every time that we ask Gemini to read an article, this is why we treat these as templates. If you open any of the instructions prompt templates, you will see that they include the following chunk of text:
Now, given the following news article:
News title: {headline}
News summary: {summary}
News body: {body}
Everytime that we send a news article to Gemini, we have to replace the {headline}
, {summary}
, and {body}
parts of the template with the actual headline, summary, and content that we have in our master_data
. It is very important that the .txt file that we are reading contain the “replaceable” parts within curly brackets in order for the prompt managing tools from Langchain to work as expected. In my previous post, we were doing this using the format()
method for strings in Python. However, Langchain provides a similar tool for managing and customizing prompts through the ChatPromptTemplate
module. We can define our context template as a System Role message, and our instructions template as a Human Role using the from_messages()
method. To understand how role management works in text generation models, you can check
this page from the OpenAI’s official documentation. For our stage 1 exercise, we could define the prompt template as follows:
stage_1_prompt = ChatPromptTemplate.from_messages([
("system", context_stage_1),
("human", instructions_stage_1),
])
This way, Langchain will understand that there are parts that will need to be replaced in the prompt text before passing it to the model. We will tell Langchain how to replace these values when invoking the model. For now, it is fine just having a final prompt with the roles properly assigned. This is the first step in our “chain”.
Once that we have our prompt template defined, we can think on our second step, which is sending the customize prompt to Gemini. For this, Langchain offers a wide set of wrappers that makes it super easy to send calls to a large variety of Large Language Models. In this exercise, we will be using the ChatGoogleGenerativeAI
wrapper to send our calls:
ChatGoogleGenerativeAI(model = "gemini-pro",
temperature = 0.2,
safety_settings = safety_settings,
convert_system_message_to_human = True)
For our calls, we are defining that we would like to use the gemini-pro
model with a temperature parameter of 0.15. The temperature parameter is used to control the randomness or creativity of the output. A low temperature will prioritize the next words in its prediction, while a high temperature will consider “less likely” options in the prediction. Given that we want the model to work under “factual accuracy”, we pass a low temperature parameter. Moreover, given that Gemini does not support the “System Role” in its syntax, we activate the convert_system_message_to_human
parameter.
Given that, by default, Gemini comes with some medium-high safety settings that could block a prompt to be answered by the model, we would like to reduce how strict these settings are. For this, we define a new set of safety nets as follows:
safety_settings = {
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE
}
However, the model could still block some of our news articles by some undefined reasons that we cannot control. This will need to be taken into account when classifying the news articles. But now, we have completed the second step in our “chain”.
Finally, we have asked the model to:
Use the following JSON format to answer:
{{
rule_of_law_related: [Model Answer]
}}
By giving the model this instruction, we can expect its answer to have a JSON structure formatting. However, the output will still be a string. Therefore, we need to parse this string as a python dictionary. For this, we will use the BaseOutputParser
method as follows:
class JSONOutputParser(BaseOutputParser):
def parse(self, text: str):
"""
Parse the output of an LLM call to a valid JSON format.
"""
return json.loads(text.replace('```json', '').replace('```', ''), strict=False)
Having defined the output parser, we can say that we have succesfully defined the three steps required by our exercise. We can assemble all three steps into a single “chain” by using the |
constructor provided by Langchain as follows:
chain_gemini = chat_prompt | ChatGoogleGenerativeAI(model = "gemini-pro",
temperature = 0.0,
safety_settings = safety_settings,
convert_system_message_to_human = True) | JSONOutputParser()
If we were to manually classify one article at a time, this will be a valid solution. However, given that we want to pass all 203 news articles at once, we need to define a proper function that we are going to call classify_article()
. This function will take a headline, a summary, and a full body content and it will classify or rate the news article according to which stage of the process we are referring to. Stay with me, the following chunk of code might be a little bit rough, so I suggest you to read the comments along the code.
def classify_article(headline, summary, body, stage_1 = True, relation = None):
"""
This function takes a headline, a summary, and the content of a news article and it sends a call to Google's Gemini
to classify the article. There are two different classifications: Stage 1 and Stage 2. If stage_1 is set to TRUE, then
the call to the model will try to answer the following question: Is this news article related or unrelated to the Rule of Law?
If stage_1 is set to FALSE, then the call to the model will try to rate how closely related is the news article to each
one of the eight pillars of the Rule of Law.
"""
# Defining the prompt according to which stage are we calling the function for
if stage_1 == True:
system_prompt = context_stage_1
human_prompt = instructions_stage_1
else:
system_prompt = context_stage_2
human_prompt = instructions_stage_2
# Setting up the Prompt Template
chat_prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", human_prompt),
])
# Defining our chain
chain_gemini = chat_prompt | ChatGoogleGenerativeAI(model = "gemini-pro",
temperature = 0.0,
safety_settings = safety_settings,
convert_system_message_to_human = True) | JSONOutputParser()
# For Stage 2, we don't want to pass articles that were already classified as "UNRELATED", so we pre-defined the outcome
if stage_1 == False and relation != "Yes":
outcome = "Unrelated"
else:
try:
llm_response = chain_gemini.invoke({
"headline": headline,
"summary" : summary,
"body" : body,
})
status = True
time.sleep(1) # We need to slow down the calls. given that the Gemini API has a limit of 60 calls per second
# The API can still block some of our prompts due to undefined reasons. Sadly, we can't do anything about it, so we
# predefine the outcome
except BlockedPromptException:
print("BLOCKED")
status = False
# We use the STATUS variable to throw an outcome to our call depending if our prompt was blocked or not and
# on the stage we are calling the function for
if status == True:
if stage_1 == True:
outcome = llm_response["rule_of_law_related"]
else:
outcome = json.dumps(llm_response["pillars_relation"])
else:
outcome = "Blocked Prompt"
return outcome
Once that we have this awesome function defined, we can proceed to apply it to the whole data frame of news articles using the apply()
method:
# Stage 1 of the classification
master_data["gemini_stage_1"] = master_data.apply(lambda row: classify_article(row["title_eng"],
row["desc_eng"],
row["content_eng"],
stage_1 = True), axis = 1)
master_data["gemini_stage_1"].value_counts()
gemini_stage_1
Unrelated 155
Yes 47
Blocked Prompt 1
Name: count, dtype: int64
We can see that, after reading all 203 news articles, the model classified 155 as “Unrelated” and 47 as “Related” to the Rule of Law, Justice, and Governance framework that we passed. This means that, for the second stage, we will only pass 47 articles to see how closely related they are to each of the eight pillars in our framework. We proceed in the same manner that we did for Stage 1, but this time, we pass the outcome from Stage 1 as the relation
parameter, so the function knows which articles to send to the model and which ones not to.
# Stage 2 of the classification
master_data["gemini_stage_2"] = master_data.apply(lambda row: classify_article(row["title_eng"],
row["desc_eng"],
row["content_eng"],
relation = row["gemini_stage_1"],
stage_1 = False), axis = 1)
For each one the 47 news articles that were classified as “RELATED” to the Rule of Law, the model has assigned a score from zero to ten rating how closely related the article is to each one the eight pillars of our framework. Let’s take a look at one specific example:
print(master_data["gemini_stage_2"][5])
[
{"1. Constraints on Government Powers": 8},
{"2. Absence of Corruption": 7},
{"3. Open Government": 6},
{"4. Fundamental Rights": 7},
{"5. Security": 9},
{"6. Regulatory Enforcement and Enabling Business Environment": 5},
{"7. Civil Justice": 4},
{"8. Criminal Justice": 9}
]
As we can see, we have achieved our goal. However, Having this huge string in our data is not practical. Therefore, what we could do is to define a threshold under which, if the assigned rating is equal or above to this threshold, then we can firmly say that the article IS related to this pillar. Otherwise, we labelled this news article as UNRELATED to a specific pillar. Following that logic, we can create eight binary variables that will be equal to one if the article surpasses or is at least equal to the threshold and it will be equal to zero, otherwise. Let’s write a function that will follow this logic.
import ast
def extract_score(string, pillar, t = 7):
"""
This function extracts scores from a string and returns a binary value that is equal to 1 if the score is higher/equal
than a specific threshold, and it returns zero if otherwise.
"""
try:
scores_dicts = ast.literal_eval(string)
ratings = [v for x in scores_dicts for _,v in x.items()]
keys = [k for x in scores_dicts for k,_ in x.items()]
pattern = str(pillar) + ". "
idx = next((index for index, element in enumerate(keys) if pattern in element), None)
if idx is not None:
score = ratings[idx]
else:
score = 0
if score >= t:
return 1
else:
return 0
except ValueError:
if string == "Unrelated":
return 0
Easy peasy, right?
We now proceed toapply the function to create the new set of binary variables:
for i in range(1, 9):
var_name = "Gemini_pillar_" + str(i)
master_data[var_name] = master_data["gemini_stage_2"].apply(lambda x: extract_score(x, i))
master_data[master_data["gemini_stage_1"] == "Yes"].head(10)
article_id | title_eng | desc_eng | content_eng | gemini_stage_1 | gemini_stage_2 | Gemini_pillar_1 | Gemini_pillar_2 | Gemini_pillar_3 | Gemini_pillar_4 | Gemini_pillar_5 | Gemini_pillar_6 | Gemini_pillar_7 | Gemini_pillar_8 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 7abd533276ada8c99a4611a95fa6056b | The “Wild West” of Nantes in May: serial shoot... | Around ten episodes of gunfire left one dead a... | Le Figaro Nantes Bloody month of May in Nantes... | Yes | [{"1. Constraints on Government Powers": 8}, {... | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
6 | 2e51e03074e70136dc47cae1790bd721 | End of the Bundeswehr mission in Afghanistan: ... | The current federal government wanted to use “... | The current federal government wanted to use “... | Yes | [{"1. Constraints on Government Powers": 8}, {... | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
8 | d1adbdd16c6f589859b3f0f4bb1bb7ed | Danko overshadowed the Voice. The nominations ... | As for the relations between Smer, Hlas and SN... | The leader of Hlas Peter Pellegrini presented ... | Yes | [{"1. Constraints on Government Powers": 7}, {... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
10 | 6623a54bdb6452ede407e47779ae0f28 | An analysis by Ulrich Reitz - Ban the AfD? An ... | The Federal Minister of the Interior and her h... | Comments Email Share More Twitter Print Feedba... | Yes | [{"1. Constraints on Government Powers": 9}, {... | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
13 | 1bf6ebbd3bad47afe77b0967f19b2a48 | That's why King Matthias shut down his uncle | Contrary to expectations, Mátyás turned out to... | Since Mátyás was a minor when László Hunyadi w... | Yes | [{"1. Constraints on Government Powers": 10}, ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
16 | 9563949a0c0bfff0105048ddcec42c63 | A man killed his wife and then committed suici... | A man killed his wife and then tried to kill h... | On the morning of October 16, 2023, a 66-year-... | Yes | [{"1. Constraints on Government Powers": 0}, {... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
19 | 2c886c2fc437d97bcf6efc783c6135e8 | German-Polish border: Federal police start new... | The federal police are gradually enforcing the... | The federal police are gradually enforcing the... | Yes | [{"1. Constraints on Government Powers": 8}, {... | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
23 | fa07a7d9886fbb61bf4ed851f7a537aa | Controversial decision: Naming the curator for... | A heated argument is raging over the naming of... | A heated argument has broken out in the Turkis... | Yes | [{"1. Constraints on Government Powers": 8}, {... | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
41 | 6e3b6913c88b05f4e55572fe2302f72e | An elderly woman got to know Pasi, and a large... | The district court of Varsinais Suomen believe... | A woman in her seventies slipped in the yard o... | Yes | [{"1. Constraints on Government Powers": 8}, {... | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 |
45 | f5f879f7c5fa7c97cbc66ed65a974cfc | TAP. Leader of the Liberal Initiative accuses ... | Rui Rocha highlighted that, during the commiss... | This Monday, the president of IL accused the p... | Yes | [{"1. Constraints on Government Powers": 8}, {... | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
master_data.iloc[:,6:].apply(sum)
Gemini_pillar_1 39
Gemini_pillar_2 19
Gemini_pillar_3 8
Gemini_pillar_4 28
Gemini_pillar_5 12
Gemini_pillar_6 3
Gemini_pillar_7 10
Gemini_pillar_8 20
dtype: int64
According to our results, 39 articles were classified as related to Pillar 1, while only three were classified as related to Pillar 6. There might be better ways to do what I just did (if you happen to know one, just email me), but this is how I am proceeding with the classification stage in our exercise. And that it is my dear readers.
Bis bald und viele Spaß!!
- Posted on:
- March 3, 2024
- Length:
- 18 minute read, 3712 words