Using Langchain and Gemini to classify news articles

A quick tutorial on how to integrate the Langchain framework when working with Language Models

By Carlos A. Toruño P.

March 3, 2024

My dear three readers, I know I have been off for a whole month now and I feel terrible. Things have been really crazy at work lately. However, I found some time this weekend and I wanted to continue with my series on using AI to setup a gathering and classification system that is able to massively track and organize events that can provide insights on the state of the Rule of Law in a country. Until now, I have talked about how to gather news articles using a News API and how to classify them using AI, which is a topic that I left open back in January.

In my previous blog post, I briefly mentioned Langchain and why it is a must-known framework if you are planning on working with Large Language Models. Retaking that conversation, Langchain is an open-source framework that facilitates the integration of generative AI models into your own framework. You can see it as a toolkit that covers and provide easy and fast solutions to many of the usual tasks that programmers face when dealing with language models. Langchain provides a whole set of features that makes your life easier. In this post, I’m just going to explain the basic features and how to integrate this amazing tool into our news classification exercise.

How does it work?

Langchain was built following a modular architecture. This means that you can easily call and use different components depending on your needs. The framework provides separate libraries for selectors, model wrappers, prompt managing, output parsers, text splittling, API interaction, among many others. You can think of Langchain as a Swiss Army Knife that comes with multiple blades and tools that you can use independently depending on how are you planning to interact with the language model.

At the same time, Langchain also allows you to streamline the process by constructing “chains”. These chains are sequences of steps that process information, pass it to a language model, and ultimately generate an output. By programming these sequences or “chains”, you can pre-program the process in order to have an assembly line ready for use.

For example, going back to our news classification exercise, we needed to customize a prompt template, send a call to a language model, and then parse the received output. These three steps can be easily streamlined using Langchain. As I said before, in this post I will just touch base over some basic functionalities. However, if you would like a more indepth explanation of Langchain, I would suggest you to watch this video from Rabbitmetrics:

Additionally, I would also suggest you to read the official documentation publicly available in this website.

Required libraries

At this point, is more than clear that we need to install Langchain in order to use it. We can proceed by installing the official python release version by running the following line in our terminal:

pip install langchain

For this exercise, we will be using Google’s Gemini model that was released on December, 2023. Therefore, we will also need to install the official Python Software Development Kit from Google and the Google AI module from Langchain. Following the official documentation from both developers, we can install these by running the following lines in our terminal:

pip install -q -U google-generativeai
pip install --upgrade --quiet  langchain-google-genai pillow

Once we have all the required libraries installed, we can proceed to import the modules we will be using in this tutorial. Most of these modules and libraries are already known by my three usual readers. Therefore, I’m just going to highlight three:

  • The ChatPromptTemplate is a module that allows us to manage prompt templates.
  • The ChatGoogleGenerativeAI is a wrapper that allows you to send calls or invoke the Gemini language model in a standardized fashion.
  • The BaseOutputParser is a module that allow us to easily parse outputs received from language models.
import json
import time
import pandas as pd
from langchain.schema import BaseOutputParser
from langchain.prompts.chat import ChatPromptTemplate
from langchain_google_genai import (
    ChatGoogleGenerativeAI,
    HarmBlockThreshold,
    HarmCategory
)
from google.generativeai.types import BlockedPromptException

Now that we have all of our libraries and modules, the next important setup is to load our API key from Google AI Studio. As I always highlight, you have to be super-extra-intensively careful when managing your API keys. NEVER display API keys in your scripts… unless… NO!! NEVER!! The most common way to load API keys is through environmental variables. For this, I usually use the Python dotenv library:

import os
from dotenv import load_dotenv

# Loading API KEY from environment
load_dotenv()
GoogleAI_key = os.getenv("googleAI_API_key")
os.environ['GOOGLE_API_KEY'] = GoogleAI_key

Reading and exploring our data

The data we will be working with for this tutorial is a dataset of 203 news articles for which we have 4 variables:

  • Article ID
  • Headline text
  • Summary
  • Full content text

Let’s read the data and take a quick look into the first 15 articles in our set:

master_data = pd.read_parquet("master-data.parquet.gzip")
master_data.head(15)

article_id title_eng desc_eng content_eng
0 5b3033b7646c124e2a893f135d6b6718 Wikland's illustrations traveled to Latvia for... [caption id="attachment_412290" align="alignno... On Saturday, an exhibition of Ilon Wikland's i...
1 c55891953899b72b8fc9e4132727cec0 Goals by Nova Englund and Linnea Helmin were n... Surahammar was defeated in the meeting with Ha... Surahammar was defeated in the meeting with Ha...
2 e7d7afc056b2aa73061834edf93aeaa7 Leader of the Qassam Brigades: The Phantom of ... Mohammed Deif is the leader of the military wi... Mohammed Deif is the leader of the military wi...
3 aa4e2287a042765adaed62e66c38cc6b Fierce criticism of the best tennis player in ... The world's top-ranked tennis player, Arina Sa... The best-ranked tennis player in the world, Be...
4 d07d8deeb60237fb1a3c12e1b07ab50c Education Minister: Hubig: German skills in sc... Education Minister Stefanie Hubig (SPD) wants ... Hubig announced a precise analysis of the data...
5 7abd533276ada8c99a4611a95fa6056b The “Wild West” of Nantes in May: serial shoot... Around ten episodes of gunfire left one dead a... Le Figaro Nantes Bloody month of May in Nantes...
6 2e51e03074e70136dc47cae1790bd721 End of the Bundeswehr mission in Afghanistan: ... The current federal government wanted to use “... The current federal government wanted to use “...
7 c6697f8d32ddc66c6c48f6040e196466 Cars and heavy vehicles, the European Parliame... BRUXELLES. Less polluting cars and vans, the E... BRUXELLES. Less polluting cars and vans, the E...
8 d1adbdd16c6f589859b3f0f4bb1bb7ed Danko overshadowed the Voice. The nominations ... As for the relations between Smer, Hlas and SN... The leader of Hlas Peter Pellegrini presented ...
9 df8c3e7629128fe0ecabe64837cd54de Iva Ančić: We expect a spectacle and an excell... We see that the awareness of gaming itself has... Reboot Infogamer powered by A1 is coming back ...
10 6623a54bdb6452ede407e47779ae0f28 An analysis by Ulrich Reitz - Ban the AfD? An ... The Federal Minister of the Interior and her h... Comments Email Share More Twitter Print Feedba...
11 fd6a409ea84add5d803fee8e2877d071 Now it's coming: the green light for the first... At its most recent meeting, the Homburg city c... Now here it comes, the bike zone in the Hombur...
12 d88ae6a3ec59ddc8dd2df71d32a2cbe1 Municipalities: District warns of fraud when d... The Vorpommern-Greifswald district warns of in... The Vorpommern-Greifswald district warns of in...
13 1bf6ebbd3bad47afe77b0967f19b2a48 That's why King Matthias shut down his uncle Contrary to expectations, Mátyás turned out to... Since Mátyás was a minor when László Hunyadi w...
14 f852ba76ef4574a0064c812b215d4ce0 A PFAS ban? What does this mean for buyers of ... Alarming news is coming from the backrooms of ... Several environmental protection associations ...

Loading the prompts

Today, we will be performing the same exercise that we did in our previous blog post. As a summary, we will be doing a classification exercise in which we ask the Gemini model to read a news article and classify it in two groups: (i) articles that are related to our Rule of Law, Justice, and Governance framework, and (ii) those that are unrelated. Once that we have identified which articles are related to the Rule of Law, Justice, and Governance, we ask Gemini to provide a score telling us how closely related is the article to each one of the eight pillars of our framework: Constraints to Government Powers, Abscense of Corruption, Open Government, Fundamental Freedoms, Order and Security, Regulatory Enforcement, Civil Justice, and Criminal Justice. For that reason, we will be referring to each one of these classification rounds as stage 1 and stage 2, respectively. For each one of these stages, we will be passing a context and an instructions prompt. You can go over these prompts by clicking on the URLs bellow:

We proceed to load these plain text files as Python objects:

def read_prompt(file_path):
    with open(file_path, 'r', encoding = "utf-8") as f:
        text = f.read()
    return text

context_stage_1      = read_prompt("context_stage_1.txt")
instructions_stage_1 = read_prompt("instructions_stage_1.txt")
context_stage_2      = read_prompt("context_stage_2.txt")
instructions_stage_2 = read_prompt("instructions_stage_2.txt")

You can open these prompt templates and see how they are trying to provide an accurate context and instructions to the model. Similarly, they provide a very extensive explanation of our theoretical framework so the model output fits our needs as best as possible. Our target is to pass this context every time that we ask Gemini to read an article, this is why we treat these as templates. If you open any of the instructions prompt templates, you will see that they include the following chunk of text:

Now, given the following news article:
News title: {headline}
News summary: {summary}
News body: {body}

Everytime that we send a news article to Gemini, we have to replace the {headline}, {summary}, and {body} parts of the template with the actual headline, summary, and content that we have in our master_data. It is very important that the .txt file that we are reading contain the “replaceable” parts within curly brackets in order for the prompt managing tools from Langchain to work as expected. In my previous post, we were doing this using the format() method for strings in Python. However, Langchain provides a similar tool for managing and customizing prompts through the ChatPromptTemplate module. We can define our context template as a System Role message, and our instructions template as a Human Role using the from_messages() method. To understand how role management works in text generation models, you can check this page from the OpenAI’s official documentation. For our stage 1 exercise, we could define the prompt template as follows:

stage_1_prompt = ChatPromptTemplate.from_messages([
                ("system", context_stage_1),
                ("human", instructions_stage_1),
            ])

This way, Langchain will understand that there are parts that will need to be replaced in the prompt text before passing it to the model. We will tell Langchain how to replace these values when invoking the model. For now, it is fine just having a final prompt with the roles properly assigned. This is the first step in our “chain”.

Once that we have our prompt template defined, we can think on our second step, which is sending the customize prompt to Gemini. For this, Langchain offers a wide set of wrappers that makes it super easy to send calls to a large variety of Large Language Models. In this exercise, we will be using the ChatGoogleGenerativeAI wrapper to send our calls:

ChatGoogleGenerativeAI(model = "gemini-pro",
                       temperature     = 0.2, 
                       safety_settings = safety_settings,
                       convert_system_message_to_human = True)

For our calls, we are defining that we would like to use the gemini-pro model with a temperature parameter of 0.15. The temperature parameter is used to control the randomness or creativity of the output. A low temperature will prioritize the next words in its prediction, while a high temperature will consider “less likely” options in the prediction. Given that we want the model to work under “factual accuracy”, we pass a low temperature parameter. Moreover, given that Gemini does not support the “System Role” in its syntax, we activate the convert_system_message_to_human parameter.

Given that, by default, Gemini comes with some medium-high safety settings that could block a prompt to be answered by the model, we would like to reduce how strict these settings are. For this, we define a new set of safety nets as follows:

safety_settings = {
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE
}

However, the model could still block some of our news articles by some undefined reasons that we cannot control. This will need to be taken into account when classifying the news articles. But now, we have completed the second step in our “chain”.

Finally, we have asked the model to:

Use the following JSON format to answer:
{{
    rule_of_law_related: [Model Answer]
}}

By giving the model this instruction, we can expect its answer to have a JSON structure formatting. However, the output will still be a string. Therefore, we need to parse this string as a python dictionary. For this, we will use the BaseOutputParser method as follows:

class JSONOutputParser(BaseOutputParser):
    def parse(self, text: str):
        """
        Parse the output of an LLM call to a valid JSON format.
        """
        return json.loads(text.replace('```json', '').replace('```', ''), strict=False)

Having defined the output parser, we can say that we have succesfully defined the three steps required by our exercise. We can assemble all three steps into a single “chain” by using the | constructor provided by Langchain as follows:

chain_gemini = chat_prompt | ChatGoogleGenerativeAI(model = "gemini-pro",
                                                    temperature     = 0.0, 
                                                    safety_settings = safety_settings,
                                                    convert_system_message_to_human = True) | JSONOutputParser()

If we were to manually classify one article at a time, this will be a valid solution. However, given that we want to pass all 203 news articles at once, we need to define a proper function that we are going to call classify_article(). This function will take a headline, a summary, and a full body content and it will classify or rate the news article according to which stage of the process we are referring to. Stay with me, the following chunk of code might be a little bit rough, so I suggest you to read the comments along the code.

def classify_article(headline, summary, body, stage_1 = True, relation = None):
    """
    This function takes a headline, a summary, and the content of a news article and it sends a call to Google's Gemini
    to classify the article. There are two different classifications: Stage 1 and Stage 2. If stage_1 is set to TRUE, then
    the call to the model will try to answer the following question: Is this news article related or unrelated to the Rule of Law?
    If stage_1 is set to FALSE, then the call to the model will try to rate how closely related is the news article to each
    one of the eight pillars of the Rule of Law.
    """

    # Defining the prompt according to which stage are we calling the function for
    if stage_1 == True:
        system_prompt = context_stage_1
        human_prompt  = instructions_stage_1
    else:
        system_prompt = context_stage_2
        human_prompt  = instructions_stage_2

    # Setting up the Prompt Template
    chat_prompt = ChatPromptTemplate.from_messages([
                    ("system", system_prompt),
                    ("human", human_prompt),
                ])

    # Defining our chain
    chain_gemini = chat_prompt | ChatGoogleGenerativeAI(model = "gemini-pro",
                                                        temperature     = 0.0, 
                                                        safety_settings = safety_settings,
                                                        convert_system_message_to_human = True) | JSONOutputParser()
    
    # For Stage 2, we don't want to pass articles that were already classified as "UNRELATED", so we pre-defined the outcome
    if stage_1 == False and relation != "Yes":
        outcome = "Unrelated"

    else:
        try: 
            llm_response = chain_gemini.invoke({
                "headline": headline,
                "summary" : summary,
                "body"    : body,
            })
            status = True
            time.sleep(1)   # We need to slow down the calls. given that the Gemini API has a limit of 60 calls per second

        # The API can still block some of our prompts due to undefined reasons. Sadly, we can't do anything about it, so we
        # predefine the outcome    
        except BlockedPromptException:
            print("BLOCKED")
            status = False
                
        # We use the STATUS variable to throw an outcome to our call depending if our prompt was blocked or not and
        # on the stage we are calling the function for
        if status == True:
            if stage_1 == True:
                outcome = llm_response["rule_of_law_related"]

            else:
                outcome = json.dumps(llm_response["pillars_relation"])
        else:
            outcome = "Blocked Prompt"

    return outcome

Once that we have this awesome function defined, we can proceed to apply it to the whole data frame of news articles using the apply() method:

# Stage 1 of the classification
master_data["gemini_stage_1"] = master_data.apply(lambda row: classify_article(row["title_eng"], 
                                                                               row["desc_eng"], 
                                                                               row["content_eng"], 
                                                                               stage_1 = True), axis = 1)
master_data["gemini_stage_1"].value_counts()
gemini_stage_1
Unrelated         155
Yes                47
Blocked Prompt      1
Name: count, dtype: int64

We can see that, after reading all 203 news articles, the model classified 155 as “Unrelated” and 47 as “Related” to the Rule of Law, Justice, and Governance framework that we passed. This means that, for the second stage, we will only pass 47 articles to see how closely related they are to each of the eight pillars in our framework. We proceed in the same manner that we did for Stage 1, but this time, we pass the outcome from Stage 1 as the relation parameter, so the function knows which articles to send to the model and which ones not to.

# Stage 2 of the classification
master_data["gemini_stage_2"] = master_data.apply(lambda row: classify_article(row["title_eng"], 
                                                                               row["desc_eng"], 
                                                                               row["content_eng"], 
                                                                               relation = row["gemini_stage_1"],
                                                                               stage_1  = False), axis = 1)

For each one the 47 news articles that were classified as “RELATED” to the Rule of Law, the model has assigned a score from zero to ten rating how closely related the article is to each one the eight pillars of our framework. Let’s take a look at one specific example:

print(master_data["gemini_stage_2"][5])
[
  {"1. Constraints on Government Powers": 8}, 
  {"2. Absence of Corruption": 7}, 
  {"3. Open Government": 6}, 
  {"4. Fundamental Rights": 7}, 
  {"5. Security": 9}, 
  {"6. Regulatory Enforcement and Enabling Business Environment": 5}, 
  {"7. Civil Justice": 4}, 
  {"8. Criminal Justice": 9}
]

As we can see, we have achieved our goal. However, Having this huge string in our data is not practical. Therefore, what we could do is to define a threshold under which, if the assigned rating is equal or above to this threshold, then we can firmly say that the article IS related to this pillar. Otherwise, we labelled this news article as UNRELATED to a specific pillar. Following that logic, we can create eight binary variables that will be equal to one if the article surpasses or is at least equal to the threshold and it will be equal to zero, otherwise. Let’s write a function that will follow this logic.

import ast

def extract_score(string, pillar, t = 7):
    """
    This function extracts scores from a string and returns a binary value that is equal to 1 if the score is higher/equal
    than a specific threshold, and it returns zero if otherwise.
    """
    try:
        scores_dicts = ast.literal_eval(string)
        ratings = [v for x in scores_dicts for _,v in x.items()]
        keys    = [k for x in scores_dicts for k,_ in x.items()]
        pattern = str(pillar) + ". "
        idx     = next((index for index, element in enumerate(keys) if pattern in element), None)

        if idx is not None:
            score = ratings[idx]
        else:
            score = 0
            
        if score >= t:
            return 1
        else:
            return 0
        
    except ValueError:
        if string == "Unrelated":
            return 0

Easy peasy, right?

We now proceed toapply the function to create the new set of binary variables:

for i in range(1, 9):
    var_name     = "Gemini_pillar_" + str(i)
    master_data[var_name] = master_data["gemini_stage_2"].apply(lambda x: extract_score(x, i))
master_data[master_data["gemini_stage_1"] == "Yes"].head(10)

article_id title_eng desc_eng content_eng gemini_stage_1 gemini_stage_2 Gemini_pillar_1 Gemini_pillar_2 Gemini_pillar_3 Gemini_pillar_4 Gemini_pillar_5 Gemini_pillar_6 Gemini_pillar_7 Gemini_pillar_8
5 7abd533276ada8c99a4611a95fa6056b The “Wild West” of Nantes in May: serial shoot... Around ten episodes of gunfire left one dead a... Le Figaro Nantes Bloody month of May in Nantes... Yes [{"1. Constraints on Government Powers": 8}, {... 1 1 0 1 1 0 0 1
6 2e51e03074e70136dc47cae1790bd721 End of the Bundeswehr mission in Afghanistan: ... The current federal government wanted to use “... The current federal government wanted to use “... Yes [{"1. Constraints on Government Powers": 8}, {... 1 1 0 1 0 0 0 0
8 d1adbdd16c6f589859b3f0f4bb1bb7ed Danko overshadowed the Voice. The nominations ... As for the relations between Smer, Hlas and SN... The leader of Hlas Peter Pellegrini presented ... Yes [{"1. Constraints on Government Powers": 7}, {... 1 0 0 1 0 0 0 0
10 6623a54bdb6452ede407e47779ae0f28 An analysis by Ulrich Reitz - Ban the AfD? An ... The Federal Minister of the Interior and her h... Comments Email Share More Twitter Print Feedba... Yes [{"1. Constraints on Government Powers": 9}, {... 1 1 0 1 0 0 0 0
13 1bf6ebbd3bad47afe77b0967f19b2a48 That's why King Matthias shut down his uncle Contrary to expectations, Mátyás turned out to... Since Mátyás was a minor when László Hunyadi w... Yes [{"1. Constraints on Government Powers": 10}, ... 1 0 0 0 0 0 0 1
16 9563949a0c0bfff0105048ddcec42c63 A man killed his wife and then committed suici... A man killed his wife and then tried to kill h... On the morning of October 16, 2023, a 66-year-... Yes [{"1. Constraints on Government Powers": 0}, {... 0 0 0 0 0 0 0 1
19 2c886c2fc437d97bcf6efc783c6135e8 German-Polish border: Federal police start new... The federal police are gradually enforcing the... The federal police are gradually enforcing the... Yes [{"1. Constraints on Government Powers": 8}, {... 1 1 0 0 1 1 0 0
23 fa07a7d9886fbb61bf4ed851f7a537aa Controversial decision: Naming the curator for... A heated argument is raging over the naming of... A heated argument has broken out in the Turkis... Yes [{"1. Constraints on Government Powers": 8}, {... 1 0 1 1 0 0 0 0
41 6e3b6913c88b05f4e55572fe2302f72e An elderly woman got to know Pasi, and a large... The district court of Varsinais Suomen believe... A woman in her seventies slipped in the yard o... Yes [{"1. Constraints on Government Powers": 8}, {... 1 1 0 1 0 0 1 1
45 f5f879f7c5fa7c97cbc66ed65a974cfc TAP. Leader of the Liberal Initiative accuses ... Rui Rocha highlighted that, during the commiss... This Monday, the president of IL accused the p... Yes [{"1. Constraints on Government Powers": 8}, {... 1 1 1 1 0 0 0 0
master_data.iloc[:,6:].apply(sum)
Gemini_pillar_1    39
Gemini_pillar_2    19
Gemini_pillar_3     8
Gemini_pillar_4    28
Gemini_pillar_5    12
Gemini_pillar_6     3
Gemini_pillar_7    10
Gemini_pillar_8    20
dtype: int64

According to our results, 39 articles were classified as related to Pillar 1, while only three were classified as related to Pillar 6. There might be better ways to do what I just did (if you happen to know one, just email me), but this is how I am proceeding with the classification stage in our exercise. And that it is my dear readers.

Bis bald und viele Spaß!!

Posted on:
March 3, 2024
Length:
18 minute read, 3712 words
Tags:
LLM Langchain Gemini AI
See Also:
Using GPT and Gemini to classify news articles