Using GPT and Gemini to classify news articles

A step-by-step guide on how to use the OpenAI and GoogleAI official APIs

By Carlos A. Toruño P.

January 28, 2024

In my previous blog post, I talked about Large Language Models (LLMs) and how you can use them to perform some specific tasks as long as you put special attention at how you are pharing your instructions to the model. However, we did all this using ChatGPT, which, technically speaking, is an app not a LLM. If you want to incorporate the power of LLMs into your own programming workflow you will have to access them through their official API. In this blog post I will go step by step on how to incorporate the power of GPT and Gemini models to your own framework in order to solve specific tasks.

The task at hand will be very easy. First, we want to classify news articles into two groups: those related to the Rule of Law, and those unrelated to the Rule of Law. Second, for those articles that are related to the Rule of Law, we want to see how related they are to each on of the pillars of the Rule of Law. If you are a bit lost or feel unfamiliar with the theoretical framework, don’t worry… so is the model. Therefore, I will be providing context along the road.

Let’s see the data we will be working with.

Loading the data

I will be working with a subset of news articles from European newspapers that we donloaded using a News API. For more information on how we were able to get this data, you can check this blog post. We begin by reading our data using the Pandas Python Library as follows:

import pandas as pd

master = pd.read_parquet("data_subset.parquet.gzip")
master.head(5)

country journal asspillar language article_id title link keywords creator video_url ... pubDate image_url source_id source_priority category language_id compiler title_trans description_trans content_trans
2302 [austria] https://www.vn.at/ order_and_security german 8535537712990c01388421e4aa6c8247 Kein Ende des Kriegs in Israel in Sicht https://epaper.vn.at/titelblatt/2023/10/09/kei... [Titelblatt] [importuser] None ... 2023-10-09 20:45:15 None vn 1083701.0 [top] de carlos No end to the war in Israel in sight Jerusalem The EU is suspending all payments to... Israel orders total blockade of the Gaza Strip...
2307 [austria] https://www.vn.at/ order_and_security german c09c8b1761eed395062e595869472551 Bewährungsprobe für Markus Söder https://epaper.vn.at/politik/2023/10/06/bewaeh... [Politik] [importuser] None ... 2023-10-06 20:49:05 None vn 1083701.0 [politics] de carlos Test for Markus Söder Election in Bavaria on Sunday with some open q... The old Prime Minister will also be the new on...
2313 [austria] https://www.vn.at/ order_and_security german b1783608728593e7d811d95e1b4f263a Bilder des Tages https://epaper.vn.at/politik/2023/10/02/bilder... [Politik] [importuser] None ... 2023-10-02 20:44:43 None vn 1083701.0 [politics] de carlos Pictures of the day Translation through API failed. Reason: expect... The 2018 Nobel Peace Prize winner Denis Mukweg...
2326 [austria] https://www.vn.at/ order_and_security german ed6711a119adb88769cda28ddd6c9b43 Diebe stehlen 20 Tonnen Äpfel von Obstplantage https://epaper.vn.at/welt/2023/09/24/diebe-ste... [Welt] [importuser] None ... 2023-09-24 20:41:08 None vn 1083701.0 [top] de carlos Thieves steal 20 tons of apples from orchard Dieterskirchen thieves stole around 20 tons of... Dieterskirchen thieves stole around 20 tons of...
2328 [austria] https://www.vn.at/ order_and_security german baa36e05329fb8381a1d2c10f85b1134 Politik in Kürze https://epaper.vn.at/politik/2023/09/21/politi... [Politik] [importuser] None ... 2023-09-21 20:49:39 None vn 1083701.0 [politics] de carlos Politics in brief hangzhou Syrian President Bashar al-Assad is v... The Syrian ruler traveled to Hangzhou. Sana/AF...

The data we will be using is a small subset of 100 news articles from way bigger data file that contains information of more than 32,000 news articles from European newspapers. As you might have noticed, the data was stored as a parquet file. For huge quatities of information, Apache Parquet is a more efficient format to store data than JSON or CSV in terms of storage, memory usage, and reading speed. If you want to learn more about the advantages of human non-readable formats I would suggest you to read this article from Towards Data Science.

For the purpose of this blog post, I will only work with 5 specific news articles from the above subset and save them as a pandas data frame called extract. Let’s take a look at the wonderful winners of this lotto:

idx = [2307, 2357, 2383, 2439, 2951]
extract = master.loc[idx, ["country", "journal", "article_id", "link", "title_trans", "description_trans", "content_trans"]]
extract

country journal article_id link title_trans description_trans content_trans
2307 [austria] https://www.vn.at/ c09c8b1761eed395062e595869472551 https://epaper.vn.at/politik/2023/10/06/bewaeh... Test for Markus Söder Election in Bavaria on Sunday with some open q... The old Prime Minister will also be the new on...
2357 [austria] https://www.salzburg24.at/ edfa1c27e299213b2e2bf4325361a0ba https://www.salzburg24.at/news/oesterreich/dre... Three dead in fire in LK Mödling: investigatio... After three patients died in a fire at the Möd... 0 Published: 18. October 2023 3:33 p.m. After ...
2383 [austria] https://www.salzburg24.at/ 27abe7fe74f897c0192e5e59188a9d1f https://www.salzburg24.at/news/welt/klima-prot... Climate protest in The Hague: 2,400 arrests In the Dutch city of The Hague, police broke u... 0 Wij verafschuwen het geweld the word was use...
2439 [austria] https://www.salzburg24.at/ 80cea065a92d0e20c0b260a612fd1b87 https://www.salzburg24.at/sport/fussball/oeste... Failed qualifying dress rehearsal for Austria Austria's national soccer team failed in the d... 0 Published: 07. September 2023 10:38 p.m. Aus...
2951 [belgium] https://www.lesoir.be/ 8739dea5ef18c8f95991f1b063804e1f https://www.lesoir.be/520042/article/2023-06-1... Tax reform in Belgium, an emergency for sixty ... The federal government is trying to reform tax... The big tax reform, everyone wants it, but no ...

Writting the instructions

As I mentioned before, we are going to divide the task at hand in two stages. In the first stage, we will classify articles according to their relation to the Rule of Law. For this, we will use a contextual prompt as follows:

You are an assistant with knowledge and subject-matter expertise on Rule of Law, justice, governance, global politics, social sciences, and related fields in the European Union. Your task is to carefully read a news article and determine whether it is related to the definitions of Rule of Law, Justice, and Governance that I will give you. To successfully perform this task, you should carefully read the definitions that I will provide, and use the knowledge of global politics, law, and social sciences that you have.

Additionally, we will write the additional information, the news article, and the set of instructions in a separate prompt that we are going to call instructions_stage_1. First, I will provide some key concepts to the model so it doesn’t base its answers entirely on the information that was used during its training. This is done so we can have some control on the predicted tokens used by the model to provide an answer.

Key macro concepts

Here are the definitions of Rule of Law, Justice, and Governance:

The term Rule of Law refers to …

We define Justice as …

Finally, we define Governance as …

In the same prompt, after passing the key concepts, I will pass the headline, summary, and full content of the news article:

Now, given the following news article:

News title: {headline}

News summary: {summary}

News body: {body}

Finally, I will provide some specific instructions telling the model what to do with the information I just passed and I will ask the model to structure its answer following a specific JSON format:

Please analyze the news article and its context, and answer the following question:

  1. _Based on the definitions that I just provided above, is this news article narrating events related to the Rule of _ Law, Justice, or Governance?

Use the following JSON format to answer:

{{

rule_of_law_related: answer to the question number 1. if the news article is not related to the Rule of Law, Justice, or Governance answer with “Unrelated”, otherwise answer with “Yes”.

}}

I will make use of Markdown syntax such as headers (#) and lists to structure and pass the information to the model. You can check the FULL instruction prompt that I’m using for the first stage here.

Accesing LLMs through their API endpoints

Now that we have loaded the data, we have written (and tested) our prompt, we can proceed to use a LLM to perform the task for us. To be able to send requests to an API, we need to have an API Key. In this guide, we will be using the GPT model from OpenAI and the Gemini model from GoogleAI. Therefore, in order to follow this guide, you will have to create an account and an API key from these two developers. Right now, January 2024, accessing the Gemini Pro model through its official API is free and open to the public given that they are introducing the product to the market.

In a previous post, I talked about how to manage your API keys through environment variables. Just in case, I leave you the video explaining how to do this using the dotenv Python library.

We start by loading our API keys as follows:

import os
from dotenv import load_dotenv

# Loading API KEY from environment
load_dotenv()
OpenAI_key   = os.getenv("openai_key")
GoogleAI_key = os.getenv("GOOGLE_API_KEY") 

Once we have our API keys loaded as environment variables, we can use the OpenAI official Python library to send calls and acces their models. OpenAI has different model families depending on the capability you are interested. DALL·E is a family of models focused on image generation, Whisper is a family of models focused on speech recognition, and GPT is their signature model focused on text generation. For a task such as text classification, what we need is a text model. Therefore, we will be using the GPT-4-Turbo model to classify our news articles.

First, let’s focus only in the news article about the elections in Bavaria “A Test for Markus Söder” with row index 2307 in our extract data frame:

headline = extract.loc[2307, "title_trans"]
summary  = extract.loc[2307, "description_trans"]
body     = extract.loc[2307, "content_trans"]

We pass the headline, summary, and content of the news article to the instructions_stage_1 prompt using the format() method:

instructions_stage_1_2307 = instructions_stage_1.format(headline = headline, 
                                                        summary  = summary, 
                                                        body     = body)

We will use the Chat Completion endpoint to pass the information and classify the article at hand as RELATED or UNRELATED to the definition of Rule of Law that we have. The Chat Completion API will take a list of messages as input, and generate an output. However, these messages need to be assign to a specific role.

According to the official OpenAI documentation, there are three roles available: “system”, “user”, and “assistant”. The system message will be setting the general behavior of the model across the conversation. The user messages provide requests or comments for the assistant to respond to. Assistant messages store previous assistant responses, but you can also pass example responses to signal a desired behavior for the model.

Having this in mind, we will pass our context_stage_1 prompt as the system message, and our instructions_stage_1 prompt as a user message. This information is passaed to the API wrapper as individual Python dictionaries within a list.

We will also require the model to provide its answer in a JSON format by specifying the “type” parameter in the response_format argument.

from openai import OpenAI

client = OpenAI(api_key = os.getenv("openai_key"))

completion = client.chat.completions.create(
    model = "gpt-4-0125-preview",
    messages = [
        {"role": "system", "content": context_stage_1},
        {"role": "user",   "content": instructions_stage_1_2307}
    ],
    response_format = {"type": "json_object"},
    temperature = 0.2
)

We take a look to the answer provided by the language model by printing the choices of answers thrown by the model. By default, the response object will only contain one single choice, but you can modify this by adding the respective arguments to the call. As we can observe, the model has classified the news article as RELATED to the Rule of Law, which, to be honest, makes sense because the article is talking about elections in Bayern, Germany.

If you are wondering about the temperature parameter, it is a numeric value between 0 and 1 that signals how deterministic or random should the assistant construct its answer. Lower values make the answer more deterministic and focused, while higher values will increase the randomness of the output.

print(completion.choices[0].message.content)
{
    "rule_of_law_related": "Yes"
}

We have succesfully completed the first stage. Now that we know that the article is related to the Rule of Law, we can proceed to the second stage of our task: rating from zero to ten how related is this article to each one of the pillars of the Rule of Law. For this, we will be making some adjustments to our prompts.

For example, we modify our context prompt to incorporate the new objective:

You are an assistant with knowledge and expertise in global politics, social sciences, rule of law, and related fields. Your task is to assist me in classifying news articles according to which pillar of the Rule of Law do they belong to. To successfully accomplish this task, you will have to carefully read a news article and the definitions of each pillar that I will give you, as well as use the knowledge of global politics, social sciences, and law that you have. Once you have read the news article, you will proceed to determine the extent to which the events described in the news article are related to each pillar.

For our instructions prompt we will be using a rather large text that spans over 4,000 words. However, you can check the full text of the prompt here. Having these new inputs, we can send a new request in the same way we did before:

# Introducing the news article into the instruction prompt
instructions_stage_2_2307 = instructions_stage_2.format(headline = headline, 
                                                        summary  = summary, 
                                                        body     = body)

# Making a request to the GPT Chat Completions API
completion = client.chat.completions.create(
    model = "gpt-4-0125-preview",
    messages = [
        {"role": "system", "content": context_stage_2},
        {"role": "user",   "content": instructions_stage_2_2307}
    ],
    response_format = {"type": "json_object"},
    temperature = 0.2
)

# Printing the output
print(completion.choices[0].message.content)
{
    "pillars_relation": [
        {"1. Constraints on Government Powers": 8},
        {"2. Absence of Corruption": 5},
        {"3. Open Government": 7},
        {"4. Fundamental Rights": 2},
        {"5. Order and Security": 1},
        {"6. Regulatory Enforcement and Enabling Business Environment": 1},
        {"7. Civil Justice": 1},
        {"8. Criminal Justice": 1}
    ]
}

Beautiful! Subarashī!

We can see that the model thinks that the news article in highly related to pillars 1 “Constraints on Government Powers” and pillar 3 “Open Government”; somehow related to pillar 2 “Absence of Corruption”, and completely unrelated to all the other pillars. Honestly… that was amazing. Because those are quite similar to the ratings that I (an allegedly expert in the topic) would give to the article. The model only needed some context and instructions to start generating text. Behind the curtains, what is happening is that the model is just constructing sentences based on the probability of what the next word is basing its predictions on the inputs that you just passed. Again, isn’t that amazing?

Let’s continue with our journey. We can do the exact same task, using another big language model that was just release in December, 2023 by Google. The Gemini Pro. Unlike the GPT-4-Turbo, Google has granted free access to the Gemini Pro capabilities through their official API. In order to use this model, we will need to adjust some things:

  • First, the Google API only accepts two roles in the list of messages: “user” and “model”, so we will have to turn the context prompt into a user message.
  • Second, the API only accepts multi-turn conversations. That means that, for our specific case in which we pass two user messages, we also need to provide a model answer to the first message. A short text like: “Sure, I can assist you in classifying news articles according to the pillars of the Rule of Law.” will be enough.
  • Third, we need to setup some safety settings to avoid getting rejections in our calls. Please check the official Python documentation for more information.

Let’s begin by authenticating and setting up a channel through the Generative Model endpoint as follows:

import google.generativeai as genai

# Authenticating our API key
genai.configure(api_key = GoogleAI_key)

# Set up the model config
generation_config = {
  "temperature": 0.2,
  "top_p": 1,
  "top_k": 1,
  "max_output_tokens": 1000,
}

# Safety presettings
safety_settings = [
  {
    "category": "HARM_CATEGORY_HARASSMENT",
    "threshold": "BLOCK_ONLY_HIGH"
  },
  {
    "category": "HARM_CATEGORY_HATE_SPEECH",
    "threshold": "BLOCK_ONLY_HIGH"
  },
  {
    "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
    "threshold": "BLOCK_ONLY_HIGH"
  },
  {
    "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
    "threshold": "BLOCK_ONLY_HIGH"
  },
]

# Set-up a model
model = genai.GenerativeModel(model_name        = "gemini-pro",
                              generation_config = generation_config,
                              safety_settings   = safety_settings)

Once that we have set-up a model to be used, we can start a multi-turn instance with the model to provide the information contained in context_stage_1, and instructions_stage_1. We do this by making use of the start_chat() method for the open instances. In the same manner as we did when interacting with the GPT model, we need to send the context and the model answer prompts as part of a history of messages. However, this time, we are sending the instructions as a new message to the conversation and waiting for the response.

Just imagine that you open your own messaging app and you see a prior conversation with the Gemini model where you are asking for a specific task. Then, you just send the instructions prompt as a new message to this conversation because you forgot it. Really easy, right?

# Start an instance
instance = model.start_chat(history = [
  {
    "role": "user",
    "parts": [context_stage_1]
  },
  {
    "role": "model",
    "parts": [model_answer_stage_1]
  }
])

# Sending instructions
instance.send_message(instructions_stage_1_2307)

# Previewing answer
print(instance.last.text)
```
{
    "rule_of_law_related": "Yes"
}
```

As we can observe, Gemini Pro also thinks that this news article is related to the Rule of Law. Still amazing, don’t get me wrong. Let’s now try it again but now with the second stage. Let’s compare the ratings assigned by GPT and Gemini to the same news article:

# Start an instance
instance = model.start_chat(history = [
  {
    "role": "user",
    "parts": [context_stage_2]
  },
  {
    "role": "model",
    "parts": [model_answer_stage_2]
  }
])

# Sending instructions
instance.send_message(instructions_stage_2_2307)

# Previewing answer
print(instance.last.text)
```
{
    "pillars_relation": [
        {
            "1. Constraints on Government Powers": 7
        },
        {
            "2. Absence of Corruption": 5
        },
        {
            "3. Open Government": 4
        },
        {
            "4. Fundamental Rights": 6
        },
        {
            "5. Security": 3
        },
        {
            "6. Regulatory Enforcement and Enabling Business Environment": 2
        },
        {
            "7. Civil Justice": 1
        },
        {
            "8. Criminal Justice": 1
        }
    ]
}
```

As we can observe, the ratings are somehow similar between GPT and Gemini. However, Gemini scored Pillar 4 “Fundamental Rights” higher than GPT, while also giving a lower score to Pillar 3 “Open Government”. Both answers are acceptable, and I can also see the potential reasons behind the differences in the scores, but that goes beyond the scope of this post.

We can extract the individual scores for each pillar by parsing the string content into a Python dictionary and extracting the dictionary values using a list comprehension. This way we can keep track of the scores in a more data-focused way for our news articles.

import json

json_content  = json.loads(instance.last.text[3:-3])
pillar_scores = [list(x.values())[0] for x in json_content["pillars_relation"]]
pillar_scores
[7, 5, 4, 6, 3, 2, 1, 1]

Setting up a data workflow

By now, you are able to use the OpenAI and GoogleAI APIs to access their models in order to generate text outputs. However, we have been doing it targeting a single news article. This gives us no advantage against using the official apps such as ChatGPT or Bard. The main purpose for us to access the power and capabilities of the models through their API is to be able to process large amounts of information without having to depend on an user interface. In other words, accessing the model in a programmatically way. In other words, instead of running the code individually for each article, having access to the API allow us to set up a workflow to process a whole data file.

For example, in our case, we will define a single function that will perform the task automatically for us. The function will work as follows:

  • First, it will extract the relevant information for each news article (headlin, summary, and content) and format an instruction prompt for that article in specific.
  • Then, it will send the instructions to the model through their respective API.
  • Finally, it will parse and process the string output sent by the model and store it as new variables in our data frame.

We will use the GeminiPro API for this example as follows:

def classify_article(row, stage):
    """
    A function that takes a row as an input, formats a prompt, sends 
    a conversation request to the GeminiPro API and returns the answer 
    from the model.
    """
    if stage == 1:
        instprompt = instructions_stage_1
        conprompt  = context_stage_1
        ansprompt  = model_answer_stage_1
    if stage == 2:
        instprompt = instructions_stage_2
        conprompt  = context_stage_2
        ansprompt  = model_answer_stage_2

    # Formatting prompt
    prompt = instprompt.format(headline = row["title_trans"], 
                               summary  = row["description_trans"], 
                               body     = row["content_trans"])
    
    # Start an instance
    instance = model.start_chat(history = [
    {
        "role": "user",
        "parts": [conprompt]
    },
    {
        "role": "model",
        "parts": [ansprompt]
    }
    ])
    
    # Sending instructions
    instance.send_message(prompt)

    # Parsening results
    out = json.loads(instance.last.text[3:-3])
    if stage == 1:
        val = list(out.values())[0]
        return val
    
    if stage == 2:
        if row["stage_1"] == "Yes":
            pillar_scores = [list(x.values())[0] for x in out["pillars_relation"]]
        
        else:
            pillar_scores = [0,0,0,0,0,0,0,0]

        return pillar_scores

Once we have the function defined, we can use vectorization to apply it to a whole data frame. We will test it using the extract data file with our 5 news articles.

extract["stage_1"] = extract.apply(lambda row: classify_article(row, stage=1), axis = 1)
extract["stage_2"] = extract.apply(lambda row: classify_article(row, stage=2), axis = 1)
extract.loc[:,["title_trans", "description_trans", "content_trans", "stage_1", "stage_2"]]

title_trans description_trans content_trans stage_1 stage_2
2307 Test for Markus Söder Election in Bavaria on Sunday with some open q... The old Prime Minister will also be the new on... Yes [7, 5, 4, 6, 2, 3, 2, 2]
2357 Three dead in fire in LK Mödling: investigatio... After three patients died in a fire at the Möd... 0 Published: 18. October 2023 3:33 p.m. After ... Yes [3, 2, 1, 5, 7, 1, 2, 8]
2383 Climate protest in The Hague: 2,400 arrests In the Dutch city of The Hague, police broke u... 0 Wij verafschuwen het geweld the word was use... Yes [7, 0, 0, 8, 7, 0, 0, 9]
2439 Failed qualifying dress rehearsal for Austria Austria's national soccer team failed in the d... 0 Published: 07. September 2023 10:38 p.m. Aus... Not related to Rule of Law [0, 0, 0, 0, 0, 0, 0, 0]
2951 Tax reform in Belgium, an emergency for sixty ... The federal government is trying to reform tax... The big tax reform, everyone wants it, but no ... Not related to Rule of Law [0, 0, 0, 0, 0, 0, 0, 0]

Sehr schön!

You can now try playing and process/generate data with Large Language Models by extending these examples on your own data project. As I mentioned in my previous post, there are hundreds of LLMs available for you to play out there. Some of them have their own python library available. In this situation, where setting up workflows with different providers can easily become chaotic, you need to worry. The Langchain framework comes in handy for this.

Langchain is a framework for developing applications powered by language models. It provides tools for connecting your application with different models, managing prompt templates, parsening outcomes, among many other features. In a future blog post, I will elaborate a bit more on the basic use of this framework to facilitate many of the steps we implemented in this example. Until then, farewell my dear three readers.

Finally, I would like to thanks Pablo Gonzalez for its valuable mentorship and support in this project.

Posted on:
January 28, 2024
Length:
19 minute read, 3875 words
Tags:
LLM GPT Gemini AI
See Also:
Using Langchain and Gemini to classify news articles