Sports Article Generation with HuggingFace’s GPT-2 language generation models

Natural Language Processing (NLP) has extended the applications of Machine Learning to an entirely new dimension. Who would have thought that statistical models could be applied to the text we see daily, in order to generate business insights and predictions?  More than ever, the past 2 years have seen a resurgence that NLP has not witnessed ever before. 

Among the advanced methodologies that have come in NLP, there has been a tremendous interest in language generation models. The auto-regressive features in Open AI’s GPT-2 model have enabled the generation of new sequences of text that very closely represent what a human mind would think. These transformer-based neural network models show promise in coming up with lengthy pieces of text that are convincingly human. 

In this post, we look at how HuggingFace’s GPT-2 language generation models can be used to write/generate sports articles. To cater to this computationally intensive task, we will use the GPU instance from the spell.ml MLOps platform. 

Getting started with Spell

As discussed above, language generation models can get computationally expensive and it becomes impossible for normal machines with CPUs to handle them. To tackle this, we make use of Spell’s GPU backed Jupyter notebooks. Spell is a powerful MLOps platform for machine learning and deep learning. It takes care of the infrastructure, thus enabling developers and enterprises to focus solely on the easy, fast and organized execution of their machine learning models. 

Getting setup with Spell is super-easy. Just visit https://spell.ml/ and create a new account. Every new user at Spell gets $10 of free usage credit. 

In this exercise, we will be using the Sports Article dataset from UCI’s Machine Learning Repository. This contains 1000 text files, each having a sports article written in them. In order to upload the text files, login to Spell in your terminal or command prompt window and navigate to the parent folder of where you unzipped the files and type spell upload. This will upload the required text files into the Resources section of Spell. Learn more about uploading files to the Spell Resources here.

Next, let's create our Jupyter workspace. To open a Jupyter notebook, login into the Spell web console and click on Workspaces > Create Workspace. Give the new workspace a name of your choice, and click on Continue.

In the next screen, Spell gives you multiple options to define the environment you want to run your code in. For example, under Machine Type, you can choose from a variety of CPU and GPU options according to your use and budget. Further, you can choose the framework, the environment variables, and the libraries to install before your Jupyter is set up. 

For the purpose of this project, lets choose V100 under Machine Type; and Notebooks under Jupyter.

In the next screen, let’s click on ‘Start Server’ to get started. Once that is done, we find a Jupyter infrastructure similar to what we have in our local machines. Click on New > Python3.

Let’s install transformers from HuggingFace and load the GPT-2 model. 

For the purpose of this project, lets choose V100 under Machine Type; and Notebooks under Jupyter.

In the next screen, let’s click on Start Server to get started. Once that is done, we find a Jupyter infrastructure similar to what we have in our local machines. Click on New > Python3.

Show me the code

Let’s install transformers from HuggingFace and load the GPT-2 model. 

!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q tensorflow==2.1
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

These two objects let you use the pretrained GPT-2 as is. The way we utilize GPT-2 or transformers in general, is to specify an input string/phrase that would essentially be the beginning of your article. The algorithms then predict the next set of words such that they are coherent with the initial string given.

Let’s try to understand the different types of algorithms and how GPT-2 is able to come up with the most human-like text passages. 

Let’s start with the greedy search algorithm, which is one of the simplistic methods of word prediction. Based on the initial string, this algorithm greedily searches for the most probable next word. Using this new string, which is basically the initial string plus the predicted word, the next word is predicted. This process repeats itself until we have the desired number of words. The drawback of this method is that the words start repeating after a few lines/words of text. This happens because it misses the high probability words hidden behind low probability words. 

Beam search mitigates this by keeping a predefined number of hypotheses each time, and eventually choosing the hypothesis that has the overall highest probability. But after several experiments, it was found that it still suffers from the problem of repeatability. 

One of the best algorithms for language generation is that of sampling.  Language generation using sampling is not deterministic since it randomly picks the next word according to its conditional probability distribution. But, it was observed that given the random nature, sampling sometimes produced passages of text that do not sound like a human. 

A trick used to solve this problem was to sharpen the distribution of the prediction of next word given previous i words. While sharpening, we still are drawing random samples; but in addition, we increase the likelihood of high probability words getting picked up, and decrease the likelihood of low probability words getting picked up. 

Another transformation was the introduction of Top-K Sampling where the K most likely next words are filtered and the probability mass is redistributed among those K next words. This simple yet powerful concept was incorporated in the GPT-2 model and is one of the reasons for its success. Yet another addition to the GPT-2 model was nucleus sampling. Instead of sampling only from the most likely K words, this model chose from the smallest possible set of words whose cumulative probability exceeds a predefined probability p. The incorporation of this feature made sure that the size of the set of words could dynamically increase and decrease according to the next word’s probability distribution. 

Having understood its internal working at a high level, let’s dive into the working and performance of the GPT-2 model. Note that, at this point, we are using the GPT-2 model as is, and are not using the sports data we had downloaded as of now. We’ll look more into how to use this data when we fine-tune the model in the next section. 

The code below demonstrates the sampling technique incorporated by GPT-2. input_ids refers to the initial string given to the model. By setting do_sample to  True, we are telling the model to use the sampling technique. max_length corresponds to the desired length of the article. top_k and top_p correspond to K words and the probability p respectively. Finally, we specify an argument called num_return_sequences to 2. This generates 2 different passages using the same initial string, and gives us an option to choose the output we like more.

#initial string
input_ids = tokenizer.encode('Manchester City agree deal to sell Leroy Sane', return_tensors='tf')

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=100, 
    top_k=50, 
    top_p=0.45, 
    num_return_sequences=3 )

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Here’s the output:

0: Manchester City agree deal to sell Leroy Sane

The Gunners are ready to sign Leroy Sane, who has been on loan at Tottenham for the past three seasons, from Chelsea.

The 21-year-old, who is in his first season at the club, has scored five goals in 14 games for the Blues this season.

The former Arsenal and Chelsea striker has been a target for Chelsea since he joined from Southampton in January 2013.

The deal is


1: Manchester City agree deal to sell Leroy Sane

Manchester City have agreed a £30million deal to sell Leroy Sane to Manchester United for £30million.

The move was confirmed by City sources.

Sane, 24, has scored nine goals in 20 Premier League appearances for the club since joining from Manchester United in January 2014.

He has scored seven goals in 16 Premier League appearances for United since joining from Manchester United in January 2014.

Having looked at the output, we can say that GPT-2 has been able to put together a cohesive piece of text. However, the factual statements it generates lack accuracy. Also, the passage as a whole does not give us a very sporty feeling. To address these issues, we try to train and finetune GPT-2 specifically on sports articles instead of using it as is. We will use the Sports Dataset which we had uploaded earlier. 

Our first job is to collate the articles into a single text file. In order to do that, initiate a new Jupyter notebook. Click on the Files tab and then click on Add mount. Go to uploads and select the folder you had just uploaded. 

Come back to the notebook and execute the code below to code read the text files and collate them.

mypath = 'Raw data/'
import os
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

article_list = ''
for i in range(len(onlyfiles)):
  if onlyfiles[i][-3:] == 'txt':
    try:
      with open('Raw data/' + onlyfiles[i], 'r') as file:
        data = file.read()
      article_list = article_list + '\n' + data
    except:
      pass

Once this is done, we need to use the printable function in the string package to filter out all the characters that don’t belong to ASCII.

import string
article_list_str = ''.join(filter(lambda x: x in string.printable, article_list))

Once the data is in the desired format, let’s move on towards building the model. We install the transformers package and import the required packages.

!pip install transformers
import logging
import os
import pickle
import random
import torch
import torch.nn as nn
import transformers
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from transformers import (
    GPT2Config,
    GPT2LMHeadModel,
    GPT2PreTrainedModel,
    GPT2Tokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
)

MODEL_CLASSES = {"gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer)}

logger = logging.getLogger(__name__)

Next, we define a class, SportsData, to fine tune the sports dataset and get the tokens within it.

class SportsData(Dataset):
    def __init__(
        self,
        tokenizer: PreTrainedTokenizer,
        #file_path: str,
        block_size=512,
        overwrite_cache=False,
    ):
        #assert os.path.isfile(file_path)

        block_size = block_size - (
            tokenizer.max_len - tokenizer.max_len_single_sentence
        )

        # change if args are added at later point
        cached_features_file = os.path.join(
           "gpt2" + "_" + str(block_size) + "_file.txt" 
        )

        if os.path.exists(cached_features_file) and not overwrite_cache:
            logger.info(
                f"Loading features from your cached file {cached_features_file}"
            )
            with open(cached_features_file, "rb") as cache:
                self.examples = pickle.load(cache)
                logger.debug("Loaded examples from cache")
        else:
            logger.info(f"Creating features from file")

            self.examples = []

            text = article_list_str
            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

            for i in range(0, len(tokenized_text) - block_size + 1, block_size):
                self.examples.append(
                    tokenizer.build_inputs_with_special_tokens(
                        tokenized_text[i : i + block_size]
                    )
                )

            logger.info(f"Saving features into cached file {cached_features_file}")
            with open(cached_features_file, "wb") as cache:
                
                pickle.dump(self.examples, cache, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

Finally, we initiate the training of the custom model and save the model.

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'


tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
model = model.to(device)

dataset = SportsData(tokenizer= tokenizer )
article_loader = DataLoader(dataset,batch_size=1,shuffle=True)

BATCH_SIZE = 1
EPOCHS = 1
LEARNING_RATE = 0.0002
WARMUP_STEPS = 5000

from transformers import AdamW, get_linear_schedule_with_warmup

model = model.to(device)
model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
script_count = 0
sum_loss = 0.0
batch_count = 0

for epoch in range(EPOCHS):
    print(f"EPOCH {epoch} started" + '=' * 30)
    for idx,script in enumerate(article_loader):
        outputs = model(script.to(device), labels=script.to(device))
        #outputs = torch.tensor(tokenizer.encode(script)).unsqueeze(0).to(device) 
        loss, logits = outputs[:2]                        
        loss.backward()
        sum_loss = sum_loss + loss.detach().data
                       
        script_count = script_count + 1
        if script_count == BATCH_SIZE:
            script_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()
            
        if batch_count == 200:
            model.eval()
            print(f"sum loss {sum_loss}")
            sample_outputs = model.generate(
                                    bos_token_id=random.randint(1,30000),
                                    do_sample=True,   
                                    top_k=50, 
                                    max_length = 1000,
                                    top_p=0.95, 
                                    num_return_sequences=1
                                )

            print("Output:\n" + 100 * '-')
            for i, sample_output in enumerate(sample_outputs):
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
            
            batch_count = 0
            sum_loss = 0.0
            model.train()

output_dir = 'Raw data/'

from transformers import WEIGHTS_NAME, CONFIG_NAME
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)

torch.save(model.state_dict(), output_model_file)
model.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_dir)

Now that we have the fine-tuned model ready, we import it and test how it works against the same input string that we used previously.

model = GPT2LMHeadModel.from_pretrained(output_dir)
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)

input_ids = tokenizer.encode('Manchester City agree deal to sell Leroy Sane', return_tensors='pt')

sample_outputs = model.generate(
                        input_ids= input_ids,
                        do_sample = True,
                        #num_beams= 5,
                        max_length = 100,
                        top_k = 50,
                        top_p=0.85, 
                        num_return_sequences=1
                    )

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
      print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Here's the output:

Output:
----------------------------------------------------------------------------------------------------
0: Manchester City agree deal to sell Leroy Sane to Liverpool

Leroy Sane was among three players who were sold this summer and Liverpool boss Brendan Rodgers admitted he felt the need to replace the former Manchester City winger.

"We sold four players last year and I know I had to get another player in to improve our squad," Rodgers told Sky Sports News HQ.

"We had to sell players and a few of them we did but it was Leroy Sane.

We observe that this passage of text not only sounds like a human, but has a much better flow to it. The manager quotes specially give this passage a much more realistic feel and make it look very similar to what an actual hand-written article would be like. 

Note that the facts are not completely correct but these can be quickly corrected. What matters is that we have been able to get rid of the manual process of writing. Just run this command, and edits some factual discrepancies that might be there. And voila, you’re done. Increasing the max_length would allow us to get even lengthier pieces of text.

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.