NLP from Scratch with PyTorch, fastai, and HuggingFace
A technical NLP tutorial using a variety of libraries to show the different levels/layers of common NLP pipelines
- 0. Introduction
- 1. Looking at the Data [Pandas]
- 2. Tokenization and Numericalization [PyTorch]
- 3. Dataset & DataLoaders [PyTorch & fastai]
- 4. Model [PyTorch]
- 5. Training/Fitting [fastai]
- 6. Using a Language Model via AWD-LSTM [fastai]
- 7. Using a Language Model via DistilBERT [HuggingFace & PyTorch & fastai]
- 8. Conclusion
Welcome! In this blog post/notebook, we'll be looking at NLP with 3 different methods:
- From Scratch/Ground-Up, with PyTorch
- FastAI Language Model (AWD-LSTM)
- HuggingFace Transformers (DistilBERT)
All 3 methods will utilize fastai to assist with keeping things organized and help with training the models, given the libary's ease of use through it's lovely Layered-API!
For this notebook, we'll be looking at the Amazon Reviews Polarity dataset! The task is to predict whether a review is of positive or negative sentiment. The original Amazon Reviews dataset contains review scores ranging from 1-5. This polarity dataset combines review scores 1-2 into the negative class, 4-5 into the positive class, and ignores/drops review scores of 3!
from fastai.text.all import *
import pandas as pd
path = untar_data(URLs.AMAZON_REVIEWS_POLARITY)
path
path.ls()
Let's go ahead and take a look at our two df's: train_df
and valid_df
train_df = pd.read_csv(path/'train.csv', names=['label', 'title', 'text'], nrows=40000)
valid_df = pd.read_csv(path/'test.csv', names=['label', 'title', 'text'], nrows=2000)
train_df.head()
len(train_df), len(valid_df)
We now want to first tokenize our inputs, then numericalize them using a vocab. Quick recap of these terms:
-
Tokenization = The process of converting an input string into "pieces"
- These pieces can be whole words, sub words, or even characters
-
Numericalization = The process of converting a token into a numeric representation
- (e.g. token -> number)
- This is done through the use (and creation of) a vocab
There are many fancy tokenizers out there, but since we're first doing things from scratch we'll go ahead and use a simple basic_english
tokenizer from torchtext
and split on spaces
sample_text = train_df['text'][0]
sample_text
import torch
import torchtext
from torchtext.data import get_tokenizer
tokenizer = get_tokenizer("basic_english")
L
is basically list
from Python, but has some convienent properties such as displaying the number of elements, and additionally doesn’t spam your screen with output if the list is too long!
tokens = L(tokenizer(sample_text))
tokens
Dudeeeee
. Other tokenizers can use rules to better handle splitting of big words through subword tokenization, and can also handle numbers like prices as well. This would help with optimizing the vocab’s embedding table as well as reducing the number of <unk>
tokens.
Next we'll need to check how many tokens there are in our dataset, and keep the frequent ones as part of our vocab.
from collections import Counter
token_counter = Counter()
for sample_text in train_df['text']:
tokens = tokenizer(sample_text)
token_counter.update(tokens)
token_counter.most_common(n=25)
token_counter.most_common()[-25:]
len(token_counter)
token_counter['well@@'], token_counter['well']
well@@
instead of well
Now that we have our token frequency counter, we can go ahead and make our vocab!
sorted_counter = dict(token_counter.most_common())
# Create vocab containing tokens with a minimum frequency of 20
my_vocab = torchtext.vocab.vocab(sorted_counter, min_freq=20)
# Add the unknown token, and use this by default for unknown words
unk_token = '<unk>'
my_vocab.insert_token(unk_token, 0)
my_vocab.set_default_index(0)
# Add the pad token
pad_token = '<pad>'
my_vocab.insert_token(pad_token, 1)
# Show vocab size, and examples of tokens
len(my_vocab.get_itos()), my_vocab.get_itos()[:25]
<unk>
as our default token for tokens that are out of our vocab!
min_freq
argument. This ensures that the vocab only includes high frequency tokens. We wouldn’t want to include tokens that only occur once/rarely. This brought our vocab count down from 75,889 to 7,591! A ~90% reduction!
Rather than starting from scratch, we can preload GloVe embeddings into our vocabulary!
glove = torchtext.vocab.GloVe(name = '6B', dim = 100)
glove.vectors.shape
Since we're using GloVe vectors for transfer learning (by preloading our embedding), let's take a look at how many tokens can be successfully transferred from GloVe into our own vocab. Each token will have an embedding (vector) of size 100. This results in an embedding of size 7591x100
my_vocab.vectors = glove.get_vecs_by_tokens(my_vocab.get_itos())
my_vocab.vectors.shape
By default, tokens that aren't able to transfer from GloVe into our own dataset get initialized with a vector of 0's. We can use this to count how many tokens were successfully preloaded!
tot_transferred = 0
for v in my_vocab.vectors:
if not v.equal(torch.zeros(100)):
tot_transferred += 1
tot_transferred, len(my_vocab)
my_vocab.get_itos()[3], my_vocab.vectors[3]
my_vocab.get_itos()[6555], my_vocab.vectors[6555]
the
and it has non-zero values in the vector since we were able to preload it from GloVe! Index 6559 corresponds to the word eargels
and it’s all zeros, which means that it wasn’t part of Glove’s vocab and therefore couldn’t transfer it’s embedding to our vocab!
stardust
wasn’t one of them :O
torch.randn
to create some diversity between the different token embeddings that weren’t preloaded with GloVe!
for i in range(my_vocab.vectors.shape[0]):
if my_vocab.vectors[i].equal(torch.zeros(100)):
my_vocab.vectors[i] = torch.randn(100)
Now let's use our vocab to numericalize our tokens!
sample_text = train_df['text'][0]
sample_text
tokens = L(tokenizer(sample_text))
tokens
We can use our vocab to convert each token to it's numeric representation one-by-one using a list comprehension!
numericalized_tokens = [my_vocab[token] for token in tokens]
numericalized_tokens = torch.tensor(numericalized_tokens)
numericalized_tokens
' '.join([my_vocab.get_itos()[num] for num in numericalized_tokens])
numericalized_tokens.shape
0
in their numericalized form! This corresponds to the <unk>
token.
This example has 81 tokens, but other examples may have more or less. It'll be a good idea to cap the number of tokens + pad the amount of tokens to a desired number of tokens. This will be needed in order to batch our samples together, as they can't vary in size!
max_tokens = 128
numericalized_tokens = [my_vocab[token] for token in tokens]
if len(numericalized_tokens) < max_tokens:
numericalized_tokens += [1] * (max_tokens-len(numericalized_tokens))
else:
numericalized_tokens = numericalized_tokens[:max_tokens]
numericalized_tokens = torch.tensor(numericalized_tokens)
numericalized_tokens
Now that we have everything we need to tokenize and numericalize our input, let's go ahead and make a simple Dataset class
from torch import nn
class Simple_Dataset(torch.utils.data.Dataset):
def __init__(self, df, vocab, max_tokens):
self.df = df
self.vocab = vocab
self.max_length = max_tokens
self.tokenizer = get_tokenizer("basic_english")
# label 1 is negative sentiment and label 2 is positive sentiment
self.label_map = {1:0, 2:1}
def __len__(self):
return len(self.df)
def decode(self, numericalized_tokens):
return ' '.join([self.vocab.get_itos()[num] for num in numericalized_tokens])
def __getitem__(self, index):
label, title, text = self.df.iloc[index]
label = self.label_map[label]
label = torch.tensor(label)
tokens = tokenizer(text)
numericalized_tokens = [my_vocab[token] for token in tokens]
if len(numericalized_tokens) < max_tokens:
numericalized_tokens += [1] * (max_tokens-len(numericalized_tokens))
else:
numericalized_tokens = numericalized_tokens[:max_tokens]
numericalized_tokens = torch.tensor(numericalized_tokens)
return numericalized_tokens, label
train_dataset = Simple_Dataset(train_df, vocab=my_vocab, max_tokens=128)
valid_dataset = Simple_Dataset(valid_df, vocab=my_vocab, max_tokens=128)
len(train_dataset), len(valid_dataset)
tokens, label = train_dataset[0]
tokens, label
train_dataset.decode(tokens)
We can now create our fastai DataLoaders dls
to use for later!
train_dl = DataLoader(train_dataset, bs=32, shuffle=True)
valid_dl = DataLoader(valid_dataset, bs=32)
dls = DataLoaders(train_dl, valid_dl)
dls
Now to the model creation section! Our PyTorch model will contain the following layers/components:
- Embedding Layer: converts numericalized tokens into their embedding representation
- LSTM: processes the sequence of embeddings
- Head: Takes final feature vector of LSTM for classification prediction
class Model(nn.Module):
def __init__(self, vocab, num_classes):
super().__init__()
vocab_size, emb_size = vocab.vectors.shape
self.emb = nn.Embedding(vocab_size, emb_size, _weight=vocab.vectors)
self.lstm = nn.LSTM(input_size = emb_size, hidden_size = 64, batch_first = True, num_layers = 2)
self.head = nn.Sequential(nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, num_classes))
def forward(self, batch_data):
token_embs = self.emb(batch_data)
outputs, (h_n, c_n) = self.lstm(token_embs)
# Assuming a batch size of 32, h_n will have a shape of:
# shape = 2, 32, 64
last_hidden_state = h_n
# shape = 32, 2, 64
last_hidden_state = last_hidden_state.permute(1,0,2)
# shape = 32, 128
last_hidden_state = last_hidden_state.flatten(start_dim=1)
logits = self.head(last_hidden_state)
return logits
_weight=vocab.vectors
in the PyTorch nn.Embedding()
layer creation function, you’ll simply initialize your embedding with all random numbers & nothing will transfer from GloVe! Feel free to comment out that bit and see how the performance drops due to the lack of transfer learning!
model = Model(my_vocab, num_classes=2)
model
Let's double check that some of our embeddings were successfully loaded from the domain-overlapping tokens from GloVe. Below is our preloaded embedding matrix!
embedding_matrix = list(model.emb.parameters())[0]
embedding_matrix
Index 3
corresponds to 'the'
:
my_vocab.vectors[3].equal(embedding_matrix[3])
total_params = 0
for p in model.parameters():
total_params += p.numel()
total_params
Now let's go ahead and make sure we can do a forward pass through our model, our loss function will be CrossEntropyLoss
as it's a classification task.
batched_data, batched_labels = train_dl.one_batch()
print(batched_data.shape, batched_labels.shape)
with torch.no_grad():
logits = model(batched_data)
logits.shape
loss_func = nn.CrossEntropyLoss()
loss = loss_func(logits, batched_labels)
loss
Sweet!
Time to use fastai to contain our dls
, model
, and metrics
& assist with training using certain best practices!
learn = Learner(dls, model, loss_func=nn.CrossEntropyLoss(), metrics=[accuracy])
learn
learn.lr_find()
learn.fit_one_cycle(5, lr_max=3e-3)
Nice, but this model may have started overfitting near the end of training as it doesn't have any dropout, weight-decay, or other forms of regularization!
Using a pretrained language model for downstream tasks is a popular and efficient technique also! Fine-tuning the language model first is even better, as shown in chapter 10 from fastbook
Here's a quick example of training a model with this dataset using fastai!
First we'll need to create our vocab as we did before!
fastai_vocab = make_vocab(token_counter)
To continue using the same subset dataframes, we'll combine both the train_df
and valid_df
into combined_df
, then let fastai split at index 40k by using the splitter
argument in the DataBlocks API!
combined_df = pd.concat([train_df, valid_df])
combined_df.head()
len(combined_df)
amazon_polarity = DataBlock(blocks=(TextBlock.from_df('text', seq_len=128, vocab=fastai_vocab),
CategoryBlock),
get_x=ColReader('text'),
get_y=ColReader('label'),
splitter=IndexSplitter(range(40000, 42000)))
# Passing a custom DataFrame in and splitting by the index!
dls = amazon_polarity.dataloaders(combined_df, bs=32)
len(dls.train_ds), len(dls.valid_ds)
len(dls.train), len(dls.valid)
dls.train.show_batch()
xxbos
-> beginning of a text, and xxmaj
-> next word capitalized) More can be found here in chapter 10 from fastbook
learn = text_classifier_learner(dls, AWD_LSTM, metrics=[accuracy])
learn.lr_find()
learn.fine_tune(5, base_lr=3e-3)
We can load up a tokenizer and transformer from HuggingFace's Transformers API and train them using fastai! We'll be using DistilBERT as it's smaller and faster than the original BERT.
"DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
bert-base-uncased
, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark."
from transformers import AutoTokenizer, AutoModelForSequenceClassification
hf_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
hf_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
sample_text = train_df['text'][0]
sample_text
tokenizer_outputs = hf_tokenizer(sample_text, return_tensors="pt")
tokenizer_outputs
tokenizer_outputs['input_ids'].shape, tokenizer_outputs['attention_mask'].shape
input_ids
(numericalization of tokens) and attention_mask
(manually lets you control attention on specific tokens). DistilBERT will take both of these as input!
Let's go ahead and put this into another dataset class:
class HF_Dataset(torch.utils.data.Dataset):
def __init__(self, df, hf_tokenizer):
self.df = df
self.hf_tokenizer = hf_tokenizer
# label 1 is negative sentiment and label 2 is positive sentiment
self.label_map = {1:0, 2:1}
def __len__(self):
return len(self.df)
def decode(self, token_ids):
return ' '.join([hf_tokenizer.decode(x) for x in tokenizer_outputs['input_ids']])
def decode_to_original(self, token_ids):
return self.hf_tokenizer.decode(token_ids.squeeze())
def __getitem__(self, index):
label, title, text = self.df.iloc[index]
label = self.label_map[label]
label = torch.tensor(label)
tokenizer_output = self.hf_tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=512)
tokenizer_output['input_ids'].squeeze_()
tokenizer_output['attention_mask'].squeeze_()
return tokenizer_output, label
train_dataset = HF_Dataset(train_df, hf_tokenizer)
valid_dataset = HF_Dataset(valid_df, hf_tokenizer)
len(train_dataset), len(valid_dataset)
tokenizer_outputs, label = train_dataset[0]
tokenizer_outputs.keys(), label
train_dataset.decode(tokenizer_outputs['input_ids'])[:500]
keyboarding
-> keyboard
##ing
, as well as fresher
-> fresh
##er
)
Here's the original input (tokens decoded, but without the subword tokenization showing):
train_dataset.decode_to_original(tokenizer_outputs['input_ids'])[:500]
train_dl = DataLoader(train_dataset, bs=16, shuffle=True)
valid_dl = DataLoader(valid_dataset, bs=16)
dls = DataLoaders(train_dl, valid_dl)
Let's make sure that items within the tokenizer_outputs
dictionary can get batched together properly:
batched_data, batched_labels = train_dl.one_batch()
batched_data.keys(), batched_data['input_ids'].shape, batched_labels.shape
To allow this model to be trained by fastai, we need to ensure that the model simply takes a single input, and returns the logits. We can create a small class to handle the intermediate stuff (like the decoupling of tokenizer_outputs via **tokenizer_outputs
, and extracting the logits from the model output via .logits
. Here's an example of a forward pass using HF's tokenizer and model:
hf_model(**batched_data)
class HF_Model(nn.Module):
def __init__(self, hf_model):
super().__init__()
self.hf_model = hf_model
def forward(self, tokenizer_outputs):
model_output = self.hf_model(**tokenizer_outputs)
return model_output.logits
model = HF_Model(hf_model)
With the same data, here's an example of a forward pass with our small wrapper over the hf_model
logits = model(batched_data)
logits
.logits
This allows for easy compatability with fastai
’s Learner
class
We have everything we need to finetune this model now!
# (doesn't automatically place model + data on gpu otherwise)
learn = Learner(dls, model.cuda(), loss_func=nn.CrossEntropyLoss(), metrics=[accuracy])
learn.lr_find()
learn.fit_one_cycle(3, 1e-4)
- We can take models written in pure PyTorch, or take existing models from elsewhere (e.g. HuggingFace), and train them with ease within fastai.
- NLP has lots of variation in terms of tokenization methods. In my personal opinion*, libaries like fastai & HuggingFace make the NLP data processing pipeline much easier/faster to get up and running!
- Each method has it's pros and cons:
- DistilBERT may have performed better, but this model has a much much larger number of parameters and possibly a much larger vocabulary!
- The runtime of DistilBERT was also much longer. ~16 minutes epochs (for a distilled transformer) vs ~2 min epochs (for a recurrent model)
- In other tasks, it may be better to use a transformer, but for many common NLP tasks, sometimes a simpler/smaller model is good enough!
- Even though transformers feed-forward it's input sequences in parallel (through positional encoding + self attention), they can be slower than simple recurrent networks due to their large model sizes.
Thanks for reading!!! 🙂