Homework 2: Word Embeddings

In this homework, you will be experimenting with word embeddings. Clone the skeleton code to begin.

Part 1: GloVe Embeddings Exercise

We’ll be using GloVe embeddings for this exercise, which you can download here. Feel free to change it later on to get better accuracy.

GloVe stands for global vectors. It is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

For example, the embeddings for king, man, queen, and woman might relate linearly as king - man + woman = queen.

For this exercise, all you need to know is that GloVe embeddings are a way of representing words as vectors. The vectors are learned from a large corpus of text. The vectors are dense, meaning that each word is represented by a vector of a fixed size. The vectors are learned in such a way that words that appear in similar contexts have similar vectors. This means that the vectors can be used to capture semantic meaning.

For the specific GloVe embeddings we have provided, the embedded vectors have dimension 50. This means that words are captured by vectors that lie in a 50d vector space.

GloVe Embeddings Interface

Install the python package gensim which will act as an interface for the GloVe embeddings.

The following code is provided to you in embeddings.py to initialize the GloVe model:

from gensim.models import KeyedVectors

EMBEDDING_MODEL_PATH = ...
EMBEDDING_MODEL_NAME = "glove.6B.50d.txt"

class WordEmbeddings:
    def __init__(self):
        self.model = KeyedVectors.load_word2vec_format(EMBEDDING_MODEL_PATH + EMBEDDING_MODEL_NAME)

KeyedVectors is a data structure that allows querying of vectors keyed by lookup tokens (e.g., strings).

You can check if a word is a valid key in KeyedVectors by using in.

You can access the value associated with a key just like a typical python dictionary vector = self.model["word"] which returns an ndarray representing the embedded vector.

Check the documentation to find all methods that KeyedVectors supports.

Task

Implement the embed function in the WordEmbeddings class in embeddings.py. embed takes in a list of documents and returns the average word embedding for each document using a pre-trained GloVe model.

...
def embed(self, documents: list[str]) -> np.ndarray

For the examples below, we should expect these shapes:

word_embedding_model = WordEmbeddings()
emb = word_embedding_model.embed(["I like goats"])
print(emb.shape)  # expecting (1, 50)
emb = word_embedding_model.embed(["I like goats", "I hate pizza"])
print(emb.shape)  # expecting (2, 50)

Hint: Consider that a “document” is just a string of space separated words. How can we obtain the word embeddings for each word in a document and then average them out?

Concept Checks

What are the pros of this embedding approach (to embed an entire document)
What are the cons of this embedding approach?

Part 2: MLP Sentiment Classifier

Download the dataset tweets.csv which contains data on various tweets to airlines, the sentiment of the tweets, and various other specifics about the tweets.

You are to build an MLP (Multi Layer Perceptron) to predict airline tweet sentiments.

Questions to ask yourself: What does an MLP take as input? What data do I have? What sort of outputs am I seeking?

Data Handling

We will work through implementing the def prepare_data() function in main.py.

Begin by using Pandas to read the dataset into a dataframe. Look at some examples of the data that you have.

Which features may be useful for us in predicting airline sentiment?

For this exercise, we will only use text.

First, drop all columns except text and airline_sentiment.

Next, let’s convert airline sentiment to a numeric value to train on: negative = 0, neutral = 1, positive = 2. Why should we do this?

Next, we ideally want to take a piece of text (a string) as input and pass this into our MLP and have it predict a sentiment {0,1,2}. But simply passing a string into our MLP will not work, why?

Once you realize the answer to this question, transform your data, the text column to convert strings into a form that your MLP can actually take as input. Hint: You have already written a function to convert strings into their mathematical representations.

Training and Test Splits

Why must we separate our data into training and test splits? Would it be erroneous to simply train on our entire dataset?

Use sklearn’s train_test_split function to perform a train/test split. Set the random_state to 42 so the results are reproducible.

from sklearn.model_selection import train_test_split

Torch Dataset and DataLoader

Now that our data is cleaned and formatted in quantities that our MLP can take as input and use to predict, let’s wrap them with pytorch’s dataset and dataloader classes to train on.

Specifically, use TensorDataset to convert both your training and validation splits into pytorch datasets.

Hint: How can you extract the embedding and sentiment columns from the train and test dataframes to pass into TensorDataset? You ultimately want something like:

dataset = TensorDataset(x_tensor, labels_tensor)

Finally, use a DataLoader to create train and validation dataloaders. Check the documentation for dataloaders. You want to specify at least batch_size and shuffle.

The MLP

Create an MLP! Ideally you make it easy to change the number of layers and the size of each layer.

Note: you actually don’t need a softmax layer if you’re using Cross Entropy Loss since PyTorch automatically applies it for us.

Hint: use torch.nn.Linear to create a linear layer. Why?

Hint: you need an activation function after each linear layer. Why?

Hint: you need a softmax activation function on the output layer. Why?

Hint: use torch.nn.Sequential to create a sequential model that’s easy to execute in forward

Train One Epoch

Implement the function:

def train_one_epoch(model, loss_fn, train_loader, optimizer)

Validation

Implement a validation function:

def validate(model, loss_fn, val_loader)

Full Training Loop

Finally, implement the full training loop:

def train(model, train_loader, val_loader, epochs)

Then, define and train your model!

Logging

Plot your train and validation loss on the same plot using plt.plot. Make sure to label your lines.

Then, plot your train and validation accuracy on another figure. Make sure to label your lines.

Finally, save your plots for submission to Gradescope.

Evaluating

Write a short script to evaluate the model on a piece of text you write.

Remember to set the model to eval mode, and to convert the text to an embedding.

Result should be a string either “negative”, “neutral”, or “positive”.

Implement this:

def evaluate(trained_model, sample_text):

Optional Tasks:

Write a short script to evaluate the model on the validation set, and print out the tweets that were misclassified
A confusion matrix shows what classes the model is likely to make mistakes. Write a few lines of code to plot a confusion matrix for the validation set using from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

Results and Submission

The dataset is not perfect, but your model should be able to get over 70% on the validation set. We do not have a hold out test set for this exercise.

Credit

For full credit, you must have a model that achieves at least 70% accuracy on the validation set.

All functions and classes mentioned above should be implemented as instructed, including:

WordEmbeddings: def embed(self, documents: list[str]):
def prepare_data():
the MLP model
def train_one_epoch(model, loss_fn, train_loader, optimizer):
def validate(model, val_loader, loss_fn):
def train(model, train_loader, val_loader, epochs):
def evaluate(trained_model, sample_text):

Additionally, you should download these plots and submit them for full credit:

Train and Validation Loss Plot
Train and Validation Accuracy Plot

Experimentation

To achieve a higher accuracy, feel free to change the way you featurize. Some ideas for you to try:

Try larger models (wider layers, more layers)

Play around with activation functions, dropout, batch norm

Remove stop words (the, a, in, is, etc) (hint: take a look at nltk)

Use a different way of featurizing. For example, instead of taking the average you could concatenate the min, mean, and max embeddings. Remember that embeddings for all documents must be the same dimensions, so you cannot just concatenate the embeddings of each word. (unless you’re training an LSTM :eyes:)

Use TF-IDF to take a weighted average of the embeddings. (haven’t tried it, but wouldn’t it be interesting)

Use a different pre-trained model. I haven’t tried it but you could try: w2v, fasttext, BERT, Univseral Sentence Encoder, CLIP, OpenAI embedding models.

Deal with the class imbalance - you can try oversampling or undersampling the data. You can also try using a weighted loss function.