CS6220 Unsupervised Data Mining
HW4 Pytorch: Classification, Autoencoders, Word Embedding, Image Features, LSTM
Make sure you check the syllabus for the due date.
Try to be concise in both your answers and in your code.
DATATSET : SpamBase: emails (54-feature vectors) classified as spam/nospam
DATATSET : 20 NewsGroups : news articles
DATATSET : MNIST : 28x28 digit B/W images
DATATSET : FASHION : 28x28 B/W images
https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/
https://www.kaggle.com/zalando-research/fashionmnist
PROBLEM 1: Setup a tensor library [Optional, no credit]
A) Setup your favorite tensor-based library for deep learning, such as PyTorch or TensorFlow, and familiarize yourself with its basic usage.
If using PyTorch, you can test if it is installed properly with (in Python):
>>> import torch
>>> print(torch.__version__)
# 1.13.1
B) Train a simple feed-forward neural network on the MNIST dataset with 80/20 train and test splits and report results
First you need to design the architecture of the network. A example architecture for MNIST might look something like:
- Two 2D-Convolutional layers
- One Dropout layer
- Two Linear layers with ReLU activation
- Final Linear layer activated w/ softmax
- SGD optimizer w/ learning rate of 0.01 and nestorov momentum
- Mini-batch size of 64
You should be able to achieve an accuracy of about 95% or higher, but your mileage may vary. Using the example architecture above, we were able to achieve a 95% test accuracy with a small number of epochs.
PROBLEM 2 : NNet supervised classification with tuned word vectors
Train a neural network on a sizeable subset of 20NG (say, at least 5 categories)
Download GloVe embeddings from https://www.kaggle.com/datasets/rtatman/glove-global-vectors-for-word-representation?select=glove.6B.100d.txt and do some basic simplification, e.g.
## read the dataset, tokenize and pad
from gensim.utils import simple_preprocess
import torch
tokens = list()
for text in ng_text:
tokens.append(simple_preprocess(text))
ng_vector_idx = torch.LongTensor([doc2ind(doc) for doc in ng_text])
where `ng_vector_idx` is a `torch.tensor` of integers representing the indices of the GloVe vectors from above, and `doc2ind` is a function you need.
Note that you should not form the matrix of word embeddings explicitly, but simply specific vector-indices representing the words
in the text (see `torch.Embedding` for more details)
Parameterize an embedding layer for GloVe. With pytorch, this looks something like:
from torch import nn
glove_emb = nn.Embedding.from_pretrained(< glove vectors from NG tags here >)
glove_emb.weight.requires_grad = False
Construct an Neural Network using the embedding layer. You're free to design the architecture of the network after that.
For example, in PyTorch, the architecture code might look something similar too:
model = nn.Sequential(
glove_emb,
...
nn.Linear(..., num_classes),
nn.Softmax(dim=1)
)
It's possible to get a test set accuracy around 63%.
Fine tune them on 20NG by making your embedding layer trainable, i.e. by unfreezing the weights. After a sufficient amount of training, plot a 2d projection of the resulting embeddings colored by class using your choice reduction (PCA, MDS, tSNE, etc.). Is there any perceptible difference between the embedding before and after tuning?
You can follow a tutorial such as
https://czarrar.github.io/Gensim-Word2Vec/
https://github.com/ashutoshsingh0223/mittens
PROBLEM 3 [Optional, no credit]: Autoencoders
You can pick your own text to fine tune word vectors, if its reasonable in size and very domain-specific (compared to general English). Suggestions:
- Alice in Wonderland
- Sonnets
- specific categories (labels) from 20NG or Reurters datasets
- use your favorite specific text (like a book, or project)
PROBLEM 4 [Optional, no credit]: Autoencoders
For each one of the datasets MNIST, 20NG (required) and SPAMBASE, FASHION (optional) run as an autoencoder with pytorch with a desired hidden layer size (try K=5,10, 20, 100, 200)- what is the smaleest K that works?).
Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
Construct an Autoencoder with the following architecture :
- Two linear layers with in features matching the dimensions of input and out features matching the size of K
- Two linear layers with in features matching K and size of out features matching the size of input dimensions.
- Define a forward pass with relu
- Code a train loop with number of epochs as 10.
- Define loss and Optimizer (Adam)
- Train the model
- use gpu if available
- use mean-squared error loss
- create a model from Autoencoder class load it to the specified device, either gpu or cpu
Verify the obtained re-encoding of data (the new feature representation) in several ways:
- repeat a classification train/test task , or a clustering taks
- examine the new pairwise distances dist(i,j) agains the old distances obtained with original features (sample 100 pairs of related words)
- examine the top 20 neighbors (by new distance) set overlap with old neighbors, per datapoint
- for images, rebuild the image form output layer and draw to look at it
PROBLEM 5 [Optional, no credit]: Image Feature Extraction
Run a Convolutional Neural Network in pytorch to extract image features. In practice the network usually does both the feature extraction and the supervised task (classification) in one pipeline.
PROBLEM 6 [Optional, no credit]: LSTM for text
Run a Recurrent Neural Network/LSTM in Pytorch to model word dependecies/order in text. Can be use for translation, next-word prediction, event detection etc.
LSTM article