CS6220 Unsupervised Data Mining
HW4 Pytorch: Classification, Autoencoders, Word Embedding, Image Features, LSTM
Make sure you check the syllabus for the due date. Please
use the notations adopted in class, even if the problem is stated
in the book using a different notation.
We are not looking for very long answers (if you find yourself
writing more than one or two pages of typed text per problem, you
are probably on the wrong track). Try to be concise; also keep in
mind that good ideas and explanations matter more than exact
details.
- Write all the answers in this ipython notebook. Once you are finished (1) Generate a PDF via (File -> Print -> Save as PDF) and upload to Gradescope
- **Important:** check your PDF before you turn it in to Gradescope to make sure it exported correctly. If Colab gets confused about your syntax, it will sometimes
terminate the PDF creation routine early.
- When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. Then you'll be sure it's actually right.
One handy way to do this is by clicking `Runtime -> Run All` in the notebook menu.
DATATSET : SpamBase: emails (54-feature vectors) classified as spam/nospam
DATATSET : 20 NewsGroups : news articles
DATATSET : MNIST : 28x28 digit B/W images
DATATSET : FASHION : 28x28 B/W images
https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/
https://www.kaggle.com/zalando-research/fashionmnist
PROBLEM 1: Pytorch Setup
The complete instructions for installing pytorch can be found at: https://colab.research.google.com/github/omarsar/pytorch_notebooks/blob/master/pytorch_quick_start.ipynb
We will use google colab for this homework, start a new notebook and install pytorch, following the instructions listed in the webpage above.
- Test if pytorch is installed properly
$ python
# Following is testing code
>>> import torch
>>> print(torch.__version__)
1.12.1+cu113
We set up the development environment.
PROBLEM 2 : NNet supervised classification
Prelims :
- Check if cuda is available and set device to cuda else to cpu.
if torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
-
Set Hyperparameters :
n_epochs = 3
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 10
random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)
A) For MNIST dataset, train a simple neural network in supervised mode with train and test splits and report results
Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
Construct an Neural Network with the following architecture :
- Two 2D-Convolutional layers followed by a Dropout and two Linear layers with in dimensions as the number of features of input and the in the last layer out dimensions matching the number of classes.
- Define a forward pass with maxpool, ReLU, dropout and softmax.
- Code a train loop with number of epochs as 3.
- Define loss and Optimizer (Adam)
- Train the model
- Test the trained model with torch.no_grad()
- You need to get test set accuracy more than 95%
B) For classification on 20NG
Upload the dataset to google drive and mount drive with colab
from google.colab import drive
drive.mount('/content/drive')
Download GloVe embeddings from https://www.kaggle.com/datasets/rtatman/glove-global-vectors-for-word-representation?select=glove.6B.100d.txt and read them.
read the dataset, tokenize and pad
from gensim.utils import simple_preprocess
tokens = list()
for text in texts:
tokens.append(simple_preprocess(text))
Split and Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
Construct an Neural Network with the following architecture :
- (1D-Convolutional layers followed by a Maxpool1D) x3 followed by two Linear layers with in dimensions as the number of features of input and the in the last layer out dimensions matching the number of classes.
- Define a forward pass with embed, transpose , (relu, pool1) x3, and softmax.
- Code a train loop with number of epochs as 3.
- Define loss and Optimizer (default=SGD, could try Adam)
- Train the model
- Test the trained model with torch.no_grad()
- You need to get test set accuracy around 63%.
C) Extra Credit. Run these classifications for MNIST using a GPU
PROBLEM 3 [OPTIONAL no credit]: Autoencoders
For each one of the datasets MNIST, 20NG (required) and SPAMBASE, FASHION (optional) run as an autoencoder with pytorch with a desired hidden layer size (try K=5,10, 20, 100, 200)- what is the smaleest K that works?).
Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
Construct an Autoencoder with the following architecture :
- Two linear layers with in features matching the dimensions of input and out features matching the size of K
- Two linear layers with in features matching K and size of out features matching the size of input dimensions.
- Define a forward pass with relu
- Code a train loop with number of epochs as 10.
- Define loss and Optimizer (Adam)
- Train the model
- use gpu if available
- use mean-squared error loss
- create a model from Autoencoder class load it to the specified device, either gpu or cpu
Verify the obtained re-encoding of data (the new feature representation) in several ways:
- repeat a classification train/test task , or a clustering taks
- examine the new pairwise distances dist(i,j) agains the old distances obtained with original features (sample 100 pairs of related words)
- examine the top 20 neighbors (by new distance) set overlap with old neighbors, per datapoint
- for images, rebuild the image form output layer and draw to look at it
PROBLEM 4 [OPTIONAL no credit]: Word Vectors Fine Tuning (subject to change)
Download general word embeddings and load them, for example
Glove or
Google word2vec pretrained
Starting with downloaded word vectors, fine tune them on the specific dataset by running few iterations of a wordvec library such as gensim or mittens or TensorFlow
You can follow a tutorial such as
https://czarrar.github.io/Gensim-Word2Vec/
https://github.com/ashutoshsingh0223/mittens
You can pick your own text to fine tune on, if its reasonable in size and very domain-specific (compared to general English). Suggestions:
- Alice in Wonderland
- Sonnets
- specific categories (labels) from 20NG or Reurters datasets
- use your favorite specific text (like a book, or project)
PROBLEM 5 [OPTIONAL no credit]: Image Feature Extraction
Run a Convolutional Neural Network in pytorch to extract image features. In practice the network usually does both the feature extraction and the supervised task (classification) in one pipeline.
PROBLEM 6 [OPTIONAL no credit]: LSTM for text
Run a Recurrent Neural Network/LSTM in Pytorch to model word dependecies/order in text. Can be use for translation, next-word prediction, event detection etc.
LSTM article