CS6220 Unsupervised Data Mining

HW4 Pytorch: Classification, Autoencoders, Word Embedding, Image Features, LSTM

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

We are not looking for very long answers (if you find yourself writing more than one or two pages of typed text per problem, you are probably on the wrong track). Try to be concise; also keep in mind that good ideas and explanations matter more than exact details.

- Write all the answers in this ipython notebook. Once you are finished (1) Generate a PDF via (File -> Print -> Save as PDF) and upload to Gradescope - **Important:** check your PDF before you turn it in to Gradescope to make sure it exported correctly. If Colab gets confused about your syntax, it will sometimes terminate the PDF creation routine early. - When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. Then you'll be sure it's actually right. One handy way to do this is by clicking `Runtime -> Run All` in the notebook menu.

DATATSET : SpamBase: emails (54-feature vectors) classified as spam/nospam

DATATSET : 20 NewsGroups : news articles

DATATSET : MNIST : 28x28 digit B/W images

DATATSET : FASHION : 28x28 B/W images

https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/
https://www.kaggle.com/zalando-research/fashionmnist

PROBLEM 1: Pytorch Setup

The complete instructions for installing pytorch can be found at: https://colab.research.google.com/github/omarsar/pytorch_notebooks/blob/master/pytorch_quick_start.ipynb
We will use google colab for this homework, start a new notebook and install pytorch, following the instructions listed in the webpage above.

- Test if pytorch is installed properly
$ python
# Following is testing code
>>> import torch
>>> print(torch.__version__)
1.12.1+cu113

We set up the development environment.

PROBLEM 2 : NNet supervised classification

Prelims :

Check if cuda is available and set device to cuda else to cpu. if torch.cuda.is_available(): device = torch.device('cuda') else: device = torch.device('cpu')
Set Hyperparameters : n_epochs = 3 batch_size_train = 64 batch_size_test = 1000 learning_rate = 0.01 momentum = 0.5 log_interval = 10 random_seed = 1 torch.backends.cudnn.enabled = False torch.manual_seed(random_seed)

A) For MNIST dataset, train a simple neural network in supervised mode with train and test splits and report results

Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

Construct an Neural Network with the following architecture :

Two 2D-Convolutional layers followed by a Dropout and two Linear layers with in dimensions as the number of features of input and the in the last layer out dimensions matching the number of classes.
Define a forward pass with maxpool, ReLU, dropout and softmax.
Code a train loop with number of epochs as 3.
Define loss and Optimizer (Adam)
Train the model
Test the trained model with torch.no_grad()
You need to get test set accuracy more than 95%

B) For classification on 20NG

Upload the dataset to google drive and mount drive with colab from google.colab import drive drive.mount('/content/drive')

Download GloVe embeddings from https://www.kaggle.com/datasets/rtatman/glove-global-vectors-for-word-representation?select=glove.6B.100d.txt and read them.

read the dataset, tokenize and pad from gensim.utils import simple_preprocess tokens = list() for text in texts: tokens.append(simple_preprocess(text))

Split and Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

Construct an Neural Network with the following architecture :

(1D-Convolutional layers followed by a Maxpool1D) x3 followed by two Linear layers with in dimensions as the number of features of input and the in the last layer out dimensions matching the number of classes.
Define a forward pass with embed, transpose , (relu, pool1) x3, and softmax.
Code a train loop with number of epochs as 3.
Define loss and Optimizer (default=SGD, could try Adam)
Train the model
Test the trained model with torch.no_grad()
You need to get test set accuracy around 63%.

C) Extra Credit. Run these classifications for MNIST using a GPU

PROBLEM 3 [OPTIONAL no credit]: Autoencoders

For each one of the datasets MNIST, 20NG (required) and SPAMBASE, FASHION (optional) run as an autoencoder with pytorch with a desired hidden layer size (try K=5,10, 20, 100, 200)- what is the smaleest K that works?).

Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

Construct an Autoencoder with the following architecture :

Two linear layers with in features matching the dimensions of input and out features matching the size of K
Two linear layers with in features matching K and size of out features matching the size of input dimensions.
Define a forward pass with relu
Code a train loop with number of epochs as 10.
Define loss and Optimizer (Adam)
Train the model

use gpu if available
use mean-squared error loss
create a model from Autoencoder class load it to the specified device, either gpu or cpu

Verify the obtained re-encoding of data (the new feature representation) in several ways:

repeat a classification train/test task , or a clustering taks
examine the new pairwise distances dist(i,j) agains the old distances obtained with original features (sample 100 pairs of related words)
examine the top 20 neighbors (by new distance) set overlap with old neighbors, per datapoint
for images, rebuild the image form output layer and draw to look at it

PROBLEM 4 [OPTIONAL no credit]: Word Vectors Fine Tuning (subject to change)

Download general word embeddings and load them, for example Glove or Google word2vec pretrained

Starting with downloaded word vectors, fine tune them on the specific dataset by running few iterations of a wordvec library such as gensim or mittens or TensorFlow

You can follow a tutorial such as
https://czarrar.github.io/Gensim-Word2Vec/
https://github.com/ashutoshsingh0223/mittens

You can pick your own text to fine tune on, if its reasonable in size and very domain-specific (compared to general English). Suggestions:
- Alice in Wonderland
- Sonnets
- specific categories (labels) from 20NG or Reurters datasets
- use your favorite specific text (like a book, or project)

PROBLEM 5 [OPTIONAL no credit]: Image Feature Extraction

Run a Convolutional Neural Network in pytorch to extract image features. In practice the network usually does both the feature extraction and the supervised task (classification) in one pipeline.

PROBLEM 6 [OPTIONAL no credit]: LSTM for text

Run a Recurrent Neural Network/LSTM in Pytorch to model word dependecies/order in text. Can be use for translation, next-word prediction, event detection etc.
LSTM article