CS6220 Unsupervised Data Mining

HW4 Pytorch: Classification, Autoencoders, Word Embedding, Image Features, LSTM

Make sure you check the syllabus for the due date.

Try to be concise in both your answers and in your code.

DATATSET : SpamBase: emails (54-feature vectors) classified as spam/nospam

DATATSET : 20 NewsGroups : news articles

DATATSET : MNIST : 28x28 digit B/W images

DATATSET : FASHION : 28x28 B/W images

https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/
https://www.kaggle.com/zalando-research/fashionmnist

PROBLEM 1: Setup a tensor library [Optional, no credit]

A) Setup your favorite tensor-based library for deep learning, such as PyTorch or TensorFlow, and familiarize yourself with its basic usage. If using PyTorch, you can test if it is installed properly with (in Python):


  >>> import torch 

  >>> print(torch.__version__) 

  # 1.13.1

B) Train a simple feed-forward neural network on the MNIST dataset with 80/20 train and test splits and report results

First you need to design the architecture of the network. A example architecture for MNIST might look something like:

Two 2D-Convolutional layers
One Dropout layer
Two Linear layers with ReLU activation
Final Linear layer activated w/ softmax
SGD optimizer w/ learning rate of 0.01 and nestorov momentum
Mini-batch size of 64

You should be able to achieve an accuracy of about 95% or higher, but your mileage may vary. Using the example architecture above, we were able to achieve a 95% test accuracy with a small number of epochs.

PROBLEM 2 : NNet supervised classification with tuned word vectors

Train a neural network on a sizeable subset of 20NG (say, at least 5 categories)

Download GloVe embeddings from https://www.kaggle.com/datasets/rtatman/glove-global-vectors-for-word-representation?select=glove.6B.100d.txt and do some basic simplification, e.g.

 
  ## read the dataset, tokenize and pad 

  from gensim.utils import simple_preprocess 

  import torch 

  tokens = list() 

  for text in ng_text: 

    tokens.append(simple_preprocess(text)) 

  ng_vector_idx = torch.LongTensor([doc2ind(doc) for doc in ng_text])

where `ng_vector_idx` is a `torch.tensor` of integers representing the indices of the GloVe vectors from above, and `doc2ind` is a function you need. Note that you should not form the matrix of word embeddings explicitly, but simply specific vector-indices representing the words in the text (see `torch.Embedding` for more details)

Parameterize an embedding layer for GloVe. With pytorch, this looks something like:


    from torch import nn  

    glove_emb = nn.Embedding.from_pretrained(< glove vectors from NG tags here >) 

    glove_emb.weight.requires_grad = False

Construct an Neural Network using the embedding layer. You're free to design the architecture of the network after that. For example, in PyTorch, the architecture code might look something similar too:


  model = nn.Sequential( 

     glove_emb, 

      ... 

      nn.Linear(..., num_classes), 

      nn.Softmax(dim=1) 

  )

It's possible to get a test set accuracy around 63%.

Fine tune them on the 20NG dataset by running few iterations of a wordvec library such as gensim or mittens or TensorFlow

You can follow a tutorial such as
https://czarrar.github.io/Gensim-Word2Vec/
https://github.com/ashutoshsingh0223/mittens

PROBLEM 3 [Optional, no credit]: Autoencoders

You can pick your own text to fine tune word vectors, if its reasonable in size and very domain-specific (compared to general English). Suggestions:
- Alice in Wonderland
- Sonnets
- specific categories (labels) from 20NG or Reurters datasets
- use your favorite specific text (like a book, or project)

PROBLEM 4 [Optional, no credit]: Autoencoders

For each one of the datasets MNIST, 20NG (required) and SPAMBASE, FASHION (optional) run as an autoencoder with pytorch with a desired hidden layer size (try K=5,10, 20, 100, 200)- what is the smaleest K that works?).

Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

Construct an Autoencoder with the following architecture :

Two linear layers with in features matching the dimensions of input and out features matching the size of K
Two linear layers with in features matching K and size of out features matching the size of input dimensions.
Define a forward pass with relu
Code a train loop with number of epochs as 10.
Define loss and Optimizer (Adam)
Train the model

use gpu if available
use mean-squared error loss
create a model from Autoencoder class load it to the specified device, either gpu or cpu

Verify the obtained re-encoding of data (the new feature representation) in several ways:

repeat a classification train/test task , or a clustering taks
examine the new pairwise distances dist(i,j) agains the old distances obtained with original features (sample 100 pairs of related words)
examine the top 20 neighbors (by new distance) set overlap with old neighbors, per datapoint
for images, rebuild the image form output layer and draw to look at it

PROBLEM 5 [Optional, no credit]: Image Feature Extraction

Run a Convolutional Neural Network in pytorch to extract image features. In practice the network usually does both the feature extraction and the supervised task (classification) in one pipeline.

PROBLEM 6 [Optional, no credit]: LSTM for text

Run a Recurrent Neural Network/LSTM in Pytorch to model word dependecies/order in text. Can be use for translation, next-word prediction, event detection etc.
LSTM article