CS6220 Unsupervised Data Mining

HW4 Pytorch: Classification, Autoencoders, Word Embedding, Image Features, LSTM

Make sure you check the syllabus for the due date.

Try to be concise in both your answers and in your code.


DATATSET : SpamBase: emails (54-feature vectors) classified as spam/nospam

DATATSET : 20 NewsGroups : news articles

DATATSET : MNIST : 28x28 digit B/W images

DATATSET : FASHION : 28x28 B/W images

https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/
https://www.kaggle.com/zalando-research/fashionmnist

PROBLEM 1: Setup a tensor library [Optional, no credit]

A) Setup your favorite tensor-based library for deep learning, such as PyTorch or TensorFlow, and familiarize yourself with its basic usage. If using PyTorch, you can test if it is installed properly with (in Python):

>>> import torch
>>> print(torch.__version__)
# 1.13.1

B) Train a simple feed-forward neural network on the MNIST dataset with 80/20 train and test splits and report results

First you need to design the architecture of the network. A example architecture for MNIST might look something like: You should be able to achieve an accuracy of about 95% or higher, but your mileage may vary. Using the example architecture above, we were able to achieve a 95% test accuracy with a small number of epochs.

PROBLEM 2 : NNet supervised classification with tuned word vectors

Train a neural network on a sizeable subset of 20NG (say, at least 5 categories)

  • Download GloVe embeddings from https://www.kaggle.com/datasets/rtatman/glove-global-vectors-for-word-representation?select=glove.6B.100d.txt and do some basic simplification, e.g.

  • ## read the dataset, tokenize and pad
    from gensim.utils import simple_preprocess
    import torch
    tokens = list()
    for text in ng_text:
    tokens.append(simple_preprocess(text))
    ng_vector_idx = torch.LongTensor([doc2ind(doc) for doc in ng_text])

    where `ng_vector_idx` is a `torch.tensor` of integers representing the indices of the GloVe vectors from above, and `doc2ind` is a function you need. Note that you should not form the matrix of word embeddings explicitly, but simply specific vector-indices representing the words in the text (see `torch.Embedding` for more details)

  • Parameterize an embedding layer for GloVe. With pytorch, this looks something like:

    from torch import nn
    glove_emb = nn.Embedding.from_pretrained(< glove vectors from NG tags here >)
    glove_emb.weight.requires_grad = False


  • Construct an Neural Network using the embedding layer. You're free to design the architecture of the network after that. For example, in PyTorch, the architecture code might look something similar too:

    model = nn.Sequential(
       glove_emb,
       ...
       nn.Linear(..., num_classes),
       nn.Softmax(dim=1)
    )

  • It's possible to get a test set accuracy around 63%.



  • Fine tune them on 20NG by making your embedding layer trainable, i.e. by unfreezing the weights. After a sufficient amount of training, plot a 2d projection of the resulting embeddings colored by class using your choice reduction (PCA, MDS, tSNE, etc.). Is there any perceptible difference between the embedding before and after tuning?
  • You can follow a tutorial such as
    https://czarrar.github.io/Gensim-Word2Vec/
    https://github.com/ashutoshsingh0223/mittens

    PROBLEM 3 [Optional, no credit]: Autoencoders

    You can pick your own text to fine tune word vectors, if its reasonable in size and very domain-specific (compared to general English). Suggestions:
    - Alice in Wonderland
    - Sonnets
    - specific categories (labels) from 20NG or Reurters datasets
    - use your favorite specific text (like a book, or project)

    PROBLEM 4 [Optional, no credit]: Autoencoders

    For each one of the datasets MNIST, 20NG (required) and SPAMBASE, FASHION (optional) run as an autoencoder with pytorch with a desired hidden layer size (try K=5,10, 20, 100, 200)- what is the smaleest K that works?).
  • Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
  • Construct an Autoencoder with the following architecture :
  • Verify the obtained re-encoding of data (the new feature representation) in several ways:


  • PROBLEM 5 [Optional, no credit]: Image Feature Extraction

    Run a Convolutional Neural Network in pytorch to extract image features. In practice the network usually does both the feature extraction and the supervised task (classification) in one pipeline.

    PROBLEM 6 [Optional, no credit]: LSTM for text

    Run a Recurrent Neural Network/LSTM in Pytorch to model word dependecies/order in text. Can be use for translation, next-word prediction, event detection etc.
    LSTM article