CS6220 Unsupervised Data Mining

HW4 Pytorch: Classification, Autoencoders, Word Embedding, Image Features, LSTM

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

We are not looking for very long answers (if you find yourself writing more than one or two pages of typed text per problem, you are probably on the wrong track). Try to be concise; also keep in mind that good ideas and explanations matter more than exact details.

- Write all the answers in this ipython notebook. Once you are finished (1) Generate a PDF via (File -> Print -> Save as PDF) and upload to Gradescope - **Important:** check your PDF before you turn it in to Gradescope to make sure it exported correctly. If Colab gets confused about your syntax, it will sometimes terminate the PDF creation routine early. - When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. Then you'll be sure it's actually right. One handy way to do this is by clicking `Runtime -> Run All` in the notebook menu.


DATATSET : SpamBase: emails (54-feature vectors) classified as spam/nospam

DATATSET : 20 NewsGroups : news articles

DATATSET : MNIST : 28x28 digit B/W images

DATATSET : FASHION : 28x28 B/W images

https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/
https://www.kaggle.com/zalando-research/fashionmnist

PROBLEM 1: Pytorch Setup

The complete instructions for installing pytorch can be found at: https://colab.research.google.com/github/omarsar/pytorch_notebooks/blob/master/pytorch_quick_start.ipynb
We will use google colab for this homework, start a new notebook and install pytorch, following the instructions listed in the webpage above.

- Test if pytorch is installed properly
$ python
# Following is testing code
>>> import torch
>>> print(torch.__version__)
1.12.1+cu113

We set up the development environment.

PROBLEM 2 : NNet supervised classification

Prelims : A) For MNIST dataset, train a simple neural network in supervised mode with train and test splits and report results
  • Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
  • Construct an Neural Network with the following architecture : B) For classification on 20NG
  • Upload the dataset to google drive and mount drive with colab from google.colab import drive drive.mount('/content/drive')
  • Download GloVe embeddings from https://www.kaggle.com/datasets/rtatman/glove-global-vectors-for-word-representation?select=glove.6B.100d.txt and read them.
  • read the dataset, tokenize and pad from gensim.utils import simple_preprocess tokens = list() for text in texts: tokens.append(simple_preprocess(text))
  • Split and Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
  • Construct an Neural Network with the following architecture : C) Extra Credit. Run these classifications for MNIST using a GPU

    PROBLEM 3 [OPTIONAL no credit]: Autoencoders

    For each one of the datasets MNIST, 20NG (required) and SPAMBASE, FASHION (optional) run as an autoencoder with pytorch with a desired hidden layer size (try K=5,10, 20, 100, 200)- what is the smaleest K that works?).
  • Load the data with dataloader https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
  • Construct an Autoencoder with the following architecture :
  • Verify the obtained re-encoding of data (the new feature representation) in several ways:


  • PROBLEM 4 [OPTIONAL no credit]: Word Vectors Fine Tuning (subject to change)

  • Download general word embeddings and load them, for example Glove or Google word2vec pretrained
  • Starting with downloaded word vectors, fine tune them on the specific dataset by running few iterations of a wordvec library such as gensim or mittens or TensorFlow
  • You can follow a tutorial such as
    https://czarrar.github.io/Gensim-Word2Vec/
    https://github.com/ashutoshsingh0223/mittens

    You can pick your own text to fine tune on, if its reasonable in size and very domain-specific (compared to general English). Suggestions:
    - Alice in Wonderland
    - Sonnets
    - specific categories (labels) from 20NG or Reurters datasets
    - use your favorite specific text (like a book, or project)

    PROBLEM 5 [OPTIONAL no credit]: Image Feature Extraction

    Run a Convolutional Neural Network in pytorch to extract image features. In practice the network usually does both the feature extraction and the supervised task (classification) in one pipeline.

    PROBLEM 6 [OPTIONAL no credit]: LSTM for text

    Run a Recurrent Neural Network/LSTM in Pytorch to model word dependecies/order in text. Can be use for translation, next-word prediction, event detection etc.
    LSTM article