Your semester project
Each student will do a semester project for the course. It must involve developing software or using an existing AI application. Your project should focus on machine learning or natural language processing. (A project related to uncertainty is an option, but it would be difficult.)
Everyone will have to deal with one challenge: The fact that you will be working on your project before the topics have been covered in class. So part of your effort, and you will get credit for this, is for you learn relevant aspects of the area you're working on as you go along. The major applications I have listed for the two topics have many tutorial documents and examples to guide you.
You will develop your project in three stages (see the Schedule for the due dates):
- In the first stage, V1, you will sketch three project possibilities, about 200 words each and with two or more references (total) to the literature, such as tutorial material.
- In the second stage, V2, you should be well along. Your code should be functioning at some level or you should be beginning to get some results using your chosen application. This document should be about 1,000 words long plus figures and tables and should include three or more references to the literature.
- The third and last stage is your Final Project, due Wednesday, March 13th. It should be about 1,500 words long plus figures and tables and should include four or more references to the literature.
In the week following your handing in V1 and after V2, I will schedule one-on-one meetings with each of you to discuss your project and give you advice to help you create a good project.
Google docs and the format of your project writeup
I have found that Google docs are very useful for developing your projects. Sharing is straightforward - share your doc with me via my course Google account, cs4100sp11@gmail.com. One advantage of Google docs is that you can see my comments the moment I write them - no need to wait to get a hardcopy handed back. You can hand in a hardcopy of your project, but only if it has such complex mathematics that doing it in Google docs would be tedious. I have used Latex or Latex-based apps to produce equations or equation images which can be inserted into a Google doc.
Guidelines for content and formatting:
- Your doc should begin with an informative title, then your name, a date, the course title and semester, and an indication of which project version it is.
- Sections should be given informative titles.
- Sentences and paragraphs should not be too long and paragraphs should be separated by a blank line, for readability.
- In choosing your references, you must avoid informal articles from the web, because they are so variable in quality. Instead, try to find books, or scholarly articles using Google Scholar or a similar source, or authoritative tutorials. Snell library has a growing collection of e-books and e-journals. I have put a number of useful books on Reserve for the course, books that can help you with your project. See the list here.
- Your references must be complete and properly formatted. Your best guide is to look at the Bibliography in the textbook, which contains a wide variety of types of items in its 1,800 citations. You can also check sites such as http://www2.liu.edu/cwis/cwp/library/workshop/citapa.htm
Project topics suggestions
Whatever you choose to do, you should explore tutorials, papers, books, etc. I also urge you to join mailing lists for your topic or system so you can look for answers to questions and ask questions yourself. For both topics below, you should get started quickly. Do not wait for me to get to the related material in the textbook. Some of it comes near the end of the course. I am here to help you get started.
Machine Learning
The UC Irvine Machine Learning Repository has many sample datasets for you to use. You can experiment with a variety of machine learning algorithms applied to some dataset you choose. The most popular include the famous Iris set, wine, breast cancer, poker hand, car evaluation, forest fires, etc. You can experiment with the two basic forms of machine learning, supervised, and unsupervised. Your guide in all this is the data mining book by Witten (about the Weka system) which is on reserve in the library. See the Resources page.
Typical projects - almost all should use the Weka system. You should get a thorough understanding of the statistics and performance measures and visualizations that Weka provides.
- Studying the theory behind a technique, e.g., boosting, and use it in the Weka system on a variety of datasets.
- Delete some samples from a dataset to see how performance degrades and how well the trained system does on the original dataset.
- Compare two different learning algorithms on a variety of datasets to see which algorithm does the best for each dataset.
- Write your own learner, e.g., for decision lists, and evaluate its performance.
Natural Language
A major site that has free corpora drawn from a variety of topic areas is the American National Corpus (ANC). The Open ANC copora are what you want.
Many text sites advertise "free books" but most come with strings attached. Project Gutenberg is a legitimate site that has over 33,000 high-quality books. You can download them as plain text, suitable for natural language work.
There are many standard analyses you can do which can be the basis for a good project. They include.
- Sentence boundary detection.
- Entity extraction, e.g., DNA, USA, New York City, John Smith, General Electric, etc.
- ngrams and other frequency analyses, especially comparing a text from two different topic areas
- Word morphology
- Part-of-speech tagging
- Finding opinions and sentiment - how does text reflects people's positive or negative feelings about something?
Demonstrating the use of GATE or the Natural Language Toolkit for some of the above problem will make a good project.
The Stanford Natural Language site has many software tools.
Some systems furnish APIs in addition to running out of the box. This will allow you to learn about AI programming, hands-on.
Some projects can use a mix of natural language processing and machine learning, e.g, separating sports stories from national news from international news stories.
Another might use machine learning for Uncertainty problems, using Bayes approaches.