CSG120 Artificial Intelligence - Spring 2008
Your Project Choices
Professor Futrelle -
College of Computer and Information Sciences, Northeastern U., Boston, MA
Version of 28 December 2007
How we will proceed:
You are to choose your project from one of the many systems/areas below.
In the first lecture, I will go over the list below briefly. I'll then proceed with the two overview lectures outlining all topics covered in our textbook, "AIMA". In required email to me between the first and second class, you must indicate what topics you might be interested in. Shortly after the second and last overview lecture, in required email to me, you should narrow your focus, hopefully to a single project topic. You should then start working on it immediately. You'll need to get reading right away, starting with our textbook and then in some of the many books on reserve. In some cases you'll be dealing with a system that you should download and install immediately, to make sure it will work for you on your platform. The biggest mistakes by students in past course projects have been: 1. Not getting right to work, and not diving in and finding out as much as possible as early as possible, and 2. Not discussing questions you might have about your project with me in person, by email, or during my IM office hours.
You should also be sure to read this
introduction to your Project work.
The topics below cover a wide range of important AI areas. Some areas that you might think should be there, but are missing, are not included for various reasons. They are not firmly excluded, but special permission from me would be needed to work on them. Examples include: neural nets (too much a black box), genetic algorithms (just one of many search procedures), expert systems (usually not very deep AI involved), fuzzy systems (equivalent solutions can be built using non-fuzzy techniques), and games (board games are search-based, and player games usually involve a lot of physics and graphics). These approaches are given little space in our course textbook, which is another indication of their value. The textbooks I have seen that emphasize topics such as genetic algorithms and fuzzy systems typically ignore the huge amount of excellent standard AI approaches, and are misleading in that regard.
I have chosen topics that are mainstream AI and that I'm rather familiar with through my own research, systems, and publications. This means that I can be maximally helpful as your work proceeds. My research focus over the years has been on knowledge extraction from the biomedical literature, from the text and the figures. These are broad topics that have given me a chance to work with many different aspects of AI.
A project based on the Semantic Web applied to the Health Sciences
Health sciences and semantic web:
http://www.w3.org/2001/sw/hcls/notes/kb/
This health sciences / semantic web activity is centered in Cambridge, MA, with teleconferences held regularly with researchers around the world.
The health sciences are one of the largest, most important, and complex
content-based systems in our culture.
Getting experience with them
and with the Semantic Web can be quite useful. I am on the very active mailing list from this group, so I have many references to the group's work.
Projects that use Biomed Central papers
The majority of knowledge available on the web is in a so-called "unstructured" format, typically text. Given the billions of pages of text out there, there is obviously an enormous amount of knowledge that can be mined, analyzed, learned from, and exploited for retrieval and to build more structured data/knowledge-bases.
An important source of full-text research papers in the biomedical domain is in the journals of the publisher, Biomed Central (BMC). They have published nearly 30,000 papers, all open access, which means that you can freely download the full text and figures to use in your projects. You might want to choose a project which involves working directly with me on some of my AI-based research. A number of students have done that in various courses I have taught. The two projects immediately below are examples.
- Analysis of sentence structures in BMC papers.
This project is one of discovery.
If you study the history of science, as I do, you'll learn that many great paths pursued in science began by simply collecting lots of samples (plants, animals, rocks, spectral lines, etc.) and then gradually making sense of them - work that sometimes lasted over many centuries. This is the classic and critically important "discovery" approach that began all good and important science.
In practically all work on "text mining", researchers start with a preconceived and limited set of sentence structures and "mine" only them.
This project will go back to the fundamental discovery paradigm.
In the project, you'll simply go through and build simple descriptions of the structure and content of many hundreds of randomly chosen sentences from BMC, with no initial bias.
This will show us what scientists actually say in their papers, not just a subset chosen in advance. You do not need to understand the Biology involved, since you'll be describing the "surface structure" of the sentences. I can guide you to gradually seeing the greater order in what you discover. This can easily lead to a published paper.
- Text-figure relations.
A full understanding of any published figure in a paper requires an interplay between the graphical elements and the text describing the figure, both in the caption and in the paper proper. This project would attempt to discover the distribution and cross-references of knowledge in the full text-figure structures.
I am currently working with a PhD student on just this topic, but we are developing and using computational linguistics tools for the task, to augment our manual inspections.
This could also lead to a published paper.
- I have developed two useful search tools for BMC papers and captions. They are trivial web forms, but quite useful. One is a
figure caption search
and the other is a
full text search.
Projects using major AI-related tools
Each of the AI tools listed below is a downloadable application that you can use on your own computer. Most are Java applications, so they should run without problems on Windows, Mac, or Linux. NLTK is Python-based.
If you decide to work with one of the systems below, you should join the mailing lists for it, or at least locate the mailing list archives and use them to answer various questions you might have. Each system typically has its own documentation, including FAQs, tutorials, lectures, etc.
-
GATE - A comprehensive natural language system
GATE is a powerful system that has been adapted for a variety of problems by research groups around the world, in development since 1995. From the GATE manual:
GATE is an infrastructure for developing and deploying software components that process
human language. GATE helps scientists and developers in three ways:
- By specifying an architecture, or organizational structure, for language processing
software;
- By providing a framework, or class library, that implements the architecture and can
be used to embed language processing capabilities in diverse applications;
- By providing a development environment built on top of the framework made up
of convenient graphical tools for developing components.
-
NLTK - Natural Language Toolkit
NLTK is a suite of Python modules organized in a shallow hierarchy. The hierarchy consists of core modules, which define the basic data types, and task modules, which each carry out a particular NLP task, e.g., parsing, tokenizing, and corpus-reading. The documentation describes the core modules: the first of these is the Token class, which maps a given unit of text to its properties, such as text and part of speech. Chunking and parsing also add properties to a text unit. In general, NLP tasks carried out by NLTK involve adding and modifying properties of Token objects. NLTK also includes modules for context-free grammars and probability calculations (for statistical NLP). The NLTK distribution includes the Brown corpus, a 5 percent sample of the Penn Treebank corpus, and 13 other corpora."
The NLTK has a straightforward API, and the Python modules can be extended as needed. I did notice that Mac OS X 10.5.1, Leopard, includes in it the current production release of Python, v 2.5.1.
-
Protégé - An ontology development system
Here is an excerpt from one of the many introductions to the Protégé system: "This tutorial gives an introduction to Protege-Frames, an extensible, platform-independent environment for creating and editing ontologies and knowledge bases which allows users to begin designing an ontology quickly and intuitively. You will learn how to create, modify, and save a Protege-Frames project. You will create a project called "tutorial," which contains a few of the classes and slots from the newspaper example that accompanies the Protege-Frames installation. Once you have read the tutorial, you will be ready to explore Protege-Frames on your own."
It is important to note that the knowledge representation strategies in Protégé are closely related to the Semantic Web. So if you're interested in the Semantic Web, the health sciences project and Protégé are excellent ways to get started.
-
Weka - Containing many machine learning algorithms with visualization tools
From chapter 8 of the Weka-based data mining book: "... a system called Weka, developed at the University of Waikato in New Zealand. "Weka" stands for the Waikato
Environment for Knowledge Analysis. (Also, the weka, pronounced to
rhyme with Mecca, is a flightless bird with an inquisitive nature found only
on the islands of New Zealand.) The system is written in Java, an object-
oriented programming language that is widely available for all major
computer platforms, and Weka has been tested under Linux, Windows, and
Macintosh operating systems. Java allows us to provide a uniform interface
to many different learning algorithms, along with methods for pre- and
postprocessing and for evaluating the result of learning schemes on any
given dataset."
Most any project you do on machine learning should involve both supervised and unsupervised techniques applied to the same data sets.
-
Prover9 is a theorem-proving program
Prover9 is an automated theorem prover for first-order and equational logic, and Mace4 searches for finite models and counterexamples. Prover9 is the successor of the Otter prover.
Working with these systems is not for the faint of heart. But if you feel comfortable with logic and other advanced topics, you may find it interesting. Two students in my previous courses used it successfully. There is a web demo and you can download a copy. It installed instantly on my Mac, OS X 10.5.
Projects based on advanced exercises/material in books on reserve
You'll need to access these books from the Reserves, or better,
purchase your own copies for any serious project. There are many books on Reserve for the course. Below are a few relevant ones for possible projects.
- AIMA - our textbook, Artificial Intelligence. A Modern Approach
The book states that some of the 385 exercises in the book are suitable for full course projects. Some of them also require investigations of the literature - something you must do for any project you propose and carry through.
On the AIMA website there is a huge collection of links to many other sites and resources. There is also substantial amounts of AI-related code, available in a number of languages.
- Paradigms of Artificial Intelligence Programming.
Case Studies in Common Lisp (Norvig)
This text has a large number of exercises marked simple, medium, hard, and difficult. The latter can take days to complete and can be part of substantial project on the topic of the chapter in which they appear. Lisp is a marvelous way to gain insight into AI, since the simple and integrated data/function syntax allows you to get right to the content without a lot of workarounds. Read the reviews on Amazon to discover just what an excellent book it is.
- Programming Collective Intelligence (Segaran)
This book, subtitled "Building Smart Web 2.0 Applications", is primarily about machine learning. It can serve to suggest some substantial topics, as well as data from the Web that can be used. But you have to make sure that any project here is really AI and not just the implementation of some algorithm. The book, Data Mining (Witten and Frank) is also focused on machine learning and is based on the Weka system described earlier.
to CSG120 home page..
or RPF's Teaching Gateway or
homepage.