This page describes two excellent sources of preprints and papers, arXiv.org, immediately below, and the NEC ResearchIndex, further down the page. Below there are a few examples of interesting and/or useful papers. There are many more where they came from.
The primary repository for preprints in computational linguistics is at the site http://www.arXiv.org, or more specifically, at the Computing Research Repository section of that site. Or you can link to the page of recent additions on Computation and Language which is constantly being updated.
Here are some recent papers that I've found interesting for the work I am doing in bionlp, ones that may be of general interest.
A Decision Tree of Bigrams is an Accurate Predictor of Word Sense
by Ted Pedersen
Abstract: This paper presents a corpus-based approach to word sense
disambiguation where a decision tree assigns a sense to an ambiguous
word based on the bigrams that occur nearby. This approach is evaluated
using the sense-tagged corpora from the 1998 SENSEVAL word sense
disambiguation exercise. It is more accurate than the average results
reported for 30 of 36 words, and is more accurate than the best results
for 19 of 36 words.
I find the memory-based approaches (MBLP) particularly appropriate for the biology literature which contains many "standard" phrases repeated many times. An excellent introduction to the field can be found in Walter Daelemans' introduction to a special issue on the topic: Memory-Based Language Processing. Introduction to the Special Issue. In: Journal of Experimental and Theoretical AI (JETAI), 11:3, 1999. (Preprint). Here is a cached PDF version of his posted Postscript file.
Here's an example of the MBLP approach, applied to shallow parsing.
(Shallow parsing does not attempt to decide on all attachments, conjunctive
structures and other larger structural aspects of sentences.)
Memory-Based Shallow Parsing by
Walter Daelemans, Sabine Buchholz, Jorn Veenstra
Abstract: We present a memory-based learning (MBL) approach to
shallow parsing in which POS tagging, chunking, and identification of
syntactic relations are formulated as memory-based modules. The
experiments reported in this paper show competitive results, the F-value
for the Wall Street Journal (WSJ) treebank is: 93.8% for NP chunking,
94.7% for VP chunking, 77.1% for subject detection and 79.0% for object
detection.
Another example of shallow parsing is
A Learning Approach to Shallow Parsing
by Marcia Mu–oz, Vasin Punyakanok, Dan Roth, Dav Zimak
from Proceedings of EMNLP-VLC'99, pages 168-178.
Abstract: A SNoW based learning approach to shallow parsing tasks
is presented and studied experimentally. The approach learns to identify
syntactic patterns by combining simple predictors to produce a coherent
inference. Two instantiations of this approach are studied and
experimental results for Noun-Phrases (NP) and Subject-Verb (SV) phrases
that compare favorably with the best published results are presented. In
doing that, we compare two ways of modeling the problem of learning to
recognize patterns and suggest that shallow parsing patterns are better
learned using open/close predictors than using inside/outside
predictors.
Finding sentence boundaries is important. A much referred to paper on the
topic is this one:
A Maximum Entropy Approach to Identifying Sentence Boundaries
Abstract:
by Jeffrey C. Reynar and Adwait Ratnaparkhi
that appeared in the 5th ANLP Conference, 1997
We present a trainable model for identifying sentence boundaries in raw
text. Given a corpus annotated with sentence boundaries, our model
learns to classify each occurrence of ., ?, and ! as either a valid or
invalid sentence boundary. The training procedure requires no
hand-crafted rules, lexica, part-of-speech tags, or domain-specific
information. The model can therefore be trained easily on any genre of
English, and should be trainable on any other Roman-alphabet language.
Performance is comparable to or better than the performance of similar
systems, but we emphasize the simplicity of retraining for new domains.
This site is a rich index of the Computer Science literature with extensive updated automated lists of citing documents, full-text search, downloadable full documents when available, etc. For an introduction to the site's many features see this page. To start searching for papers, go to this page.
A example of a useful entry in the ResearchIndex is Eric Brill's
highly cited 1995 paper on part-of-speech tagging. He also has a chapter
in Dale's recent book: Brill, E. 2000. Part-of-Speech Tagging, p.
403-414. In R. Dale, H. Moisl, and H. Somers (ed.), Handbook of Natural
Language Processing. Marcel Dekker, New York. (a book which is mentioned
on this site's Literature page)
Abstract of the 1995 paper:
Recently, there has been a rebirth of empiricism in the field of natural language
processing Manual encoding of linguistic information is being challenged by automated
corpus-based learning as a method of providing a natural language processing system with
linguistic knowledge Although corpus-based approaches have been successful in many
different areas of natural language processing it is often the case that these methods capture
the linguistic information they are modelling indirectly in large opaque tables of statistics
This can make it difficult to analyze understand and improve the ability of these
approaches to model underlying linguistic behavior In this paper we will describe a simple
rule-based approach to automated learning of linguistic knowledge This approach has
been shown for a number of tasks to capture information in a clearer and more direct
fashion without a compromise in performance We present a detailed case study of this
learning method applied to part of speech tagging.