NTCIR Project

Overview
The NTCIR Workshop is a series of evaluation workshops designed to enhance research in Information Access (IA) technologies including information retrieval, question answering, text summarization, extraction, etc. It was co-sponcered by Japan Society for Promotion of Science (JSPS) as part of JSPS “Research for Future” Program” and National Center for Science Information Systems (NACSIS) since 1997, by JSPS and Research Center for Information Resources at National Institute of Informatics (RCIR/NII,) in FY 2000, and by MEXT Grant-in-Aid for Scientific Research on Priority Areas of “Informatics” (#13224087) and RCIR/NII in and after FY2001.

The aims are;

  1. to encourage research in Information Access technologies by providing large-scale test collections reusable for experiments and a common evaluation infrastructure allowing cross-system comparisons
  2. to provide a forum for research groups interested in cross-system comparison and exchanging research ideas in an informal atmosphere
  3. to investigate evaluation methods of Information Access techniques and methods for constructing a large-scale data set reusable for experiments.

An evaluation workshop usually provides test collections (data sets usable for experiments) and unified evaluation procedures for experiment results. Each participating group conducts research and experiments using the common data provided by the NTCIR organizer with various approaches. The importance of reusable large-scale standard test collections in IA research has been widely recognized and an evaluation workshop is now recognized as a new style of active research project that facilitates research by providing the data and a forum for research idea exchange and technology transfer.

  • Author: K. L. Kwok
  • Citation: K. L. Kwok. Comparing Representations in Chinese Information Retrieval. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2007.
  • Data Set: 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings
  • References:
    • “Dictionaries are surprisingly expensive to build and maintain and bi-gram is surprisingly effective for Chinese.” - Walter Underwood of NetFlix

Abstract

Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach.

Introduction

While information retrieval (IR) in English has over thirty years of history, IR in Chinese is relatively recent. It is well-known that written Chinese consists of strings of ideographs separated by punctuation signs. An ideograph (or character) can function as a word with meaning(s), or it can act as an alphabet forming a ’short-word’ with one or more adjacent characters and having more specific meaning. Short-words can be strung together to form ’long-words’ that have more complex and precise meaning. Words with little content are stopwords and usually manifest as single ideograph or short-words. Determining the boundaries of single or multi-character words in a string, a process called segmentation, is difficult because no delimiter or white space is used in the text and one has to rely on context, see for example [ChKi92, JiCh95, WuTs95, NiBR96, PoCr96, SSGC96, SuSH97].

Since it has been known in English IR that retrieval using words can provide a sound basis from which other methodologies can improve on, it appears that successful segmentation of Chinese documents and queries into words for representation to diminish ambiguity may be an important initial step for Chinese IR. From a linguistics point of view, segmentation usually means determining the longest words with precise meaning in a character string. The corresponding English constructs for these long-words are often noun phrases. These long-words may be unambiguous and pleasing to read, but for retrieval they lead to the disadvantage of having to determine partial matching values when a query short-word matches part of a document long-word or vice versa. For IR purposes, where we are working at the content term level, it appears that using short-words for representation would be adequate as a first step.

Because text segmentation is not straightforward and the process itself can have ambiguous outcomes, several previous attempts make use of single characters or all bigrams (adjacent overlapping character pairs) as representation [Chie95,LiLy96,BGHR9x,BuSM9x] in Chinese and in Japanese [FuCr93,OgIw95]. These approaches are simple and efficient, provide exhaustive indexing, and do not rely on having a segmentation procedure nor large dictionaries. Thus, it would be interesting to compare the effectiveness of retrieval among these three types of representation. IR concerns the detection of relevant documents, not answering questions. Effective IR can be obtained using indexing features with appropriate statistical properties. Good indexing features may not necessarily be good linguistic constructs that are needed for human comprehension. Thus, if 1-gram or bigrams can provide comparable effectiveness, segmentation may not be as important as thought to be for IR….

Conclusion

Within the TREC-5 Chinese retrieval environment using our PIRCS engine and where the queries are long and rich, bigram indexing is effective and performs as well as short-word indexing. The latter approach is however more efficient, dealing with an indexing term space that is about one third smaller….

  • Author: Paul McNamee

Pro: Asian Languages (1999)

  • Information Processing and Management 35(4) was devoted to IR in Asian Languages
    • Many Asian languages lack explicit word boundaries
  • Korean
    • Lee et al., KRIST Collection (13K docs)
      • 2-grams outperform words, decompounding cited
  • Chinese
    • Nie and Ren, TREC 5/6 Chinese Collection (165K docs)
      • 2-grams (0.4161 avg. prec.) comparable to words (0.4300)
      • Combination of both is best (0.4796)
  • Japanese
    • Ogawa and Matsuda, BMIR-J2 (5K docs)
      • M-grams (unigrams and bigrams) comparable to words

Reference: N-grams and Morpheme Analysis in IR

Previous: Why You Should Use N-grams for Multilingual Information Retrieval

Abstract

The purpose of this research is to identify scalable approaches that can handle large amount of training data such as several years of news articles, and automatically assign predefined category to Chinese free text documents. Our approach consists of the following processes: (i) term extraction, (ii) term selection, and (iii) document classification. The approach first builds a recently developed SB-tree to identify all repeated substrings, called patterns, from the text. We then proceed to identify possible boundary of terms appearing in the identified patterns. After terms are extracted from the training articles, we run term selection algorithms to select the most significant terms and to reduce the number of terms to an acceptable level. The selected terms are med by the classifier to assign a predefined category to each text document. Our current experiment uses CNA one year news as training data, which consists of 73,420 articles and is far more than previous related research. In the experiment, we implement and compare four term selection methods, the odds ratio method, the mutual information method, the information gain method and the x2-test method, when they are combined with the naive Bayes classifier….

Introduction

Text categorization is the problem of automatically assigning predefined categories to free text documents, and is gaining more and more importance as the amount of text data available on World Wide Web grows dramatically. A well classified text database will be very helpful for a user to identify interesting data from the huge collection of texts. There are many studies about the text categorization as well as web-page classification [11, 3, 7, 8, 21, 25, 26, 18, 6, 5, 2, 10]. While there are a great number of researches on automatic text categorization for English texts, text categorization for Asian languages such as Chinese, Japanese, Korean and Thai has not been studied seriously until recently [17, 29, 1].

It is well known that written Asian language consists of strings of ideograph separated by punctuation signs. An ideograph (or character) can function as a word with meaning(s), or it can act as an alphabet to form a “word” with one or more adjacent characters. Determining the boundaries of single or multi-character words in a string, a process called segmentation [4], is very difficult because no delimiter or while space is used in the text and one has to rely on the context contents. Because text segmentation is not straightforward, 1-grams, 2-grams and n-grams have been used as indexing terms to represent documents in Asian languages. Among them, 1-gram based approaches is the simplest one that uses single characters as indexing terms, and should be good for recall in information retrieval(IR) because it guarantees that if there are correct word matches between queries and documents, there will be 1-gram matches. However, single characters (1-grams) are ambiguous in meaning, which results in low precision in IR. A number of research have proposed to use n-grams, instead of 1-grams, as indexing terms. An n-gram is a sequence of n contiguous characters in the text. The 1-gram-based approaches [23] simply use every single character as a single term, and the 2-gram-based approaches use every 2 contiguous characters as indexing terms, and the general n-gram-based approaches use all 1-grams, 2-grams, 3-grams,…, n–grams as indexing terms. Although 2-gram and n-gram perform similarly well as indicated in our experiment, in this research, we take n-grams, 1 < n < 10, as indexing terms because n-grams can catch the concept of a document. Notice that the possible number of n-grams in Chinese is dramatically huge, and furthermore many of them are meaningless and non-informative for text categorization. The major challenge is to develop approach that can reduce the number of n-grams to an acceptable level, while at the same time maintains similar categorization accuracy.

The purpose of this research is to identify scalable approaches that can handle large amount of training data such as several years of news articles, and automatically assign predefined category to Chinese free text documents….

Conclusions and Further Remarks

In this paper, we sketch an implementation of approaches that can handle large amount of training data such as several years of news articles, and automatically assign predefined category to Chinese free text documents. We implement a SB-tree-based approach to extract terms from the original text data, and develop a simple approach to remove redundant substrings. We also compare four term selection methods, the odds ratio method, the mutual information method, the information gain method and the x2-test method, and use the naive Bayes classifier to evaluate their performance. Among four feature selection method, x2-test achieve the best performance.

Abstract
The role of lexical resources is often understated in NLP research. The complexity
of Chinese, Japanese and Korean (CJK) poses special challenges to developers of NLP tools, especially in the area of word segmentation (WS), information retrieval (IR), named entity extraction (NER), and machine translation (MT). These difficulties are exacerbated by the lack of comprehensive lexical resources, especially for proper nouns, and the lack of a standardized orthography, especially in Japanese. This paper summarizes some of the major linguistic issues in the development NLP applications that are dependent on lexical resources, and discusses the central role such resources should play in enhancing the accuracy of NLP tools.

1 Introduction
Developers of CJK NLP tools face various challenges, some of the major ones being:

  1. Identifying and processing the large number of orthographic variants in Japanese, and alternate character forms in CJK languages.
  2. The lack of easily available comprehensive lexical resources, especially lexical databases, comparable to the major European languages.
  3. The accurate conversion between Simplified and Traditional Chinese (Halpern and Kerman 1999).
  4. The morphological complexity of Japanese and Korean.
  5. Accurate word segmentation (Emerson 2000 and Yu et al. 2000) and disambiguating ambiguous segmentations strings (ASS) (Zhou and Yu 1994).
  6. The difficulty of lexeme-based retrieval and CJK CLIR (Goto et al. 2001).
  7. Chinese and Japanese proper nouns, which are very numerous, are difficult to detect without a lexicon.

  8. Automatic recognition of terms and their variants (Jacquemin 2001).

The various attempts to tackle these tasks by statistical and algorithmic methods (Kwok 1997) have had only limited success. An important motivation for such methodology has been the poor availability and high cost of acquiring and maintaining large-scale lexical databases.

This paper discusses how a lexicon-driven approach exploiting large-scale lexical databases can offer reliable solutions to some of the principal issues, based on over a decade of experience in building such databases for NLP applications.

2 Named Entity Extraction
Named Entity Recognition (NER) is useful in NLP applications such as question answering, machine translation and information extraction. A major difficulty in NER, and a strong motivation for using tools based on probabilistic methods, is that the compilation and maintenance of large entity databases is time consuming and expensive. The number of personal names and their variants (e.g. over a hundred ways to spell Mohammed) is probably in the billions. The number of place names is also large, though they are relatively stable compared with the names of organizations and products, which change frequently.

A small number of organizations, including The CJK Dictionary Institute (CJKI), maintain
databases of millions of proper nouns, but even such comprehensive databases cannot be kept fully up-to-date as countless new names are created daily….

6 The Role of Lexical Databases
Because of the irregular orthography of CJK languages, procedures such as orthographic normalization cannot be based on statistical and probabilistic methods (e.g. bigramming) alone, not to speak of pure algorithmic methods. Many attempts have been made along these lines, as for example Brill (2001) and Goto et al. (2001), with some claiming performance equivalent to lexicon-driven methods, while Kwok (1997) reports good results with only a small lexicon and simple segmentor.

Emerson (2000) and others have reported that a robust morphological analyzer capable of processing lexemes, rather than bigrams or n-grams, must be supported by a large-scale computational lexicon. This experience is shared by many of the world’s major portals and MT developers, who make extensive use of lexical databases.

Unlike in the past, disk storage is no longer a major issue. Many researchers and developers, such as Prof. Franz Guenthner of the University of Munich, have come to realize that “language is in the data,” and “the data is in the dictionary,” even to the point of compiling full-form dictionaries with millions of entries rather than rely on statistical methods, such as Meaningful Machines who use a full form dictionary containing millions of entries in developing a human quality Spanish-to-English MT system.

CJKI, which specializes in CJK and Arabic computational lexicography, is engaged in an
ongoing research and development effort to compile CJK and Arabic lexical databases (currently about seven million entries), with special emphasis on proper nouns, orthographic normalization, and C2C. These resources are being subjected to heavy industrial use under real-world conditions, and the feedback thereof is being used to further expand these databases and to enhance the effectiveness of the NLP tools based on them.

7 Conclusions
Performing such tasks as orthographic normalization and named entity extraction accurately is beyond the ability of statistical methods alone, not to speak of C2C conversion and morphological analysis. However, the small-scale lexical resources currently used by many NLP tools are inadequate to these tasks. Because of the irregular orthography of the CJK writing systems, lexical databases fine-tuned to the needs of NLP applications are required. The building of large-scale lexicons based on corpora consisting of even billions of words has come of age. Since lexicon-driven techniques have proven their effectiveness, there is no need to overly rely on probabilistic methods. Comprehensive, up-to-date lexical resources are the key to achieving major enhancements in NLP technology.

Conclusions (monolingual)

  • n Words or bigrams seem to present the same performance for JA & ZH (words slightly better for ZH)
  • n Bigrams seem the best approach for KR (to verify!)
  • n Processing space / time
  • James Mayfield and Paul McNamee

Abstract

The complexity of human language makes accessing multilingual information difficult. Most cross-language retrieval systems attempt to address linguistic variation through language-specific techniques and resources. In contrast, APL has developed a multilingual information retrieval system, the Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT), which incorporates five language-neutral techniques: character n-gram tokenization, a proprietary term similarity measure, a language model document similarity metric, pre-translation query expansion, and exploitation of parallel corpora. Through extensive empirical evaluation on multiple internationally developed test sets we have demonstrated that the knowledge-light, language-neutral approach used in HAIRCUT can achieve state-of-the-art retrieval performance. In this article we discuss the key techniques used by HAIRCUT and report on experiments verifying the efficacy of these methods.

Conclusion

Through participation in the TREC, CLEF, and NTCIR evaluations, retrieval performance was investigated using document collections in Arabic, Chinese, Dutch, English, Finnish, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, and Swedish. We have found overwhelming support for the contention that high performance is possible without dependence on language-specific approaches. The HAIRCUT system has been developed using five language-neutral techniques: n-gram tokenization, affinity sets, a language model similarity metric, pre-translation query expansion, and exploitation of parallel collections. These techniques are effective across a wide range of languages as evidenced by HAIRCUT’s consistently high performance in international evaluations. We believe that the techniques described here can help intelligence analysts handle future crises—whatever the language requirements.

Reference: The HAIRCUT Information Retrieval System

  • Authors: Gemma Boleda, Toni Badia, Sabine Schulte imWalde
  • Email: (gemma.boleda|toni.badia)@upf.edu, schulte@CoLi.Uni-SB.DE
  • Citation: Gemma Boleda, Toni Badia and Sabine Schulte imWalde. Morphology vs. Syntax in Adjective Class Acquisition. Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition; 77–86. 2005.

Abstract
This paper discusses the role of morphological and syntactic information in the automatic acquisition of semantic classes for Catalan adjectives, using decision trees as a tool for exploratory data analysis. We show that a simple mapping from the derivational type to the semantic class achieves 70.1% accuracy; syntactic function reaches a slightly higher accuracy of 73.5%. Although the accuracy scores are quite similar with the two resulting classifications, the kinds of mistakes are qualitatively very different. Morphology can be used as a baseline classification, and syntax can be used as a clue when there are mismatches between morphology and semantics.

Introduction
This paper fits into a broader effort addressing the automatic acquisition of semantic classes for Catalan adjectives. So far, no established standard of such semantic classes is available in theoretical or empirical linguistic research. Our aim is to reach a classification that is empirically adequate and theoretically sound, and we use computational techniques as a means to explore large amounts of data which would be impossible to explore by hand to help us define and characterise the classification….

Conclusion and future work
In this paper, we have presented and discussed the role of two sources of evidence for the automatic classification of adjectives into ontological semantic classes: morphology and syntax. Both levels provide relevant information, as indicated by their respective accuracy results (70.1% for morphology, 73.8% for syntax), both well above a majority baseline (46.8%). Morphology fails in cases of noncompositional meaning, when the relationship to the deriving word has been lost, cases that syntax tends to correctly classify. In contrast, syntax systematically confuses event and basic adjectives due to the lack of a sufficiently distinct syntactic profile of the event class. Therefore, the default morphology-semantics mapping handles these cases better.

Not suprisingly, the best classifier is obtained combining both kinds of information (74.7%), although it is not even 1% better than the syntactic classifier. More research is needed to achieve better
ways of combining both levels of description.

We can summarise our results as indicating that morphology can give a reliable initial hypothesis with respect to the semantic class of an adjective, which syntax can refine in cases of noncompositional meaning, particularly for object adjectives. Therefore, morphology can be used as a baseline in future classification experiments….

  • Authors: Helmut Berger and Dieter Merkl

Abstract

This paper gives an analysis of multi-class e-mail categorization performance, comparing a character n-gram document representation against a word-frequency based representation. Furthermore the impact of using available e-mail specific meta-information on classification performance is explored and the findings are presented.

Introduction

The task of automatically sorting documents of a document collection into categories from a predefined set, is referred to as Text Categorization. Text categorization is applicable in a variety of domains: document genre identification, authorship attribution, survey coding to name but a few. One particular application is categorizing e-mail messages into legitimate and spam messages, i.e. spam filtering. Androutsopoulos et al. compare in [1] a Naive Bayes classifier against an Instance-Based classifier to categorize e-mail messages into spam and legitimate messages, and conclude that these learning-based classifiers clearly outperform simple anti-spam keyword approaches. However, sometimes it is desired to classify e-mail messages in more than two categories. Consider, for example an e-mail routing application, which automatically sorts incoming messages according to their content and routes them to receivers that are responsible for a particular topic. The study presented herein compares the performance of different text classification algorithms in such a multi-class setting.

By nature, e-mail messages are short documents containing misspellings, special characters and abbreviations. This entails an additional challenge for text classifiers to cope with “noisy” input data. To classify e-mail in the presence of noise, a method used for language identification is adapted in order to statistically describe e-mail messages. Specifically, character-based n-gram frequency profiles, as proposed in [2], are used as features which represent each particular e-mail message. The comparison of the performance of categorization algorithms using character-based n-gram frequencies as elements of feature vectors with respect to multiple classes is described. The assumption is, that applying text categorization on character-based n-gram frequencies will outperform word-based frequency representations of e-mails. In [3] a related approach aims at authorship attribution and topic detection. They evaluate the performance of a Naive Bayes classifier combined with n-gram language models. The authors mention, that the character-based approach has better classification results than the word-based approach for topic detection in newsgroups. Their interpretation is that the character-based approach captures regularities that the word-based approach is missing in this particular application.

Besides the content contained in the body of an e-mail message, the e-mail header holds useful data that has impact on the classification task. This study explores the influence of header information on classification performance thoroughly. Two different representations of each e-mail message were generated: one that contains all data of an e-mail message and a second, which consists of textual data found in the e-mail body. The impact on classification results when header information is discarded is shown….

Conclusion

In this paper, the results of three text categorization algorithms are described in a multi-class categorization setting. The algorithms are applied to character n-gram frequency statistics and a word frequency based document representation. A corpus consisting of multi-lingual e-mail messages which were manually split into multiple classes was used. Furthermore, the impact of e-mail meta-information on classification performance was assessed.

The assumption, that a document representation based on character n-gram frequency statistics boosts categorization performance in a “noisy” domain such as e-mail filtering, could not be verified….

Reference: A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

  • Authors: Fuchun Peng, Fangfang Feng, Andrew McCallum

Abstract
Chinese word segmentation is a difficult, important and widely-studied sequence modeling problem. This paper demonstrates the ability of linear-chain conditional random fields (CRFs) to perform robust and accurate Chinese word segmentation by providing a principled framework that easily supports the integration of domain knowledge in the form of multiple lexicons of characters and words. We also present a probabilistic new word detection method, which further improves performance. Our system is evaluated on four datasets used in a recent comprehensive Chinese word segmentation competition. State-of-the-art performance is obtained.

Introduction
Unlike English and other western languages, many Asian languages such as Chinese, Japanese, and Thai, do not delimit words by white-space. Word segmentation is therefore a key precursor for language processing tasks in these languages. For Chinese, there has been significant research on finding word boundaries in unsegmented sequences (see (Sproat and Shih, 2002) for a review). Unfortunately, building a Chinese word segmentation system is complicated by the fact that there is no standard definition of word boundaries in Chinese.

Approaches to Chinese segmentation fall roughly into two categories: heuristic dictionary-based methods and statistical machine learning methods. In dictionary-based methods, a predefined dictionary is used along with hand-generated rules for segmenting input sequence (Wu, 1999). However these approaches have been limited by the impossibility of creating a lexicon that includes all possible Chinese words and by the lack of robust statistical inference in the rules. Machine learning approaches are more desirable and have been successful in both unsupervised learning (Peng and Schuurmans, 2001) and supervised learning (Teahan et al., 2000).

Many current approaches suffer from either lack of exact inference over sequences or difficulty in incorporating domain knowledge effectively into segmentation. Domain knowledge is either not used, used in a limited way, or used in a complicated way spread across different components. For example, the N-gram generative language modeling based approach of Teahan et al (2000) does not use domain knowledge. Gao et al (2003) uses class-based language for word segmentation where some word category information can be incorporated. Zhang et al (2003) use a hierarchical hidden Markov Model to incorporate lexical knowledge. A recent advance in this area is Xue (2003), in which the author uses a sliding-window maximum entropy classifier to tag Chinese characters into one of four position tags, and then covert these tags into a segmentation using rules. Maximum entropy models give tremendous flexibility to incorporate arbitrary features. However, a traditional maximum entropy tagger, as used in Xue (2003), labels characters without considering dependencies among the predicted segmentation labels that is inherent in the state transitions of finitestate sequence models.

Linear-chain conditional random fields (CRFs) (Lafferty et al., 2001) are models that address both issues above….

Conclusions
The contribution of this paper is three-fold. First, we apply CRFs to Chinese word segmentation and find that they achieve state-of-the art performance. Second, we propose a probabilistic new word detection method that is integrated in segmentation, and show it to improve segmentation performance. Third, as far as we are aware, this is the first work to comprehensively evaluate on the four benchmark datasets, making a solid baseline for future research on Chinese word segmentation.

Reference: Chinese Segmentation and New Word Detection using Conditional Random Fields

« Previous Entries  

Recent Entries

    Recent Comments

      Most Commented