Novosibirsk State University Journal of Information Technologies
Scientic Journal

ISSN 2410-0420 (Online), ISSN 1818-7900 (Print)

Clustering of text documents based on composite key terms
Vladimir Borisovich Barakhnin, Dmitry Aleksandrovich Tkachev

Institute of Computational Technologies SB RAS
Novosibirsk State University

UDC code: 340.11(3)

The classical approach to the coordinate indexing texts with a view to their subsequent clustering is to use analysis tools based on the thesaurus treated he subject area. But if we talk about the processing of texts rather narrow topics, in such cases requires a very detailed thesauri, which are (at least, widely available), not for all subject fields. The approach is based on the extraction of key phrases without a priori constraints is much more universal. However, this approach has the problem of selection of key terms. The purpose of this article is to demonstrate the practical advantages of clustering documents based on key phrases compared to the very popular clustering based on the analysis of only one-word key terms. At the same time to highlight the key phrases used publicly available software tools that do not require special computing costs.

Key Words
composite key terms, coordinate indexing, clustering text documents

Barakhnin V. B., Tkachev D. A. Clustering of text documents based on composite key terms // Vestnik NSU Series: Information Technologies. - 2010. - Volume 08, Issue No 2. - P. 5-14. - ISSN 1818-7900. (in Russian).

Main title Vestnik NSU Series: Information Technologies, Volume 08, Issue No 2 (2010).
Parallel title: Novosibirsk State University Journal of Information Technologies Volume 08, Issue No 2 (2010).

Key title: Vestnik Novosibirskogo gosudarstvennogo universiteta. Seriâ: Informacionnye tehnologii
Abbreviated key title: Vestn. Novosib. Gos. Univ., Ser.: Inf. Tehnol.
Variant title: Vestnik NGU. Seriâ: Informacionnye tehnologii

Year of Publication: 2010
ISSN: 1818-7900 (Print), ISSN 2410-0420 (Online)
Publisher: Novosibirsk State University Press
