Novosibirsk State University Journal of Information Technologies
Scientic Journal

ISSN 2410-0420 (Online), ISSN 1818-7900 (Print)

Switch to

All Issues >> Contents: Volume 08, Issue No 2 (2010)

Clustering of text documents based on composite key terms
Vladimir Borisovich Barakhnin, Dmitry Aleksandrovich Tkachev

Institute of Computational Technologies SB RAS
Novosibirsk State University

UDC code: 340.11(3)

The classical approach to the coordinate indexing texts with a view to their subsequent clustering is to use analysis tools based on the thesaurus treated he subject area. But if we talk about the processing of texts rather narrow topics, in such cases requires a very detailed thesauri, which are (at least, widely available), not for all subject fields. The approach is based on the extraction of key phrases without a priori constraints is much more universal. However, this approach has the problem of selection of key terms. The purpose of this article is to demonstrate the practical advantages of clustering documents based on key phrases compared to the very popular clustering based on the analysis of only one-word key terms. At the same time to highlight the key phrases used publicly available software tools that do not require special computing costs.

Key Words
composite key terms, coordinate indexing, clustering text documents

How to cite:
Barakhnin V. B., Tkachev D. A. Clustering of text documents based on composite key terms // Vestnik NSU Series: Information Technologies. - 2010. - Volume 08, Issue No 2. - P. 5-14. - ISSN 1818-7900. (in Russian).

Full Text in Russian

Available in PDF

1. Fedotov A. M., Barakhnin V. B. K voprosu o poiske dokumentov «po analogii» // Vestn. Novosib. gos. un-ta. Seriya: Informatcionnyye tekhnologii. 2009. T. 7, vyp. 4. S. 3–14.
2. Peskova O. V. Avtomaticheskoye formirovaniye rubrikatora polnotekstovykh dokumentov // Tr. X Vseros. nauch. konf. «Elektronnyye biblioteki: perspektivnyye metody i tekhnologii, elektronnyye kollektcii» (RCDL’2008). Dubna, 7–11 oktyabrya 2008 g. S. 139–148.
3. Mikhailov A. I., Cherny A. I., Gilyarevsky R. S. Osnovy informatiki. M.: Nauka, 1968.
4. Kormen T., Leizerson Ch., Rivest R. M. Algoritmy: postroyeniye i analiz. M.: MTcNMO, 2001.
5. Barakhnin V. B., Nekhayeva V. A., Fedotov A. M. O zadanii mery skhodstva dlya klasterizatcii tekstovykh dokumentov // Vestn. Novosib. gos. un-ta. Seriya: Informatcionnyye tekhnologii. 2008. T. 6, vyp. 1. S. 3–9.
6. Bezdek J. C., Pal N. R. Some New Indexes of Cluster Validity // IEEE Transactions On Systems, Man And Cybernetics. 1998. Vol. 28, No. 3. P. 301–315.
7. Halkidi M., Batistakis V., Vazirgiannis M. On Clustering Validation // Journal of Intelligent Information Systems. 2001. Vol. 17 (2/3). P. 107–145.

Publication information
Main title Vestnik NSU Series: Information Technologies, Volume 08, Issue No 2 (2010).
Parallel title: Novosibirsk State University Journal of Information Technologies Volume 08, Issue No 2 (2010).

Key title: Vestnik Novosibirskogo gosudarstvennogo universiteta. Seriâ: Informacionnye tehnologii
Abbreviated key title: Vestn. Novosib. Gos. Univ., Ser.: Inf. Tehnol.
Variant title: Vestnik NGU. Seriâ: Informacionnye tehnologii

Year of Publication: 2010
ISSN: 1818-7900 (Print), ISSN 2410-0420 (Online)
Publisher: Novosibirsk State University Press
DSpace handle

|Home Page| |All Issues| |Information for Authors| |Journal Boards| |Ethical principles| |Editorial Policy| |Contact Information| |Old Site in Russian|
© 2006-2017, Novosibirsk State University.