Duplicate detection in heterogenous bibliographic sources
D. N. Rubtcov, V. B. Barakhnin

When performing queries to multiple heterogen eous bibliographic sources the p roblem of repetitive records arises. The problems appearing in the process of de tection of fuzzy match between two records are analyzed in this paper. The existing methods and algorithms of duplicate elimination and in particular the approaches to determination and calculation of string similar ity function are considered. Taking into account the requirements of the concrete task of mo dernization of the information system «Mathematicians of SB RAS» the solution method was realized based on the use of longest common subsequence as string similarity function. The proposed method was tested on three SB RAS databases – Database of publications of Journal «Computational Technologies», Database of publications of employees of The Institute of Computational Technologies SB RAS and Database of publications of «Web-resources of the mathematical content». The method showed high efficiency on results of the testing and was applied for the information system «Mathematicians of SB RAS» and the integrated system of remote access to the heterogenous bibliographic resources which is being developed at the present moment.

Publication information
Main title Vestnik NSU Series: Information Technologies, Volume 07, Issue No 3 (2009).
Parallel title: Novosibirsk State University Journal of Information Technologies Volume 07, Issue No 3 (2009).

Key title: Vestnik Novosibirskogo gosudarstvennogo universiteta. Seriâ: Informacionnye tehnologii
Abbreviated key title: Vestn. Novosib. Gos. Univ., Ser.: Inf. Tehnol.
Variant title: Vestnik NGU. Seriâ: Informacionnye tehnologii

Year of Publication: 2009
ISSN: 1818-7900 (Print), ISSN 2410-0420 (Online)
Publisher: Novosibirsk State University Press
