Implementation of a Noise Filter for Grouping in Bibliographic Databases using Latent Semantic Indexing

  • Murilo Marques Armelin Gomes Department of Computer Science, Federal University of Uberlândia, MG 38400-902, Brazil
  • William Ferreira dos Anjos Department of Computer Science, Federal University of Uberlândia, MG 38400-902, Brazil
  • Arun Kumar Jaiswal Postgraduate Program in Bioinformatics, Institute of Biological Sciences, Federal University of Minas Gerais (UFMG), Belo Horizonte 31270-901, MG, Brazil
  • Sandeep Tiwari Postgraduate Program in Bioinformatics, Institute of Biological Sciences, Federal University of Minas Gerais (UFMG), Belo Horizonte 31270-901, MG, Brazil
  • Preetam Ghosh Department of Computer Science, Virginia Commonwealth University, Richmond, VA-23284, USA
  • Debmalya Barh Department of Genetics, Ecology and Evolution, Institute of Biological Sciences, Federal University of Minas Gerais (UFMG), Belo Horizonte 31270-901, MG, Brazil
  • Vasco Azevedo Department of Genetics, Ecology and Evolution, Institute of Biological Sciences, Federal University of Minas Gerais (UFMG), Belo Horizonte 31270-901, MG, Brazil
  • Anderson Santos Department of Computer Science, Federal University of Uberlândia, MG 38400-902, Brazil
Keywords: SVD, LSI, Grouping, Dimensionality, Reduction

Abstract

Clustering algorithms can assist in scientific research by presenting themes related to some topics from which we can extract information more easily. However, it is common for many of these clusters to have documents that have no relevance to the topic of interest, thereby reducing the quality of the information. We can manage the reduced quality of information of clusters for a bibliographic database by dealing with noise in the semantic space that represents the relations between the grouped documents. In this work, we sustain the hypothesis of using the Latent Semantic Indexing (LSI) technique as an efficient instrument to reduce noise and promote better group quality. Using a database of 90 scientific publications from different areas, we pre-processed the documents by LSI and grouped them using six clustering algorithms. The results were significantly improved compared to our initial results that did not use LSI-based pre-processing. From the perspective of individual performance of the algorithms demonstrating the best results, CMeans was the one that got the highest average gain, with approximately 25%, followed by K-Means and SKmeans, with 17% each; PAM, with 16.5%; and EM, with 15%. The conclusion is that Latent Semantic Indexing has proven to be a helpful tool for noise reduction. We recommend its use to improve the cluster quality of bibliographic databases significantly.

Published
2023-02-01