Indian Journal of Science and Technology
Year: 2016, Volume: 9, Issue: 24, Pages: 1-12
Kyung Mi Lee and Keon Myung Lee*
Background/Objectives: Similarity search is a fundamental task in many domains. For large volumes of data, it is sometimes preferable to obtain approximate results instead of pursuing exact results that require long computation times. Methods/Statistical Analysis: The proposed method expresses a query using a fuzzy ball for the numerical data space and a conjunction of ground or general terms for categorical domains. It exploits a dual locality-sensitive hashing technique that constructs separate locality-sensitive hashing tables for the numerical and categorical data spaces. It determines buckets for a query by considering the relative distances of the numerical part of the query to the subspace boundaries and using concept hierarchies for the categorical attributes. It may also use a Bloom filter to select the candidate data set to be examined from among the determined buckets. Findings: The proposed approximate search technique was applied to two data sets. The portions of data to be examined were 0.02712% and 0.00084% for the first and the second data set. The average numbers of numerical buckets examined were 1.42 and 2.53 for the data sets, respectively. The figures mean that the proposed method significantly reduces the number of candidates to be considered and hence saves greatly computation cost. In addition, the proposed method is unique that it can be applied to data set with both numerical and categorical attributes. Improvements/Applications: The proposed locality-sensitive hashing method can be applied to approximate search tasks for large volume of data containing both numerical and categorical attributes.
Keywords: Locality-Sensitive Hashing, Nearest Neighbor Search, Similarity Search, Space Partitioning
Subscribe now for latest articles and news.