Efficient Search for Data with Numerical and Categorical Attributes Based on Dual Locality-Sensitive Hashing

Kyung Mi Lee and Keon Myung Lee

doi:10.17485/ijst/2016/v9i24/96110

Article

Efficient Search for Data with Numerical and Categorical Attributes Based on Dual Locality-Sensitive Hashing

VIEWS 707
PDF 225

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2016/v9i24/96110

Year: 2016, Volume: 9, Issue: 24, Pages: 1-12

Original Article

Efficient Search for Data with Numerical and Categorical Attributes Based on Dual Locality-Sensitive Hashing

Kyung Mi Lee and Keon Myung Lee^*

Department of Computer Science, [email protected],
[email protected]
*Author for Correspondence
Keon Myung Lee
Department of Computer Science
Email:[email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Background/Objectives: Similarity search is a fundamental task in many domains. For large volumes of data, it is sometimes preferable to obtain approximate results instead of pursuing exact results that require long computation times. Methods/Statistical Analysis: The proposed method expresses a query using a fuzzy ball for the numerical data space and a conjunction of ground or general terms for categorical domains. It exploits a dual locality-sensitive hashing technique that constructs separate locality-sensitive hashing tables for the numerical and categorical data spaces. It determines buckets for a query by considering the relative distances of the numerical part of the query to the subspace boundaries and using concept hierarchies for the categorical attributes. It may also use a Bloom filter to select the candidate data set to be examined from among the determined buckets. Findings: The proposed approximate search technique was applied to two data sets. The portions of data to be examined were 0.02712% and 0.00084% for the first and the second data set. The average numbers of numerical buckets examined were 1.42 and 2.53 for the data sets, respectively. The figures mean that the proposed method significantly reduces the number of candidates to be considered and hence saves greatly computation cost. In addition, the proposed method is unique that it can be applied to data set with both numerical and categorical attributes. Improvements/Applications: The proposed locality-sensitive hashing method can be applied to approximate search tasks for large volume of data containing both numerical and categorical attributes.
Keywords: Locality-Sensitive Hashing, Nearest Neighbor Search, Similarity Search, Space Partitioning