Etd

Scalable Similarity Search over Big Data Series

Public

The massive amounts of large-scale high-dimensional data series continuously generated and collected by applications warrant scalable processing systems. At the core of these processing systems, indexing techniques play a critical role in speeding up similarity queries – with the later representing a critical capability leveraged by most analytics-based applications. State-of-the-art indexes for data series, e.g., iSAX-based indexes, show good adaptability and applicability in centralized systems. However, for big data distributed environments, these indexes do not scale well. They tend to suffer from a poor proximity preservation of the data series. This can lead to severe accuracy degradation for approximate similarity queries. The main theme of my dissertation is thus to design, develop and evaluate efficient scalable indexing techniques to support distributed similarity search queries over data series databases. I focus on one very common class of queries, namely, approximate kNN queries. To achieve my goal, I build the proposed techniques on modern big data infrastructures such as Hadoop and Spark. My dissertation focuses on the following three research themes: 1) Scalability of Data Series Object Number to Billion-level. This project aims to scale to billions of data series objects, e.g., 100s of data points in a given data series, to support similarity query. The state-of-the-art techniques suffer from having long traversal paths, large initial cardinality, high matching overhead, and poor maintenance of the proximity relationships among the data series objects because of the small adopted fan-out (binary) which leads to severe accuracy degradation for approximate similarity queries. This project designs a novel iSAX index tree based on a new word-level variable cardinality which ensures compact structure, efficient search and comparison, and good preservation of the similarity relationships to improve the efficiency of the index construction and optimize query processing. 2) Leveraging Correlation-Aware Optimizations. This project aims to leverage the correlation between the data series and the (often simple) partitioning attribute(s) in distributed repositories exist to build secondary index. The prevalent approach is to construct indexes directly on the data series objects which suffers from very high construction time and storage costs due to the inherent complexity of indexing these high dimensional data series. Several critical challenges exist, including the high dimensionality of the data, uncertainty of correlation, and no proper measure for assessing correlation strength. This project proposes a new infrastructure to discover, assess, and exploit such correlation for query optimization. A coarse-grain partition-level index structure over the correlation is designed to tackle these challenges including pattern-level mapping, exception handling strategies for soft correlations, and a new entropy-based measure for assessing the correlations' strength and judging their potential effectiveness. 3) Boosting Approximate Similarity Search Accuracy. This project aims to improve the similarity preservation of representation to improve the query performance dramatically. The horizontal strip approximation used by the state-of-the-art techniques is a naturally lossy process in the signature conversion and the indices prefer to the fewer horizontal strips for a compact structure which would worse the performance. This project abandons the lossy horizontal strip approximation and propose a novel dual-representation based on pivot-permutation-prefix technique to represent data series with less information loss and design a corresponding index to localize the similar data series in group and partition levels. In addition, we propose a density-aware query processing to automatically select partitions to improve quality of candidates for final evaluation. Experimental studies including performance evaluation are conducted on various benchmark and real datasets to evaluate both the effectiveness and efficiency of my proposed approaches.

Creator
Contributors
Degree
Unit
Publisher
Identifier
  • etd-63216
Keyword
Advisor
Committee
Defense date
Year
  • 2022
Date created
  • 2022-04-22
Resource type
Rights statement

Relations

In Collection:

Items

Items

Permanent link to this page: https://digital.wpi.edu/show/4j03d274x