14/09/2020

LOAD: LSH-Based l0-Sampling over Stream Data with Near-Duplicates

Dingzhu Lurong, Yanlong Wen, Jiangwei Zhang, Xiaojie Yuan

Keywords: stream data, data sampling, \(\ell _0\)-sampling

Abstract: Massive amounts of stream data nowadays almost make any real-time analysis impossible. To overcome the challenge of processing this huge amount of data, previous works typically use sampling to extract representatives and conduct analysis on this sampled dataset. In this paper, we propose LOAD, a Locality-Sensitive Hashing (LSH) based \(\ell _0\)-sampling over stream data. Instead of having the same diameter for all dimensions, LOAD utilizes the dimension-specific diameters which could fit the distribution of groups better. Therefore, LOAD always generates a better representative identification result. To facilitate the real-time analysis, we further optimize LOAD by applying LSH. Since nearest items are hashed into the same bucket with high probability, hence distinguishing the representatives becomes lightning fast. Extensive experiments show that LOAD is not only more accurate than other state-of-the-art algorithms, but also faster by an order of magnitude.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at ECML PKDD 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers