MinSearch: An efficient algorithm for similarity search under edit distance

23/08/2020

MinSearch: An efficient algorithm for similarity search under edit distance

Haoyu Zhang, Qin Zhang

Keywords: edit distance, top-k query, similarity search

Abstract Paper Similar Papers

Abstract: We study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = s1, ..., sn, with the goal of answering the following two types of queries: (1) the threshold query: given a query string t and a threshold K, output all si ∈ S such that the edit distance between si and t is at most K; (2) the top-k query: given a query string t, output the k strings in S that are closest to t in terms of edit distance. Edit similarity search has numerous applications in bioinformatics, databases, data mining, information retrieval, etc., and has been studied extensively in the literature. In this paper we propose a novel algorithm for edit similarity search named MinSearch. The algorithm is randomized, and we can show mathematically that it outputs the correct answer with high probability for both types of queries. We have conducted an extensive set of experiments on MinSearch, and compared it with the best existing algorithms for edit similarity search. Our experiments show that MinSearch has a clear advantage (often in orders of magnitudes) against the best previous algorithms in query time, and MinSearch is always one of the best among all competitors in the indexing time and space usage. Finally, MinSearch achieves perfect accuracy for both types of queries on all datasets that we have tested.

The video of this talk cannot be embedded. You can watch it here:

https://dl.acm.org/doi/10.1145/3394486.3403099#sec-supp

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at KDD 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

22/06/2020

Does preprocessing help in fast sequence comparisons?

Elazar Goldenberg, Aviad Rubinstein, Barna Saha

Keywords Paper

edit distance, approximation algorithms, preprocessing

0

0

0

0

14:30

16/11/2020

Seq2Edits: Sequence Transduction Using Span-level Edit Operations

Felix Stahlberg, Shankar Kumar

Keywords Paper

sequence editing, natural tasks, nlp tasks, text normalization

0

0

0

0

9:56

06/12/2020

Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment

Govinda Kamath, Tavor Baharav, Ilan Shomorony

Keywords Paper

0

0

0

0

3:19

19/04/2021

Expanding, retrieving and infilling: Diversifying cross-domain question generation with flexible templates

Xiaojing Yu, Anxiao Jiang

Keywords Paper

0

0

0

0

11:40

12/07/2020

A Chance-Constrained Generative Framework for Sequence Optimization

Xianggen Liu, Jian Peng, Qiang Liu, Sen Song

Keywords Paper

Deep Learning - Generative Models and Autoencoders

0

0

0

0

12:40

03/05/2021

Filtered Inner Product Projection for Crosslingual Embedding Alignment

Vin Sachidananda, Ziyi Yang, Chenguang Zhu

Keywords Paper

multilingual representations, natural language processing, word embeddings

0

0

0

0

5:22

19/10/2020

A comparison of top-k threshold estimation techniques for disjunctive query processing

Antonio Mallia, Michal Siedlaczek, Mengyang Sun, Torsten Suel

Keywords Paper

top-k document retrieval, query processing, threshold estimation

0

0

0

0

7:40

16/11/2020

Wasserstein Distance Regularized Sequence Representation for Text Matching in Asymmetrical Domains

Weijie Yu, Chen Xu, Jun Xu and
Liang Pang, Xiaopeng Gao, Xiaozhao Wang, Ji-Rong Wen

Keywords Paper

real-world practices, text matching, matching models, match method

0

0

0

0

11:43

06/12/2020

Finding the Homology of Decision Boundaries with Active Learning

Weizhi Li, Gautam Dasarathy, Karthi Natesan Ramamurthy, Visar Berisha

Keywords Paper

Algorithms -> AutoML; Applications -> Fairness, Accountability, and Transparency; Optimization -> Stochastic Optimization, Algorithms -> Classification

0

0

0

0

3:27

19/10/2020

Intent-driven similarity in e-commerce listings

Gilad Fuchs, Yoni Acriche, Idan Hasson, Pavel Petrov

Keywords Paper

machine learning, e-commerce, sentence similarity

0

0

0

0

9:57

06/12/2020

Diversity-Guided Multi-Objective Bayesian Optimization With Batch Evaluations

Mina Konakovic Lukovic, Yunsheng Tian, Wojciech Matusik

Keywords Paper

0

0

0

0

3:22

03/08/2020

High Dimensional Discrete Integration over the Hypergrid

Raj Kumar Maity, Arya Mazumdar, Soumyabrata Pal

Keywords Paper

0

0

0

0

8:46

19/08/2021

Uncertainty-Aware Few-Shot Image Classification

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng and
Zhibo Chen, Shih-Fu Chang

Keywords Paper

Machine Learning, Classification, Cost-Sensitive Learning, Recognition, Uncertainty Representations

0

0

0

0

8:07

04/07/2020

Fact-based Text Editing

Hayate Iso, Chao Qiao, Hang Li

Keywords Paper

Fact-based Editing, text task, text editing, automatically dataset

0

0

0

0

12:41

06/12/2021

Robustness between the worst and average case

Leslie Rice, Anna Bair, Huan Zhang, J. Zico Kolter

Keywords Paper

machine learning, robustness, adversarial robustness and security, generative model

0

0

0

0

10:46

04/07/2020

Parallel Sentence Mining by Constrained Decoding

Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, Faheem Kirefu

Keywords Paper

Parallel Mining, decoding, Constrained Decoding, neural translation

0

0

0

0

6:22

08/07/2020

Space-efficient Query Evaluation over Probabilistic Event Streams

Rajeev Alur, Yu Chen, Kishor Jothimurugan, Sanjeev Khanna

Keywords Paper

Query processing over streams, Streaming algorithms, Probabilistic streams

0

0

0

0

22:51

06/12/2021

Greedy Approximation Algorithms for Active Sequential Hypothesis Testing

Kyra Gan, Su Jia, Andrew Li

Keywords Paper

active learning

0

0

0

0

14:03

08/07/2020

Approximate Nearest Neighbor for Curves --- Simple, Efficient, and Deterministic

Arnold Filtser, Omrit Filtser, Matthew Katz

Keywords Paper

polygonal curves, Fréchet distance, dynamic time warping, approximation algorithms, (asymmetric) approximate nearest neighbor, range counting

0

0

0

0

19:55

26/08/2020

'Bring Your Own Greedy'+Max: Near-Optimal 1/2-Approximations for Submodular Knapsack

Grigory Yaroslavtsev, Samson Zhou, Dmitrii Avdiukhin

Keywords Paper

0

0

0

0

13:14

18/07/2021

Randomized Algorithms for Submodular Function Maximization with a $k$-System Constraint

Shuang Cui, Kai Han, Tianshuai Zhu and
Jing Tang, Benwei Wu, He Huang

Keywords Paper

Optimization

0

0

0

0

4:48

08/12/2020

A Deep Metric Learning Method for Biomedical Passage Retrieval

Andrés Rosso-Mateus, Fabio A. González, Manuel Montes-y-Gómez

Keywords Paper

0

0

0

0

14:58

26/04/2020

A Probabilistic Formulation of Unsupervised Text Style Transfer

Junxian He, Xinyi Wang, Graham Neubig, Taylor Berg-Kirkpatrick

Keywords Paper

unsupervised text style transfer, deep latent sequence model

0

0

0

0

5:02

23/08/2020

On sampling top-k recommendation evaluation

Dong Li, Ruoming Jin, Jing Gao, Zhi Liu

Keywords Paper

recommender systems, hit ratio, evaluation metric, top-k, recall

0

0

0

0

14:42

06/12/2020

Extrapolation Towards Imaginary 0-Nearest Neighbour and Its Improved Convergence Rate

Akifumi Okuno, Hidetoshi Shimodaira

Keywords Paper

0

0

0

0

3:14

02/02/2021

Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information

Qiu Ran, Yankai Lin, Peng Li, Jie Zhou

Keywords Paper

0

0

0

0

14:24

26/04/2020

Learning-Augmented Data Stream Algorithms

Tanqiu Jiang, Yi Li, Honghao Lin and
Yisong Ruan, David P. Woodruff

Keywords Paper

streaming algorithms, heavy hitters, F_p moment, distinct elements, cascaded norms

0

0

0

0

3:55

19/08/2021

Heuristic Search for Approximating One Matrix in Terms of Another Matrix

Guihong Wan, Haim Schweitzer

Keywords Paper

Data Mining, Feature Extraction, Selection and Dimensionality Reduction, Heuristic Search, Dimensionality Reduction, Learning Sparse Models

0

0

0

0

13:00

22/06/2020

Top-𝑘-convolution and the quest for near-linear output-sensitive subset sum

Karl Bringmann, Vasileios Nakos

Keywords Paper

Subset Sum, pseudopolynomial, output-sensitive, convolution, restricted sumset

0

0

0

0

25:48

19/08/2021

Diversity in Kemeny Rank Aggregation: A Parameterized Approach

Emmanuel Arrighi, Henning Fernau, Daniel Lokshtanov and
Mateus de Oliveira Oliveira, Petra Wolf

Keywords Paper

Agent-based and Multi-agent Systems, Computational Social Choice, Voting

0

0

0

0

15:00

02/02/2021

Multi-Objective Submodular Maximization by Regret Ratio Minimization with Theoretical Guarantee

Chao Feng, Chao Qian

Keywords Paper

0

0

0

0

15:19

02/02/2021

Robust Model Compression Using Deep Hypotheses

Omri Armstrong, Ran Gilad-Bachrach

Keywords Paper

0

0

0

0

17:26

06/12/2021

Neural Distance Embeddings for Biological Sequences

Gabriele Corso, Zhitao Ying, Michal Pándy and
Petar Veličković, Jure Leskovec, Pietro Liò

Keywords Paper

machine learning, clustering

0

0

0

0

14:55

12/07/2020

Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data with RACE

Benjamin Coleman, Anshumali Shrivastava, Richard Baraniuk

Keywords Paper

General Machine Learning Techniques

0

0

0

0

15:20

12/07/2020

Recovery of sparse signals from a mixture of linear samples

Arya Mazumdar, Soumyabrata Pal

Keywords Paper

Optimization - General

0

0

0

0

15:44

06/12/2021

Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals

Lang Liu, Krishna Pillutla, Sean Welleck and
Sewoong Oh, Yejin Choi, Zaid Harchaoui

Keywords Paper

theory, vision, generative model, language

0

0

0

0

8:52

06/12/2021

Manifold Topology Divergence: a Framework for Comparing Data Manifolds.

Serguei Barannikov, Ilya Trofimov, Grigorii Sotnikov and
Ekaterina Trimbach, Alexander Korotin, Alexander Filippov, Evgeny Burnaev

Keywords Paper

generative model

0

0

0

0

15:01

06/12/2020

Benchmarking Deep Inverse Models over time, and the Neural-Adjoint method

Ben Ren, Willie Padilla, Jordan Malof

Keywords Paper

0

0

0

0

3:17

06/12/2021

Contextual Similarity Aggregation with Self-attention for Visual Re-ranking

Jianbo Ouyang, Hui Wu, Min Wang and
Wengang Zhou, Houqiang Li

Keywords Paper

robustness, transformers

0

0

0

0

6:34

05/12/2020

Massively multilingual document alignment with cross-lingual sentence-mover’s distance

Ahmed El-Kishky, Francisco Guzmán

Keywords Paper

0

0

0

0

14:59