Robust $k$-means++

Abstract: A good seeding or initialization of cluster centers for the $k$-means method is important from both theoretical and practical standpoints. The $k$-means objective is inherently non-robust and sensitive to outliers. A popular seeding such as the $k$-means++ [3] that is more likely to pick outliers in the worst case may compound this drawback, thereby affecting the quality of clustering on noisy data.For any $0 < \delta \leq 1$, we show that using a mixture of $D^{2}$ [3] and uniform sampling, we can pick $O(k/\delta)$ candidate centers with the following guarantee: they contain some $k$ centers that give $O(1)$-approximation to the optimal robust $k$-means solution while discarding at most $\delta n$ more points than the outliers discarded by the optimal solution. That is, if the optimal solution discards its farthest $\beta n$ points as outliers, our solution discards its $(\beta + \delta) n$ points as outliers. The constant factor in our $O(1)$-approximation does not depend on $\delta$. This is an improvement over previous results for $k$-means with outliers based on LP relaxation and rounding [7] and local search [17]. The $O(k/\delta)$ sized subset can be found in time $O(ndk)$. Our \emph{robust} $k$-means++ is also easily amenable to scalable, faster, parallel implementations of $k$-means++ [5]. Our empirical results show a comparison of the above \emph{robust} variant of $k$-means++ with the usual $k$-means++, uniform random seeding, threshold $k$-means++ [6] and local search on real world and synthetic data.

06/12/2021

Deep Learning, Generative Models, Data, Challenges, Implementations, and Software, Benchmarks, Algorithms, Adversarial Examples

18:17

06/12/2021

Deep Learning, Generative Models, Algorithms, Representation Learning; Optimization, Submodular Optimization, Probabilistic Methods, Robust statistics

5:20

14/09/2020

hidden markov models, mixture models, mixture of hidden markov models, expectation maximization, orthogonality, regularization, penalty

14:43

19/08/2021

Robust $k$-means++

Amit Deshpande, Praneeth Kacham, Rameshwar Pratap

Comments

Similar Papers

Better Algorithms for Individually Fair $k$-Clustering

Maryam Negahbani, Deeparnab Chakrabarty

Keywords Abstract Paper

theory, self-supervised learning, clustering, fairness

List-Decodable Mean Estimation in Nearly-PCA Time

Ilias Diakonikolas, Daniel Kane, Daniel Kongsgaard and Jerry Li, Kevin Tian

Keywords Abstract Paper

theory, clustering

Fast Noise Removal for k-Means Clustering

Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley and Xiaorui Sun, Rudy Zhou

Keywords Abstract Paper

Sharper Generalization Bounds for Clustering

Shaojie Li, Yong Liu

Keywords Abstract Paper

Deep Learning, Algorithms, Clustering, Applications, Natural Language Processing

Model-based Clustering with HDBSCAN*

Michael Strobl, Joerg Sander, Ricardo Campello, Osmar Zaiane

Keywords Abstract Paper

hierarchical clustering, expectation maximization, model selection

Sparse and Imperceptible Adversarial Attack via a Homotopy Algorithm

Mingkang Zhu, Tianlong Chen, Zhangyang Wang

Keywords Abstract Paper

Deep Learning, Generative Models, Data, Challenges, Implementations, and Software, Benchmarks, Algorithms, Adversarial Examples

Refined Learning Bounds for Kernel and Approximate $k$-Means

Yong Liu

Keywords Abstract Paper

theory, clustering, kernel methods

Wasserstein Distributional Normalization For Robust Distributional Certification of Noisy Labeled Data

Sung Woo Park, Junseok Kwon

Keywords Abstract Paper

Deep Learning, Generative Models, Algorithms, Representation Learning; Optimization, Submodular Optimization, Probabilistic Methods, Robust statistics

Orthogonal Mixture of Hidden Markov Models

Negar Safinianaini, Camila P. E. de Souza, Henrik Boström, Jens Lagergren

Keywords Abstract Paper

hidden markov models, mixture models, mixture of hidden markov models, expectation maximization, orthogonality, regularization, penalty

Discrete Multiple Kernel k-means

Rong Wang, Jitao Lu, Yihang Lu and Feiping Nie, Xuelong Li

Keywords Abstract Paper

Machine Learning, Clustering, Kernel Methods, Multi-instance; Multi-label; Multi-view learning

Differentially Private k-Means via Exponential Mechanism and Max Cover

Huy L. Nguyen, Anamay Chaturvedi, Eric Z Xu

Keywords Abstract Paper

Extrapolation Towards Imaginary 0-Nearest Neighbour and Its Improved Convergence Rate

Akifumi Okuno, Hidetoshi Shimodaira

Keywords Abstract Paper

Better Algorithms for Estimating Non-Parametric Models in Crowd-Sourcing and Rank Aggregation

Allen X Liu, Ankur Moitra

Keywords Abstract Paper

Matrix/tensor estimation, Learning with algebraic or combinatorial structure, Ranking and preference learning

Crowdsourcing via Annotator Co-occurrence Imputation and Provable Symmetric Nonnegative Matrix Factorization

Shahana Ibrahim, Xiao Fu

Keywords Abstract Paper

Algorithms, Crowdsourcing

Locally Private Hypothesis Selection

Sivakanth Gopi, Gautam Kamath, Janardhan D Kulkarni and Aleksandar Nikolov, Steven Wu, Huanyu Zhang

Keywords Abstract Paper

Privacy, fairness, Distribution learning/testing

Consistent Estimation for PCA and Sparse Regression with Oblivious Outliers

Tommaso d'Orsi, Chih-Hung Liu, Rajai Nasser and Gleb Novikov, David Steurer, Stefan Tiegel

Keywords Abstract Paper

optimization

Extractors for adversarial sources via extremal hypergraphs

Eshan Chattopadhyay, Jesse Goodman, Vipul Goyal, Xin Li

Keywords Abstract Paper

randomness extractors, non-malleable extractors, extremal hypergraphs, explicit constructions, cap sets, Ramsey graphs

An efficient K-means clustering algorithm for tall data

Marco Capó, Aritz Pérez, Jose A. Lozan

Keywords Abstract Paper

First-Order Methods for Wasserstein Distributionally Robust MDP

Julien Grand-Clement, Christian Kroer

Keywords Abstract Paper

Theory, RL, Decisions and Control Theory

Faster DBSCAN via subsampled similarity queries

Heinrich Jiang, Jennifer Jang, Jakub Lacki

Keywords Abstract Paper

Bayesian Pseudocoresets

Keywords Paper

Ilias Diakonikolas, Daniel Kane, Daniel Kongsgaard and
Jerry Li, Kevin Tian

Keywords Paper

Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley and
Xiaorui Sun, Rudy Zhou

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Rong Wang, Jitao Lu, Yihang Lu and
Feiping Nie, Xuelong Li

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Sivakanth Gopi, Gautam Kamath, Janardhan D Kulkarni and
Aleksandar Nikolov, Steven Wu, Huanyu Zhang

Keywords Paper

Tommaso d'Orsi, Chih-Hung Liu, Rajai Nasser and
Gleb Novikov, David Steurer, Stefan Tiegel

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Shuli Jiang, Dongyu Li, Irene Mengze Li and
Arvind Mahankali, David Woodruff

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Xiangyi Chen, Tiancong Chen, Haoran Sun and
Steven Wu, Mingyi Hong

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Gen Li, Laixi Shi, Yuxin Chen and
Yuantao Gu, Yuejie Chi

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper