Abstract:
Given a collection of n points in ℝd, the goal of the (k,z)-clustering problem is to find a subset of k “centers” that minimizes the sum of the z-th powers of the Euclidean distance of each point to the closest center. Special cases of the (k,z)-clustering problem include the k-median and k-means problems. Our main result is a unified two-stage importance sampling framework that constructs an ε-coreset for the (k,z)-clustering problem. Compared to the results for (k,z)-clustering in [Feldman and Langberg, STOC 2011], our framework saves a ε2 d factor in the coreset size. Compared to the results for (k,z)-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a poly(k) factor in the coreset size and avoids the exp(k/ε) term in the construction time. Specifically, our coreset for k-median (z=1) has size Õ(ε−4 k) which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a k factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of Ω(k· min2z/20,d ) for a 0.01-coreset for (k,z)-clustering, which has a linear dependence of size on k and an exponential dependence on z that matches our algorithmic results.