Abstract:
A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic "linkage" algorithms (single, average, or complete). However, these methods exhibit a quite prohibitive running time of $\Theta(n^2)$.
In this paper, we provide a new algorithm which takes as input a set of points $P$ in $R^d$, and for every $c\ge 1$, runs in time $n^{1+O(1/c^2)}$ to output an ultrametric $\Delta$ such that for any two points $u,v$ in $P$, we have $\Delta(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the "best" ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $\Delta^*$ that minimizes the maximum distance distortion with respect to the $\ell_2$ distance, namely that minimizes $\max_{u,v \in P} \Delta^*(u,v)/||u-v||_2$."
We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $\epsilon>0$,
no algorithm with running time $n^{2-\epsilon}$
can distinguish between inputs that admit
isometric embedding and inputs that can incur a
distortion of 3/2 in L∞ -metric.
Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.