06/12/2020

Optimal Prediction of the Number of Unseen Species with Multiplicity

Yi Hao, Ping Li

Keywords:

Abstract: Based on a sample of size $n$, we consider estimating the number of symbols that appear at least $\mu$ times in an independent sample of size $a \cdot n$, where $a$ is a given parameter. This formulation includes, as a special case, the well-known problem of inferring the number of unseen species introduced by [Fisher et al.] in 1943 and considered by many others. Of considerable interest in this line of works is the largest $a$ for which the quantity can be accurately predicted. We completely resolve this problem by determining the limit of estimation to be $a \approx (\log n)/\mu$, with both lower and upper bounds matching up to constant factors. For the particular case of $\mu = 1$, this implies the recent result by [Orlitsky et al.] on the unseen species problem. Experimental evaluations show that the proposed estimator performs exceptionally well in practice. Furthermore, the estimator is a simple linear combination of symbols' empirical counts, and hence linear-time computable.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at NeurIPS 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd

Similar Papers