Abstract:
Data valuation arises as a non-trivial challenge in use cases such as collaborative data sharing, data markets, among others. The value of data is often associated with the learning performance (e.g., validation accuracy) of the model trained on the data. This intuitive methodology introduces a high coupling between data valuation and validation. This may be undesirable because a validation set may not be available in practice, and it can be challenging for the data providers to reach an agreement on the choice of the validation set. A separate but practical issue is data replication. Given the value of some data points, a dishonest data provider may replicate these data points to exploit the valuation for a higher reward/payment. We observe that the diversity of the data points is an inherent property of the dataset that is independent of validation. We formalize diversity via the volume of the data matrix (determinant of its left Gram). This allows us to formally connect the diversity of data to the learning performance without requiring validation. Furthermore, we propose a robust volume with theoretical replication robustness guarantees by following the intuition that copying the same data points does not increase the diversity in data. We perform extensive experiments to demonstrate its consistency and practical advantages over existing baselines and show that our method is model- and task-agnostic and flexibly adaptable to various neural networks.