Abstract:
Data profiling is a "set of statistical data analysis activities and processes to determine properties of a given dataset". Historically,most of the data profiling tasks were aimed at data. At scale, when a dataset has millions of tables, their meta-data (i.e. titles, attribute names and types) becomes abundant similar to data instances, and its profiling starts playing a vital role.Here we demonstrate our work on WebLens- an interactive, scalable metadata profiler for large-scale structured data. At its core is a new data structure - Metadata-profile, coupled with Machine/Deep-Learning models trained to construct it. It represents a meta-data summary of a specific real world object collected over millions of data sources. Such profiles significantly simplify access to large-scale structured datasets for both data scientists and end users.Finally, we performed a user study with 20 students and found WebLens trained models significantly outperform 20 people on the task of construction of metadata-profiles for 10 objects from different domains. For demonstration and evaluation we used a large-scale dataset of ’15 Million relational English tables from the Web.