A dataset and an approach for identity resolution of 38 million author IDs extracted from 2B git commits

29/06/2020

A dataset and an approach for identity resolution of 38 million author IDs extracted from 2B git commits

Tanner Fry, Tapajit Dey, Andrey Karnauch, Audris Mockus

Keywords: Git Commits, Heuristics, Identity Resolution, Machine Learning, Data Sharing

Abstract Paper Similar Papers

Abstract: The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at MSR 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

29/06/2020

Detecting and characterizing bots that commit code

Tapajit Dey, Sara Mousavi, Eduardo Ponce and
Tanner Fry, Bogdan Vasilescu, Anna Filippova, Audris Mockus

Keywords Paper

random forest, software engineering, automated commits, bots, ensemble model, social coding platforms

0

0

0

0

11:21

25/04/2020

ScrAPIr: Making Web Data APIs Accessible to End Users

Tarfah Alrashed, Jumana Almahmoud, Amy Zhang, David Karger

Keywords Paper

web apis, api description language, web scraping

0

0

0

0

13:20

29/06/2020

A study of potential code borrowing and license violations in java projects on GitHub

Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, Timofey Bryksin

Keywords Paper

0

0

0

0

15:00

02/06/2020

Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach

David Schindler, Benjamin Zapilko, Frank Krüger

Keywords Paper

0

0

0

0

27:13

29/06/2020

An empirical study on the impact of deimplicitization on comprehension in programs using application frameworks

Jürgen Cito, Jiasi Shen, Martin Rinard

Keywords Paper

0

0

0

0

4:27

06/12/2021

PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair

Zimin Chen, Vincent J Hellendoorn, Pascal Lamblin and
Petros Maniatis, Pierre-Antoine Manzagol, Daniel Tarlow, Subhodeep Moitra

Keywords Paper

deep learning, machine learning, transformers, graph learning

0

0

0

0

5:59

29/06/2020

A complete set of related git repositories identified via community detection approaches based on shared commits

Audris Mockus, Diomidis Spinellis, Zoe Kotti, Gabriel John Dusing

Keywords Paper

forks and clones

0

0

0

0

5:00

29/06/2020

GitterCom: A dataset of open source developer communications in gitter

Esteban Parra, Ashley Ellis, Sonia Haiduc

Keywords Paper

social media, datasets, chat, team communication

0

0

0

0

5:08

06/12/2021

Integrating Tree Path in Transformer for Code Representation

Han Peng, Ge Li, Wenhan Wang and
YunFei Zhao, Zhi Jin

Keywords Paper

machine learning, transformers

0

0

0

0

4:42

12/07/2020

Structural Language Models of Code

Uri Alon, Roy Sadaka, Omer Levy, Eran Yahav

Keywords Paper

Applications - Language, Speech and Dialog

0

0

0

0

11:57

29/06/2020

Using large-scale anomaly detection on code to improve kotlin compiler

Timofey Bryksin, Victor Petukhov, Ilya Alexin and
Stanislav Prikhodko, Alexey Shpilman, Vladimir Kovalenko, Nikita Povarov

Keywords Paper

0

0

0

0

14:33

29/06/2020

Employing contribution and quality metrics for quantifying the software development process

Themistoklis Diamantopoulos, Michail D. Papamichail, Thomas Karanikiotis and
Kyriakos C. Chatzidimitriou, Andreas L. Symeonidis

Keywords Paper

code violations, mining software repositories, contribution analysis, DevOps, GitHub issues

0

0

0

0

4:59

15/06/2020

Semantic code search via equational reasoning

Varot Premtoon, James Koppel, Armando Solar-Lezama

Keywords Paper

equational reasoning, code search

0

0

0

0

16:29

15/11/2020

Designing Types for R, Empirically

Alexi Turcotte, Aviral Goel, Filip Křikava, Jan Vitek

Keywords Paper

R, dynamic languages, type declarations

0

0

0

0

16:04

23/06/2021

When Threads Meet Events: Efficient and Precise Static Race Detection with Origins

Bozhen Liu, Peiming Liu, Yanze Li and
Chia-Che Tsai, Dilma Da Silva, Jeff Huang

Keywords Paper

Origins, Data Race Detection, Pointer Analysis, Static Analysis

0

0

0

0

13:31

19/01/2020

Binders by Day, Labels by Night: Effect Instances via Lexically Scoped Handlers

Dariusz Biernacki, Maciej Piróg, Piotr Polesiuk, Filip Sieczkowski

Keywords Paper

effect handlers, logical relations, algebraic effects

0

0

0

0

21:13

06/12/2021

Environment Generation for Zero-Shot Compositional Reinforcement Learning

Izzeddin Gur, Natasha Jaques, Yingjie Miao and
Jongwook Choi, Manoj Tiwari, Honglak Lee, Aleksandra Faust

Keywords Paper

reinforcement learning and planning, robustness, graph learning

0

0

0

0

8:40

29/06/2020

Visualization of methods changeability based on VCS data

Sergey Svitkov, Timofey Bryksin

Keywords Paper

0

0

0

0

4:34

15/11/2020

Digging for Fold: Synthesis-Aided API Discovery for Haskell

Michael B. James, Zheng Guo, Ziteng Wang and
Shivani Doshi, Hila Peleg, Ranjit Jhala, Nadia Polikarpova

Keywords Paper

Program Synthesis, Type Inference, Human-Computer Interaction

0

0

0

0

16:01

15/06/2020

SPINFER: Inferring Semantic Patches for the Linux Kernel

Lucas Serrano, Van-Anh Nguyen, Ferdian Thung and
Lingxiao Jiang, David Lo, Julia Lawall, Gilles Muller

Keywords Paper

0

0

0

0

20:47

23/06/2021

RbSyn: Type- and Effect-Guided Program Synthesis

Sankha Narayan Guria, Jeffrey S. Foster, David Van Horn

Keywords Paper

program synthesis, type and effect systems, Ruby

0

0

0

0

12:40

19/08/2021

Graph-Augmented Code Summarization in Computational Notebooks

April Wang, Dakuo Wang, Xuye Liu, Lingfei Wu

Keywords Paper

Natural Language Processing, General, General

0

0

0

0

9:06

03/05/2021

BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration

Augustus Odena, Kensen Shi, David Bieber and
Rishabh Singh, Charles Sutton, Hanjun Dai

Keywords Paper

Program Synthesis

0

0

0

0

10:26

29/06/2020

SoftMon: A tool to compare similar open-source software from a performance perspective

Shubhankar Suman Singh, Smruti R. Sarangi

Keywords Paper

Performance debugging, Software comparison, NLP based matching

0

0

0

0

15:58

02/02/2021

An Automated Engineering Assistant: Learning Parsers for Technical Drawings

Dries Van Daele, Nicholas Decleyre, Herman Dubois, Wannes Meert

Keywords Paper

0

0

0

0

17:21

15/11/2020

Unifying Execution of Imperative Generators and Declarative Specifications

Pengyu Nie, Marinela Parovic, Zhiqiang Zang and
Sarfraz Khurshid, Aleksandar Milicevic, Milos Gligoric

Keywords Paper

Imperative generators, declarative specifications, Deuterium

0

0

0

0

14:47

12/07/2020

Adversarial Robustness for Code

Pavol Bielik, Martin Vechev

Keywords Paper

Adversarial Examples

0

0

0

0

15:43

29/06/2020

An exploratory study to find motives behind cross-platform forks from software heritage dataset

Avijit Bhattacharjee, Sristy Sumana Nath, Shurui Zhou and
Debasish Chakroborti, Banani Roy, Chanchal K. Roy, Kevin Schneider

Keywords Paper

OSS, Social coding, Cross-platform forks, Collaboration

0

0

0

0

4:50

29/06/2020

Multi-language design smells: A backstage perspective

Mouna Abidi, Moses Openja, Foutse Khomh

Keywords Paper

Design smells, Survey, JNI, Multi-language systems

0

0

0

0

6:16

19/08/2021

Adapting Meta Knowledge with Heterogeneous Information Network for COVID-19 Themed Malicious Repository Detection

Yiyue Qian, Yiming Zhang, Yanfang Ye, Chuxu Zhang

Keywords Paper

Multidisciplinary Topics and Applications, Security and Privacy, Classification, Mining Graphs, Semi Structured Data, Complex Data

0

0

0

0

13:28

20/08/2020

Kinds are Calling Conventions

Paul Downen, Zena M. Ariola, Simon Peyton Jones, Richard A. Eisenberg

Keywords Paper

representation, type systems, arity, levity, polymorphism

0

0

0

0

15:00

29/06/2020

Forking without clicking: On how to identify software repository forks

Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli

Keywords Paper

source code, software fork, free software, software evolution, open source, version control system

0

0

0

0

12:53

17/08/2020

Code replicability in computer graphics

Nicolas Bonneel, David Coeurjolly, Julie Digne, Nicolas Mellado

Keywords Paper

replicability, open source, siggraph, code review, reproducibility

0

0

0

0

17:30

06/12/2021

RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem

Eric Liang, Zhanghao Wu, Michael Luo and
Sven Mika, Joseph Gonzalez, Ion Stoica

Keywords Paper

reinforcement learning and planning

0

0

0

0

12:15

29/06/2020

How often do single-statement bugs occur? The ManySStuBs4J dataset

Rafael-Michael Karampatsis, Charles Sutton

Keywords Paper

0

0

0

0

6:16

03/05/2021

Learning Structural Edits via Incremental Tree Transformations

Ziyu Yao, Frank F Xu, Pengcheng Yin and
Huan Sun, Graham Neubig

Keywords Paper

Representation Learning, Source Code, Incremental Tree Transformations, Edit, Tree-structured Data, Imitation Learning

0

0

0

0

5:18

29/06/2020

Traceability support for multi-lingual software projects

Yalin Liu, Jinfeng Lin, Jane Cleland-Huang

Keywords Paper

Traceability, Cross-lingual information retrieval, Generalized Vector Space Model

0

0

0

0

13:23

06/12/2021

Row-clustering of a Point Process-valued Matrix

Lihao Yin, Ganggang Xu, Huiyan Sang, Yongtao Guan

Keywords Paper

machine learning, clustering

0

0

0

0

14:30

25/04/2020

TRACTUS: Understanding and Supporting Source Code Experimentation in Hypothesis-Driven Data Science

Krishna Subramanian, Johannes Maas, Jan Borchers

Keywords Paper

data science, programming ide, exploratory programming, information visualization, observational study

0

0

0

0

14:51

25/07/2020

Relevance transformer: Generating concise code snippets with relevance feedback

Carlos Gemmell, Federico Rossetto, Jeffrey Dalton

Keywords Paper

code generation, neural machine translation, code retrieval

0

0

0

0

8:44