Feature extraction for large-scale text collections

19/10/2020

Feature extraction for large-scale text collections

Luke Gallagher, Antonio Mallia, J. Shane Culpepper, Torsten Suel, B. Barla Cambazoglu

Keywords: clueweb, feature index, feature extraction, feature repository, lambdamart, ltr, learning to rank, feature importance

Abstract Paper Similar Papers

Abstract: Feature engineering is a fundamental but poorly documented component in Learning-to-Rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems, and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open-source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the extracted features. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems. In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. To demonstrate the software’s utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.

The video of this talk cannot be embedded. You can watch it here:

https://dl.acm.org/doi/10.1145/3340531.3412773#sec-supp

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at CIKM 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

29/06/2020

Embedding java classes with Code2vec: Improvements from variable obfuscation

Rhys Compton, Eibe Frank, Panos Patros, Abigail Koay

Keywords Paper

code2vec, machine learning, code obfuscation, source code, neural networks

0

0

0

0

14:20

12/07/2020

Improving Transformer Optimization Through Better Initialization

Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs

Keywords Paper

Sequential, Network, and Time-Series Modeling

0

0

0

0

14:52

16/11/2020

Chaining Behaviors from Data with Model-Free Reinforcement Learning

Avi Singh, Albert Yu, Jonathan Yang and
Jesse Zhang, Aviral Kumar, Sergey Levine

Keywords Paper

0

0

0

0

5:01

12/07/2020

Adversarial Robustness for Code

Pavol Bielik, Martin Vechev

Keywords Paper

Adversarial Examples

0

0

0

0

15:43

02/02/2021

Deeplite NeutrinoTM: A BlackBox Framework for Constrained Deep Learning Model Optimization

Anush Sankaran, Olivier Mastropietro, Ehsan Saboori and
Yasser Idris, Davis Sawyer, MohammadHossein AskariHemmat, Ghouthi Boukli Hacene

Keywords Paper

0

0

0

0

18:29

06/12/2021

Pipeline Combinators for Gradual AutoML

Guillaume Baudart, Martin Hirzel, Kiran Kate and
Parikshit Ram, Avi Shinnar, Jason Tsay

Keywords Paper

machine learning

0

0

0

0

14:25

16/11/2020

SoftGym: Benchmarking Deep Reinforcement Learning for Deformable Object Manipulation

Xingyu Lin, Yufei Wang, Jake Olkin, David Held

Keywords Paper

0

0

0

0

5:06

19/10/2020

SemFE: Facilitating ML pipeline development with semantics

Baifan Zhou, Yulia Svetashova, Tim Pychynski and
Ildar Baimuratov, Ahmet Soylu, Evgeny Kharlamov

Keywords Paper

semantic technology, discrete manufacturing, machine learning, quality monitoring

0

0

0

0

9:27

16/11/2020

Learning Dexterous Manipulation from Suboptimal Experts

Rae Jeong, Jost Tobias Springenberg, Jackie Kay and
Dan Zheng, Alexandre Galashov, Nicolas Heess, Francesco Nori

Keywords Paper

0

0

0

0

5:03

06/12/2021

Accelerating Robotic Reinforcement Learning via Parameterized Action Primitives

Murtaza Dalal, Deepak Pathak, Russ Salakhutdinov

Keywords Paper

optimization, reinforcement learning and planning

0

0

0

0

10:01

18/07/2021

ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations

Chris Cummins, Zacharias Fisches, Tal Ben-Nun and
Torsten Hoefler, Michael O'Boyle, Hugh Leather

Keywords Paper

Applications, Hardware and Systems

0

0

0

0

5:01

16/11/2020

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Bryan Chen, Alexander Sax, Francis Lewis and
Iro Armeni, Silvio Savarese, Amir Zamir, Jitendra Malik, Lerrel Pinto

Keywords Paper

0

0

0

0

5:06

15/11/2020

Precise Inference of Expressive Units of Measurement Types

Tongtong Xiang, Jeff Y. Luo, Werner Dietl

Keywords Paper

Scientific computing, Pluggable type system, Dimensional analysis, Units of measurements, Type inference

0

0

0

0

13:39

16/11/2020

An Imitation Game for Learning Semantic Parsers from User Interaction

Ziyu Yao, Yiqi Tang, Wen-tau Yih and
Huan Sun, Yu Su

Keywords Paper

bootstrapping, fine-tuning parsers, theoretical analysis, text-to-sql problem

0

0

0

0

11:49

06/12/2021

Revisiting Deep Learning Models for Tabular Data

Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, Artem Babenko

Keywords Paper

deep learning, transformers

0

0

0

0

12:14

19/10/2020

Flexible IR pipelines with capreolus

Andrew Yates, Kevin Martin Jose, Xinyu Zhang, Jimmy Lin

Keywords Paper

neural information retrieval, retrieval pipeline, ad hoc ranking

0

0

0

0

10:00

16/11/2020

Never Stop Learning: The Effectiveness of Fine-Tuning in Robotic Reinforcement Learning

Ryan Julian, Benjamin Swanson, Gaurav Sukhatme and
Sergey Levine, Chelsea Finn, Karol Hausman

Keywords Paper

0

0

0

0

5:47

03/05/2021

Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling

Benedikt Boecking, Willie Neiswanger, Eric P Xing, Artur Dubrawski

Keywords Paper

active learning, data programming, data labeling, weak supervision

0

0

0

0

5:10

14/06/2020

RL-CycleGAN: Reinforcement Learning Aware Simulation-to-Real

Kanishka Rao, Chris Harris, Alex Irpan and
Sergey Levine, Julian Ibarz, Mohi Khansari

Keywords Paper

robotics, sim2real, cyclegan, reinforcement learning, grasping, q-learning

0

0

0

0

4:55

29/06/2020

RTPTorrent: An open-source dataset for evaluating regression test prioritization

Toni Mattis, Patrick Rein, Falco Dürsch, Robert Hirschfeld

Keywords Paper

Regression Test Prioritization, Dataset, Java, GitHub, TravisCI

0

0

0

0

14:57

02/02/2021

Physarum Powered Differentiable Linear Programming Layers and Applications

Zihang Meng, Sathya N. Ravi, Vikas Singh

Keywords Paper

0

0

0

0

16:57

15/11/2020

Interactive Synthesis of Temporal Specifications from Examples and Natural Language

Ivan Gavran, Eva Darulova, Rupak Majumdar

Keywords Paper

robots, program synthesis, LTL, specification, natural language processing

0

0

0

0

15:41

15/06/2020

Blended, precise semantic program embeddings

Ke Wang, Zhendong Su

Keywords Paper

Static and Dynamic Program Features, Attention Network, Semantic Program Embedding

0

0

0

0

15:39

06/12/2020

Unsupervised Translation of Programming Languages

Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, Guillaume Lample

Keywords Paper

0

0

0

0

3:17

16/11/2020

Towards General and Autonomous Learning of Core Skills: A Case Study in Locomotion

Roland Hafner, Tim Hertweck, Philipp Kloeppner and
Michael Bloesch, Michael Neunert, Markus Wulfmeier, Saran Tunyasuvunakool, Nicolas Heess, Martin Riedmiller

Keywords Paper

0

0

0

0

5:24

03/05/2021

BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration

Augustus Odena, Kensen Shi, David Bieber and
Rishabh Singh, Charles Sutton, Hanjun Dai

Keywords Paper

Program Synthesis

0

0

0

0

10:26

29/06/2020

An empirical study on the impact of deimplicitization on comprehension in programs using application frameworks

Jürgen Cito, Jiasi Shen, Martin Rinard

Keywords Paper

0

0

0

0

4:27

06/12/2021

Continual Learning via Local Module Composition

Oleksiy Ostapenko, Pau Rodriguez, Massimo Caccia, Laurent Charlin

Keywords Paper

continual learning, transfer learning

1

0

0

1

14:32

18/07/2021

Training Data Subset Selection for Regression with Controlled Generalization Error

Durga S, Rishabh Iyer, Ganesh Ramakrishnan, Abir De

Keywords Paper

, Algorithms, Online Learning, Algorithms, Supervised Learning

0

0

0

0

4:15

06/12/2021

Neural Program Generation Modulo Static Analysis

Rohan Mukherjee, Yeming Wen, Dipak Chaudhari and
Thomas Reps, Swarat Chaudhuri, Christopher Jermaine

Keywords Paper

deep learning, transformers, generative model

0

0

0

0

14:58

06/12/2020

What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Vitaly Feldman, Chiyuan Zhang

Keywords Paper

0

0

0

0

3:22

25/07/2020

AutoGroup: Automatic feature grouping for modelling explicit high-order feature interactions in CTR prediction

Bin Liu, Niannan Xue, Huifeng Guo and
Ruiming Tang, Stefanos Zafeiriou, Xiuqiang He, Zhenguo Li

Keywords Paper

automl, high-order feature interactions, deep learning, recommender systems, neural networks

0

0

0

0

20:26

15/06/2020

SCAF: A speculation-aware collaborative dependence analysis framework

Sotiris Apostolakis, Ziyang Xu, Zujun Tan and
Greg Chan, Simone Campanoni, David I. August

Keywords Paper

speculation, collaboration, dependence analysis

0

0

0

0

16:16

16/11/2020

Transformer Based Multi-Source Domain Adaptation

Dustin Wright, Isabelle Augenstein

Keywords Paper

unsupervised adaptation, cnns, rnns, domain classifiers

0

0

0

0

11:30

18/07/2021

Selecting Data Augmentation for Simulating Interventions

Max Ilse, Jakub Tomczak, Patrick Forré

Keywords Paper

Algorithms, Supervised Learning

0

0

0

0

4:14

06/12/2020

Learning Sparse Prototypes for Text Generation

Junxian He, Taylor Berg-Kirkpatrick, Graham Neubig

Keywords Paper

0

0

0

0

3:22

16/11/2020

Robot Action Selection Learning via Layered Dimension Informed Program Synthesis

Jarrett Holtz, Arjun Guha, Joydeep Biswas

Keywords Paper

0

0

0

0

5:05

03/05/2021

Concept Learners for Few-Shot Learning

Kaidi Cao, Maria Brbic, Jure Leskovec

Keywords Paper

few-shot learning, meta learning

0

0

0

0

4:55

15/11/2020

Designing Types for R, Empirically

Alexi Turcotte, Aviral Goel, Filip Křikava, Jan Vitek

Keywords Paper

R, dynamic languages, type declarations

0

0

0

0

16:04

06/12/2021

Environment Generation for Zero-Shot Compositional Reinforcement Learning

Izzeddin Gur, Natasha Jaques, Yingjie Miao and
Jongwook Choi, Manoj Tiwari, Honglak Lee, Aleksandra Faust

Keywords Paper

reinforcement learning and planning, robustness, graph learning

0

0

0

0

8:40