[2004.00584] Deep Entity Matching with Pre-Trained Language Modelsopen searchopen navigation menucontact arXivsubscribe to arXiv mailings

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (dif

Date: 2020/10/15 15:52

Related Entries

Read more Case Study: Exploring Baseball Pitching Data in R | DataCampIconIconIconprofessionalinfo
0 users, 1 mentions 2020/03/05 08:21
Read more [2003.09772] Invariant Rationalizationopen searchopen navigation menucontact arXivarXiv Twitter
0 users, 1 mentions 2020/05/12 17:21
Read more [2005.13837] Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hi...
0 users, 2 mentions 2020/05/30 15:51
Read more [1907.00030] Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Mod...
0 users, 2 mentions 2020/07/19 17:21
Read more Efficient Second-Order TreeCRF for Neural Dependency Parsing - ACL Anthology
0 users, 1 mentions 2020/10/24 14:21