[1909.03004] Show Your Work: Improved Reporting of Experimental Results

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yield

5 mentions: @nlpnoah@dallascard@royschwartz02@Smerity
Date: 2019/09/09 20:17

Referring Tweets

@Smerity In starting a preliminary write-up on my most recent byte level language modeling work (enwik8) I've really been wishing for many of the metrics highlighted by @allen_ai's recent "Show your work" paper. https://t.co/8YWQEtzwsF https://t.co/VWdvk9Vy6y
@nlpnoah in "Show Your Work," we look at the status quo in experimental reporting in NLP -- it's abysmal -- and propose concrete ways to do better. https://t.co/W5EDGX3tFK to appear at EMNLP, by @JesseDodge, @ssgrn, @dallascard, @royschwartz02, & @nlpnoah
@dallascard I'm excited about this one! In "Show Your Work" (to appear at EMNLP) we advocate for better reporting of experimental results and propose budget-aware evaluation of models. https://t.co/czl6n5jdNr by @JesseDodge, @ssgrn, @dallascard, @royschwartz02, & @nlpnoah (1/5)
@royschwartz02 Current practice in reporting experimental results is incomplete and often inaccurate. To find out how--and what can be done about it--https://t.co/6q3bKbgLwA, #emnlp2019, by @JesseDodge, @ssgrn, @dallascard, @royschwartz02, & @nlpnoah https://t.co/2GNi5yeUq7 https://t.co/9ErwWNSqt5

Related Entries

Read more 100 Must-Read NLP Papers | This is a list of 100 important natural language processing (NLP) papers ...
0 users, 0 mentions 2018/04/22 03:40
Read more [1708.02709] Recent Trends in Deep Learning Based Natural Language Processing
0 users, 0 mentions 2018/04/22 03:40
Read more GitHub - THUNLP-MT/MT-Reading-List: A machine translation reading list maintained by Tsinghua Natura...
1 users, 15 mentions 2018/12/31 14:14
Read more Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids – Real-World Natural Languag...
0 users, 0 mentions 2018/10/30 19:17
Read more A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
0 users, 0 mentions 2018/07/06 00:23