B-T simplification of Christiano preference learning - Pastebin.com

B-T simplification of Christiano preference learning a guest May 16th, 2019 33 Never b Sign Up , it unlocks many cool features! Playing with GPT-2 for various things (mostly poetry: https://www.gwern.net/GPT-2 ), I've been thinking about the potential for preference learning and I think the original architecture can be simplified & improved. The motivation for the double-critic architecture is that the data being collected from humans is pairwise, and so one trains the critic to predict comparisons. This outside training loop then has an inner G/agent training loop etc. The double training loop is necessary to collect ratings from brand new areas of statespace that the G/agent can newly access, but also, GAN-style, to avoid the D/critic from being too powerful and saturating loss, I get the impression? But, just because the input is pairwise doesn't mean that the output must also be pairwise. It could instead be a scalar, with the D/critic performing regression. A Bradley-Terr...

1 mentions: @gwern
Date: 2019/05/16 15:47

Referring Tweets

@gwern Another idea: simplify by dropping the comparison part from the NN architecture, which complicates it considerably and makes it harder to use for anything else. A simple Bradley-Terry model can produce real cardinal values to train the critic w/regression: https://t.co/BvumwIauF8