{"id":617,"date":"2026-01-16T06:01:08","date_gmt":"2026-01-16T06:01:08","guid":{"rendered":"https:\/\/blog.adlington.fr\/index.php\/2026\/01\/16\/a-quote-from-boaz-barak-gabriel-wu-jeremy-chen-and-manas-joglekar\/"},"modified":"2026-01-16T06:01:08","modified_gmt":"2026-01-16T06:01:08","slug":"a-quote-from-boaz-barak-gabriel-wu-jeremy-chen-and-manas-joglekar","status":"publish","type":"post","link":"https:\/\/blog.adlington.fr\/index.php\/2026\/01\/16\/a-quote-from-boaz-barak-gabriel-wu-jeremy-chen-and-manas-joglekar\/","title":{"rendered":"A quote from Boaz Barak, Gabriel Wu, Jeremy Chen and Manas Joglekar"},"content":{"rendered":"<blockquote><p>One way to think of confessions is that we are giving the model access to an \u201canonymous tip line\u201d where it can turn itself in by presenting incriminating evidence of misbehavior. But unlike real-world tip lines, if the model acted badly in the original task, it can collect the reward for turning itself in while still keeping the original reward from the bad behavior in the main task. We hypothesize that this form of training will teach models to produce maximally honest confessions.<br \/>\n\u2014 Read on <a href=\"https:\/\/simonwillison.net\/2026\/Jan\/15\/boaz-barak-gabriel-wu-jeremy-chen-and-manas-joglekar\/\">simonwillison.net\/2026\/Jan\/15\/boaz-barak-gabriel-wu-jeremy-chen-and-manas-joglekar\/<\/a><\/p>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>One way to think of confessions is that we are giving the model access to an \u201canonymous tip line\u201d where it can turn itself in by presenting incriminating evidence of misbehavior. But unlike real-world tip lines, if the model acted badly in the original task, it can collect the reward for turning itself in while [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[5,24,23],"class_list":["post-617","post","type-post","status-publish","format-standard","hentry","category-blog","tag-ai","tag-computing","tag-reasoning"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/posts\/617","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/comments?post=617"}],"version-history":[{"count":0,"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/posts\/617\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/media?parent=617"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/categories?post=617"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.adlington.fr\/index.php\/wp-json\/wp\/v2\/tags?post=617"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}