{"id":1757,"date":"2018-03-02T02:10:11","date_gmt":"2018-03-02T02:10:11","guid":{"rendered":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/?p=1757"},"modified":"2018-03-05T22:13:10","modified_gmt":"2018-03-05T22:13:10","slug":"finding-influential-training-samples-for-gradient-boosted-decision-trees","status":"publish","type":"post","link":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/finding-influential-training-samples-for-gradient-boosted-decision-trees\/","title":{"rendered":"Now on arXiv: Finding influential training samples for gradient boosted decision trees"},"content":{"rendered":"<p>Boris Sharchilev, Yury Ustinovsky, Pavel Serdyukov, and I have released a new pre-print on &#8220;finding influential training samples for gradient boosted decision trees&#8221; on arXiv. In the paper we address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model&#8217;s predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency. You can find the paper <a href=\"https:\/\/arxiv.org\/abs\/1802.06640\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Boris Sharchilev, Yury Ustinovsky, Pavel Serdyukov, and I have released a new pre-print on &#8220;finding influential training samples for gradient boosted decision trees&#8221; on arXiv. In the paper we address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT)&#8230;.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"_links":{"self":[{"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/posts\/1757"}],"collection":[{"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/comments?post=1757"}],"version-history":[{"count":2,"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/posts\/1757\/revisions"}],"predecessor-version":[{"id":1760,"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/posts\/1757\/revisions\/1760"}],"wp:attachment":[{"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/media?parent=1757"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/categories?post=1757"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/staff.fnwi.uva.nl\/m.derijke\/wp-json\/wp\/v2\/tags?post=1757"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}