Jekyll2024-01-25T01:36:13-03:00https://jmloyola.github.io/feed.xmlJuan Martín LoyolaComputer Science PhD StudentJuan Martín Loyolajmloyola@unsl.edu.arscikit-learn Sprint in Salta, Argentina2022-09-29T00:00:00-03:002022-09-29T00:00:00-03:00https://jmloyola.github.io/posts/2022/09/scikit-learn-sprint-salta<p class="notice--info">Blog post cross-posted from the <a href="https://blog.scikit-learn.org/events/salta-sprint/">scikit-learn blog</a>.</p>
<p>In September of 2022, the <a href="https://pythoncientifico.ar/">SciPy Latin America</a> conference took place in Salta, Argentina.
As part of the event, we organized a <a href="https://pythoncientifico.ar/events/sprints/">scikit-learn sprint</a>.
The main idea was to introduce the participants to the open source world and help them make their first contribution.
The sprint event was an in-person event.</p>
<p style="text-align: center;"><img src="/images/posts/2022-09-29-scikit-learn-sprint-salta/scipy-la-2022-logo.png" alt="SciPy logo" width="50%" height="50%" /></p>
<h2 id="schedule">Schedule</h2>
<ul>
<li>September 27, 2022 - <strong>Pre-sprint</strong> - 10:00 to 12:00 hs (UTC -3)</li>
<li>September 28, 2022 - <strong>Sprint</strong> - 10:00 to 17:00 hs (UTC -3)</li>
</ul>
<h2 id="repository">Repository</h2>
<p>For more information in Spanish, <a href="https://github.com/jmloyola/sklearn-sprint-argentina-2022">check this repository</a>.
You will find details about the event, instructions to set up the development environment, links with further information and tutorials, and an example git workflow to make a pull request for the project.</p>
<h2 id="photos">Photos</h2>
<figure style="text-align: center;">
<img src="/images/posts/2022-09-29-scikit-learn-sprint-salta/sprint-salta-2022-1.jpg" alt="11 people standing behind some computers and 2 people projected in the screen" max-width="20%" max-height="20%" />
<figcaption>
Group photo of the SciPy Latin America sprint, Salta, Argentina, 2022. Sandra Meneses and Juan Martín Loyola are projected on the screen from a Zoom call. Photo credit: Lucía Torres.
</figcaption>
</figure>
<figure style="text-align: center;">
<img src="/images/posts/2022-09-29-scikit-learn-sprint-salta/sprint-salta-2022-2.jpeg" alt="11 people coding in their computers" max-width="20%" max-height="20%" />
<figcaption>
Participants of the SciPy Latin America sprint working on their computers. Photo credit: Ariel Silvio Norberto Ramos.
</figcaption>
</figure>
<h2 id="acknowledgment">Acknowledgment</h2>
<p>These people made this sprint possible:</p>
<ul>
<li>Ariel Silvio Norberto Ramos, one of the organizers of the SciPy Latin America,</li>
<li><a href="https://www.dataumbrella.org/">Data Umbrella</a>, <a href="https://twitter.com/ScipyLA/status/1573710649963724802">one of the community partners of the event</a>, especially Sandra Meneses and Reshama Shaikh,</li>
<li>The mentors that helped run the sprint.</li>
</ul>Juan Martín Loyolajmloyola@unsl.edu.arPair Programming with Visual Studio Code Live Share2021-09-23T00:00:00-03:002021-09-23T00:00:00-03:00https://jmloyola.github.io/posts/2021/09/pair-programming-vs-code<p class="notice--info">Blog post cross-posted from the <a href="https://blog.dataumbrella.org/pair-programming-with-visual-studio-code-live-share">Data Umbrella blog</a>.</p>
<p>Pair programming is an amazing experience. On one hand, you and your coding partner can accomplish much more than you would be working separately. On the other hand, both can learn new things about the language or framework you are working on from one another. But, mainly, solving problems and coding challenges with somebody else is a lot of fun :).</p>
<p>This document aims to introduce the Visual Studio Code extension <a href="https://visualstudio.microsoft.com/services/live-share/">Live Share</a>, which allows you to collaborate with coding partners in real-time.</p>
<h2 id="live-share">Live Share</h2>
<p>Live Share is an extension from <a href="https://code.visualstudio.com/">Visual Studio Code</a> that allows you to collaboratively edit and debug with others in real-time. It allows you to instantly share your current project, edit snippets of code at the same time or follow someone’s cursor while they program.
These are extremely helpful when you are pair-programming using the driver-navigator technique <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. While the driver codes, the navigator can observe, check other sources, or suggest changes.</p>
<h2 id="installation">Installation</h2>
<p>To install Live Share follow these steps:</p>
<ol>
<li>If needed, install Visual Studio Code. We used version 1.60.2 for this tutorial. For other versions, some steps might be different.</li>
<li>Install the Python extensions and select the Python interpreter you want to use.</li>
<li>Download and install the Live Share extension for Visual Studio Code. Use the extensions tab in Visual Studio Code (Ctrl+Shift+X) and search for Live Share. You will see three extensions: Live Share, Live Share Audio, and Live Share Extension Pack. Only the first one, Live Share, is necessary. The last one contains the two previous extensions.</li>
<li>Wait for the extension to finish downloading and then reload VS Code when prompted.</li>
<li>Wait for Visual Studio Live Share to finish installing dependencies (you’ll see progress updates in the status bar).</li>
<li>Once complete, you’ll see Live Share appear in your status bar and the activity bar.
<p style="text-align: center;"><img src="/images/posts/2021-09-23-pair-programming-vs-code/live_share_status_bar.png" alt="Live Share Status Bar" width="150px" />
<img src="/images/posts/2021-09-23-pair-programming-vs-code/live_share_activity_bar.png" alt="Live Share Status Bar" width="25px" /></p>
</li>
<li>[Optional] Sign in with GitHub to use Live Share. For that go to the “Account” tab in the left panel. In case you don’t sign in, you can only join sessions anonymously, and I think you will not be able to start a session. In some web browsers, GitHub authentication can fail. Try a different web browser.</li>
</ol>
<h2 id="how-to-use-it">How to use it</h2>
<p>In what follows, we are going to assume two people <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> are pair-programming, where one takes the role of the driver (in the screenshots represented using the dark theme) and the other is the navigator (in the screenshots represented using the light theme).</p>
<ol>
<li><strong>[driver]</strong> To begin a Live Share session you can:
<ul>
<li>Make sure that the folder you want to work in is open.</li>
<li>Click on “Live Share” on the status bar, or</li>
<li>Click on the Live Share icon in the activity bar and then click on “Share”.</li>
</ul>
<p style="text-align: center;"><img src="/images/posts/2021-09-23-pair-programming-vs-code/create_session.png" alt="Create Session" width="400px" /></p>
</li>
<li><strong>[driver]</strong> You should see a notification with a message saying your collaboration session is starting.</li>
<li><strong>[driver]</strong> Once the session starts, a link will be copied to your clipboard. Send it to the person you want to collaborate with.</li>
<li><strong>[navigator]</strong> To join the collaboration session, go to the Live Share icon in the activity bar and click “Join”. If you have the link in your clipboard, when you click on “Join” it will automatically join the session. Otherwise, you will have to paste the link in the panel that appears in the center of the screen. Now you should wait for the driver to accept you into the session.
<p style="text-align: center;"><img src="/images/posts/2021-09-23-pair-programming-vs-code/join_session.png" alt="Join Session" width="400px" /></p>
</li>
<li><strong>[driver]</strong> Accept the navigator into the collaboration session.
<p style="text-align: center;"><img src="/images/posts/2021-09-23-pair-programming-vs-code/accept_session.png" alt="Accept Session" width="300px" /></p>
</li>
<li>You will now see the Live Session details showing the other participants. You can now start working collaboratively :).
<p style="text-align: center;"><img src="/images/posts/2021-09-23-pair-programming-vs-code/working_session.png" alt="Working Session" width="200px" /></p>
</li>
<li>You can start following each other while you code. This is extremely useful when discussing code over a voice channel.
<ul>
<li>To begin following your partner click on their name.
<p style="text-align: center;"><img src="/images/posts/2021-09-23-pair-programming-vs-code/following_you.png" alt="Following Partner" width="600px" /></p>
</li>
<li>To make your partner follow you click on “Focus Participants”.
<p style="text-align: center;"><img src="/images/posts/2021-09-23-pair-programming-vs-code/follow_me.png" alt="Follow Me" width="200px" /></p>
</li>
</ul>
</li>
<li>You can chat with the rest of the participants using the “Session chat” on the Live Share tab.</li>
<li><strong>[driver]</strong> You can also share a terminal using the Live Share tab (in read-only or read/write mode). Once allowed, you can remove it. Once activated in read/write mode, all the participants will share the same terminal, so be careful not to step on each other’s commands.</li>
<li>To end the session you have to click on “Stop Collaboration Session”.
<p style="text-align: center;"><img src="/images/posts/2021-09-23-pair-programming-vs-code/end_session.png" alt="End Session" width="200px" /></p>
</li>
</ol>
<h2 id="notes">Notes</h2>
<p>While working collaboratively on Live Share, all the changes will appear as if they had been made by the host of the session (the driver). Thus, remember to appropriately indicate that the work was done as a team in the <a href="https://docs.github.com/en/github/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors">commit message</a> :D.</p>
<h2 id="more-information">More information</h2>
<p>For more information or if you encounter a problem while installing or using Live Share, visit these pages:</p>
<ul>
<li><a href="https://marketplace.visualstudio.com/items?itemName=MS-vsliveshare.vsliveshare">Visual Studio Code Marketplace: Live Share</a></li>
<li><a href="https://docs.microsoft.com/en-us/visualstudio/liveshare/">Documentation: Visual Studio Live Share</a></li>
</ul>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>If you want to read more about pair-programming and the different techniques you can use, visit this <a href="https://medium.com/@weblab_tech/pair-programming-guide-a76ca43ff389">link</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Note that you can work with more than two people using the Live Share extension. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Juan Martín Loyolajmloyola@unsl.edu.arGSoC 2019 Final Evaluation2019-08-22T00:00:00-03:002019-08-22T00:00:00-03:00https://jmloyola.github.io/posts/2019/08/gsoc-2019-final-evaluation<p>The Google Summer of Code 2019 has come to an end.
It has been an amazing experience!
I would recommend it to anyone that is interested in participating in an open source project.</p>
<p>In what follows, I will present a summary of the work done for the Google Summer of Code project <a href="https://summerofcode.withgoogle.com/projects/#4666396833742848">“Bayesian Additive Regression Trees in PyMC3”</a>. I will go over key points of the project and provide links to commits that show the current state of the implementation. Along the way, I will point out different challenges encountered and future work to be done.</p>
<p>All the code produced during GSoC 2019 is available in <a href="https://github.com/jmloyola/pymc3/tree/add_bart">this repository of GitHub</a>.
Also, I have written a serie of <a href="https://jmloyola.github.io/tags/#gsoc-2019">blog posts</a> where I talk about the Bayesian Additive Regression Trees (BART) model and its implementation.</p>
<h2 id="project-abstract">Project Abstract</h2>
<p>Bayesian Additive Regression Trees is a Bayesian nonparametric approach to estimating functions using regression trees. A BART model consist on a sum of regression trees with normal additive noise. Regression trees are defined by recursively partitioning the input space, and defining a local model in each resulting region of input space in order to approximate some unknown function. BARTs are useful and flexible model to capture interactions and non-linearities and have been proved useful tools for variable selection.</p>
<p>Bayesian Additive Regression Trees will allow PyMC3 users to perform regressions with a “canned” non-parametric model. By simple calling a method, users will obtain the mean regressor plus the uncertainty estimation in a fully Bayesian way. This can be used later to predict on hold-out data. Furthermore, the implemented BART model will allow experience users to specify their own priors for the specific problem they are tackling, improving performance substantially.</p>
<h2 id="project-status">Project status</h2>
<p>A lot work was put into the Google Summer of Code.
I read some papers about BART and wrote a lot a code, tests and documentation.
But, sadly we didn’t meet all the goals we set in the beginning.
There are still a lot more debugging to do to ensure the model is correct.</p>
<p>The main files for the project are:</p>
<ul>
<li><a href="https://github.com/jmloyola/pymc3/blob/add_bart/pymc3/bart/tree.py">Tree structure</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/blob/add_bart/pymc3/bart/bart.py">BART</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/blob/add_bart/pymc3/bart/exceptions.py">BART exceptions</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/blob/add_bart/pymc3/tests/test_tree_nodes.py">Tests for nodes</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/blob/add_bart/pymc3/tests/test_tree_structure.py">Tests for tree structure</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/blob/add_bart/pymc3/tests/test_bart.py">Tests for BART</a>.</li>
</ul>
<h2 id="things-done">Things done</h2>
<ul>
<li>Community bonding period. During this time I learned more about PyMC and the BART model. I also set up this blog. For more details you can check this blog post: <a href="/posts/2019/06/coding-period-begins">“Coding period begins”</a>.</li>
<li>Understand the BART model. This involved reading many research papers and different implementations. The most relevant papers read were <em>Chipman et al. (2010)</em>, <em>Chipman et al. (1998)</em>, <em>Lakshminarayanan et al. (2015)</em>, <em>Tan et al. (2019)</em> and <em>Kapelner & Bleich (2013)</em>. I studied the code for some BART implementations to complement the information given in the papers, for example, <a href="https://github.com/JakeColtman/bartpy">bartpy</a>, <a href="https://github.com/balajiln/pgbart">pgbart</a> and <a href="https://github.com/kapelner/bartMachine">bartMachine</a>. Finally, some notes I wrote about this process can be seen in this blog:
<ul>
<li><a href="/posts/2019/06/introduction-to-bart">“Introduction to Bayesian Additive Regression Trees”</a>.</li>
<li><a href="/posts/2019/07/posterior-inference-in-bart">“Posterior inference in Bayesian Additive Regression Trees”</a>.</li>
</ul>
</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/0ddb3dc9584f2b6ca5b45d4b6d33d9de317d3e4f">Implement the tree structure</a>. For details on the choosen tree structure implementation, you can check this blog post: <a href="/posts/2019/07/bart-tree-structure">“BART’s tree structure implementation”</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/473593013f3864c5d15bc011cc611dac1ec90550">Implement the visualization of a tree</a>.</li>
<li>Add tests for the <a href="https://github.com/jmloyola/pymc3/commit/0ddb3dc9584f2b6ca5b45d4b6d33d9de317d3e4f">nodes</a> and the <a href="https://github.com/jmloyola/pymc3/commit/473593013f3864c5d15bc011cc611dac1ec90550">tree structure</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/9b8c7d2cccc904ffa51a973d3482bf735271575b">Define two APIs for BART</a>: one for the conjugate model defined in the original paper of <em>Chipman et al. (2010)</em> and a flexible API for user defined priors and likelihoods.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/43e6072deb30225db1d51126b0cff72130934378">Set the priors for BART</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/50fc9440f7a9c62d0f498291d283ccc5f4b6ba1a">Implement the Bayesian backfitting MCMC for posterior inference in BART</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/5ecec56692fc7ea335260cd4aa55a47204ab996e">Implement the tree samplers GrowPrune and ParticleGibbs</a>. Both these approaches are described in <em>Lakshminarayanan et al. (2015)</em>.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/b5cb5e80cc1f118b7aea91f122f1fa0bde2f1c25">Add <code class="language-plaintext highlighter-rouge">Particle</code> class to get a clearer code</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/cbcbf9e5b8ddc0578408e2d303e9134bba8dd4c2#diff-be411b5f778aaf7c0eadc4d0456e7752">Add tests for BART</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/d9d8e210ddea0102e6359fb0aeb9579fa5157100">Optimize space of data structure for trees</a>.</li>
<li><a href="https://github.com/jmloyola/pymc3/commit/1360afa371fe2642e074e441bb1850b07a67ffc3">Add documentation to tree structure code</a>.</li>
</ul>
<h2 id="challenges">Challenges</h2>
<p>During GSoC there were two challenges I faced. For one part implementing a model given in a serie of papers was not as simple as I though at first. There are a lot of small details that are not explained in the papers but that appear while implementing the model. This entailed to examine the implementations of the original papers and reading a lot more papers in order to find the answer. Tasks that proved to be hard because I am not an expert in the topic. On the other hand, I faced the challenge of merging the code I wrote for this model inside the PyMC3 code base. For this I had to dive deep into the code base of PyMC3 and adapt the code I had written. I go in a little more detail in this blog post: <a href="/posts/2019/08/almost-end-coding-period">“The coding period is coming to an end”</a>.</p>
<h2 id="future-work">Future work</h2>
<p>Although GSoC has finished, I plan to keep working in the project. It is not going to be as intense as during GSoC, but I will dedicate a couple of hours a week to have BART merged in PyMC.</p>
<p>To have an initial version of BART in PyMC we should:</p>
<ul>
<li>Add documentation to the rest of the code.</li>
<li>Ensure the model is correct. Compare our model with other implementations.</li>
<li>Add jupyter notebook showing the effect of the BART priors.</li>
<li>Add jupyter notebook showing how BART performs.</li>
<li>Finish merging the BART code to PyMC3.</li>
<li>Refactor Tree code. Separate the basic tree structure from the things needed for BART. Add a new class <code class="language-plaintext highlighter-rouge">BARTTree</code> which add the attributes BART needs.</li>
</ul>
<p>To extend the model we should:</p>
<ul>
<li>Implement another way to calculate variable importance following the paper from <em>Bleich et al. (2014)</em>.</li>
<li>Implement another way to choose the path when the variable has <code class="language-plaintext highlighter-rouge">np.NaN</code> following the paper from <em>Kapelner & Bleich (2015)</em>. For example:
<ul>
<li>Always to the left.</li>
<li>Always to the right.</li>
<li>To the left or right at random.</li>
<li>Every data point with <code class="language-plaintext highlighter-rouge">np.NaN</code> in that particular variate goes to the left and the rest of the data points to the right. The condition the split node will divide the space is <code class="language-plaintext highlighter-rouge">np.isNan(x)</code>.</li>
</ul>
</li>
<li>Single program, multiple data (SPMD) parallelization following the paper from <em>Pratola et al. (2014)</em>.</li>
<li>Optimize time for posterior estimation following the paper from <em>He et al. (2019)</em>.</li>
</ul>
<h2 id="references">References</h2>
<ol>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. <em>The Annals of Applied Statistics</em>, <em>4</em>(1), 266-298.</li>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model search. <em>Journal of the American Statistical Association</em>, <em>93</em>(443), 935-948.</li>
<li>Lakshminarayanan, B., Roy, D., & Teh, Y. W. (2015). Particle Gibbs for Bayesian additive regression trees. In <em>Artificial Intelligence and Statistics</em> (pp. 553-561).</li>
<li>Kapelner, A., & Bleich, J. (2013). bartMachine: Machine learning with Bayesian additive regression trees. <em>arXiv preprint arXiv:1312.2171</em>.</li>
<li>Tan, Y. V., & Roy, J. (2019). Bayesian additive regression trees and the General BART model. <em>arXiv preprint arXiv:1901.07504</em>.</li>
<li>Pratola, M. T., Chipman, H. A., Gattiker, J. R., Higdon, D. M., McCulloch, R., & Rust, W. N. (2014). Parallel Bayesian additive regression trees. <em>Journal of Computational and Graphical Statistics</em>, <em>23</em>(3), 830-852.</li>
<li>Bleich, J., Kapelner, A., George, E. I., & Jensen, S. T. (2014). Variable selection for BART: an application to gene regulation. <em>The Annals of Applied Statistics</em>, <em>8</em>(3), 1750-1781.</li>
<li>Kapelner, A., & Bleich, J. (2015). Prediction with missing data via Bayesian additive regression trees. <em>Canadian Journal of Statistics</em>, <em>43</em>(2), 224-239.</li>
<li>He, J., Yalov, S., & Hahn, P. R. (2019, April). XBART: Accelerated Bayesian Additive Regression Trees. In <em>The 22nd International Conference on Artificial Intelligence and Statistics</em> (pp. 1130-1138).</li>
</ol>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>I will like to thanks <strong>Google</strong> and <strong>NumFOCUS</strong> for promoting open-source projects and bringing students closer to open-source communities; <strong>PyMC</strong> for the warm welcome to the community; and <strong>Austin Rochford</strong> and <strong>Osvaldo Martin</strong> for their support as mentors during the development of the project.</p>Juan Martín Loyolajmloyola@unsl.edu.arThe coding period is coming to an end2019-08-01T00:00:00-03:002019-08-01T00:00:00-03:00https://jmloyola.github.io/posts/2019/08/almost-end-coding-period<p>We already passed two thirds of the coding period for GSoC, with only two weeks left.
It is been a tremendous experience and I have learned a lot so far, both in programming and in statistics.
Nevertheless, I still consider myself a newcomer when it comes to Bayesian statistics.
There is still a lot to learn.
A big part of what I learned is thanks to the problems we encountered.</p>
<p>In this post we will reflect on what we been doing for GSoC and some of the difficulties that arose.</p>
<h2 id="what-have-we-done-until-now">What have we done until now?</h2>
<p>During the coding period, I would say that the time I spend was equally distributed between understanding the model and implementing it.
With respect to the implementation, I tried to always think ahead of time for possible errors or user frictions with the API.</p>
<p>I also paid attention to the performance of the implementation, so that this model is competitive with others already existing in PyMC
For example, while implementing the function <code class="language-plaintext highlighter-rouge">get_available_predictors()</code>, two different approaches came to my head.
One that uses <code class="language-plaintext highlighter-rouge">numpy.unique()</code> to get the unique values of a predictor and if that value is greater than two add it to the list of available predictors.
And another that loops through the dataset only looking at one particular variate at a time, when two different values are observed, the predictor is added to list and the iteration breaks.
This last one was based on the implementation from <a href="https://github.com/kapelner/bartMachine">bartMachine</a>.
In these cases, I tested both and only kept the fastest one, which for the <code class="language-plaintext highlighter-rouge">get_available_predictors()</code> function was the second one.</p>
<p>Nevertheless, I didn’t thought much about the parallelization of the implementation.
For this, we could take advantage of the GPU or multithreading architectures like Patrola et al. (2014) did.
This could be implemented once the model is stable.
It might be a good project for another Google Summer of Code.</p>
<p>Next is a brief summary of tasks done so far:</p>
<ul>
<li>Understand the Bayesian additive regression trees model. This involved reading many research papers and different implementations.</li>
<li>Implement the tree structure.</li>
<li>Implement the visualization of a tree.</li>
<li>Add tests for the nodes and the tree structure.</li>
<li>Define the API for BART.</li>
<li>Implement the setting of the priors.</li>
<li>Derive and implement the posterior distributions for $\mu_{ij}$ and $\sigma^2$ in BART.</li>
<li>Implement the Bayesian backfitting MCMC for posterior inference in BART.</li>
<li>Update PyMC documentation. While I was reading the documentation of PyMC, I encountered some minor errors. Thus, I made some pull requests to fix them: <a href="https://github.com/pymc-devs/pymc3/pull/3533">#3533</a> and <a href="https://github.com/pymc-devs/pymc3/pull/3537">#3537</a>.</li>
</ul>
<h2 id="difficulties">Difficulties</h2>
<p>There were two issues that were relatively difficult for me, one related to the theory of BART and the other related to the implementation:</p>
<ul>
<li>Posterior distributions for $\mu_{ij}$ and $\sigma^2$ in BART.</li>
<li>PyMC API for BART.</li>
</ul>
<h3 id="posterior-distributions-for-mu_ij-and-sigma2-in-bart">Posterior distributions for $\mu_{ij}$ and $\sigma^2$ in BART</h3>
<p>This project made me realize that to fully understand a model, the ins and outs of it, one should read the literature, the implementation and try to implement it oneself.
While trying to implement something, one realizes that there are details that are not explained in the papers.
In this case, this may be due to my ignorance.
Thus, when I started implementing the Bayesian backfitting MCMC algorithm for posterior inference in BART, I figured out that what seemed simple while reading Chipman et al. (2010) was not so clear when implementing it.</p>
<p>First, in the paper, the authors comment that the draws of $M_j$ for the posterior distribution is <em>simply</em> (:neutral_face: -something similar happened to me with some proofs that were skipped in Bishop’s book for being <em>straightforward</em>-) a set of independent draws of the terminal node $\mu_{ij}$’s from a normal distribution. But, <strong>what parameters should this normal distribution have?</strong>
Something similar happened with the posterior distribution for the $\sigma^2$ parameter.</p>
<p>I was clueless. To find the answer, I studied the code of four BART implementations: <a href="https://github.com/JakeColtman/bartpy">bartpy</a>, <a href="https://github.com/balajiln/pgbart">pgbart</a>, <a href="https://github.com/kapelner/bartMachine">bartMachine</a> and <a href="https://github.com/cran/BayesTree">BayesTree</a>.
This wasn’t easy either because everyone used a different parametrization of the distributions which ended in a different expression for the posteriors.</p>
<p>We ended up choosing one implementation and following it.
Since <em>bartpy</em> was the clearest implementation we picked it.
But, I wasn’t sure the implementation was correct.
Later, I found a paper from Tan et al. (2019) where they derived these posteriors, confirming the validity of the implementation.</p>
<p>If we use the conjugate priors for $\mu_{ij}\mid T_j$ and $\sigma^2$ as in Chipman et al. (2010), the posterior distributions can be obtain analytically. Tan et al. (2019) derived these two expressions in the following manner.</p>
<p>Let $R_{ij}=(R_{ij1}, \ldots, R_{ij\eta_i})^T$ be a subset from $R_j$ where $\eta_i$ is the number of $R_{ijh}$s allocated to the terminal node with parameter $\mu_{ij}$ and $h$ indexes the subjects allocated to the terminal node with parameter $\mu_{ij}$. Note that $R_{ijh} \mid g(X_{ijh}, T_j, M_j),\sigma \sim \mathcal{N}(\mu_{ij}, \sigma^2)$ and $\mu_{ij} \mid T_j \sim \mathcal{N}(\mu_{\mu}, \sigma_{\mu}^2)$. Then the posterior distribution of $\mu_{ij}$ is given by</p>
<p>\begin{equation}
\begin{split}
p(\mu_{ij} \mid T_j, \sigma, R_j) &\propto p(R_{ij} \mid T_j, \mu_{ij}, \sigma) \, p(\mu_{ij} \mid T_j)\\ &\propto \exp \left[ - \frac{\left( \mu_{ij}-\frac{\sigma_{\mu}^2 \sum_{h} R_{ijh} + \sigma^2 \mu_{\mu}} {\eta_i \sigma_{\mu}^2 + \sigma^2} \right)^2 } {2 \frac{\sigma^2 \sigma_{\mu}^2}{\eta_i \sigma_{\mu}^2 + \sigma^2} } \right]
\end{split}
\end{equation}</p>
<p>Finally, let $Y = (Y_1, \ldots, Y_n)^T$ and $k$ index the subjects $k=1,\ldots,n$. With $\sigma^2 \sim \frac{\nu\lambda}{\chi_{\nu}^2}$, or using the Inverse Gamma reparametrization, $\sigma^2 \sim IG(\frac{\nu}{2},\frac{\nu \lambda}{2})$, we obtain the posterior draw of $\sigma$ as follows</p>
<p>\begin{equation}
\begin{split}
p(\sigma^2 \mid \{T_j, M_j\}_{j=1}^m, Y) \propto& \, p(Y \mid \{T_j, M_j\}_{j=1}^m, \sigma) \, p(\sigma^2)\\ =& \, (\sigma^2)^{-(\frac{\nu + n} {2} +1)}\\ & \,\exp \left[ - \frac{\nu \lambda + \sum_{i=1}^n (Y_i - \sum_{j=1}^m g(X_i,T_j,M_j))^2} {2 \sigma^2} \right]
\end{split}
\end{equation}</p>
<h3 id="pymc-api-for-bart">PyMC API for BART</h3>
<p>Or, in other words, how we fit the code for BART in the PyMC code-base.
For this, I examined PyMC’s code-base in order to deepen my knowledge about its data flow, architecture and API.
The API design was the most discussed topic with my mentors, since we wanted little friction for the user.
Because they use PyMC frequently and have the ability to spot weirdly written functions, they provided valuable feedback.</p>
<p>We brainstormed and arrived to a beta version of the API for the model presented in Chipman et al. (2010).
This implementation only allowed for a normal likelihood and conjugate priors, and calculated the posterior for the parameters using analytical derivations.</p>
<p>But, this is not how PyMC is.
The API should be flexible for the user.
If she wants to use some other likelihood or others priors, she should be able to.
PyMC should take care of everything and arrive to the posterior of the parameters given the data.
With this part I am still struggling, since it requires two things I still can’t figure out how to do:</p>
<ul>
<li>Deduce the likelihood of the model when it is not explicitly given.</li>
<li>Calculate the posterior distribution of the parameters with not conjugate priors.</li>
</ul>
<p>We have draft implementation, but there is still a lot of work to do.</p>
<h2 id="references">References</h2>
<ol>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. <em>The Annals of Applied Statistics</em>, <em>4</em>(1), 266-298.</li>
<li>Tan, Y. V., & Roy, J. (2019). Bayesian additive regression trees and the General BART model. <em>arXiv preprint arXiv:1901.07504</em>.</li>
<li>Pratola, M. T., Chipman, H. A., Gattiker, J. R., Higdon, D. M., McCulloch, R., & Rust, W. N. (2014). Parallel Bayesian additive regression trees. <em>Journal of Computational and Graphical Statistics</em>, <em>23</em>(3), 830-852.</li>
</ol>Juan Martín Loyolajmloyola@unsl.edu.arPosterior inference in Bayesian Additive Regression Trees2019-07-21T00:00:00-03:002019-07-21T00:00:00-03:00https://jmloyola.github.io/posts/2019/07/posterior-inference-in-bart<p>In this post we will show how the posterior inference in BART is performed. To do this, we first introduce the specification of the likelihood for BART. Later, we show the algorithm Bayesian backfitting MCMC for posterior inference in BART used to sample from the posterior. Finally, we will briefly describe different approaches to sample the tree structure.</p>
<h2 id="likelihood-specification-for-bart">Likelihood specification for BART</h2>
<p>As we commented in a previous post, BART is a sum-of-trees model specified by:</p>
\[y=\sum_{j=1}^{m}g(x; T_j, M_j)+\epsilon\text{,}\qquad \epsilon \sim \mathcal{N}(0,\sigma^2)\text{.}\]
<p class="notice--primary">To reduce clutter in the text, from now on, we will write $\{T_j, M_j\}_{j=1}^m$ to refer to $(T_1, M_1), \ldots,(T_m, M_m)$.</p>
<p>Hence, the likelihood for a training instance is</p>
\[\ell\left ( y \mid \{T_j, M_j\}_{j=1}^m, \sigma, x \right ) = \mathcal{N} \left ( y \mid \sum_{j=1}^{m}g(x; T_j, M_j), \sigma \right )\]
<p>and the likelihood for the entire training dataset (note that $Y = (y_1, \ldots, y_n)$) is</p>
\[\ell\left ( Y \mid \{T_j, M_j\}_{j=1}^m, \sigma, X \right ) = \prod_{i=1}^n \ell(y_i \mid \{T_j, M_j\}_{j=1}^m, \sigma, x_i)\text{.}\]
<h2 id="bayesian-backfitting-mcmc-for-posterior-inference-in-bart">Bayesian backfitting MCMC for posterior inference in BART</h2>
<p>Given the likelihood and the prior, the posterior distribution is</p>
\[p \left (\{T_j, M_j\}_{j=1}^m, \sigma \mid Y,X \right ) \propto \ell\left ( Y \mid \{T_j, M_j\}_{j=1}^m, \sigma, X \right ) p \left ( \{T_j, M_j\}_{j=1}^m , \sigma \mid X \right )\text{.}\]
<p>To sample from the BART posterior, Chipman et al. (2010) propposed a Bayesian backfitting MCMC algorithm. At a general level, this algorithm is a Gibbs sampler that loops through the trees, sampling:</p>
<ul>
<li>each tree $T_j$ and associated parameters $M_j$ conditioned on $\sigma$ and the remaining trees and their associated parameters,
$\{ T_{j’}, M_{j’} \}_{j’\neq j}$; and</li>
<li>$\sigma$ conditioned on all the trees and parameters $\{T_j, M_j\}_{j=1}^m$.</li>
</ul>
<p>Let $T_j^{(i)}$, $M_j^{(i)}$, and $\sigma^{(i)}$ denote the values of $T_j$, $M_j$ and $\sigma$ at the $i^{th}$ MCMC iteration, respectively.
Sampling $\sigma$ conditioned on $\{T_j, M_j\}_{j=1}^m$ is straightforward due to conjugacy, i.e., a draw from a inverse gamma distribution.
To sample $T_j, M_j$ conditioned on the other trees $\{ T_{j’}, M_{j’} \}_{j’\neq j}$, we first sample $T_j \mid \{ T_{j’}, M_{j’} \}_{j’\neq j}, \sigma$ and then sample $M_j \mid T_j, \{ T_{j’}, M_{j’} \}_{j’\neq j}, \sigma$. More precisely, we compute the residual</p>
\[R_j = Y - \sum_{j'=1, j'\ne j}^m g(X; T_{j'}, M_{j'})\text{.}\]
<p>Using the residual $R_j^{(i)}$ as the target, sample $T_j^{(i)}$ by proposing local changes to $T_j^{(i-1)}$.
Finally, $M_j$ is sampled from a Gaussian distribution conditioned on $T_j, \{ T_{j’}, M_{j’} \}_{j’\neq j}, \sigma$.
The algorithm is summarized in the following figure.</p>
<figure>
<a href="/images/posts/2019-07-21-posterior-inference-in-bart/backfitting_MCMC.png"><img src="/images/posts/2019-07-21-posterior-inference-in-bart/backfitting_MCMC.png" /></a>
<figcaption>Adapted from Lakshminarayanan et al.</figcaption>
</figure>
<h2 id="sample-of-the-tree-structure-t_ji-mid-r_ji-sigmai-t_ji-1">Sample of the tree structure, $T_j^{(i)} \mid R_j^{(i)}, \sigma^{(i)}, T_j^{(i-1)}$</h2>
<p>To sample $T_j$, Chipman et al. (2010) use the MCMC algorithm proposed by Chipman et al. (1998). This algorithm, which we refer to as <strong>CGM</strong>, is a Metropolis-within-Gibbs sampler that randomly chooses one of the following four moves:</p>
<ul>
<li><em>grow</em>: randomly chooses a leaf node and splits it further into left and right children,</li>
<li><em>prune</em>: randomly chooses an internal node where both the children are leaf nodes and prunes the two leaf nodes, thereby making the internal node a leaf node,</li>
<li><em>change</em>: changes the decision rule at a randomly chosen internal node,</li>
<li><em>swap</em>: swaps the decision rules at a parent-child pair where both the parent and child are internal nodes.</li>
</ul>
<p>Another approach, introduced by Patrola et al. (2013) to reduce computational time, proposes using only the <em>grow</em> and <em>prune</em> moves; we will call this the <strong>GrowPrune</strong> sampler.</p>
<p>Finally, Lakshminarayanan et al. (2015) proposed a sampler based on the Particle Gibbs algorithm; we will call this <strong>PG</strong>. The PG sampler is implemented using the conditional SMC algorithm (instead of the Metropolis-Hastings sampler). Rather than making local changes to individual trees (as in Chipman et al. (2010)), the PG sampler proposes a complete tree to fit the residual.</p>
<h2 id="references">References</h2>
<ol>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. <em>The Annals of Applied Statistics</em>, <em>4</em>(1), 266-298.</li>
<li>Lakshminarayanan, B., Roy, D., & Teh, Y. W. (2015, February). Particle Gibbs for Bayesian additive regression trees. In <em>Artificial Intelligence and Statistics</em> (pp. 553-561).</li>
<li>Kapelner, A., & Bleich, J. (2013). bartMachine: Machine learning with Bayesian additive regression trees. <em>arXiv preprint arXiv:1312.2171</em>.</li>
<li>Tan, Y. V., & Roy, J. (2019). Bayesian additive regression trees and the General BART model. <em>arXiv preprint arXiv:1901.07504</em>.</li>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model search. <em>Journal of the American Statistical Association</em>, <em>93</em>(443), 935-948.</li>
<li>Pratola, M. T., Chipman, H. A., Gattiker, J. R., Higdon, D. M., McCulloch, R., & Rust, W. N. (2014). Parallel Bayesian additive regression trees. <em>Journal of Computational and Graphical Statistics</em>, <em>23</em>(3), 830-852.</li>
</ol>Juan Martín Loyolajmloyola@unsl.edu.arBART’s tree structure implementation2019-07-05T00:00:00-03:002019-07-05T00:00:00-03:00https://jmloyola.github.io/posts/2019/07/bart-tree-structure<p>As we commented in the previous <a href="/posts/2019/06/introduction-to-bart">post</a>, BART is a sum-of-trees model where each tree is a decision tree. Thus, to implement the BART model first we have to implement the tree structure used by it. The tree implementation will need:</p>
<ul>
<li>A data structure to link the nodes. This should allow us to randomly access a node in the tree and to easily add and delete a node.</li>
<li>Two types of nodes that make up the tree: splitting nodes and leaf nodes. The splitting nodes will be responsible for the division of the predictor space and gather the logic to traverse tree given an element $x$; the leaf nodes have the responses $\mu_{ij}$ for the tree.</li>
<li>Functions to grow and prune the tree.</li>
<li>Checks of correctness of the tree.</li>
</ul>
<h2 id="data-structure-to-link-the-nodes">Data structure to link the nodes</h2>
<p>Two types of data structures were consider:</p>
<ul>
<li>A series of linked nodes were each node has a link to its left and right children nodes, if they exist. The root node represents the whole tree, since you can traverse the whole tree from it.</li>
<li>A dictionary that represents the nodes stored in breadth-first order, based in the <a href="https://en.wikipedia.org/wiki/Binary_tree#Arrays">array method for storing binary trees</a>.</li>
</ul>
<p>We started coding the linked nodes structure, inspired by <a href="https://github.com/joowani/binarytree">this implementation of a binary tree</a>. This structure made creation and deletion of nodes easy since we only needed to replace links to nodes. But, early on, we realized that this implementation would not allow us to randomly access a node in the tree. Thus, we dropped it only maintaining code to represent the tree as a string.</p>
<p>Therefore, we thought of a structure that would explicitly represent the nodes and its positions. Assume we have a complete binary tree, binary tree in which every level, except possibly the last, is completely filled, and all nodes are as far left as possible. If we walk through the tree in a breadth-first order and number the nodes from zero to number of nodes minus one we can identify every node and its position in the tree structure by this number.</p>
<p><img src="https://jmloyola.github.io/images/posts/2019-07-05-bart-tree-structure/complete_binary_tree.png" alt="complete_binary_tree" class="align-center" /></p>
<p>A complete binary tree is efficiently implemented as an array, where a node at location $i$ has children at indexes $2i + 1$ and $2i + 2$ and a parent at location $\left \lfloor{(i - 1) / 2}\right \rfloor $. Since Python doesn’t have a built-in array structure, we consider two basic structures: <code class="language-plaintext highlighter-rouge">list</code> and <code class="language-plaintext highlighter-rouge">dict</code>.</p>
<p>Note that, although we consider that the indices are taken from numbering a complete binary tree, BART does not necessary construct this type of trees. The only thing we can ensure about the tree structure is that each node of the tree has exactly zero or two children. Yet, this numbering will prove us useful for indexing our structure.</p>
<p>If we try to implement this structure using a list, we would end up with a lot of wasted space since we would have to create dummy nodes to represent non-existent nodes. Thus, we ended up coding the tree structure as a dictionary, where the keys represent the nodes position and the values represent the nodes itself.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Tree</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span> <span class="o">=</span> <span class="p">{}</span>
<span class="bp">self</span><span class="p">.</span><span class="n">num_nodes</span> <span class="o">=</span> <span class="mi">0</span>
<span class="bp">self</span><span class="p">.</span><span class="n">idx_leaf_nodes</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">get_node</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">index</span><span class="p">,</span> <span class="nb">int</span><span class="p">)</span> <span class="ow">or</span> <span class="n">index</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Node index must be a non-negative int'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">index</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Node missing at index {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">index</span><span class="p">))</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">set_node</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="n">node</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">index</span><span class="p">,</span> <span class="nb">int</span><span class="p">)</span> <span class="ow">or</span> <span class="n">index</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Node index must be a non-negative int'</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">SplitNode</span><span class="p">)</span> <span class="ow">and</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">LeafNode</span><span class="p">):</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Node class must be SplitNode or LeafNode'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">index</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Node index already exist in tree'</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_nodes</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">index</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Root node must have index zero'</span><span class="p">)</span>
<span class="n">parent_index</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">get_idx_parent_node</span><span class="p">()</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_nodes</span> <span class="o">!=</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">parent_index</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Invalid index, node must have a parent node'</span><span class="p">)</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_nodes</span> <span class="o">!=</span> <span class="mi">0</span> <span class="ow">and</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">get_node</span><span class="p">(</span><span class="n">parent_index</span><span class="p">),</span> <span class="n">SplitNode</span><span class="p">):</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Parent node must be of class SplitNode'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">index</span> <span class="o">!=</span> <span class="n">node</span><span class="p">.</span><span class="n">index</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Node must have same index as tree index'</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="n">node</span>
<span class="bp">self</span><span class="p">.</span><span class="n">num_nodes</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">LeafNode</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">idx_leaf_nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">index</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">delete_node</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">index</span><span class="p">,</span> <span class="nb">int</span><span class="p">)</span> <span class="ow">or</span> <span class="n">index</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Node index must be a non-negative int'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">index</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Node missing at index {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">index</span><span class="p">))</span>
<span class="n">current_node</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">get_node</span><span class="p">(</span><span class="n">index</span><span class="p">)</span>
<span class="n">left_child_idx</span> <span class="o">=</span> <span class="n">current_node</span><span class="p">.</span><span class="n">get_idx_left_child</span><span class="p">()</span>
<span class="n">right_child_idx</span> <span class="o">=</span> <span class="n">current_node</span><span class="p">.</span><span class="n">get_idx_right_child</span><span class="p">()</span>
<span class="k">if</span> <span class="n">left_child_idx</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span> <span class="ow">or</span> <span class="n">right_child_idx</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Invalid removal of node, leaving two orphans nodes'</span><span class="p">)</span>
<span class="k">del</span> <span class="bp">self</span><span class="p">.</span><span class="n">tree_structure</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
<span class="bp">self</span><span class="p">.</span><span class="n">num_nodes</span> <span class="o">-=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">index</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">idx_leaf_nodes</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">idx_leaf_nodes</span><span class="p">.</span><span class="n">remove</span><span class="p">(</span><span class="n">index</span><span class="p">)</span></code></pre></figure>
<h2 id="tree-nodes">Tree nodes</h2>
<p>Both splitting and leaf nodes inherit from a base class called <code class="language-plaintext highlighter-rouge">BaseNode</code> which has two attributes: index and depth.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">BaseNode</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">index</span>
<span class="bp">self</span><span class="p">.</span><span class="n">depth</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">floor</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">index</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span></code></pre></figure>
<p>The splitting nodes should maintain the splitting variable and the value to split. Since BART allows for quantitative and qualitative splitting nodes we should make that distinction possible.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">SplitNode</span><span class="p">(</span><span class="n">BaseNode</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="n">idx_split_variable</span><span class="p">,</span> <span class="n">type_split_variable</span><span class="p">,</span> <span class="n">split_value</span><span class="p">):</span>
<span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">index</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">idx_split_variable</span> <span class="o">=</span> <span class="n">idx_split_variable</span>
<span class="bp">self</span><span class="p">.</span><span class="n">type_split_variable</span> <span class="o">=</span> <span class="n">type_split_variable</span>
<span class="bp">self</span><span class="p">.</span><span class="n">split_value</span> <span class="o">=</span> <span class="n">split_value</span>
<span class="bp">self</span><span class="p">.</span><span class="n">operator</span> <span class="o">=</span> <span class="s">'<='</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">type_split_variable</span> <span class="o">==</span> <span class="s">'quantitative'</span> <span class="k">else</span> <span class="s">'in'</span></code></pre></figure>
<p>The leaf nodes only hold the response result of the tree for a particular predictor space.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">LeafNode</span><span class="p">(</span><span class="n">BaseNode</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="n">value</span><span class="p">):</span>
<span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">index</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">value</span></code></pre></figure>
<h2 id="functions-to-grow-and-prune-the-tree">Functions to grow and prune the tree</h2>
<p>Every tree can only grow from a leaf node. When this happen, the old node is replaced for a splitting node and two leaf nodes. On the other hand, when we prune a tree, we select a prunable node (splitting node that have two leaf nodes as children) from the tree, delete its children and replace the node for a leaf node.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Tree</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">grow_tree</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index_leaf_node</span><span class="p">,</span> <span class="n">new_split_node</span><span class="p">,</span> <span class="n">new_left_node</span><span class="p">,</span> <span class="n">new_right_node</span><span class="p">):</span>
<span class="n">current_node</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">get_node</span><span class="p">(</span><span class="n">index_leaf_node</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">current_node</span><span class="p">,</span> <span class="n">LeafNode</span><span class="p">):</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'The tree grows from the leaves'</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">new_split_node</span><span class="p">,</span> <span class="n">SplitNode</span><span class="p">):</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'The node that replaces the leaf node must be SplitNode'</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">new_left_node</span><span class="p">,</span> <span class="n">LeafNode</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">new_right_node</span><span class="p">,</span> <span class="n">LeafNode</span><span class="p">):</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'The new leaves must be LeafNode'</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">delete_node</span><span class="p">(</span><span class="n">index_leaf_node</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">set_node</span><span class="p">(</span><span class="n">index_leaf_node</span><span class="p">,</span> <span class="n">new_split_node</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">set_node</span><span class="p">(</span><span class="n">new_left_node</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">new_left_node</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">set_node</span><span class="p">(</span><span class="n">new_right_node</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">new_right_node</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">prune_tree</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index_split_node</span><span class="p">,</span> <span class="n">new_leaf_node</span><span class="p">):</span>
<span class="n">current_node</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">get_node</span><span class="p">(</span><span class="n">index_split_node</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">current_node</span><span class="p">,</span> <span class="n">SplitNode</span><span class="p">):</span>
<span class="k">raise</span> <span class="n">TreeStructureError</span><span class="p">(</span><span class="s">'Only SplitNodes are prunable'</span><span class="p">)</span>
<span class="n">left_child_idx</span> <span class="o">=</span> <span class="n">current_node</span><span class="p">.</span><span class="n">get_idx_left_child</span><span class="p">()</span>
<span class="n">right_child_idx</span> <span class="o">=</span> <span class="n">current_node</span><span class="p">.</span><span class="n">get_idx_right_child</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">delete_node</span><span class="p">(</span><span class="n">left_child_idx</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">delete_node</span><span class="p">(</span><span class="n">right_child_idx</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">delete_node</span><span class="p">(</span><span class="n">index_split_node</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">set_node</span><span class="p">(</span><span class="n">index_split_node</span><span class="p">,</span> <span class="n">new_leaf_node</span><span class="p">)</span></code></pre></figure>
<h2 id="checks-of-correctness-of-the-tree">Checks of correctness of the tree</h2>
<p>Although the user will not be creating trees, but since we want our code to fail as soon as something bad happens (specially during develping), we added checks of correctness of the tree and raised exceptions if something was wrong. We also created tests to control that after each commit the implementation is still correct.</p>
<p>All the code for the implementation of BART can be viewed in <a href="https://github.com/jmloyola/pymc3/tree/add_bart">this branch of PyMC</a>.</p>Juan Martín Loyolajmloyola@unsl.edu.arIntroduction to Bayesian Additive Regression Trees2019-06-23T00:00:00-03:002019-06-23T00:00:00-03:00https://jmloyola.github.io/posts/2019/06/introduction-to-bart<h1 id="introduction">Introduction</h1>
<p>Bayesian Additive Regression Trees (BART) is a sum-of-trees model for approximating an unknown function $f$. Like other ensemble methods, every tree act as a weak learner, explaining only part of the result. All these trees are of a particular kind called decision trees. The decision tree is a very interpretable and flexible model but it is also prone to overfitting. To avoid overfitting, BART uses a regularization prior that forces each tree to be able to explain only a limited subset of the relationships between the covariates and the predictor variable.</p>
<p>The problem BART tackles is making inference about an unknown function $f$ that predicts an output $y$ using a $p$ dimensional vector of inputs $x=(x_1,\ldots,x_p)$ when</p>
\[y=f(x)+\epsilon\text{,}\qquad \epsilon \sim \mathcal{N}(0,\sigma^2)\]
<p>To solve this regression problem, BART approximates $f(x)=E(y \mid x)$ using $f(x)\approx h(x)\equiv \sum_{j=1}^{m}g_j(x)$, where each $g_j$ denotes a regression tree:</p>
\[y=h(x)+\epsilon\text{,}\qquad \epsilon \sim \mathcal{N}(0,\sigma^2)\label{general-sum-of-tree-model}\]
<h1 id="the-bart-model">The BART model</h1>
<p>The BART model consists of two parts: a sum-of-trees model and a regularization prior on the parameters of that model.</p>
<h2 id="a-sum-of-trees-model">A sum-of-trees model</h2>
<p>To elaborate the form of the sum-of-trees model (\ref{general-sum-of-tree-model}), we begin by establishing notation for a single tree model. Let $T$ denote a binary tree consisting of a set of interior node decision rules and a set of terminal nodes, and let $M=\{\mu_1, \mu_2, \ldots, \mu_b\}$ denote a set of parameter values associated with each of the $b$ terminal nodes of $T$. The decision rules are binary splits of the predictor space of the form $\{x \in A\}$ vs $\{x \notin A\}$ where $A$ is a subset of the range of $x$. These are typically based on the single components of $x = (x_1, \dots , x_p)$ and are of the form $\{x_i \leq c\}$ vs $\{x_i > c\} $ for continuous $x_i$. Given the way it is constructed, the tree is a full binary tree, that is, each node has exactly zero or two children. Each $x$ value is associated with a single terminal node of $T$ by the sequence of decision rules from top to bottom, and is then assigned the $\mu_i$ value associated with this terminal node. For a given $T$ and $M$, we use $g(x; T, M)$ to denote the function which assigns a $\mu_i \in M$ to $x$.</p>
<p>With this notation, the sum-of-trees model (\ref{general-sum-of-tree-model}) can be more explicitly expressed as:</p>
\[y=\sum_{j=1}^{m}g(x; T_j, M_j)+\epsilon\text{,}\qquad \epsilon \sim \mathcal{N}(0,\sigma^2)\label{well-specified-sum-of-tree-model}\]
<p>where for each binary regression tree $T_j$ and its associated terminal node parameters $M_j$, $g(x; T_j, M_j)$ is the function which assigns $\mu_{ij} \in M_j$ to $x$. Under (\ref{well-specified-sum-of-tree-model}), $E(y \mid x)$ equals the sum of all the terminal node $\mu_{ij}$’s assigned to $x$ by the $g(x; T_j, M_j)$’s.</p>
<p>The following image is an example of $g(x; T_j, M_j)$,</p>
<p><img src="https://jmloyola.github.io/images/posts/2019-06-23-introduction-to-bart/tree.png" alt="single-tree" class="align-center" /></p>
<p>Each such $\mu_{ij}$ will represent a main effect when $g(x; T_j, M_j)$ depends on only one component of $x$ (i.e., single variable), and will represent an interaction effect when $g(x; T_j, M_j)$ depends on more than one component of $x$ (i.e., more than one variable). Thus, the sum-of-trees model can incorporate both main effects and interaction effects. And because (\ref{well-specified-sum-of-tree-model}) may be based on trees of varying sizes, the interaction effects may be of varying orders. In the special case where every terminal node assignment depends on just a single component of $x$, the sum-of-trees model reduces to a simple additive function, a sum of step functions of the individual components of $x$.</p>
<p>With a large number of trees, a sum-of-trees model gains increased representation flexibility which endows BART with excellent predictive capabilities. This representational flexibility is obtained by rapidly increasing the number of parameters. Indeed, for fixed $m$, each sum-of-trees model (\ref{well-specified-sum-of-tree-model}) is determined by $(T_1, M_1), \ldots,(T_m, M_m)$ and $\sigma$, which includes all the bottom node parameters as well as the tree structures and decision rules.</p>
<h2 id="a-regularization-prior">A regularization prior</h2>
<p>The BART model specification is completed by imposing a prior over all the parameters of the sum-of-trees model, namely, $(T_1, M_1), \ldots,(T_m, M_m)$ and $\sigma$. There exists specifications of this prior that effectively regularize the fit by keeping the individual tree effects from being unduly influential. Without such a regularizing influence, large tree components would overwhelm the rich structure of (\ref{well-specified-sum-of-tree-model}), thereby limiting the advantages of the additive representation both in terms of function approximation and computation.</p>
<p>Chipman et al. proposed a prior formulation in term of just a few interpretable hyperparameters which govern priors on $T_j$, $M_j$ and $\sigma$. When domain information is not available the authors recomend using an <em>empirical Bayes</em> approach and calibrate the prior using the observed variation in $y$. Or at least to obtaing a range of plausible values and the perform cross-validation to select from these values.</p>
<h3 id="prior-independence-and-symmetry">Prior independence and symmetry</h3>
<p>In order to simplify the specification of the regularization prior we restrict our attention to priors for which the tree components ($T_j$, $M_j$) are independent of each other and also independent of $\sigma$, and the terminal node parameters of every tree are independent.</p>
<p>\begin{equation}
\begin{split}
p((T_1 , M_1), \ldots , (T_m , M_m ), \sigma ) &= \left [ \prod_j p(T_j , M_j) \right ] p(\sigma)\\ &= \left [ \prod_j p(M_j \mid T_j) p(T_j) \right ] p(\sigma)
\end{split}
\end{equation}</p>
<p>and</p>
\[p(M_j \mid T_j) = \prod_j p(\mu_{ij} \mid T_j)\]
<p>where $\mu_{ij} \in M_j$.</p>
<p>Under the independence assumption we only need to specify $p(T_j)$, $p(\mu_{ij} \mid T_j)$ and $p(\sigma)$.</p>
<h3 id="the-t_j-prior">The $T_j$ prior</h3>
<p>The $T_j$ prior, $p(T_j)$, is specified by three aspects:</p>
<ul>
<li>the probability that a node at depth $d=(0, 1, 2, \ldots)$ is nonterminal, given by:
$ \frac{\alpha}{(1 + d)^{\beta}}$ with $\alpha \in (0, 1)$ and $\beta \in \lbrack 0, \infty)$. Node depth is defined as distance from the root. Thus, the root itself has depth $0$, its first child node has depth $1$, etc. This prior controls the tree depth. For a sum-of-trees model with $m$ large, we want the regularization prior to keep the individual tree components small. To do that, we usually use $\alpha=0.95$ and $\beta=2$. Even though this prior puts most probability on tree sizes of $2$ or $3$, trees with many terminal nodes can be grown if the data demands it.</li>
<li>the distribution on the splitting variable assignments at each interior node. Usually, this is the uniform prior on available variables</li>
<li>the distribution on the splitting rule assignment in each interior node, conditional on the splitting variable. Usually, this is the uniform prior on the discrete set of available splitting values.</li>
</ul>
<h3 id="the-mu_ij-mid-t_j-prior">The $\mu_{ij} \mid T_j$ prior</h3>
<p>For convenience, we first shift and rescale $y$ so that the observed transformed values range from $y_{min} = -0.5$ to $y_{max} = 0.5$, then the prior is</p>
\[\mu_{ij} \sim \mathcal{N}(0, \sigma_{\mu}^2)\]
<p>where $\sigma_{\mu} = \frac{0.5}{k\sqrt{m}}$.</p>
<p>This prior has the effect of shrinking the tree parameters $\mu_{ij}$ toward zero, limiting the effect of the individual tree components by keeping them small. Note that as $k$ and/or $m$ is increased, this prior will become tighter and apply greater shrinkage to the $\mu_{ij}$. Chipman et al. (2010) found that a value of $k$ between $1$ and $3$ yield good results.</p>
<h3 id="the-sigma-prior">The $\sigma$ prior</h3>
<p>We used the inverse chi-square distribution</p>
\[\sigma^2 \sim \frac{\nu\lambda}{\chi_{\nu}^2}\]
<p>Essentially, we calibrate the prior for the degree of freedom $\nu$ and scale $\lambda$ for this purpose using a <em>rough data-based overestimate</em> $\hat \sigma$ of $\sigma$. Two natural choices for $\hat \sigma$ are:</p>
<ul>
<li>the <em>naive</em> specification, in which we take $\hat \sigma$ to be the sample standard deviation of $y$</li>
<li>the <em>linear model</em> specification, in which we take $\hat \sigma$ as the residual standard deviation from a least squares linear regression of $y$ on the original $X$.</li>
</ul>
<p>We then pick a value of $\nu$ between $3$ and $10$ to get an appropriate shape, and a value of $\lambda$ so that the $q$th quantile of the prior on $\sigma$ is located at $\hat \sigma$, that is, $P(\sigma < \hat \sigma) = q$. We consider values of $q$ such as $0.75$, $0.90$ or $0.99$ to center the distribution below $\hat \sigma$.</p>
<p>For automatic use, Chipman et al. (2010) recommend the default setting $(\nu, q) = (3, 0.90)$. It is not recommended to choose $\nu < 3$ because it seems to concentrate too much mass on very small $\sigma$ values, which leads to overfitting.</p>
<h3 id="the-choice-of-m">The choice of $m$</h3>
<p>Instead of being fully Bayesian and estimate the value of $m$, a fast and robust option is to choose $m=200$ and then maybe check if a couple of other values makes any difference. As $m$ is increased, starting with $m = 1$, the predictive performance of BART improves dramatically until at some point it levels off and then begins to very slowly degrade for large values of $m$. Thus, for prediction, it seems only important to avoid choosing $m$ too small.</p>
<h1 id="inference">Inference</h1>
<p>Given the observed data $y$, the Bayesian setup induces a posterior distribution</p>
\[p((T_1, M_1), \ldots, (T_m, M_m), \sigma \mid y)\]
<p>on all the unknowns that determine a sum-of-trees model (\ref{well-specified-sum-of-tree-model}). Although the sheer size of the parameter space precludes exhaustive calculation, Chipman et al. (2010) propose a backfitting MCMC algorithm that can be used to sample from this posterior. On the other hand, Lakshminarayanan et al. (2015) show that Particle Gibbs is a better approach for Bayesian Additive Regression Trees. In a future post, we’ll talk more about this.</p>
<h1 id="results">Results</h1>
<p>The ouput of a BART model is:</p>
<ul>
<li>a posterior mean estimate of $f(x) = E(y \mid x)$ at any input value $x$</li>
<li>pointwise uncertainty intervals for $f(x)$</li>
<li>variable importance meassures. This is done by keeping track of the relative frequency with which $x$ components appear in the sum-of-trees model iterations.</li>
</ul>
<h1 id="references">References</h1>
<ol>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. <em>The Annals of Applied Statistics</em>, <em>4</em>(1), 266-298.</li>
<li>Lakshminarayanan, B., Roy, D., & Teh, Y. W. (2015). Particle Gibbs for Bayesian additive regression trees. In <em>Artificial Intelligence and Statistics</em> (pp. 553-561).</li>
<li>Kapelner, A., & Bleich, J. (2013). bartMachine: Machine learning with Bayesian additive regression trees. <em>arXiv preprint arXiv:1312.2171</em>.</li>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model search. <em>Journal of the American Statistical Association</em>, <em>93</em>(443), 935-948.</li>
<li>Tan, Y. V., & Roy, J. (2019). Bayesian additive regression trees and the General BART model. <em>arXiv preprint arXiv:1901.07504</em>.</li>
</ol>Juan Martín Loyolajmloyola@unsl.edu.arCoding period begins2019-06-09T00:00:00-03:002019-06-09T00:00:00-03:00https://jmloyola.github.io/posts/2019/06/coding-period-begins<p>At this point the bonding period already ended and we are in the middle of the coding period. In this post I will tell you how things are going.</p>
<h2 id="community-bonding-period">Community bonding period</h2>
<p>The community bonding period is a time for students to learn about the organization’s processes, developer interactions, codes of conduct, set up environment, etc. Since I already contributed to PyMC3, I had my working environment set up by then. Thus, I used this time to organize the project, talk to my mentors and learn more about the community.</p>
<p>In this period, Austin, Osvaldo and me had an online meeting to introduce ourselves. We talked about the project and coordinated future meetings. Since Osvaldo and me live in the same city, we arranged to meet in person every week. Then, every three weeks or so, Austin, Osvaldo and me will meet online to discuss questions or problems about the project. I shared a Trello board with my mentors to show them what task I’m working on.</p>
<p>Furthermore, I was added to the PyMC’s Slack group. Here, I introduced myself to the rest of the developers and receive a warm welcome from them. Casually, that week they had organized a journal club to discuss a paper, so they invited me to join in. Colin Carroll commented the paper <a href="https://arxiv.org/abs/1903.03704">NeuTra-lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport</a>. Since I didn’t had time to read the paper beforehand and my knowledge of Bayesian statistics is not that great, I didn’t understand everything :sweat_smile:. Nonetheless, it was a great experience to hear everyone discuss the topic.</p>
<p>In these weeks, I also set up this blog using a GitHub page template called <a href="https://academicpages.github.io/">academic pages</a>, a fork from <a href="https://mmistakes.github.io/minimal-mistakes/">minimal mistakes</a>.</p>
<p>Finally, I started reading about the main topic of the project, Bayesian Additive Regression Trees (BART).</p>
<h2 id="coding-period">Coding period</h2>
<p>In May 27th the coding period started. The primary focus for the beginning of this period was to fully understand the Bayesian Additive Regression Trees model. Thus, I read in depth these papers:</p>
<ul>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1998.10473750">Bayesian CART model search</a>. Journal of the American Statistical Association, 93(443), 935-948.</li>
<li>Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). <a href="https://projecteuclid.org/euclid.aoas/1273584455">BART: Bayesian additive regression trees</a>. The Annals of Applied Statistics, 4(1), 266-298.</li>
<li>Lakshminarayanan, B., Roy, D., & Teh, Y. W. (2015, February). <a href="http://proceedings.mlr.press/v38/lakshminarayanan15.pdf">Particle Gibbs for Bayesian additive regression trees</a>. In Artificial Intelligence and Statistics (pp. 553-561).</li>
<li>Tan, Y. V., & Roy, J. (2019). <a href="https://arxiv.org/abs/1901.07504">Bayesian additive regression trees and the General BART model</a>. arXiv preprint arXiv:1901.07504.</li>
</ul>
<p>I also skimmed through papers that went more in-depth with the theoretical analysis of BART and its implementation:</p>
<ul>
<li>Rockova, V., & Saha, E. (2018). <a href="https://arxiv.org/abs/1810.00787">On theory for BART</a>. arXiv preprint arXiv:1810.00787.</li>
<li>Kapelner, A., & Bleich, J. (2013). <a href="https://arxiv.org/abs/1312.2171">bartMachine: Machine learning with Bayesian additive regression trees</a>. arXiv preprint arXiv:1312.2171.</li>
</ul>
<p>After reading these papers, I now have deeper understanding of BART but I still lack from a clear vision of the inference process. Surely, these will fade as I start implementing the model.</p>
<p>While working on these, a new stable version of PyMC was released, <a href="https://github.com/pymc-devs/pymc3/releases/tag/v3.7">PyMC 3.7</a>. The new release featured code I contributed with in past PRs (<a href="https://github.com/pymc-devs/pymc3/pull/3389">#3389</a>, <a href="https://github.com/pymc-devs/pymc3/pull/3427">#3427</a>), namely the <code class="language-plaintext highlighter-rouge">Data</code> class. This new feature was highlighted in a <a href="https://medium.com/@pymc_devs/pymc-3-7-making-data-a-first-class-citizen-7ed87fe4bcc5?sk=2e984396bd3c540bfdafdc8842becf38">blog post</a> written by Chris Fonnesbeck where I am directly mentioned :flushed:. I felt super happy because the work I had done was distinguished, but at the same time I felt an enormous weight over my shoulders. The new class was going to be used by more people than me and in many different ways I didn’t plan to, possibly finding new issues. So it happened. I felt I needed to solve all these problems but I didn’t have time for it right now. Osvaldo calmed me down and reminded me that this is an open source project and that although I may not have time right now, another person can step in and fix those problems. This reassured me and helped me focus on GSoC.</p>
<p>Finally, I dug into other already existing BART implementations in Python like <a href="https://github.com/JakeColtman/bartpy">bartpy</a> and <a href="https://github.com/balajiln/pgbart">pgbart</a> to organize the project and better understand the model. I also checked the <a href="https://github.com/scikit-learn/scikit-learn/blob/7b136e92acf49d46251479b75c88cba632de1937/sklearn/tree/_tree.pyx#L504">scikit-learn’s tree implementation</a> for ideas on the tree implementation.</p>
<p>In the next blog post I will introduce the Bayesian Additive Regression Trees model. Till next time…</p>Juan Martín Loyolajmloyola@unsl.edu.arAccepted to the Google Summer of Code 20192019-05-15T00:00:00-03:002019-05-15T00:00:00-03:00https://jmloyola.github.io/posts/2019/05/gsoc-acceptance<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">'Hello World!'</span><span class="p">)</span>
</code></pre></div></div>
<p><em>(As a computer scientist I had to do that :grin:)</em></p>
<p>Last week Google announced the accepted students for the Google Summer of Code (GSoC) for this year. Luckily I’m one of those. I will work on PyMC3 under the NumFOCUS umbrella organization. Austin Rochford and Osvaldo Martin, core developers of PyMC3, will be my mentors. The main objective of the project is to <a href="https://summerofcode.withgoogle.com/projects/#4666396833742848">implement Bayesian Additive Regression Trees in PyMC3</a>.</p>
<p>Bayesian Additive Regression Trees (BART) is a Bayesian nonparametric approach to estimating functions using regression trees. A BART model consist of a sum of regression trees with (homoskedastic) normal additive noise. Regression trees are defined by recursively partitioning the input space, and defining a local model in each resulting region of input space in order to approximate some unknown function. BARTs are useful and flexible models to capture interactions and non-linearities and have been proved useful tools for variable selection.</p>
<p>Bayesian Additive Regression Trees will allow PyMC3 users to perform regressions with a “canned” non-parametric model. By simple calling a method, users will obtain the mean regressor plus the uncertainty estimation in a fully Bayesian way. This can be used later to predict on hold-out data. Furthermore, the implemented BART model will allow experience users to specify their own priors for the specific problem they are tackling, improving performance substantially.</p>
<p>In the next few weeks, Community Bonding Period, I will:</p>
<ul>
<li>deepen my knowledge about BART;</li>
<li>examine PyMC3 code-base in order to deepen my knowledge about its data flow, architecture and application programming interface (API);</li>
<li>communicate with mentors and PyMC3 core developers to clarify doubts about possible API for BART.</li>
</ul>Juan Martín Loyolajmloyola@unsl.edu.ar