# GSoC 2019 Final Evaluation

** Published:**

The Google Summer of Code 2019 has come to an end. It has been an amazing experience! I would recommend it to anyone that is interested in participating in an open source project.

In what follows, I will present a summary of the work done for the Google Summer of Code project “Bayesian Additive Regression Trees in PyMC3”. I will go over key points of the project and provide links to commits that show the current state of the implementation. Along the way, I will point out different challenges encountered and future work to be done.

All the code produced during GSoC 2019 is available in this repository of GitHub. Also, I have written a serie of blog posts where I talk about the Bayesian Additive Regression Trees (BART) model and its implementation.

## Project Abstract

Bayesian Additive Regression Trees is a Bayesian nonparametric approach to estimating functions using regression trees. A BART model consist on a sum of regression trees with normal additive noise. Regression trees are defined by recursively partitioning the input space, and defining a local model in each resulting region of input space in order to approximate some unknown function. BARTs are useful and flexible model to capture interactions and non-linearities and have been proved useful tools for variable selection.

Bayesian Additive Regression Trees will allow PyMC3 users to perform regressions with a “canned” non-parametric model. By simple calling a method, users will obtain the mean regressor plus the uncertainty estimation in a fully Bayesian way. This can be used later to predict on hold-out data. Furthermore, the implemented BART model will allow experience users to specify their own priors for the specific problem they are tackling, improving performance substantially.

## Project status

A lot work was put into the Google Summer of Code. I read some papers about BART and wrote a lot a code, tests and documentation. But, sadly we didn’t meet all the goals we set in the beginning. There are still a lot more debugging to do to ensure the model is correct.

The main files for the project are:

## Things done

- Community bonding period. During this time I learned more about PyMC and the BART model. I also set up this blog. For more details you can check this blog post: “Coding period begins”.
- Understand the BART model. This involved reading many research papers and different implementations. The most relevant papers read were
*Chipman et al. (2010)*,*Chipman et al. (1998)*,*Lakshminarayanan et al. (2015)*,*Tan et al. (2019)*and*Kapelner & Bleich (2013)*. I studied the code for some BART implementations to complement the information given in the papers, for example, bartpy, pgbart and bartMachine. Finally, some notes I wrote about this process can be seen in this blog: - Implement the tree structure. For details on the choosen tree structure implementation, you can check this blog post: “BART’s tree structure implementation”.
- Implement the visualization of a tree.
- Add tests for the nodes and the tree structure.
- Define two APIs for BART: one for the conjugate model defined in the original paper of
*Chipman et al. (2010)*and a flexible API for user defined priors and likelihoods. - Set the priors for BART.
- Implement the Bayesian backfitting MCMC for posterior inference in BART.
- Implement the tree samplers GrowPrune and ParticleGibbs. Both these approaches are described in
*Lakshminarayanan et al. (2015)*. - Add
`Particle`

class to get a clearer code. - Add tests for BART.
- Optimize space of data structure for trees.
- Add documentation to tree structure code.

## Challenges

During GSoC there were two challenges I faced. For one part implementing a model given in a serie of papers was not as simple as I though at first. There are a lot of small details that are not explained in the papers but that appear while implementing the model. This entailed to examine the implementations of the original papers and reading a lot more papers in order to find the answer. Tasks that proved to be hard because I am not an expert in the topic. On the other hand, I faced the challenge of merging the code I wrote for this model inside the PyMC3 code base. For this I had to dive deep into the code base of PyMC3 and adapt the code I had written. I go in a little more detail in this blog post: “The coding period is coming to an end”.

## Future work

Although GSoC has finished, I plan to keep working in the project. It is not going to be as intense as during GSoC, but I will dedicate a couple of hours a week to have BART merged in PyMC.

To have an initial version of BART in PyMC we should:

- Add documentation to the rest of the code.
- Ensure the model is correct. Compare our model with other implementations.
- Add jupyter notebook showing the effect of the BART priors.
- Add jupyter notebook showing how BART performs.
- Finish merging the BART code to PyMC3.
- Refactor Tree code. Separate the basic tree structure from the things needed for BART. Add a new class
`BARTTree`

which add the attributes BART needs.

To extend the model we should:

- Implement another way to calculate variable importance following the paper from
*Bleich et al. (2014)*. - Implement another way to choose the path when the variable has
`np.NaN`

following the paper from*Kapelner & Bleich (2015)*. For example:- Always to the left.
- Always to the right.
- To the left or right at random.
- Every data point with
`np.NaN`

in that particular variate goes to the left and the rest of the data points to the right. The condition the split node will divide the space is`np.isNan(x)`

.

- Single program, multiple data (SPMD) parallelization following the paper from
*Pratola et al. (2014)*. - Optimize time for posterior estimation following the paper from
*He et al. (2019)*.

## References

- Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees.
*The Annals of Applied Statistics*,*4*(1), 266-298. - Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model search.
*Journal of the American Statistical Association*,*93*(443), 935-948. - Lakshminarayanan, B., Roy, D., & Teh, Y. W. (2015). Particle Gibbs for Bayesian additive regression trees. In
*Artificial Intelligence and Statistics*(pp. 553-561). - Kapelner, A., & Bleich, J. (2013). bartMachine: Machine learning with Bayesian additive regression trees.
*arXiv preprint arXiv:1312.2171*. - Tan, Y. V., & Roy, J. (2019). Bayesian additive regression trees and the General BART model.
*arXiv preprint arXiv:1901.07504*. - Pratola, M. T., Chipman, H. A., Gattiker, J. R., Higdon, D. M., McCulloch, R., & Rust, W. N. (2014). Parallel Bayesian additive regression trees.
*Journal of Computational and Graphical Statistics*,*23*(3), 830-852. - Bleich, J., Kapelner, A., George, E. I., & Jensen, S. T. (2014). Variable selection for BART: an application to gene regulation.
*The Annals of Applied Statistics*,*8*(3), 1750-1781. - Kapelner, A., & Bleich, J. (2015). Prediction with missing data via Bayesian additive regression trees.
*Canadian Journal of Statistics*,*43*(2), 224-239. - He, J., Yalov, S., & Hahn, P. R. (2019, April). XBART: Accelerated Bayesian Additive Regression Trees. In
*The 22nd International Conference on Artificial Intelligence and Statistics*(pp. 1130-1138).

## Acknowledgments

I will like to thanks **Google** and **NumFOCUS** for promoting open-source projects and bringing students closer to open-source communities; **PyMC** for the warm welcome to the community; and **Austin Rochford** and **Osvaldo Martin** for their support as mentors during the development of the project.