*(he/him)* · Ph.D Student at Arizona State University in Geography

*As mentioned in the final blog post, these two notebooks showcase the user-oriented designs I created during GSoC 2022.*

`formula_example.ipynb`

: a Jupyter notebook demonstrating the syntax and usage of the spatial regression formula syntax in`pysal/spreg`

.`sklearn_example.ipynb`

: a Jupyter notebook demonstrating the usage of the`scikit-learn`

style interface to`pysal/spreg`

.`tdhoffman/spreg`

,`api-dev`

branch: this branch of my forked copy of`spreg`

contains all the deliverables I created over the summer. I did some experimental work in other repositories (see the blog and links at the end of this page), but the vast majority of the work was focused on creating different interfaces for`spreg`

.

*Posted 10 September 2022*

This summer, I implemented two new ways of interfacing with the `spreg`

module of the Python Spatial Analysis Library. In the first half of the summer, I created a Wilkinson formula interface available through the `spreg.from_formula()`

function. Wilkinson formulas are used extensively in R and Python packages like `statsmodels`

that are oriented towards quantitative social scientists and econometricians. These libraries emphasize the inference side of quantitative analyses and value interpretability and expressibility of models. Formulas offer a natural path for achieving these goals: by writing in a mathematical syntax, users can easily fit a model and interpret the results. It’s a clear, direct way of doing statistical modeling that requires very little code on the user end. I extended the Wilkinson formula syntax to admit spatial regression components (spatial lags and spatial errors) and wrapped the formula parser from `formulaic`

in a spatial preprocessor (that handles the spatial components) and dispatcher (that sends the constructed model matrices to the modeling classes in `spreg`

). As a result of my work, users can now fit nonspatial and spatial regressions using a PySAL function. This contribution improves the accessibility and ease of use of `spreg`

’s modeling classes. To see how they work, read the blog entries below or check out the `formula_example.ipynb`

Jupyter notebook linked above.

In the second half of the summer, I designed an alternative way of interfacing with the modeling classes in `spreg`

. In a submodule called `spreg.sklearn`

, I translated the spatial regression modeling classes from `spreg`

into `scikit-learn`

style estimators. `scikit-learn`

is the dominant Python library for statistical and machine learning (anything short of deep learning) and is widely used in the data science community. It’s focused primarily on prediction, which makes it a great complement to the formula interface implemented in the first half of the summer. By conforming `spreg`

’s modeling classes to `scikit-learn`

’s style, the models can now be used in `scikit-learn`

workflows. This construction makes spatial models available to a wide community of modelers who previously may not have included spatial models in their data science workflow. Once merged with the main branch of `spreg`

, it may be useful to reach out to the `scikit-learn`

development team or others in the community to raise awareness about this connection and begin promoting the use of spatial models in the data science community. Finally, to preserve backward-compatibility, the `spreg.sklearn`

submodule is not imported by default when importing `spreg`

– it must be directly imported (e.g., `import spreg.sklearn`

) so as not to pollute the namespace.

I am immensely thankful for the advice and guidance of my mentors throughout this project. Their encyclopedic knowledge of the design and structure of PySAL proved invaluable time and time again while I was building these interfaces. Our discussions about design philosophy and code organization were illuminating for me: I’ve used open source code for years, but prior to this summer had not spent time thinking about its construction and development. Gaining insights into how the library and its modules were structured has made me introspect about the usage of code as an expression of thought. I’m eager to build on these insights in my dissertation and to continue working with the PySAL development team to expand the library!

`scikit-learn`

Structure*Posted 11 August 2022*

Unfortunately, the syntax for the regimes models proved to be too difficult to set up. My mentors and I initially thought that `formulaic`

’s pipe operator would be sufficient to create the groupings needed for a multilevel regimes model, but it turns out that the pipe in `formulaic`

works differently. It’s designed for creating multiple models, like for a two-stage least squares. This finding makes it significantly more difficult to implement the regimes, and it’s not as high of a priority at this point in the project. The pull request for the formula implementation has been submitted and regimes can come at some point in the future.

After receiving confirmation that the `spreg`

code must remain backwards-compatible, my mentors and I have shifted to designing a new package with a sleeker API which could exist in parallel. To do this, we’ve elected to revisit the `scikit-learn`

design pattern and create spatial models which plug directly into it. As a result, the new package doesn’t need the ordinary least squares class (already implemented in `sklearn.linear_model`

), so I’ve gotten to work rewriting the spatial lag and spatial error model classes in the `scikit-learn`

design. They’ll each use the appropriate mixins and base classes for linear models, and should serve to plug and play directly with existing `scikit-learn`

pipelines.

*Posted 4 August 2022*

The `spreg.from_formula()`

function has been steadily getting closer and closer to library-grade code. Per the `spreg`

Feature Enhancement Proposal instructions, I’ve added doc tests, unit tests, and a Jupyter notebook demonstrating the usage and verifying the code’s correctness. I’ve also polished up the code a lot, adding a few more features:

- full support for all the combo models (spatial lag + spatial error), endogenous error models, and heteroskedastic/homoskedastic spatial error formulations
- adding the spatial lag of a covariate now automatically includes the non-lagged covariate (required for SLX models)
- intelligently transfers variable names from provided formula to the selected modeling class, including any transformations

I’m now working on developing formula syntax for the regimes models that exist in `spreg`

. Hopefully, I’ll be able to have that ready by tomorrow’s PySAL developer meeting so I can include it in my presentation. If not, then it will be something I can easily develop in the future now that the base code is prepared. Tomorrow I’ll initiate the bigger review process to get this code added to PySAL!

*Posted 27 July 2022*

This week I was primarily occupied by moving to a new apartment, but I did manage to handle some small issues with the formula parser (see the commits in `tdhoffman/spreg`

). The big news this week is in splitting the scope of this project: in the next week or so, I will move forward with polishing and testing the formula parser for `spreg`

as it exists. I’ll submit that code for formal review and aim to get it accepted in the repository by September.

Separately, I’ll create a new package (name TBD, will add to Ongoing Links) that will serve as a reimagining of the `spreg`

API. My goal with this package is to cleanly reexpress the functionality of `spreg`

in a user-oriented way. Ideally, by the end of the second half of the summer I’ll have a full-fledged novel implementation of global spatial regression models, paving the way for its inclusion in PySAL and broader rethinking of the spatial model class structures.

`from_formula`

*Posted 18 July 2022*

Progress on the `spreg.from_formula()`

function has been steady. I’ve built a parser that accepts a new syntax for spatial lag and spatial error models. This syntax comes in two parts:

- The
`<...>`

operator:- Enclose variables (covariates or dependent) in angle brackets to denote a spatial lag.
- Usage:
`<FIELD>`

adds spatial lag of that field from df to model matrix

- The
`&`

symbol:- Adds spatial error component to a model.
`&`

is not a reserved char in`formulaic`

, so there are no conflicts.

- The parser accepts combinations of
`<...>`

and`&`

:`<FIELD + ... + FIELD> + &`

is the most general possible spatial model available.

Importantly, all terms and operators MUST be space delimited in order for the parser to properly pick up on the tokens. The current design also requires the user to have constructed a weights matrix first, which I think makes sense as the weights functionality is well-documented and external to the actual running of the model.

The ordinary least squares and spatial lag classes have been converted to the new `GenericModel`

template, but have not been tested and almost certainly have some errors. The spatial error class is partway converted and also needs to be tested. Once these are completed, I will convert the combo models (those accepting a mixture of lag and error terms) and build out the dispatcher in `spreg.from_formula`

to select the correct combo class.

*Posted 7 July 2022*

While conceptualizing this project, the twin streams of formula implementation and library standardization seemed quite separate to me. Naturally, a solid base class spec would lead to easier implementations of formulas, but I didn’t see them as being much more related than that. This week, however, the commonalities between these two objectives merged.

First, I began designing a generic base class for PySAL models. I based my design off of the novel framework developed in Guo et al. (2022), which unifies a variety of spatial models under one paradigm (see Table 1 and Figure 1 for more information). The structure is: data, model, objective function, optimization method, and output. Such a structure broadly characterizes nearly every model in PySAL, and would offer a sleek way to standardize the user-facing model classes. Drafts of this idea can be found in my `pysal_base`

repository; in the coming weeks I will be building on these ideas to create minimal working examples to showcase to the PySAL developers at large.

In parallel, I created a demo of the `formulaic`

library (called `formula_test.py`

) in my forked copy of `spreg`

for implementing spatial lag of X models. This demo showcases how to create a model matrix for spatially lagged covariates using native features of the `formulaic`

library. All that remains to implement spatial lag and spatial error models is to add preprocessing of formula strings that recognizes when the dependent variable is being lagged and when a spatial error term is being added. These will be implemented for next week.

Together, these two streams are starting to converge to form a proposal for PySAL’s next-gen model API. Ideally, I’ll equip `spreg`

with a `from_formula`

function that generates and runs models similar to R’s `lm`

function simply given a formula and a dataframe. This external-facing method will connect nicely to updated versions of the modeling classes, which will conform to the new base template.

**References**

Guo, H., Python, A., and Liu, Y. (2022) “A generalized regionalization framework for geographical modelling and its application in spatial regression.” arXiv:2206.09429.

*Posted 1 July 2022*

This week, I accomplished my two goals of implementing `scikit-learn`

-compatible versions of `Moran_Local`

and `GM_Lag`

and starting discussions about converting PySAL’s style. In doing this, I got a better picture of the potential advantages and disadvantages of the paradigm.

One issue I ran into while building out the demo classes was the location of the spatial weights matrix in a `scikit-learn`

style estimator. Putting the matrix in the `__init__`

method of a class implies that the matrix is a parameter rather than data, whereas putting the matrix in the `fit`

method of a class implies that it is data rather than a parameter. Each choice has implications on the format and usage of the class. Personally, I prefer placing the spatial weights matrix in the `__init__`

method as that choice encourages users to tune hyperparameters of the estimator for each spatial domain they are handling. However, this could lead to memory issues if users need to analyze many spatial domains. This dilemma was also noticed by Martin Fleischmann in the `esda`

issue thread.

In `spreg`

, Eli Knaap and Luc Anselin discussed how PySAL and `scikit-learn`

have generally different objectives in their modeling frameworks: the former is typically focused on inference and interpretation, while the latter is centered around prediction. These different use cases affect the design patterns of the two libraries, and as a result it may not make sense to mimic the design pattern from `scikit-learn`

. Of course, Wilkinson formulas were viewed as being useful to users on this thread as they reflect how social scientists think about models, thus making the library more accessible to a wider base of users.

For these reasons, I think it may be more useful to design a set of PySAL-specific base classes. While `scikit-learn`

and multiple dispatch might not be great fits for the library, it still needs consistent standards (whether or not they conform to external rules). Creating a PySAL-specific framework would have several key advantages, like in-house maintainability and customizability. This week, my goals are to begin prototyping Wilkinson formulas (as they are universally popular) and to think of design constraints for a set of PySAL-specific base classes.

*Posted 26 June 2022*

At our first meeting, my mentors and I discussed the options we have for implementing new interfaces to PySAL’s model and exploratory statistic classes. We discussed using multiple dispatch or ducktyping from `scikit-learn`

and chose to begin with ducktyping `scikit-learn`

as multiple dispatch has been deemed non-Pythonic. While developing functionality for Wilkinson formulas is an ancillary goal of this project, we are keeping it in mind as we make interface edits as those structural changes affect the formula implementation.

My first two goals are as follows:

- Choose an unsupervised statistic (
`Moran_Local`

) and a supervised statistic (`GM_Lag`

) and implement them using the`scikit-learn`

paradigm. This will give us two concrete, minimal working examples that we can share to package leads. - Start issues in relevant packages to fill out the cells in the table below. We want to know what must be done for each package to switch to a
`scikit-learn`

interface and to extend it to Wilkinson formulas.

Package | Requirements for ducktyping `scikit-learn` |
Requirements for extending to Wilkinson formulas |
---|---|---|

`spreg` |
||

`gwr` |
||

`esda` |
||

`spopt` |
||

`weights` |
||

`spglm` |
||

… |

Finally, we aim to include automatic attribution tools to all the new interfaces so that users may easily generate the proper citations for the code they are using.

`pysal_base`

, a repository of draft base classes for PySAL.`tdhoffman/spreg`

, my forked copy of`spreg`

. This is a testing ground for the`spreg`

formula dispatcher and new model APIs.`tdhoffman/esda`

, my forked copy of`esda`

. This is a testing ground for base classes in descriptive or unsupervised settings.