GSoC 2022

Summary Links

As mentioned in the final blog post, these two notebooks showcase the user-oriented designs I created during GSoC 2022.

formula_example.ipynb: a Jupyter notebook demonstrating the syntax and usage of the spatial regression formula syntax in pysal/spreg.
sklearn_example.ipynb: a Jupyter notebook demonstrating the usage of the scikit-learn style interface to pysal/spreg.
tdhoffman/spreg, api-dev branch: this branch of my forked copy of spreg contains all the deliverables I created over the summer. I did some experimental work in other repositories (see the blog and links at the end of this page), but the vast majority of the work was focused on creating different interfaces for spreg.

8. Final Blog Post

Posted 10 September 2022

This summer, I implemented two new ways of interfacing with the spreg module of the Python Spatial Analysis Library. In the first half of the summer, I created a Wilkinson formula interface available through the spreg.from_formula() function. Wilkinson formulas are used extensively in R and Python packages like statsmodels that are oriented towards quantitative social scientists and econometricians. These libraries emphasize the inference side of quantitative analyses and value interpretability and expressibility of models. Formulas offer a natural path for achieving these goals: by writing in a mathematical syntax, users can easily fit a model and interpret the results. It’s a clear, direct way of doing statistical modeling that requires very little code on the user end. I extended the Wilkinson formula syntax to admit spatial regression components (spatial lags and spatial errors) and wrapped the formula parser from formulaic in a spatial preprocessor (that handles the spatial components) and dispatcher (that sends the constructed model matrices to the modeling classes in spreg). As a result of my work, users can now fit nonspatial and spatial regressions using a PySAL function. This contribution improves the accessibility and ease of use of spreg’s modeling classes. To see how they work, read the blog entries below or check out the formula_example.ipynb Jupyter notebook linked above.

In the second half of the summer, I designed an alternative way of interfacing with the modeling classes in spreg. In a submodule called spreg.sklearn, I translated the spatial regression modeling classes from spreg into scikit-learn style estimators. scikit-learn is the dominant Python library for statistical and machine learning (anything short of deep learning) and is widely used in the data science community. It’s focused primarily on prediction, which makes it a great complement to the formula interface implemented in the first half of the summer. By conforming spreg’s modeling classes to scikit-learn’s style, the models can now be used in scikit-learn workflows. This construction makes spatial models available to a wide community of modelers who previously may not have included spatial models in their data science workflow. Once merged with the main branch of spreg, it may be useful to reach out to the scikit-learn development team or others in the community to raise awareness about this connection and begin promoting the use of spatial models in the data science community. Finally, to preserve backward-compatibility, the spreg.sklearn submodule is not imported by default when importing spreg – it must be directly imported (e.g., import spreg.sklearn) so as not to pollute the namespace.

I am immensely thankful for the advice and guidance of my mentors throughout this project. Their encyclopedic knowledge of the design and structure of PySAL proved invaluable time and time again while I was building these interfaces. Our discussions about design philosophy and code organization were illuminating for me: I’ve used open source code for years, but prior to this summer had not spent time thinking about its construction and development. Gaining insights into how the library and its modules were structured has made me introspect about the usage of code as an expression of thought. I’m eager to build on these insights in my dissertation and to continue working with the PySAL development team to expand the library!

7. Revisiting the `scikit-learn` Structure

Posted 11 August 2022

Unfortunately, the syntax for the regimes models proved to be too difficult to set up. My mentors and I initially thought that formulaic’s pipe operator would be sufficient to create the groupings needed for a multilevel regimes model, but it turns out that the pipe in formulaic works differently. It’s designed for creating multiple models, like for a two-stage least squares. This finding makes it significantly more difficult to implement the regimes, and it’s not as high of a priority at this point in the project. The pull request for the formula implementation has been submitted and regimes can come at some point in the future.

After receiving confirmation that the spreg code must remain backwards-compatible, my mentors and I have shifted to designing a new package with a sleeker API which could exist in parallel. To do this, we’ve elected to revisit the scikit-learn design pattern and create spatial models which plug directly into it. As a result, the new package doesn’t need the ordinary least squares class (already implemented in sklearn.linear_model), so I’ve gotten to work rewriting the spatial lag and spatial error model classes in the scikit-learn design. They’ll each use the appropriate mixins and base classes for linear models, and should serve to plug and play directly with existing scikit-learn pipelines.

6. Production Ready

Posted 4 August 2022

The spreg.from_formula() function has been steadily getting closer and closer to library-grade code. Per the spreg Feature Enhancement Proposal instructions, I’ve added doc tests, unit tests, and a Jupyter notebook demonstrating the usage and verifying the code’s correctness. I’ve also polished up the code a lot, adding a few more features:

full support for all the combo models (spatial lag + spatial error), endogenous error models, and heteroskedastic/homoskedastic spatial error formulations
adding the spatial lag of a covariate now automatically includes the non-lagged covariate (required for SLX models)
intelligently transfers variable names from provided formula to the selected modeling class, including any transformations

I’m now working on developing formula syntax for the regimes models that exist in spreg. Hopefully, I’ll be able to have that ready by tomorrow’s PySAL developer meeting so I can include it in my presentation. If not, then it will be something I can easily develop in the future now that the base code is prepared. Tomorrow I’ll initiate the bigger review process to get this code added to PySAL!

5. Divergence

Posted 27 July 2022

This week I was primarily occupied by moving to a new apartment, but I did manage to handle some small issues with the formula parser (see the commits in tdhoffman/spreg). The big news this week is in splitting the scope of this project: in the next week or so, I will move forward with polishing and testing the formula parser for spreg as it exists. I’ll submit that code for formal review and aim to get it accepted in the repository by September.

Separately, I’ll create a new package (name TBD, will add to Ongoing Links) that will serve as a reimagining of the spreg API. My goal with this package is to cleanly reexpress the functionality of spreg in a user-oriented way. Ideally, by the end of the second half of the summer I’ll have a full-fledged novel implementation of global spatial regression models, paving the way for its inclusion in PySAL and broader rethinking of the spatial model class structures.

4. Creating `from_formula`

Posted 18 July 2022

Progress on the spreg.from_formula() function has been steady. I’ve built a parser that accepts a new syntax for spatial lag and spatial error models. This syntax comes in two parts:

The <...> operator:
- Enclose variables (covariates or dependent) in angle brackets to denote a spatial lag.
- Usage: <FIELD> adds spatial lag of that field from df to model matrix
The & symbol:
- Adds spatial error component to a model.
- & is not a reserved char in formulaic, so there are no conflicts.
The parser accepts combinations of <...> and &: <FIELD + ... + FIELD> + & is the most general possible spatial model available.

Importantly, all terms and operators MUST be space delimited in order for the parser to properly pick up on the tokens. The current design also requires the user to have constructed a weights matrix first, which I think makes sense as the weights functionality is well-documented and external to the actual running of the model.

The ordinary least squares and spatial lag classes have been converted to the new GenericModel template, but have not been tested and almost certainly have some errors. The spatial error class is partway converted and also needs to be tested. Once these are completed, I will convert the combo models (those accepting a mixture of lag and error terms) and build out the dispatcher in spreg.from_formula to select the correct combo class.

3. Convergence

Posted 7 July 2022

While conceptualizing this project, the twin streams of formula implementation and library standardization seemed quite separate to me. Naturally, a solid base class spec would lead to easier implementations of formulas, but I didn’t see them as being much more related than that. This week, however, the commonalities between these two objectives merged.

First, I began designing a generic base class for PySAL models. I based my design off of the novel framework developed in Guo et al. (2022), which unifies a variety of spatial models under one paradigm (see Table 1 and Figure 1 for more information). The structure is: data, model, objective function, optimization method, and output. Such a structure broadly characterizes nearly every model in PySAL, and would offer a sleek way to standardize the user-facing model classes. Drafts of this idea can be found in my pysal_base repository; in the coming weeks I will be building on these ideas to create minimal working examples to showcase to the PySAL developers at large.

In parallel, I created a demo of the formulaic library (called formula_test.py) in my forked copy of spreg for implementing spatial lag of X models. This demo showcases how to create a model matrix for spatially lagged covariates using native features of the formulaic library. All that remains to implement spatial lag and spatial error models is to add preprocessing of formula strings that recognizes when the dependent variable is being lagged and when a spatial error term is being added. These will be implemented for next week.

Together, these two streams are starting to converge to form a proposal for PySAL’s next-gen model API. Ideally, I’ll equip spreg with a from_formula function that generates and runs models similar to R’s lm function simply given a formula and a dataframe. This external-facing method will connect nicely to updated versions of the modeling classes, which will conform to the new base template.

References

Guo, H., Python, A., and Liu, Y. (2022) “A generalized regionalization framework for geographical modelling and its application in spatial regression.” arXiv:2206.09429.

2. Base Class Rethinking

Posted 1 July 2022

This week, I accomplished my two goals of implementing scikit-learn-compatible versions of Moran_Local and GM_Lag and starting discussions about converting PySAL’s style. In doing this, I got a better picture of the potential advantages and disadvantages of the paradigm.

One issue I ran into while building out the demo classes was the location of the spatial weights matrix in a scikit-learn style estimator. Putting the matrix in the __init__ method of a class implies that the matrix is a parameter rather than data, whereas putting the matrix in the fit method of a class implies that it is data rather than a parameter. Each choice has implications on the format and usage of the class. Personally, I prefer placing the spatial weights matrix in the __init__ method as that choice encourages users to tune hyperparameters of the estimator for each spatial domain they are handling. However, this could lead to memory issues if users need to analyze many spatial domains. This dilemma was also noticed by Martin Fleischmann in the esda issue thread.

In spreg, Eli Knaap and Luc Anselin discussed how PySAL and scikit-learn have generally different objectives in their modeling frameworks: the former is typically focused on inference and interpretation, while the latter is centered around prediction. These different use cases affect the design patterns of the two libraries, and as a result it may not make sense to mimic the design pattern from scikit-learn. Of course, Wilkinson formulas were viewed as being useful to users on this thread as they reflect how social scientists think about models, thus making the library more accessible to a wider base of users.

For these reasons, I think it may be more useful to design a set of PySAL-specific base classes. While scikit-learn and multiple dispatch might not be great fits for the library, it still needs consistent standards (whether or not they conform to external rules). Creating a PySAL-specific framework would have several key advantages, like in-house maintainability and customizability. This week, my goals are to begin prototyping Wilkinson formulas (as they are universally popular) and to think of design constraints for a set of PySAL-specific base classes.

1. Initial Goals

Posted 26 June 2022

At our first meeting, my mentors and I discussed the options we have for implementing new interfaces to PySAL’s model and exploratory statistic classes. We discussed using multiple dispatch or ducktyping from scikit-learn and chose to begin with ducktyping scikit-learn as multiple dispatch has been deemed non-Pythonic. While developing functionality for Wilkinson formulas is an ancillary goal of this project, we are keeping it in mind as we make interface edits as those structural changes affect the formula implementation.

My first two goals are as follows:

Choose an unsupervised statistic (Moran_Local) and a supervised statistic (GM_Lag) and implement them using the scikit-learn paradigm. This will give us two concrete, minimal working examples that we can share to package leads.
Start issues in relevant packages to fill out the cells in the table below. We want to know what must be done for each package to switch to a scikit-learn interface and to extend it to Wilkinson formulas.

Package	Requirements for ducktyping `scikit-learn`	Requirements for extending to Wilkinson formulas
`spreg`
`gwr`
`esda`
`spopt`
`weights`
`spglm`
…

Finally, we aim to include automatic attribution tools to all the new interfaces so that users may easily generate the proper citations for the code they are using.

Tyler Hoffman

Summary Links

8. Final Blog Post

7. Revisiting the `scikit-learn` Structure

6. Production Ready

5. Divergence

4. Creating `from_formula`

3. Convergence

2. Base Class Rethinking

1. Initial Goals

Other Links

Summary Links

8. Final Blog Post

7. Revisiting the scikit-learn Structure

6. Production Ready

5. Divergence

4. Creating from_formula

3. Convergence

2. Base Class Rethinking

1. Initial Goals

Other Links

7. Revisiting the `scikit-learn` Structure

4. Creating `from_formula`