As mentioned in the final blog post, these two notebooks showcase the user-oriented designs I created during GSoC 2022.
formula_example.ipynb: a Jupyter notebook demonstrating the syntax and usage of the spatial regression formula syntax in
sklearn_example.ipynb: a Jupyter notebook demonstrating the usage of the
scikit-learnstyle interface to
api-devbranch: this branch of my forked copy of
spregcontains all the deliverables I created over the summer. I did some experimental work in other repositories (see the blog and links at the end of this page), but the vast majority of the work was focused on creating different interfaces for
Posted 10 September 2022
This summer, I implemented two new ways of interfacing with the
spreg module of the Python Spatial Analysis Library. In the first half of the summer, I created a Wilkinson formula interface available through the
spreg.from_formula() function. Wilkinson formulas are used extensively in R and Python packages like
statsmodels that are oriented towards quantitative social scientists and econometricians. These libraries emphasize the inference side of quantitative analyses and value interpretability and expressibility of models. Formulas offer a natural path for achieving these goals: by writing in a mathematical syntax, users can easily fit a model and interpret the results. It’s a clear, direct way of doing statistical modeling that requires very little code on the user end. I extended the Wilkinson formula syntax to admit spatial regression components (spatial lags and spatial errors) and wrapped the formula parser from
formulaic in a spatial preprocessor (that handles the spatial components) and dispatcher (that sends the constructed model matrices to the modeling classes in
spreg). As a result of my work, users can now fit nonspatial and spatial regressions using a PySAL function. This contribution improves the accessibility and ease of use of
spreg’s modeling classes. To see how they work, read the blog entries below or check out the
formula_example.ipynb Jupyter notebook linked above.
In the second half of the summer, I designed an alternative way of interfacing with the modeling classes in
spreg. In a submodule called
spreg.sklearn, I translated the spatial regression modeling classes from
scikit-learn style estimators.
scikit-learn is the dominant Python library for statistical and machine learning (anything short of deep learning) and is widely used in the data science community. It’s focused primarily on prediction, which makes it a great complement to the formula interface implemented in the first half of the summer. By conforming
spreg’s modeling classes to
scikit-learn’s style, the models can now be used in
scikit-learn workflows. This construction makes spatial models available to a wide community of modelers who previously may not have included spatial models in their data science workflow. Once merged with the main branch of
spreg, it may be useful to reach out to the
scikit-learn development team or others in the community to raise awareness about this connection and begin promoting the use of spatial models in the data science community. Finally, to preserve backward-compatibility, the
spreg.sklearn submodule is not imported by default when importing
spreg – it must be directly imported (e.g.,
import spreg.sklearn) so as not to pollute the namespace.
I am immensely thankful for the advice and guidance of my mentors throughout this project. Their encyclopedic knowledge of the design and structure of PySAL proved invaluable time and time again while I was building these interfaces. Our discussions about design philosophy and code organization were illuminating for me: I’ve used open source code for years, but prior to this summer had not spent time thinking about its construction and development. Gaining insights into how the library and its modules were structured has made me introspect about the usage of code as an expression of thought. I’m eager to build on these insights in my dissertation and to continue working with the PySAL development team to expand the library!
Posted 11 August 2022
Unfortunately, the syntax for the regimes models proved to be too difficult to set up. My mentors and I initially thought that
formulaic’s pipe operator would be sufficient to create the groupings needed for a multilevel regimes model, but it turns out that the pipe in
formulaic works differently. It’s designed for creating multiple models, like for a two-stage least squares. This finding makes it significantly more difficult to implement the regimes, and it’s not as high of a priority at this point in the project. The pull request for the formula implementation has been submitted and regimes can come at some point in the future.
After receiving confirmation that the
spreg code must remain backwards-compatible, my mentors and I have shifted to designing a new package with a sleeker API which could exist in parallel. To do this, we’ve elected to revisit the
scikit-learn design pattern and create spatial models which plug directly into it. As a result, the new package doesn’t need the ordinary least squares class (already implemented in
sklearn.linear_model), so I’ve gotten to work rewriting the spatial lag and spatial error model classes in the
scikit-learn design. They’ll each use the appropriate mixins and base classes for linear models, and should serve to plug and play directly with existing
Posted 4 August 2022
spreg.from_formula() function has been steadily getting closer and closer to library-grade code. Per the
spreg Feature Enhancement Proposal instructions, I’ve added doc tests, unit tests, and a Jupyter notebook demonstrating the usage and verifying the code’s correctness. I’ve also polished up the code a lot, adding a few more features:
I’m now working on developing formula syntax for the regimes models that exist in
spreg. Hopefully, I’ll be able to have that ready by tomorrow’s PySAL developer meeting so I can include it in my presentation. If not, then it will be something I can easily develop in the future now that the base code is prepared. Tomorrow I’ll initiate the bigger review process to get this code added to PySAL!
Posted 27 July 2022
This week I was primarily occupied by moving to a new apartment, but I did manage to handle some small issues with the formula parser (see the commits in
tdhoffman/spreg). The big news this week is in splitting the scope of this project: in the next week or so, I will move forward with polishing and testing the formula parser for
spreg as it exists. I’ll submit that code for formal review and aim to get it accepted in the repository by September.
Separately, I’ll create a new package (name TBD, will add to Ongoing Links) that will serve as a reimagining of the
spreg API. My goal with this package is to cleanly reexpress the functionality of
spreg in a user-oriented way. Ideally, by the end of the second half of the summer I’ll have a full-fledged novel implementation of global spatial regression models, paving the way for its inclusion in PySAL and broader rethinking of the spatial model class structures.
Posted 18 July 2022
Progress on the
spreg.from_formula() function has been steady. I’ve built a parser that accepts a new syntax for spatial lag and spatial error models. This syntax comes in two parts:
<FIELD>adds spatial lag of that field from df to model matrix
&is not a reserved char in
formulaic, so there are no conflicts.
<FIELD + ... + FIELD> + &is the most general possible spatial model available.
Importantly, all terms and operators MUST be space delimited in order for the parser to properly pick up on the tokens. The current design also requires the user to have constructed a weights matrix first, which I think makes sense as the weights functionality is well-documented and external to the actual running of the model.
The ordinary least squares and spatial lag classes have been converted to the new
GenericModel template, but have not been tested and almost certainly have some errors. The spatial error class is partway converted and also needs to be tested. Once these are completed, I will convert the combo models (those accepting a mixture of lag and error terms) and build out the dispatcher in
spreg.from_formula to select the correct combo class.
Posted 7 July 2022
While conceptualizing this project, the twin streams of formula implementation and library standardization seemed quite separate to me. Naturally, a solid base class spec would lead to easier implementations of formulas, but I didn’t see them as being much more related than that. This week, however, the commonalities between these two objectives merged.
First, I began designing a generic base class for PySAL models. I based my design off of the novel framework developed in Guo et al. (2022), which unifies a variety of spatial models under one paradigm (see Table 1 and Figure 1 for more information). The structure is: data, model, objective function, optimization method, and output. Such a structure broadly characterizes nearly every model in PySAL, and would offer a sleek way to standardize the user-facing model classes. Drafts of this idea can be found in my
pysal_base repository; in the coming weeks I will be building on these ideas to create minimal working examples to showcase to the PySAL developers at large.
In parallel, I created a demo of the
formulaic library (called
formula_test.py) in my forked copy of
spreg for implementing spatial lag of X models. This demo showcases how to create a model matrix for spatially lagged covariates using native features of the
formulaic library. All that remains to implement spatial lag and spatial error models is to add preprocessing of formula strings that recognizes when the dependent variable is being lagged and when a spatial error term is being added. These will be implemented for next week.
Together, these two streams are starting to converge to form a proposal for PySAL’s next-gen model API. Ideally, I’ll equip
spreg with a
from_formula function that generates and runs models similar to R’s
lm function simply given a formula and a dataframe. This external-facing method will connect nicely to updated versions of the modeling classes, which will conform to the new base template.
Guo, H., Python, A., and Liu, Y. (2022) “A generalized regionalization framework for geographical modelling and its application in spatial regression.” arXiv:2206.09429.
Posted 1 July 2022
This week, I accomplished my two goals of implementing
scikit-learn-compatible versions of
GM_Lag and starting discussions about converting PySAL’s style. In doing this, I got a better picture of the potential advantages and disadvantages of the paradigm.
One issue I ran into while building out the demo classes was the location of the spatial weights matrix in a
scikit-learn style estimator. Putting the matrix in the
__init__ method of a class implies that the matrix is a parameter rather than data, whereas putting the matrix in the
fit method of a class implies that it is data rather than a parameter. Each choice has implications on the format and usage of the class. Personally, I prefer placing the spatial weights matrix in the
__init__ method as that choice encourages users to tune hyperparameters of the estimator for each spatial domain they are handling. However, this could lead to memory issues if users need to analyze many spatial domains. This dilemma was also noticed by Martin Fleischmann in the
esda issue thread.
spreg, Eli Knaap and Luc Anselin discussed how PySAL and
scikit-learn have generally different objectives in their modeling frameworks: the former is typically focused on inference and interpretation, while the latter is centered around prediction. These different use cases affect the design patterns of the two libraries, and as a result it may not make sense to mimic the design pattern from
scikit-learn. Of course, Wilkinson formulas were viewed as being useful to users on this thread as they reflect how social scientists think about models, thus making the library more accessible to a wider base of users.
For these reasons, I think it may be more useful to design a set of PySAL-specific base classes. While
scikit-learn and multiple dispatch might not be great fits for the library, it still needs consistent standards (whether or not they conform to external rules). Creating a PySAL-specific framework would have several key advantages, like in-house maintainability and customizability. This week, my goals are to begin prototyping Wilkinson formulas (as they are universally popular) and to think of design constraints for a set of PySAL-specific base classes.
Posted 26 June 2022
At our first meeting, my mentors and I discussed the options we have for implementing new interfaces to PySAL’s model and exploratory statistic classes. We discussed using multiple dispatch or ducktyping from
scikit-learn and chose to begin with ducktyping
scikit-learn as multiple dispatch has been deemed non-Pythonic. While developing functionality for Wilkinson formulas is an ancillary goal of this project, we are keeping it in mind as we make interface edits as those structural changes affect the formula implementation.
My first two goals are as follows:
Moran_Local) and a supervised statistic (
GM_Lag) and implement them using the
scikit-learnparadigm. This will give us two concrete, minimal working examples that we can share to package leads.
scikit-learninterface and to extend it to Wilkinson formulas.
|Package||Requirements for ducktyping
||Requirements for extending to Wilkinson formulas|
Finally, we aim to include automatic attribution tools to all the new interfaces so that users may easily generate the proper citations for the code they are using.
pysal_base, a repository of draft base classes for PySAL.
tdhoffman/spreg, my forked copy of
spreg. This is a testing ground for the
spregformula dispatcher and new model APIs.
tdhoffman/esda, my forked copy of
esda. This is a testing ground for base classes in descriptive or unsupervised settings.