Tyler Hoffman


(he/him) · Spatial data scientist, statistician, computational thinker.

Email me: tdhoffman@asu.edu




Google Scholar


GSoC 2022 Progress Journal

Kedron Lab @ ASU



7. Revisiting the scikit-learn Structure

Posted 11 August 2022

Unfortunately, the syntax for the regimes models proved to be too difficult to set up. My mentors and I initially thought that formulaic’s pipe operator would be sufficient to create the groupings needed for a multilevel regimes model, but it turns out that the pipe in formulaic works differently. It’s designed for creating multiple models, like for a two-stage least squares. This finding makes it significantly more difficult to implement the regimes, and it’s not as high of a priority at this point in the project. The pull request for the formula implementation has been submitted and regimes can come at some point in the future.

After receiving confirmation that the spreg code must remain backwards-compatible, my mentors and I have shifted to designing a new package with a sleeker API which could exist in parallel. To do this, we’ve elected to revisit the scikit-learn design pattern and create spatial models which plug directly into it. As a result, the new package doesn’t need the ordinary least squares class (already implemented in sklearn.linear_model), so I’ve gotten to work rewriting the spatial lag and spatial error model classes in the scikit-learn design. They’ll each use the appropriate mixins and base classes for linear models, and should serve to plug and play directly with existing scikit-learn pipelines.

6. Production Ready

Posted 4 August 2022

The spreg.from_formula() function has been steadily getting closer and closer to library-grade code. Per the spreg Feature Enhancement Proposal instructions, I’ve added doc tests, unit tests, and a Jupyter notebook demonstrating the usage and verifying the code’s correctness. I’ve also polished up the code a lot, adding a few more features:

I’m now working on developing formula syntax for the regimes models that exist in spreg. Hopefully, I’ll be able to have that ready by tomorrow’s PySAL developer meeting so I can include it in my presentation. If not, then it will be something I can easily develop in the future now that the base code is prepared. Tomorrow I’ll initiate the bigger review process to get this code added to PySAL!

5. Divergence

Posted 27 July 2022

This week I was primarily occupied by moving to a new apartment, but I did manage to handle some small issues with the formula parser (see the commits in tdhoffman/spreg). The big news this week is in splitting the scope of this project: in the next week or so, I will move forward with polishing and testing the formula parser for spreg as it exists. I’ll submit that code for formal review and aim to get it accepted in the repository by September.

Separately, I’ll create a new package (name TBD, will add to Ongoing Links) that will serve as a reimagining of the spreg API. My goal with this package is to cleanly reexpress the functionality of spreg in a user-oriented way. Ideally, by the end of the second half of the summer I’ll have a full-fledged novel implementation of global spatial regression models, paving the way for its inclusion in PySAL and broader rethinking of the spatial model class structures.

4. Creating from_formula

Posted 18 July 2022

Progress on the spreg.from_formula() function has been steady. I’ve built a parser that accepts a new syntax for spatial lag and spatial error models. This syntax comes in two parts:

Importantly, all terms and operators MUST be space delimited in order for the parser to properly pick up on the tokens. The current design also requires the user to have constructed a weights matrix first, which I think makes sense as the weights functionality is well-documented and external to the actual running of the model.

The ordinary least squares and spatial lag classes have been converted to the new GenericModel template, but have not been tested and almost certainly have some errors. The spatial error class is partway converted and also needs to be tested. Once these are completed, I will convert the combo models (those accepting a mixture of lag and error terms) and build out the dispatcher in spreg.from_formula to select the correct combo class.

3. Convergence

Posted 7 July 2022

While conceptualizing this project, the twin streams of formula implementation and library standardization seemed quite separate to me. Naturally, a solid base class spec would lead to easier implementations of formulas, but I didn’t see them as being much more related than that. This week, however, the commonalities between these two objectives merged.

First, I began designing a generic base class for PySAL models. I based my design off of the novel framework developed in Guo et al. (2022), which unifies a variety of spatial models under one paradigm (see Table 1 and Figure 1 for more information). The structure is: data, model, objective function, optimization method, and output. Such a structure broadly characterizes nearly every model in PySAL, and would offer a sleek way to standardize the user-facing model classes. Drafts of this idea can be found in my pysal_base repository; in the coming weeks I will be building on these ideas to create minimal working examples to showcase to the PySAL developers at large.

In parallel, I created a demo of the formulaic library (called formula_test.py) in my forked copy of spreg for implementing spatial lag of X models. This demo showcases how to create a model matrix for spatially lagged covariates using native features of the formulaic library. All that remains to implement spatial lag and spatial error models is to add preprocessing of formula strings that recognizes when the dependent variable is being lagged and when a spatial error term is being added. These will be implemented for next week.

Together, these two streams are starting to converge to form a proposal for PySAL’s next-gen model API. Ideally, I’ll equip spreg with a from_formula function that generates and runs models similar to R’s lm function simply given a formula and a dataframe. This external-facing method will connect nicely to updated versions of the modeling classes, which will conform to the new base template.


Guo, H., Python, A., and Liu, Y. (2022) “A generalized regionalization framework for geographical modelling and its application in spatial regression.” arXiv:2206.09429.

2. Base Class Rethinking

Posted 1 July 2022

This week, I accomplished my two goals of implementing scikit-learn-compatible versions of Moran_Local and GM_Lag and starting discussions about converting PySAL’s style. In doing this, I got a better picture of the potential advantages and disadvantages of the paradigm.

One issue I ran into while building out the demo classes was the location of the spatial weights matrix in a scikit-learn style estimator. Putting the matrix in the __init__ method of a class implies that the matrix is a parameter rather than data, whereas putting the matrix in the fit method of a class implies that it is data rather than a parameter. Each choice has implications on the format and usage of the class. Personally, I prefer placing the spatial weights matrix in the __init__ method as that choice encourages users to tune hyperparameters of the estimator for each spatial domain they are handling. However, this could lead to memory issues if users need to analyze many spatial domains. This dilemma was also noticed by Martin Fleischmann in the esda issue thread.

In spreg, Eli Knaap and Luc Anselin discussed how PySAL and scikit-learn have generally different objectives in their modeling frameworks: the former is typically focused on inference and interpretation, while the latter is centered around prediction. These different use cases affect the design patterns of the two libraries, and as a result it may not make sense to mimic the design pattern from scikit-learn. Of course, Wilkinson formulas were viewed as being useful to users on this thread as they reflect how social scientists think about models, thus making the library more accessible to a wider base of users.

For these reasons, I think it may be more useful to design a set of PySAL-specific base classes. While scikit-learn and multiple dispatch might not be great fits for the library, it still needs consistent standards (whether or not they conform to external rules). Creating a PySAL-specific framework would have several key advantages, like in-house maintainability and customizability. This week, my goals are to begin prototyping Wilkinson formulas (as they are universally popular) and to think of design constraints for a set of PySAL-specific base classes.

1. Initial Goals

Posted 26 June 2022

At our first meeting, my mentors and I discussed the options we have for implementing new interfaces to PySAL’s model and exploratory statistic classes. We discussed using multiple dispatch or ducktyping from scikit-learn and chose to begin with ducktyping scikit-learn as multiple dispatch has been deemed non-Pythonic. While developing functionality for Wilkinson formulas is an ancillary goal of this project, we are keeping it in mind as we make interface edits as those structural changes affect the formula implementation.

My first two goals are as follows:

  1. Choose an unsupervised statistic (Moran_Local) and a supervised statistic (GM_Lag) and implement them using the scikit-learn paradigm. This will give us two concrete, minimal working examples that we can share to package leads.
  2. Start issues in relevant packages to fill out the cells in the table below. We want to know what must be done for each package to switch to a scikit-learn interface and to extend it to Wilkinson formulas.
Package Requirements for ducktyping scikit-learn Requirements for extending to Wilkinson formulas

Finally, we aim to include automatic attribution tools to all the new interfaces so that users may easily generate the proper citations for the code they are using.