2  Data Sets

geosnap was created as a neighborhood analysis package, not a data API. However, many researchers studying neighborhoods need access to a common yet diverse set of data. These data are distributed by a variety of sources and at different geographic labels, making simple data collection and aggregation a major hurdle (and time sink) for neighborhood research (let alone reproducible research). For that reason, geosnap maintains a large database of neighborhood indicators and some helper functions for accessing other databases directly.

from geosnap import io as gio
from geosnap import visualize as gvz
from geosnap import DataStore
import geopandas as gpd
import matplotlib.pyplot as plt

%load_ext jupyter_black
%load_ext watermark
%watermark -a 'eli knaap' -iv
Author: eli knaap

geosnap  : 0.12.1.dev9+g3a1cb0f6de61.d20231212
geopandas: 0.14.1

2.1 The DataStore Class

The DataStore class is a quick way to access a large database of social, economic, and environmental variables tablulated at various geographic levels. If you are bringing your own data, or using geosnap outside of the U.S., feel free to skip this section; the DataStore class is not required to use the package. It is just a quick way to load commonly-used data

Toward that end, many of the datasets used in social science and public policy research in the U.S. are drawn from the same set of resources, like the Census, the Environmental Protection Agency (EPA), the Bureau of Labor Statistics (BLS), or the National Center for Education Statistics (NCES), to name a few. As researchers, we found ourselves writing the same code to download [versions of] the same data repeatedly (at different scales or time frames). That works ok in some cases, but it is also cumbersome, and can be extremely slow for even medium-sized datasets. While there are nice tools like cenpy or pyrgris, these tools cannot overcome a basic limitation of many government data providers, namely that they use old technology (like FTP) and outdated file formats (like shapefiles).

Thus, rather than repetitively querying these slow servers, geosnap takes the position that it is preferable to store repeatedly-used datasets in highly-performat formats, because they are usually small and fast enough to store on disk. When not on disk, it makes sense to stream these datasets over S3, where they can be read very quickly (and directly). For that reason, geosnap maintains a public S3 bucket, thanks to the Amazon Open Data Registry, and uses quilt to manage data under the hood.

To get started using data, just instantiate the DataStore class

datasets = DataStore()

In practice, a DataStore is just a thin wrapper around a set of known file paths. It takes the path to a directory where files are (or will be stored), and loads files from there. This allows you store datasets in a central location that many users can draw from in a shared user environment.

If you dont specify a file path when instantiating a DataStore, it will default to the user data directory according platformdirs. The show_data_dir method will print that directory, in case you want to delete the files by hand.

/Users/knaaptime/Library/Application Support/geosnap
'/Users/knaaptime/Library/Application Support/geosnap'

In general, the default data directory is recommended. Using the default path allows geosnap to handle data internally without the need to ever specify a path manually (when storing or reading). If you want to keep data on a shared directory for multiple users, or save on an external or network storage drive, then you could alternatively specify a different path. In this case, geosnap expects that datasets already exist here (i.e. that you have already saved them using (geosnap.io.store_*)

Once instantiated, the DataStore class is just a shell, and it makes datasets available as methods. This pattern makes it easy to tab-complete so you do not need to remember what is available or how to access it. To list the datasets registered with geosnap, look into it using the builtin dir function


2.2 Demographic Data

Over the last decade, one of the most useful resources for understanding socioeconomic changes in U.S. neighborhoods over time has been the Longitudinal Tract Database (LTDB), which is used in countless studies of neighborhood change. One of the most recognized benefits of this dataset is that it standardizes boundaries for census geographic units over time, providing a consistent set of units for time-series analysis. An under-appreciated benefit of these data is the ease with which researchers have access to hundreds of useful intermediate variables computed from raw Census data (e.g. population rates by race and age). Unfortunately, the LTDB data is only available for a subset of Census releases (and only at the tract level), so geosnap includes tooling that computes the same variable set as LTDB, and it provides access to those data for every release of the 5-year ACS at both the tract and blockgroup levels (variable permitting). Note: following the Census convention, the 5-year releases are labelled by the terminal year

Unlike LTDB or NHGIS, these datasets are created using code that collects raw data from the Census FTP servers, computes intermediate variables according to the codebook, and saves the original geometries– and is fully open and reproducible. This means that all assumptions are exposed, and the full data processing pipeline is visible to any user or contributor to review, update, or send corrections. Geosnap’s approach to data provenance reflects our view that open and transparent analysis is always preferable to black boxes, and is a much better way to promote scientific discovery. Further, by formalizing the pipeline using code, the tooling is re-run to generate new datasets each time ACS or decennial Census data is released :)

To load a dataset, like the 2010 tract-level census data, just call it as a method, which returns a geodataframe (some datasets, like blocks or acs have additional arguments). Geometries are available for all of the commonly used administrative units, many of which have multiple time periods available:

  • metropolitan statistical areas (MSAS)
  • states
  • counties
  • tracts
  • blockgroups
  • blocks

There is also a table that maps counties to their constituent MSAs. Note that blockgroups are not exposed directly, but are accessible as part of the ACS and EJSCREEN data

tracts = datasets.tracts_2010()

If the dataset exists in the DataStore path, it will be loaded from disk, otherwise it will be streamed from S3 (provided an internet connection is available).

geoid n_asian_under_15 n_black_under_15 n_hispanic_under_15 n_native_under_15 n_white_under_15 n_persons_under_18 n_asian_over_60 n_black_over_60 n_hispanic_over_60 ... year n_total_housing_units_sample p_nonhisp_white_persons p_white_over_60 p_black_over_60 p_hispanic_over_60 p_native_over_60 p_asian_over_60 p_disabled geometry
0 02020000701 65.0 204.0 440.0 249.0 91.0 1739.0 NaN NaN NaN ... 2010 2190.0 37.785941 NaN NaN NaN NaN NaN NaN POLYGON ((-149.77843 61.22629, -149.77677 61.2...
1 02020001701 465.0 250.0 86.0 249.0 413.0 1941.0 NaN NaN NaN ... 2010 2829.0 50.446112 NaN NaN NaN NaN NaN NaN POLYGON ((-149.77837 61.19525, -149.76341 61.1...
2 02020002811 102.0 0.0 89.0 159.0 639.0 1592.0 NaN NaN NaN ... 2010 2703.0 58.986692 NaN NaN NaN NaN NaN NaN POLYGON ((-149.85668 61.13753, -149.84760 61.1...
3 02050000100 0.0 0.0 11.0 3079.0 45.0 3812.0 NaN NaN NaN ... 2010 2774.0 2.863413 NaN NaN NaN NaN NaN NaN MULTIPOLYGON (((-161.67073 58.56075, -161.6672...
4 02090001700 0.0 0.0 0.0 0.0 309.0 342.0 NaN NaN NaN ... 2010 1253.0 95.342930 NaN NaN NaN NaN NaN NaN POLYGON ((-147.31748 64.69094, -147.30114 64.6...

5 rows × 194 columns

This makes it very fast to go from zero to data visualization

tracts[tracts.geoid.str.startswith("06")][["median_home_value", "geometry"]].explore(
    tiles="CartoDB Positron",
Make this Notebook Trusted to load map: File -> Trust Notebook