geosnap.analyze.cluster

geosnap.analyze.cluster(gdf, n_clusters=6, method=None, best_model=False, columns=None, verbose=False, temporal_index='year', unit_index='geoid', scaler='std', pooling='fixed', random_state=None, cluster_kwargs=None, model_colname=None, return_model=False)[source]

Create a geodemographic typology by running a cluster analysis on the study area’s neighborhood attributes.

Parameters:
gdfgeopandas.GeoDataFrame, required

long-form GeoDataFrame containing neighborhood attributes

n_clustersint, required

the number of clusters to model. The default is 6).

methodstr in [‘kmeans’, ‘ward’, ‘affinity_propagation’, ‘spectral’,’gaussian_mixture’, ‘hdbscan’], required

the clustering algorithm used to identify neighborhood types

best_modelbool, optional

if using a gaussian mixture model, use BIC to choose the best n_clusters. (the default is False).

columnslist-like, required

subset of columns on which to apply the clustering

verbosebool, optional

whether to print warning messages (the default is False).

temporal_indexstr, required

which column on the dataframe defines time and or sequencing of the long-form data. Default is “year”

unit_indexstr, required

which column on the long-form dataframe identifies the stable units over time. In a wide-form dataset, this would be the unique index

scalerNone or scaler from sklearn.preprocessing, optional

a scikit-learn preprocessing class that will be used to rescale the data. Defaults to sklearn.preprocessing.StandardScaler

pooling[“fixed”, “pooled”, “unique”], optional (default=’fixed’)

How to treat temporal data when applying scaling. Options include:

  • fixed : scaling is fixed to each time period

  • pooled : data are pooled across all time periods

  • unique : if scaling, apply the scaler to each time period, then generate clusters unique to each time period.

cluster_kwargs: dict

additional keyword arguments passed to the clustering instance

model_colnamestr

column name for storing cluster labels on the output dataframe. If no name is provided, the colun will be named after the clustering method. If there is already a column named after the clustering method, the name will be incremented with a number

return_model: bool

if True, return the clustering model for further inspection (default is False)

Returns:
gdfgeopandas.GeoDataFrame

GeoDataFrame with a column (model_colname) of neighborhood cluster labels appended as a new column. If model_colname exists as a column on the DataFrame then the column will be incremented.

modelnamed tuple

A tuple with attributes X, columns, labels, instance, W, which store the input matrix, column labels, fitted model instance, and spatial weights matrix