geosnap.analyze.find_k¶

geosnap.analyze.find_k(gdf, method=None, columns=None, temporal_index='year', unit_index='geoid', scaler='std', pooling='fixed', random_state=None, cluster_kwargs=None, min_k=2, max_k=10, return_table=False)[source]¶

Brute-forse search through cluster fit metrics to determine the optimal number of k clusters

Parameters:

gdfgeopandas.GeoDataFrame, required

long-form GeoDataFrame containing neighborhood attributes

methodstr in [‘kmeans’, ‘ward’, ‘spectral’,’gaussian_mixture’], required

the clustering algorithm used to identify neighborhood types

columnslist-like, required

subset of columns on which to apply the clustering

temporal_indexstr, optional

which column on the dataframe defines time and or sequencing of the long-form data. Default is “year”

unit_indexstr, optional

which column on the long-form dataframe identifies the stable units over time. In a wide-form dataset, this would be the unique index

scalerNone or scaler from sklearn.preprocessing, optional

a scikit-learn preprocessing class that will be used to rescale the data. Defaults to sklearn.preprocessing.StandardScaler

pooling[“fixed”, “pooled”, “unique”], optional (default=’fixed’)

How to treat temporal data when applying scaling. Options include:

fixed : scaling is fixed to each time period
pooled : data are pooled across all time periods
unique : if scaling, apply the scaler to each time period, then generate clusters unique to each time period.

cluster_kwargsdict, optional

additional keyword arguments passed to the clustering algorithm

min_kint, optional

minimum number of clusters to test, by default 2

max_kint, optional

maximum number of clusters to test, by default 10

return_tablebool, optional

if True, return the table of fit metrics for each combination of k and cluster method, by default False

Returns:

pandas.DataFrame

if return_table==False (default), returns a pandas dataframe with a single column that holds the optimal number of clusters according to each fit metric (row index).

if return_table==True, returns a table of fit coefficients for each k between min_k and max_k