geosnap.analyze.find_k¶
- geosnap.analyze.find_k(gdf, method=None, columns=None, temporal_index='year', unit_index='geoid', scaler='std', pooling='fixed', random_state=None, cluster_kwargs=None, min_k=2, max_k=10, return_table=False)[source]¶
Brute-forse search through cluster fit metrics to determine the optimal number of k clusters
- Parameters:
- gdf
geopandas.GeoDataFrame
, required long-form GeoDataFrame containing neighborhood attributes
- method
str
in
[‘kmeans’, ‘ward’, ‘spectral’,’gaussian_mixture’], required the clustering algorithm used to identify neighborhood types
- columnslist-like, required
subset of columns on which to apply the clustering
- temporal_index
str
, optional which column on the dataframe defines time and or sequencing of the long-form data. Default is “year”
- unit_index
str
, optional which column on the long-form dataframe identifies the stable units over time. In a wide-form dataset, this would be the unique index
- scaler
None
orscaler
fromsklearn.preprocessing
, optional a scikit-learn preprocessing class that will be used to rescale the data. Defaults to sklearn.preprocessing.StandardScaler
- pooling[“fixed”, “pooled”, “unique”], optional (default=’fixed’)
How to treat temporal data when applying scaling. Options include:
fixed : scaling is fixed to each time period
pooled : data are pooled across all time periods
unique : if scaling, apply the scaler to each time period, then generate clusters unique to each time period.
- cluster_kwargs
dict
, optional additional keyword arguments passed to the clustering algorithm
- min_k
int
, optional minimum number of clusters to test, by default 2
- max_k
int
, optional maximum number of clusters to test, by default 10
- return_tablebool, optional
if True, return the table of fit metrics for each combination of k and cluster method, by default False
- gdf
- Returns:
pandas.DataFrame
if return_table==False (default), returns a pandas dataframe with a single column that holds the optimal number of clusters according to each fit metric (row index).
if return_table==True, returns a table of fit coefficients for each k between min_k and max_k