geosnap.analyze.cluster¶
- geosnap.analyze.cluster(gdf, n_clusters=6, method=None, best_model=False, columns=None, verbose=False, temporal_index='year', unit_index='geoid', scaler='std', pooling='fixed', random_state=None, cluster_kwargs=None, model_colname=None, return_model=False)[source]¶
Create a geodemographic typology by running a cluster analysis on the study area’s neighborhood attributes.
- Parameters:
- gdf
geopandas.GeoDataFrame
, required long-form GeoDataFrame containing neighborhood attributes
- n_clusters
int
, required the number of clusters to model. The default is 6).
- method
str
in
[‘kmeans’, ‘ward’, ‘affinity_propagation’, ‘spectral’,’gaussian_mixture’, ‘hdbscan’], required the clustering algorithm used to identify neighborhood types
- best_modelbool, optional
if using a gaussian mixture model, use BIC to choose the best n_clusters. (the default is False).
- columnslist-like, required
subset of columns on which to apply the clustering
- verbosebool, optional
whether to print warning messages (the default is False).
- temporal_index
str
, required which column on the dataframe defines time and or sequencing of the long-form data. Default is “year”
- unit_index
str
, required which column on the long-form dataframe identifies the stable units over time. In a wide-form dataset, this would be the unique index
- scaler
None
orscaler
fromsklearn.preprocessing
, optional a scikit-learn preprocessing class that will be used to rescale the data. Defaults to sklearn.preprocessing.StandardScaler
- pooling[“fixed”, “pooled”, “unique”], optional (default=’fixed’)
How to treat temporal data when applying scaling. Options include:
fixed : scaling is fixed to each time period
pooled : data are pooled across all time periods
unique : if scaling, apply the scaler to each time period, then generate clusters unique to each time period.
- cluster_kwargs: dict
additional keyword arguments passed to the clustering instance
- model_colname
str
column name for storing cluster labels on the output dataframe. If no name is provided, the colun will be named after the clustering method. If there is already a column named after the clustering method, the name will be incremented with a number
- return_model: bool
if True, return the clustering model for further inspection (default is False)
- gdf
- Returns:
- gdf
geopandas.GeoDataFrame
GeoDataFrame with a column (model_colname) of neighborhood cluster labels appended as a new column. If model_colname exists as a column on the DataFrame then the column will be incremented.
- model
named
tuple
A tuple with attributes X, columns, labels, instance, W, which store the input matrix, column labels, fitted model instance, and spatial weights matrix
- gdf