geosnap.analyze.sequence¶
- geosnap.analyze.sequence(gdf, cluster_col, seq_clusters=5, subs_mat=None, dist_type=None, indel=None, temporal_index='year', unit_index='geoid')[source]¶
Pairwise sequence analysis and sequence clustering.
Dynamic programming if optimal matching.
- Parameters:
- gdf
geopandas.GeoDataFrame
orpandas.DataFrame
Long-form geopandas.GeoDataFrame or pandas.DataFrame containing neighborhood attributes with a column defining neighborhood clusters.
- cluster_col
str
orint
Column name for the neighborhood segmentation, such as “ward”, “kmeans”, etc.
- seq_clusters
int
, optional Number of neighborhood sequence clusters. Agglomerative Clustering with Ward linkage is now used for clustering the sequences. Default is 5.
- dist_type
str
“hamming”: hamming distance (substitution only and its cost is constant 1) from sklearn.metrics; “markov”: utilize empirical transition probabilities to define substitution costs; “interval”: differences between states are used to define substitution costs, and indel=k-1; “arbitrary”: arbitrary distance if there is not a strong theory guidance: substitution=0.5, indel=1. “tran”: transition-oriented optimal matching. Sequence of transitions. Based on [Bie11].
- subs_mat
array
(k,k), substitution cost matrix. Should be hollow ( 0 cost between the same type), symmetric and non-negative.
- indel
float
, optional insertion/deletion cost.
- temporal_index
str
, optional Column defining time and or sequencing of the long-form data. Default is “year”.
- unit_index
str
, optional Column identifying the unique id of spatial units. Default is “geoid”.
- gdf
- Returns:
- gdf_temp
geopandas.GeoDataFrame
orpandas.DataFrame
geopandas.GeoDataFrame or pandas.DataFrame with a new column for sequence labels.
- df_wide
pandas.DataFrame
Wide-form DataFrame with k (k is the number of periods) columns of neighborhood types and 1 column of sequence labels.
- seq_dis_mat
array
(n,n), distance/dissimilarity matrix for each pair of sequences
- gdf_temp
Examples
>>> from geosnap.data import Community >>> columbus = Community.from_ltdb(msa_fips="18140") >>> columbus1 = columbus.cluster(columns=['median_household_income', ... 'p_poverty_rate', 'p_edu_college_greater', 'p_unemployment_rate'], ... method='ward', n_clusters=6) >>> gdf = columbus1.gdf >>> gdf_new, df_wide, seq_hamming = Sequence(gdf, dist_type="hamming") >>> seq_hamming.seq_dis_mat[:5, :5] array([[0., 3., 4., 5., 5.], [3., 0., 3., 3., 3.], [4., 3., 0., 2., 2.], [5., 3., 2., 0., 0.], [5., 3., 2., 0., 0.]])