geosnap.analyze.sequence¶

geosnap.analyze.sequence(gdf, cluster_col, seq_clusters=5, subs_mat=None, dist_type=None, indel=None, temporal_index='year', unit_index='geoid')[source]¶

Pairwise sequence analysis and sequence clustering.

Dynamic programming if optimal matching.

Parameters:

gdfgeopandas.GeoDataFrame or pandas.DataFrame: Long-form geopandas.GeoDataFrame or pandas.DataFrame containing neighborhood attributes with a column defining neighborhood clusters.
cluster_colstr or int: Column name for the neighborhood segmentation, such as “ward”, “kmeans”, etc.
seq_clustersint, optional: Number of neighborhood sequence clusters. Agglomerative Clustering with Ward linkage is now used for clustering the sequences. Default is 5.
dist_typestr: “hamming”: hamming distance (substitution only and its cost is constant 1) from sklearn.metrics; “markov”: utilize empirical transition probabilities to define substitution costs; “interval”: differences between states are used to define substitution costs, and indel=k-1; “arbitrary”: arbitrary distance if there is not a strong theory guidance: substitution=0.5, indel=1. “tran”: transition-oriented optimal matching. Sequence of transitions. Based on [Bie11].
subs_matarray: (k,k), substitution cost matrix. Should be hollow ( 0 cost between the same type), symmetric and non-negative.
indelfloat, optional: insertion/deletion cost.
temporal_indexstr, optional: Column defining time and or sequencing of the long-form data. Default is “year”.
unit_indexstr, optional: Column identifying the unique id of spatial units. Default is “geoid”.

Returns:

gdf_tempgeopandas.GeoDataFrame or pandas.DataFrame: geopandas.GeoDataFrame or pandas.DataFrame with a new column for sequence labels.
df_widepandas.DataFrame: Wide-form DataFrame with k (k is the number of periods) columns of neighborhood types and 1 column of sequence labels.
seq_dis_matarray: (n,n), distance/dissimilarity matrix for each pair of sequences

Examples

>>> from geosnap.data import Community
>>> columbus = Community.from_ltdb(msa_fips="18140")
>>> columbus1 = columbus.cluster(columns=['median_household_income',
... 'p_poverty_rate', 'p_edu_college_greater', 'p_unemployment_rate'],
... method='ward', n_clusters=6)
>>> gdf = columbus1.gdf
>>> gdf_new, df_wide, seq_hamming = Sequence(gdf, dist_type="hamming")
>>> seq_hamming.seq_dis_mat[:5, :5]
array([[0., 3., 4., 5., 5.],
       [3., 0., 3., 3., 3.],
       [4., 3., 0., 2., 2.],
       [5., 3., 2., 0., 0.],
       [5., 3., 2., 0., 0.]])