geosnap.analyze.sequence

geosnap.analyze.sequence(gdf, cluster_col, seq_clusters=5, subs_mat=None, dist_type=None, indel=None, temporal_index='year', unit_index='geoid')[source]

Pairwise sequence analysis and sequence clustering.

Dynamic programming if optimal matching.

Parameters:
gdfgeopandas.GeoDataFrame or pandas.DataFrame

Long-form geopandas.GeoDataFrame or pandas.DataFrame containing neighborhood attributes with a column defining neighborhood clusters.

cluster_colstr or int

Column name for the neighborhood segmentation, such as “ward”, “kmeans”, etc.

seq_clustersint, optional

Number of neighborhood sequence clusters. Agglomerative Clustering with Ward linkage is now used for clustering the sequences. Default is 5.

dist_typestr

“hamming”: hamming distance (substitution only and its cost is constant 1) from sklearn.metrics; “markov”: utilize empirical transition probabilities to define substitution costs; “interval”: differences between states are used to define substitution costs, and indel=k-1; “arbitrary”: arbitrary distance if there is not a strong theory guidance: substitution=0.5, indel=1. “tran”: transition-oriented optimal matching. Sequence of transitions. Based on [Bie11].

subs_matarray

(k,k), substitution cost matrix. Should be hollow ( 0 cost between the same type), symmetric and non-negative.

indelfloat, optional

insertion/deletion cost.

temporal_indexstr, optional

Column defining time and or sequencing of the long-form data. Default is “year”.

unit_indexstr, optional

Column identifying the unique id of spatial units. Default is “geoid”.

Returns:
gdf_tempgeopandas.GeoDataFrame or pandas.DataFrame

geopandas.GeoDataFrame or pandas.DataFrame with a new column for sequence labels.

df_widepandas.DataFrame

Wide-form DataFrame with k (k is the number of periods) columns of neighborhood types and 1 column of sequence labels.

seq_dis_matarray

(n,n), distance/dissimilarity matrix for each pair of sequences

Examples

>>> from geosnap.data import Community
>>> columbus = Community.from_ltdb(msa_fips="18140")
>>> columbus1 = columbus.cluster(columns=['median_household_income',
... 'p_poverty_rate', 'p_edu_college_greater', 'p_unemployment_rate'],
... method='ward', n_clusters=6)
>>> gdf = columbus1.gdf
>>> gdf_new, df_wide, seq_hamming = Sequence(gdf, dist_type="hamming")
>>> seq_hamming.seq_dis_mat[:5, :5]
array([[0., 3., 4., 5., 5.],
       [3., 0., 3., 3., 3.],
       [4., 3., 0., 2., 2.],
       [5., 3., 2., 0., 0.],
       [5., 3., 2., 0., 0.]])