Skip to content

convert_2010_to_2020_bounds

sdc_census10to20.convert_2010_to_2020_bounds

convert_2010_to_2020_bounds(data: DataFrame, *, geoid_col: str = 'geoid', val_col: str = 'value', state_fips: str = '51') -> pd.DataFrame

Redistribute a single year/measure of 2010-vintage values onto 2020 boundaries.

The input frame must contain exactly one row per GEOID (one year, one measure). Each 2010 source distributes its value to the overlapping 2020 tracts by the fraction of the source area in each overlap (area_part / area10); a source's overlaps tile it, so the fractions sum to 1 and the total is conserved (count-preserving areal interpolation, using the Census relationship file's land-area overlaps).

Parameters:

Name Type Description Default
data DataFrame

Input frame with at least geoid_col and val_col.

required
geoid_col str

Name of the GEOID column (default "geoid").

'geoid'
val_col str

Name of the value column (default "value").

'value'
state_fips str

State FIPS for the block-group crosswalk (default Virginia, "51").

'51'

Returns:

Type Description
DataFrame

Two columns: geoid (2020 boundaries) and val_col (redistributed).

Source code in packages/sdc-census10to20/src/sdc_census10to20/convert.py
def convert_2010_to_2020_bounds(
    data: pd.DataFrame,
    *,
    geoid_col: str = "geoid",
    val_col: str = "value",
    state_fips: str = "51",
) -> pd.DataFrame:
    """Redistribute a single year/measure of 2010-vintage values onto 2020 boundaries.

    The input frame must contain exactly one row per GEOID (one year, one
    measure). Each 2010 source distributes its value to the overlapping 2020
    tracts by the fraction of the *source* area in each overlap
    (``area_part / area10``); a source's overlaps tile it, so the fractions sum to
    1 and the total is conserved (count-preserving areal interpolation, using the
    Census relationship file's land-area overlaps).

    Parameters
    ----------
    data : pd.DataFrame
        Input frame with at least ``geoid_col`` and ``val_col``.
    geoid_col : str
        Name of the GEOID column (default ``"geoid"``).
    val_col : str
        Name of the value column (default ``"value"``).
    state_fips : str
        State FIPS for the block-group crosswalk (default Virginia, "51").

    Returns
    -------
    pd.DataFrame
        Two columns: ``geoid`` (2020 boundaries) and ``val_col`` (redistributed).
    """
    if data[geoid_col].isna().any():
        raise ValueError("geoids contain missing values")

    data = data.copy()
    data[geoid_col] = data[geoid_col].astype(str)
    geoids = data[geoid_col].unique()

    if len(data[geoid_col]) > len(geoids):
        raise ValueError(
            "geoids are not unique -- data cannot contain more than one entry per geoid. "
            "Please double check that data only spans one year, measure, etc."
        )

    if data[val_col].isna().any():
        warnings.warn(
            "data contains missing values. the value of any new tract that overlaps "
            "with a NULL value will be coerced to NULL. If this is an issue, "
            "we recommend manual insertion of values based on contextual specifications.",
            stacklevel=2,
        )

    data = data[[geoid_col, val_col]].copy()
    data = data.rename(columns={val_col: "value"})

    crosswalk = create_crosswalk(list(geoids), state_fips=state_fips)

    joined = crosswalk.merge(data, left_on="geoid10", right_on=geoid_col, how="left")

    # Areal interpolation that conserves counts: each 2010 source distributes its
    # value to overlapping 2020 tracts by the fraction of the *source* area in the
    # overlap (area_part / area10). A source's overlaps tile it, so the fractions
    # sum to 1 and the source's full value is distributed. type_change does not
    # affect the math -- the geometry in area_part/area10 already encodes same vs
    # split vs moved.
    joined["value"] = joined["value"] * (joined["area_part"] / joined["area10"])
    redistributed = joined.groupby("geoid20", as_index=False)["value"].sum()
    redistributed = redistributed.rename(columns={"geoid20": "geoid", "value": val_col})
    return redistributed