Assignment 6: Predictive Modeling of Housing Prices in Philadelphia

Due date: Wednesday, 12/6 by the end of the day

Lectures 12B and 13A will cover predictive modeling of housing prices in Philadelphia. We’ll extend that analysis in this section by:

Part 2: Modeling Philadelphia’s Housing Prices and Algorithmic Fairness

2.1 Load data from the Office of Property Assessment

Use the requests package to query the CARTO API for single-family property assessment data in Philadelphia for properties that had their last sale during 2022.

Sources: - OpenDataPhilly - Metadata

import requests
import pandas as pd
import geopandas as gpd
import numpy as np
carto_url = "https://phl.carto.com/api/v2/sql"
where = "sale_date >= '2022-01-01' and sale_date <= '2022-12-31'"
where = where + " and category_code_description IN ('SINGLE FAMILY', 'Single Family')"
# Create the query
query = f"SELECT * FROM opa_properties_public WHERE {where}"

# Make the request
params = {"q": query, "format": "geojson", "where": where}
response = requests.get(carto_url, params=params)

# Make the GeoDataFrame
sales2022 = gpd.GeoDataFrame.from_features(response.json(), crs="EPSG:4326")

2.2 Load data for census tracts and neighborhoods

Load various Philadelphia-based regions that we will use in our analysis.

tracts = gpd.read_file('2010_Tracts.geojson')
url = 'https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson'
neighbor = gpd.read_file(url).to_crs('EPSG:4326')

2.3 Spatially join the sales data and neighborhoods/census tracts.

Perform a spatial join, such that each sale has an associated neighborhood and census tract.

Note: After performing the first spatial join, you will need to use the drop() function to remove the index_right column; otherwise an error will be raised on the second spatial join about duplicate columns.

joined = gpd.sjoin(sales2022, tracts, predicate="within", how="left",)
joined = gpd.sjoin(joined.drop('index_right', axis=1), neighbor, predicate = "within", how = "left")

2.4 Train a Random Forest on the sales data

In this step, you should follow the steps outlined in lecture to preprocess and train your model. We’ll extend our analysis to do a hyperparameter grid search to find the best model configuration. As you train your model, follow the following steps:

Preprocessing Requirements - Trim the sales data to those sales with prices between $3,000 and $1 million - Set up a pipeline that includes both numerical columns and categorical columns - Include one-hot encoded variable for the neighborhood of the sale, instead of ZIP code. We don’t want to include multiple location based categories, since they encode much of the same information.

Training requirements - Use a 70/30% training/test split and predict the log of the sales price. - Use GridSearchCV to perform a k-fold cross validation that optimize at least 2 hyperparameters of the RandomForestRegressor - After fitting your model and finding the optimal hyperparameters, you should evaluate the score (R-squared) on the test set (the original 30% sample withheld)

Note: You don’t need to include additional features (such as spatial distance features) or perform any extra feature engineering beyond what is required above to receive full credit. Of course, you are always welcome to experiment!

# model
from sklearn.ensemble import RandomForestRegressor

# pipeline
from sklearn.pipeline import make_pipeline

# model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

# Preprocessing
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
# trim data to between $3000 and $1 million
valid = (joined['sale_price'] > 3000) & (joined['sale_price'] < 1e6)
joined = joined.loc[valid]
pd.set_option('display.max_columns', None)
joined.head()
geometry cartodb_id_left assessment_date basements beginning_point book_and_page building_code building_code_description category_code category_code_description census_tract central_air cross_reference date_exterior_condition depth exempt_building exempt_land exterior_condition fireplaces frontage fuel garage_spaces garage_type general_construction geographic_ward homestead_exemption house_extension house_number interior_condition location mailing_address_1 mailing_address_2 mailing_care_of mailing_city_state mailing_street mailing_zip market_value market_value_date number_of_bathrooms number_of_bedrooms number_of_rooms number_stories off_street_open other_building owner_1 owner_2 parcel_number parcel_shape quality_grade recording_date registry_number sale_date sale_price separate_utilities sewer site_type state_code street_code street_designation street_direction street_name suffix taxable_building taxable_land topography total_area total_livable_area type_heater unfinished unit utility view_type year_built year_built_estimate zip_code zoning pin building_code_new building_code_description_new objectid OBJECTID STATEFP10 COUNTYFP10 TRACTCE10 GEOID10 NAME10 NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 LOGRECNO index_right name listname mapname shape_leng shape_area cartodb_id_right created_at updated_at
0 POINT (-75.18212 39.97558) 1699 2022-05-24T00:00:00Z None 139'10 15/16" N STILES 54234223 O30 ROW 2 STY MASONRY 1 SINGLE FAMILY 137 None None None 36.0 0.0 0.0 4 0.0 13.0 None 0.0 None A 29 0 None 1245 4 1245 N NEWKIRK ST CSC INGEO None None PHILADELPHIA PA 1245 N NEWKIRK ST 19121-4526 131100 None 1.0 2.0 NaN 2.0 1125.0 None AUTUMN CAPITAL GROUP INC None 292108701 E C 2023-10-17T00:00:00Z 010N010260 2022-10-24T00:00:00Z 213000 None None None PA 59640 ST N NEWKIRK None 104880.0 26220.0 F 468.0 650.0 None None None None I 1920 Y 19121 RSA5 1001388772 22 ROW TYPICAL 401673742 238.0 42 101 013700 42101013700 137 Census Tract 137 G5020 S 589738.0 0.0 +39.9772752 -075.1842323 10467 100.0 BREWERYTOWN Brewerytown Brewerytown 14790.673736 1.125654e+07 80.0 2013-03-19 17:41:50.508000+00:00 2013-03-19 17:41:50.743000+00:00
2 POINT (-75.16524 40.00137) 2261 2022-05-24T00:00:00Z D 120' W 21ST ST 54222844 O30 ROW 2 STY MASONRY 1 SINGLE FAMILY 173 N None None 79.0 0.0 0.0 4 0.0 20.0 None 0.0 None F 11 0 None 2116 3 2116 W CLEARFIELD ST None None None PHILADELPHIA PA 2116 W CLEARFIELD ST 19132-1517 139700 None 1.0 4.0 NaN 2.0 929.0 None MILLIONIQUE LLC None 111112200 E C 2023-09-13T00:00:00Z 38N5 289 2022-03-03T00:00:00Z 10000 None None None PA 23640 ST W CLEARFIELD None 111760.0 27940.0 F 1583.0 1960.0 H None None None I 1925 Y 19132 RSA5 1001149040 22 ROW TYPICAL 401675592 241.0 42 101 017300 42101017300 173 Census Tract 173 G5020 S 874586.0 0.0 +39.9985707 -075.1609337 10503 12.0 GLENWOOD Glenwood Glenwood 12350.882116 9.652808e+06 65.0 2013-03-19 17:41:50.508000+00:00 2013-03-19 17:41:50.743000+00:00
4 POINT (-75.14075 39.97514) 2645 2023-05-11T00:00:00Z A NWC BODINE TO NEC 54224335 SR VACANT LAND RESIDE < ACRE 1 SINGLE FAMILY 156 Y None None 61.0 0.0 0.0 1 0.0 21.0 A 1.0 1 B 18 0 None 249 1 249 W OXFORD ST SIMPLIFILE LC E-RECORDING None None PHILADELPHIA PA 249 W OXFORD ST # A 19122-3742 500000 None NaN 3.0 NaN 3.0 NaN None None None 182000012 E None 2023-09-18T00:00:00Z 012N230250 2022-03-29T00:00:00Z 192000 None 2 None PA 62120 ST W OXFORD None 400000.0 100000.0 F 1294.0 2199.0 A None A None I 2021 None 19122 RSA5 1001681249 25 ROW MODERN 401675399 18.0 42 101 015600 42101015600 156 Census Tract 156 G5020 S 429567.0 0.0 +39.9790066 -075.1418910 10484 87.0 OLD_KENSINGTON Kensington, Old Old Kensington 10913.733421 7.476740e+06 88.0 2013-03-19 17:41:50.508000+00:00 2013-03-19 17:41:50.743000+00:00
6 POINT (-75.13389 40.03928) 2839 2022-05-24T00:00:00Z H 241' N OF CHEW ST 54230133 R30 ROW B/GAR 2 STY MASONRY 1 SINGLE FAMILY 275 N None None 95.0 0.0 0.0 7 0.0 15.0 None 1.0 None B 61 0 None 5732 4 5732 N 7TH ST WALKER MICHAEL None None SICKLERVILLE NJ 44 FARMHOUSE RD 08081 133400 None 1.0 3.0 NaN 2.0 1920.0 None WALKER MICHAEL None 612234600 E C 2023-10-04T00:00:00Z 135N7 61 2022-08-21T00:00:00Z 21000 None None None NJ 87930 ST N 7TH None 106720.0 26680.0 F 1425.0 1164.0 H None None None I 1925 Y 19120 RSA5 1001602509 24 ROW PORCH FRONT 401674376 350.0 42 101 027500 42101027500 275 Census Tract 275 G5020 S 606825.0 0.0 +40.0400497 -075.1322707 10588 42.0 OLNEY Olney Olney 32197.205271 5.030840e+07 8.0 2013-03-19 17:41:50.508000+00:00 2013-03-19 17:41:50.743000+00:00
7 POINT (-75.14337 40.00957) 2890 2022-05-24T00:00:00Z D 415' N OF ERIE AVE 54230032 O30 ROW 2 STY MASONRY 1 SINGLE FAMILY 198 N None None 45.0 0.0 0.0 4 0.0 16.0 None 0.0 None A 43 0 None 3753 4 3753 N DELHI ST None None None DELRAY BEACH FL 4899 NW 6TH STREET 33445 73800 None 1.0 3.0 NaN 2.0 1683.0 None RJ SIMPLE SOLUTION LLC None 432345900 E C 2023-10-04T00:00:00Z 100N040379 2022-06-13T00:00:00Z 35000 None None None FL 28040 ST N DELHI None 59040.0 14760.0 F 720.0 960.0 H None None None I 1942 Y 19140 RM1 1001175031 24 ROW PORCH FRONT 401674385 23.0 42 101 019800 42101019800 198 Census Tract 198 G5020 S 541006.0 0.0 +40.0107245 -075.1421472 10523 98.0 HUNTING_PARK Hunting Park Hunting Park 32920.799360 3.902450e+07 73.0 2013-03-19 17:41:50.508000+00:00 2013-03-19 17:41:50.743000+00:00
#categorical
cat_cols = ['exterior_condition', 'listname']
 
# numeric
num_cols = ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']

features = cat_cols + num_cols + ['sale_price']
joined_trimmed = joined[features].dropna()
joined_trimmed
exterior_condition interior_condition quality_grade parcel_shape listname view_type year_built fireplaces garage_spaces number_of_bathrooms number_of_bedrooms number_stories total_area total_livable_area sale_price
0 4 4 C E Brewerytown I 1920 0.0 0.0 1.0 2.0 2.0 468.0 650.0 213000
2 4 3 C E Glenwood I 1925 0.0 0.0 1.0 4.0 2.0 1583.0 1960.0 10000
6 7 4 C E Olney I 1925 0.0 1.0 1.0 3.0 2.0 1425.0 1164.0 21000
7 4 4 C E Hunting Park I 1942 0.0 0.0 1.0 3.0 2.0 720.0 960.0 35000
9 4 4 C E Harrowgate I 1920 0.0 0.0 1.0 3.0 1.0 602.0 840.0 39600
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24472 4 4 C E Holmesburg I 1950 0.0 1.0 1.0 3.0 2.0 1147.0 1034.0 163000
24473 4 4 C+ E Mayfair H 1940 0.0 0.0 1.0 3.0 2.0 1600.0 1360.0 139900
24474 4 4 C E Oxford Circle I 1950 0.0 1.0 1.0 3.0 1.0 2813.0 1656.0 280000
24477 4 3 C E Mayfair A 1950 0.0 1.0 2.0 3.0 1.0 1424.0 1292.0 280000
24479 4 4 C E Wissinoming I 1920 0.0 0.0 1.0 3.0 2.0 1109.0 1076.0 150000

17430 rows × 15 columns

joined_trimmed['sale_price'] = pd.Series(joined_trimmed['sale_price'], dtype='int32')
joined_trimmed["log_price"] = np.log(joined_trimmed["sale_price"])
joined_trimmed
exterior_condition interior_condition quality_grade parcel_shape listname fireplaces garage_spaces number_of_bathrooms number_of_bedrooms number_stories total_area total_livable_area sale_price log_price
0 4 4 C E Hunting Park 0.0 0.0 1.0 3.0 2.0 720.0 960.0 35000 10.463103
1 7 4 C E Olney 0.0 1.0 1.0 3.0 2.0 1425.0 1164.0 21000 9.952278
6 4 3 C E Glenwood 0.0 0.0 1.0 4.0 2.0 1583.0 1960.0 10000 9.210340
7 4 4 C+ B Chestnut Hill 0.0 0.0 1.0 4.0 2.0 2457.0 2110.0 350000 12.765688
8 4 4 C E West Oak Lane 0.0 0.0 1.0 3.0 2.0 1161.0 1120.0 70000 11.156251
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24432 2 2 C E Whitman 0.0 0.0 1.0 3.0 2.0 736.0 951.0 150000 11.918391
24434 4 4 C+ E Mayfair 0.0 0.0 1.0 3.0 2.0 1600.0 1360.0 139900 11.848683
24436 4 4 C E Stadium District 0.0 0.0 1.0 3.0 2.0 1053.0 960.0 190000 12.154779
24440 4 4 C E Richmond 0.0 0.0 1.0 3.0 2.0 741.0 1020.0 46000 10.736397
24442 4 4 C E Fox Chase 0.0 1.0 0.0 0.0 2.0 9000.0 1388.0 150000 11.918391

17461 rows × 14 columns

# 70/30 training / test split 
train_set, test_set = train_test_split(joined_trimmed, test_size=0.3, random_state=42)

# the target labels
y_train = train_set["log_price"]
y_test = test_set["log_price"]

# features
feature_cols = cat_cols + num_cols
X_train = train_set[feature_cols]
X_test = test_set[feature_cols]
# column transformers + pipeline for random forest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

transformer = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)

pipe = make_pipeline(transformer, RandomForestRegressor(n_estimators=100,
                                       random_state=42)
)
# Fit the training set
pipe.fit(X_train, y_train);
# Test Score
pipe.score(X_test, y_test)
0.5646447325116657
# Setup for GridSearchCV

model_step = "randomforestregressor"
param_grid = {
    f"{model_step}__n_estimators": [5, 10, 15, 20, 30, 50, 100, 200],
    f"{model_step}__max_depth": [9, 13, 21, 33, 51],
}

# Create the grid and use 3-fold CV
grid = GridSearchCV(pipe, param_grid, cv=3, verbose=1)

# Run the search
grid.fit(X_train, y_train)
Fitting 3 folds for each of 40 candidates, totalling 120 fits
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('num',
                                                                         StandardScaler(),
                                                                         ['fireplaces',
                                                                          'garage_spaces',
                                                                          'number_of_bathrooms',
                                                                          'number_of_bedrooms',
                                                                          'number_stories',
                                                                          'total_area',
                                                                          'total_livable_area']),
                                                                        ('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['exterior_condition',
                                                                          'listname'])])),
                                       ('randomforestregressor',
                                        RandomForestRegressor(random_state=42))]),
             param_grid={'randomforestregressor__max_depth': [9, 13, 21, 33,
                                                              51],
                         'randomforestregressor__n_estimators': [5, 10, 15, 20,
                                                                 30, 50, 100,
                                                                 200]},
             verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# The best estimator
grid.best_estimator_
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['fireplaces',
                                                   'garage_spaces',
                                                   'number_of_bathrooms',
                                                   'number_of_bedrooms',
                                                   'number_stories',
                                                   'total_area',
                                                   'total_livable_area']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['exterior_condition',
                                                   'listname'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=33, n_estimators=200,
                                       random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
new_pipe = make_pipeline(transformer, RandomForestRegressor(max_depth=33, n_estimators=200,
                                       random_state=42))
# Fit the training set
new_pipe.fit(train_set, y_train);
# Test Score
new_pipe.score(test_set, y_test)
0.5675150349542478

2.5 Calculate the percent error of your model predictions for each sale in the test set

Fit your best model and use it to make predictions on the test set.

Note: This should be the percent error in terms of sale price. You’ll need to convert if your model predicted the log of sales price!

# this step is not necessary, just curious what the MAPE is. 
# X_test = test_set[feature_cols]
# y_test = test_set["log_price"].values
# y_test_no_log = np.exp(y_test)

def evaluate_mape(model, X_test, y_test):
    """
    Given a model and test features/targets, print out the 
    mean absolute error and accuracy
    """
    # Make the predictions
    predictions = model.predict(X_test)

    # Absolute Percentage error
    abs_errors = abs(np.exp(predictions) - y_test)
    avg_error = np.mean(abs_errors)

    # Mean absolute percentage error
    mape = 100 * np.mean(abs_errors / y_test)


    # Accuracy
    accuracy = 100 - mape

    print("Model Performance")
    print(f"Average Absolute Error: {avg_error:0.4f}")
    print(f"Accuracy = {accuracy:0.2f}%.")

    return accuracy
y_test_no_log = np.exp(y_test)
base_accuracy = evaluate_mape(pipe, X_test, y_test_no_log)
Model Performance
Average Absolute Error: 69104.9786
Accuracy = 51.06%.
# This is the actual percent error calculation part 

def percent_error(model, X_test, y_test):
    
    predictions = np.exp(model.predict(X_test))
    
    #percentage error
    p_errors = (predictions - y_test) / y_test
    
    return p_errors
p_errors = percent_error(pipe, X_test, y_test_no_log)
pipe.predict(X_test)
array([10.79687142, 12.01202048, 12.44467831, ..., 11.82508308,
       11.50655521, 12.76111153])
y_test
8682     11.245046
45       11.336188
6987     11.580584
16121    12.449019
8673     11.695247
           ...    
2252     12.254863
13927    13.422468
17434    11.695247
12885    11.289782
3753     13.060488
Name: log_price, Length: 5230, dtype: float64
p_errors
7209     0.095504
52      -0.068308
10900    3.547680
22227   -0.448610
18079    8.437723
           ...   
12894   -0.101753
15565   -0.107182
21955   -0.495361
3723    -0.141809
18913    0.245689
Name: log_price, Length: 5229, dtype: float64

2.6 Make a data frame with percent errors and census tract info for each sale in the test set

Create a data frame that has the property geometries, census tract data, and percent errors for all of the sales in the test set.

Notes

  • When using the “train_test_split()” function, the index of the test data frame includes the labels from the original sales data frame
  • You can use this index to slice out the test data from the original sales data frame, which should include the census tract info and geometries
  • Add a new column to this data frame holding the percent error data
  • Make sure to use the percent error and not the absolute percent error
test_df = joined.loc[test_set.index]
test_df['percent_error'] = p_errors

2.8 Plot a map of the median percent error by census tract

  • You’ll want to group your data frame of test sales by the GEOID10 column and take the median of you percent error column
  • Merge the census tract geometries back in and use geopandas to plot.
median_p_errors = test_df.groupby('GEOID10')['percent_error'].median()
median_p_errors = pd.DataFrame(median_p_errors).reset_index()
geo_median = tracts.merge(median_p_errors, on='GEOID10')
geo_median.explore(
    column="percent_error",
    tooltip="percent_error",
    popup=True,
    cmap="coolwarm",
    tiles="CartoDB positron",
    legend=True
    )
Make this Notebook Trusted to load map: File -> Trust Notebook

2.9 Compare the percent errors in Qualifying Census Tracts and other tracts

Qualifying Census Tracts are a poverty designation that HUD uses to allocate housing tax credits

  • I’ve included a list of the census tract names that qualify in Philadelphia
  • Add a new column to your dataframe of test set sales that is True/False depending on if the tract is a QCT
  • Then, group by this new column and calculate the median percent error

You should find that the algorithm’s accuracy is significantly worse in these low-income, qualifying census tracts

qct = ['5',
 '20',
 '22',
 '28.01',
 '30.01',
 '30.02',
 '31',
 '32',
 '33',
 '36',
 '37.01',
 '37.02',
 '39.01',
 '41.01',
 '41.02',
 '56',
 '60',
 '61',
 '62',
 '63',
 '64',
 '65',
 '66',
 '67',
 '69',
 '70',
 '71.01',
 '71.02',
 '72',
 '73',
 '74',
 '77',
 '78',
 '80',
 '81.01',
 '81.02',
 '82',
 '83.01',
 '83.02',
 '84',
 '85',
 '86.01',
 '86.02',
 '87.01',
 '87.02',
 '88.01',
 '88.02',
 '90',
 '91',
 '92',
 '93',
 '94',
 '95',
 '96',
 '98.01',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '110',
 '111',
 '112',
 '113',
 '119',
 '121',
 '122.01',
 '122.03',
 '131',
 '132',
 '137',
 '138',
 '139',
 '140',
 '141',
 '144',
 '145',
 '146',
 '147',
 '148',
 '149',
 '151.01',
 '151.02',
 '152',
 '153',
 '156',
 '157',
 '161',
 '162',
 '163',
 '164',
 '165',
 '167.01',
 '167.02',
 '168',
 '169.01',
 '169.02',
 '170',
 '171',
 '172.01',
 '172.02',
 '173',
 '174',
 '175',
 '176.01',
 '176.02',
 '177.01',
 '177.02',
 '178',
 '179',
 '180.02',
 '188',
 '190',
 '191',
 '192',
 '195.01',
 '195.02',
 '197',
 '198',
 '199',
 '200',
 '201.01',
 '201.02',
 '202',
 '203',
 '204',
 '205',
 '206',
 '208',
 '239',
 '240',
 '241',
 '242',
 '243',
 '244',
 '245',
 '246',
 '247',
 '249',
 '252',
 '253',
 '265',
 '267',
 '268',
 '271',
 '274.01',
 '274.02',
 '275',
 '276',
 '277',
 '278',
 '279.01',
 '279.02',
 '280',
 '281',
 '282',
 '283',
 '284',
 '285',
 '286',
 '287',
 '288',
 '289.01',
 '289.02',
 '290',
 '291',
 '293',
 '294',
 '298',
 '299',
 '300',
 '301',
 '302',
 '305.01',
 '305.02',
 '309',
 '311.01',
 '312',
 '313',
 '314.01',
 '314.02',
 '316',
 '318',
 '319',
 '321',
 '325',
 '329',
 '330',
 '337.01',
 '345.01',
 '357.01',
 '376',
 '377',
 '380',
 '381',
 '382',
 '383',
 '389',
 '390']
geo_median['qct'] = geo_median['NAME10'].isin(qct)

geo_median.groupby('qct')['percent_error'].median()
qct
False   -0.059781
True     0.039906
Name: percent_error, dtype: float64