import requests
import pandas as pd
import geopandas as gpd
import numpy as np
Assignment 6: Predictive Modeling of Housing Prices in Philadelphia
Due date: Wednesday, 12/6 by the end of the day
Lectures 12B and 13A will cover predictive modeling of housing prices in Philadelphia. We’ll extend that analysis in this section by:
- Optimizing our hyperparameters during the modeling process using cross-validation and a grid search
- Testing the fairness of our model by calculating the intersection of the model error rate and poverty rate across neighborhoods
Part 2: Modeling Philadelphia’s Housing Prices and Algorithmic Fairness
2.1 Load data from the Office of Property Assessment
Use the requests package to query the CARTO API for single-family property assessment data in Philadelphia for properties that had their last sale during 2022.
Sources: - OpenDataPhilly - Metadata
= "https://phl.carto.com/api/v2/sql"
carto_url = "sale_date >= '2022-01-01' and sale_date <= '2022-12-31'"
where = where + " and category_code_description IN ('SINGLE FAMILY', 'Single Family')" where
# Create the query
= f"SELECT * FROM opa_properties_public WHERE {where}"
query
# Make the request
= {"q": query, "format": "geojson", "where": where}
params = requests.get(carto_url, params=params)
response
# Make the GeoDataFrame
= gpd.GeoDataFrame.from_features(response.json(), crs="EPSG:4326") sales2022
2.2 Load data for census tracts and neighborhoods
Load various Philadelphia-based regions that we will use in our analysis.
- Census tracts can be downloaded from: https://opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson
- Neighborhoods can be downloaded from: https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson
= gpd.read_file('2010_Tracts.geojson') tracts
= 'https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson'
url = gpd.read_file(url).to_crs('EPSG:4326') neighbor
2.3 Spatially join the sales data and neighborhoods/census tracts.
Perform a spatial join, such that each sale has an associated neighborhood and census tract.
Note: After performing the first spatial join, you will need to use the drop()
function to remove the index_right
column; otherwise an error will be raised on the second spatial join about duplicate columns.
= gpd.sjoin(sales2022, tracts, predicate="within", how="left",)
joined = gpd.sjoin(joined.drop('index_right', axis=1), neighbor, predicate = "within", how = "left") joined
2.4 Train a Random Forest on the sales data
In this step, you should follow the steps outlined in lecture to preprocess and train your model. We’ll extend our analysis to do a hyperparameter grid search to find the best model configuration. As you train your model, follow the following steps:
Preprocessing Requirements - Trim the sales data to those sales with prices between $3,000 and $1 million - Set up a pipeline that includes both numerical columns and categorical columns - Include one-hot encoded variable for the neighborhood of the sale, instead of ZIP code. We don’t want to include multiple location based categories, since they encode much of the same information.
Training requirements - Use a 70/30% training/test split and predict the log of the sales price. - Use GridSearchCV to perform a k-fold cross validation that optimize at least 2 hyperparameters of the RandomForestRegressor - After fitting your model and finding the optimal hyperparameters, you should evaluate the score (R-squared) on the test set (the original 30% sample withheld)
Note: You don’t need to include additional features (such as spatial distance features) or perform any extra feature engineering beyond what is required above to receive full credit. Of course, you are always welcome to experiment!
# model
from sklearn.ensemble import RandomForestRegressor
# pipeline
from sklearn.pipeline import make_pipeline
# model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
# Preprocessing
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
# trim data to between $3000 and $1 million
= (joined['sale_price'] > 3000) & (joined['sale_price'] < 1e6)
valid = joined.loc[valid] joined
'display.max_columns', None)
pd.set_option( joined.head()
geometry | cartodb_id_left | assessment_date | basements | beginning_point | book_and_page | building_code | building_code_description | category_code | category_code_description | census_tract | central_air | cross_reference | date_exterior_condition | depth | exempt_building | exempt_land | exterior_condition | fireplaces | frontage | fuel | garage_spaces | garage_type | general_construction | geographic_ward | homestead_exemption | house_extension | house_number | interior_condition | location | mailing_address_1 | mailing_address_2 | mailing_care_of | mailing_city_state | mailing_street | mailing_zip | market_value | market_value_date | number_of_bathrooms | number_of_bedrooms | number_of_rooms | number_stories | off_street_open | other_building | owner_1 | owner_2 | parcel_number | parcel_shape | quality_grade | recording_date | registry_number | sale_date | sale_price | separate_utilities | sewer | site_type | state_code | street_code | street_designation | street_direction | street_name | suffix | taxable_building | taxable_land | topography | total_area | total_livable_area | type_heater | unfinished | unit | utility | view_type | year_built | year_built_estimate | zip_code | zoning | pin | building_code_new | building_code_description_new | objectid | OBJECTID | STATEFP10 | COUNTYFP10 | TRACTCE10 | GEOID10 | NAME10 | NAMELSAD10 | MTFCC10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | LOGRECNO | index_right | name | listname | mapname | shape_leng | shape_area | cartodb_id_right | created_at | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POINT (-75.18212 39.97558) | 1699 | 2022-05-24T00:00:00Z | None | 139'10 15/16" N STILES | 54234223 | O30 | ROW 2 STY MASONRY | 1 | SINGLE FAMILY | 137 | None | None | None | 36.0 | 0.0 | 0.0 | 4 | 0.0 | 13.0 | None | 0.0 | None | A | 29 | 0 | None | 1245 | 4 | 1245 N NEWKIRK ST | CSC INGEO | None | None | PHILADELPHIA PA | 1245 N NEWKIRK ST | 19121-4526 | 131100 | None | 1.0 | 2.0 | NaN | 2.0 | 1125.0 | None | AUTUMN CAPITAL GROUP INC | None | 292108701 | E | C | 2023-10-17T00:00:00Z | 010N010260 | 2022-10-24T00:00:00Z | 213000 | None | None | None | PA | 59640 | ST | N | NEWKIRK | None | 104880.0 | 26220.0 | F | 468.0 | 650.0 | None | None | None | None | I | 1920 | Y | 19121 | RSA5 | 1001388772 | 22 | ROW TYPICAL | 401673742 | 238.0 | 42 | 101 | 013700 | 42101013700 | 137 | Census Tract 137 | G5020 | S | 589738.0 | 0.0 | +39.9772752 | -075.1842323 | 10467 | 100.0 | BREWERYTOWN | Brewerytown | Brewerytown | 14790.673736 | 1.125654e+07 | 80.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 |
2 | POINT (-75.16524 40.00137) | 2261 | 2022-05-24T00:00:00Z | D | 120' W 21ST ST | 54222844 | O30 | ROW 2 STY MASONRY | 1 | SINGLE FAMILY | 173 | N | None | None | 79.0 | 0.0 | 0.0 | 4 | 0.0 | 20.0 | None | 0.0 | None | F | 11 | 0 | None | 2116 | 3 | 2116 W CLEARFIELD ST | None | None | None | PHILADELPHIA PA | 2116 W CLEARFIELD ST | 19132-1517 | 139700 | None | 1.0 | 4.0 | NaN | 2.0 | 929.0 | None | MILLIONIQUE LLC | None | 111112200 | E | C | 2023-09-13T00:00:00Z | 38N5 289 | 2022-03-03T00:00:00Z | 10000 | None | None | None | PA | 23640 | ST | W | CLEARFIELD | None | 111760.0 | 27940.0 | F | 1583.0 | 1960.0 | H | None | None | None | I | 1925 | Y | 19132 | RSA5 | 1001149040 | 22 | ROW TYPICAL | 401675592 | 241.0 | 42 | 101 | 017300 | 42101017300 | 173 | Census Tract 173 | G5020 | S | 874586.0 | 0.0 | +39.9985707 | -075.1609337 | 10503 | 12.0 | GLENWOOD | Glenwood | Glenwood | 12350.882116 | 9.652808e+06 | 65.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 |
4 | POINT (-75.14075 39.97514) | 2645 | 2023-05-11T00:00:00Z | A | NWC BODINE TO NEC | 54224335 | SR | VACANT LAND RESIDE < ACRE | 1 | SINGLE FAMILY | 156 | Y | None | None | 61.0 | 0.0 | 0.0 | 1 | 0.0 | 21.0 | A | 1.0 | 1 | B | 18 | 0 | None | 249 | 1 | 249 W OXFORD ST | SIMPLIFILE LC E-RECORDING | None | None | PHILADELPHIA PA | 249 W OXFORD ST # A | 19122-3742 | 500000 | None | NaN | 3.0 | NaN | 3.0 | NaN | None | None | None | 182000012 | E | None | 2023-09-18T00:00:00Z | 012N230250 | 2022-03-29T00:00:00Z | 192000 | None | 2 | None | PA | 62120 | ST | W | OXFORD | None | 400000.0 | 100000.0 | F | 1294.0 | 2199.0 | A | None | A | None | I | 2021 | None | 19122 | RSA5 | 1001681249 | 25 | ROW MODERN | 401675399 | 18.0 | 42 | 101 | 015600 | 42101015600 | 156 | Census Tract 156 | G5020 | S | 429567.0 | 0.0 | +39.9790066 | -075.1418910 | 10484 | 87.0 | OLD_KENSINGTON | Kensington, Old | Old Kensington | 10913.733421 | 7.476740e+06 | 88.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 |
6 | POINT (-75.13389 40.03928) | 2839 | 2022-05-24T00:00:00Z | H | 241' N OF CHEW ST | 54230133 | R30 | ROW B/GAR 2 STY MASONRY | 1 | SINGLE FAMILY | 275 | N | None | None | 95.0 | 0.0 | 0.0 | 7 | 0.0 | 15.0 | None | 1.0 | None | B | 61 | 0 | None | 5732 | 4 | 5732 N 7TH ST | WALKER MICHAEL | None | None | SICKLERVILLE NJ | 44 FARMHOUSE RD | 08081 | 133400 | None | 1.0 | 3.0 | NaN | 2.0 | 1920.0 | None | WALKER MICHAEL | None | 612234600 | E | C | 2023-10-04T00:00:00Z | 135N7 61 | 2022-08-21T00:00:00Z | 21000 | None | None | None | NJ | 87930 | ST | N | 7TH | None | 106720.0 | 26680.0 | F | 1425.0 | 1164.0 | H | None | None | None | I | 1925 | Y | 19120 | RSA5 | 1001602509 | 24 | ROW PORCH FRONT | 401674376 | 350.0 | 42 | 101 | 027500 | 42101027500 | 275 | Census Tract 275 | G5020 | S | 606825.0 | 0.0 | +40.0400497 | -075.1322707 | 10588 | 42.0 | OLNEY | Olney | Olney | 32197.205271 | 5.030840e+07 | 8.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 |
7 | POINT (-75.14337 40.00957) | 2890 | 2022-05-24T00:00:00Z | D | 415' N OF ERIE AVE | 54230032 | O30 | ROW 2 STY MASONRY | 1 | SINGLE FAMILY | 198 | N | None | None | 45.0 | 0.0 | 0.0 | 4 | 0.0 | 16.0 | None | 0.0 | None | A | 43 | 0 | None | 3753 | 4 | 3753 N DELHI ST | None | None | None | DELRAY BEACH FL | 4899 NW 6TH STREET | 33445 | 73800 | None | 1.0 | 3.0 | NaN | 2.0 | 1683.0 | None | RJ SIMPLE SOLUTION LLC | None | 432345900 | E | C | 2023-10-04T00:00:00Z | 100N040379 | 2022-06-13T00:00:00Z | 35000 | None | None | None | FL | 28040 | ST | N | DELHI | None | 59040.0 | 14760.0 | F | 720.0 | 960.0 | H | None | None | None | I | 1942 | Y | 19140 | RM1 | 1001175031 | 24 | ROW PORCH FRONT | 401674385 | 23.0 | 42 | 101 | 019800 | 42101019800 | 198 | Census Tract 198 | G5020 | S | 541006.0 | 0.0 | +40.0107245 | -075.1421472 | 10523 | 98.0 | HUNTING_PARK | Hunting Park | Hunting Park | 32920.799360 | 3.902450e+07 | 73.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 |
#categorical
= ['exterior_condition', 'listname']
cat_cols
# numeric
= ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']
num_cols
= cat_cols + num_cols + ['sale_price'] features
= joined[features].dropna()
joined_trimmed joined_trimmed
exterior_condition | interior_condition | quality_grade | parcel_shape | listname | view_type | year_built | fireplaces | garage_spaces | number_of_bathrooms | number_of_bedrooms | number_stories | total_area | total_livable_area | sale_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 4 | C | E | Brewerytown | I | 1920 | 0.0 | 0.0 | 1.0 | 2.0 | 2.0 | 468.0 | 650.0 | 213000 |
2 | 4 | 3 | C | E | Glenwood | I | 1925 | 0.0 | 0.0 | 1.0 | 4.0 | 2.0 | 1583.0 | 1960.0 | 10000 |
6 | 7 | 4 | C | E | Olney | I | 1925 | 0.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1425.0 | 1164.0 | 21000 |
7 | 4 | 4 | C | E | Hunting Park | I | 1942 | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 720.0 | 960.0 | 35000 |
9 | 4 | 4 | C | E | Harrowgate | I | 1920 | 0.0 | 0.0 | 1.0 | 3.0 | 1.0 | 602.0 | 840.0 | 39600 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
24472 | 4 | 4 | C | E | Holmesburg | I | 1950 | 0.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1147.0 | 1034.0 | 163000 |
24473 | 4 | 4 | C+ | E | Mayfair | H | 1940 | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1600.0 | 1360.0 | 139900 |
24474 | 4 | 4 | C | E | Oxford Circle | I | 1950 | 0.0 | 1.0 | 1.0 | 3.0 | 1.0 | 2813.0 | 1656.0 | 280000 |
24477 | 4 | 3 | C | E | Mayfair | A | 1950 | 0.0 | 1.0 | 2.0 | 3.0 | 1.0 | 1424.0 | 1292.0 | 280000 |
24479 | 4 | 4 | C | E | Wissinoming | I | 1920 | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1109.0 | 1076.0 | 150000 |
17430 rows × 15 columns
'sale_price'] = pd.Series(joined_trimmed['sale_price'], dtype='int32') joined_trimmed[
"log_price"] = np.log(joined_trimmed["sale_price"]) joined_trimmed[
joined_trimmed
exterior_condition | interior_condition | quality_grade | parcel_shape | listname | fireplaces | garage_spaces | number_of_bathrooms | number_of_bedrooms | number_stories | total_area | total_livable_area | sale_price | log_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 4 | C | E | Hunting Park | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 720.0 | 960.0 | 35000 | 10.463103 |
1 | 7 | 4 | C | E | Olney | 0.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1425.0 | 1164.0 | 21000 | 9.952278 |
6 | 4 | 3 | C | E | Glenwood | 0.0 | 0.0 | 1.0 | 4.0 | 2.0 | 1583.0 | 1960.0 | 10000 | 9.210340 |
7 | 4 | 4 | C+ | B | Chestnut Hill | 0.0 | 0.0 | 1.0 | 4.0 | 2.0 | 2457.0 | 2110.0 | 350000 | 12.765688 |
8 | 4 | 4 | C | E | West Oak Lane | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1161.0 | 1120.0 | 70000 | 11.156251 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
24432 | 2 | 2 | C | E | Whitman | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 736.0 | 951.0 | 150000 | 11.918391 |
24434 | 4 | 4 | C+ | E | Mayfair | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1600.0 | 1360.0 | 139900 | 11.848683 |
24436 | 4 | 4 | C | E | Stadium District | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1053.0 | 960.0 | 190000 | 12.154779 |
24440 | 4 | 4 | C | E | Richmond | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 741.0 | 1020.0 | 46000 | 10.736397 |
24442 | 4 | 4 | C | E | Fox Chase | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 9000.0 | 1388.0 | 150000 | 11.918391 |
17461 rows × 14 columns
# 70/30 training / test split
= train_test_split(joined_trimmed, test_size=0.3, random_state=42)
train_set, test_set
# the target labels
= train_set["log_price"]
y_train = test_set["log_price"]
y_test
# features
= cat_cols + num_cols
feature_cols = train_set[feature_cols]
X_train = test_set[feature_cols] X_test
# column transformers + pipeline for random forest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder()
ohe
= ColumnTransformer(
transformer =[
transformers"num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
(
]
)
= make_pipeline(transformer, RandomForestRegressor(n_estimators=100,
pipe =42)
random_state )
# Fit the training set
; pipe.fit(X_train, y_train)
# Test Score
pipe.score(X_test, y_test)
0.5646447325116657
# Setup for GridSearchCV
= "randomforestregressor"
model_step = {
param_grid f"{model_step}__n_estimators": [5, 10, 15, 20, 30, 50, 100, 200],
f"{model_step}__max_depth": [9, 13, 21, 33, 51],
}
# Create the grid and use 3-fold CV
= GridSearchCV(pipe, param_grid, cv=3, verbose=1)
grid
# Run the search
grid.fit(X_train, y_train)
Fitting 3 folds for each of 40 candidates, totalling 120 fits
GridSearchCV(cv=3, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', StandardScaler(), ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['exterior_condition', 'listname'])])), ('randomforestregressor', RandomForestRegressor(random_state=42))]), param_grid={'randomforestregressor__max_depth': [9, 13, 21, 33, 51], 'randomforestregressor__n_estimators': [5, 10, 15, 20, 30, 50, 100, 200]}, verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', StandardScaler(), ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['exterior_condition', 'listname'])])), ('randomforestregressor', RandomForestRegressor(random_state=42))]), param_grid={'randomforestregressor__max_depth': [9, 13, 21, 33, 51], 'randomforestregressor__n_estimators': [5, 10, 15, 20, 30, 50, 100, 200]}, verbose=1)
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', StandardScaler(), ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['exterior_condition', 'listname'])])), ('randomforestregressor', RandomForestRegressor(random_state=42))])
ColumnTransformer(transformers=[('num', StandardScaler(), ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['exterior_condition', 'listname'])])
['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']
StandardScaler()
['exterior_condition', 'listname']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(random_state=42)
# The best estimator
grid.best_estimator_
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', StandardScaler(), ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['exterior_condition', 'listname'])])), ('randomforestregressor', RandomForestRegressor(max_depth=33, n_estimators=200, random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('num', StandardScaler(), ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['exterior_condition', 'listname'])])), ('randomforestregressor', RandomForestRegressor(max_depth=33, n_estimators=200, random_state=42))])
ColumnTransformer(transformers=[('num', StandardScaler(), ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']), ('cat', OneHotEncoder(handle_unknown='ignore'), ['exterior_condition', 'listname'])])
['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']
StandardScaler()
['exterior_condition', 'listname']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(max_depth=33, n_estimators=200, random_state=42)
= make_pipeline(transformer, RandomForestRegressor(max_depth=33, n_estimators=200,
new_pipe =42)) random_state
# Fit the training set
; new_pipe.fit(train_set, y_train)
# Test Score
new_pipe.score(test_set, y_test)
0.5675150349542478
2.5 Calculate the percent error of your model predictions for each sale in the test set
Fit your best model and use it to make predictions on the test set.
Note: This should be the percent error in terms of sale price. You’ll need to convert if your model predicted the log of sales price!
# this step is not necessary, just curious what the MAPE is.
# X_test = test_set[feature_cols]
# y_test = test_set["log_price"].values
# y_test_no_log = np.exp(y_test)
def evaluate_mape(model, X_test, y_test):
"""
Given a model and test features/targets, print out the
mean absolute error and accuracy
"""
# Make the predictions
= model.predict(X_test)
predictions
# Absolute Percentage error
= abs(np.exp(predictions) - y_test)
abs_errors = np.mean(abs_errors)
avg_error
# Mean absolute percentage error
= 100 * np.mean(abs_errors / y_test)
mape
# Accuracy
= 100 - mape
accuracy
print("Model Performance")
print(f"Average Absolute Error: {avg_error:0.4f}")
print(f"Accuracy = {accuracy:0.2f}%.")
return accuracy
= np.exp(y_test) y_test_no_log
= evaluate_mape(pipe, X_test, y_test_no_log) base_accuracy
Model Performance
Average Absolute Error: 69104.9786
Accuracy = 51.06%.
# This is the actual percent error calculation part
def percent_error(model, X_test, y_test):
= np.exp(model.predict(X_test))
predictions
#percentage error
= (predictions - y_test) / y_test
p_errors
return p_errors
= percent_error(pipe, X_test, y_test_no_log) p_errors
pipe.predict(X_test)
array([10.79687142, 12.01202048, 12.44467831, ..., 11.82508308,
11.50655521, 12.76111153])
y_test
8682 11.245046
45 11.336188
6987 11.580584
16121 12.449019
8673 11.695247
...
2252 12.254863
13927 13.422468
17434 11.695247
12885 11.289782
3753 13.060488
Name: log_price, Length: 5230, dtype: float64
p_errors
7209 0.095504
52 -0.068308
10900 3.547680
22227 -0.448610
18079 8.437723
...
12894 -0.101753
15565 -0.107182
21955 -0.495361
3723 -0.141809
18913 0.245689
Name: log_price, Length: 5229, dtype: float64
2.6 Make a data frame with percent errors and census tract info for each sale in the test set
Create a data frame that has the property geometries, census tract data, and percent errors for all of the sales in the test set.
Notes
- When using the “train_test_split()” function, the index of the test data frame includes the labels from the original sales data frame
- You can use this index to slice out the test data from the original sales data frame, which should include the census tract info and geometries
- Add a new column to this data frame holding the percent error data
- Make sure to use the percent error and not the absolute percent error
= joined.loc[test_set.index] test_df
'percent_error'] = p_errors test_df[
2.8 Plot a map of the median percent error by census tract
- You’ll want to group your data frame of test sales by the
GEOID10
column and take the median of you percent error column - Merge the census tract geometries back in and use geopandas to plot.
= test_df.groupby('GEOID10')['percent_error'].median() median_p_errors
= pd.DataFrame(median_p_errors).reset_index() median_p_errors
= tracts.merge(median_p_errors, on='GEOID10') geo_median
geo_median.explore(="percent_error",
column="percent_error",
tooltip=True,
popup="coolwarm",
cmap="CartoDB positron",
tiles=True
legend )
2.9 Compare the percent errors in Qualifying Census Tracts and other tracts
Qualifying Census Tracts are a poverty designation that HUD uses to allocate housing tax credits
- I’ve included a list of the census tract names that qualify in Philadelphia
- Add a new column to your dataframe of test set sales that is True/False depending on if the tract is a QCT
- Then, group by this new column and calculate the median percent error
You should find that the algorithm’s accuracy is significantly worse in these low-income, qualifying census tracts
= ['5',
qct '20',
'22',
'28.01',
'30.01',
'30.02',
'31',
'32',
'33',
'36',
'37.01',
'37.02',
'39.01',
'41.01',
'41.02',
'56',
'60',
'61',
'62',
'63',
'64',
'65',
'66',
'67',
'69',
'70',
'71.01',
'71.02',
'72',
'73',
'74',
'77',
'78',
'80',
'81.01',
'81.02',
'82',
'83.01',
'83.02',
'84',
'85',
'86.01',
'86.02',
'87.01',
'87.02',
'88.01',
'88.02',
'90',
'91',
'92',
'93',
'94',
'95',
'96',
'98.01',
'100',
'101',
'102',
'103',
'104',
'105',
'106',
'107',
'108',
'109',
'110',
'111',
'112',
'113',
'119',
'121',
'122.01',
'122.03',
'131',
'132',
'137',
'138',
'139',
'140',
'141',
'144',
'145',
'146',
'147',
'148',
'149',
'151.01',
'151.02',
'152',
'153',
'156',
'157',
'161',
'162',
'163',
'164',
'165',
'167.01',
'167.02',
'168',
'169.01',
'169.02',
'170',
'171',
'172.01',
'172.02',
'173',
'174',
'175',
'176.01',
'176.02',
'177.01',
'177.02',
'178',
'179',
'180.02',
'188',
'190',
'191',
'192',
'195.01',
'195.02',
'197',
'198',
'199',
'200',
'201.01',
'201.02',
'202',
'203',
'204',
'205',
'206',
'208',
'239',
'240',
'241',
'242',
'243',
'244',
'245',
'246',
'247',
'249',
'252',
'253',
'265',
'267',
'268',
'271',
'274.01',
'274.02',
'275',
'276',
'277',
'278',
'279.01',
'279.02',
'280',
'281',
'282',
'283',
'284',
'285',
'286',
'287',
'288',
'289.01',
'289.02',
'290',
'291',
'293',
'294',
'298',
'299',
'300',
'301',
'302',
'305.01',
'305.02',
'309',
'311.01',
'312',
'313',
'314.01',
'314.02',
'316',
'318',
'319',
'321',
'325',
'329',
'330',
'337.01',
'345.01',
'357.01',
'376',
'377',
'380',
'381',
'382',
'383',
'389',
'390']
'qct'] = geo_median['NAME10'].isin(qct)
geo_median[
'qct')['percent_error'].median() geo_median.groupby(
qct
False -0.059781
True 0.039906
Name: percent_error, dtype: float64