import requests
import pandas as pd
import geopandas as gpd
import numpy as npAssignment 6: Predictive Modeling of Housing Prices in Philadelphia
Due date: Wednesday, 12/6 by the end of the day
Lectures 12B and 13A will cover predictive modeling of housing prices in Philadelphia. We’ll extend that analysis in this section by:
- Optimizing our hyperparameters during the modeling process using cross-validation and a grid search
- Testing the fairness of our model by calculating the intersection of the model error rate and poverty rate across neighborhoods
Part 2: Modeling Philadelphia’s Housing Prices and Algorithmic Fairness
2.1 Load data from the Office of Property Assessment
Use the requests package to query the CARTO API for single-family property assessment data in Philadelphia for properties that had their last sale during 2022.
Sources: - OpenDataPhilly - Metadata
carto_url = "https://phl.carto.com/api/v2/sql"
where = "sale_date >= '2022-01-01' and sale_date <= '2022-12-31'"
where = where + " and category_code_description IN ('SINGLE FAMILY', 'Single Family')"# Create the query
query = f"SELECT * FROM opa_properties_public WHERE {where}"
# Make the request
params = {"q": query, "format": "geojson", "where": where}
response = requests.get(carto_url, params=params)
# Make the GeoDataFrame
sales2022 = gpd.GeoDataFrame.from_features(response.json(), crs="EPSG:4326")2.2 Load data for census tracts and neighborhoods
Load various Philadelphia-based regions that we will use in our analysis.
- Census tracts can be downloaded from: https://opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson
- Neighborhoods can be downloaded from: https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson
tracts = gpd.read_file('2010_Tracts.geojson')url = 'https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson'
neighbor = gpd.read_file(url).to_crs('EPSG:4326')2.3 Spatially join the sales data and neighborhoods/census tracts.
Perform a spatial join, such that each sale has an associated neighborhood and census tract.
Note: After performing the first spatial join, you will need to use the drop() function to remove the index_right column; otherwise an error will be raised on the second spatial join about duplicate columns.
joined = gpd.sjoin(sales2022, tracts, predicate="within", how="left",)
joined = gpd.sjoin(joined.drop('index_right', axis=1), neighbor, predicate = "within", how = "left")2.4 Train a Random Forest on the sales data
In this step, you should follow the steps outlined in lecture to preprocess and train your model. We’ll extend our analysis to do a hyperparameter grid search to find the best model configuration. As you train your model, follow the following steps:
Preprocessing Requirements - Trim the sales data to those sales with prices between $3,000 and $1 million - Set up a pipeline that includes both numerical columns and categorical columns - Include one-hot encoded variable for the neighborhood of the sale, instead of ZIP code. We don’t want to include multiple location based categories, since they encode much of the same information.
Training requirements - Use a 70/30% training/test split and predict the log of the sales price. - Use GridSearchCV to perform a k-fold cross validation that optimize at least 2 hyperparameters of the RandomForestRegressor - After fitting your model and finding the optimal hyperparameters, you should evaluate the score (R-squared) on the test set (the original 30% sample withheld)
Note: You don’t need to include additional features (such as spatial distance features) or perform any extra feature engineering beyond what is required above to receive full credit. Of course, you are always welcome to experiment!
# model
from sklearn.ensemble import RandomForestRegressor
# pipeline
from sklearn.pipeline import make_pipeline
# model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
# Preprocessing
from sklearn.preprocessing import StandardScaler, PolynomialFeatures# trim data to between $3000 and $1 million
valid = (joined['sale_price'] > 3000) & (joined['sale_price'] < 1e6)
joined = joined.loc[valid]pd.set_option('display.max_columns', None)
joined.head()| geometry | cartodb_id_left | assessment_date | basements | beginning_point | book_and_page | building_code | building_code_description | category_code | category_code_description | census_tract | central_air | cross_reference | date_exterior_condition | depth | exempt_building | exempt_land | exterior_condition | fireplaces | frontage | fuel | garage_spaces | garage_type | general_construction | geographic_ward | homestead_exemption | house_extension | house_number | interior_condition | location | mailing_address_1 | mailing_address_2 | mailing_care_of | mailing_city_state | mailing_street | mailing_zip | market_value | market_value_date | number_of_bathrooms | number_of_bedrooms | number_of_rooms | number_stories | off_street_open | other_building | owner_1 | owner_2 | parcel_number | parcel_shape | quality_grade | recording_date | registry_number | sale_date | sale_price | separate_utilities | sewer | site_type | state_code | street_code | street_designation | street_direction | street_name | suffix | taxable_building | taxable_land | topography | total_area | total_livable_area | type_heater | unfinished | unit | utility | view_type | year_built | year_built_estimate | zip_code | zoning | pin | building_code_new | building_code_description_new | objectid | OBJECTID | STATEFP10 | COUNTYFP10 | TRACTCE10 | GEOID10 | NAME10 | NAMELSAD10 | MTFCC10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | LOGRECNO | index_right | name | listname | mapname | shape_leng | shape_area | cartodb_id_right | created_at | updated_at | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | POINT (-75.18212 39.97558) | 1699 | 2022-05-24T00:00:00Z | None | 139'10 15/16" N STILES | 54234223 | O30 | ROW 2 STY MASONRY | 1 | SINGLE FAMILY | 137 | None | None | None | 36.0 | 0.0 | 0.0 | 4 | 0.0 | 13.0 | None | 0.0 | None | A | 29 | 0 | None | 1245 | 4 | 1245 N NEWKIRK ST | CSC INGEO | None | None | PHILADELPHIA PA | 1245 N NEWKIRK ST | 19121-4526 | 131100 | None | 1.0 | 2.0 | NaN | 2.0 | 1125.0 | None | AUTUMN CAPITAL GROUP INC | None | 292108701 | E | C | 2023-10-17T00:00:00Z | 010N010260 | 2022-10-24T00:00:00Z | 213000 | None | None | None | PA | 59640 | ST | N | NEWKIRK | None | 104880.0 | 26220.0 | F | 468.0 | 650.0 | None | None | None | None | I | 1920 | Y | 19121 | RSA5 | 1001388772 | 22 | ROW TYPICAL | 401673742 | 238.0 | 42 | 101 | 013700 | 42101013700 | 137 | Census Tract 137 | G5020 | S | 589738.0 | 0.0 | +39.9772752 | -075.1842323 | 10467 | 100.0 | BREWERYTOWN | Brewerytown | Brewerytown | 14790.673736 | 1.125654e+07 | 80.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 | 
| 2 | POINT (-75.16524 40.00137) | 2261 | 2022-05-24T00:00:00Z | D | 120' W 21ST ST | 54222844 | O30 | ROW 2 STY MASONRY | 1 | SINGLE FAMILY | 173 | N | None | None | 79.0 | 0.0 | 0.0 | 4 | 0.0 | 20.0 | None | 0.0 | None | F | 11 | 0 | None | 2116 | 3 | 2116 W CLEARFIELD ST | None | None | None | PHILADELPHIA PA | 2116 W CLEARFIELD ST | 19132-1517 | 139700 | None | 1.0 | 4.0 | NaN | 2.0 | 929.0 | None | MILLIONIQUE LLC | None | 111112200 | E | C | 2023-09-13T00:00:00Z | 38N5 289 | 2022-03-03T00:00:00Z | 10000 | None | None | None | PA | 23640 | ST | W | CLEARFIELD | None | 111760.0 | 27940.0 | F | 1583.0 | 1960.0 | H | None | None | None | I | 1925 | Y | 19132 | RSA5 | 1001149040 | 22 | ROW TYPICAL | 401675592 | 241.0 | 42 | 101 | 017300 | 42101017300 | 173 | Census Tract 173 | G5020 | S | 874586.0 | 0.0 | +39.9985707 | -075.1609337 | 10503 | 12.0 | GLENWOOD | Glenwood | Glenwood | 12350.882116 | 9.652808e+06 | 65.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 | 
| 4 | POINT (-75.14075 39.97514) | 2645 | 2023-05-11T00:00:00Z | A | NWC BODINE TO NEC | 54224335 | SR | VACANT LAND RESIDE < ACRE | 1 | SINGLE FAMILY | 156 | Y | None | None | 61.0 | 0.0 | 0.0 | 1 | 0.0 | 21.0 | A | 1.0 | 1 | B | 18 | 0 | None | 249 | 1 | 249 W OXFORD ST | SIMPLIFILE LC E-RECORDING | None | None | PHILADELPHIA PA | 249 W OXFORD ST # A | 19122-3742 | 500000 | None | NaN | 3.0 | NaN | 3.0 | NaN | None | None | None | 182000012 | E | None | 2023-09-18T00:00:00Z | 012N230250 | 2022-03-29T00:00:00Z | 192000 | None | 2 | None | PA | 62120 | ST | W | OXFORD | None | 400000.0 | 100000.0 | F | 1294.0 | 2199.0 | A | None | A | None | I | 2021 | None | 19122 | RSA5 | 1001681249 | 25 | ROW MODERN | 401675399 | 18.0 | 42 | 101 | 015600 | 42101015600 | 156 | Census Tract 156 | G5020 | S | 429567.0 | 0.0 | +39.9790066 | -075.1418910 | 10484 | 87.0 | OLD_KENSINGTON | Kensington, Old | Old Kensington | 10913.733421 | 7.476740e+06 | 88.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 | 
| 6 | POINT (-75.13389 40.03928) | 2839 | 2022-05-24T00:00:00Z | H | 241' N OF CHEW ST | 54230133 | R30 | ROW B/GAR 2 STY MASONRY | 1 | SINGLE FAMILY | 275 | N | None | None | 95.0 | 0.0 | 0.0 | 7 | 0.0 | 15.0 | None | 1.0 | None | B | 61 | 0 | None | 5732 | 4 | 5732 N 7TH ST | WALKER MICHAEL | None | None | SICKLERVILLE NJ | 44 FARMHOUSE RD | 08081 | 133400 | None | 1.0 | 3.0 | NaN | 2.0 | 1920.0 | None | WALKER MICHAEL | None | 612234600 | E | C | 2023-10-04T00:00:00Z | 135N7 61 | 2022-08-21T00:00:00Z | 21000 | None | None | None | NJ | 87930 | ST | N | 7TH | None | 106720.0 | 26680.0 | F | 1425.0 | 1164.0 | H | None | None | None | I | 1925 | Y | 19120 | RSA5 | 1001602509 | 24 | ROW PORCH FRONT | 401674376 | 350.0 | 42 | 101 | 027500 | 42101027500 | 275 | Census Tract 275 | G5020 | S | 606825.0 | 0.0 | +40.0400497 | -075.1322707 | 10588 | 42.0 | OLNEY | Olney | Olney | 32197.205271 | 5.030840e+07 | 8.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 | 
| 7 | POINT (-75.14337 40.00957) | 2890 | 2022-05-24T00:00:00Z | D | 415' N OF ERIE AVE | 54230032 | O30 | ROW 2 STY MASONRY | 1 | SINGLE FAMILY | 198 | N | None | None | 45.0 | 0.0 | 0.0 | 4 | 0.0 | 16.0 | None | 0.0 | None | A | 43 | 0 | None | 3753 | 4 | 3753 N DELHI ST | None | None | None | DELRAY BEACH FL | 4899 NW 6TH STREET | 33445 | 73800 | None | 1.0 | 3.0 | NaN | 2.0 | 1683.0 | None | RJ SIMPLE SOLUTION LLC | None | 432345900 | E | C | 2023-10-04T00:00:00Z | 100N040379 | 2022-06-13T00:00:00Z | 35000 | None | None | None | FL | 28040 | ST | N | DELHI | None | 59040.0 | 14760.0 | F | 720.0 | 960.0 | H | None | None | None | I | 1942 | Y | 19140 | RM1 | 1001175031 | 24 | ROW PORCH FRONT | 401674385 | 23.0 | 42 | 101 | 019800 | 42101019800 | 198 | Census Tract 198 | G5020 | S | 541006.0 | 0.0 | +40.0107245 | -075.1421472 | 10523 | 98.0 | HUNTING_PARK | Hunting Park | Hunting Park | 32920.799360 | 3.902450e+07 | 73.0 | 2013-03-19 17:41:50.508000+00:00 | 2013-03-19 17:41:50.743000+00:00 | 
#categorical
cat_cols = ['exterior_condition', 'listname']
 
# numeric
num_cols = ['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']
features = cat_cols + num_cols + ['sale_price']joined_trimmed = joined[features].dropna()
joined_trimmed| exterior_condition | interior_condition | quality_grade | parcel_shape | listname | view_type | year_built | fireplaces | garage_spaces | number_of_bathrooms | number_of_bedrooms | number_stories | total_area | total_livable_area | sale_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 4 | C | E | Brewerytown | I | 1920 | 0.0 | 0.0 | 1.0 | 2.0 | 2.0 | 468.0 | 650.0 | 213000 | 
| 2 | 4 | 3 | C | E | Glenwood | I | 1925 | 0.0 | 0.0 | 1.0 | 4.0 | 2.0 | 1583.0 | 1960.0 | 10000 | 
| 6 | 7 | 4 | C | E | Olney | I | 1925 | 0.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1425.0 | 1164.0 | 21000 | 
| 7 | 4 | 4 | C | E | Hunting Park | I | 1942 | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 720.0 | 960.0 | 35000 | 
| 9 | 4 | 4 | C | E | Harrowgate | I | 1920 | 0.0 | 0.0 | 1.0 | 3.0 | 1.0 | 602.0 | 840.0 | 39600 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 24472 | 4 | 4 | C | E | Holmesburg | I | 1950 | 0.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1147.0 | 1034.0 | 163000 | 
| 24473 | 4 | 4 | C+ | E | Mayfair | H | 1940 | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1600.0 | 1360.0 | 139900 | 
| 24474 | 4 | 4 | C | E | Oxford Circle | I | 1950 | 0.0 | 1.0 | 1.0 | 3.0 | 1.0 | 2813.0 | 1656.0 | 280000 | 
| 24477 | 4 | 3 | C | E | Mayfair | A | 1950 | 0.0 | 1.0 | 2.0 | 3.0 | 1.0 | 1424.0 | 1292.0 | 280000 | 
| 24479 | 4 | 4 | C | E | Wissinoming | I | 1920 | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1109.0 | 1076.0 | 150000 | 
17430 rows × 15 columns
joined_trimmed['sale_price'] = pd.Series(joined_trimmed['sale_price'], dtype='int32')joined_trimmed["log_price"] = np.log(joined_trimmed["sale_price"])joined_trimmed| exterior_condition | interior_condition | quality_grade | parcel_shape | listname | fireplaces | garage_spaces | number_of_bathrooms | number_of_bedrooms | number_stories | total_area | total_livable_area | sale_price | log_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 4 | C | E | Hunting Park | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 720.0 | 960.0 | 35000 | 10.463103 | 
| 1 | 7 | 4 | C | E | Olney | 0.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1425.0 | 1164.0 | 21000 | 9.952278 | 
| 6 | 4 | 3 | C | E | Glenwood | 0.0 | 0.0 | 1.0 | 4.0 | 2.0 | 1583.0 | 1960.0 | 10000 | 9.210340 | 
| 7 | 4 | 4 | C+ | B | Chestnut Hill | 0.0 | 0.0 | 1.0 | 4.0 | 2.0 | 2457.0 | 2110.0 | 350000 | 12.765688 | 
| 8 | 4 | 4 | C | E | West Oak Lane | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1161.0 | 1120.0 | 70000 | 11.156251 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 24432 | 2 | 2 | C | E | Whitman | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 736.0 | 951.0 | 150000 | 11.918391 | 
| 24434 | 4 | 4 | C+ | E | Mayfair | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1600.0 | 1360.0 | 139900 | 11.848683 | 
| 24436 | 4 | 4 | C | E | Stadium District | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1053.0 | 960.0 | 190000 | 12.154779 | 
| 24440 | 4 | 4 | C | E | Richmond | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 | 741.0 | 1020.0 | 46000 | 10.736397 | 
| 24442 | 4 | 4 | C | E | Fox Chase | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 9000.0 | 1388.0 | 150000 | 11.918391 | 
17461 rows × 14 columns
# 70/30 training / test split 
train_set, test_set = train_test_split(joined_trimmed, test_size=0.3, random_state=42)
# the target labels
y_train = train_set["log_price"]
y_test = test_set["log_price"]
# features
feature_cols = cat_cols + num_cols
X_train = train_set[feature_cols]
X_test = test_set[feature_cols]# column transformers + pipeline for random forest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
transformer = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)
pipe = make_pipeline(transformer, RandomForestRegressor(n_estimators=100,
                                       random_state=42)
)# Fit the training set
pipe.fit(X_train, y_train);# Test Score
pipe.score(X_test, y_test)0.5646447325116657# Setup for GridSearchCV
model_step = "randomforestregressor"
param_grid = {
    f"{model_step}__n_estimators": [5, 10, 15, 20, 30, 50, 100, 200],
    f"{model_step}__max_depth": [9, 13, 21, 33, 51],
}
# Create the grid and use 3-fold CV
grid = GridSearchCV(pipe, param_grid, cv=3, verbose=1)
# Run the search
grid.fit(X_train, y_train)
Fitting 3 folds for each of 40 candidates, totalling 120 fitsGridSearchCV(cv=3,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('num',
                                                                         StandardScaler(),
                                                                         ['fireplaces',
                                                                          'garage_spaces',
                                                                          'number_of_bathrooms',
                                                                          'number_of_bedrooms',
                                                                          'number_stories',
                                                                          'total_area',
                                                                          'total_livable_area']),
                                                                        ('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['exterior_condition',
                                                                          'listname'])])),
                                       ('randomforestregressor',
                                        RandomForestRegressor(random_state=42))]),
             param_grid={'randomforestregressor__max_depth': [9, 13, 21, 33,
                                                              51],
                         'randomforestregressor__n_estimators': [5, 10, 15, 20,
                                                                 30, 50, 100,
                                                                 200]},
             verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('num',
                                                                         StandardScaler(),
                                                                         ['fireplaces',
                                                                          'garage_spaces',
                                                                          'number_of_bathrooms',
                                                                          'number_of_bedrooms',
                                                                          'number_stories',
                                                                          'total_area',
                                                                          'total_livable_area']),
                                                                        ('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['exterior_condition',
                                                                          'listname'])])),
                                       ('randomforestregressor',
                                        RandomForestRegressor(random_state=42))]),
             param_grid={'randomforestregressor__max_depth': [9, 13, 21, 33,
                                                              51],
                         'randomforestregressor__n_estimators': [5, 10, 15, 20,
                                                                 30, 50, 100,
                                                                 200]},
             verbose=1)Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['fireplaces',
                                                   'garage_spaces',
                                                   'number_of_bathrooms',
                                                   'number_of_bedrooms',
                                                   'number_stories',
                                                   'total_area',
                                                   'total_livable_area']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['exterior_condition',
                                                   'listname'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=42))])ColumnTransformer(transformers=[('num', StandardScaler(),
                                 ['fireplaces', 'garage_spaces',
                                  'number_of_bathrooms', 'number_of_bedrooms',
                                  'number_stories', 'total_area',
                                  'total_livable_area']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['exterior_condition', 'listname'])])['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']
StandardScaler()
['exterior_condition', 'listname']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(random_state=42)
# The best estimator
grid.best_estimator_Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['fireplaces',
                                                   'garage_spaces',
                                                   'number_of_bathrooms',
                                                   'number_of_bedrooms',
                                                   'number_stories',
                                                   'total_area',
                                                   'total_livable_area']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['exterior_condition',
                                                   'listname'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=33, n_estimators=200,
                                       random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['fireplaces',
                                                   'garage_spaces',
                                                   'number_of_bathrooms',
                                                   'number_of_bedrooms',
                                                   'number_stories',
                                                   'total_area',
                                                   'total_livable_area']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['exterior_condition',
                                                   'listname'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=33, n_estimators=200,
                                       random_state=42))])ColumnTransformer(transformers=[('num', StandardScaler(),
                                 ['fireplaces', 'garage_spaces',
                                  'number_of_bathrooms', 'number_of_bedrooms',
                                  'number_stories', 'total_area',
                                  'total_livable_area']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['exterior_condition', 'listname'])])['fireplaces', 'garage_spaces', 'number_of_bathrooms', 'number_of_bedrooms', 'number_stories', 'total_area', 'total_livable_area']
StandardScaler()
['exterior_condition', 'listname']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(max_depth=33, n_estimators=200, random_state=42)
new_pipe = make_pipeline(transformer, RandomForestRegressor(max_depth=33, n_estimators=200,
                                       random_state=42))# Fit the training set
new_pipe.fit(train_set, y_train);# Test Score
new_pipe.score(test_set, y_test)0.56751503495424782.5 Calculate the percent error of your model predictions for each sale in the test set
Fit your best model and use it to make predictions on the test set.
Note: This should be the percent error in terms of sale price. You’ll need to convert if your model predicted the log of sales price!
# this step is not necessary, just curious what the MAPE is. 
# X_test = test_set[feature_cols]
# y_test = test_set["log_price"].values
# y_test_no_log = np.exp(y_test)
def evaluate_mape(model, X_test, y_test):
    """
    Given a model and test features/targets, print out the 
    mean absolute error and accuracy
    """
    # Make the predictions
    predictions = model.predict(X_test)
    # Absolute Percentage error
    abs_errors = abs(np.exp(predictions) - y_test)
    avg_error = np.mean(abs_errors)
    # Mean absolute percentage error
    mape = 100 * np.mean(abs_errors / y_test)
    # Accuracy
    accuracy = 100 - mape
    print("Model Performance")
    print(f"Average Absolute Error: {avg_error:0.4f}")
    print(f"Accuracy = {accuracy:0.2f}%.")
    return accuracyy_test_no_log = np.exp(y_test)base_accuracy = evaluate_mape(pipe, X_test, y_test_no_log)Model Performance
Average Absolute Error: 69104.9786
Accuracy = 51.06%.# This is the actual percent error calculation part 
def percent_error(model, X_test, y_test):
    
    predictions = np.exp(model.predict(X_test))
    
    #percentage error
    p_errors = (predictions - y_test) / y_test
    
    return p_errorsp_errors = percent_error(pipe, X_test, y_test_no_log)pipe.predict(X_test)array([10.79687142, 12.01202048, 12.44467831, ..., 11.82508308,
       11.50655521, 12.76111153])y_test8682     11.245046
45       11.336188
6987     11.580584
16121    12.449019
8673     11.695247
           ...    
2252     12.254863
13927    13.422468
17434    11.695247
12885    11.289782
3753     13.060488
Name: log_price, Length: 5230, dtype: float64p_errors7209     0.095504
52      -0.068308
10900    3.547680
22227   -0.448610
18079    8.437723
           ...   
12894   -0.101753
15565   -0.107182
21955   -0.495361
3723    -0.141809
18913    0.245689
Name: log_price, Length: 5229, dtype: float642.6 Make a data frame with percent errors and census tract info for each sale in the test set
Create a data frame that has the property geometries, census tract data, and percent errors for all of the sales in the test set.
Notes
- When using the “train_test_split()” function, the index of the test data frame includes the labels from the original sales data frame
- You can use this index to slice out the test data from the original sales data frame, which should include the census tract info and geometries
- Add a new column to this data frame holding the percent error data
- Make sure to use the percent error and not the absolute percent error
test_df = joined.loc[test_set.index]test_df['percent_error'] = p_errors2.8 Plot a map of the median percent error by census tract
- You’ll want to group your data frame of test sales by the GEOID10column and take the median of you percent error column
- Merge the census tract geometries back in and use geopandas to plot.
median_p_errors = test_df.groupby('GEOID10')['percent_error'].median()median_p_errors = pd.DataFrame(median_p_errors).reset_index()geo_median = tracts.merge(median_p_errors, on='GEOID10')geo_median.explore(
    column="percent_error",
    tooltip="percent_error",
    popup=True,
    cmap="coolwarm",
    tiles="CartoDB positron",
    legend=True
    )2.9 Compare the percent errors in Qualifying Census Tracts and other tracts
Qualifying Census Tracts are a poverty designation that HUD uses to allocate housing tax credits
- I’ve included a list of the census tract names that qualify in Philadelphia
- Add a new column to your dataframe of test set sales that is True/False depending on if the tract is a QCT
- Then, group by this new column and calculate the median percent error
You should find that the algorithm’s accuracy is significantly worse in these low-income, qualifying census tracts
qct = ['5',
 '20',
 '22',
 '28.01',
 '30.01',
 '30.02',
 '31',
 '32',
 '33',
 '36',
 '37.01',
 '37.02',
 '39.01',
 '41.01',
 '41.02',
 '56',
 '60',
 '61',
 '62',
 '63',
 '64',
 '65',
 '66',
 '67',
 '69',
 '70',
 '71.01',
 '71.02',
 '72',
 '73',
 '74',
 '77',
 '78',
 '80',
 '81.01',
 '81.02',
 '82',
 '83.01',
 '83.02',
 '84',
 '85',
 '86.01',
 '86.02',
 '87.01',
 '87.02',
 '88.01',
 '88.02',
 '90',
 '91',
 '92',
 '93',
 '94',
 '95',
 '96',
 '98.01',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '110',
 '111',
 '112',
 '113',
 '119',
 '121',
 '122.01',
 '122.03',
 '131',
 '132',
 '137',
 '138',
 '139',
 '140',
 '141',
 '144',
 '145',
 '146',
 '147',
 '148',
 '149',
 '151.01',
 '151.02',
 '152',
 '153',
 '156',
 '157',
 '161',
 '162',
 '163',
 '164',
 '165',
 '167.01',
 '167.02',
 '168',
 '169.01',
 '169.02',
 '170',
 '171',
 '172.01',
 '172.02',
 '173',
 '174',
 '175',
 '176.01',
 '176.02',
 '177.01',
 '177.02',
 '178',
 '179',
 '180.02',
 '188',
 '190',
 '191',
 '192',
 '195.01',
 '195.02',
 '197',
 '198',
 '199',
 '200',
 '201.01',
 '201.02',
 '202',
 '203',
 '204',
 '205',
 '206',
 '208',
 '239',
 '240',
 '241',
 '242',
 '243',
 '244',
 '245',
 '246',
 '247',
 '249',
 '252',
 '253',
 '265',
 '267',
 '268',
 '271',
 '274.01',
 '274.02',
 '275',
 '276',
 '277',
 '278',
 '279.01',
 '279.02',
 '280',
 '281',
 '282',
 '283',
 '284',
 '285',
 '286',
 '287',
 '288',
 '289.01',
 '289.02',
 '290',
 '291',
 '293',
 '294',
 '298',
 '299',
 '300',
 '301',
 '302',
 '305.01',
 '305.02',
 '309',
 '311.01',
 '312',
 '313',
 '314.01',
 '314.02',
 '316',
 '318',
 '319',
 '321',
 '325',
 '329',
 '330',
 '337.01',
 '345.01',
 '357.01',
 '376',
 '377',
 '380',
 '381',
 '382',
 '383',
 '389',
 '390']geo_median['qct'] = geo_median['NAME10'].isin(qct)
geo_median.groupby('qct')['percent_error'].median()qct
False   -0.059781
True     0.039906
Name: percent_error, dtype: float64