5 - Exploring Yelp Reviews in Philly

In this assignment, we’ll explore restaurant review data available through the Yelp Dataset Challenge. The dataset includes Yelp data for user reviews and business information for many metropolitan areas. I’ve already downloaded this dataset (8 GB total!) and extracted out the data files for reviews and restaurants in Philadelphia. I’ve placed these data files into the data directory in this repository.

This assignment is broken into two parts:

Part 1: Analyzing correlations between restaurant reviews and census data

We’ll explore the relationship between restaurant reviews and the income levels of the restaurant’s surrounding area.

Part 2: Exploring the impact of fast food restaurants

We’ll run a sentiment analysis on reviews of fast food restaurants and estimate income levels in neighborhoods with fast food restaurants. We’ll test how well our sentiment analysis works by comparing the number of stars to the sentiment of reviews.

Background readings - Does sentiment analysis work? - The Geography of Taste: Using Yelp to Study Urban Culture

Code
import pandas as pd
import geopandas as gpd
import altair as alt
import numpy as np

1. Correlating restaurant ratings and income levels

In this part, we’ll use the census API to download household income data and explore how it correlates with restaurant review data.

1.1 Query the Census API

Use the cenpy package to download median household income in the past 12 months by census tract from the 2021 ACS 5-year data set for your county of interest.

You have two options to find the correct variable names: - Search through: https://api.census.gov/data/2021/acs/acs5/variables.html - Initialize an API connection and use the .varslike() function to search for the proper keywords

At the end of this step, you should have a pandas DataFrame holding the income data for all census tracts within the county being analyzed. Feel free to rename your variable from the ACS so it has a more meaningful name!

Caution

Some census tracts won’t have any value because there are not enough households in that tract. The census will use a negative number as a default value for those tracts. You can safely remove those tracts from the analysis!

Code
import cenpy
/Users/annzhang/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/libpysal/cg/alpha_shapes.py:39: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_dist(x, y):
/Users/annzhang/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/libpysal/cg/alpha_shapes.py:165: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def get_faces(triangle):
/Users/annzhang/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/libpysal/cg/alpha_shapes.py:199: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def build_faces(faces, triangles_is, num_triangles, num_faces_single):
/Users/annzhang/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/libpysal/cg/alpha_shapes.py:261: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_mask_faces(mask, faces):
Code
acs = cenpy.remote.APIConnection("ACSDT5Y2021")
Code
matches = acs.varslike(
    pattern="household income",
    by="label",
).sort_index()
Code
pd.set_option('display.max_colwidth', None)
matches.head(10)
label concept predicateType group limit predicateOnly hasGeoCollectionSupport attributes required
B19013A_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER) int B19013A 0 NaN NaN B19013A_001EA,B19013A_001M,B19013A_001MA NaN
B19013B_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (BLACK OR AFRICAN AMERICAN ALONE HOUSEHOLDER) int B19013B 0 NaN NaN B19013B_001EA,B19013B_001M,B19013B_001MA NaN
B19013C_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (AMERICAN INDIAN AND ALASKA NATIVE ALONE HOUSEHOLDER) int B19013C 0 NaN NaN B19013C_001EA,B19013C_001M,B19013C_001MA NaN
B19013D_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (ASIAN ALONE HOUSEHOLDER) int B19013D 0 NaN NaN B19013D_001EA,B19013D_001M,B19013D_001MA NaN
B19013E_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (NATIVE HAWAIIAN AND OTHER PACIFIC ISLANDER ALONE HOUSEHOLDER) int B19013E 0 NaN NaN B19013E_001EA,B19013E_001M,B19013E_001MA NaN
B19013F_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (SOME OTHER RACE ALONE HOUSEHOLDER) int B19013F 0 NaN NaN B19013F_001EA,B19013F_001M,B19013F_001MA NaN
B19013G_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (TWO OR MORE RACES HOUSEHOLDER) int B19013G 0 NaN NaN B19013G_001EA,B19013G_001M,B19013G_001MA NaN
B19013H_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE, NOT HISPANIC OR LATINO HOUSEHOLDER) int B19013H 0 NaN NaN B19013H_001EA,B19013H_001M,B19013H_001MA NaN
B19013I_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) (HISPANIC OR LATINO HOUSEHOLDER) int B19013I 0 NaN NaN B19013I_001EA,B19013I_001M,B19013I_001MA NaN
B19013_001E Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars) MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS) int B19013 0 NaN NaN B19013_001EA,B19013_001M,B19013_001MA NaN
Code
variables = [
    "NAME",
    "B19013_001E" #median income in the past 12 month
]
Code
philly_county_code = "101"
pa_state_code = "42"

philly_income = acs.query(cols=variables, geo_unit="block group:*", geo_filter={"state": pa_state_code, "county": philly_county_code, "tract": "*"}, apikey='b3abcecc231fa30ccaa18cb5e854c30f1982fe3f')
philly_income.head(10)

philly_income['B19013_001E'] = philly_income['B19013_001E'].replace('-666666666', '0').astype('int32')
Code
philly_income.head()
NAME B19013_001E state county tract block group
0 Block Group 1, Census Tract 1.01, Philadelphia County, Pennsylvania 0 42 101 000101 1
1 Block Group 2, Census Tract 1.01, Philadelphia County, Pennsylvania 0 42 101 000101 2
2 Block Group 3, Census Tract 1.01, Philadelphia County, Pennsylvania 0 42 101 000101 3
3 Block Group 4, Census Tract 1.01, Philadelphia County, Pennsylvania 97210 42 101 000101 4
4 Block Group 5, Census Tract 1.01, Philadelphia County, Pennsylvania 109269 42 101 000101 5

1.2 Download census tracts from the Census and merge the data from part 1.1

  • Download census tracts for the desired geography using the pygris package
  • Merge the downloaded census tracts with the household income DataFrame
Code
import pygris
philly_blocks = pygris.block_groups(state=pa_state_code, county=philly_county_code, year=2021)
Code
philly_merged = philly_blocks.merge(
    philly_income,
    left_on=["STATEFP", "COUNTYFP", "TRACTCE", "BLKGRPCE"],
    right_on=["state", "county", "tract", "block group"],)
Code
philly_merged.head(1)
STATEFP COUNTYFP TRACTCE BLKGRPCE GEOID NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON geometry NAME B19013_001E state county tract block group
0 42 101 989100 2 421019891002 Block Group 2 G5030 S 373653 7060 +40.0373207 -075.0177378 POLYGON ((-75.02195 40.03435, -75.02191 40.03451, -75.02173 40.03482, -75.02151 40.03515, -75.02135 40.03544, -75.02129 40.03562, -75.02130 40.03582, -75.02134 40.03607, -75.02150 40.03638, -75.02163 40.03663, -75.02181 40.03702, -75.02168 40.03709, -75.01967 40.03813, -75.01902 40.03847, -75.01843 40.03877, -75.01783 40.03908, -75.01703 40.03950, -75.01621 40.03991, -75.01535 40.03894, -75.01500 40.03910, -75.01432 40.03945, -75.01254 40.03803, -75.01216 40.03772, -75.01195 40.03755, -75.01131 40.03704, -75.01225 40.03654, -75.01530 40.03485, -75.01790 40.03340, -75.01797 40.03336, -75.01817 40.03350, -75.01862 40.03366, -75.01959 40.03374, -75.01986 40.03376, -75.02000 40.03377, -75.02011 40.03377, -75.02022 40.03378, -75.02043 40.03379, -75.02083 40.03378, -75.02125 40.03374, -75.02142 40.03375, -75.02157 40.03381, -75.02173 40.03392, -75.02180 40.03401, -75.02189 40.03411, -75.02195 40.03422, -75.02195 40.03435)) Block Group 2, Census Tract 9891, Philadelphia County, Pennsylvania 0 42 101 989100 2

1.3 Load the restaurants data

The Yelp dataset includes data for 7,350 restaurants across the city. Load the data from the data/ folder and use the latitude and longitude columns to create a GeoDataFrame after loading the JSON data. Be sure to set the right CRS on when initializing the GeoDataFrame!

Notes

The JSON data is in a “records” format. To load it, you’ll need to pass the following keywords:

  • orient='records'
  • lines=True
Code
data = pd.read_json("data/restaurants_philly.json.gz", orient='records', lines=True)
gpd_Data = gpd.GeoDataFrame(data, geometry=gpd.points_from_xy(data.longitude, data.latitude), crs="EPSG:4326")
Code
gpd_Data.head()
business_id latitude longitude name review_count stars categories geometry
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries POINT (-75.15556 39.95551)
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395)
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322)
3 QdN72BWoyFypdGJhhI5r7g 39.939825 -75.157447 Bar One 65 4.0 Cocktail Bars, Bars, Italian, Nightlife, Restaurants POINT (-75.15745 39.93982)
4 Mjboz24M9NlBeiOJKLEd_Q 40.022466 -75.218314 DeSandro on Main 41 3.0 Pizza, Restaurants, Salad, Soup POINT (-75.21831 40.02247)

1.4 Add tract info for each restaurant

Do a spatial join to identify which census tract each restaurant is within. Make sure each dataframe has the same CRS!

At the end of this step, you should have a new dataframe with a column identifying the tract number for each restaurant.

Code
tract_restaurant = gpd.sjoin(gpd_Data, philly_merged.loc[:,["geometry", "tract", "B19013_001E"]], how="left", op="within")
tract_restaurant = tract_restaurant.rename(columns={'B19013_001E': 'Median Household Income'})
tract_restaurant.head()
/Users/annzhang/mambaforge/envs/musa-550-fall-2023/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3448: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead.
  if await self.run_code(code, result, async_=asy):
/var/folders/q3/y0zpvj752qg3_3nvpkx6v2300000gn/T/ipykernel_14881/1177895681.py:1: UserWarning: CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: EPSG:4269

  tract_restaurant = gpd.sjoin(gpd_Data, philly_merged.loc[:,["geometry", "tract", "B19013_001E"]], how="left", op="within")
business_id latitude longitude name review_count stars categories geometry index_right tract Median Household Income
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, Bakeries POINT (-75.15556 39.95551) 988.0 000200 42308.0
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395) 658.0 000102 198125.0
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 689.0 001500 107292.0
3 QdN72BWoyFypdGJhhI5r7g 39.939825 -75.157447 Bar One 65 4.0 Cocktail Bars, Bars, Italian, Nightlife, Restaurants POINT (-75.15745 39.93982) 774.0 001800 103125.0
4 Mjboz24M9NlBeiOJKLEd_Q 40.022466 -75.218314 DeSandro on Main 41 3.0 Pizza, Restaurants, Salad, Soup POINT (-75.21831 40.02247) 715.0 021000 86146.0

1.5 Add income data to your restaurant data

Add the income data to your dataframe from the previous step, merging the census data based on the tract that each restaurant is within.

Code
# this step is completed in 1.4

1.6 Make a plot of median household income vs. Yelp stars

Our dataset has the number of stars for each restaurant, rounded to the nearest 0.5 star. In this step, create a line plot that shows the average income value for each stars category (e.g., all restaurants with 1 star, 1.5 stars, 2 stars, etc.)

While their are multiple ways to do this, the seaborn.lineplot() is a great option. This can show the average value in each category as well as 95% uncertainty intervals. Use this function to plot the stars (“x”) vs. average income (“y”) for all of our restaurants, using the dataframe from last step. Be sure to format your figure to make it look nice!

Question: Is there a correlation between a restaurant’s ratings and the income levels of its surrounding neighborhood?

Code
stars = tract_restaurant.loc[:, ['stars','Median Household Income']].groupby(['stars']).median(['Median Household Income']).reset_index()
Code
import seaborn as sns
Code
sns.set_theme()
sns.lineplot(data=stars, x="stars", y="Median Household Income")
<Axes: xlabel='stars', ylabel='Median Household Income'>

As suggested by the graph above, there is an overall trend of higher ratings of restaurants associated with higeher median household income in the tracts that restaurants are located in.