Code
import pandas as pd
import geopandas as gpd
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objects as go
import numpy as np
'display.max_columns', None) pd.set_option(
The last part of this project zooms into museum collections at famous worldwide museums, including the MET, MoMA, British Museum, V&AM, and Philadelphia Museum of Art (used as an example for web scrapping, though it Is not located in New York or London). The part is broken into four sections – starting with an overview of large museums worldwide, then moving into three different ways of acquiring information about collections at museums online.
The first group is MET and MoMA, which have the most well-constructed digital galleries and open-access data hosted on GitHub. With the existing dataset, it is very easy to take a deep dive into the analysis.
The second group is British Museum, which does not have GitHub page but has a mature database for their abundant objects in the collections. Their websites have the function of downloading the search results in csv, if the results are under 20,000 items. In comparison to web-scrapping, this query and open access system allows visitors to download a way more detailed table of all the search results, despite the capacity is capped at 20,000 (so if you want to download more than 20,000, got to find a way to parse and download two times separately).
The last group is museums that has online digital collections showcasing on website but not yet given open data access to the general public. With their digital gallery, it is possible for users to web scrap and acquire basic information of objects. But two downsides are 1) it is very slow to scrap many items, and 2) the information acquired is too basic (i.e., usually only the name of the artwork, name of the artist, year, and geography is displayed on the search result page). As you will see in section 2 in this document, there is a clear comparison in terms of the level of details available about each item when comparing the three approaches.
Ultimately, this part of the project is to advocate for more resources given to impactful large museums to establish their online open access database. The public will undoubtedly enjoy it and may be able to generate some important insights for Museum’s future curations.
import pandas as pd
import geopandas as gpd
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objects as go
import numpy as np
'display.max_columns', None) pd.set_option(
To offer some context, I have picked some of the most well-known and largest museums in the world. In terms of the size of collections, the British Museum is unarguably the champion amongst this group, followed by the Palace Museum in Beijing. Size, reputation, management, and the larger political agenda (e.g., whether it is a national museum or museum representing important aspects of local culture) are important factors when it comes to digital accessibility and choices of whether to open the digitized items. The funding received by government and private donors are also determinants of to what extent are the museums capable of digitizing partially or all of their items. Many non-English speaking countries’ national museums also prioritize their local languages when it comes to description and research (as it should be), so there is a difference when it comes to access in English. For the sake of ease in conducting research, I have picked museums in the U.S. and U.K., so I don’t have to translate.
= pd.read_csv("./Final_Data/Museum Collection Numbers.csv").dropna()
world_museums = gpd.read_file("./Final_Data/world_museum_location.geojson")
world_museums_location 'Quantity'] = world_museums['Quantity'].astype(int)
world_museums['Museum'] = world_museums['Museum'].astype(object) world_museums[
= world_museums_location.merge(world_museums, on="Museum")
location
location.explore(="cartodbpositron",
tiles )
= px.bar(world_museums, x='Museum', y='Quantity',
fig ='Size of Collections at World Famous Museums (by Works)',
title=700, width=1000,
height="plotly_white")
template
fig.show()
I was surprised when I found out the MET and MoMA in NYC have their GitHub account for hosting online database, which contains detailed information of more than 400,000 and 100,000 items respectively. They might be the only two museums in the world so far to have such transparency for their digitized collection.
https://github.com/MuseumofModernArt
https://github.com/metmuseum/openaccess
This section showcases what possible interesting exploration one can do with such open access data, and hence advocating for more museums to join the two museums in this wave of digitization and opening access. In the next part (Part II), you will see how the level of interestingness and complexity can differ a lot by different approaches of accessing data.
# Load Data
= pd.read_csv("./Final_Data/MetObjects.txt")
met = pd.read_csv("./Final_Data/Moma_Artists.txt")
moma_artist = pd.read_csv("./Final_Data/Moma_artworks.txt") moma_artwork
/var/folders/q3/y0zpvj752qg3_3nvpkx6v2300000gn/T/ipykernel_94597/3272275578.py:2: DtypeWarning:
Columns (5,7,10,11,12,13,14,34,35,36,37,38,39,40,41,42,43,44,45,46) have mixed types. Specify dtype option on import or set low_memory=False.
The first quick glance goes to this departmental breakdown. It is apparent that both museums have the largest collection in Drawings and Paintings, followed by photography. But the museums are taking different approaches when it comes to managing and breaking down – the MET is managing collection and departments by genre (a comprehensive way of distinguishing temporal, geographical, and thematic features of artworks), considering its large variety of artwork profiles; MoMA, on the other hand, takes the approach of medium, like many other modern art museums do.
= pd.DataFrame(met.groupby(['Department']).size()).reset_index()
met_dept = met_dept.rename(columns={met_dept.columns[1]: 'Counts'})
met_dept = pd.DataFrame(moma_artwork.groupby(['Department']).size()).reset_index()
moma_dept = moma_dept.rename(columns={moma_dept.columns[1]: 'Counts'}) moma_dept
= go.Figure()
fig
fig.add_trace(=met_dept['Department'],
go.Bar(x=met_dept['Counts'],
y='Met Dept',
name=dict(color="#E81D2E")))
marker
fig.add_trace(=moma_dept['Department'],
go.Bar(x=moma_dept['Counts'],
y='MoMA Dept',
name=dict(color="Black")))
marker
#add dropdown
fig.update_layout(=[
updatemenusdict(
=0,
active=list([
buttonsdict(label="Met",
="update",
method=[{"visible": [True, False]},
args"title": "Metropolitan Museum of Art Department Breakdown",
{"annotations": []}]),
dict(label="MoMA",
="update",
method=[{"visible": [False, True]},
args"title": "Modern Museum of Art Department Breakdown",
{"annotations": []}])
]))])
="Number of Objects Held by Departments at Museums",
fig.update_layout(title_text=700,
height='plotly_white')
template fig.show()
After departments, I am curious about who are the top 30 artists, who has the most artworks owned by the two museums. So, I have counted their artwork, ranked, and grouped their work by the departmental categories that was explored in section 1.1. The results are shown in the bar charts below. As expected, most top artists are predominantly producing photography work or paintings, which consist of the largest collections at both Museums.
= pd.DataFrame(met.groupby(['Artist Display Name']).size()).reset_index()
met_top_artist = met_top_artist.rename(columns={met_top_artist.columns[1]: 'Counts'})
met_top_artist = met_top_artist[(met_top_artist['Artist Display Name'] != 'Unknown') &
met_top_artist_with_Co ~met_top_artist['Artist Display Name'].str.contains('Anonymous', case=False) &
'Artist Display Name'] != 'Unidentified artist')]
(met_top_artist[
= met_top_artist[(met_top_artist['Artist Display Name'] != 'Unknown') &
met_top_artist ~met_top_artist['Artist Display Name'].str.contains('Anonymous', case=False) &
~met_top_artist['Artist Display Name'].str.contains('company', case=False) &
~met_top_artist['Artist Display Name'].str.contains('Co.', case=False) &
'Artist Display Name'] != 'Unidentified artist')]
(met_top_artist[
= pd.DataFrame(moma_artwork.groupby(['Artist']).size()).reset_index()
moma_top_artist = moma_top_artist.rename(columns={moma_top_artist.columns[1]: 'Counts'})
moma_top_artist = moma_top_artist[(moma_top_artist['Artist'] != 'Unknown') &
moma_top_artist 'Artist'] != 'Anonymous') &
(moma_top_artist[~moma_top_artist['Artist'].str.contains('Unidentified', case=False)]
= met_top_artist.loc[met_top_artist['Counts'].nlargest(30).index]
met_top_30 = moma_top_artist.loc[moma_top_artist['Counts'].nlargest(30).index]
moma_top_30 = met[met['Artist Display Name'].isin(met_top_30['Artist Display Name'])]
met_30_artworks = moma_artwork[moma_artwork['Artist'].isin(moma_top_30['Artist'])]
moma_30_artworks = moma_30_artworks.groupby(['Artist', 'Department']).size().reset_index()
moma_30_work_breakdown = moma_30_work_breakdown.rename(columns={moma_30_work_breakdown.columns[2]: 'Counts'})
moma_30_work_breakdown
= met_30_artworks.groupby(['Artist Display Name', 'Department']).size().reset_index()
met_30_work_breakdown = met_30_work_breakdown.rename(columns={met_30_work_breakdown.columns[2]: 'Counts'})
met_30_work_breakdown
= moma_30_work_breakdown.pivot(index='Artist', columns='Department', values="Counts").fillna(0)
moma_30_work_pivot
= met_30_work_breakdown.pivot(index='Artist Display Name', columns='Department', values="Counts").fillna(0) met_30_work_pivot
= moma_top_30['Artist'].values
order = pd.CategoricalIndex(moma_30_work_pivot.index, categories=order, ordered=True)
moma_30_work_pivot.index = moma_30_work_pivot.sort_index()
moma_30_work_pivot
= met_top_30['Artist Display Name'].values
order = pd.CategoricalIndex(met_30_work_pivot.index, categories=order, ordered=True)
met_30_work_pivot.index = met_30_work_pivot.sort_index() met_30_work_pivot
= go.Figure()
fig
= ['#3A6C8C', '#0F3D3F','#B3D8EB','#728F4C','#CEE0C6','#242545','#EF819C','#F4B8D4']
colors
= ['Architecture & Design', 'Architecture & Design - Image Archive', 'Drawings & Prints', 'Film', 'Fluxus Collection', 'Media and Performance', 'Painting & Sculpture', 'Photography']
headings
import plotly.graph_objects as go
= np.transpose(moma_30_work_pivot.values)
x_data = moma_30_work_pivot.index.values
y_data
for heading, xd, colors in zip(headings, x_data, colors):
fig.add_trace(go.Bar(=xd,
x=y_data,
y=heading,
name='h',
orientation=dict(
marker=colors,
color=dict(color='rgb(248, 248, 249)', width=1)
line
)
))
fig.update_layout(=800,
height=1500,
width=dict(autorange="reversed"),
yaxis='stack',
barmode=dict(l=120, r=10, t=140, b=80),
margin=True,
showlegend='plotly_white',
template=True,
autosize='Top 30 Artists whom MoMA Holds Most Works of'
title
)
fig.show()
= go.Figure()
fig
= ['#3A6C8C','#dc596d','#B3D8EB','#949EC3', '#8B7099', '#242545','#EF819C','#F4B8D4','#728F4C','#CEE0C6', '#ffbb93', '#fa958f' ]
colors
= met_30_work_pivot.columns.to_numpy()
headings
import plotly.graph_objects as go
= np.transpose(met_30_work_pivot.values)
x_data = met_30_work_pivot.index.values
y_data
for heading, xd, colors in zip(headings, x_data, colors):
fig.add_trace(go.Bar(=xd,
x=y_data,
y=heading,
name='h',
orientation=dict(
marker=colors,
color=dict(color='rgb(248, 248, 249)', width=1)
line
)
))
fig.update_layout(=800,
height=1500,
width=dict(autorange="reversed"),
yaxis='stack',
barmode=dict(l=120, r=10, t=140, b=80),
margin=True,
showlegend='plotly_white',
template=True,
autosize='Top 30 Artists whom the Met Holds Most Works of'
title
)
fig.show()
Along the same line, I analyzed the top nationalities of artists whose work are hosted at the MET. The results show that the MET and MoMA both has the most artwork from American artists, though MoMA’s American artists’ collection is significantly larger than artworks of other nationalities. Additionally, we are seeing British, French, Japanese, Italian, German, and Dutch being leading group of artists. One disclaimer is that the analysis is conducted using different datasets – the MET one is generated from counting number of occurrence of one nationality from the list, while MoMA has a separate list of artists, which is used for this analysis. So, the MET analysis may double count artists if they have multiple work hosted, but MoMA’s count only for unique values. Despite the potential inaccuracy in the absolute value, this is useful to generate insights on leading nationalities of artists within either museum respectively.
= met.dropna(subset=['Artist Nationality'])
filtered = filtered['Artist Nationality'].str.split('[\s,|]', expand=True).stack().value_counts()
word_counts = pd.DataFrame(word_counts) word_counts
= word_counts.iloc[0:12].reset_index()
nationality = nationality.drop(index=[1,6])
nationality = nationality.rename(columns={nationality.columns[0]: 'Nationalities',nationality.columns[1]: 'Counts' }) nationality
= px.bar(x=nationality['Counts'].values, y=nationality['Nationalities'].values,
fig ='h')
orientation
fig.update_layout(=800,
height=1500,
width=dict(autorange="reversed"),
yaxis='stack',
barmode=dict(l=120, r=10, t=140, b=80),
margin=True,
showlegend='plotly_white',
template=True,
autosize='Top 10 Nationalities of Artists at the MET'
title
)
fig.show()
= pd.DataFrame(moma_artist.groupby(['Nationality']).size().reset_index())
moma_nationality = moma_nationality.rename(columns={moma_nationality.columns[1]: 'Counts' })
moma_nationality = moma_nationality.loc[moma_nationality['Counts'].nlargest(10).index] moma_nationality
= px.bar(x=moma_nationality['Counts'].values, y=moma_nationality['Nationality'].values,
fig ='h')
orientation
fig.update_layout(=800,
height=1500,
width=dict(autorange="reversed"),
yaxis='stack',
barmode=dict(l=120, r=10, t=140, b=80),
margin=True,
showlegend='plotly_white',
template=True,
autosize='Top 10 Nationalities of Artists at MoMA'
title
)
fig.show()
This section compares three approaches of acquiring data from museum’s online galleries / digital collections, using Chinese artworks as an example (because it is usually smaller than other genres and have more variation in terms of uncertain information, such as unknown artists or time). It hopes to show different levels of analysis one can make with the data.
Utilizing the MET’s open data, I analyzed the top 30 tags / recurring themes of Chinese artwork. This is the kind of data analytics made possible by large scale digitization of in-house collection. It’s quite interesting to see how themes emerge as we put them into the database. The data can also be used for other types of analysis, shown in the first part already.
= met.loc[(met['Culture'] == "China")] met_china_art
met_china_art.head()
Object Number | Is Highlight | Is Timeline Work | Is Public Domain | Object ID | Gallery Number | Department | AccessionYear | Object Name | Title | Culture | Period | Dynasty | Reign | Portfolio | Constituent ID | Artist Role | Artist Prefix | Artist Display Name | Artist Display Bio | Artist Suffix | Artist Alpha Sort | Artist Nationality | Artist Begin Date | Artist End Date | Artist Gender | Artist ULAN URL | Artist Wikidata URL | Object Date | Object Begin Date | Object End Date | Medium | Dimensions | Credit Line | Geography Type | City | State | County | Country | Region | Subregion | Locale | Locus | Excavation | River | Classification | Rights and Reproduction | Link Resource | Object Wikidata URL | Metadata Date | Repository | Tags | Tags AAT URL | Tags Wikidata URL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6933 | 13.31.15 | False | False | True | 7411 | 774 | The American Wing | 1913.0 | Shaving mug | Shaving Mug | China | NaN | NaN | NaN | NaN | 188 | Maker | E. & W. Bennett Pottery | American, Baltimore, Maryland 1847–1857 | Bennett, E. & W., Pottery | American | 1847 | 1857 | NaN | http://vocab.getty.edu/page/ulan/500524602 | https://www.wikidata.org/wiki/Q98446707 | ca. 1853 | 1850 | 1853 | Mottled brown earthenware | H. 4 3/8 in. (11.1 cm) | Rogers Fund, 1913 | Made in | Baltimore | NaN | NaN | United States | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | http://www.metmuseum.org/art/collection/search... | https://www.wikidata.org/wiki/Q116342297 | NaN | Metropolitan Museum of Art, New York, NY | Men | http://vocab.getty.edu/page/aat/300025928 | https://www.wikidata.org/wiki/Q8441 | ||
6979 | 33.120.164 | False | False | True | 7457 | 774 | The American Wing | 1933.0 | Buckle | Shoe Buckle | China | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ca. 1800 | 1797 | 1800 | Silver | 2 3/8 x 1 3/4 in. (6 x 4.4 cm) | Bequest of Alphonso T. Clearwater, 1933 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | http://www.metmuseum.org/art/collection/search... | https://www.wikidata.org/wiki/Q116341420 | NaN | Metropolitan Museum of Art, New York, NY | NaN | NaN | NaN |
30296 | 96.14.1896 | False | False | True | 35967 | NaN | Asian Art | 1896.0 | Panel | NaN | China | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 18th century or earlier | 1650 | 1799 | Paint; on leather | 9 1/4 x 5 3/8 in. (23.5 x 13.7 cm) | Gift of Mr. and Mrs. H. O. Havemeyer, 1896 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Leatherwork | NaN | http://www.metmuseum.org/art/collection/search... | NaN | NaN | Metropolitan Museum of Art, New York, NY | Musical Instruments|Men|Elephants|Flowers | http://vocab.getty.edu/page/aat/300041620|http... | https://www.wikidata.org/wiki/Q34379|https://w... |
30297 | 09.3 | False | False | True | 35968 | NaN | Asian Art | 1909.0 | Pictorial map | 清 佚名 台南地區荷蘭城堡|Forts Zeelandia and Provintia ... | China | NaN | NaN | NaN | NaN | 3750 | Artist | Unidentified artist | Chinese, active 19th century | Unidentified artist | NaN | NaN | NaN | 19th century | 1800 | 1899 | Wall hanging; ink and color on deerskin | Image: 59 1/4 × 80 3/4 in. (150.5 × 205.1 cm)\... | Gift of J. Pierpont Morgan, 1909 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Paintings | NaN | http://www.metmuseum.org/art/collection/search... | https://www.wikidata.org/wiki/Q79003782 | NaN | Metropolitan Museum of Art, New York, NY | Maps|Houses|Cities|Boats|Ships | http://vocab.getty.edu/page/aat/300028094|http... | https://www.wikidata.org/wiki/Q4006|https://ww... | |||||
30298 | 12.37.135 | False | False | False | 35969 | NaN | Asian Art | 1912.0 | Hanging scroll | NaN | China | Qing dynasty (1644–1911) | NaN | NaN | NaN | 1214 | Artist | Jin Zunnian | Chinese, active early 18th century | Jin Zunnian | Chinese | 1700 | 1800 | NaN | NaN | NaN | dated 1732 | 1732 | 1732 | Hanging scroll; ink and color on silk | 67 x 38 in. (170.2 x 96.5 cm) | Rogers Fund, 1912 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Paintings | NaN | http://www.metmuseum.org/art/collection/search... | NaN | NaN | Metropolitan Museum of Art, New York, NY | NaN | NaN | NaN |
= met_china_art.dropna(subset=['Tags']) met_china_art
= met_china_art['Tags'].str.split('[\s,|]', expand=True).stack().value_counts()
tags_counts = pd.DataFrame(tags_counts).reset_index()
tags_counts = tags_counts.rename(columns={tags_counts.columns[1]: 'Counts' })
tags_counts
= tags_counts.loc[tags_counts['Counts'].nlargest(30).index] tags_counts_met
= px.bar(y=tags_counts_met['index'], x=tags_counts_met['Counts'])
fig
fig.update_layout(=800,
height=1500,
width=dict(autorange="reversed"),
yaxis='stack',
barmode=dict(l=120, r=10, t=140, b=80),
margin=True,
showlegend='plotly_white',
template=True,
autosize='Top 30 Tags / Themes of Chinese artwork at the MET'
title
)
fig.show()
A similar level of deep analysis into the content and themes of artwork can be conducted on the dataset downloaded from The British Museum. While the British Museum allows to download all results (cap at 20,000 items), some other museums like VAM (Victoria and Albert Museum only allows to download one page at a time (15 or 50 items), which is not efficient for large-scale analysis. However, they claim that they have an API to be utilized, that is not explore as a part of this project.
In addition to recurring themes, I also tried to do a quick glance of most used materials for Chinese artwork. Quick glance of such data might interest profane visitors who doesn’t have much background in Chinese history or art history in general.
= pd.read_csv("./Final_Data/3/British_Museum_Result.csv") BM_Result
= BM_Result.dropna(subset=['Materials'])
BM_Materials = BM_Result.dropna(subset=['Subjects']) BM_Subjects
= BM_Subjects['Subjects'].str.split('[\s,|;]', expand=True).stack().value_counts()
BM_Subjects_counts = pd.DataFrame(BM_Subjects_counts).reset_index()
BM_Subjects_counts = BM_Subjects_counts.rename(columns={BM_Subjects_counts.columns[0]: 'Subjects', BM_Subjects_counts.columns[1]: 'Counts' })
BM_Subjects_counts
= BM_Subjects_counts.loc[BM_Subjects_counts['Counts'].nlargest(34).index]
BM_Subjects_counts = BM_Subjects_counts.drop(index=[0,5,16,17]) BM_Subjects_counts
= px.bar(y=BM_Subjects_counts['Subjects'], x=BM_Subjects_counts['Counts'])
fig
fig.update_layout(=800,
height=1500,
width=dict(autorange="reversed"),
yaxis='stack',
barmode=dict(l=120, r=10, t=140, b=80),
margin=True,
showlegend='plotly_white',
template=True,
autosize='Top 30 Tags / Themes of Chinese artwork at the British Museum'
title
)
fig.show()
= BM_Materials['Materials'].str.split('[\s,|;]', expand=True).stack().value_counts()
BM_Materials_counts = pd.DataFrame(BM_Materials_counts).reset_index()
BM_Materials_counts = BM_Materials_counts.rename(columns={BM_Materials_counts.columns[0]: 'Materials', BM_Materials_counts.columns[1]: 'Counts' })
BM_Materials_counts
= BM_Materials_counts.loc[BM_Materials_counts['Counts'].nlargest(31).index]
BM_Materials_counts = BM_Materials_counts.drop(index=[1]) BM_Materials_counts
= px.bar(y=BM_Materials_counts['Materials'], x=BM_Materials_counts['Counts'])
fig
fig.update_layout(=800,
height=1500,
width=dict(autorange="reversed"),
yaxis='stack',
barmode=dict(l=120, r=10, t=140, b=80),
margin=True,
showlegend='plotly_white',
template=True,
autosize='Top 30 Materials of Chinese artwork on display at the British Museum'
title
)
fig.show()
The last approach is to scrap from the online galleries. The example here is Philadelphia Museum of Art’s Chinese art collection. In comparison to the earlier two, the information scrapped from the web is a lot less in detail. Particularly for art genres like Chinese art, of which many have unknown artist name or time or production, the information scrapped would be not useful, as we are not able to efficiently scrape details of many objects. Hence, the potential analyses are limited.
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from time import sleep
= webdriver.Chrome()
driver = "https://philamuseum.org/search/collections?from=0&size=48&filters=%7B%22department%22%3A%5B%22East%20Asian%20Art%22%5D%2C%22place%22%3A%5B%22China%22%5D%7D"
url = driver.get(url)
response = driver.page_source html_content
NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: chrome=120.0.6099.71)
Stacktrace:
0 chromedriver 0x000000010a906c48 chromedriver + 4852808
1 chromedriver 0x000000010a8fe1b3 chromedriver + 4817331
2 chromedriver 0x000000010a4ca7bd chromedriver + 411581
3 chromedriver 0x000000010a49e2f8 chromedriver + 230136
4 chromedriver 0x000000010a54c41f chromedriver + 943135
5 chromedriver 0x000000010a563226 chromedriver + 1036838
6 chromedriver 0x000000010a5449a3 chromedriver + 911779
7 chromedriver 0x000000010a50c103 chromedriver + 680195
8 chromedriver 0x000000010a50d71e chromedriver + 685854
9 chromedriver 0x000000010a8c6792 chromedriver + 4589458
10 chromedriver 0x000000010a8cb99c chromedriver + 4610460
11 chromedriver 0x000000010a8abcb1 chromedriver + 4480177
12 chromedriver 0x000000010a8cc716 chromedriver + 4613910
13 chromedriver 0x000000010a89d23c chromedriver + 4420156
14 chromedriver 0x000000010a8ec798 chromedriver + 4745112
15 chromedriver 0x000000010a8ec94e chromedriver + 4745550
16 chromedriver 0x000000010a8fddf3 chromedriver + 4816371
17 libsystem_pthread.dylib 0x00007ff802926259 _pthread_start + 125
18 libsystem_pthread.dylib 0x00007ff802921c7b thread_start + 15
= BeautifulSoup(html_content, 'html.parser') soup
= ".searchcard"
selector
= soup.select(selector) tables
= []
results
= 10
max_pages
# The base URL we will be using
= "https://philamuseum.org/search/collections?"
base_url
# loop over each page of search results
for page_num in range(1, max_pages + 1):
print(f"Processing page {page_num}...")
= (page_num-1)*48
obj_num
# Update the URL hash for this page number and make the combined URL
= f"from={obj_num}&size=48&filters=%7B%22department%22%3A%5B%22East%20Asian%20Art%22%5D%2C%22place%22%3A%5B%22China%22%5D%7D"
url_hash = base_url + url_hash
url
# Go to the driver and wait for 5 seconds
driver.get(url)5)
sleep(
# YOUR CODE: get the list of all apartments
# This is the same code from Part 1.2 and 1.3
# It should be a list of 120 apartments
= soup
soup = tables
objects print("Number of Objects = ", len(objects))
# loop over each apartment in the list
= []
page_results for artwork in objects:
#artwork name
= artwork.select_one(".card-title").text
artwork_name
#artist, Geoegraphy, Time
= artwork.select_one(".card-body").text
artist_geo_time
# Save the result
page_results.append([artwork_name, artist_geo_time])
# Create a dataframe and save
= ["artwork_name", "artist_geo_time"]
col_names = pd.DataFrame(page_results, columns=col_names)
df
results.append(df)
print("sleeping for 10 seconds between calls")
10)
sleep(
# Finally, concatenate all the results
= pd.concat(results, axis=0).reset_index(drop=True) results
Processing page 1...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 2...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 3...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 4...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 5...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 6...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 7...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 8...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 9...
Number of Objects = 48
sleeping for 10 seconds between calls
Processing page 10...
Number of Objects = 48
sleeping for 10 seconds between calls
'Artist', 'Geography', 'Time']] = pd.DataFrame(results['artist_geo_time'].str.split(',').tolist(), index=results.index)
results[[
10) results.head(
artwork_name | artist_geo_time | Artist | Geography | Time | |
---|---|---|---|---|---|
0 | Reception Hall | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |
1 | Portrait of a Manchu Lady | Mangguli, Chinese (Manchu), 1672 - 1736 | Mangguli | Chinese (Manchu) | 1672 - 1736 |
2 | Jar | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |
3 | Covered Cup | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |
4 | Cup | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |
5 | Teapot | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |
6 | Bowl | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |
7 | Vase in the form of an Archaic Bronze Vessel | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |
8 | Wall Vase (P'ing) | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |
9 | Vase (P'ing) | Artist/maker unknown, Chinese | Artist/maker unknown | Chinese | None |