Part III - Museum Collections - MET, MoMA, British Museum

The last part of this project zooms into museum collections at famous worldwide museums, including the MET, MoMA, British Museum, V&AM, and Philadelphia Museum of Art (used as an example for web scrapping, though it Is not located in New York or London). The part is broken into four sections – starting with an overview of large museums worldwide, then moving into three different ways of acquiring information about collections at museums online.

The first group is MET and MoMA, which have the most well-constructed digital galleries and open-access data hosted on GitHub. With the existing dataset, it is very easy to take a deep dive into the analysis.

The second group is British Museum, which does not have GitHub page but has a mature database for their abundant objects in the collections. Their websites have the function of downloading the search results in csv, if the results are under 20,000 items. In comparison to web-scrapping, this query and open access system allows visitors to download a way more detailed table of all the search results, despite the capacity is capped at 20,000 (so if you want to download more than 20,000, got to find a way to parse and download two times separately).

The last group is museums that has online digital collections showcasing on website but not yet given open data access to the general public. With their digital gallery, it is possible for users to web scrap and acquire basic information of objects. But two downsides are 1) it is very slow to scrap many items, and 2) the information acquired is too basic (i.e., usually only the name of the artwork, name of the artist, year, and geography is displayed on the search result page). As you will see in section 2 in this document, there is a clear comparison in terms of the level of details available about each item when comparing the three approaches.

Ultimately, this part of the project is to advocate for more resources given to impactful large museums to establish their online open access database. The public will undoubtedly enjoy it and may be able to generate some important insights for Museum’s future curations.

Code

import pandas as pd
import geopandas as gpd
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objects as go
import numpy as np
pd.set_option('display.max_columns', None)

0. World Museum at a Glance

To offer some context, I have picked some of the most well-known and largest museums in the world. In terms of the size of collections, the British Museum is unarguably the champion amongst this group, followed by the Palace Museum in Beijing. Size, reputation, management, and the larger political agenda (e.g., whether it is a national museum or museum representing important aspects of local culture) are important factors when it comes to digital accessibility and choices of whether to open the digitized items. The funding received by government and private donors are also determinants of to what extent are the museums capable of digitizing partially or all of their items. Many non-English speaking countries’ national museums also prioritize their local languages when it comes to description and research (as it should be), so there is a difference when it comes to access in English. For the sake of ease in conducting research, I have picked museums in the U.S. and U.K., so I don’t have to translate.

Code

world_museums = pd.read_csv("./Final_Data/Museum Collection Numbers.csv").dropna()
world_museums_location = gpd.read_file("./Final_Data/world_museum_location.geojson")
world_museums['Quantity'] = world_museums['Quantity'].astype(int)
world_museums['Museum'] = world_museums['Museum'].astype(object)

Code

location = world_museums_location.merge(world_museums, on="Museum")
location.explore(
     tiles="cartodbpositron",
)

Make this Notebook Trusted to load map: File -> Trust Notebook

Code

fig = px.bar(world_museums, x='Museum', y='Quantity', 
             title='Size of Collections at World Famous Museums (by Works)', 
             height=700, width=1000,
             template="plotly_white")
fig.show()

1. MET & MoMA’s Open Access Digital Collection

I was surprised when I found out the MET and MoMA in NYC have their GitHub account for hosting online database, which contains detailed information of more than 400,000 and 100,000 items respectively. They might be the only two museums in the world so far to have such transparency for their digitized collection.

https://github.com/MuseumofModernArt

https://github.com/metmuseum/openaccess

This section showcases what possible interesting exploration one can do with such open access data, and hence advocating for more museums to join the two museums in this wave of digitization and opening access. In the next part (Part II), you will see how the level of interestingness and complexity can differ a lot by different approaches of accessing data.

Code

# Load Data
met = pd.read_csv("./Final_Data/MetObjects.txt")
moma_artist = pd.read_csv("./Final_Data/Moma_Artists.txt")
moma_artwork = pd.read_csv("./Final_Data/Moma_artworks.txt")

/var/folders/q3/y0zpvj752qg3_3nvpkx6v2300000gn/T/ipykernel_94597/3272275578.py:2: DtypeWarning:

Columns (5,7,10,11,12,13,14,34,35,36,37,38,39,40,41,42,43,44,45,46) have mixed types. Specify dtype option on import or set low_memory=False.

1.1 Department Breakdown

The first quick glance goes to this departmental breakdown. It is apparent that both museums have the largest collection in Drawings and Paintings, followed by photography. But the museums are taking different approaches when it comes to managing and breaking down – the MET is managing collection and departments by genre (a comprehensive way of distinguishing temporal, geographical, and thematic features of artworks), considering its large variety of artwork profiles; MoMA, on the other hand, takes the approach of medium, like many other modern art museums do.

Code

met_dept = pd.DataFrame(met.groupby(['Department']).size()).reset_index()
met_dept = met_dept.rename(columns={met_dept.columns[1]: 'Counts'})
moma_dept = pd.DataFrame(moma_artwork.groupby(['Department']).size()).reset_index()
moma_dept = moma_dept.rename(columns={moma_dept.columns[1]: 'Counts'})

Code

fig = go.Figure()


fig.add_trace(
    go.Bar(x=met_dept['Department'], 
           y=met_dept['Counts'], 
           name='Met Dept',
          marker=dict(color="#E81D2E")))


fig.add_trace(
    go.Bar(x=moma_dept['Department'], 
           y=moma_dept['Counts'], 
           name='MoMA Dept',
          marker=dict(color="Black")))

#add dropdown

fig.update_layout(
    updatemenus=[
        dict(
            active=0,
            buttons=list([
                dict(label="Met",
                    method="update",
                    args=[{"visible": [True, False]},
                        {"title": "Metropolitan Museum of Art Department Breakdown",
                         "annotations": []}]),
                dict(label="MoMA",
                    method="update",
                    args=[{"visible": [False, True]},
                        {"title": "Modern Museum of Art Department Breakdown",
                         "annotations": []}])
            ]))])

fig.update_layout(title_text="Number of Objects Held by Departments at Museums", 
                  height=700,
                 template='plotly_white')
fig.show()

1.2 Top Artists

After departments, I am curious about who are the top 30 artists, who has the most artworks owned by the two museums. So, I have counted their artwork, ranked, and grouped their work by the departmental categories that was explored in section 1.1. The results are shown in the bar charts below. As expected, most top artists are predominantly producing photography work or paintings, which consist of the largest collections at both Museums.

Code

met_top_artist = pd.DataFrame(met.groupby(['Artist Display Name']).size()).reset_index()
met_top_artist = met_top_artist.rename(columns={met_top_artist.columns[1]: 'Counts'})
met_top_artist_with_Co = met_top_artist[(met_top_artist['Artist Display Name'] != 'Unknown') & 
                                ~met_top_artist['Artist Display Name'].str.contains('Anonymous', case=False) &
                                (met_top_artist['Artist Display Name'] != 'Unidentified artist')]

met_top_artist = met_top_artist[(met_top_artist['Artist Display Name'] != 'Unknown') & 
                                ~met_top_artist['Artist Display Name'].str.contains('Anonymous', case=False) &
                                ~met_top_artist['Artist Display Name'].str.contains('company', case=False) &
                                ~met_top_artist['Artist Display Name'].str.contains('Co.', case=False) &
                                (met_top_artist['Artist Display Name'] != 'Unidentified artist')]
 
moma_top_artist = pd.DataFrame(moma_artwork.groupby(['Artist']).size()).reset_index()
moma_top_artist = moma_top_artist.rename(columns={moma_top_artist.columns[1]: 'Counts'})
moma_top_artist = moma_top_artist[(moma_top_artist['Artist'] != 'Unknown') & 
                                  (moma_top_artist['Artist'] != 'Anonymous') &
                                  ~moma_top_artist['Artist'].str.contains('Unidentified', case=False)]

Code

met_top_30 = met_top_artist.loc[met_top_artist['Counts'].nlargest(30).index]
moma_top_30 = moma_top_artist.loc[moma_top_artist['Counts'].nlargest(30).index]
met_30_artworks = met[met['Artist Display Name'].isin(met_top_30['Artist Display Name'])] 
moma_30_artworks = moma_artwork[moma_artwork['Artist'].isin(moma_top_30['Artist'])] 
moma_30_work_breakdown = moma_30_artworks.groupby(['Artist', 'Department']).size().reset_index()
moma_30_work_breakdown = moma_30_work_breakdown.rename(columns={moma_30_work_breakdown.columns[2]: 'Counts'})

met_30_work_breakdown = met_30_artworks.groupby(['Artist Display Name', 'Department']).size().reset_index()
met_30_work_breakdown = met_30_work_breakdown.rename(columns={met_30_work_breakdown.columns[2]: 'Counts'})

moma_30_work_pivot = moma_30_work_breakdown.pivot(index='Artist', columns='Department', values="Counts").fillna(0)

met_30_work_pivot = met_30_work_breakdown.pivot(index='Artist Display Name', columns='Department', values="Counts").fillna(0)

Code

order = moma_top_30['Artist'].values
moma_30_work_pivot.index = pd.CategoricalIndex(moma_30_work_pivot.index, categories=order, ordered=True)
moma_30_work_pivot = moma_30_work_pivot.sort_index()

order = met_top_30['Artist Display Name'].values
met_30_work_pivot.index = pd.CategoricalIndex(met_30_work_pivot.index, categories=order, ordered=True)
met_30_work_pivot = met_30_work_pivot.sort_index()

Code

fig = go.Figure()

colors = ['#3A6C8C', '#0F3D3F','#B3D8EB','#728F4C','#CEE0C6','#242545','#EF819C','#F4B8D4']

headings = ['Architecture & Design', 'Architecture & Design - Image Archive', 'Drawings & Prints', 'Film', 'Fluxus Collection', 'Media and Performance', 'Painting & Sculpture', 'Photography']



import plotly.graph_objects as go


x_data = np.transpose(moma_30_work_pivot.values)
y_data = moma_30_work_pivot.index.values


for heading, xd, colors in zip(headings, x_data, colors):
    fig.add_trace(go.Bar(
            x=xd, 
            y=y_data,
            name=heading,
            orientation='h',
            marker=dict(
                color=colors,
                line=dict(color='rgb(248, 248, 249)', width=1)
            )
        ))

fig.update_layout(
    height=800,
    width=1500,
    yaxis=dict(autorange="reversed"),
    barmode='stack',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=True,
    template='plotly_white',
    autosize=True,
    title='Top 30 Artists whom MoMA Holds Most Works of'
)



fig.show()

Code

fig = go.Figure()

colors = ['#3A6C8C','#dc596d','#B3D8EB','#949EC3', '#8B7099', '#242545','#EF819C','#F4B8D4','#728F4C','#CEE0C6', '#ffbb93', '#fa958f' ]

headings = met_30_work_pivot.columns.to_numpy()



import plotly.graph_objects as go


x_data = np.transpose(met_30_work_pivot.values)
y_data = met_30_work_pivot.index.values


for heading, xd, colors in zip(headings, x_data, colors):
    fig.add_trace(go.Bar(
            x=xd, 
            y=y_data,
            name=heading,
            orientation='h',
            marker=dict(
                color=colors,
                line=dict(color='rgb(248, 248, 249)', width=1)
            )
        ))

fig.update_layout(
    height=800,
    width=1500,
    yaxis=dict(autorange="reversed"),
    barmode='stack',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=True,
    template='plotly_white',
    autosize=True,
    title='Top 30 Artists whom the Met Holds Most Works of'
)



fig.show()

1.3 Top Nationalities of Artists

Along the same line, I analyzed the top nationalities of artists whose work are hosted at the MET. The results show that the MET and MoMA both has the most artwork from American artists, though MoMA’s American artists’ collection is significantly larger than artworks of other nationalities. Additionally, we are seeing British, French, Japanese, Italian, German, and Dutch being leading group of artists. One disclaimer is that the analysis is conducted using different datasets – the MET one is generated from counting number of occurrence of one nationality from the list, while MoMA has a separate list of artists, which is used for this analysis. So, the MET analysis may double count artists if they have multiple work hosted, but MoMA’s count only for unique values. Despite the potential inaccuracy in the absolute value, this is useful to generate insights on leading nationalities of artists within either museum respectively.

Code

filtered = met.dropna(subset=['Artist Nationality'])
word_counts = filtered['Artist Nationality'].str.split('[\s,|]', expand=True).stack().value_counts()
word_counts = pd.DataFrame(word_counts)

Code

nationality = word_counts.iloc[0:12].reset_index()
nationality = nationality.drop(index=[1,6])
nationality = nationality.rename(columns={nationality.columns[0]: 'Nationalities',nationality.columns[1]: 'Counts' })

Code

fig = px.bar(x=nationality['Counts'].values, y=nationality['Nationalities'].values,
            orientation='h')

fig.update_layout(
    height=800,
    width=1500,
    yaxis=dict(autorange="reversed"),
    barmode='stack',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=True,
    template='plotly_white',
    autosize=True,
    title='Top 10 Nationalities of Artists at the MET'
)



fig.show()

Code

moma_nationality = pd.DataFrame(moma_artist.groupby(['Nationality']).size().reset_index())
moma_nationality = moma_nationality.rename(columns={moma_nationality.columns[1]: 'Counts' })
moma_nationality = moma_nationality.loc[moma_nationality['Counts'].nlargest(10).index]

Code

fig = px.bar(x=moma_nationality['Counts'].values, y=moma_nationality['Nationality'].values,
            orientation='h')

fig.update_layout(
    height=800,
    width=1500,
    yaxis=dict(autorange="reversed"),
    barmode='stack',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=True,
    template='plotly_white',
    autosize=True,
    title='Top 10 Nationalities of Artists at MoMA'
)



fig.show()

2. Digital Access to Collection Comparison - Chinese Art as an Example

This section compares three approaches of acquiring data from museum’s online galleries / digital collections, using Chinese artworks as an example (because it is usually smaller than other genres and have more variation in terms of uncertain information, such as unknown artists or time). It hopes to show different levels of analysis one can make with the data.

2.1 MET - Open Access CSV / JSON file

Utilizing the MET’s open data, I analyzed the top 30 tags / recurring themes of Chinese artwork. This is the kind of data analytics made possible by large scale digitization of in-house collection. It’s quite interesting to see how themes emerge as we put them into the database. The data can also be used for other types of analysis, shown in the first part already.

Code

met_china_art = met.loc[(met['Culture'] == "China")]

Code

met_china_art.head()

	Object Number	Is Highlight	Is Timeline Work	Is Public Domain	Object ID	Gallery Number	Department	AccessionYear	Object Name	Title	Culture	Period	Dynasty	Reign	Portfolio	Constituent ID	Artist Role	Artist Prefix	Artist Display Name	Artist Display Bio	Artist Suffix	Artist Alpha Sort	Artist Nationality	Artist Begin Date	Artist End Date	Artist Gender	Artist ULAN URL	Artist Wikidata URL	Object Date	Object Begin Date	Object End Date	Medium	Dimensions	Credit Line	Geography Type	City	State	County	Country	Region	Subregion	Locale	Locus	Excavation	River	Classification	Rights and Reproduction	Link Resource	Object Wikidata URL	Metadata Date	Repository	Tags	Tags AAT URL	Tags Wikidata URL
6933	13.31.15	False	False	True	7411	774	The American Wing	1913.0	Shaving mug	Shaving Mug	China	NaN	NaN	NaN	NaN	188	Maker		E. & W. Bennett Pottery	American, Baltimore, Maryland 1847–1857		Bennett, E. & W., Pottery	American	1847	1857	NaN	http://vocab.getty.edu/page/ulan/500524602	https://www.wikidata.org/wiki/Q98446707	ca. 1853	1850	1853	Mottled brown earthenware	H. 4 3/8 in. (11.1 cm)	Rogers Fund, 1913	Made in	Baltimore	NaN	NaN	United States	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	http://www.metmuseum.org/art/collection/search...	https://www.wikidata.org/wiki/Q116342297	NaN	Metropolitan Museum of Art, New York, NY	Men	http://vocab.getty.edu/page/aat/300025928	https://www.wikidata.org/wiki/Q8441
6979	33.120.164	False	False	True	7457	774	The American Wing	1933.0	Buckle	Shoe Buckle	China	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	ca. 1800	1797	1800	Silver	2 3/8 x 1 3/4 in. (6 x 4.4 cm)	Bequest of Alphonso T. Clearwater, 1933	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	http://www.metmuseum.org/art/collection/search...	https://www.wikidata.org/wiki/Q116341420	NaN	Metropolitan Museum of Art, New York, NY	NaN	NaN	NaN
30296	96.14.1896	False	False	True	35967	NaN	Asian Art	1896.0	Panel	NaN	China	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	18th century or earlier	1650	1799	Paint; on leather	9 1/4 x 5 3/8 in. (23.5 x 13.7 cm)	Gift of Mr. and Mrs. H. O. Havemeyer, 1896	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Leatherwork	NaN	http://www.metmuseum.org/art/collection/search...	NaN	NaN	Metropolitan Museum of Art, New York, NY	Musical Instruments\|Men\|Elephants\|Flowers	http://vocab.getty.edu/page/aat/300041620\|http...	https://www.wikidata.org/wiki/Q34379\|https://w...
30297	09.3	False	False	True	35968	NaN	Asian Art	1909.0	Pictorial map	清佚名台南地區荷蘭城堡\|Forts Zeelandia and Provintia ...	China	NaN	NaN	NaN	NaN	3750	Artist		Unidentified artist	Chinese, active 19th century		Unidentified artist				NaN	NaN	NaN	19th century	1800	1899	Wall hanging; ink and color on deerskin	Image: 59 1/4 × 80 3/4 in. (150.5 × 205.1 cm)\...	Gift of J. Pierpont Morgan, 1909	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Paintings	NaN	http://www.metmuseum.org/art/collection/search...	https://www.wikidata.org/wiki/Q79003782	NaN	Metropolitan Museum of Art, New York, NY	Maps\|Houses\|Cities\|Boats\|Ships	http://vocab.getty.edu/page/aat/300028094\|http...	https://www.wikidata.org/wiki/Q4006\|https://ww...
30298	12.37.135	False	False	False	35969	NaN	Asian Art	1912.0	Hanging scroll	NaN	China	Qing dynasty (1644–1911)	NaN	NaN	NaN	1214	Artist		Jin Zunnian	Chinese, active early 18th century		Jin Zunnian	Chinese	1700	1800	NaN	NaN	NaN	dated 1732	1732	1732	Hanging scroll; ink and color on silk	67 x 38 in. (170.2 x 96.5 cm)	Rogers Fund, 1912	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Paintings	NaN	http://www.metmuseum.org/art/collection/search...	NaN	NaN	Metropolitan Museum of Art, New York, NY	NaN	NaN	NaN

Code

met_china_art = met_china_art.dropna(subset=['Tags'])

Code

tags_counts = met_china_art['Tags'].str.split('[\s,|]', expand=True).stack().value_counts()
tags_counts = pd.DataFrame(tags_counts).reset_index()
tags_counts = tags_counts.rename(columns={tags_counts.columns[1]: 'Counts' })

tags_counts_met = tags_counts.loc[tags_counts['Counts'].nlargest(30).index]

Code

fig = px.bar(y=tags_counts_met['index'], x=tags_counts_met['Counts'])
fig.update_layout(
    height=800,
    width=1500,
    yaxis=dict(autorange="reversed"),
    barmode='stack',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=True,
    template='plotly_white',
    autosize=True,
    title='Top 30 Tags / Themes of Chinese artwork at the MET'
)


fig.show()

2.2 British Museum - Access Search Result Download

A similar level of deep analysis into the content and themes of artwork can be conducted on the dataset downloaded from The British Museum. While the British Museum allows to download all results (cap at 20,000 items), some other museums like VAM (Victoria and Albert Museum only allows to download one page at a time (15 or 50 items), which is not efficient for large-scale analysis. However, they claim that they have an API to be utilized, that is not explore as a part of this project.

In addition to recurring themes, I also tried to do a quick glance of most used materials for Chinese artwork. Quick glance of such data might interest profane visitors who doesn’t have much background in Chinese history or art history in general.

Code

BM_Result = pd.read_csv("./Final_Data/3/British_Museum_Result.csv")

Code

BM_Materials = BM_Result.dropna(subset=['Materials'])
BM_Subjects = BM_Result.dropna(subset=['Subjects'])

Code

BM_Subjects_counts = BM_Subjects['Subjects'].str.split('[\s,|;]', expand=True).stack().value_counts()
BM_Subjects_counts = pd.DataFrame(BM_Subjects_counts).reset_index()
BM_Subjects_counts = BM_Subjects_counts.rename(columns={BM_Subjects_counts.columns[0]: 'Subjects', BM_Subjects_counts.columns[1]: 'Counts' })

BM_Subjects_counts = BM_Subjects_counts.loc[BM_Subjects_counts['Counts'].nlargest(34).index]
BM_Subjects_counts = BM_Subjects_counts.drop(index=[0,5,16,17])

Code


fig = px.bar(y=BM_Subjects_counts['Subjects'], x=BM_Subjects_counts['Counts'])
fig.update_layout(
    height=800,
    width=1500,
    yaxis=dict(autorange="reversed"),
    barmode='stack',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=True,
    template='plotly_white',
    autosize=True,
    title='Top 30 Tags / Themes of Chinese artwork at the British Museum'
)


fig.show()

Code

BM_Materials_counts = BM_Materials['Materials'].str.split('[\s,|;]', expand=True).stack().value_counts()
BM_Materials_counts = pd.DataFrame(BM_Materials_counts).reset_index()
BM_Materials_counts = BM_Materials_counts.rename(columns={BM_Materials_counts.columns[0]: 'Materials', BM_Materials_counts.columns[1]: 'Counts' })

BM_Materials_counts = BM_Materials_counts.loc[BM_Materials_counts['Counts'].nlargest(31).index]
BM_Materials_counts = BM_Materials_counts.drop(index=[1])

Code


fig = px.bar(y=BM_Materials_counts['Materials'], x=BM_Materials_counts['Counts'])
fig.update_layout(
    height=800,
    width=1500,
    yaxis=dict(autorange="reversed"),
    barmode='stack',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=True,
    template='plotly_white',
    autosize=True,
    title='Top 30 Materials of Chinese artwork on display at the British Museum'
)


fig.show()

2.3 Web Scrapping - Philadelphia Museum of Art as an example

The last approach is to scrap from the online galleries. The example here is Philadelphia Museum of Art’s Chinese art collection. In comparison to the earlier two, the information scrapped from the web is a lot less in detail. Particularly for art genres like Chinese art, of which many have unknown artist name or time or production, the information scrapped would be not useful, as we are not able to efficiently scrape details of many objects. Hence, the potential analyses are limited.

Code

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from time import sleep

Code

driver = webdriver.Chrome()
url = "https://philamuseum.org/search/collections?from=0&size=48&filters=%7B%22department%22%3A%5B%22East%20Asian%20Art%22%5D%2C%22place%22%3A%5B%22China%22%5D%7D"
response = driver.get(url)
html_content = driver.page_source

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=120.0.6099.71)
Stacktrace:
0   chromedriver                        0x000000010a906c48 chromedriver + 4852808
1   chromedriver                        0x000000010a8fe1b3 chromedriver + 4817331
2   chromedriver                        0x000000010a4ca7bd chromedriver + 411581
3   chromedriver                        0x000000010a49e2f8 chromedriver + 230136
4   chromedriver                        0x000000010a54c41f chromedriver + 943135
5   chromedriver                        0x000000010a563226 chromedriver + 1036838
6   chromedriver                        0x000000010a5449a3 chromedriver + 911779
7   chromedriver                        0x000000010a50c103 chromedriver + 680195
8   chromedriver                        0x000000010a50d71e chromedriver + 685854
9   chromedriver                        0x000000010a8c6792 chromedriver + 4589458
10  chromedriver                        0x000000010a8cb99c chromedriver + 4610460
11  chromedriver                        0x000000010a8abcb1 chromedriver + 4480177
12  chromedriver                        0x000000010a8cc716 chromedriver + 4613910
13  chromedriver                        0x000000010a89d23c chromedriver + 4420156
14  chromedriver                        0x000000010a8ec798 chromedriver + 4745112
15  chromedriver                        0x000000010a8ec94e chromedriver + 4745550
16  chromedriver                        0x000000010a8fddf3 chromedriver + 4816371
17  libsystem_pthread.dylib             0x00007ff802926259 _pthread_start + 125
18  libsystem_pthread.dylib             0x00007ff802921c7b thread_start + 15

Code

soup = BeautifulSoup(html_content, 'html.parser')

Code

selector = ".searchcard"

tables = soup.select(selector)

Code

results = []


max_pages = 10

# The base URL we will be using
base_url = "https://philamuseum.org/search/collections?"

# loop over each page of search results
for page_num in range(1, max_pages + 1):
    print(f"Processing page {page_num}...")

    obj_num = (page_num-1)*48
    
    # Update the URL hash for this page number and make the combined URL
    url_hash = f"from={obj_num}&size=48&filters=%7B%22department%22%3A%5B%22East%20Asian%20Art%22%5D%2C%22place%22%3A%5B%22China%22%5D%7D"
    url = base_url + url_hash

    # Go to the driver and wait for 5 seconds
    driver.get(url)
    sleep(5)

    # YOUR CODE: get the list of all apartments
    # This is the same code from Part 1.2 and 1.3
    # It should be a list of 120 apartments
    soup = soup
    objects = tables
    print("Number of Objects = ", len(objects))

    # loop over each apartment in the list
    page_results = []
    for artwork in objects:

        #artwork name
        artwork_name = artwork.select_one(".card-title").text

        #artist, Geoegraphy, Time 
        artist_geo_time = artwork.select_one(".card-body").text
              
        # Save the result
        page_results.append([artwork_name, artist_geo_time])

    # Create a dataframe and save
    col_names = ["artwork_name", "artist_geo_time"]
    df = pd.DataFrame(page_results, columns=col_names)
    results.append(df)

    print("sleeping for 10 seconds between calls")
    sleep(10)

# Finally, concatenate all the results
results = pd.concat(results, axis=0).reset_index(drop=True)

Processing page 1...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 2...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 3...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 4...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 5...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 6...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 7...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 8...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 9...
Number of Objects =  48
sleeping for 10 seconds between calls
Processing page 10...
Number of Objects =  48
sleeping for 10 seconds between calls

Code

results[['Artist', 'Geography', 'Time']] = pd.DataFrame(results['artist_geo_time'].str.split(',').tolist(), index=results.index)

Code

results.head(10)

	artwork_name	artist_geo_time	Artist	Geography	Time
0	Reception Hall	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None
1	Portrait of a Manchu Lady	Mangguli, Chinese (Manchu), 1672 - 1736	Mangguli	Chinese (Manchu)	1672 - 1736
2	Jar	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None
3	Covered Cup	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None
4	Cup	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None
5	Teapot	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None
6	Bowl	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None
7	Vase in the form of an Archaic Bronze Vessel	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None
8	Wall Vase (P'ing)	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None
9	Vase (P'ing)	Artist/maker unknown, Chinese	Artist/maker unknown	Chinese	None