LESSON 6 OF 6

Manipulating the API data

With a new knowledge of how to interpret query results, we can now discover how to retrieve that data through code. The following example searches for all objects that contain "Cat" in its title.

import requests

# Find all instances of titles that include "Cat"
r = requests.get('https://api.harvardartmuseums.org/object?title=cat&apikey=0000-0000-0000-0000')

# Convert data to JSON format
data = r.json()

# Extract the info and records
info = data['info']
records = data['records']

# Print the records
print(records)

Tip: If you do not have an IDE in order to run programs, try using this site to use our code examples. Paste our examples (after changing the API key to your own) into this website and press "Run" in order to see what kind of data our code produces.


The code above will print the query's record data in one continuous string. Now that we have all of the objects together, we can now seperate each one in order to use individual objects' data within.

Let's go through each record using a For Loop, which iterates through the list of records one-by-one. Each time the loop lands on a record, let's print out both the title and the classification of the object. To access either of these, you use record['title'] and record['classification']. Using the same syntax, you can access any other value in the JSON object.

import requests

# Find all instances of titles that include "Cat"
r = requests.get('https://api.harvardartmuseums.org/object?title=cat&apikey=0000-0000-0000-0000')

# Convert data to JSON format
data = r.json()

# Extract the info and records
info = data['info']
records = data['records']

# For each record of objects, print the title and classification
for record in records:
    print(record['title'] + ' --- ' + record['classification'])

Once you run this code, you may notice that it only prints 10 objects, even though there are over 300 objects with the word "Cat" in the title. This is because the code is only operating on the first page of the records. As a quick fix, you can increase the page size to 100 if you want a larger sample size. But to get the full dataset, you must create a pagination function. Essentially, you are wrapping part of the original code into a function that gets called every time there is a "next" page available.

import requests


def pagination(url):
    r = requests.get(url)

    # Convert data to jSON format
    data = r.json()

    # Extract the info and records
    info = data['info']
    records = data['records']

    # For each record of objects, print the title and classification
    for record in records:
        print(record['title'] + ' --- ' + record['classification'])

    try:
        # If there is a next page, repeat pagination function
        if (info['next']):
            pagination(info['next'])
    # If next page doesn't work, end function
    except:
        pass

# Query to find all objects that have cat in the title
url = "https://api.harvardartmuseums.org/object?title=cat&apikey=0000-0000-0000-0000"

# Perform Pagination function defined above on the query
pagination(url)

Now that you know how to retrieve the data, you can use code to manipulate it in order to answer interesting questions and realize greater trends. The following example is more advanced but builds on key concepts introduced in the first example, like pagination, requesting queries, and accessing JSON values in a record. The key difference is that we are now adding pieces of the records to a Python Dictionary, and using that Dictionary to compare records and their values to one another.

To begin the example, Harvard Art Museums tracks whenever an object is physically viewed at the Art Study Center, where the public can request to view objects not in the galleries. By querying for all recorded study center views, you can parse through each view, building a table that records the total number of views for each object. From there, you can discover what objects are viewed the most.

To do this, first determine your query:

Study Center Views

https://api.harvardartmuseums.org/ activity ? type=studycenterviews & apikey=0000-0000-0000-0000
URL Resource Filter API
Warning: With this query, there are more than 300,000 records. Because the API has an upper limit of 10,000 requests, this means this data set is too large. To fix this, we must narrow our dataset by applying more filters.

To get under 10,000 records, let's only retrieve Study Center views from 2020. To do this, we must apply a filter. Even though each record contains the date of the view, the Activity resource does not have a specific date filter. To filter by date, we must use Elasticsearch URI Search. Essentially, this allows us to customize a filter based on the fields available in the record.

Study Center Views in 2020:

https://api.harvardartmuseums.org/ activity ? type=studycenterviews & q=date:>2020 & apikey=0000-0000-0000-0000
URL Resource Filter Filter API


In this example, the q filter is going into the records' "date" fields and retrieving records that are greater than or equal to the year 2020.

To access this data in Python, you must form a request. The following lines of code will ‘print’ the JSON records in your computer’s terminal:

import requests

# Find all recorded moves
r = requests.get('https://api.harvardartmuseums.org/activity?type=studycenterviews&q=date:>2020&apikey=0000-0000-0000-0000')

# Convert data to JSON format
data = r.json()

# Extract the info and records
info = data['info']
records = data['records']

print(records)

Now, that the code can retrieve data, the code will parse through each record, appending to a dictionary that records each time an object has been viewed. Then, it will find the ID of the object that has been viewed the most.

for record in records:

    # Convert object ID to a string so it's accessible as a dictionary key
    objectid = str(record['objectid'])

    # If the object already has a recorded move, increment by one
    if (views.get(objectid)):
        views[objectid] += 1
    # Otherwise, record the object's first move
    else:
        views[objectid] = 1

# Take the dictionary of recorded views and get the one with the most views
object_id = max(views, key=views.get)

By referencing our own dictionary, the code then retrieves the object view count for the most-viewed object. After this, the code sends a query that retrieves the data for the individual most-viewed object. This allows us to access its title and artist. Then, to finish the code, it prints a message that compiles the object data.

# Retrieve the top object's number of views, convert to string
object_views = str(views[object_id])

# Fetch the information of that object
object_url = 'https://api.harvardartmuseums.org/object/' + object_id + '?apikey=0000-0000-0000-0000'
object_info = requests.get(object_url)

# Convert to JSON format
object_data = object_info.json()

# Get title and artist of object
object_title = object_data["title"]
object_artist = object_data["people"][0]["name"]

print('The object that has been viewed the most in 2020 is ' + object_title + ' by ' + object_artist + ' (' + object_id + ')' + ' at ' + object_views + ' views')

Although this code works, the size of data is limited to 10 since the data is split into pages. Thus, to access all 8000+ records available, the code must iterate through each page using a pagination function. To reduce the time it takes to run the code, you can also increase the size of each page using the size filter.

import requests

# Create a dictionary for all views with their respective object
views = {}


def viewsCounter(url):
    r = requests.get(url)

    # Convert data to JSON format
    data = r.json()

    # Extract the info and records
    info = data['info']
    records = data['records']

    for record in records:

        # Convert object ID to a string so it's accessible as a dictionary key
        objectid = str(record['objectid'])

        # If the object already has a recorded move, increment by one
        if (views.get(objectid)):
            views[objectid] += 1
        # Otherwise, record the object's first move
        else:
            views[objectid] = 1
    # Error handling if it reaches >10,000 records
    try:
        if (info['next']):
            viewsCounter(info['next'])
    except:
        pass

# Find all recorded views
url = 'https://api.harvardartmuseums.org/activity?type=studycenterviews&q=date:>2020&apikey=0000-0000-0000-000&size=100'
viewsCounter(url)

# Take the dictionary of recorded views and get the one with the most views
object_id = max(views, key=views.get)

# Return the number of times that object has been seen in 2020, convert to string
object_views = str(views[object_id])

# Fetch the information of that object
object_url = 'https://api.harvardartmuseums.org/object/' + object_id + '?apikey=0000-0000-0000-0000'
object_info = requests.get(object_url)

# Convert to JSON format
object_data = object_info.json()

# Get title and artist of object
object_title = object_data["title"]
object_artist = object_data["people"][0]["name"]

print('The object that has been viewed the most in 2020 is ' + object_title + ' by ' + object_artist + ' (' + object_id + ')' + ' at ' + object_views + ' views')

Congratulations! You now know how to use the Harvard Art Museums API.