Flatten Nested JSON with Pandas
June 09, 2016
I believe the pandas library takes the expression "batteries included" to a whole new level (in a good way). Recent evidence: the pandas.io.json.json_normalize function. It turns an array of nested JSON objects into a flat DataFrame with dotted-namespace column names. It may not seem like much, but I've found it invaluable when working with responses from RESTful APIs.
Let me demonstrate. According to the Gitter API docs, the /rooms/:roomid/chatMessages
resource has a variety of nested objects and arrays. I'll fetch fifty messages from the jupyter/notebook
room using requests and then use pandas to do a bit of counting.
import pandas as pd
import requests
I need an API token to get the messages. I've obscured mine here. You can get your own from the Gitter Developer page and plug it in as the bearer token.
headers = {
'Authorization': 'Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'Content-Type': 'application/json'
}
I also need the Gitter roomid
for jupyter/notebook
. I looked it up out-of-band and pasted it into the URL below to avoid muddying this post with additional steps.
resp = requests.get('https://api.gitter.im/v1/rooms/554d218a15522ed4b3e02126/chatMessages',
headers=headers)
resp.raise_for_status()
When I look at the first entry in the JSON response, I see that it contains a few keys with array and object values (e.g., fromUser
, mentions
, meta
, etc.)
resp.json()[0]
Here's what happens when I pass the full list of messages to the json_normalize
function.
msgs = pd.io.json.json_normalize(resp.json())
Notice how the properties of the fromUser
nested object become column headers with a fromUser.
prefix.
msgs.dtypes
msgs.head(2)
Also notice how nested arrays are left untouched as rich Python objects stored in columns. For example, here's the first ten values in the mentions
column.
msgs.mentions.head(10)
I can leave these lists as column values and apply
functions to them. For example, I can compute the frequency of mentions per message easily with the data in this form.
msgs.mentions.apply(lambda mentions: len(mentions)).value_counts()
Alternatively, I can apply the json_normalize
function to the mentions
key in each entry in the original API response to get another DataFrame.
mentions = pd.io.json.json_normalize(resp.json(), record_path='mentions')
mentions.head()
I can compute the distribution of mentions per username more easily with this DataFrame than with the original, for example.
mentions.screenName.value_counts()
I can also apply the json_normalize
function to mentions
while retaining other metadata from the original response, such as the message id
.
mentions_with_id = pd.io.json.json_normalize(resp.json(), record_path='mentions', meta='id',
record_prefix='mentions.')
mentions_with_id.head()
With the message id
intact, I can merge the mentions
and msgs
DataFrames. Here, I do an inner
merge to create rows for messages that have at least one mention.
mention_msgs = pd.merge(mentions_with_id, msgs, how='inner', on='id')
With the merged DataFrame, I can readily compute the distribution of sender-receiver username pairs, for example.
mention_msgs.apply(lambda df: (df['fromUser.username'], df['mentions.screenName']), axis=1).value_counts()
See the pandas documentation for complete information about the json_normalize
function. You can also download this post as a Jupyter Notebook and run it yourself.