parente.dev

Flatten Nested JSON with Pandas

June 09, 2016

I believe the pandas library takes the expression "batteries included" to a whole new level (in a good way). Recent evidence: the pandas.io.json.json_normalize function. It turns an array of nested JSON objects into a flat DataFrame with dotted-namespace column names. It may not seem like much, but I've found it invaluable when working with responses from RESTful APIs.

Let me demonstrate. According to the Gitter API docs, the /rooms/:roomid/chatMessages resource has a variety of nested objects and arrays. I'll fetch fifty messages from the jupyter/notebook room using requests and then use pandas to do a bit of counting.

In [1]:
import pandas as pd
import requests

I need an API token to get the messages. I've obscured mine here. You can get your own from the Gitter Developer page and plug it in as the bearer token.

In [2]:
headers = {
    'Authorization': 'Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
    'Content-Type': 'application/json'
}

I also need the Gitter roomid for jupyter/notebook. I looked it up out-of-band and pasted it into the URL below to avoid muddying this post with additional steps.

In [3]:
resp = requests.get('https://api.gitter.im/v1/rooms/554d218a15522ed4b3e02126/chatMessages', 
                    headers=headers)
In [4]:
resp.raise_for_status()

When I look at the first entry in the JSON response, I see that it contains a few keys with array and object values (e.g., fromUser, mentions, meta, etc.)

In [5]:
resp.json()[0]
Out[5]:
{'fromUser': {'avatarUrlMedium': 'https://avatars0.githubusercontent.com/u/19606573?v=3&s=128',
  'avatarUrlSmall': 'https://avatars0.githubusercontent.com/u/19606573?v=3&s=60',
  'displayName': 'tomVeloso',
  'gv': '3',
  'id': '574810c4c43b8c6019753f42',
  'url': '/tomVeloso',
  'username': 'tomVeloso',
  'v': 1},
 'html': '<span data-link-type="mention" data-screen-name="minrk" class="mention">@minrk</span>  I tried with the above but seam does not work in the sense that the nothing is cleared from the output cell.',
 'id': '5755a171e8163f872c4e6a84',
 'issues': [],
 'mentions': [{'screenName': 'minrk',
   'userId': '529c6cc9ed5ab0b3bf04d9eb',
   'userIds': []}],
 'meta': [],
 'readBy': 22,
 'sent': '2016-06-06T16:14:41.516Z',
 'text': '@minrk  I tried with the above but seam does not work in the sense that the nothing is cleared from the output cell.',
 'unread': False,
 'urls': [],
 'v': 1}

Here's what happens when I pass the full list of messages to the json_normalize function.

In [6]:
msgs = pd.io.json.json_normalize(resp.json())

Notice how the properties of the fromUser nested object become column headers with a fromUser. prefix.

In [7]:
msgs.dtypes
Out[7]:
editedAt                     object
fromUser.avatarUrlMedium     object
fromUser.avatarUrlSmall      object
fromUser.displayName         object
fromUser.gv                  object
fromUser.id                  object
fromUser.url                 object
fromUser.username            object
fromUser.v                  float64
html                         object
id                           object
issues                       object
mentions                     object
meta                         object
readBy                        int64
sent                         object
text                         object
unread                         bool
urls                         object
v                             int64
dtype: object
In [8]:
msgs.head(2)
Out[8]:
editedAt fromUser.avatarUrlMedium fromUser.avatarUrlSmall fromUser.displayName fromUser.gv fromUser.id fromUser.url fromUser.username fromUser.v html id issues mentions meta readBy sent text unread urls v
0 NaN https://avatars0.githubusercontent.com/u/19606... https://avatars0.githubusercontent.com/u/19606... tomVeloso 3 574810c4c43b8c6019753f42 /tomVeloso tomVeloso 1 <span data-link-type="mention" data-screen-nam... 5755a171e8163f872c4e6a84 [] [{'screenName': 'minrk', 'userIds': [], 'userI... [] 22 2016-06-06T16:14:41.516Z @minrk I tried with the above but seam does n... False [] 1
1 NaN https://avatars1.githubusercontent.com/u/15192... https://avatars1.githubusercontent.com/u/15192... Min RK 3 529c6cc9ed5ab0b3bf04d9eb /minrk minrk 12 clear_output is a function, you must call it. 5755b9bf75a601a158b0415d [] [] [] 22 2016-06-06T17:58:23.862Z clear_output is a function, you must call it. False [] 1

Also notice how nested arrays are left untouched as rich Python objects stored in columns. For example, here's the first ten values in the mentions column.

In [9]:
msgs.mentions.head(10)
Out[9]:
0    [{'screenName': 'minrk', 'userIds': [], 'userI...
1                                                   []
2    [{'screenName': 'minrk', 'userIds': [], 'userI...
3    [{'screenName': 'minrk', 'userIds': [], 'userI...
4    [{'screenName': 'epifanio', 'userIds': [], 'us...
5                                                   []
6    [{'screenName': 'sccolbert', 'userIds': [], 'u...
7    [{'screenName': 'jasongrout', 'userIds': [], '...
8    [{'screenName': 'minrk', 'userIds': [], 'userI...
9                                                   []
Name: mentions, dtype: object

I can leave these lists as column values and apply functions to them. For example, I can compute the frequency of mentions per message easily with the data in this form.

In [10]:
msgs.mentions.apply(lambda mentions: len(mentions)).value_counts()
Out[10]:
0    36
1    12
4     1
3     1
Name: mentions, dtype: int64

Alternatively, I can apply the json_normalize function to the mentions key in each entry in the original API response to get another DataFrame.

In [11]:
mentions = pd.io.json.json_normalize(resp.json(), record_path='mentions')
mentions.head()
Out[11]:
announcement group screenName userId userIds
0 NaN NaN minrk 529c6cc9ed5ab0b3bf04d9eb []
1 NaN NaN minrk 529c6cc9ed5ab0b3bf04d9eb []
2 NaN NaN minrk 529c6cc9ed5ab0b3bf04d9eb []
3 NaN NaN all NaN []
4 NaN NaN minrk 529c6cc9ed5ab0b3bf04d9eb []

I can compute the distribution of mentions per username more easily with this DataFrame than with the original, for example.

In [12]:
mentions.screenName.value_counts()
Out[12]:
minrk          7
jasongrout     3
epifanio       2
all            2
blink1073      1
sccolbert      1
afshin         1
mingsterism    1
ellisonbg      1
Name: screenName, dtype: int64

I can also apply the json_normalize function to mentions while retaining other metadata from the original response, such as the message id.

In [13]:
mentions_with_id = pd.io.json.json_normalize(resp.json(), record_path='mentions', meta='id',
                                     record_prefix='mentions.')
mentions_with_id.head()
Out[13]:
mentions.announcement mentions.group mentions.screenName mentions.userId mentions.userIds id
0 NaN NaN minrk 529c6cc9ed5ab0b3bf04d9eb [] 5755a171e8163f872c4e6a84
1 NaN NaN minrk 529c6cc9ed5ab0b3bf04d9eb [] 5755bf7a17856dc5179ac226
2 NaN NaN minrk 529c6cc9ed5ab0b3bf04d9eb [] 5755d43d92fc7c915f57444d
3 NaN NaN all NaN [] 5755d43d92fc7c915f57444d
4 NaN NaN minrk 529c6cc9ed5ab0b3bf04d9eb [] 5755d43d92fc7c915f57444d

With the message id intact, I can merge the mentions and msgs DataFrames. Here, I do an inner merge to create rows for messages that have at least one mention.

In [14]:
mention_msgs = pd.merge(mentions_with_id, msgs, how='inner', on='id')

With the merged DataFrame, I can readily compute the distribution of sender-receiver username pairs, for example.

In [15]:
mention_msgs.apply(lambda df: (df['fromUser.username'], df['mentions.screenName']), axis=1).value_counts()
Out[15]:
(mingsterism, minrk)       3
(epifanio, minrk)          2
(sccolbert, jasongrout)    1
(epifanio, all)            1
(afshin, jasongrout)       1
(jasongrout, afshin)       1
(jasongrout, ellisonbg)    1
(epifanio, jasongrout)     1
(jasongrout, blink1073)    1
(tomVeloso, minrk)         1
(fperez, minrk)            1
(JamiesHQ, all)            1
(minrk, epifanio)          1
(jasongrout, sccolbert)    1
(jasongrout, epifanio)     1
(minrk, mingsterism)       1
dtype: int64

See the pandas documentation for complete information about the json_normalize function. You can also download this post as a Jupyter Notebook and run it yourself.

Another Read: Kicking the Tires: Bluemix Insights for Weather »

In this notebook, we're going to poke at the Bluemix Insights for Weather service from Python. We'll look at what kinds of queries we can make and do a few basic things with the data. We'll keep a running commentary that can serve as an introductory tutorial for developers who want to go off and build more sophisticated apps and analyses using the service.