I believe the pandas library takes the expression "batteries included" to a whole new level (in a good way). Recent evidence: the pandas.io.json.json_normalize function. It turns an array of nested JSON objects into a flat DataFrame with dotted-namespace column names. It may not seem like much, but I've found it invaluable when working with responses from RESTful APIs.

Let me demonstrate. According to the Gitter API docs, the /rooms/:roomid/chatMessages resource has a variety of nested objects and arrays. I'll fetch fifty messages from the jupyter/notebook room using requests and then use pandas to do a bit of counting.

In [1]:

import pandas as pd
import requests

I need an API token to get the messages. I've obscured mine here. You can get your own from the Gitter Developer page and plug it in as the bearer token.

In [2]:

headers = {
    'Authorization': 'Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
    'Content-Type': 'application/json'
}

I also need the Gitter roomid for jupyter/notebook. I looked it up out-of-band and pasted it into the URL below to avoid muddying this post with additional steps.

In [3]:

resp = requests.get('https://api.gitter.im/v1/rooms/554d218a15522ed4b3e02126/chatMessages', 
                    headers=headers)

In [4]:

resp.raise_for_status()

When I look at the first entry in the JSON response, I see that it contains a few keys with array and object values (e.g., fromUser, mentions, meta, etc.)

In [5]:

resp.json()[0]

Out[5]:

{'fromUser': {'avatarUrlMedium': 'https://avatars0.githubusercontent.com/u/19606573?v=3&s=128',
  'avatarUrlSmall': 'https://avatars0.githubusercontent.com/u/19606573?v=3&s=60',
  'displayName': 'tomVeloso',
  'gv': '3',
  'id': '574810c4c43b8c6019753f42',
  'url': '/tomVeloso',
  'username': 'tomVeloso',
  'v': 1},
 'html': '<span data-link-type="mention" data-screen-name="minrk" class="mention">@minrk</span>  I tried with the above but seam does not work in the sense that the nothing is cleared from the output cell.',
 'id': '5755a171e8163f872c4e6a84',
 'issues': [],
 'mentions': [{'screenName': 'minrk',
   'userId': '529c6cc9ed5ab0b3bf04d9eb',
   'userIds': []}],
 'meta': [],
 'readBy': 22,
 'sent': '2016-06-06T16:14:41.516Z',
 'text': '@minrk  I tried with the above but seam does not work in the sense that the nothing is cleared from the output cell.',
 'unread': False,
 'urls': [],
 'v': 1}

Here's what happens when I pass the full list of messages to the json_normalize function.

In [6]:

msgs = pd.io.json.json_normalize(resp.json())

Notice how the properties of the fromUser nested object become column headers with a fromUser. prefix.

In [7]:

msgs.dtypes

Out[7]:

editedAt                     object
fromUser.avatarUrlMedium     object
fromUser.avatarUrlSmall      object
fromUser.displayName         object
fromUser.gv                  object
fromUser.id                  object
fromUser.url                 object
fromUser.username            object
fromUser.v                  float64
html                         object
id                           object
issues                       object
mentions                     object
meta                         object
readBy                        int64
sent                         object
text                         object
unread                         bool
urls                         object
v                             int64
dtype: object

In [8]:

msgs.head(2)

Out[8]:

	editedAt	fromUser.avatarUrlMedium	fromUser.avatarUrlSmall	fromUser.displayName	fromUser.gv	fromUser.id	fromUser.url	fromUser.username	fromUser.v	html	id	issues	mentions	meta	readBy	sent	text	unread	urls	v
0	NaN	https://avatars0.githubusercontent.com/u/19606...	https://avatars0.githubusercontent.com/u/19606...	tomVeloso	3	574810c4c43b8c6019753f42	/tomVeloso	tomVeloso	1	<span data-link-type="mention" data-screen-nam...	5755a171e8163f872c4e6a84	[]	[{'screenName': 'minrk', 'userIds': [], 'userI...	[]	22	2016-06-06T16:14:41.516Z	@minrk I tried with the above but seam does n...	False	[]	1
1	NaN	https://avatars1.githubusercontent.com/u/15192...	https://avatars1.githubusercontent.com/u/15192...	Min RK	3	529c6cc9ed5ab0b3bf04d9eb	/minrk	minrk	12	clear_output is a function, you must call it.	5755b9bf75a601a158b0415d	[]	[]	[]	22	2016-06-06T17:58:23.862Z	clear_output is a function, you must call it.	False	[]	1

Also notice how nested arrays are left untouched as rich Python objects stored in columns. For example, here's the first ten values in the mentions column.

In [9]:

msgs.mentions.head(10)

Out[9]:

0    [{'screenName': 'minrk', 'userIds': [], 'userI...
1                                                   []
2    [{'screenName': 'minrk', 'userIds': [], 'userI...
3    [{'screenName': 'minrk', 'userIds': [], 'userI...
4    [{'screenName': 'epifanio', 'userIds': [], 'us...
5                                                   []
6    [{'screenName': 'sccolbert', 'userIds': [], 'u...
7    [{'screenName': 'jasongrout', 'userIds': [], '...
8    [{'screenName': 'minrk', 'userIds': [], 'userI...
9                                                   []
Name: mentions, dtype: object

I can leave these lists as column values and apply functions to them. For example, I can compute the frequency of mentions per message easily with the data in this form.

In [10]:

msgs.mentions.apply(lambda mentions: len(mentions)).value_counts()

Out[10]:

0    36
1    12
4     1
3     1
Name: mentions, dtype: int64

Alternatively, I can apply the json_normalize function to the mentions key in each entry in the original API response to get another DataFrame.

In [11]:

mentions = pd.io.json.json_normalize(resp.json(), record_path='mentions')
mentions.head()

Out[11]:

	announcement	group	screenName	userId	userIds
0	NaN	NaN	minrk	529c6cc9ed5ab0b3bf04d9eb	[]
1	NaN	NaN	minrk	529c6cc9ed5ab0b3bf04d9eb	[]
2	NaN	NaN	minrk	529c6cc9ed5ab0b3bf04d9eb	[]
3	NaN	NaN	all	NaN	[]
4	NaN	NaN	minrk	529c6cc9ed5ab0b3bf04d9eb	[]

I can compute the distribution of mentions per username more easily with this DataFrame than with the original, for example.

In [12]:

mentions.screenName.value_counts()

Out[12]:

minrk          7
jasongrout     3
epifanio       2
all            2
blink1073      1
sccolbert      1
afshin         1
mingsterism    1
ellisonbg      1
Name: screenName, dtype: int64

I can also apply the json_normalize function to mentions while retaining other metadata from the original response, such as the message id.

In [13]:

mentions_with_id = pd.io.json.json_normalize(resp.json(), record_path='mentions', meta='id',
                                     record_prefix='mentions.')
mentions_with_id.head()

Out[13]:

	mentions.announcement	mentions.group	mentions.screenName	mentions.userId	mentions.userIds	id
0	NaN	NaN	minrk	529c6cc9ed5ab0b3bf04d9eb	[]	5755a171e8163f872c4e6a84
1	NaN	NaN	minrk	529c6cc9ed5ab0b3bf04d9eb	[]	5755bf7a17856dc5179ac226
2	NaN	NaN	minrk	529c6cc9ed5ab0b3bf04d9eb	[]	5755d43d92fc7c915f57444d
3	NaN	NaN	all	NaN	[]	5755d43d92fc7c915f57444d
4	NaN	NaN	minrk	529c6cc9ed5ab0b3bf04d9eb	[]	5755d43d92fc7c915f57444d

With the message id intact, I can merge the mentions and msgs DataFrames. Here, I do an inner merge to create rows for messages that have at least one mention.

In [14]:

mention_msgs = pd.merge(mentions_with_id, msgs, how='inner', on='id')

With the merged DataFrame, I can readily compute the distribution of sender-receiver username pairs, for example.

In [15]:

mention_msgs.apply(lambda df: (df['fromUser.username'], df['mentions.screenName']), axis=1).value_counts()

Out[15]:

(mingsterism, minrk)       3
(epifanio, minrk)          2
(sccolbert, jasongrout)    1
(epifanio, all)            1
(afshin, jasongrout)       1
(jasongrout, afshin)       1
(jasongrout, ellisonbg)    1
(epifanio, jasongrout)     1
(jasongrout, blink1073)    1
(tomVeloso, minrk)         1
(fperez, minrk)            1
(JamiesHQ, all)            1
(minrk, epifanio)          1
(jasongrout, sccolbert)    1
(jasongrout, epifanio)     1
(minrk, mingsterism)       1
dtype: int64

See the pandas documentation for complete information about the json_normalize function. You can also download this post as a Jupyter Notebook and run it yourself.

parente.dev

Flatten Nested JSON with Pandas

Another Read: Kicking the Tires: Bluemix Insights for Weather »

Latest Posts

Flatten Nested JSON with Pandas

Another Read: Kicking the Tires: Bluemix Insights for Weather »

Contact

Latest Posts