How to flatten a nested json array?

Question

How to flatten a nested json array?

1.4k views Asked by lsr At 07 August 2021 at 20:12

I need to flatten a JSON with different levels of nested JSON arrays in Python

Part of my JSON looks like:

{
  "data": {
    "workbooks": [
      {
        "projectName": "TestProject",
        "name": "wkb1",
        "site": {
          "name": "site1"
        },
        "description": "",
        "createdAt": "2020-12-13T15:38:58Z",
        "updatedAt": "2020-12-13T15:38:59Z",
        "owner": {
          "name": "user1",
          "username": "John"
        },
        "embeddedDatasources": [
          {
            "name": "DS1",
            "hasExtracts": false,
            "upstreamDatasources": [
              {
                "projectName": "Data Sources",
                "name": "DS1",
                "hasExtracts": false,
                "owner": {
                  "username": "user2"
                }
              }
            ],
            "upstreamTables": [
              {
                "name": "table_1",
                "schema": "schema_1",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              },
              {
                "name": "table_2",
                "schema": "schema_2",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              },
              {
                "name": "table_3",
                "schema": "schema_3",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              }
            ]
          },
          {
            "name": "DS2",
            "hasExtracts": false,
            "upstreamDatasources": [
              {
                "projectName": "Data Sources",
                "name": "DS2",
                "hasExtracts": false,
                "owner": {
                  "username": "user3"
                }
              }
            ],
            "upstreamTables": [
              {
                "name": "table_4",
                "schema": "schema_1",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              }
            ]
          }
        ]
      }
    ]
  }
}

The output should like this

sample output

Tried using json_normalize but couldn't make it work. Currently parsing it by reading the nested arrays using loops and reading values using keys. Looking for a better way of normalizing the JSON

Original Q&A

There are 2 answers

**thenarfer** · Answer 1 · 2021-08-07T23:01:41+00:00

Here's a partial solution:

First save your data in the same directory as the script as a JSON file called data.json.

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('data.json') as json_file:
    json_data = json.load(json_file)

new_data = json_data['data']['workbooks']

result = json_normalize(new_data, ['embeddedDatasources', 'upstreamTables'], ['projectName', 'name', 'createdAt', 'updatedAt', 'owner', 'site'], record_prefix='_')

result

Output:

	_name	_schema	_database.name	_database.connectionType	projectName	name	createdAt	updatedAt	owner	site
0	table_1	schema_1	testdb	redshift	TestProject	wkb1	2020-12-13T15:38:58Z	2020-12-13T15:38:59Z	{'name': 'user1', 'username': 'John'}	{'name': 'site1'}
1	table_2	schema_2	testdb	redshift	TestProject	wkb1	2020-12-13T15:38:58Z	2020-12-13T15:38:59Z	{'name': 'user1', 'username': 'John'}	{'name': 'site1'}
2	table_3	schema_3	testdb	redshift	TestProject	wkb1	2020-12-13T15:38:58Z	2020-12-13T15:38:59Z	{'name': 'user1', 'username': 'John'}	{'name': 'site1'}
3	table_4	schema_1	testdb	redshift	TestProject	wkb1	2020-12-13T15:38:58Z	2020-12-13T15:38:59Z	{'name': 'user1', 'username': 'John'}	{'name': 'site1'}

What next?

I think if you re-structure the data a bit in advance (for example flattening 'database': {'name': 'testdb', 'connectionType': 'redshift'}) you will be able to add more fields to the meta parameter.

As you see in the documentation of json_normalize, the four parameters that are used here are:

data: dict or list of dicts :
- Unserialized JSON objects.
record_path: str or list of str : default None
- Path in each object to list of records. If not passed, data will be assumed to be an array of records.
meta: list of paths (str or list of str) : default None
- Fields to use as metadata for each record in resulting table.
record_prefix: str : default None
- If True, prefix records with dotted (?) path, e.g. foo.bar.field if path to records is [‘foo’, ‘bar’].

**Jagvir** · Answer 2 · 2022-09-28T07:41:00+00:00

tl;dr: Your final output along with detailed steps are mentioned in here

details :

To answer this question you need to have a thorough understanding of pandas.json_normalize. The understanding of json_normalize, record_path, meta, explode and in general json parsing.

import json
import pandas as pd

data = {
    "data":
    {
        "workbooks":
        [
            {
                "projectName": "TestProject",
                "name": "wkb1",
                "site":
                {
                    "name": "site1"
                },
                "description": "",
                "createdAt": "2020-12-13T15:38:58Z",
                "updatedAt": "2020-12-13T15:38:59Z",
                "owner":
                {
                    "name": "user1",
                    "username": "John"
                },
                "embeddedDatasources":
                [
                    {
                        "name": "DS1",
                        "hasExtracts": False,
                        "upstreamDatasources":
                        [
                            {
                                "projectName": "Data Sources",
                                "name": "DS1",
                                "hasExtracts": False,
                                "owner":
                                {
                                    "username": "user2"
                                }
                            }
                        ],
                        "upstreamTables":
                        [
                            {
                                "name": "table_1",
                                "schema": "schema_1",
                                "database":
                                {
                                    "name": "testdb",
                                    "connectionType": "redshift"
                                }
                            },
                            {
                                "name": "table_2",
                                "schema": "schema_2",
                                "database":
                                {
                                    "name": "testdb",
                                    "connectionType": "redshift"
                                }
                            },
                            {
                                "name": "table_3",
                                "schema": "schema_3",
                                "database":
                                {
                                    "name": "testdb",
                                    "connectionType": "redshift"
                                }
                            }
                        ]
                    },
                    {
                        "name": "DS2",
                        "hasExtracts": False,
                        "upstreamDatasources":
                        [
                            {
                                "projectName": "Data Sources",
                                "name": "DS2",
                                "hasExtracts": False,
                                "owner":
                                {
                                    "username": "user3"
                                }
                            }
                        ],
                        "upstreamTables":
                        [
                            {
                                "name": "table_4",
                                "schema": "schema_1",
                                "database":
                                {
                                    "name": "testdb",
                                    "connectionType": "redshift"
                                }
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

First you need to bring it to the dict level.

data_list = data['data']['workbooks']

I did some data massaging by renaming some columns as per requirements.

data_list_pd = pd.DataFrame(data_list)
data_list_pd = data_list_pd.rename(
    columns= {'name':'wkb'},errors='ignore').rename(
    columns= {'createdAt':'wkb_createdDt'},errors='ignore').rename(
    columns= {'updatedAt':'wkb_updatedDt'},errors='ignore').rename(
    columns= {'projectName':'prj'},errors='ignore')
data_list_pd

data_list = json.loads(data_list_pd.to_json(orient="records"))
data_list

Next is where the core of your problem statement lies. You need to flatten the JSON by mentioning the record_path which is esentially the nested dictionary you want to expand along with the meta which is meta data/the remaining columns which you want to display. After that you need to explode on columns which have lists in them. You can achieve it by chaining explode method couple of times.

flattened_dataframe= pd.json_normalize(data_list, 
                  record_path = 'embeddedDatasources',
                  meta = ['prj','wkb','wkb_createdDt', 'wkb_updatedDt',['site','name'],['owner','name'],['owner','username']],
                  errors='ignore').explode('upstreamDatasources').explode('upstreamTables')
flattened_dataframe

You can repeat this process couple of times to reach your final goal/desired result. Since the json_normalize works on JSON/dict files you will have to convert the dataframe into json files after each iteration. You can follow these steps.

flattened_json = json.loads(flattened_dataframe.to_json(orient="records"))

TechQA.

How to flatten a nested json array?

There are 2 answers

Related Questions in PYTHON

Related Questions in JSON

Related Questions in DENORMALIZED

Popular Questions

Trending Questions