I have e-commerce product data in JSON format. This data is frequently updated. I am planning to load this data into BigQuery. Given that the JSON files are huge (a few hundred attributes) and there are many updates, I found that it is not very cost-effective to store the entire data in BigQuery and update it frequently using merge queries.
I am now planning to store the frequently used attributes in individual columns in a BigQuery table. However, I also need to support the use cases when somebody wants to access any other attribute of the JSON for some ad hoc analysis. To address this, can I use the following strategy:
Store the JSON for a product in its own file in GCP in a specific directory. Create an external table using all the files in that directory. Each file's content becomes a row in the external table describing a single product. When updates happen to a product, I update the BigQuery table and also replace the existing file for that product with a new file.
Is this a reasonable approach? If yes, how can I create the external table from the directory?
There are multiple approaches to loading
JSONdata to BigQuery. Every approach may have design considerations in terms of cost, complexity, Ops. To provide you withYes/Noanswer really depends on these three factors.Having said, your approach of create
externaltable will work too.Solution: All SQL-like solution
Here's a
externaltable query which will create a load data in a single column - see documentationCREATE OR REPLACE EXTERNAL TABLE YOUR_PROJECT.YOUR_DATASET.my_newline_json_arrays ( col1 STRING ) OPTIONS ( format = 'CSV', field_delimiter = '\x10', quote = '', uris = ['gs://YOUR_BUCKET/my_newline_json_arrays.json'] );After creating external table, you can use
BigQuery json functionsto select the attributes you want and run a BigQuery scheduled query to load the data into BigQuery. To identify changes to the attributes, you can consider usingtable valued function(TVF) as mentioned here in documentation