How to store and aggregate data in ~300m JSON objects efficiently - sql

I have an app where I receive 300m JSON text files (10m daily, retention = 30 days) from a Kafka topic.
The data it contains needs to be aggregated every day based on different properties.
We would like to build it with Apache Spark, using Azure Databricks, because the size of the data will gro, we cannot vertically scale this process anymore (currently runs in 1 Postgres server) and we also need something that is cost-effective.
Having this job in Apache Spark is straightforward in theory, but I haven't found any practical advice on how to process JSON objects efficiently.
These are the options as I see:
Store the data in Postgres and ingest it with the Spark job (SQL) - may be slow to transfer the data
Store the data in Azure Blob Storage in JSON format - We may run out of the number of files that can be stored, also this seems inefficient to read so many files
Store the JSON data in big chunks, eg. 100.000 JSON in one file - it could be slow to delete/reinsert when the data changes
Convert the data to CSV or some binary format with a fixed structure and store it in blob format in big chunks - Changing the format will be a challenge but it would rarely happen in the future, also CSV/binary is quicker to parse
Any practical advice would be really appreciated. Thanks in advance.

There are multiple factors to be consider :
If you are trying to read the data on daily manner then strongly suggested to do store the data in Parquet format and store in databricks. If not accessing daily then store in Azure buckets itself (computation cost will be minimised)
If JSON data to be flattened then you need to do all the data manipulations and write into delta tables with OPTIMISE conditions.
If really retention 30 mandatory then be cautious with file formats bcz data will grow exponentially on daily basis. Other wise Alter table properties with retention period to 7 days or 15 days.

Related

ETL on S3 : Duplicate rows : how to update old entries?

During my ETL imports some pre-synchronized entries are supplied multiple times by my source (because updated by the service) and therefore imported multiple times in AWS. I would like to implement a structure that overwrites an entry if it already exists (something close to a key-value store for few rows updated twice).
My requirements entails to operate on one terabyte of data and to operate on glue (or potentially redshift).
I implemented the solution as follows:
I read the data from my source
I save each entry in a different file by choosing the unique identifier of the content as the file name.
I index my raw data with a glue crawler scanning new files on S3
I run a glue job to transform the raw data in an OLAP compliant format (parquet).
Is this the right way to proceed?
This seems personally correct to me even if I have concerns about the large amount of separate files in my raw data (1 file per entry).
Thank you,
Hugo

Use case of using Big Query or Big table for querying aggregate values?

I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And when loading cluster your tables, for massive improvements in costs for the most common queries:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
In summary:
BigQuery can read CSVs and other files straight from GCS.
You can define a view that parses those CSVs in any way you might prefer, all within SQL.
You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.
15 TB can be handled easily by BigQuery.

Optimal Google Cloud Storage for BigQuery

Given a 1-terabyte data set which comes from the sources in a couple hundred csv files, and divides naturally into two large tables, what's the best way to store the data in Google Cloud Storage? Partitioning by date does not apply as the data is relatively static and only updated quarterly. Is it best to combine all of the data into two large files and map each to a BigQuery table? Is it better to partition? If so, on what basis? Is there a threshold file size above which BigQuery performance degrades?
Depending on the use case:
To query data => then load it into BigQuery from GCS.
To store the data => leave it in GCS.
Question: "I want to query and have created a table in BiqQuery, but with only a subset of the data totaling a few GB. My question is if I have a TB of data should I keep it in one giant file GCS or should I split it up?"
Answer: Just load it all into BigQuery. BigQuery eats TB's for breakfast.

Multi-Date data Load into BigQuery Partitioned table

I am trying to explore BigQuery's abilities to load CSV file (Doulbelick impression data) into BigQuery's partitioned table. My use case includes:
1. Reading daily (nightly load) dumps (csv) from Google cloud storage for my customer's (ad agency) 30 different clients into BQ. Daily dump may contain data from previous day/week. All data should be loaded into respective daily partition (into BQ) so as to provide daily reporting to individual clients.
2.The purpose here is to build an analytical system that gives ad agency an ability to run "Trends & Pattern over time and across clients".
I am new to BQ and thus trying to understand its Schema layout.
Should i create a single table with daily partitions (holding data from all 50 clients/50 daily load -csv files)? Does the partitions need to be created well in advance ?
Should i create 50 different tables(partitioned by date) for each client so as NOT to run into any data sharing/security concerns of a single table option ?
My customer wants a simple solution with min cost.
If you are going to use transfer service (as mentioned in the comment), you don't need to create tables by hand. Instead transfer service will do that for you. Transfer service will schedule daily jobs and load data into partition. Also, if there is short delay (2-3 days), transfer service will still pick up the data.

Data preparation to upload into Redis server

I have a 10GB .xml file, which I want to upload into redis server using the mass insert . I need advise on how to convert this .xml data to some key, value or any other data structure supported by redis? I am working with stack over flow dumps and for example, If I take up the comments.xml.
Data pattern:
row Id="5" PostId="5" Score="9" Text="this is a super theoretical AI question. An interesting discussion! but out of place..." CreationDate="2014-05-14T00:23:15.437" UserId="34"
Lets say I want to retrieve all comments made by particular userid or a particular date how do I do that?
Firstly,
How do I prepare this .xml date into data structure suitable for Redis.
How can I upload it into Redis. I am using Redis on windows. The commands pipe and cat does not seem to work. I have tired using centos but I prefer using Redis on windows.
Before you choose proper data structure you need to understand what type of quires you will make. For example if you have user specific data and you need to group different user activities per user and have aggregated results you need to go with different structures, build indexes, split data in chunks and so on.
Relatively for large amount of aggregated data (45GB) I found usable SortedSets with ZRANGE because it has better complexity that LRANGE. You can split your data in chunks based on your data size and process each ZRANGE individually in threads and then combine results.
On top of that structure you can add indexes with LISTS where you need only to iterate data for relatively small amounts of data.