Partition data while writing to delta sink - azure-synapse

In Azure mapping dataflow we now have option to save files in delta format. But that is only available when we select inline dataset (without data bricks subscription). And when the sink dataset is inline dataset, it does not allow to set partition based on any column.
I can write pyspark code to rewrite the delta table with required partition. But that would incur additional cost.
What could be work arounds for getting good performance on delta data?

There was a UI issue that was recently fixed by the engineering team. Until this reflects at your end.
You could do the following as a workaround :
Option 1 :
You can change the type of sink to something else, like a delimited text sink, and you should then see the key columns in Key partitioning. Then, switch the Sink type back to Delta.
Reference : https://learn.microsoft.com/en-us/answers/questions/599075/index.html
Option 2:
You could enable the partitioning at the source end.
The partitioned data was flowing as a stream. I was able to achieve the partitioned data as a result

Related

Table without date and Primary Key

I have 9M records. We needed to do the following operations:-
daily we receive the entire file of 9M records with 150GB of file size
It is truncate and loads in Snowflake. Daily deleting the entire 9B records and loading
We would want to send only incremental file load to Snowflake. Meaning that:
For example, out of 9Million records, we would only have an update in 0.5Million records(0.1 M Inserts,0.3 Deletes, and 0.2 Updates). How we will be able to compare the file and extract only delta file and load to the snowflake. How to do it cost-effectively and fast way in AWS native tools and load to S3.
P.s data doesn't have any date column. It is a pretty old concept written in 2012. We need to optimize this. The file format is fixed width. Attaching sample RAW data.
Sample Data:
https://paste.ubuntu.com/p/dPpDx7VZ5g/
In a nutshell, I want to extract only Insert, Updates, and Deletes into a File. How do you classify this best and cost-efficient way.
Your tags and the question content does not match, but I am guessing that you are trying to load data from Oracle to Snowflake. You want to do an incremental load from Oracle but you do not have an incremental key in the table to identify the incremental rows. You have two options.
Work with your data owners and put the effort to identify the incremental key. There needs to be one. People are sometimes lazy to put this effort. This will be the most optimal option
If you cannot, then look for a CDC(change data capture) solution like golden gate
CDC stage comes by default in DataStage.
Using CDC stage in combination of Transformer stage, is best approach to identify new rows, changed rows and rows for deletion.
You need to identify column(s) which makes row unique, doing CDC with all columns is not recommended, DataStage job with CDC stage consumes more resources if you add more change columns in CDC stage.
Work with your BA to identifying column(s) which makes row unique in the data.
I had the similar problem what you have. In my case, there are no Primary key and there is no date column to identify the difference. So what I did is actually, I used AWS Athena (presto managed) to calculate the difference between source and the destination. Below is the process:
Copy the source data to s3.
Create Source Table in athena pointing the data copied from source.
Create Destination table in athena pointing to the destination data.
Now use, SQL in athena to find out the difference. As I did not have the both primary key and date column, I used the below script:
select * from table_destination
except
select * from table_source;
If you have primary key, you can use that to find the difference as well and create the result table with the column which says "update/insert/delete"
This option is aws native and then it will be cheaper as well, as it costs 5$ per TB in athena. Also, in this method, do not forget to write file rotation scripts, to cut down your s3 costs.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Streaming data to a specific BigQuery Time Partition

I would like to know if there is any way to stream data to a specific time partition of a BigQuery table. The documentation says that you must use table decorators:
Loading data using partition decorators
Partition decorators enable you to load data into a specific
partition. To adjust for timezones, use a partition decorator to load
data into a partition based on your preferred timezone. For example,
if you are on Pacific Standard Time (PST), load all data generated on
May 1, 2016 PST into the partition for that date by using the
corresponding partition decorator:
[TABLE_NAME]$20160501
Source: https://cloud.google.com/bigquery/docs/partitioned-tables#dealing_with_timezone_issues
And:
Restating data in a partition
To update data in a specific partition, append a partition decorator
to the name of the partitioned table when loading data into the table.
A partition decorator represents a specific date and takes the form:
$YYYYMMDD
Source: https://cloud.google.com/bigquery/docs/creating-partitioned-tables#creating_a_partitioned_table
But if I try to use them when streaming data i got the following error: Table decorators cannot be used with streaming insert.
Thanks in advance!
Sorry for the inconvenience. We are considering providing support for this in the near future. Please stay tuned for more updates.
Possible workarounds that might work in many cases:
If you have most of the data available(which is sometimes the case when restating data for an old partition), you can use a load job with the partition as the destination.
Another option is to stream to a temporary table and after the data has been flushed from the streaming buffer, use bq cp
This feature was recently released and you can now stream directly into a decorated date partition within the last 30 days historically and 5 days into the future.
https://cloud.google.com/bigquery/streaming-data-into-bigquery

Is data appended to a table or overwrite it if the table has existed already when streaming data into BigQuery

When streaming data into a BigQuery table, I wonder if the default is to append the json data to a BigQuery table if the table has existed already? The api documentation for tabledata().insertAll() is very brief and doesn't mention parameters like configuration.load.writeDisposition as in a load job.
There are no multiple choices here, so there is no default and no overridden case. Don't forget that BigQuery is a WORM technology (append-only by design). It looks for me, that you are not aware of this thing, as there is no option like UPDATE.
You just set the path parameters, the trio of project, dataset, table ID,
then set the existing schema as json and the rows, and it will append to the table.
To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis.
In case of error you have a short error code that summarizes the error. For help on debugging the specific reason value you receive, see troubleshooting errors.
Also worth reading:
Bigquery internalError when streaming data

Dynamic partitioning in google cloud dataflow?

I'm using dataflow to process files stored in GCS and write to Bigquery tables. Below are my requirements:
input files contain events records, each record pertains to one eventType;
need to partition records by eventType;
for each eventType output/write records to a corresponding Bigquery table, one table per eventType.
events in each batch input files vary;
I'm thinking of applying transforms such as "groupByKey" and "partition", however seems that I have to know number of (and type of) events at the development time which is needed to determine the partitions.
Do you guys have a good idea to do the partitioning dramatically? meaning partitions can be determined at run time?
Why not loading everything into a single "raw" bigquery table and then using BigQuery API determine the different number of events and export each event type to its own table (e.g., via https://cloud.google.com/bigquery/bq-command-line-tool#createtablequery) or an API call?
If your input format is simple, you can do that without using dataflow at all and it will be probably more cost efficient.