Deleting records in a table with billion records using spark or scala - sql

we have a table in Azure Data Warehouse with 17 billion records. Now we have a scenario where we have to delete records from this table based on some where condition. We are writing Spark in Scala language in Azure Databricks notebooks.
We searched for different options to do this in Spark, but all suggested to first read the entire table, delete records from this and then overwrite the entire table in Data Warehosue. However this approach will not work in our case due to huge number of records in our table.
Can you please suggest how we can achieve this functionality using spark/scala?
1) checked if we can call stored procedure through spark/scala code in azure databricks but Spark do not support stored procedures.
2) Tried reading the entire table first to delete the records but it goes into never ending loop.

Is possible to create view with select clause as per your requirement, then using of the view

Related

how to write a Trigger to insert data from aurora to redshift

I am having some data in aurora mysql db, I would like to do two things:
HISTORICAL DATA:
To read the data from aurora(say TABLE A) do some processing and update some columns of a table in redshift(say TABLE B).
ALSO,
LATEST DAILY LOAD
To have a trigger like condition where whenever a new row is inserted in aurora table A then a trigger should update the columns in redshift table B with some processing.
what should be the best approach to handle such situation. Please understand I don't have a simple read and insert situation , I also have to perform some process as well between read and write.
Not sure if you have already solved the issue and if so please share the details.
We are looking at following approach
A cron will write the daily data batch into s3 (say 1 month or order)
Upon s3 arrival, load that file into Redshift via copy command (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html)
Looking for more ideas/thoughts for sure.

How to do bulk upsert operation in snowflake?

Am syncing my mongo DB data to snowflake on a daily basis using a node js script. So if a row is already existing in snowflake, then I want to replace that row with the new data, or if the row doesn't exist in snowflake then I want to insert a new row.
Also, I want to do this for a lot of data.
So is there any way to do bulk upsert in snowflake? If not, then what will be the optimal way to achieve this?
The table may have millions of rows and possibly go to billions in the future.
This is a typical use case for a merge statement. You can see the documentation for merge here: https://docs.snowflake.com/en/sql-reference/sql/merge.html
Using a merge statement for billions of rows can lead to some high-churn tables so it isn't ideal. It could be better if you can append to the table only and figure out the latest record with a select statement.
You can bulk copy your data into a staging table then use MERGE feature in snowflake.

How to insert overwrite partitioned table in BigQuery UI?

We can insert data into specific partition of partitioned table, here we need to specify partition value.But my requirement is to overwrite all partitions in a table in one query using UI. Can we perform this operation?
Consulted bigquery team member. You can NOT write to all partitions in one query.
You can only write to a partition at a time.
As YY has pointed out, you would not be able to do this directly in BigQuery/SQL with one query (you could script something to run N queries). However, if you spun up a Cloud Dataflow pipeline and configured it to have multiple BigQueryIO sinks, with each one (sink) overwriting a partition, this could be one way I can think of doing it in one shot. It would be a straightforward pipeline to spin up and run.
At this time, BigQuery allows updating upto 2000 partitions in a single statement. If you need to just insert data into a partitioned table, you can use the INSERT DML statement to write to upto 2000 partitions in one statement. If you are updating or deleting existing partitions you can use the UPDATE or DELETE statements respectively. If you are both updating and inserting new data, you can use the MERGE DML statements to achieve this.

SSIS Alternatives to one-by-one update from RecordSet

I'm looking for a way to speed up the following process: I have a SSIS package that loads data from Excel files on a weekly basis to SQL Server. There are 3 fields: Brand, Date, Value.
In the dataflow, I check for existing combinations of Brand+Date, and new combinations go to the table directly, the existing ones go to a RecordSet destination for updates:
The next step is to update the Value of the existing combinations:
As you can see, there are thousands of records to update, and it takes too long. The number of records tend to grow week by week. Please suggest.
The fastest way will be do this inside a Stored procedure using ELT (Extract Load Transform) approach.
Push all data from excel as is into a table(called load to a staging table in theory). Since you do not seem to be concerned with data validation steps, this table can be a replica of final destination table columns.
Next step is to call a stored procedure using Execute SQL task. Inside this procedure you can put all your business logic. Since this steps with native data manipulation on SQL server entities, it is the fastest alternative.
As a last part, please delete all entries from the staging table.
You can use indexes on staging table to make the SP part even faster.

Create Partition table in Big Query

Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.