Copy data from one blob storage to another blob storage - azure-storage

My requiremnt is like that I have two storage account sa01 and sa02. Let say Sa01 having 10 files and Sa02 also having 10 files at time 01:00 AM. Now I have uploaded 4 more files at 1:15AM in sa01 and my copy activity wil automatically runs beacause I am implemented the event trigger. So It will insert the 4 files to sa02.
Question - It will insert the 4 files and also updating the previous (10) files also, so I am getting 14 files at time 01:15 AM,and requriment say that if 10 files uploaded already at 01:00 AM and 4 files which is latest can inserted in sa02.
See the timings in image I have just uploaded one file all the files time is modified.

Azure Data Share is one good way to accomplish this. It is typically used to sync storage with a partner company. But you can sync in your own subscription. There is no code to write. There is a UI and a sync schedule.

You can use a Metadata activity to get the lastModified of the destination folder.
In your Copy activity, put dynamic content in the
Filter by last modified: start time field. Choose the lastModified field output from the Metadata activity.
Only files in the source newer than the destination's lastModified will be copied.
Metadata activity is tiny fractions of a penny in cost.

Related

Looking for a safe way to delete a parquet file from Delta Lake

as the title says I'm looking for a safe way to delete parquet files on a Delta Lake, correct because I don't want to corrupt my Delta Lake.
This is what happened, some days ago some partitions had to be backfilled but forgot to delete the contents of one of those prior to this process, delete/refill is the last resource I want to try right now.
I know by the last modification date which files I need to delete, but if I delete them from S3 I will corrupt the partition/table and generate an inconsistency against the manifest.
What would be the "uncorrupting" way to delete those files?
I'm currently working on spark.
Thanks!

Can I copy data table folders in QuestDb to another instance?

I am running QuestDb on production server which constantly writes data to a table, 24x7. The table is daily partitioned.
I want to copy data to another instance and update it there incrementally since the old days data never changes. Sometimes the copy works but sometimes the data gets corrupted and reading from the second instance fails and I have to retry coping all the table data which is huge and takes a lot of time.
Is there a way to backup / restore QuestDb while not interrupting continuous data ingestion?
QuestDB appends data in following sequence
Append to column files inside partition directory
Append to symbol files inside root table directory
Mark transaction as committed in _txn file
There is no order between 1 and 2 but 3 always happens last. To incrementally copy data to another box you should copy in opposite manner:
Copy _txn file first
Copy root symbol files
Copy partition directory
Do it while your slave QuestDB sever is down and then on start the table should have data up to the point when you started copying _txn file.

Not able to update Big query table with Transfer from a Storage file

I am not able to update a big query table from a storage file. I have latest data file and transfer runs successfully. But it say "8:36:01 AM Detected that no changes will be made to the destination table.".
Tried multiple ways.
Please help.
Thanks,
-Srini
You have to wait 1 hour after your file has been updated in Cloud Storage: https://cloud.google.com/bigquery-transfer/docs/cloud-storage-transfer?hl=en_US#minimum_intervals
I had the same error. I created two transfers from GCS to BigQuery, with write preference set to MIRROR and APPEND. I got the logs below (no error). The GCS file was uploaded less than one hour before.
MIRROR: Detected that no changes will be made to the destination table. Summary: succeeded 0 jobs, failed 0 jobs.
APPEND: None of the 1 new file(s) found matching "gs://mybucket/myfile" meet the requirement of being at least 60 minutes old. They will be loaded in next run. Summary: succeeded 0 jobs, failed 0 jobs.
Both jobs went through one hour later.

Google Cloud Dataprep - Scan for multiple input csv and create corresponding bigquery tables

I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.

Start stream analytics job back in time with reference data

Setup: A stream analytics job - We have a stream data in csv format in azure blob storage and reference data in azure blob storage. The query is simple pass through query with a left join on reference data (so if reference doesn’t match the stream data should still output)
Input path for stream data in the job is
Data/2016/06/08/17/file1.csv
based on the pattern - Data/{date}/{time} - i.e. Data/YYYY/MM/dd/HH/
the last modified date of the file is 2016-06-08 16:30:00 - the time we uploaded the file
Input path for reference data in the job is
Reference/2016/06/08/17/00/ref1.csv
based on the pattern - Reference/{date}/{time}/ref1.csv
i.e. Reference/YYYY/MM/dd/HH/mm/ref1.csv
the last modified date of the file is 2016-06-08 16:30:00 - the time we uploaded the file
All the files are in place - when we start the job with custom start time as 2016-06-08 17:00:00.
we can see the input events but there is no output events
when we remove the reference data left join in the query - the job produces the output
Note: Current timestamp when executing this job is 2016-06-08 19:00:00 so we are going back in time to process data.
What causes this behaviour?
Why can't we go back in time, start the job and cannot see any outputs? Basically we would like to stop the job - go back in time and replay everything as of that time.
What is wrong with reference file timestamp?
I have also included the last modified datetime of both files wondering whether that plays any role in this behavior.
What are we missing?
Thanks.