Setup: A stream analytics job - We have a stream data in csv format in azure blob storage and reference data in azure blob storage. The query is simple pass through query with a left join on reference data (so if reference doesn’t match the stream data should still output)
Input path for stream data in the job is
Data/2016/06/08/17/file1.csv
based on the pattern - Data/{date}/{time} - i.e. Data/YYYY/MM/dd/HH/
the last modified date of the file is 2016-06-08 16:30:00 - the time we uploaded the file
Input path for reference data in the job is
Reference/2016/06/08/17/00/ref1.csv
based on the pattern - Reference/{date}/{time}/ref1.csv
i.e. Reference/YYYY/MM/dd/HH/mm/ref1.csv
the last modified date of the file is 2016-06-08 16:30:00 - the time we uploaded the file
All the files are in place - when we start the job with custom start time as 2016-06-08 17:00:00.
we can see the input events but there is no output events
when we remove the reference data left join in the query - the job produces the output
Note: Current timestamp when executing this job is 2016-06-08 19:00:00 so we are going back in time to process data.
What causes this behaviour?
Why can't we go back in time, start the job and cannot see any outputs? Basically we would like to stop the job - go back in time and replay everything as of that time.
What is wrong with reference file timestamp?
I have also included the last modified datetime of both files wondering whether that plays any role in this behavior.
What are we missing?
Thanks.
Related
My requiremnt is like that I have two storage account sa01 and sa02. Let say Sa01 having 10 files and Sa02 also having 10 files at time 01:00 AM. Now I have uploaded 4 more files at 1:15AM in sa01 and my copy activity wil automatically runs beacause I am implemented the event trigger. So It will insert the 4 files to sa02.
Question - It will insert the 4 files and also updating the previous (10) files also, so I am getting 14 files at time 01:15 AM,and requriment say that if 10 files uploaded already at 01:00 AM and 4 files which is latest can inserted in sa02.
See the timings in image I have just uploaded one file all the files time is modified.
Azure Data Share is one good way to accomplish this. It is typically used to sync storage with a partner company. But you can sync in your own subscription. There is no code to write. There is a UI and a sync schedule.
You can use a Metadata activity to get the lastModified of the destination folder.
In your Copy activity, put dynamic content in the
Filter by last modified: start time field. Choose the lastModified field output from the Metadata activity.
Only files in the source newer than the destination's lastModified will be copied.
Metadata activity is tiny fractions of a penny in cost.
Problem
I'm attempting to create a BigQuery table from a CSV file in Google Cloud Storage.
I'm explicitly defining the schema for the load job (below) and set header rows to skip = 1.
Data
$ cat date_formatting_test.csv
id,shipped,name
0,1/10/2019,ryan
1,2/1/2019,blah
2,10/1/2013,asdf
Schema
id:INTEGER,
shipped:DATE,
name:STRING
Error
BigQuery produces the following error:
Error while reading data, error message: Could not parse '1/10/2019' as date for field shipped (position 1) starting at location 17
Questions
I understand that this date isn't in ISO format (2019-01-10), which I'm assuming will work.
However, I'm trying to define a more flexible input configuration whereby BigQuery will correctly load any date that the average American would consider valid.
Is there a way to specify the expected date format(s)?
Is there a separate configuration / setting to allow me to successfully load the provided CSV in with the schema defined as-is?
According to the listed limitations:
When you load CSV or JSON data, values in DATE columns must use
the dash (-) separator and the date must be in the following
format: YYYY-MM-DD (year-month-day).
So this leaves us with 2 options:
Option 1: ETL
Place new CSV files in Google Cloud Storage
That in turn triggers a Google Cloud Function or Google Cloud Composer job to:
Edit the date column in all the CSV files
Save the edited files back to Google Cloud Storage
Load the modified CSV files into Google BigQuery
Option 2: ELT
Load the CSV file as-is to BigQuery (i.e. your schema should be modified to shipped:STRING)
Create a BigQuery view that transforms the shipped field from a string to a recognised date format. Use SELECT id, PARSE_DATE('%m/%d/%Y', shipped) AS shipped, name
Use that view for your analysis
I'm not sure, from your description, if this is a once-off job or recurring. If it's once-off, I'd go with Option 2 as it requires the least effort. Option 1 requires a bit more effort, and would only be worth it for recurring jobs.
For moving data from a BigQuery (BQ) table that resides in the US, I want to export the table to a Cloud Storage (GCS) bucket in the US, copy it to an EU bucket, and from there import it again.
The problem is that AVRO does not support DATE types, but it is crucial to us as we are using the new partitioning feature that is not relying on ingestion time, but a column in the table itself.
The AVRO files contain the DATE column as a STRING and therefore a
Field date has changed type from DATE to STRING error is thrown, when trying to load the files via bq load.
There has been a similar question, but it is about timestamps - in my case it absolutely needs to be a DATE as dates don't carry timezone information and timestamps are always interpreted in UTC by BQ.
It works when using NEWLINE_DELIMITED_JSON, but is it possible to make this work with AVRO files?
As #ElliottBrossard pointed out in the comments, there's a public feature request regarding this where it's possible to sign up for the whitelist.
I've a batch script to load data from google cloud bucket to a table in big query. A scheduled SSIS job executes this batch file daily.
bq load -F "\t" --encoding=UTF-8 --replace=true db_name.tbl_name gs://GSCloudBucket/file.txt "column1:string, column2:string, column3:string"
Weirdly, the execution is successful some days and not some other time. Here is what I have on the log.
Waiting on bqjob_r790a43a4_00000155a65559c2_1 ... (0s) Current status: RUNNING ......
Waiting on bqjob_r790a43a4_00000155a65559c2_1 ... (7s) Current status: DONE
BigQuery error in load operation: Error processing job: Destination
deleted/expired during execution
one option is if you have 1 day (or multiple of days) expiration on that table (either on table directly or via default expiration on dataset). In this case - because actual time of load very you can get to situation when destination table has expired by that time.
You can use configuration.load.createDisposition attribute to address this.
Or/and you can make sure you have proper expiration set - for daily process it would be let's say - 26 hours - so you have extra 2 hours for your SSIS job to complete before table can expire
I submitted a Bigquery load job, it ran and returned with the status successful. But the data didn't make into the destintation table.
Here was the command that was run:
/usr/local/bin/bq load --nosynchronous_mode --project_id=ardent-course-601 --job_id=logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a --source_format=NEWLINE_DELIMITED_JSON dw_logs_impressions.impressions_20140816 gs://sm-uk-hadoop/queries/logsToBq_transformLogs/impressions/20140816/9307f6e3-0b3a-44ca-8571-7107c399998c/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/impressionsSchema.txt
I checked the job status of the job logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a. The input file count and size showed the correct number of input files and total size.
Does anyone know why the data didn't make into the table but yet the job is reported as successful?
Just in case this is not a mistake on our side, I ran the load job again but to a different destination table and this time the data made into the destination table fine.
Thank you.
I experienced this recently with BigQuery in sandbox mode without a billing account.
In this mode the partition expiration is automatically set to 60 days. If you load data into the table where the partitioned column(e.g. date) is older than 60 days it won't show up in the table. The load job still succeeds with the correct number of output rows.
This is very surprising, but I've confirmed via the logs that this is indeed the case.
Unfortunately, the detailed logs for this job, which ran on August 16, are no longer available. We're investigating whether this may have affected other jobs more recently. Please ping this thread if you see this issue again.
we had this issue in our system and the reason was like table was set with partition expiry for 30 days and table was partitioned on timestamp column.. Hence when someone was ingesting data which is older than partition expiry date bigquery load jobs were successfully completed in Spark but we see no data in ingestion tables.. since it was getting deleted moment after it was ingested.. due to partition expiry set on.
Please check your bigquery table partition expiry parameters and see the partition column value of incoming data. If it value will be lower than partition expiry.. you wont see data in bigquery tables.. it will get deleted just after the ingestion.