I have an azure stream analytics receiving input from event hub. I then enrich the messages with a blob reference using a left join.
If the join fails, the enrichment fields end up as null as I was expecting, but if the blob does not exist at all, the stream analytics does not even output anything, no errors is thrown and the job continues running (it does not fail and stop).
Any idea on how I can achieve this? Since I see no error I was expecting to have the same results as a miss join.
Sorry for this. This is actually the expected behavior when there is no reference data: if the blob doesn't exist in the specified path, the Stream Analytics job will wait for the blob to become available.
You can detect this issue since a warning will be generated in activity logs.
We updated the doc with this information. Let me know if you have any further question.
Thanks
JS (Azure Stream Analytics)
Related
I have created input using ADLS Gen2 data stream option. I have added path pattern (upto folder which gets continuous data from eventhub). Test connection is successful but when I try to run query or sample data, it fails and error is:
Diagnostics: While sampling data, no data was received from '1' partitions.
Appreciate your help in advance.
Thank you Florian Eiden and Swati for your valuable discussions and suggestions. Posting it as answer to help other community members.
Used Event Hub directly as data streaming input instead of ADLS Gen2
data streaming option (that gets continuous data from Event Hub). This is more efficient option.
I am a newbie in Flink and I am trying to write a simple streaming job with exactly-once semantics that listens from Kafka and writes the data to S3. When I say "Exact once", I mean I don't want to end up to have duplicates, on intermediate failure between writing to S3 and commit the file sink operator. I am using Kafka of version v2.5.0, according to the connector described in this page, I am guessing my use case will end up to have exact once behavior.
Questions:
1) Whether my assumption is correct that my use case will endup to have exact once even though there is any failure occurring in any part of the steps so that I can say my S3 files won't have duplicate records?
2) How Flink handle this exact once with S3? In the documentation it says, it uses multipart upload to get exact once semantics, but my question is, how it is handled internally to achieve exact once semantics? Let's say, the task failed once the S3 multipart get succeeded and before the operator commit process, in this case, once the operator gets restarts will it stream the data again to S3 which was written to S3 already, so will it be a duplicate?
If you read from kafka and then write to S3 with the StreamingDataSink you should indeed be able to get exactly once.
Though it is not specifically about S3, this article gives a nice explanation on how to ensure exactly once in general.
https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
My key takeaway: After a failure we must always be able to see where we stand from the perspective of the sink.
After reading this article I decided to take a shot on building a pipe of data ingestion. Everything works well. I was able to send data to Event Hub, that is ingested by Stream Analytics and sent to Data Lake. But, I have a few questions regarding some things that seem odd to me. I would appreciate if someone more experienced than me is able to answer.
Here is the SQL inside my Stream Analytics
SELECT
*
INTO
[my-data-lake]
FROM
[my-event-hub]
Now, for the questions:
Should I store 100% of my data in a single file, try to split it in multiple files, or try to achieve one-file-per-object? Stream Analytics is storing all the data inside a single file, as a huge JSON array. I tried setting {date} and {time} as variables, but it is still a huge single file everyday.
Is there a way to enforce Stream Analytics to write every entry from Event Hub on its own file? Or maybe limit the size of the file?
Is there a way to set the name of the file from Stream Analytics? If so, is there a way to override a file if a name already exists?
I also noticed the file is available as soon as it is created, and it is written in real time, in a way I can see data truncation inside it when I download/display the file. Also, before it finishes, it is not a valid JSON. What happens if I query a Data Lake file (through U-SQL) while it is being written? Is it smart enough to ignore the last entry, or understand it as an array of objects that is incomplete?
Is it better to store the JSON data as an array or each object in a new line?
Maybe I am taking a bad approach on my issue, but I have a huge dataset in Google Datastore (NoSQL solution from Google). I only have access to the Datastore, with an account with limited permissions. I need to store this data on a Data Lake. So I made an application that streams the data from Datastore to Event Hub, that is ingested by Stream Analytics, who writes down the files inside the Data Lake. It is my first time using the three technologies, but seems to be the best solution. It is my go-to alternative to ETL chaos.
I am sorry for making so much questions. I hope someone helps me out.
Thanks in advance.
I am only going to answer the file aspect:
It is normally better to produce larger files for later processing than many very small files. Given you are using JSON, I would suggest to limit the files to a size that your JSON extractor will be able to manage without running out of memory (if you decide to use a DOM based parser).
I will leave that to an ASA expert.
ditto.
The answer depends here on how ASA writes the JSON. Clients can append to files and U-SQL should only see the data in a file that has been added in sealed extents. So if ASA makes sure that extents align with the end of a JSON document, you should be only seeing a valid JSON document. If it does not, then you may fail.
That depends on how you plan on processing the data. Note that if you write it as part of an array, you will have to wait until the array is "closed", or your JSON parser will most likely fail. For parallelization and be more "flexible", I would probably get one JSON document per line.
I am searching for a way to make a Google DataFlow job stop ingesting from Pub/Sub when a (specific) exception happens.
The events from Pub/Sub are JSON read via PubsubIO.Read.Bound<TableRow> using TableRowJsonCoder and directly streamed to BigQuery with
BigQueryIO.Write.Bound.
(There is a ParDo inbetween that changes the contents of one field and some custom partitioning by day happening, but that should be irrelevant for this purpose.)
When there are fields in the events/rows ingested from PubSub that are not columns in the destination BigQuery table, the DataFlow job logs IOExceptions at run time claiming it could not insert the rows, but seems to acknowledge these messages and continues running.
What I want to do instead is to stop ingesting messages from Pub/Sub and/or make the Dataflow job crash, so that alerting could be based on the age of oldest unacknowledged message. At the very least I want to make sure that those Pub/Sub messages that failed to be inserted to BigQuery are not ack'ed so that I can fix the problem, restart the Dataflow job and consume those messages again.
I know that one suggested solution for handling faulty input is described here: https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow
I am also aware of this PR on Apache Beam that would allow inserting the rows without the offending fields:
https://github.com/apache/beam/pull/1778
However in my case I don't really want to guard from faulty input but rather from programmer errors, i.e. the fact that new fields were added to the JSON messages which are pushed to Pub/Sub, but the corresponding DataFlow job was not updated. So I don't really have faulty data, I rather simply want to crash when a programmer makes the mistake not to deploy a new Dataflow job before changing anything about the message format.
I assume it would be possible to (analogue to the blog post solution) create a custom ParDo that validates each row and throws an exception that isn't caught and leads to a crash.
But ideally, I would just like to have some configuration that does not handle the insert error and logs it but instead just crashes the job or at least stops ingestion.
You could have a ParDo with a DoFn which sits before the BQ write. The DoFn would be responsible to get the output table schema every X mins and would validate that each record that is to be written matches the expected output schema (and throw an exception if it doesn't).
Old Pipeline:
PubSub -> Some Transforms -> BQ Sink
New Pipeline:
PubSub -> Some Transforms -> ParDo(BQ Sink Validator) -> BQ Sink
This has the advantage that once someone fixes the output table schema, the pipeline will recover. You'll want to throw a good error messaging stating whats wrong with the incoming PubSub message.
Alternatively, you could have the BQ Sink Validator instead output messages to a PubSub DLQ (monitoring its size). Operationally you would have to update the table and then re-ingest the DLQ back in as an input. This has the advantage that only bad messages block pipeline execution.
We have a strange issue that happen quite often.
We have a process which getting files from sources and loading it into the GCS. Than, and only if the file uploaded successfully, we try to load it into the BigQuery table and get the error of
"Not found: Uris List of uris (possibly truncated): json: file_name: ...".
After a deep investigation, it all supposed to be fine, and we don't know what had changed. In the time frames, the file in the job exists in the cloud storage, and uploaded into the GCS 2 minutes before BigQuery tried to get it.
There is need to say that we load every file as the whole batch dictionary in the Cloud Storage, like gs://<bucket>/path_to_dir/*. Is that still supported?
Also, the file sizes are kind of small - from few bytes to KB. Is that matter?
job ids for checking:
load_file_8e4e16f737084ba59ce0ba89075241b7 load_file_6c13c25e1fc54a088af40199eb86200d
Known issue with Cloud Storage consistency
As noted by Felipe, this was indeed related to a known issue with Cloud Storage. Google Cloud Storage Incident #16036 is shown to have been resolved since December 20, 2016. This was also being tracked in Issue 738. Though Cloud Storage list operations are eventually consistent, this incident displayed excessive delays in operations returning consistent results.
Handling Cloud Storage inconsistency
Though this was an isolated incident, it is nevertheless a good practice to have some means of handling such inconsistencies. Two such suggestions can be found in comment #10 of the related public issue.
Retry the load job if it failed.
Verify if Cloud Storage results are consistent with expectations
Verify the expected number of files (and total size) was processed by BigQuery. You can get this information out of the Job metadata.
Still getting unexpected results
Should you encounter such an issue again and have the appropriate error handling measures in place, I recommend first consulting the Google Cloud Status Dashboard and BigQuery public issue tracker for existing reports showing similar symptoms. If none exist, file a new issue on the issue tracker.
The solution was to move from Multi Region Bucket(that was set before Region type was enable) to Region.
Since we moved, we never faced this issue.