I have created input using ADLS Gen2 data stream option. I have added path pattern (upto folder which gets continuous data from eventhub). Test connection is successful but when I try to run query or sample data, it fails and error is:
Diagnostics: While sampling data, no data was received from '1' partitions.
Appreciate your help in advance.
Thank you Florian Eiden and Swati for your valuable discussions and suggestions. Posting it as answer to help other community members.
Used Event Hub directly as data streaming input instead of ADLS Gen2
data streaming option (that gets continuous data from Event Hub). This is more efficient option.
Related
I have Application Insights configured with a retention period for logs of three months and I want to load them using Data Factory pipelines, scheduled daily, to a Data Lake Gen 2 storage.
The purpose of doing this is to not lose data after the retention period passes and to have the data stored for future purposes - Machine Learning and Reporting, mainly.
I am trying to decide what format to use for storing these data, from the many formats available in Data Lake Gen 2, so if anyone has a similar design, any information or reference to documentation would be greater appreciated.
Per my experience, most format of the log files are .log files. If we want to keep file type and move them to Data Lake Gen 2, please use Binary format.
Binary format can help you move all the folder/sub-folder and all the files to other destination.
HTH.
I am looking for the best option to access data from Spark data pipelines. The scenario is as follows:
I am reading data from Kafka topics, creating a streaming dataframe which is then cleaned and being printed on the console. I need this data to be integrated with existing Python scripts which is doing all the data operations by Pandas. I have considered the following options:
Write streaming data to local memory (e.g. Hive Tables).
Use Spark Structured Streaming ForeachBatch Sink.
I want to mention that the data is to be read after a certain interval and there will be a real time data dashboard in the future with this data.
Please advise which will be the best approach to handle this scenario. Apologies if the question sounds too basic. Thanks in advance.
If you save data on Hive each time before accessing the newly streamed data through python scripts, the newly added hive partitions are required to be refreshed each time as well.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Here are some disadvantages of having a hive for mentioned real-time scenarios.
https://www.quora.com/What-are-some-disadvantages-of-Apache-Hive#
Whereas, Using Spark Structured Streaming looks a better choice for the near-real-time experience.
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
I have developed a real time pipeline in data fusion to fetch data from pubsub and then feed into GCS and thereafter in BQ. However, after GCS (which is available as a sink), i am not able to feed the data into BQ because GCS is only available as a sink and hence, it doesnt give any output schema. Is there any way out that i can create a pipeline to take the data from GCS to BQ
To provide a possible solution: It is not possible to connect a sink to another sink. From based on the question my guess is SO is trying to connect the GCS sink plugin to BQ sink and have data flow from one sink to another. That is not by design possible with Data Fusion pipelines.
SO can push data simultaneously from pubsub source to both BQ and GCS sinks directly instead of pushing one after the other. It would look something like this,
Hope this helps.
After reading this article I decided to take a shot on building a pipe of data ingestion. Everything works well. I was able to send data to Event Hub, that is ingested by Stream Analytics and sent to Data Lake. But, I have a few questions regarding some things that seem odd to me. I would appreciate if someone more experienced than me is able to answer.
Here is the SQL inside my Stream Analytics
SELECT
*
INTO
[my-data-lake]
FROM
[my-event-hub]
Now, for the questions:
Should I store 100% of my data in a single file, try to split it in multiple files, or try to achieve one-file-per-object? Stream Analytics is storing all the data inside a single file, as a huge JSON array. I tried setting {date} and {time} as variables, but it is still a huge single file everyday.
Is there a way to enforce Stream Analytics to write every entry from Event Hub on its own file? Or maybe limit the size of the file?
Is there a way to set the name of the file from Stream Analytics? If so, is there a way to override a file if a name already exists?
I also noticed the file is available as soon as it is created, and it is written in real time, in a way I can see data truncation inside it when I download/display the file. Also, before it finishes, it is not a valid JSON. What happens if I query a Data Lake file (through U-SQL) while it is being written? Is it smart enough to ignore the last entry, or understand it as an array of objects that is incomplete?
Is it better to store the JSON data as an array or each object in a new line?
Maybe I am taking a bad approach on my issue, but I have a huge dataset in Google Datastore (NoSQL solution from Google). I only have access to the Datastore, with an account with limited permissions. I need to store this data on a Data Lake. So I made an application that streams the data from Datastore to Event Hub, that is ingested by Stream Analytics, who writes down the files inside the Data Lake. It is my first time using the three technologies, but seems to be the best solution. It is my go-to alternative to ETL chaos.
I am sorry for making so much questions. I hope someone helps me out.
Thanks in advance.
I am only going to answer the file aspect:
It is normally better to produce larger files for later processing than many very small files. Given you are using JSON, I would suggest to limit the files to a size that your JSON extractor will be able to manage without running out of memory (if you decide to use a DOM based parser).
I will leave that to an ASA expert.
ditto.
The answer depends here on how ASA writes the JSON. Clients can append to files and U-SQL should only see the data in a file that has been added in sealed extents. So if ASA makes sure that extents align with the end of a JSON document, you should be only seeing a valid JSON document. If it does not, then you may fail.
That depends on how you plan on processing the data. Note that if you write it as part of an array, you will have to wait until the array is "closed", or your JSON parser will most likely fail. For parallelization and be more "flexible", I would probably get one JSON document per line.
I have a use case
We have java framework to parse realtime data from Kinesis to Hive table in every half an hour.
I need to access this hive table and do some processing near realtime. An hour delay is fine, as I dont have permission to access Kinesis stream.
Once processing is done in spark (pyspark preferably), I have to create a new kinesys stream and push the data.
I will then use Splunk and pull it near realtime.
Question is, any one has done spark streaming from hive using python ? I have to do a POC and then the actual work.
Any help will be highly appreciated.
Thanks in advance!!
There are 2 ways to go ahead on this:
Use spark-streaming to obtain messages drectly from Kinesis. That will give you something that is real time.
Once the file drops into your staging area ( either your hive ware-house OR your some HDFS location ), you can pick it up for processing using spark-streaming for files.
Do let us know which approch worked best for you.