Event Hub, Stream Analytics and Data Lake pipe questions - azure-data-lake

After reading this article I decided to take a shot on building a pipe of data ingestion. Everything works well. I was able to send data to Event Hub, that is ingested by Stream Analytics and sent to Data Lake. But, I have a few questions regarding some things that seem odd to me. I would appreciate if someone more experienced than me is able to answer.
Here is the SQL inside my Stream Analytics
SELECT
*
INTO
[my-data-lake]
FROM
[my-event-hub]
Now, for the questions:
Should I store 100% of my data in a single file, try to split it in multiple files, or try to achieve one-file-per-object? Stream Analytics is storing all the data inside a single file, as a huge JSON array. I tried setting {date} and {time} as variables, but it is still a huge single file everyday.
Is there a way to enforce Stream Analytics to write every entry from Event Hub on its own file? Or maybe limit the size of the file?
Is there a way to set the name of the file from Stream Analytics? If so, is there a way to override a file if a name already exists?
I also noticed the file is available as soon as it is created, and it is written in real time, in a way I can see data truncation inside it when I download/display the file. Also, before it finishes, it is not a valid JSON. What happens if I query a Data Lake file (through U-SQL) while it is being written? Is it smart enough to ignore the last entry, or understand it as an array of objects that is incomplete?
Is it better to store the JSON data as an array or each object in a new line?
Maybe I am taking a bad approach on my issue, but I have a huge dataset in Google Datastore (NoSQL solution from Google). I only have access to the Datastore, with an account with limited permissions. I need to store this data on a Data Lake. So I made an application that streams the data from Datastore to Event Hub, that is ingested by Stream Analytics, who writes down the files inside the Data Lake. It is my first time using the three technologies, but seems to be the best solution. It is my go-to alternative to ETL chaos.
I am sorry for making so much questions. I hope someone helps me out.
Thanks in advance.

I am only going to answer the file aspect:
It is normally better to produce larger files for later processing than many very small files. Given you are using JSON, I would suggest to limit the files to a size that your JSON extractor will be able to manage without running out of memory (if you decide to use a DOM based parser).
I will leave that to an ASA expert.
ditto.
The answer depends here on how ASA writes the JSON. Clients can append to files and U-SQL should only see the data in a file that has been added in sealed extents. So if ASA makes sure that extents align with the end of a JSON document, you should be only seeing a valid JSON document. If it does not, then you may fail.
That depends on how you plan on processing the data. Note that if you write it as part of an array, you will have to wait until the array is "closed", or your JSON parser will most likely fail. For parallelization and be more "flexible", I would probably get one JSON document per line.

Related

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.
In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery
I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

Hortonworks: Hbase, Hive, etc used for which type of data

I would like to ask if anyone could tell me or refer me to an internet page which describes all possibilities to store data in an apache hadoop cluster.
What I would like to know is: Which type of data should be stored in which "system". Under type of data I mean for example:
Live data (realtime)
Historical data
Data which is regularly accessed from an application
...
The complete question is not reduced on Hbase or Hive ("System") but for everything which is available under Hdp.
I hope someone could lead me in a direction where i could find my answer. Thanks!
I can give you an overview, but rest of the things you have to read on your own.
Let's begin with the types of data you want to store in HDFS:
Data in Motion(Which you denoted as real-time data).
So, how can you fetch the real-time data? Is it even possible? The answer is NO. There will always be a delay. However, we can reduce the downtime and processing time of the data. For which we have HDF(Hortonworks Data Flow). It works with the data in motion. There are many services providing the real-time data streaming. You can take the example of Kafka, Nifi, Storm and many more. These tools are used to process the data. You also need to store the data in such a way that you'd be able to fetch it no time(~2 sec), for that we use HBase. HBase stores the data in the columnar structure.
Data at rest (Historic/Data stored for future use)
So, to store the data at rest, there are no such issues. HDP(Hortonworks Data Platform) is there providing us the services to ingest, store and process the data. Even we can integrate HDF services to HDP(prior to version 2.6), which makes it easier to process Data in motion also. Here we need Databases to store a large amount of data. However, we are provided with HDFS(Hadoop Distributed File System) which can help us store any kind of data. But we don't ONLY want to store our data, we want to fetch it no time when it is required. So, how are we planning to do that? By storing our data in a structured form. For which we are provided Hive and HBase. To store such amount of data which is in TB, we need to run heavy processes that are where MapReduce, YARN, Spark, Kubernetes, Spark comes in to picture.
This is the basic idea of storing and processing data in Hadoop.
Rest you can always read from the internet.

incremental updates documentation is not clear enough

I have a database where I need keep up with changes on Wikidata changes, and while I was looking for ways to do it, I found these three:
RSS
API Call
Socket.IO
I would like to know if there are other ways and which one is the best or recommended by Wikidata
The answer depends on how up to date you need to keep your database.
As up to date as possible
If you need to keep you database as up to date with Wikidata as possible then you will probably want to use a combination of the solutions that you have found.
Socket.IO will provide you with a stream of what has changed, but will not necessarily give you all of the information that you need.
(Note: there is an IRC stream that would allow you to do the same thing)
Based on the data provided by the stream you can then make calls to the Wikidata API retrieving the new data.
Of course this could result in lots of API calls, so make sure you batch them and also don't retrieve update immediately in case lots of changes occur in a row.
Daily or Weekly
As well as the 3 options you have listed above you also have the Database dumps!
https://www.wikidata.org/wiki/Wikidata:Database_download
The JSON & RDF dumps are generally recommended. The JSON dump contains the data exactly as it is stored. These dumps are made weekly.
The XML dumps are not guaranteed to have the same JSON format as the JSON dumps as they use the internal serialization format. However daily XML dumps are provided.

Data handling performance

I'm not sure how to best handle the data I am working with so I wanted to ask what you guys suggested. I'm no expert so please try to be as simple as possible.
I'm writing an IRC bot that maintains a massive list of users(hundreds of thousands). It gives each user points based on the time they spend in IRC. So the data I must manage will consist of a user & their points.
I already tried storing the users in individual text files but that was a bad idea mostly due to wasting clusters on the HDD. Now I'm considering storing all the users in a single file but I'm concerned with the efficiency of processing through all the users.
Should I load the whole file into an array or just load the users that are online?
I hope you can understand what I am asking. I will try to clarify if needed.
It sounds like this program is calling for the information to be stored in a SQL style database. This load is beyond what I'd want to put into a single text file but can easily be handled in a database

BLOB's in SQL that stores a Video file

I am hoping someone can explain how to use BLOBs. I see that BLOBs can be used to store video files. My question is why would a person store a video file in a BLOB in a SQL database? What are the advantages and disadvantages compared to storing pointers to the location of the video file?
A few different reasons.
If you store a pointer to a file on disk (presumably using the BFILE data type), you have to ensure that your database is updated whenever files are moved, renamed, or deleted on disk. It's relatively common when you store data like this that over time your database gets out of sync with the file system and you end up with broken links and orphaned content.
If you store a pointer to a file on disk, you cannot use transactional semantics when you're dealing with multimedia. Since you can't do something like issue a rollback against a file system, you either have to deal with the fact that you're going to have situations where the data on the file system doesn't match the data in the database (i.e. someone uploaded a video to the file system but the transaction that created the author and title in the database failed or vice versa) or you have to add additional steps to the file upload to simulate transactional semantics (i.e. upload a second <>_done.txt file that just contains the number of bytes in the actual file that was uploaded. That's cumbersome and error-prone and may create usability issues.
For many applications, having the database serve up data is the easiest way to provide it to a user. If you want to avoid giving a user a direct FTP URL to your files because they could use that to bypass some application-level security, the easiest option is to have a database-backed application where to retrieve the data, the database reads it from the file system and then returns it to the middle tier which then sends the data to the client. If you're going to have to read the data into the database every time the data is retrieved, it often makes more sense to just store the data directly in the database and to let the database read it from its data files when the user asks for it.
Finally, databases like Oracle provide additional utilities for working with multimedia data in the database. Oracle interMedia, for example, provides a rich set of objects to interact with video data stored in the database-- you can easily tag where scenes begin or end, tag where various subjects are discussed, when the video was recorded, who recorded it, etc. And you can integrate that search functionality with searches against all your relational data. Of course, you could write an application on top of the database that did all those things as well but then you're either writing a lot of code or using another framework in your app. It's often much easier to leverage the database functionality.
Take a read of this : http://www.oracle.com/us/products/database/options/spatial/039950.pdf
(obviously a biased view, but does have a few cons (that have now been fixed by the advent of 11g)