Temperature based data lake - azure-data-lake

I recently come across a term Temperature based data lake , I searched for information about it, but couldn't able to find anything related to it. Could someone please explain what does it means and share any links. Thanks in advance.

When you say temerature based data lake , I am assuming that you are refering to hot and cold tier .
Hot - Optimized for storing data that is accessed frequently.
Cool - Optimized for storing data that is infrequently accessed and stored for at least 30 days.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
HTH .

Related

Azure Data Factory - load Application Insights logs to Data Lake Gen 2

I have Application Insights configured with a retention period for logs of three months and I want to load them using Data Factory pipelines, scheduled daily, to a Data Lake Gen 2 storage.
The purpose of doing this is to not lose data after the retention period passes and to have the data stored for future purposes - Machine Learning and Reporting, mainly.
I am trying to decide what format to use for storing these data, from the many formats available in Data Lake Gen 2, so if anyone has a similar design, any information or reference to documentation would be greater appreciated.
Per my experience, most format of the log files are .log files. If we want to keep file type and move them to Data Lake Gen 2, please use Binary format.
Binary format can help you move all the folder/sub-folder and all the files to other destination.
HTH.

Hortonworks: Hbase, Hive, etc used for which type of data

I would like to ask if anyone could tell me or refer me to an internet page which describes all possibilities to store data in an apache hadoop cluster.
What I would like to know is: Which type of data should be stored in which "system". Under type of data I mean for example:
Live data (realtime)
Historical data
Data which is regularly accessed from an application
...
The complete question is not reduced on Hbase or Hive ("System") but for everything which is available under Hdp.
I hope someone could lead me in a direction where i could find my answer. Thanks!
I can give you an overview, but rest of the things you have to read on your own.
Let's begin with the types of data you want to store in HDFS:
Data in Motion(Which you denoted as real-time data).
So, how can you fetch the real-time data? Is it even possible? The answer is NO. There will always be a delay. However, we can reduce the downtime and processing time of the data. For which we have HDF(Hortonworks Data Flow). It works with the data in motion. There are many services providing the real-time data streaming. You can take the example of Kafka, Nifi, Storm and many more. These tools are used to process the data. You also need to store the data in such a way that you'd be able to fetch it no time(~2 sec), for that we use HBase. HBase stores the data in the columnar structure.
Data at rest (Historic/Data stored for future use)
So, to store the data at rest, there are no such issues. HDP(Hortonworks Data Platform) is there providing us the services to ingest, store and process the data. Even we can integrate HDF services to HDP(prior to version 2.6), which makes it easier to process Data in motion also. Here we need Databases to store a large amount of data. However, we are provided with HDFS(Hadoop Distributed File System) which can help us store any kind of data. But we don't ONLY want to store our data, we want to fetch it no time when it is required. So, how are we planning to do that? By storing our data in a structured form. For which we are provided Hive and HBase. To store such amount of data which is in TB, we need to run heavy processes that are where MapReduce, YARN, Spark, Kubernetes, Spark comes in to picture.
This is the basic idea of storing and processing data in Hadoop.
Rest you can always read from the internet.

Best solution for storing / accessing large Integer arrays for a web application

I have a Web Application (Java backend) that processes a large amount of raw data that is uploaded from a hardware platform containing a number of sensors.
Currently the raw data is uploaded and the data is decompressed and stored as a 'text' field in a Postgresql database to allow the users to log in and generate various graphs / charts of the data (using a JS charting library clientside).
Example string...
[45,23,45,32,56,75,34....]
The arrays will typically contain ~300,000 values but this could be up to 1,000,000 depending on how long the sensors are recording so the size of the string being stored could be a few hundred kilobytes
This currently seems to work fine for now as there are only ~200 uploads per day but as I am looking at the scalability of the application and the ability to backup the data I am looking at alternatives for storing this data
DynamoDB looked like a great option for me as I can carry on storing the uploads details in my SQL table and just save a URL endpoint to be called to retrieve the arrays....but then I noticed the item size is limited to 64kb
As I am sure there are a million and one ways to do this I would like to put this out to the SO community to hear what others would recommend, either web services or locally stored....considering performance, scalability, maintainability etc etc...
Thanks in advance!
UPDATE:
Just to clarify the data shown above is just the 'Y' values as it is time-sampled the X values are taken as the position in the array....so I dont think storing as a tuple would have any benefits.
If you are looking to store such strings, you probably want to use S3 (1 object containing
the array string), in this case you will have "backup" out of the box by enabling bucket
versioning.
You can try tuple of Couchbase and ElasticSearch. Couchbase is very fast document-oriented NoSql database. Several thousands of insert operation is normal for CB. Item size is limited to 20MB. Performance of "get" operation is several tens of thousands. There is one disadvantage, you can query data only by id (there is "view", but I think it will be too difficult to adapt them to the plotting). Compensate for this deficiency may ElasticSearch, that can perform any query very fast. Format data in Couchbase and ElasticSearch is json-document.
I have just come across Google Cloud Datastore, which allows me to store single item Strings up to 1Mb (un-indexed), seems like a good alternative to Dynamo
May be you should use Redis or SSDB, both are designed to store large list(array) of data. The difference between these two databases is that Redis is memory only(disk for backup), but SSDB is disk based and uses memory as cache.

What is this vague accusation of RRD data loss about?

I want to use CollectD to gather some statistics (about storage) and have Graphite display them nicely. Apparently this can be done either by
having CollectD store the data as RRD files and pointing Graphite at
those, or
using a CollectD plugin to push the data to Graphite's Carbon API, which will store the data in a Whisper database (which is similar to RRD but not compatible).
I think I want to go with RRDs, but I found this statement in the Whisper docs that concerns me:
In many cases (depending on configuration) if an update is made to an
RRD series but is not followed up by another update soon, the original
update will be lost.
Hmmm. That's a bit scary, but the accusation is so vague that I don't know what to make of it. What is the configuration they are talking about, and the situation in which it causes data loss?
My situation is that the metrics data I am gathering will be available in chunks -- periodically I will go get the latest data and make as many entries into the database as there are new samples available. So, for example, I might grab some data and update the database with the values from 3 minutes ago, 2 minutes ago, and 1 minute ago, one right after the other. In fact, I might have dozens of new samples to put in the database at once. Does using RRD this way have anything to do with the Whisper accusation?
NOTE: I do not need to back-fill data; I will always be adding newer data than what has already been stored.
One scenario I see this happening would be if you have an AVERAGE RRA setup, and have the xxf value set to a low percentage. When the data is compressed over time, you could receive an unknown value and 'loose' all the data that was averaged. If you are using a RRD for what it was designed for, and have it setup with the proper type and settings, I wouldn't think you will run into a problem.
I would recommend taking an in depth look at the RRD documentation found HERE to answer questions about how RRD's and RRA's handle the data, and the different storage techniques that are available to you.

What are data volumes?

What are data volumes?
How would you define them?
How would you calculate the data volumes for a website.
In SAP, data volumes are the spaces defined in SAP to store data or log information.
Otherwise, the English word volume means amount. A data volume is simply the amount of data in a file or database.
You would calculate the amount of data storage for a website by figuring out how much data comes in per month, and multiply that times the number of months you expect your web site to grow.
Most web sites just add disk storage as needed, rather than attempt to predict how much will be needed in the future. If you're Google or Facebook, you just plan to add disk storage space constantly.
Maybe this article may help you, but still I would migrate this question to Webmasters part of the stackexchange.com if you're asking for a data volumes calculation for a website.