Hive or HBase for reporting? - hive

I am trying to understand what would be the best big data solution for reporting purposes?
Currently I narrowed it down to HBase vs Hive.
The use case is that we have hundreds of terabytes of data with hundreds different files. The data is live and gets updated all the time. We need to provide the most efficient way to do reporting. We have dozens different reports pages where each report consist of different type of numeric and graph data. For instance:
Show all users that logged in to the system in the last hour and
their origin is US.
Show a graph with the most played games to the
least played games.
From all users in the system show the percentage
of paying vs non paying users.
For a given user, show his entire history. How many games he played? What kind of games he played. What was his score in each and every game?
The way I see it, there are 3 solutions:
Store all data in Hadoop and do the queries in Hive. This might work but I am not sure about the performance. How will it perform when the data is 100 TB? Also, Having Hadoop as the main data base is probably not the best solution as update operation will be hard to achieve, right?
Store all data in HBase and do the queries using Phoenix. This solution is nice but HBase is a key/value store. If I join on a key that is not indexed then HBase will do a full scan which will probably be even worse than Hive. I can put index on columns but that will require to put an index on almost each column which is I think not the best recommendation.
Store all data in HBase and do the queries in Hive that communicates with HBase using it propriety bridge.

Respective responses on your suggested solutions (based on my personal experience with similar problem):
1) You should not think of Hive as a regular RDMS as it is best suited for Immutable data. So it is like killing your box if you want to do updates using Hive.
2) As suggested by Paul, in comments you can use Phoenix to create indexes but we tried it and it will be really slow with the volume of data that you suggested (we saw slowness in Hbase with ~100 GB of data.)
3) Hive with Hbase is slower than Phoenix (we tried it and Phoenix worked faster for us)
If you are going to do updates, then Hbase is the best option that you have and you can use Phoenix with it. However if you can make the updates using Hbase, dump the data into Parquet and then query it using Hive it will be super fast.

You can use a lambda structure which is , hbase along with some stream-compute tools such as spark streaming. You store data in hbase ,and when there is new data coming ,update both original data and report by stream-compute. When a new report is created ,you can generate it from a full-scan of hbase, after that ,the report can by updated by stream-compute. You can also use a map-reduce job to adjust the stream-compute result periodically.

The first solution (Store all data in Hadoop and do the queries in Hive), won't allow you to update data. You can just insert to the hive table. Plain hive is pretty slow, as for me it's better to use Hive LLAP or Impala. I've used Impala, it's show pretty good performance, but it's can efficiently, only one query per time. Certainly, update rows in Impala isn`t possible too.
The third solution will get really slow join performance. I've tried Impala with HBase, and join works extremely slow.
About processing data size and cluster size ratio for Impala, https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_cluster_sizing.html
If you need rows update, you can try Apache Kudu.
Here you can find integration guide for Kudu with Impala: https://www.cloudera.com/documentation/enterprise/5-11-x/topics/impala_kudu.html

Related

What tools to use for reporting for Terabyte scale of data in AWS

We have data in the range of 100 TeraBytes. We’d need to run pre-defined reports on this data (Not a lot of Adhoc reporting). For the pre-defined reports, the expected turn around time is 10s seconds, should not be in minutes.
We are all AWS server-less so far. The original data source is DynamoDB. What toolset should we use? Is redshift server-less an overkill for this? I hear data should be atleast PB scale to use it.
The 100 Terabytes of data are expected to accumulate in the period of next 5 years.
I have been leaning towards streaming data to S3 and using Athena for querying but the unpredictability of query performance is making us anxious.
Is redshift/ redshift serverless/ spectrum an overkill for this use case?
Any better toolset that I should explore?
Given this light query load but large data and S3 storage of the data I'd go with Redshift serverless with the large data located in S3 and accessed via spectrum. This will give you the full functionality of Redshift but not paying for a cluster up full time.
If you can stream data to S3 and your pre-defined report aggregate on a certain time window (like hour/day/week), I would suggest using a database that creates rollups or materialized views so your query will run fast (that is one of your requirements). With rollups or materialized views, you are looking at a much smaller table to query, saving you money on storage and on compute resources during query time. At the same time, you might want to pay attention to the ease of use and cost of creating rollups or materialized views.
Granted I don't know what your data looks like, but being able to handle things like nested JSON which is common with DynamoDB is probably another important requirement. Some systems can handle nested JSON out of the box and some will require ETL tooling to 'flatten' the data.
Here are a few I know that can do rollups and/or materialized views:
Rockset
Druid
Elasticsearch
Clickhouse
Good luck!

How can I load data from BigQuery to Spanner?

I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.

Inserting large amount of data into Redshift

I'm looking to build a RESTful API in Go that would be in charge of inserting datas based on the data sent by multiple mobile apps (that would be stored in an Amazon Redshift cluster). Possibly receiving tens of thousands of requests per second.
From what I have read, Redshift give slow insert speeds.
That's why few people have advised me to use an intermediate database like dynamodb or s3 in which I'd perform the inserts first. Then, in a second time, I'd import the data to Redshift.
I'm wondering why would I need to use Redshift in that case as the data would already be stored in a database ? Do you think I can proceed differently ?
I have also thought of a simpler solution by writing to a queue and progressively inserting the data to redshift but I think it might be a problem if the queue gets increasingly bigger as the insert speed isn't fast enough to compensate the incoming data.
Thanks in advance for your help! :-)
Advice like this is normally off-topic for StackOverflow, but...
Amazon Redshift is a massively parallel processing (MPP) database with an SQL interface. It can be used to query TBs and even PBs of data and it can do it very efficiently.
You ask "why would I need to use Redshift" -- the answer is if your querying requirements cannot be satisfied with a traditional database. If you can satisfactorily use a normal database for your queries, then there's no real reason to use Redshift.
However, if your queries need Redshift, then you should continue to use it. The design of Redshift is such that the most efficient way to insert data is to load from Amazon S3 via the COPY command. It is inefficient to insert data via normal INSERT statements unless they are inserting many rows per INSERT statement (eg hundreds or thousands).
So, some questions to ask:
Do I need the capabilities of Amazon Redshift for my queries, or can a traditional database suffice?
Do I need to load data in real-time, or is it sufficient to load in batches?
If using batches, how often do I need to load the batch? Can I do it hourly or daily, or does it need to be within a few minutes of the data arriving?
You could also consider using Amazon Kinesis Firehose, which can accept a stream of data and insert it into an Amazon Redshift database automatically.

Google CloudSQL or BigQuery for Big Data Actively Update Every Second

So now I'm currently using Google CloudSQL for my needs.
I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main parameters like user locations (latitude longitude), timestamp, user activities and conversations and more.
I need to constantly access a lot of insight from this user activities, like "how many users between latitude-longitude A and latitude-longitude B who use my app per hour since 30 days ago?".
Because my table become bigger every day, it's hard to manage the performance of select query in my table. (I already implemented the indexing method in my table especially for most common use parameter)
All my data insert, select, update and more is executed from API that I code in PHP.
So my question is can I get much more better benefit if I use Google BigQuery for my needs?
If yes, how can I do this? Because is Google BigQuery (forgive my if I'm wrong) designed to be used for static data? (Not a constantly update data)? How can I connect my CloudSQL data into BigQuery in real time?
Which one is better: optimizing my table in CloudSQL to maximize the select process or use BigQuery (if possible)
I also open for another alterntive or sugget to optimize my CloudSQL performance :)
Thank you
Sounds like BigQuery would be far better suited your use case. I can think of a good solution:
Migrate existing data from CloudSQL to BigQuery.
Stream events directly to BigQuery (using a async queue).
Use time partitioned table in BigQuery.
If you use BigQuery, you don't need to worry about performance or scaling. That's all handled for you by Google.

Easiest way to persist Cassandra data to S3 using Spark

I am trying to figure out how to best store and retrieve data, from S3 to Cassandra, using Spark: I have log data that I store in Cassandra. I run Spark using DSE to perform analysis of the data, and it works beautifully. The log data grows daily, and I only need two weeks worth in Cassandra at any given time. I still need to store older logs somewhere for at least 6 months, and after research, S3 with Glaciar looks like the most promising solution. I'd like to use Spark, to run a daily job that finds the logs from day 15, deletes them from Cassandra, and sends them to S3. My problem is this: I can't seem to settle on the right format to save the Cassandra rows to a file, such that I can one day potentially load the file back into Spark, and run an analysis, if I have to. I only want to run the analysis in Spark one day, not persist the data back into Cassandra. JSON seems to be an obvious solution, but is there any other format that I am not considering? Should I use Spark SQL? Any advice appreciated before I commit to one format or another.
Apache Spark is designed for this kind of use case. It is a storage format for columnar databases. It provides column compression and some indexing.
It is becoming a de facto standard. Many big data platforms are adopting it or at least providing some support for it.
You can query it efficiently directly in S3 using SparkSQL, Impala or Apache Drill. You can also run EMR jobs against it.
To write data to Parquet using Spark, use DataFrame.saveAsParquetFile.
Depending on your specific requirements you may even end up not needing a separate Cassandra instance.
You may also find this post interesting