How to make a snapshot from hbase to hive quickly? - hive

My company has a hbase cluster of about 30 machines. Every morning I will start a spark/mr task to synchronous data from hbase to hive for writing SQL to make analysing.
The question is: a table in hbase is becoming bigger and bigger, it has more than 1k columns. Now its size is about 15 TB. I want to copy all data to hbase, with less time. Scan takes hours, is there any other way?

Related

I have to solve for Tableau performance with about 30 million data points

I have a use case where I need to have a structured data (with 30 million records & 20 columns) in a Tableau compatible source. And we want the tableau charts refresh to happen within 1sec of time. NOTE: That 30 million records are already aggregated so we cannot reduce the records further.
I was thinking about having a hive table & create a presto connection to create a tableau extract. But when I did that then I saw we have a latency of about 5-10 seconds.
Someone please help in selecting (may be) a better source along with (may be) a live tableau connection to refresh the data faster?
I was thinking about having a hive table & create a presto connection to create a tableau extract. But when I did that then I saw we have a latency of about 5-10 seconds.

How to physically partition data to avoid shuffle in Spark SQL joins

I have a requirement to join 5 medium size tables (~80 gb each) with a big Input data ~ 800 gb. All data resides in HIVE tables.
I am using Spark SQL 1.6.1 for achieving this.
Join is taking 40 mins of time to complete with
--num-executors 20 --driver-memory 40g --executor-memory 65g --executor-cores 6. All joins are sort merge outer joins. Also seeing a lot of shuffle happening.
I bucketed all tables in hive into same number of buckets so that similar keys from all tables will go to same spark partitions while loading data itself at first. But it seems spark does not understand bucketing.
Is there any other way i can physically partition & sort data in Hive (no of part files) so that spark will know about partitioning keys while loading data from hive itself and do a join with in the same partitioning without shuffling data around? This will avoid additional re-Partitioning after loading data from hive.
First of all Spark Sql 1.6.1 doesn't support hive buckets yet.
So in this case we are left with Spark level operations ensuring that all tables must go to same spark partitions while loading the data. Spark API provides repartition and sortWithinPartitions to achieve the same. e.g
val part1 = df1.repartition(df1("key1")).sortWithinPartitions(df1("key1"))
In the same way you can go for the generations of partitions for remaining tables and joined them on the key which was sorted within partitions.
This will make the join "shuffle free" operation but come with major computational cost. Caching the dataframes(you can go for cache operation for newly created partition) perform better if the operation will be performed subsequent times. Hope this help.

Inserting data to U-SQL tables is taking too long?

Inserting data to U-SQL table is taking too much time. We are using partitioned tables to recalculate previously processed data. Insertion for the first time took almost 10-12 minutes on three tables with 11, 5 and 1 partitions and parallelism was set to 10. Second time insertion of same data took almost 4 hours. Currently we are using year based partitions. We tested insertion and querying without adding partitions and performance was much better. Is this an issue with partitioned tables?
It is very strange that the same job would be taking that much longer for the same data and script executed with the same degree of parallelism. If you look at the job graph (or the vertex execution information) from within VisualStudio, can you see where the time was being spent?
Note that (coarse-grained) partitions are more of a data life-cycle management feature that allows you to address individual partitions of a table, and not necessarily a performance feature (although partition elimination can help with query performance). But it should not go from minutes to hours with the same script, resources and data.

Bigquery partitioning table performance

I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?
A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.

Slow loading of partitioned Hive table

I'm loading a table in Hive thats partitioned by date. It currently contains about 3 years worth of records, so circa 900 partitions (i.e. 365*3).
I'm loading daily deltas into this table, adding an additional partition per day. I achieve this using dynamic partitioning as I cant guarantee my source data only contains one days worth of data (e.g. if I'm recovering from a failure I may have multiple days of data to process).
This is all fine and dandy, however I've noticed the final step of actually writing the partition has become very slow. By this I mean the logs show the MapReduce stage completes quickly, its just very slow on the final step as it seems to scan and open all existing partitions regardless of if they will be overwritten.
Should I be explicitly creating partitions to avoid this step?
Whether the partitions are dynamic or static typically should not alter the performance drastically. Can you check in each of the partitions how many actual files are getting created? Just want to make sure that the actual writing is not serialized which it could be if it's writing to only one file. Also check how many mappers & reducers were employed by the job.