Automatic Hive or Cascading for ETL in AWS-EMR - hive

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(

You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html

Related

Hive or HBase for reporting?

I am trying to understand what would be the best big data solution for reporting purposes?
Currently I narrowed it down to HBase vs Hive.
The use case is that we have hundreds of terabytes of data with hundreds different files. The data is live and gets updated all the time. We need to provide the most efficient way to do reporting. We have dozens different reports pages where each report consist of different type of numeric and graph data. For instance:
Show all users that logged in to the system in the last hour and
their origin is US.
Show a graph with the most played games to the
least played games.
From all users in the system show the percentage
of paying vs non paying users.
For a given user, show his entire history. How many games he played? What kind of games he played. What was his score in each and every game?
The way I see it, there are 3 solutions:
Store all data in Hadoop and do the queries in Hive. This might work but I am not sure about the performance. How will it perform when the data is 100 TB? Also, Having Hadoop as the main data base is probably not the best solution as update operation will be hard to achieve, right?
Store all data in HBase and do the queries using Phoenix. This solution is nice but HBase is a key/value store. If I join on a key that is not indexed then HBase will do a full scan which will probably be even worse than Hive. I can put index on columns but that will require to put an index on almost each column which is I think not the best recommendation.
Store all data in HBase and do the queries in Hive that communicates with HBase using it propriety bridge.
Respective responses on your suggested solutions (based on my personal experience with similar problem):
1) You should not think of Hive as a regular RDMS as it is best suited for Immutable data. So it is like killing your box if you want to do updates using Hive.
2) As suggested by Paul, in comments you can use Phoenix to create indexes but we tried it and it will be really slow with the volume of data that you suggested (we saw slowness in Hbase with ~100 GB of data.)
3) Hive with Hbase is slower than Phoenix (we tried it and Phoenix worked faster for us)
If you are going to do updates, then Hbase is the best option that you have and you can use Phoenix with it. However if you can make the updates using Hbase, dump the data into Parquet and then query it using Hive it will be super fast.
You can use a lambda structure which is , hbase along with some stream-compute tools such as spark streaming. You store data in hbase ,and when there is new data coming ,update both original data and report by stream-compute. When a new report is created ,you can generate it from a full-scan of hbase, after that ,the report can by updated by stream-compute. You can also use a map-reduce job to adjust the stream-compute result periodically.
The first solution (Store all data in Hadoop and do the queries in Hive), won't allow you to update data. You can just insert to the hive table. Plain hive is pretty slow, as for me it's better to use Hive LLAP or Impala. I've used Impala, it's show pretty good performance, but it's can efficiently, only one query per time. Certainly, update rows in Impala isn`t possible too.
The third solution will get really slow join performance. I've tried Impala with HBase, and join works extremely slow.
About processing data size and cluster size ratio for Impala, https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_cluster_sizing.html
If you need rows update, you can try Apache Kudu.
Here you can find integration guide for Kudu with Impala: https://www.cloudera.com/documentation/enterprise/5-11-x/topics/impala_kudu.html

BigQuery best approach for ETL (external tables and views vs Dataflow)

CSV files get uploaded to some FTP server (for which I don't have SSH access) in a daily basis and I need to generate weekly data that merges those files with transformations. That data would go into a history table in BQ and a CSV file in GCS.
My approach goes as follows:
Create a Linux VM and set a cron job that syncs the files from the
FTP server with a GCS bucket (I'm using GCSFS)
Use an external table in BQ for each category of CSV files
Create views with complex queries that transform the data
Use another cron job to create a table with the historic data and also the CSV file on a weekly basis.
My idea is to remove as much middle processes as I can and to make the implementation as easy as possible, including dataflow for ETL, but I have some questions first:
What's the problem with my approach in terms of efficiency and money?
Is there anything DataFlow can provide that my approach can't?
any ideas about other approaches?
BTW, I ran into one problem that might be fixable by parsing the csv files myself rather than using external tables, which is invalid characters, like the null char, so I can get rid of them, while as an external table there is a parsing error.
Probably your ETL will be simplified by Google DataFlow Pipeline batch execution job. Upload your files to the GCS bucket. For transforming use pipeline transformation to strip null values and invalid character (or whatever your need is). On those transformed dataset use your complex queries like grouping it by key, aggregating it (sum or combine) and also if you need side inputs data-flow provides ability to merge other data-sets into the current the data-set too. Finally the transformed output can written to BQ or you can write your own custom implementation for writing those results.
So the data-flow gives you very high flexibility to your solution, you can branch the pipeline and work differently on each branch with same data-set. And regarding the cost, if you run your batch job with three workers, which is the default that should not be very costly, but again if you just want to concentrate on your business logic and not worry about the rest, google data-flow is pretty interesting and its very powerful if used wisely.
Data-flow helps you to keep everything on a single plate and manage them effectively. Go through its pricing and determine if it could be the best fit for you (your problem is completely solvable with google data-flow), Your approach is not bad but needs extra maintenance with those pieces.
Hope this helps.
here are a few thoughts.
If you are working with a very low volume of data then your approach may work just fine. If you are working with more data and need several VMs, dataflow can automatically scale up and down the number of workers your pipeline uses to help it run more efficiently and save costs.
Also, is your linux VM always running? Or does it only spin up when you run your cron job? A batch Dataflow job only runs when it needed, which also helps to save on costs.
In Dataflow you could use TextIO to read each line of the file in, and add your custom parsing logic.
You mention that you have a cron job which puts the files into GCS. Dataflow can read from GCS, so it would probably be simplest to keep that process around and have your dataflow job read from GCS. Otherwise you would need to write a custom source to read from your FTP server.
Here are some useful links:
https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling

Easiest way to persist Cassandra data to S3 using Spark

I am trying to figure out how to best store and retrieve data, from S3 to Cassandra, using Spark: I have log data that I store in Cassandra. I run Spark using DSE to perform analysis of the data, and it works beautifully. The log data grows daily, and I only need two weeks worth in Cassandra at any given time. I still need to store older logs somewhere for at least 6 months, and after research, S3 with Glaciar looks like the most promising solution. I'd like to use Spark, to run a daily job that finds the logs from day 15, deletes them from Cassandra, and sends them to S3. My problem is this: I can't seem to settle on the right format to save the Cassandra rows to a file, such that I can one day potentially load the file back into Spark, and run an analysis, if I have to. I only want to run the analysis in Spark one day, not persist the data back into Cassandra. JSON seems to be an obvious solution, but is there any other format that I am not considering? Should I use Spark SQL? Any advice appreciated before I commit to one format or another.
Apache Spark is designed for this kind of use case. It is a storage format for columnar databases. It provides column compression and some indexing.
It is becoming a de facto standard. Many big data platforms are adopting it or at least providing some support for it.
You can query it efficiently directly in S3 using SparkSQL, Impala or Apache Drill. You can also run EMR jobs against it.
To write data to Parquet using Spark, use DataFrame.saveAsParquetFile.
Depending on your specific requirements you may even end up not needing a separate Cassandra instance.
You may also find this post interesting

Hive queries of external tables stored on Google Cloud Storage extremely slow

I have begun testing The Google Cloud Storage connector for Hadoop. I am finding it incredibly slow for hive queries run against it.
It seems a single client must scan the entire file system before starting the job, 10s of 1000s of files this takes 10s of minutes. Once the job is actually running it performs well.
Is this a configuration issue or the nature of hive/gcs? Can something be done to improve performance.
Running CDH 5.3.0-1 in GCE
I wouldn't say it's necessarily a MapReduce vs Hive difference, though there are possible reasons it could be more common to run into this type of slowness using Hive.
It's true that metadata operations like "stat/getFileStatus" have a slower round-trip latency on GCS than local HDFS, on the order of 30-70ms instead of single-digit milliseconds.
However, this doesn't mean it should take >10 of minutes to start a job on 10,000 files. Best-practice is to allow the connector to "batch" requests as much as possible, allowing retrieval of up to 1000 fileInfos in a single round-trip.
The key is that if I have a single directory:
gs://foobar/allmydata/foo-0000.txt
....<lots of files following this pattern>...
gs://foobar/allmydata/foo-9998.txt
gs://foobar/allmydata/foo-9999.txt
If I have my Hive "location" = gs://foobar/allmydata it should actually be very quick, because it will be fetching 1000 files at a time. If I did hadoop fs -ls gs://foobar/allmydata it should come back in <5 seconds.
However, if I have lots of small subdirectories:
gs://foobar/allmydata/dir-0000/foo-0000.txt
....<lots of files following this pattern>...
gs://foobar/allmydata/dir-9998/foo-9998.txt
gs://foobar/allmydata/dir-9999/foo-9999.txt
Then this could go awry. The Hadoop subsystem is a bit naive, so that if you just do hadoop fs -ls -R gs://foobar/allmydata in this case, it will indeed first find the 10000 directories of the form gs://foobar/allmydata/dir-####, and then run a for-loop over them, one-by-one listing the single file under each directory. This for-loop could easily take > 1000 seconds.
This was why we implemented a hook to intercept at least fully-specified glob expressions, released back in May of last year:
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/MbWx1KqY2Q4
7. Implemented new version of globStatus which initially performs a flat
listing before performing the recursive glob logic in-memory to
dramatically speed up globs with lots of directories; the new behavior is
default, but can disabled by setting fs.gs.glob.flatlist.enable = false.
In this case, if the subdirectory layout was present, the user can opt instead to do hadoop fs -ls gs://foobar/allmydata/dir-*/foo*.txt. Hadoop lets us override a "globStatus", so by using this glob expression, we can correctly intercept the entire listing without letting Hadoop do its naive for-loop. We then batch it up efficiently, such that we'll retrieve all 10,000 fileInfos again in <5 seconds.
This could be a bit more complicated in the case of Hive if it doesn't allow as free usage of glob expressions.
Worst case, if you can move those files into a flat directory structure then Hive should be able to use that flat directory efficiently.
Here's a related JIRA from a couple years ago describing the similar problem for how Hive deals with files in S3, still officially unresolved: https://issues.apache.org/jira/browse/HIVE-951
If it's unclear how/why the Hive client is performing the slow for-loop, you can add log4j.logger.com.google=DEBUG to your log4j.properties and re-run the Hive client to see detailed info about what the GCS connector is doing under the hood.

Related to speed of execution of Job in Amazon Elastic Mapreduce

My Task is
1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table
3) That result containing table from Hive is again exported to MS SQL SERVER back.
I want to perform all this using Amazon Elastic Map Reduce.
The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).
I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine.
In order to improve the performance what number of instances should I need to use?
As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same.
And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR?
Please guide me through this as I don't much about the Amazon Servers.
Thanks.
You could try Ganglia, which can be installed on your EMR cluster using a bootstrap action. This will give you some metrics on the performance of each node in the cluster and may help you optimise to get the right sized cluster:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Ganglia.html
If you use the EMR Ruby client on your local machine, you can set up an SSH tunnel to allow you to view the ganglia web interface in Firefox (you'll also need to setup FoxyProxy as per the following http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html)