Hive queries of external tables stored on Google Cloud Storage extremely slow - google-hadoop

I have begun testing The Google Cloud Storage connector for Hadoop. I am finding it incredibly slow for hive queries run against it.
It seems a single client must scan the entire file system before starting the job, 10s of 1000s of files this takes 10s of minutes. Once the job is actually running it performs well.
Is this a configuration issue or the nature of hive/gcs? Can something be done to improve performance.
Running CDH 5.3.0-1 in GCE

I wouldn't say it's necessarily a MapReduce vs Hive difference, though there are possible reasons it could be more common to run into this type of slowness using Hive.
It's true that metadata operations like "stat/getFileStatus" have a slower round-trip latency on GCS than local HDFS, on the order of 30-70ms instead of single-digit milliseconds.
However, this doesn't mean it should take >10 of minutes to start a job on 10,000 files. Best-practice is to allow the connector to "batch" requests as much as possible, allowing retrieval of up to 1000 fileInfos in a single round-trip.
The key is that if I have a single directory:
gs://foobar/allmydata/foo-0000.txt
....<lots of files following this pattern>...
gs://foobar/allmydata/foo-9998.txt
gs://foobar/allmydata/foo-9999.txt
If I have my Hive "location" = gs://foobar/allmydata it should actually be very quick, because it will be fetching 1000 files at a time. If I did hadoop fs -ls gs://foobar/allmydata it should come back in <5 seconds.
However, if I have lots of small subdirectories:
gs://foobar/allmydata/dir-0000/foo-0000.txt
....<lots of files following this pattern>...
gs://foobar/allmydata/dir-9998/foo-9998.txt
gs://foobar/allmydata/dir-9999/foo-9999.txt
Then this could go awry. The Hadoop subsystem is a bit naive, so that if you just do hadoop fs -ls -R gs://foobar/allmydata in this case, it will indeed first find the 10000 directories of the form gs://foobar/allmydata/dir-####, and then run a for-loop over them, one-by-one listing the single file under each directory. This for-loop could easily take > 1000 seconds.
This was why we implemented a hook to intercept at least fully-specified glob expressions, released back in May of last year:
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/MbWx1KqY2Q4
7. Implemented new version of globStatus which initially performs a flat
listing before performing the recursive glob logic in-memory to
dramatically speed up globs with lots of directories; the new behavior is
default, but can disabled by setting fs.gs.glob.flatlist.enable = false.
In this case, if the subdirectory layout was present, the user can opt instead to do hadoop fs -ls gs://foobar/allmydata/dir-*/foo*.txt. Hadoop lets us override a "globStatus", so by using this glob expression, we can correctly intercept the entire listing without letting Hadoop do its naive for-loop. We then batch it up efficiently, such that we'll retrieve all 10,000 fileInfos again in <5 seconds.
This could be a bit more complicated in the case of Hive if it doesn't allow as free usage of glob expressions.
Worst case, if you can move those files into a flat directory structure then Hive should be able to use that flat directory efficiently.
Here's a related JIRA from a couple years ago describing the similar problem for how Hive deals with files in S3, still officially unresolved: https://issues.apache.org/jira/browse/HIVE-951
If it's unclear how/why the Hive client is performing the slow for-loop, you can add log4j.logger.com.google=DEBUG to your log4j.properties and re-run the Hive client to see detailed info about what the GCS connector is doing under the hood.

Related

How to do parallel indexing on files (not on HDFS) in Solr?

I am not able to find a feasible solution so far, here is my env:
Cloudera Solr
1TB data from file system to be indexed
data format is JSON only
I know how to do indexing on file system like single file or folder, but how do I do that in a parallel way? As the data is not and cannot be put on HDFS, it limits the possible solution of using MapReduce or Spark tool.
Does anyone encounter the same need? Thanks.
Writing an indexer using a programming language you're familiar with that takes a slice of the available files is probably the best bet, then running multiple copies of this indexer (or using multiple threads if that's easily available) - allowing you to submit content in parallel and from multiple servers if necessary.
Don't use explicit commits in each client - use commitWithin so that you only commit every 60 seconds (or 10 minutes, or .. whatever interval that works for you).

BigQuery best approach for ETL (external tables and views vs Dataflow)

CSV files get uploaded to some FTP server (for which I don't have SSH access) in a daily basis and I need to generate weekly data that merges those files with transformations. That data would go into a history table in BQ and a CSV file in GCS.
My approach goes as follows:
Create a Linux VM and set a cron job that syncs the files from the
FTP server with a GCS bucket (I'm using GCSFS)
Use an external table in BQ for each category of CSV files
Create views with complex queries that transform the data
Use another cron job to create a table with the historic data and also the CSV file on a weekly basis.
My idea is to remove as much middle processes as I can and to make the implementation as easy as possible, including dataflow for ETL, but I have some questions first:
What's the problem with my approach in terms of efficiency and money?
Is there anything DataFlow can provide that my approach can't?
any ideas about other approaches?
BTW, I ran into one problem that might be fixable by parsing the csv files myself rather than using external tables, which is invalid characters, like the null char, so I can get rid of them, while as an external table there is a parsing error.
Probably your ETL will be simplified by Google DataFlow Pipeline batch execution job. Upload your files to the GCS bucket. For transforming use pipeline transformation to strip null values and invalid character (or whatever your need is). On those transformed dataset use your complex queries like grouping it by key, aggregating it (sum or combine) and also if you need side inputs data-flow provides ability to merge other data-sets into the current the data-set too. Finally the transformed output can written to BQ or you can write your own custom implementation for writing those results.
So the data-flow gives you very high flexibility to your solution, you can branch the pipeline and work differently on each branch with same data-set. And regarding the cost, if you run your batch job with three workers, which is the default that should not be very costly, but again if you just want to concentrate on your business logic and not worry about the rest, google data-flow is pretty interesting and its very powerful if used wisely.
Data-flow helps you to keep everything on a single plate and manage them effectively. Go through its pricing and determine if it could be the best fit for you (your problem is completely solvable with google data-flow), Your approach is not bad but needs extra maintenance with those pieces.
Hope this helps.
here are a few thoughts.
If you are working with a very low volume of data then your approach may work just fine. If you are working with more data and need several VMs, dataflow can automatically scale up and down the number of workers your pipeline uses to help it run more efficiently and save costs.
Also, is your linux VM always running? Or does it only spin up when you run your cron job? A batch Dataflow job only runs when it needed, which also helps to save on costs.
In Dataflow you could use TextIO to read each line of the file in, and add your custom parsing logic.
You mention that you have a cron job which puts the files into GCS. Dataflow can read from GCS, so it would probably be simplest to keep that process around and have your dataflow job read from GCS. Otherwise you would need to write a custom source to read from your FTP server.
Here are some useful links:
https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling

flink streaming or batch processing

I am tasked with redesigning an existing catalog processor and the requirement goes as belowRequirement I have 5 to 10 vendors(each vendor can have multiple stores) who would provide me with 'XML' file per store. Basically, 1 products xml file per Store, and multiple Store files per Vendor. Max file size can be 500 MB and min can be 100 MB Avg products per file could be 100,000.
Sample xml format could be like this ... ... ...
It doesnt take more than 30 mins to download the file per store, and these files are updated once per day or every 3 to 6 hours.
Now priority requirement is that, the product details are highly unorganized and these files have to organized, processed(10+ processes) and converted to another common object(json) and then file stored in Cassandra.
My technology head advised me to design with Apache Flink and Kafka on top of HDFS, where flink directly stream the files from the vendor servers and start processing them while streaming.
My view was that, either case the files are of finite size and there is not much need to stream them. So thought of having a standalone scheduler come downloader to download and load the files to HDFS. As soon as the files are loaded to HDFS, I can trigger the Flink processing and store the same in Cassandra.
My question here is that, knowing the files are of finite size and finite counts irrespsective of the number of vendors, Is stream processing a overkill or a Batch processing would be a latency burden later?
The question is highly dependent on the tool you will use. If you go for Flink I believe that using the stream is fine and won't create problem in the long run. If you write your functions and jobs properly, moving from DataStream API to DataSet API would be easy, if needed. Batch here introduces an useless delay and without further informations doesn't seem the appropriate approach. I believe it would work fine anyway but it's not clear if latency is a strict requirement.
That said, I believe Flink in itself is an overkill. In this particular use case a more traditional like Spark would be a better option in terms of usability but if you want to invest on Flink, it's totally fine and given the use case, I don't think you will need any particular library that is present/integrated with spark but missing on Flink.

Automatic Hive or Cascading for ETL in AWS-EMR

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html

Node-Local Map reduce job

I am currently attempting to write a map-reduce job where the input data is not in HDFS and cannot be loaded into HDFS basically because the programs using the data cannot use data from HDFS and there is too much to copy it into HDFS, at least 1TB per node.
So I have 4 directories on each of the 4 nodes in my cluster. Ideally I would like my mappers to just receive the paths for these 4 local directories and read them, using something like file:///var/mydata/... and then 1 mapper can work with each directory. i.e. 16 Mappers in total.
However to be able to do this I need to ensure that I get exactly 4 mappers per node and exactly the 4 mappers which have been assigned the paths local to that machine. These paths are static and so can be hard coded into my fileinputformat and recordreader, but how do I guarantee that given splits end up on a given node with a known hostname. If it were in HDFS I could use a varient on FileInputFormat setting isSplittable to false and hadoop would take care of it but as all the data is local this causes issues.
Basically all I want is to be able to crawl local directory structures on every node in my cluster exactly once, process a collection of SSTables in these directories and emit rows (on the mapper), and reduce the results (in the reduce step) into HDFS for further bulk processing.
I noticed that the inputSplits provide a getLocations function but I believe that this does not guarantee locality of execution, only optimises it and clearly if I try and use file:///some_path in each mapper I need to ensure exact locality otherwise I may end up reading some directories repeatedly and other not at all.
Any help would be greatly appreciated.
I see there are three ways you can do it.
1.) Simply load the data into HDFS, which you do not it want to do. But it is worth trying as it will be useful for future processing
2.) You can make use of NLineInputFormat. Create four different files with the URLs of the input files in each of your node.
file://192.168.2.3/usr/rags/data/DFile1.xyz
.......
You load these files into HDFS and write your program on these files to access the data data using these URLs and process your data. If you use NLineInputFormat with 1 line. You will process 16 mappers, each map processing an exclusive file. The only issue here, there is a high possibility that the data on one node may be processed on another node, however there will not be any duplicate processing
3.) You can further optimize the above method by loading the above four files with URLs separately. While loading any of these files you can remove the other three nodes to ensure that the file exactly goes to the node where the data files are locally present. While loading choose the replication as 1 so that the blocks are not replicated. This process will increase the probability of the maps launched processing the local files to a very high degree.
Cheers
Rags