STORE output to a single CSV? - apache-pig

Currently, when I STORE into HDFS, it creates many part files.
Is there any way to store out to a single CSV file?

You can do this in a few ways:
To set the number of reducers for all Pig opeations, you can use the default_parallel property - but this means every single step will use a single reducer, decreasing throughput:
set default_parallel 1;
Prior to calling STORE, if one of the operations execute is (COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), and ORDER BY), then you can use the PARALLEL 1 keyword to denote the use of a single reducer to complete that command:
GROUP a BY grp PARALLEL 1;
See Pig Cookbook - Parallel Features for more information

You can also use Hadoop's getmerge command to merge all those part-* files.
This is only possible if you run your Pig scripts from the Pig shell (and not from Java).
This as an advantage over the proposed solution: as you can still use several reducers to process your data, so your job may run faster, especially if each reducer output few data.
grunt> fs -getmerge <Pig output file> <local file>

Related

Is there any performance-wise better option for exporting data to local than Redshift unload via s3?

I'm working on a Spring project that needs exporting Redshift table data into local a single CSV file. The current approach is to:
Execute Redshift UNLOAD to write data across multiple files to S3 via JDBC
Download said files from S3 to local
Joining them together into one single CSV file
UNLOAD (
'SELECT DISTINCT #{#TYPE_ID}
FROM target_audience
WHERE #{#TYPE_ID} is not null
AND #{#TYPE_ID} != \'\'
GROUP BY #{#TYPE_ID}'
)
TO '#{#s3basepath}#{#s3jobpath}target_audience#{#unique}_'
credentials 'aws_access_key_id=#{#accesskey};aws_secret_access_key=#{#secretkey}'
DELIMITER AS ',' ESCAPE GZIP ;
The above approach has been fine and all. But i think the overall performance can be improved by, for example skipping the S3 part and get data directly from Redshift to local.
After searching through online resources, i found that you can export data from redshift directly through psql or to perform SELECT queries and move the result data myself. But neither option can top Redshift UNLOAD performance with parallel writing.
So is there any way i can mimic UNLOAD parallel writing to achieve the same performance without having to go through S3 ?
You can avoid the need to join files together by using UNLOAD with the PARALLEL OFF parameter. It will output only one file.
This will, however, create multiple files if the filesize exceeds 6.2GB.
See: UNLOAD - Amazon Redshift
It is doubtful that you would get better performance by running psql, but if performance is important for you then you can certainly test the various methods.
We do exactly same as you'r trying to do here. In our performance comparison, it found to be almost same or even better in some cases in our user case. Hence programming and debugging wise its easy. As there is practically one step.
//replace user/password,host,region,dbname appropriately in given command
psql postgresql://user:password#xxx1.xxxx.us-region-1.redshift.amazonaws.com:5439/dbname?sslmode=require -c "select C1,C2 from sch1.tab1" > ABC.csv
This enables us to avoid 3 steps,
Unload using JDBC
Download the exported Data from S3
Decompress gzip file, (this we used to save network Input/Output).
On other hand also saving some cost(S3 storing, though its negligible).
By the way, pgsql(9.0+) onwards, sslcompression is bydefault on.

Pig variable storage

Pig uses variables to store the data.
When I load the data from HDFS into the variable in pig. Where is the data temporarily stored?
What exactly happens in the background when we load the data into the variable ?
Kindy help
Pig lazily evaluates most expressions. In most cases, it checks for syntax errors etc. Like,
a = load 'hdfs://I/Dont/Exist'
won't throw an error unless you use STORE or DUMP or something along those lines which result in the evaluation of a
Similarly, if a file exists and you load it to a relation and perform transformations on it, the file is spooled to /tmp folder usually and then the transformations are performed. If you look at the messages that appear when you run commands on grunt, you'll notice file paths starting with file:///tmp/xxxxxx_201706171047235. These are the files that store intermediate data.

Hive query - INSERT OVERWRITE LOCAL DIRECTORY creates multiple files for a single table

I do the following from a hive table myTable.
INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT concat_ws('',NAME,PRODUCT,PRC,field1,field2,field3,field4,field5) FROM myTable;
So, this command generates 2 files 000000_0 and 000001_0 inside the folder out/.
But, I need the contents as a single file. What should I do?
There are multiple files in the directory because every reducer is writing one file. If you really need the contents as a single file, run your map reduce job with only 1 reducer which will write to a single file.
However depending on your data size, this might not be a good approach to run a single reducer.
Edit: Instead of forcing hive to run 1 reduce task and output a single reduce file, it would be better to use hadoop fs operations to merge outputs to a single file.
For example
hadoop fs -text /myDir/out/* | hadoop fs -put - /myDir/out.txt
A bit late to the game, but I found that using LIMIT large_number, where large_number is bigger than rows in your query. It forces hive to use at least a reducer. For example:
set mapred.reduce.tasks=1; INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT * FROM table_name LIMIT 1000000000
Worked flawlessly.
CLUSTER BY will make the work.

Running Elastic Mapreduce Hive Queries from an Application

I've run Hive on elastic mapreduce in interactive mode:
./elastic-mapreduce --create --hive-interactive
and in script mode:
./elastic-mapreduce --create --hive-script --arg s3://mybucket/myfile.q
I'd like to have an application (preferably in PHP, R, or Python) on my own server be able to spin up an elastic mapreduce cluster and run several Hive commands while getting their output in a parsable form.
I know that spinning up a cluster can take some time, so maybe my application might have to do that in a separate step and wait for the cluster to become ready. But is there any way to do something like this somewhat concrete hypothetical example:
create Hive table customer_orders
run Hive query "SELECT dt, count(*) FROM customer_orders GROUP BY dt"
wait for result
parse result in PHP
run Hive query "SELECT MAX(id) FROM customer_orders"
wait for result
parse result in PHP
...
Does anyone have any recommendations on how I might do this?
You may use MRJOB. It lets you write MapReduce jobs in Python 2.5+ and run them on several platforms.
An alternative is HiPy, it is an awesome project which should perhaps be enough for all your needs. The purpose of HiPy is to support programmatic construction of Hive queries in Python and easier management of queries, including queries with transform scripts.
HiPy enables grouping together in a single script of query
construction, transform scripts and post-processing. This assists in
traceability, documentation and re-usability of scripts. Everything
appears in one place and Python comments can be used to document the
script.
Hive queries are constructed by composing a handful of Python objects,
representing things such as Columns, Tables and Select statements.
During this process, HiPy keeps track of the schema of the resulting
query output.
Transform scripts can be included in the main body of the Python
script. HiPy will take care of providing the code of the script to
Hive as well as of serialization and de-serialization of data to/from
Python data types. If any of the data columns contain JSON, HiPy takes
care of converting that to/from Python data types too.
Check out the Documentation for details!

Save reducers output directory path to a variable in Hadoop

How do I save the output path of Hadoop reducers to a variable?
This variable will be used by all other MR jobs.
These jobs will be sequential.
All the sequential MR jobs will write their corresponding output to that output directory.
I need their path variable to be updated accordingly.
Take a look at "Oozie". It's a Hadoop workflow engine which allows just what you described. Multiple jobs can take their "Input" as an "Output" from the last job.
There are other solutions for this such as "Cascading" API.
http://www.concurrentinc.com/products/
http://yahoo.github.com/oozie/releases/2.0.0/#Quick_Start