Pig ParquetLoader : Column Pruning - apache-pig

I read parquet files which has a schema of 12 columns.
I do a group by and sum aggregation over a single long column.
then join on another dataset. after join I only take a single column (the sum one) from the parquet dataset.
But pig constantly keeps on giving me error=>
"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2000: Error processing rule ColumnMapKeyPrune. Try -t ColumnMapKeyPrune"
Does the pig parquet loader doesn't support column pruning?
If i tried with column pruning disabled, the job works.
pseudo code for what I am trying to achieve.
REGISTER /<path>/parquet*.jar;
res1 = load '<path>' using parquet.pig.ParquetLoader() as (c1:chararray,c2:chararray,c3:int, c4:int, c5:chararray, c6:chararray, c7:chararray, c8:chararray, c9:chararray, c10:chararray, c11:chararray, c12:long);
res2 = group winrate by (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11);
res3 = foreach res2 generate flatten(group) as (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11),SUM(res1.c12) as counts;

Related

Query Hive view with Redshift Spectrum

I'm trying to query a Hive view with Redshift Spectrum but it gives me this error:
SQL Error [500310] [XX000]: [Amazon](500310) Invalid operation: Assert
Details:
-----------------------------------------------
error: Assert
code: 1000
context: loc->length() > 5 && loc->substr(0, 5) == "s3://" -
query: 12103470
location: scan_range_manager.cpp:272
process: padbmaster [pid=1769]
-----------------------------------------------;
Is is possible to query Hive views from Redshift Spectrum? I'm using Hive Metastore (not Glue Data Catalog).
I wanted to have a view to restrict access to the original table, with a limited set of columns and partitions. And also because my original table (Parquet data) has some Map fields so I wanted to do something like that to make it easier to query from Redshift as Map fields are a bit complicated to deal with in Redshift:
CREATE view my_view AS
SELECT event_time, event_properties['user-id'] as user_id, event_properties['product-id'] as product_id, year, month, day
FROM my_events
WHERE event_type = 'my-event' -- partition
I can query the table my_events from Spectrum but it's a mess because properties is a Map field, not a Struct so I need to kind of explode it into several rows in Redshift.
Thanks
Looking at the error it seems Spectrum always looks for a S3 path when external tables and views are queried.
This is valid for external tables because those will always have a location but views will never have an explicit S3 location.
Error type -> Assert
Error context -> context: loc->length() > 5 && loc->substr(0, 5) == "s3://"
In case of a hive view,
loc->length() will return 0, and the whole statement will return False and result in assertion error.
Confirmation for this could be the second clause:
loc->substr(0, 5) == "s3://"
It is expecting the location to be a S3 path and if we count number of chars in "s3://" it is 5, which also confirms the first clause :
loc->length() > 5
Looks like Spectrum does not support Hive Views (or in general any object without an explicit S3 path)

Pig Query - Giving inconsistent results in AWS EMR

I am new to PIG. I have written one query which is not working as expected. I am trying to process Google ngrams dataset provided to me.
I load the data which is 1GB
bigrams = LOAD '$(INPUT)' AS (bigram:chararray, year:int, occurrences:int, books:int);
Then I select a subset which is limited to 2000 entries
limbigrams = LIMIT bigrams 2000;
Then see the dump of the limited data (pasting sample output)
(GB product,2006,1,1)
(GB product,2007,5,5)
(GB wall_NOUN,2007,27,7)
(GB wall_NOUN,2008,35,6)
(GB2 ,_.,1906,1,1)
(GB2 ,_.,1938,1,1)
Now I do a group by on limbigrams
D = GROUP limbigrams BY bigram;
When I see the data dump of D I see an entirely different data set (sample)
(GLABRIO .,1977,3,3),(GLABRIO .,1992,3,3),(GLABRIO .,1997,1,1),(GLABRIO .,2000,6,6),(GLABRIO .,2001,9,1),(GLABRIO .,2002,24,3),(GLABRIO .,2003,3,1)})
(GLASS FILMS,{(GLASS FILMS,1978,1,1),(GLASS FILMS,1976,2,1),(GLASS FILMS,1970,3,3),(GLASS FILMS,1966,7,1),(GLASS FILMS,1962,1,1),(GLASS FILMS,1958,1,1),(GLASS FILMS,1955,1,1),(GLASS FILMS,1899,2,2),(GLASS FILMS,1986,6,3),(GLASS FILMS,1984,1,1),(GLASS FILMS,1980,7,3)})
Now I am not attaching the entire output because there is not even a single row of overlap between both the outputs (i.e. before group-by and after group-by). Hence it really doesn't matter to see the output files.
Why does this happen?
The dumps are accurate. The GROUP BY operator in Pig creates a single record for each group and puts every record belonging to that group inside a bag. You can indeed see this in the last record of your second dump. The record stands for the group GLASS FILMS and has a bag containing records which have the bigram as GLASS FILMS. You can read more about the GROUP BY operator here: https://www.tutorialspoint.com/apache_pig/apache_pig_group_operator.htm

Pig calculating avg of delay fails

I have a file for airplanes data, having airplane dest and delay(delay can be negative or positve number)
A = load ‘flightdelays’ using Pigstorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
C = group b all; -- this is failing for cast error, also get an error failed to read data from input file..
D =foreach c generate b.dest, AVG(b.delay);
When i execute this , i get 0 records read from source file and mapreduce job failed..
Why is it not able to calculate AVG?
Check the extension/path of the file.Is your file comma separated? Also,there are plenty of case issues with your script.
PigStorage - s is small in your load statement.
A = load ‘flightdelays’ using PigStorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
There is no relation called a,b,c.You are loading data to relation A and so on.
1st thing A,a treated differently(in pig relation names are case sensitive) and 2nd thing while calculating Aggregate function on relation and group by on any attribute..
In FOREACH you should specify grouping attribute and aggregate function..
In this scenario you used group by all so you can't use b.dest along with aggregate function..
If you want destination wise AVG() delay then you should group by dest..

Filter by length of array in Pig

I have data stored in avro format. One of the fields of each record (array_field, say) is an array. Using Pig how do I obtain only the records that have arrays with, for example, length(array_field) >= 2 and then store the results in avro files using the same schema as the original input?
This should be doable with something like code below:
A = LOAD '$INPUT' USING AvroStorage();
B = FILTER A BY SIZE(array_field) >= 2;
STORE B INTO '$OUTPUT' USING AvroStorage('schema', '<schema_here>');

Pig Optimization on Group by

Lets assume that i have a large data set as per below schema layout
id,name,city
---------------
100,Ajay,Chennai
101,John,Bangalore
102,Zach,Chennai
103,Deep,Bangalore
....
...
I have two style of pig code giving me the same output.
Style 1 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_grp = group records by city;
records_each = foreach records_grp generate group as city,COUNT(records.id) as emp_cnt;
dump records_each;
Style 2 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_each = foreach (group records by city) generate group as city,COUNT(records.id) as emp_cnt;
dump records_each ;
In second style i used a nested Foreach. Does it style 2 code run faster than style 1 code or not.
I Would like to reduce the total time taken to complete that pig job..
So the Style 2 code achieve that ? Or there is no impact in total time taken?
If somebody confirms me then i can run similar code in my cluster with very large dataset
The solutions will have same costs.
However if records_grp is not used elsewhere, the version 2 allows you to not declare a variable and your script is shorter.