Apache Pig - Removing the pseudo-column added by -tagFile - apache-pig

I have files of the format test_YYYYMM.txt. I am using '-tagFile' and SUBSTRING() to extract the year and month for use in my pig script.
The file name gets added as a pseudo-column at the beginning of the tuple.
Before I do a DUMP I would like to remove that column. Doing a FOREACH ... GENERATE with only the columns I need does not work, it still retains the psuedo-column.
Is there a way to remove this column?
My sample script is as follows
raw_data = LOAD 'test_201501.txt' using PigStorage('|', '-tagFile') as
col1: chararray, col2: chararray;
data_with_yearmonth = FOREACH raw_data GENERATE
SUBSTRING($0,5,11) as yearmonth,
'TEST_DATA' as test,
col1,
col2;
DUMP data_with_yearmonth;
Expected Output:
201501, TEST_DATA, col1, col2
Current Output:
201501, TEST_DATA, test_YYYYMM.txt, col1, col2

First of all, if col1 and col2 are string then you should declare them as CHARARRAY in Pig.
Plus, I guess your current output is actually : 201501, TEST_DATA, test_YYYYMM.txt, col1.
Tell me if I'm wrong, but as you used '-TagFile' the first column is the file title, this is why you access to it with $0 in your SUBSTRING.
You can try with this code :
raw_data = LOAD 'text_201505.txt'
USING PigStorage('|', '-tagFile')
AS (title: CHARARRAY, col1: CHARARRAY, col2: CHARARRAY);
data_with_yearmonth = FOREACH raw_data
GENERATE
SUBSTRING($0,5,11) AS yearmonth,
'TEST_DATA' AS test,
col1,
col2;
DUMP data_with_yearmonth;

Related

How to send values to automatic sequential rows in an arrayformula on gsheets?

This would be much easier to solve on the 1* data sheet, but I am trying to have grouped data with a weekly quantity broken down into a daily # to make work easier further down the line.
=query({ARRAYFORMULA(if(BinCountData!A3, "Store0",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!A3:B;
ARRAYFORMULA(if(BinCountData!E3:E,"Store1",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!E3:F;
ARRAYFORMULA(IF(BinCountData!I3:I,"Store2",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!I3:J;
ARRAYFORMULA(IF(BinCountData!M3:M,"Store3",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!M3:N;
ARRAYFORMULA(IF(BinCountData!Q3:Q,"Store4",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!Q3:R;
ARRAYFORMULA(IF(BinCountData!U3:U,"Store5",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!U:V;
ARRAYFORMULA(IF(BinCountData!Y3:Y,"Store6",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!Y3:Z;
ARRAYFORMULA(IF(BinCountData!AC3:AC,"Store7",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!AC3:AD;
ARRAYFORMULA(IF(BinCountData!AG3:AG,"Store8",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!AG3:AH;
ARRAYFORMULA(IF(BinCountData!AK3:AK,"Store9",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!AK3:AL;
ARRAYFORMULA(IF(BinCountData!AO3:AO,"Store10",)(BinCountData!B3/7)*Row(1:7)^0), BinCountData!AO3:AP
}, "
Select *
where Col1 <>''
label Col1 'Date', Col2 'Store'
")
What would be the replacement for *Row(1:7)^0 to make them go to sequentially in the same columns?
if you want to distribute it into columns instead of rows you can do:
*COLUMN(A:G)^0
or:
*TRANSPOSE(ROW(1:7)^0)
or:
*SEQUENCE(1, 7, 1, 0)
update
try:
=ARRAYFORMULA(QUERY(SPLIT(FLATTEN(FLATTEN(
FILTER(BinCountData!A3:AR, MOD(COLUMN(BinCountData!A:AR)-1, 4)=0))+SEQUENCE(1, 7, 0)&"×"&FLATTEN(
FILTER(BinCountData!A1:AR1, MOD(COLUMN(BinCountData!A:AR)-1, 4)=0)&"×"&
FILTER(BinCountData!A3:AR, MOD(COLUMN(BinCountData!A:AR)-2, 4)=0)/7)), "×"),
"where Col3 <> 0
order by Col2
label Col1'Date',Col2'Store',Col3'Bin/Day'
format Col1 'yyyy-mm-dd'"))
demo spreadsheet

SUBSTRING operation does not work in JOIN operation

I have a column col1 in file 1:
00SPY58KHT5
00SPXB2BD0J
00SPXB2DXH6
00SPXDQ02S1
00SPXDY91JI
00SPXFG88L6
00SPXF1AQ4Z
00SPXF5UKS3
00SPXGL9IV6
I have column col2 in file2:
0SPY58KHT5
0SPXB2BD0J
0SPXB2DXH6
0SPXDQ02S1
0SPXDY91JI
0SPXFG88L6
0SPXF1AQ4Z
0SPXF5UKS3
0SPXGL9IV6
As you can see there is different of 0 in the first one in the beginning
I need to do JOIN operation between two files by these columns. So I need to use substring like this :
JOIN_FILE1_FILE2 = JOIN FILE1 BY TRIM(SUBSTRING(col1,1,10)), FILE1 BY TRIM(col2);
DUMP JOIN_FILE1_FILE2;
But I get empty result.
Input(s):
Successfully read 914493 records from: "/hdfs/data/adhoc/PR/02/RDO0/GUIDES/GUIDE_CONTRAT_USINE.csv"
Successfully read 102851809 records from: "/hdfs/data/adhoc/PR/02/RDO0/BB0/MGM7X007-2019-09-11.csv"
Output(s):
Successfully stored 0 records in: "hdfs://ha-manny/hdfs/hadoop/pig/tmp/temp964914764/tmp1220183619"
How can I did this jointure please ?
As a solution I generate first data to applicate the SUBSTRING function to the col1.
Then I did the filtration using TRIM and finally use CONCAT('0',col1) in other generation.
In other words
DATA1 = FOREACH DATA_SOURCE GENERATE
SUBSTRING(col1,1,10) AS col1;
JOINED_DATA = JOIN DATA1 BY col1, ...
FINAL_DATA = FOREACH JOINED_DATA GENERATE
CONCAT('0',col1) AS col1,
...
And this works without problem.

Perl - From Database to data structure

I'm querying a table in a database with SQL like this:
Select col1, col2 from table_name
For reference, col2 will be an integer value, and col1 will be the name of an element. E.G.
FOO, 3
BAR, 10
I want a data structure where the values can be addressed like vars->{valueofcol1} should return the value of col2.
So
$vars->FOO
would return 3
Basically I don't know how to return the SQL results into a data structure I can address like this.
You need to fetch reach row and build that hashref yourself.
my $vars; # declare the variable for the hash ref outside the loop
my $sth = $dbh->prepare(q{select col1, col2 from table_name});
$sth->execute;
while ( my $res = $sth->fetchrow_hashref ) { # fetch row by row
$vars->{ $res->{col1} } = $res->{col2}; # build up data structure
}
print $vars->{FOO};
__END__
3
You may want to read up on DBI, especially how to fetch stuff.

Reading json files in pig

I have three data types...
1) Base data
2) data_dict_1
3) data_dict_2
Base data is very well formatted json..
For example:
{"id1":"foo", "id2":"bar" ,type:"type1"}
{"id1":"foo", "id2":"bar" ,type:"type2"}
data_dict_1
1 foo
2 bar
3 foobar
....
data_dict_2
-1 foo
-2 bar
-3 foobar
... and so on
Now, what I want is.. if the data is of type1
Then read id1 from data_dict_1, id2 from data_dict2 and assign that integer id..
If data is of type2.. then read id1 from data_dict_2.. id2 from data_dict1.. and assign corresponding ids..
For example:
{"id1":1, "id2":2 ,type:"type1"}
{"id1":-1, "id2":-2 ,type:"type2"}
And so on..
How do i do this in pig?
Note: what you have in the upper example is not valid json, the type key is not quoted.
Assuming Pig 0.10 and up, there's the JsonLoader built-in, which you can pass a schema to and load it with
data = LOAD 'loljson' USING JsonLoader('id1:chararray,id2:chararray,type:chararray');
and load the dicts
dict_1 = LOAD 'data_dict_1' USING PigStorage(' ') AS (id:int, key:chararray);
dict_2 = LOAD 'data_dict_2' USING PigStorage(' ') AS (id:int, key:chararray);
Then split that based on the type value
SPLIT data INTO type1 IF type == 'type1', type2 IF type == 'type2';
JOIN them appropriately
type1_joined = JOIN type1 BY id1, dict_1 BY key;
type1_joined = FOREACH type1_joined GENERATE type1::id1 AS id1, type1::id2 AS id2, type1::type AS type, dict_1::id AS id;
type2_joined = JOIN type2 BY id2, dict_2 BY key;
type2_joined = FOREACH type2_joined GENERATE type2::id1 AS id1, type2::id2 AS id2, type2::type AS type, dict_2::id AS id;
and since the schemas are equal, UNION them together
final_data = UNION type1_joined, type2_joined;
this produces
DUMP final_data;
(foo,bar,type2,-2)
(foo,bar,type1,1)

Find similar entries using pig script

I have data as below
1,ref1,1200,USD,CR
2,ref1,1200,USD,DR
3,ref2,2100,USD,DR
4,ref2,800,USD,CR
5,ref2,700,USD,CR
6,ref2,600,USD,CR
I want to group these records where field2 matches, sum(field3) matches and field5 is opposite (means if lhs is "CR" then rhs should be "DR" and viceversa)
How can I achieve this using pig script?
You can also do it this way:
data = LOAD 'myData' USING PigStorage(',') AS
(field1: int, field2: chararray,
field3: int, field4: chararray,
field5: chararray) ;
B = FOREACH (GROUP data BY (field2, field5)) GENERATE group.field2, data ;
-- Since in B there will always be two sets of field2 (one for CR and one for DR)
-- grouping by field2 again will get all pairs of CR and DR
-- (where the sums are equal of course)
C = GROUP B BY (field2, SUM(field3)) ;
The schema and output at the last step:
C: {group: (field2: chararray,long),B: {(field2: chararray,data: {(field1: int,field2: chararray,field3: int,field4: chararray,field5: chararray)})}}
((ref1,1200),{(ref1,{(1,ref1,1200,USD,CR)}),(ref1,{(2,ref1,1200,USD,DR)})})
((ref2,2100),{(ref2,{(4,ref2,800,USD,CR),(5,ref2,700,USD,CR),(6,ref2,600,USD,CR)}),(ref2,{(3,ref2,2100,USD,DR)})})
The output put is a little unwieldy right now, but this will clear it up:
-- Make sure to look at the schema for C above
D = FOREACH C {
-- B is a bag containing tuples in the form: B: {(field2, data)}
-- What we want is to just extract out the data field and nothing else
-- so we can go over each tuple in the bag and pull out
-- the second element (the data we want).
justbag = FOREACH B GENERATE FLATTEN($1) ;
-- Without FLATTEN the schema for justbag would be:
-- justbag: {(data: (field1, ...))}
-- FLATTEN makes it easier to access the fields by removing data:
-- justbag: {(field1, ...)}
GENERATE justbag ;
}
Into this:
D: {justbag: {(data::field1: int,data::field2: chararray,data::field3: int,data::field4: chararray,data::field5: chararray)}}
({(1,ref1,1200,USD,CR),(2,ref1,1200,USD,DR)})
({(4,ref2,800,USD,CR),(5,ref2,700,USD,CR),(6,ref2,600,USD,CR),(3,ref2,2100,USD,DR)})
I'm not sure that I understand your requirements, but you could load the data, split into two sets (filter/split) and the cogroup such as:
data = load ... as (field1: int, field2: chararray, field3: int, field4: chararray, field5: chararray);
crs= filter data by field5='CR';
crs_grp = group crs by field1;
crs_agg = foreach crs_grp generate group.field1 as field1, sum(crs.field3);
drs = filter data by field5='DR';
drs_grp = group drs by field1;
drs_agg = foreach drs_grp generate group.field1 as field1, sum(drs.field3);
g = COGROUP crs_agg BY (field1, field3), drs_agg BY (field1, field3);