Split one file into multiple file using PIG script - apache-pig

I have a pipe delimited text file say abc.txt. which has different number of columns in different records. Number of columns in a record can be 100,80,70,60. I need to split abc.txt based on 3rd column value. If third column has value "A" , then that record will go to A.txt, if "B" then B.txt. Need to write a PIG script.

abc = LOAD 'abc.txt' using PigStorage('|');
Assuming you have the 3rd column in all the records, SPLIT using the positional notation. It starts from 0, so the third column will be $2.
SPLIT abc into a_records if $2 == 'A', b_records if $2 == 'B';
Then store the results, also note that STORE does not accept filenames as path.
STORE a_records into 'A_DIR' using PigStorage('|');
STORE b_records into 'B_DIR' using PigStorage('|');

Related

Compare two comma separated columns

I want to compare two columns actual_data and pipeline_data based on source column bcz every source has different format.
I am trying to achieve the result column based on comparision between actual_data and pipeline_data .
I am new to pandas and looking for a way to implement this.
df['result'] = np.where(df['pipeline_data'].str.len() == df['actual_data'].str.len(), 'Match', np.where(df['pipeline_data'].str.len() > df['actual_data'].str.len(), 'Length greater than actual_data', 'Length shorter than actual_data'))
The code above should to what you want to do.

How to split a column containing two records separately

I have millions of observation in different columns and one of the column contains records of two factors together. for instance, 136789 and i want to split the first character (1) and the rest (36789) as separate columns for all observations.
The field looks like this
#136789
I want to see like this
#1 36789
You can make use of sub() function.
For example:
kent$ awk 'BEGIN{x="123456";sub(/^./,"& ",x);print x}'
1 23456
In your code, you need apply sub() on some column ($x)

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.
I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.

How to change a particular column value for certain number of rows in Pig latin

I have a a pig file with say 10000 rows. Is there any quick way where I can change the value of a certain column for say first 1000 rows ?
Since some info is missing, I will make a few assumptions, and then offer a solution.
by "first 1000 rows" you mean that you can order them records using some column
you which to change the value of column $1 in first 1000 records when ordering by column $2
The following code snippet will do what you asked for:
a = load ...
b = rank a by $2;
c = foreach b generate $0, (rank_a<1000?$1:3*$1), $2..;
Use For Each and Limit Operations to achieve the effect.

PIG - Defining the delimiter used for a bag after a GROUP function

In Pig, I'm loading and grouping two files. I end up with a something like this:
A = LOAD 'File1' Using PigStorage('\t');
B = LOAD 'File2' Using PigStorage('\t');
C = COGROUP A BY $0, B BY $0;
STORE C INTO 'Output' USING PigStorage('\t');
Output:
123 {(123,XYZ,456)} {(123,QRS,889,QWER)}
Where the first field is the group key, the first bag is from File1, and the next bag is from File2. These three sections are delimited from each other using whatever I identified in the PigStorage('\t') clause.
Question: How do I force Pig to delimit the bags by something other than a comma? In my real data, there are commas present and so I need to delimit by tabs instead.
Desired output:
123 {(123\tXYZ\t456)} {(123\tQRS\t889\tQWER)}
This seems to be an open issue (as of June 2013) in Pig. See the corresponding JIRA for more details. Until the issue is fixed, you can change your input data.