How to remove null values from multiple columns in PIG - apache-pig

My dataset has around 200000 thousands of lines. Using the below command in PIG I'm trying to remove the null values but getting a wrong output. What am I missing here? Please help.
I've used
div = Foreach dataset generate $43 as A,$44 as B,.....,$50 as H;
and trying to eliminate the null values to find the individual variable count as well as the total count.
values = filter div by A is not null AND B is not null AND C is not null AND D is not null AND E is not null AND F is not null AND G is not null AND H is not null;
And getting wrong output.
I want the final output as something like H 1056 U 4355 W 999 P 1000 Y 2199

Related

PostgreSQL data transformation - Turn rows into columns

I have a table whose structure looks like the following:
k | i | p | v
Notice that the key (k) is not unique, there are no keys, nothing. Each key can have multiple attributes (i = 0, 1, 2, ...) which can be of different types (p) and have different values (v). One attribute type may also appear multiple times (p(i-1) = p(i)).
What I want to do is pick certain attribute types and their corresponding values and place them in the same row. For example I want to have:
k | attr_name1 | attr_name2
I have managed to make a query that does this and works for all keys (k) for which attr_name1 and attr_name2 appear in the column p of the initial table:
SELECT DISTINCT ON (key) fn.k AS key, fn.v AS attr_name1, a.v AS attr_name2
FROM Table fn
LEFT JOIN Table a ON fn.k = a.k
AND a.p = 'attr_name2'
WHERE fn.p = 'attr_name1'
I would like, however, to take into account the case where a certain key has no attribute named attr_name1 and insert a NULL value into the corresponding column of the new table. I am not sure how to achieve that. I have no issue using multiple queries or intermediate tables etc, but there are quite a lot of rows in the table and I need something that scales to millions of rows.
Any help would be appreciated.
Example:
k i p v
1 0 a 10
1 1 b 12
1 2 c 34
1 3 d 44
1 4 e 09
2 0 a 11
2 1 b 13
2 2 d 22
2 3 f 34
Would turn into (assuming I am only interested in columns a, b, c):
k a b c
1 10 12 34
2 11 13 NULL
I would use conditional aggregation. That is, an aggregate function around a CASE expression.
SELECT
k,
MAX(CASE WHEN p='a' THEN v END) AS a,
MAX(CASE WHEN p='b' THEN v END) AS b,
MAX(CASE WHEN p='c' THEN v END) AS c
FROM
your_table
GROUP BY
k
This presumes that (k, p) is unique. If there are duplicate keys, this will clearly find the one v with the highest value (for each (k,p))
As a general rule this kind of pivoting makes the data harder to process in SQL. This is often done for display purposes because humans find this easier to read. However, from a software engineering perspective, such formatting should not be done in the data layer; be careful that by doing this you don't actually make your future life harder.

Calculate percentage in pig

I have the below requirements.
Test data has the following values.
I need to find the percentage of each of the characters out of the total.
I have tried with the below query , but not to success.
Ex:
W
H
U
U
H
W
U
W
W
H
W
U
H
H
H
U
W
W
W
H
data = LOAD 'location of test data';
grp = GROUP data BY data.$0; // considering only 1 field in this csv.
result = FOREACH grp GENERATE group, COUNT(data.$0)/SUM(data.$0);
Since the fields are chararrays, I am not able to do the sum of the fields.
Is there an alternate to find one?
If I use a GROUP ALL, followed by COUNT(data.$0), I get the total number of entries.
If I use a GROUP of the field, followed by COUNT(data.$0), I get the individual count.
Here i need the percentage of this individual count by the sum.
Thanks in advance.
Here i need the percentage of this individual count by the sum.
To do this, you would need to run two Pig Operations I believe -
1) First as you said get the individual counts in one relation
W 8
H 7
U 5
2) Second, you count all the elements as you mentioned earlier in one relation
total 20
3) You then need to CROSS the relations obtained in first and two (CROSS) so that you have a new relation like this
W 8 20
H 7 20
U 5 20
4) Post this, you can calculate the percentage that you wanted.
Update
Below is the Pig script that I came up with.
A = LOAD 'data.txt' using PigStorage('\n');
--DUMP A;
B = GROUP A by $0;
C = FOREACH B GENERATE group, COUNT(A.$0);
--DUMP C;
D = GROUP A ALL;
E = FOREACH D GENERATE group,COUNT(A.$0);
DUMP E;
DESCRIBE C;
DESCRIBE E;
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,($1*100/$3);
DESCRIBE G;
DUMP G;
you have to do that manually,
something like
data = foreach data generate *, ((B=='b1')?1:0) AS dummy_b1;
data = foreach data generate *, mean(dummy_b1) AS percentage;

SAS Checking Whether A Third Variable is Between the Values of two other variables

I have been dealing with this issue that I thought was trivial, but for some reason nothing I have tried has worked so far.
I have a dataset
obs A B C
1 2 6 7
2 3 1 5
3 8 5 9
. . . .
For each observation, I want to compare the values in column A to the values in column B and assign a value 1 to a variable called within. My goal to only select observations where their A value is within their B and C values. I have tried everything, but nothing seem to be working.
Thank you.
Here's how to do it in a data step. Let me know if that works for you.
data new;
set old;
if B < A < C then D = 1;
else delete;
run;

Convert data in a specific format in Apache Pig.

I want to convert data in to a specific format in Apache Pig so that I can use a reporting tool on top of it.
For example:
10:00,abc
10:00,cde
10:01,abc
10:01,abc
10:02,def
10:03,efg
The output should be in the following format:
abc cde def efg
10:00 1 1 0 0
10:01 2 0 0 0
10:02 0 0 1 0
The main problem here is that a value can occur multiple times in a row, depending on the different values available in the sample csv file, up to a total of 120.
Any suggestions to tackle this are more than welcome.
Thanks
Gagan
Try something like the following:
A = load 'data' using PigStorage(",") as (key:chararray,value:chararray);
B = foreach A generate key,(value=='abc'?1:0) as abc,(value=='cde'?1:0) as cde,(value=='efg'?1:0) as efg;
C = group B by key;
D = foreach C generate group as key, COUNT(abc) as abc, COUNT(cde) as cde, COUNT(efg) as efg;
That should get you a count of the occurances of a particular value for a particular key.
EDIT: just noticed the limit 120 part of the question. If you cannot go above 120 put the following code
E = foreach D generate key,(abc>120?"OVER 120":abc) as abc,(cde>120?"OVER 120":cde) as cde,(efg>120?"OVER 120":efg) as efg;

Selecting random tuple from bag

Is it possible to (efficiently) select a random tuple from a bag in pig?
I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection.
One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?
You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:
inpt = load 'group.txt' as (id:int, c1:bytearray, c2:bytearray);
groups = group inpt by id;
randoms = foreach groups {
rnds = foreach inpt generate *, RANDOM() as rnd; -- assign random number to each row in the bag
ordered_rnds = order rnds by rnd;
one_tuple = limit ordered_rnds 1; -- select tuple with the smallest random number
generate group as id, one_tuple;
};
dump randoms;
INPUT:
1 a r
1 a t
1 b r
1 b 4
1 e 4
1 h 4
1 k t
2 k k
2 j j
3 a r
3 e l
3 j l
4 a r
4 b t
4 b g
4 h b
4 j d
5 h k
OUTPUT:
(1,{(1,b,r,0.05172709255901231)})
(2,{(2,k,k,0.14351660053632986)})
(3,{(3,e,l,0.0854104195792681)})
(4,{(4,h,b,8.906013598960483E-4)})
(5,{(5,h,k,0.6219490873384448)})
If you run "dump randoms;" multiple times, you should get different results for each run.
Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.
I needed to do this myself, and found surprisingly that a very simple answer seems to work, to get about 10% of an alias A:
B = filter A by RANDOM() < 0.1