Improve efficiency of PIG Script - apache-pig

DATASET:
I have a data set (data.txt) in (ID, Category) format as given below:
01,X
02,Y
03,X
04,Y
05,X
06,X
07,Z
08,Z
09,X
10,Z
Objective:
The objective is to find out which category has the maximum number of IDs without using UDF.
One Approach:
I have tried multiple times and concluded that this can be achived by the followins set of PIG statements
A1 = LOAD 'data.txt' USING PigStorage(',') AS (ID:int , Category:chararray);
A2 = DISTINCT A1;
A3 = GROUP A2 BY Category;
A4 = FOREACH A3 GENERATE group AS Category, COUNT(A2.ID) AS Number;
A5 = GROUP A4 ALL;
A6 = FOREACH A5 GENERATE MAX(A4.Number);
A7 = FILTER A4 by Number == A6.$0;
A8 = FOREACH A7 GENERATE Category;
DUMP A8;
Request:
Although these statements give the desired result, I am not convinced with its efficiency.
As I am new to PIG, I am not sure if there are any inbuilt functions which can perform such tasks to output the corresponding values of the minimum or maximum value of from a table.
My request is to know if this can be achived in any less number of steps.
Many Thanks

After grouping sort the grouping by counts in descending order and get the topmost record.
A1 = LOAD 'data.txt' USING PigStorage(',') AS (ID:int , Category:chararray);
A2 = DISTINCT A1;
A3 = GROUP A2 BY Category;
A4 = FOREACH A3 GENERATE group AS Category, COUNT(A2.ID) AS Number;
A5 = ORDER A4 BY Number DESC;
A6 = LIMIT A5 1;
DUMP A6.$0;

Related

Plotting CDF for ranking distribution

I have a panda dataframe that looks like this, this is generated with the groupby command and then sorted by # of users to give me user count for top X feature combination.
count_28day, Feature1, Feature2, Feature3
5000 a1 b1 c1
1000 a2 b2 c2
50 a3 b3 c3
I'm trying to plot cdf of user distribution. I don't need to know the features. I just want to show the top X feature combinations that will give me 90% of total user.
I'm doing this in a very hacky way.
topx = table.count_28day.sort_values(ascending=False).cumsum()/sum(table.count_28day)
ser_cdf = pd.Series(topx.tolist()[1:100], index=pd.Series(range(1,100)))
ser_cdf.plot(drawstyle='steps')
Is there a more elegant way to do this using histogram or ecdf or something?

Postgresql performing partitioning to find time difference

I am trying to fill column D and column E.
Column A: varchar(64) - unique for each trip
Column B: smallint
Column C: timestamp without time zone (excel messed it up in the
image below but you can assume this as timestamp column)
Column D: numeric - need to find out time from origin in minutes
column E: numeric - time to destination in minutes.
Each trip has got different intermediate stations and I am trying to figure out the time it has been since origin and time to destination
Cell D2 = C2 - C2 = 0
cell D3 = C3 - C2
Cell D4 = C4 - C2
Cell E2 = E6 - E2
Cell E3 = E6 - E3
Cell E6 = E6 - E6 = 0
The main issue is that each trip contains differnt number of stations for each trip_id. I can think about using partition by column but cant figure out how to implement it.
Another sub question: I am dealing with very large table (100 million rows). What is the best way Postgresql experts implement data modification. Do you create like a sample table from the original data and implement everything on the sample before implementing the modifications on the original table or do you use something like "Begin trasaction" on the original data so that you can rollback in case of any error.
PS: Help with question title appreciated.
you don't need to know the number of stops
with a as (select *,extract(minutes from c - min(c) over (partition by a)) dd,extract(minutes from max(c) over (partition by a) - c) ee from td)
update td set d=dd, e=ee
from a
where a.a = td.a and a.b=td.b
;
http://sqlfiddle.com/#!17/c9112/1

Pig min command and order by

I have data in the form of shell, $917.14,$654.23,2013
I have to find out the minimum value in column $1 and $2
I tried to do a order by these columns by asc order
But the answer is not coming out correct. Can anyone please help?
Refer MIN
A = LOAD 'test1.txt' USING PigStorage(',') as (f1:chararray,f2:float,f3:float,f4:int,f5:int,f6:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MIN(A.f2),MIN(A.f3);
DUMP C;
EDIT1: The data you are loading has '$' in it.You will either have to clean it up and load it to a float field to apply MIN function or load it into a chararray and replace the '$' and then cast it to float and apply the MIN function.
EDIT2: Here is the solution without removing the $ in the original data but handling it in the PigScript.
Input:
shell,$820.48,$11992.70,996,891,1629
shell,$817.12,$2105.57,1087,845,1630
Bharat,$974.48,$5479.10,965,827,1634
Bharat,$943.70,$9162.57,939,895,1635
PigScript
A = LOAD 'test5.txt' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0,(float)$1,(float)$2,(int)$3,(int)$4,(int)$5;
C = GROUP B1 ALL;
D = FOREACH C GENERATE CONCAT('$',(chararray)MIN(B1.$1)),CONCAT('$',(chararray)MIN(B1.$2));
DUMP D;
Output

Automatic group creation in R or SQL

I have a R data frame with a column AS_ID as given below:
AS_ID
A8653654
B7653655
C5653650
C5653650
A8653654
D1658645
D1658645
C5653650
C5653650
D1658645
C5653650
E4568640
F796740
A8653654
F796740
E4568640
I am trying to group similar record as A1, A2, A3 and so on. For example all record having AS_ID as "A8653654" should be grouped as A1 and can be entered into new column as given below:
AS_ID AS
A8653654 A1
B7653655 A2
C5653650 A3
C5653650 A3
A8653654 A1
D1658645 A4
D1658645 A4
C5653650 A3
C5653650 A3
D1658645 A4
C5653650 A3
E4568640 A5
F796740 A6
A8653654 A1
F796740 A6
E4568640 A5
I am fine with either R or oracle code, since I can write SQL code in R too. Any help will be highly appreciated. My data is bit more dynamic compare to what I have given in sample data above. Generic code will help more.
If you've read those values into an R data.frame, it's likely they are already of class "factor". If not, you can convert them to a factor. But each factor value is automatically assigned a unique integer ID already. Here's a sample data.frame
dd<-read.table(text=c("A8653654", "B7653655", "C5653650", "C5653650", "A8653654",
"D1658645", "D1658645", "C5653650", "C5653650", "D1658645", "C5653650",
"E4568640", "F796740", "A8653654", "F796740", "E4568640"), col.names="AS_ID")
Observe that
class(dd$AS_ID)
# [1] "factor"
If it was character, you could do
dd$AS_ID <- factor(dd$AS_ID)
To get the unique IDs, just use as.numeric and then paste an A in front of that
dd <- cbind(dd, AS=paste0("A",as.numeric(dd$AS_ID)))
and that gives
#> head(dd)
AS_ID AS
1 A8653654 A1
2 B7653655 A2
3 C5653650 A3
4 C5653650 A3
5 A8653654 A1
6 D1658645 A4
You can get a group identifier in Oracle using dense_rank():
select AS_ID, dense_rank() over (partition by AS_ID order by AS_ID)
from table t;
If you want an 'A' in front, then concatenate it:
select AS_ID, 'A' || dense_rank() over (partition by AS_ID order by AS_ID)

Redis and linked hashes

everyone
I would like to ask community of help to find a way of how to cache our huge plain table by splitting it to the multiple hashes or otherwise.
The sample of table, as an example for structure:
A1 B1 C1 D1 E1 X1
A1 B1 C1 D1 E1 X2
A7 B5 C2 D1 E2 X3
A8 B1 C1 D1 E2 X4
A1 B6 C3 D2 E2 X5
A1 B1 C1 D2 E1 X6
This our denormalized data, we don't have any ability to normalize it.
So currently we must perform 'group by' to get required items, for instance to get all D* we perform 'data.GroupBy(A1).GroupBy(B1).GroupBy(C1)' and it takes a lot of time.
Temporarily we had found workaround for this by creating composite a string key:
A1 -> 'list of lines begin A1'
A1:B1 -> 'list of lines begin A1:B1'
A1:B1:C1 -> 'list of lines begin A1:B1:C1'
...
as a cache of results of grouping operations.
The question is how it can be stored efficiently?
Estimated number of lines in denormalized data around 10M records and as in my an example there is 6 columns it will be 60M entries in hash. So, I'm looking an approach to lookup values in O(N) if it's possible.
Thanks.