How can I find the Maximum Year in a given dataset using PIG? - apache-pig

Suppose I have the following dataset :-
Year Temp
1974 48
1974 48
1991 56
1983 89
1993 91
1938 41
1938 56
1941 93
1983 87
I want my final answer to be 93 ( Pertaining to the year 1941). I am able to find the Maximum temperature for each year(Say 1941-93) but unable to find only the maximum. Any suggestions are appreciated.
Thanks,

You can solve this problem in two ways.
Option1: Using (Group ALL + MAX)
A = LOAD 'input' USING PigStorage() AS (Year:int,Temp:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MAX(A.Temp);
DUMP C;
Output:
(93)
Option2: Using (ORDER and LIMIT)
A = LOAD 'input' USING PigStorage() AS (Year:int,Temp:int);
B = ORDER A BY Temp DESC;
C = LIMIT B 1;
D = FOREACH C GENERATE Temp;
DUMP D;
Output:
(93)

Related

How to: For each unique id, for each unique version, grab the best score and organize it into a table

Just wanted to preface this by saying while I do have a basic understanding, I am still fairly new to using Bigquery tables and sql statements in general.
I am trying to make a new view out of a query that grabs all of the best test scores for each version by each employee:
select emp_id,version,max(score) as score from `project.dataset.table` where type = 'assessment_test' group by version,emp_id order by emp_id
I'd like to take the results of that query, and make a new table comprised of employee id's with a column for each versions best score for that rows emp_id. I know that I can manually make a table for each version by including a "where version = a", "where version = b", etc.... and then joining all of the tables at the end but that doesn't seem like the most elegant solution plus there is about 20 different versions in total.
Is there a way to programmatically create a column for each unique version or at the very least use my initial query as maybe a subquery and just reference it, something like this:
with a as (
select id,version,max(score) as score
from `project.dataset.table`
where type = 'assessment_test' and version is not null and score is not null and id is not null
group by version,id
order by id),
version_a as (select score from a where version = 'version_a')
version_b as (select score from a where version = 'version_b')
version_c as (select score from a where version = 'version_c')
select
a.id as id,
version_a.score as version_a,
version_b.score as version_b,
version_c.score as version_c
from
a,
version_a,
version_b,
version_c
Example Picture: left table is example data, right table is expected output
Example Data:
id
version
score
1
a
88
1
b
93
1
c
92
2
a
89
2
b
99
2
c
78
3
a
95
3
b
83
3
c
89
4
a
90
4
b
90
4
c
86
5
a
82
5
b
78
5
c
98
1
a
79
1
b
97
1
c
77
2
a
100
2
b
96
2
c
85
3
a
83
3
b
87
3
c
96
4
a
84
4
b
80
4
c
77
5
a
95
5
b
77
Expected Output:
id
a score
b score
c score
1
88
97
92
2
100
99
85
3
95
87
96
4
90
90
86
5
95
78
98
Thanks in advance and feel free to ask any clarifying questions
Use below approach
select * from your_table
pivot (max(score) score for version in ('a', 'b', 'c'))
if applied to sample data in your question - output is
In case if versions is not known in advance - use below
execute immediate (select '''
select * from your_table
pivot (max(score) score for version in (''' || string_agg(distinct "'" || version || "'") || "))"
from your_table
)

Compare two Excel files that have a different number of rows using Python Pandas

I'm using Python 3.7 , and I want to compare two Excel file that have the same columns (140 columns) but with a different number of rows, I looked on the website , but I didn't find a solution for my case!
Here is an example :
df1 (old report) :
id qte d1 d2
A 10 23 35
B 43 63 63
C 15 61 62
df2 (new report) :
id qte d1 d2
A 20 23 35
C 15 61 62
E 38 62 16
F 63 20 51
and the results should be :
the modify rows must be in yellow and the value modified in red color
the new rows in green
the deleted rows in red
id qte d1 d2
A 20 23 35
C 15 61 62
B 43 63 63
E 38 62 16
F 63 20 51
the code :
import pandas as pd
import numpy as np
df1= pd.read_excel(r'C .....\data novembre.xlsx','Sheet1',na_values=['NA'])
df2= pd.read_excel(r'C.....\data decembre.xlsx','Sheet1',na_values=['NA'])
merged_data=df1.merge(df2, left_on = 'id', right_on = 'id', how = 'outer')
Joining the data though is not want I want to have!
I'm just starting to learn Python so I really need help!
an excel diff can quickly become a funky beast, but we should be able to do this with some concats and boolean statements.
assuming your dataframes are called df1, df2
df1 = df1.set_index('id')
df2 = df2.set_index('id')
df3 = pd.concat([df1,df2],sort=False)
df3a = df3.stack().groupby(level=[0,1]).unique().unstack(1).copy()
df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new' # if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells.
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fufilled by above rules is the same.
print(df3a)
d1 d2 qte status
id
A [23] [35] [10, 20] modified
B [63] [63] [43] deleted
C [61] [62] [15] same
E [62] [16] [38] new
F [20] [51] [63] new
if you don't mind the performance hit of turning all your datatypes to strings then this could work. I dont' recommend it though, use a fact or slow changing dimension schema to hold such data, you'll thank your self in the future.
df3a.stack().explode().astype(str).groupby(level=[0,1]).agg('-->'.join).unstack(1)
d1 d2 qte status
id
A 23 35 10-->20 modified
B 63 63 43 deleted
C 61 62 15 same
E 62 16 38 new
F 20 51 63 new

How merge several time series data sets to a data frame and then using cross validation on it?

Hi I have 50 time series datasets in 50 .CSV files and each of them have 2 column: data and label. exactly like below. each of these .CSV file contain more than 800,000 row of signal records and none of them is equal to each other
How i can merge these .CSV file in to a data frame in order to select training and testing data with Cross validation?
because i working with RNN the sequence of data is important.
DATA OF CASE 1:
Data label
23 A
88 A
56 B
87 B
56 c
87 D
17 C
44 B
12 A
----------------
DATA OF CASE 2:
Data label
13 A
98 B
56 B
77 C
49 D
89 c
19 B
23 B
32 A

Find 5 top popular based on sum in Pig Script

I'm trying to find the top 3 most popular locations with the greatest tripCount.
So I need to see the total of tripCount per location and return the greatest n...
My data is as follow:
LocationID tripCount tripDistance
101 40 4.6
203 29 1.3
56 25 9.3
101 17 4.5
66 5 1.1
13 5 0.5
203 10 1.2
558 8 0.5
56 10 5.5
So the result I'm expecting is:
101 57
203 39
56 35
So far my code is:
B = GROUP UNION_DATA BY DOLocationID;
C = FOREACH B {
DA = ORDER UNION_DATA BY passenger_count DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(group), FLATTEN(DB.LocationID), FLATTEN(DB.dropoff_datetime);
}
What am I missing and what do I need to do to get the expected result?
Below piece of code should get you desired results.
I broke down the statement into simple chunks for better understanding and readability.Also your alias and code provided seems incomplete so i completely re-wrote from scratch.
LocationID,tripCount,tripDistance
cat > trip_data.txt
101,40,4.6
203,29,1.3
56,25,9.3
101,17,4.5
66,5,1.1
13,5,0.5
203,10,1.2
558,8,0.5
56,10,5.5
PIG Code:
A = load '/home/ec2-user/trip_data.txt' using PigStorage(',') as (LocationID,tripCount,tripDistance);
describe A;
B = GROUP A BY LocationID;
describe B;
dump B;
C = FOREACH B GENERATE group, SUM(A.tripCount);
describe C;
dump C;
D = ORDER C BY $1 DESC;
describe D;
dump D;
RESULT = LIMIT D 3;
describe RESULT;
dump RESULT;

Percentage calculation from pivot table pandas

I have a set of data which I have already imported from excel xlsx file. After that I determine to find out the percentage of the total profit from each of the customer segment. I manage to use the pivot_table to summarize the the total profit of each customer segment. However, I also would like to know the percentage. How do I do that?
Pivot_table
profit = df.pivot_table(index = ['Customer Segment'], values = ['Profit'], aggfunc=sum)
Result So far
Customer Segment Profit
A a
B b
C c
D d
Maybe adding the percentage column to the pivot table would be an ideal way. But how can I do that?
How about
df['percent'] = df['Profit']/sum(df['Profit'])
For example you have this data frame:
Customer Segment Customer Profit
0 A AAA 12
1 B BBB 43
2 C CCC 45
3 D DDD 23
4 D EEE 67
5 C FFF 21
6 B GGG 45
7 A JJJ 67
8 A KKK 32
9 B LLL 13
10 C MMM 43
11 D NNN 13
From the above data frame you want to make pivot table.
import pandas as pd
import numpy as np
tableframe = pd.pivot_table(df, values='Profit', index=['Customer Segment'], aggfunc=np.sum)
Here is your pivot table:
Profit
Customer Segment
A 111
B 101
C 109
D 103
Now you want to add another column to tableframe then compute the percentage.
tableframe['percentage'] = ((tableframe.Profit / tableframe.Profit.sum()) * 100)
Here is your final tableframe:
Profit percentage
Customer Segment
A 111 26.179245
B 101 23.820755
C 109 25.707547
D 103 24.292453