pig order by with rank and join the rank together - apache-pig

I have the following data with the schema (t0:chararray, t1:int)
a0 1
a1 7
b2 9
a2 4
b0 6
And I want to order it t1 and then combine with a rank
a0 1 1
a2 4 2
b0 6 3
a1 7 4
b2 9 5
Is there any convenient way without writing UDF in pig?

There is the RANK operation in Pig. This should be sufficient:
X = rank A by t1 ASC;
Please see the Pig docs for more details.

Related

How to: For each unique id, for each unique version, grab the best score and organize it into a table

Just wanted to preface this by saying while I do have a basic understanding, I am still fairly new to using Bigquery tables and sql statements in general.
I am trying to make a new view out of a query that grabs all of the best test scores for each version by each employee:
select emp_id,version,max(score) as score from `project.dataset.table` where type = 'assessment_test' group by version,emp_id order by emp_id
I'd like to take the results of that query, and make a new table comprised of employee id's with a column for each versions best score for that rows emp_id. I know that I can manually make a table for each version by including a "where version = a", "where version = b", etc.... and then joining all of the tables at the end but that doesn't seem like the most elegant solution plus there is about 20 different versions in total.
Is there a way to programmatically create a column for each unique version or at the very least use my initial query as maybe a subquery and just reference it, something like this:
with a as (
select id,version,max(score) as score
from `project.dataset.table`
where type = 'assessment_test' and version is not null and score is not null and id is not null
group by version,id
order by id),
version_a as (select score from a where version = 'version_a')
version_b as (select score from a where version = 'version_b')
version_c as (select score from a where version = 'version_c')
select
a.id as id,
version_a.score as version_a,
version_b.score as version_b,
version_c.score as version_c
from
a,
version_a,
version_b,
version_c
Example Picture: left table is example data, right table is expected output
Example Data:
id
version
score
1
a
88
1
b
93
1
c
92
2
a
89
2
b
99
2
c
78
3
a
95
3
b
83
3
c
89
4
a
90
4
b
90
4
c
86
5
a
82
5
b
78
5
c
98
1
a
79
1
b
97
1
c
77
2
a
100
2
b
96
2
c
85
3
a
83
3
b
87
3
c
96
4
a
84
4
b
80
4
c
77
5
a
95
5
b
77
Expected Output:
id
a score
b score
c score
1
88
97
92
2
100
99
85
3
95
87
96
4
90
90
86
5
95
78
98
Thanks in advance and feel free to ask any clarifying questions
Use below approach
select * from your_table
pivot (max(score) score for version in ('a', 'b', 'c'))
if applied to sample data in your question - output is
In case if versions is not known in advance - use below
execute immediate (select '''
select * from your_table
pivot (max(score) score for version in (''' || string_agg(distinct "'" || version || "'") || "))"
from your_table
)

Groupby and smallest on more than one index [duplicate]

This question already has answers here:
Keep other columns when doing groupby
(5 answers)
pandas groupby, then sort within groups
(9 answers)
Closed 2 years ago.
I have a data frame as follows
REG LOC DATE SUM
1 A1 19-07-20 10
1 B1 19-07-20 25
1 C1 19-07-20 20
2 A2 19-07-20 25
2 B2 19-07-20 30
2 C3 19-07-20 45
1 A1 20-07-20 15
1 B1 20-07-20 20
1 C1 20-07-20 30
2 A2 20-07-20 10
2 B2 20-07-20 15
2 C3 20-07-20 30
1 A1 21-07-20 25
1 B1 21-07-20 35
1 C1 21-07-20 45
2 A2 21-07-20 20
2 B2 21-07-20 30
2 C3 21-07-20 40
I want to find LOC with smallest 2 value of SUM for each region and date combination. For example for Date 19-7-20 and region 1, smallest is Loc A1 and C1 and for region 2 is A2 and B2. I am able to do it for one level with following code but not able to introduce another level in the code.
groupby(level=0,group_keys=False).apply(lambda x: x.nsmallest())
How can I do it for 2 levels not just one level up when I want n smallest values for a combination.
Thanks

How to merge common indices when creating MultiIndex DataFrame

I have a DataFrame that looks like this:
Method Dataset foo bar
0 A1 B1 10 20
1 A1 B2 10 20
2 A1 B2 10 20
3 A2 B1 10 20
4 A3 B1 10 20
5 A1 B1 10 20
6 A2 B2 10 20
7 A3 B2 10 20
I'd like to use Method and Dataset columns to turn this into a MultiIndex DataFrame. So I tried doing:
df.set_index(["Method", "Dataset"], inplace=True)
df.sort_index(inplace=True)
Which gives:
Method Dataset
A1 B1 10 20
B1 10 20
B2 10 20
B2 10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
This is almost what I want but I was expecting to see common values in Dataset index to also be merged under one value, i.e. similar to Method index:
foo bar
Method Dataset
A1 B1 10 20
10 20
B2 10 20
10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
How can I achieve that?
(This might not make a big difference to how you'd use a DataFrame but I'm trying to use the to_latex() method which is sensitive to these things)
I suggest you do this at the very end right before you write the DataFrame to_latex, otherwise you can have issues with data processing.
We will make the duplicated entries in the last level the empty string and reconstruct the entire MultiIndex.
import pandas as pd
import numpy as np
df.index = pd.MultiIndex.from_arrays([
df.index.get_level_values('Method'),
np.where(df.index.duplicated(), '', df.index.get_level_values('Dataset'))
], names=['Method', 'Dataset'])
foo bar
Method Dataset
A1 B1 10 20
10 20
B2 10 20
10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
If you want to make this a bit more flexible for any number of levels (even just a simple Index) we can use this function which will replace in the last level:
def white_out_index(idx):
"""idx : pd.MultiIndex or pd.Index"""
i0 = [idx.get_level_values(i) for i in range(idx.nlevels-1)]
i0.append(np.where(idx.duplicated(), '', idx.get_level_values(-1)))
return pd.MultiIndex.from_arrays(i0, names=idx.names)
df.index = white_out_index(df.index)

How to get the last non empty value of a hierarchy?

I've got a hierarchy with the appropriate value linked to each level, let's say :
A 100
A1 NULL
A2 NULL
B
B1 NULL
B2 1000
B21 500
B22 500
B3 NULL
This hierarchy is materialized in my database as a parent-child hierarchy
Hierarchy Table
------------------------
Id Code Parent_Id
1 A NULL
2 A1 1
3 A2 3
4 B NULL
5 B1 4
6 B2 4
7 B21 6
8 B22 6
9 B3 4
And here is my fact table :
Fact Table
------------------------
Hierarchy_Id Value
1 100
6 1000
7 500
8 500
My question is : do you know/have any idea of how to get only the last non empty value of my hiearchy?
I know that there an MDX function which could do this job but I'd like to do this in an another way.
To be clear, the desired output would be :
Fact Table
------------------------
Hierarchy_Id Value
1 100
7 500
8 500
(If necessary, the work of flatten the hierarchy is already done...)
Thank you in advance!
If the codes for your hierarchy are correct, then you can use the information in the codes to determine the depth of the hierarchy. I think you want to filter out any "code" where there is a longer code that starts with it.
In that case:
select f.*
from fact f join
hierarchy h
on f.hierarchyId = h.hierarchyId
where not exists (select 1
from fact f2 join
hierarchy h2
on f2.hierarchyId = h2.hierarchyId
where h2.code like concat(h.code, '%') and
h2.code <> h.code
)
Here I've used the function concat() to create the pattern. In some databases, you might use + or || instead.

Using correctly HAVING with group by and COUNT

I am running this query:
SELECT u.id as id,
COUNT(DISTINCT YEAR(TIMESTAMP), WEEK(TIMESTAMP)) cc,
GROUP_CONCAT(DISTINCT YEAR(TIMESTAMP),'-',WEEK(TIMESTAMP)) a
FROM users u
JOIN checkins c
ON c.userid = u.id
GROUP BY userid
HAVING COUNT(cc) = 3
And this produces the following results:
id cc a
05 3 2010-43,2010-47,2010-45
06 2 2010-44,2010-45
13 3 2010-43,2010-45,2010-48
20 3 2010-45,2010-43,2010-47
21 3 2010-43,2010-47,2010-45
22 2 2010-47,2010-48
25 3 2010-48,2010-43,2010-46
27 2 2010-42,2010-47
30 2 2010-48,2010-45
41 3 2010-44,2010-45,2010-47
44 2 2010-42,2010-44
50 2 2010-44,2010-47
52 2 2010-48,2010-47
57 2 2010-43,2010-44
71 3 2010-43,2010-48,2010-47
72 2 2010-43,2010-44
78 3 2010-47,2010-42,2010-43
79 2 2010-45,2010-46
80 2 2010-46,2010-44
87 1 2010-46
97 1 2010-48
108 3 2010-43,2010-47,2010-45
As you see the cc column has values 2, 3, or even 1.
How that comes, when I've told with HAVING that should be 3?
MySQL does allow aliases in the Having clause. You would need to use:
HAVING cc = 3
not
HAVING COUNT(cc) = 3
in order to filter the results to only include rows which have a cc value of 3 though. I'm actually quite unsure though why HAVING COUNT(cc) = 3 would return any results at all.
As previously said about aliases and having clause, I'd just like to expand on it.
You already have created cc alias which holds counts that you'd like to filter on, so you just need to reference aliased column in HAVING, like:
HAVING cc = 3
What you have tried (COUNT(cc) = 3) would make sense if you were to group by cc column (if that was possible), and then that would filter out all rows with same cc value that didn't appear exactly 3 times.