How to merge common indices when creating MultiIndex DataFrame - pandas

I have a DataFrame that looks like this:
Method Dataset foo bar
0 A1 B1 10 20
1 A1 B2 10 20
2 A1 B2 10 20
3 A2 B1 10 20
4 A3 B1 10 20
5 A1 B1 10 20
6 A2 B2 10 20
7 A3 B2 10 20
I'd like to use Method and Dataset columns to turn this into a MultiIndex DataFrame. So I tried doing:
df.set_index(["Method", "Dataset"], inplace=True)
df.sort_index(inplace=True)
Which gives:
Method Dataset
A1 B1 10 20
B1 10 20
B2 10 20
B2 10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
This is almost what I want but I was expecting to see common values in Dataset index to also be merged under one value, i.e. similar to Method index:
foo bar
Method Dataset
A1 B1 10 20
10 20
B2 10 20
10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
How can I achieve that?
(This might not make a big difference to how you'd use a DataFrame but I'm trying to use the to_latex() method which is sensitive to these things)

I suggest you do this at the very end right before you write the DataFrame to_latex, otherwise you can have issues with data processing.
We will make the duplicated entries in the last level the empty string and reconstruct the entire MultiIndex.
import pandas as pd
import numpy as np
df.index = pd.MultiIndex.from_arrays([
df.index.get_level_values('Method'),
np.where(df.index.duplicated(), '', df.index.get_level_values('Dataset'))
], names=['Method', 'Dataset'])
foo bar
Method Dataset
A1 B1 10 20
10 20
B2 10 20
10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
If you want to make this a bit more flexible for any number of levels (even just a simple Index) we can use this function which will replace in the last level:
def white_out_index(idx):
"""idx : pd.MultiIndex or pd.Index"""
i0 = [idx.get_level_values(i) for i in range(idx.nlevels-1)]
i0.append(np.where(idx.duplicated(), '', idx.get_level_values(-1)))
return pd.MultiIndex.from_arrays(i0, names=idx.names)
df.index = white_out_index(df.index)

Related

Create a new column for table B based on information from table A

I have this problem. I want to create a report that keeps everything in table B, but adds another column from table A (QtyRecv).
Condition: If RunningTotalQtyUsed (from table B) < QtyRecv, take that QtyRecv for the new column.
For example, for item A1, (RunningTotalQtyUsed) 55 < 100 (QtyRecv), -> ExpectedQtyRecv = 100.
But if RunningTotalQtyUsed exceeds QtyRecv, we take the next QtyRecv to cover that used quantity.
For example, 101 > 100, -> ExpectedQtyRecv = 138.
149 (RunningTotalQtyUsed) < (100 + 138) (QtyRecv) -> get 138.
250 < (100 + 138 + 121) -> get 121.
The same logic applies to item A2.
If total QtyRecv = 6 + 4 + 10 = 20, but RunningTotalQtyUsed = 31 -> result should be 99999 to notify an error that QtyRecv can't cover QtyUsed.
Table A:
Item QtyRecv
A1 100
A1 138
A1 121
A2 6
A2 4
A2 10
Table B:
Item RunningTotalQtyUsed
A1 55
A1 101
A1 149
A1 250
A2 1
A2 5
A2 9
A2 19
A2 31
Expected result:
Item RunningTotalQtyUsed ExpectedQtyRecv
A1 55 100
A1 101 138
A1 149 138
A1 250 121
A2 1 6
A2 5 6
A2 9 4
A2 19 10
A2 31 99999
What I made an effort:
SELECT b.*
FROM tableB b LEFT JOIN tableA a
ON b.item = a.item
item RunningTotalQtyUsed
A1 55
A1 55
A1 55
A1 101
A1 101
A1 101
A1 149
A1 149
A1 149
A1 250
A1 250
A1 250
A2 1
A2 1
A2 1
A2 5
A2 5
A2 5
A2 9
A2 9
A2 9
A2 19
A2 19
A2 19
A2 31
A2 31
A2 31
It doesn't keep the same number of rows as table B. How to still keep table B but add the ExpectQtyRecv from table A? Thank you so much for all the help!
SELECT B.TOTAL,B.SUM_RunningTotalQtyUsed,A.SUM_QtyRecv FROM
(
SELECT B.ITEM,SUM(B.RunningTotalQtyUsed)AS SUM_RunningTotalQtyUsed
FROM TABLE_B AS B
GROUP BY B.ITEM
)B_TOTAL
LEFT JOIN
(
SELECT A.ITEM,SUM(A.QtyRecv)AS SUM_QtyRecv
FROM TABLE_A AS A
GROUP BY A.ITEM
)A_TOTAL ON B.ITEM=A.ITEM
I can not be sure, but may be you need something like above ?

Groupby and smallest on more than one index [duplicate]

This question already has answers here:
Keep other columns when doing groupby
(5 answers)
pandas groupby, then sort within groups
(9 answers)
Closed 2 years ago.
I have a data frame as follows
REG LOC DATE SUM
1 A1 19-07-20 10
1 B1 19-07-20 25
1 C1 19-07-20 20
2 A2 19-07-20 25
2 B2 19-07-20 30
2 C3 19-07-20 45
1 A1 20-07-20 15
1 B1 20-07-20 20
1 C1 20-07-20 30
2 A2 20-07-20 10
2 B2 20-07-20 15
2 C3 20-07-20 30
1 A1 21-07-20 25
1 B1 21-07-20 35
1 C1 21-07-20 45
2 A2 21-07-20 20
2 B2 21-07-20 30
2 C3 21-07-20 40
I want to find LOC with smallest 2 value of SUM for each region and date combination. For example for Date 19-7-20 and region 1, smallest is Loc A1 and C1 and for region 2 is A2 and B2. I am able to do it for one level with following code but not able to introduce another level in the code.
groupby(level=0,group_keys=False).apply(lambda x: x.nsmallest())
How can I do it for 2 levels not just one level up when I want n smallest values for a combination.
Thanks

sortindex() for a string index

My dataframe looks like this:
Method Dataset
A1 B2 10 20
B3 10 20
B1 10 20
B1 10 20
A2 B2 10 20
B1 10 20
A3 B9 10 20
B5 10 20
The Dataset index is a string. How can I sort just the second (Dataset) index using a list like ["B1", "B2", "B3", "B4", "B5"] ? I think I'm looking for sortindex() but with custom ordering.

Python pandas: delete the data in a data frame that the size of data is below a value

I have a data frame called df(this is just example, the real data is big, please consider the computing speed) as following:
name id text
tom 1 a1
lucy 2 b1
john 3 c1
tick 4 d1
tom 1 a2
lucy 2 b2
john 3 c2
tick 4 d2
tom 1 a3
lucy 2 b3
john 3 c3
tick 4 d3
tom 1 a4
tick 4 d4
tom 1 a5
lucy 2 b5
tick 4 d5
the dataframe can be grouped by the name(tom, john, lucy, tick). I want to delete the data that the size of each group(by name)is less 5. I mean since the size of name of lucy and john is less 5, I want to delete these data and get the new df(just have tick and tom data), such as.
Could you tell me how to do it,please! Thanks!
I think you can use a filter for this. It would only be one line:
df = pd.DataFrame({'name': ['tom','lucy','john','tick','tom','lucy','john','tick', 'tom', 'lucy','john','tick','tom','tick','tom', 'lucy','tick'], 'id':[1,2,3,4,1,2,3,4,1,2,3,4,1,4,1,2,4],'text':['a1','b1','c1','d1','a2','b2','c2','d2','a3','b3','c3','d3','a4','d4','a5','b5','d5']})
df.groupby('name').filter(lambda x: len(x) >= 5)
and the output is only Tick and Tom:
id name text
0 1 tom a1
3 4 tick d1
4 1 tom a2
7 4 tick d2
8 1 tom a3
11 4 tick d3
12 1 tom a4
13 4 tick d4
14 1 tom a5
16 4 tick d5
You can use value_counts(), then, if you want to, you can reset the index reset_index()
s = df.name.value_counts()
print(df[df.name.isin(s[s > 4].index)].reset_index(drop=True))
name id text
0 tom 1 a1
1 tick 4 d1
2 tom 1 a2
3 tick 4 d2
4 tom 1 a3
5 tick 4 d3
6 tom 1 a4
7 tick 4 d4
8 tom 1 a5
9 tick 4 d5

pig order by with rank and join the rank together

I have the following data with the schema (t0:chararray, t1:int)
a0 1
a1 7
b2 9
a2 4
b0 6
And I want to order it t1 and then combine with a rank
a0 1 1
a2 4 2
b0 6 3
a1 7 4
b2 9 5
Is there any convenient way without writing UDF in pig?
There is the RANK operation in Pig. This should be sufficient:
X = rank A by t1 ASC;
Please see the Pig docs for more details.