I have this problem. I want to create a report that keeps everything in table B, but adds another column from table A (QtyRecv).
Condition: If RunningTotalQtyUsed (from table B) < QtyRecv, take that QtyRecv for the new column.
For example, for item A1, (RunningTotalQtyUsed) 55 < 100 (QtyRecv), -> ExpectedQtyRecv = 100.
But if RunningTotalQtyUsed exceeds QtyRecv, we take the next QtyRecv to cover that used quantity.
For example, 101 > 100, -> ExpectedQtyRecv = 138.
149 (RunningTotalQtyUsed) < (100 + 138) (QtyRecv) -> get 138.
250 < (100 + 138 + 121) -> get 121.
The same logic applies to item A2.
If total QtyRecv = 6 + 4 + 10 = 20, but RunningTotalQtyUsed = 31 -> result should be 99999 to notify an error that QtyRecv can't cover QtyUsed.
Table A:
Item QtyRecv
A1 100
A1 138
A1 121
A2 6
A2 4
A2 10
Table B:
Item RunningTotalQtyUsed
A1 55
A1 101
A1 149
A1 250
A2 1
A2 5
A2 9
A2 19
A2 31
Expected result:
Item RunningTotalQtyUsed ExpectedQtyRecv
A1 55 100
A1 101 138
A1 149 138
A1 250 121
A2 1 6
A2 5 6
A2 9 4
A2 19 10
A2 31 99999
What I made an effort:
SELECT b.*
FROM tableB b LEFT JOIN tableA a
ON b.item = a.item
item RunningTotalQtyUsed
A1 55
A1 55
A1 55
A1 101
A1 101
A1 101
A1 149
A1 149
A1 149
A1 250
A1 250
A1 250
A2 1
A2 1
A2 1
A2 5
A2 5
A2 5
A2 9
A2 9
A2 9
A2 19
A2 19
A2 19
A2 31
A2 31
A2 31
It doesn't keep the same number of rows as table B. How to still keep table B but add the ExpectQtyRecv from table A? Thank you so much for all the help!
SELECT B.TOTAL,B.SUM_RunningTotalQtyUsed,A.SUM_QtyRecv FROM
(
SELECT B.ITEM,SUM(B.RunningTotalQtyUsed)AS SUM_RunningTotalQtyUsed
FROM TABLE_B AS B
GROUP BY B.ITEM
)B_TOTAL
LEFT JOIN
(
SELECT A.ITEM,SUM(A.QtyRecv)AS SUM_QtyRecv
FROM TABLE_A AS A
GROUP BY A.ITEM
)A_TOTAL ON B.ITEM=A.ITEM
I can not be sure, but may be you need something like above ?
This question already has answers here:
Keep other columns when doing groupby
(5 answers)
pandas groupby, then sort within groups
(9 answers)
Closed 2 years ago.
I have a data frame as follows
REG LOC DATE SUM
1 A1 19-07-20 10
1 B1 19-07-20 25
1 C1 19-07-20 20
2 A2 19-07-20 25
2 B2 19-07-20 30
2 C3 19-07-20 45
1 A1 20-07-20 15
1 B1 20-07-20 20
1 C1 20-07-20 30
2 A2 20-07-20 10
2 B2 20-07-20 15
2 C3 20-07-20 30
1 A1 21-07-20 25
1 B1 21-07-20 35
1 C1 21-07-20 45
2 A2 21-07-20 20
2 B2 21-07-20 30
2 C3 21-07-20 40
I want to find LOC with smallest 2 value of SUM for each region and date combination. For example for Date 19-7-20 and region 1, smallest is Loc A1 and C1 and for region 2 is A2 and B2. I am able to do it for one level with following code but not able to introduce another level in the code.
groupby(level=0,group_keys=False).apply(lambda x: x.nsmallest())
How can I do it for 2 levels not just one level up when I want n smallest values for a combination.
Thanks
My dataframe looks like this:
Method Dataset
A1 B2 10 20
B3 10 20
B1 10 20
B1 10 20
A2 B2 10 20
B1 10 20
A3 B9 10 20
B5 10 20
The Dataset index is a string. How can I sort just the second (Dataset) index using a list like ["B1", "B2", "B3", "B4", "B5"] ? I think I'm looking for sortindex() but with custom ordering.
I have a data frame called df(this is just example, the real data is big, please consider the computing speed) as following:
name id text
tom 1 a1
lucy 2 b1
john 3 c1
tick 4 d1
tom 1 a2
lucy 2 b2
john 3 c2
tick 4 d2
tom 1 a3
lucy 2 b3
john 3 c3
tick 4 d3
tom 1 a4
tick 4 d4
tom 1 a5
lucy 2 b5
tick 4 d5
the dataframe can be grouped by the name(tom, john, lucy, tick). I want to delete the data that the size of each group(by name)is less 5. I mean since the size of name of lucy and john is less 5, I want to delete these data and get the new df(just have tick and tom data), such as.
Could you tell me how to do it,please! Thanks!
I think you can use a filter for this. It would only be one line:
df = pd.DataFrame({'name': ['tom','lucy','john','tick','tom','lucy','john','tick', 'tom', 'lucy','john','tick','tom','tick','tom', 'lucy','tick'], 'id':[1,2,3,4,1,2,3,4,1,2,3,4,1,4,1,2,4],'text':['a1','b1','c1','d1','a2','b2','c2','d2','a3','b3','c3','d3','a4','d4','a5','b5','d5']})
df.groupby('name').filter(lambda x: len(x) >= 5)
and the output is only Tick and Tom:
id name text
0 1 tom a1
3 4 tick d1
4 1 tom a2
7 4 tick d2
8 1 tom a3
11 4 tick d3
12 1 tom a4
13 4 tick d4
14 1 tom a5
16 4 tick d5
You can use value_counts(), then, if you want to, you can reset the index reset_index()
s = df.name.value_counts()
print(df[df.name.isin(s[s > 4].index)].reset_index(drop=True))
name id text
0 tom 1 a1
1 tick 4 d1
2 tom 1 a2
3 tick 4 d2
4 tom 1 a3
5 tick 4 d3
6 tom 1 a4
7 tick 4 d4
8 tom 1 a5
9 tick 4 d5
I have the following data with the schema (t0:chararray, t1:int)
a0 1
a1 7
b2 9
a2 4
b0 6
And I want to order it t1 and then combine with a rank
a0 1 1
a2 4 2
b0 6 3
a1 7 4
b2 9 5
Is there any convenient way without writing UDF in pig?
There is the RANK operation in Pig. This should be sufficient:
X = rank A by t1 ASC;
Please see the Pig docs for more details.