Trouble with groupss and aggregation - pandas

I'm trying to use pandas to select a single result from a group of results, where some column has a minimum value. An example table representing my data frame is:
ID q A B C D
---------------
1 10 1 2 3 4
1 5 5 6 7 8
2 1 9 1 2 3
2 2 8 7 6 5
I would like to group by ID and then select the row that has the smallest q for each group. So, the second row corresponding to ID=1 and the first row corresponding to ID=2 to be selected.
I can only select the lowest values of each column, which is not what I need. Thanks a lot to anybody who can offer some guidance.

This should do what you're asking:
In [10]: df.groupby('ID').apply(lambda x: x.ix[x['q'].idxmin()])
Out[10]:
ID q A B C D
ID
1 1 5 5 6 7 8
2 2 1 9 1 2 3
Apply a function that returns the group row that has the index of the minimum 'q' value.

Related

Does any one know how to randomize rows of a dataframe of pandas with some constraints?

I have a pandas frame with two columns, column A and column B.
What I want to do is to randomize the rows of this dataframe, and no same values in column B are on adjacent rows.
A B
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 2
6 7 3
7 8 3
What come up to me is that I can sameple one row a time which satisfy this constraints, i.e., sample one row with its value in column B equals to 1, then sample another row with its value in column B equals to 2 or 3.
However, this solution requires multiple for loops, especially when the constraints involve more than one column B.
So, does any one know better solution?
It's not really randomization as the logic is fully deterministic, but ordering the rows by their position in each group would satisfy your requirement:
import numpy as np
out = df.iloc[np.argsort(df.groupby('B').cumcount().to_numpy())]
Output:
A B
0 1 1
3 4 2
6 7 3
1 2 1
4 5 2
7 8 3
2 3 1
5 6 2

Replace condition with mode in Pandas

I need a pandas code for the following data.Here I need a condition for replace the value.if product name is A,price needs to be the mode value of A and replace all the value.At the end The value of A is 5 in every row.
Product
Price
A
5
A
6
A
7
B
8
B
8
B
4
A
5
A
5
A
5
A
Nan
c
4
D
3
You could create a dictionary, keys being the values from Product column and values will be their respective mode price, and then map it back to your dataframe based on your Product column:
df.assign(Price=df['Product'].map(
df.groupby(['Product'])['Price'].agg(pd.Series.mode).to_dict()))
prints:
Product Price
0 A 5
1 A 5
2 A 5
3 B 8
4 B 8
5 B 8
6 A 5
7 A 5
8 A 5
9 c 4
10 D 3

Using groupby() and cut() in pandas

I have a dataframe and for each group value I want to label values. If value is less that group mean then label is 1 and if group value is more than group mean then label is 2.
input data frame is
groups num1
0 a 2
1 a 5
2 a Nan
3 b 10
4 b 4
5 b 0
6 b 7
7 c 2
8 c 4
9 c 1
Here mean values for group a, b ,c are 3.5, 5.25 and 2.33 respectively and output data frame is .
groups out
0 a 1
1 a 2
2 a Nan
3 b 2
4 b 1
5 b 1
6 b 2
7 c 1
8 c 2
9 c 1
I want to use panads.cut and may be pandas.groupby and pandas.apply also.
and also how can I skip Null values here?
Thanks in advance
cut is not really pertinent here. Use groupby.transform('mean') and numpy.where:
df['out'] = np.where(df['num1'].lt(df.groupby('groups')['num1']
.transform('mean')),
1, 2)
Output (as new column "out" for clarity):
groups num1 out
0 a 2 1
1 a 5 2
2 a 7 2
3 b 10 2
4 b 4 1
5 b 0 1
6 b 7 2
7 c 2 1
8 c 4 2
9 c 1 1
I really want cut
OK, but it's not really nice and performant:
(df.groupby('groups')['num1']
.transform(lambda g: pd.cut(g, [-np.inf, g.mean(), np.inf], labels=[1, 2]))
)

If a column value does not have a certain number of occurances in a dataframe, how to duplicate rows at random until that count is met?

Say that this is what my dataframe looks like
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
I want every unique value in Column B to occur at least 3 times. So none of the rows with a B value of 5 are duplicated. The row with a column B value of 0 are duplicated twice. And the rest have one of their two rows duplicated at random.
Here is an example desired output
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 4 2
11 2 3
12 2 0
13 2 0
14 4 1
Edit:
The row chosen to be duplicated should be selected at random
To random pick rows, I would use groupby apply with sample on each group. x of lambda is each group of B, so I use reapeat - x.shape[0] to find number of rows need to create. There may be some cases group B already has more rows than 3, so I use np.clip to force negative values to 0. Sample on 0 row is the same as ignore it. Finally, reset_index and append back to df
repeats = 3
df1 = (df.groupby('B').apply(lambda x: x.sample(n=np.clip(repeats-x.shape[0], 0, np.inf)
.astype(int), replace=True))
.reset_index(drop=True))
df_final = df.append(df1).reset_index(drop=True)
Out[43]:
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 2 0
11 2 0
12 5 1
13 4 2
14 2 3

calculate the total value for each group using Calculated Column in Spotfire

I have a problem about the sum calculation for the rows using calculated column in Spotfire.
For example, the raw data is as below, the raw table is order by id, for each type, the sequence is 2,3,0.
id type value state
1 1 12 2
2 1 7 3
3 1 10 0
4 2 11 2
5 2 6 3
6 3 9 0
7 3 7 2
8 3 5 3
9 2 9 0
10 1 7 2
11 1 3 3
12 1 2 0
for type of each cycle of (2,3,0), I want to sum the value, then the result could be:
id type value state cycle time
1 1 12 2
2 1 7 3
3 1 10 0 29
4 2 11 2
5 2 6 3
6 3 7 2
7 3 5 3
8 3 9 0 21
9 2 9 0 26
10 2 7 2
11 2 3 3
12 2 2 0 12
note: only the row which its state is 0 will have the sum value , i think it will be easier to see the rules, when we order the type :
id type value state cycle time
1 1 12 2
2 1 7 3
3 1 10 0 29
4 2 11 2
5 2 6 3
9 2 9 0 26
10 2 7 2
11 2 3 3
12 2 2 0 12
6 3 7 2
7 3 5 3
8 3 9 0 21
thanks for your time and help!
Here is a solution for you.
Insert a Calculated Column RowId() and name it RowId
Insert a Calculated Column If(Mod([RowId],3)=0,[RowId] / 3,Ceiling([RowId] / 3)) and name it Groups
Insert a Calculated Column Sum([value]) OVER ([Groups]) and name it Running Sum
Insert a Calculated Column If([state] = 0,[RunningSum]) and name it OnlyState=0
The only thing to really explain here is #2. With the data sorted as you listed in your example, the last row for each group, based on the RowId, should be divisible by 3. We have to do it this way since your type field can have multiple groups for any given type. RowId 3, 6, 9, 12 etc will all have a Modulus of 0 since they are divisible by 3. This marks the last row in each set. If it is the last row, we just set it to RowId / 3. This gives us groups 1,2,3,4 etc... For the rows which aren't divisible by 3, we round them up to the nearest whole number of the divisor... which will be the last row in the set.
The last calculated column is the only way I know how to get ONLY the values you care about. If you use the If [state] = 0 logic anywhere else, you negate all other rows.