I have a dataframe that I'd like to sort on cols time and b, where b sort is conditional on value of a. So if a == 1, sort from highest to lowest, and if a == -1, sort from lowest to highest. I would normally do something like df.sort_values(by=['time', 'b']) but I think it sorts b always from lowest to highest.
df = pd.DataFrame({'time': [0, 3, 2, 2, 1], 'a': [1, -1, 1, 1, -1], 'b': [4, 5, 1, 6, 2]})
time a b
0 0 1 4
1 3 -1 5
2 2 1 1
3 2 1 6
4 1 -1 2
desired output
time a b
0 0 1 4
1 1 -1 2
2 2 1 6
3 2 1 1
4 3 -1 5
Multiply a and b and use it as sorting key:
df['sort'] = df['a']*df['b']
df.sort_values(by=['time', 'sort'], ascending=[True, False]).drop('sort', axis=1)
output:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5
alternative:
df['sort'] = (1-df['a'])*df['b']
df.sort_values(by=['time', 'sort']).drop('sort', axis=1)
Pass ascending after create additional col for sorting
out = df.assign(key = df.a*df.b).sort_values(['time','key'],ascending=[True,False]).drop('key',1)
Out[59]:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5
Related
I would effectively like to replace a single value in a column (based on a criteria) with 3 new column values. For example:
Ch
Time
A
1 min
A
1 min
B
2 min
B
2 min
A1
A2
A3
1
2
3
2
3
4
1
1
2
B1
B2
B3
1
2
1
1
1
1
In the above table, If Ch==A, then I would like to see (All values from table 2 repeated twice and so on so forth for Ch==B etc):
ColA
ColB
ColC
1
2
3
1
3
4
1
1
2
1
2
3
1
3
4
1
1
2
How can I go about doing this efficiently? Thank you in advance for your help!
I tried replacing the value C, however I am not sure of how to insert 3 new columns.
IIUC, you can concat and merge:
df1 = pd.DataFrame({'Ch': ['A', 'A', 'B', 'B'], 'Time': ['1 min', '1 min', '2 min', '2 min']})
df2 = pd.DataFrame({'A1': [1, 2, 1], 'A2': [2, 3, 1], 'A3': [3, 4, 2]})
df3 = pd.DataFrame({'B1': [1, 1], 'B2': [2, 1], 'B3': [1, 1]})
d = {'A': df2, 'B': df3}
df1.merge(pd.concat({k: v.set_axis(['ColA', 'ColB', 'ColC'], axis=1)
for k,v in d.items()}
).droplevel(1),
left_on='Ch', right_index=True
)
Output:
Ch Time ColA ColB ColC
0 A 1 min 1 2 3
0 A 1 min 2 3 4
0 A 1 min 1 1 2
1 A 1 min 1 2 3
1 A 1 min 2 3 4
1 A 1 min 1 1 2
2 B 2 min 1 2 1
2 B 2 min 1 1 1
3 B 2 min 1 2 1
3 B 2 min 1 1 1
I have the following table:
ID
Qty_1
Qty_2
A
1
10
A
2
0
A
3
0
B
3
29
B
2
0
B
1
0
I want to iterate based on the ID, and subtract Qty_2 - Qty_1 and update the next row with that result.
The result would be:
ID
Qty_1
Qty_2
A
1
10
A
2
8
A
3
5
B
3
29
B
2
27
B
1
26
Ideally, I would also like to start by subtracting the first row end a new ID appears and only after that start the loop:
ID
Qty_1
Qty_2
A
1
9
A
2
7
A
3
4
B
3
26
B
2
24
B
1
23
Each of the solutions is ok! Thank you!
First compute the difference between 'Qty_1' and 'Qty_2' row by row, then group by 'ID' and compute cumulative sum:
df['Qty_2'] = df.assign(Qty_2=df['Qty_2'].sub(df['Qty_1'])) \
.groupby('ID')['Qty_2'].cumsum()
print(df)
# Output:
ID Qty_1 Qty_2
0 A 1 9
1 A 2 7
2 A 3 4
3 B 3 26
4 B 2 24
5 B 1 23
Setup:
data = {'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'Qty_1': [1, 2, 3, 3, 2, 1],
'Qty_2': [10, 0, 0, 29, 0, 0]}
df = pd.DataFrame(data)
Input:
numbers = pandas.Series([3,5,8,1], index=["A","B","C","D"])
A 3
B 5
C 8
D 1
Expected output (pandas DataFrame):
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
Current (working) solution:
pairwise_mins = pandas.DataFrame(index=numbers.index)
def calculate_mins(series, index):
to_return = numpy.minimum(series, series[index])
return to_return
for col in numbers.index:
pairwise_mins[col] = calculate_mins(numbers, col)
I suspect there must be a better, shorter, vectorized solution. Who could help me with it?
Use the outer ufunc here that numpy provides, combined with numpy.minimum
n = numbers.to_numpy()
np.minimum.outer(n, n)
array([[3, 3, 3, 1],
[3, 5, 5, 1],
[3, 5, 8, 1],
[1, 1, 1, 1]], dtype=int64)
This can be done by broadcasting:
pd.DataFrame(np.where(numbers.values[:,None] < numbers.values,
numbers[:,None],
numbers),
index=numbers.index,
columns=numbers.index)
Output:
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
Use np.broadcast_to and np.clip
a = numbers.values
pd.DataFrame(np.broadcast_to(a, (a.size,a.size)).T.clip(max=a),
columns=numbers.index,
index=numbers.index)
Out[409]:
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
I am trying to convert a list within multiple columns of a pandas DataFrame into separate columns.
Say, I have a dataframe like this:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [1, 2, 3] [4, 5, 6]
2 [1, 2, 3] [4, 5, 6]
And would like to convert it to something like this:
0 1 2 0 1 2
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I have managed to do this in a loop. However, I would like to do this in fewer lines.
My code snippet so far is as follows:
import pandas as pd
df = pd.DataFrame([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])
output1 = df[0].apply(pd.Series)
output2 = df[1].apply(pd.Series)
output = pd.concat([output1, output2], axis=1)
If you don't care about the column names you could do:
>>> df.apply(np.hstack, axis=1).apply(pd.Series)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
Using sum
pd.DataFrame(df.sum(1).tolist())
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I'm coming from R and do not understand the default groupby behavior in pandas. I create a dataframe and groupby the column 'id' like so:
d = {'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]}
df = DataFrame(data=d)
freq = df.groupby('id').count()
When I check the header of the resulting dataframe, all the original columns are there instead of just 'id' and 'freq' (or 'id' and 'count').
list(freq)
Out[117]: ['color', 'size']
When I display the resulting dataframe, the counts have replaced the values for the columns not employed in the count:
freq
Out[114]:
color size
id
1 1 1
2 3 3
3 1 1
4 2 2
I was planning to use groupby and then to filter by the frequency column. Do I need to delete the unused columns and add the frequency column manually? What is the usual approach?
count aggregate all columns of DataFrame with excluding NaNs values, if need id as column use as_index=False parameter or reset_index():
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 1
1 2 3 3
2 3 1 1
3 4 2 2
So if add NaNs in each column should be differences:
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 0
1 2 3 3
2 3 1 1
3 4 2 2
You can specify columns for count:
freq = df.groupby('id', as_index=False)['color'].count()
print (freq)
id color
0 1 1
1 2 3
2 3 1
3 4 2
If need count with NaNs:
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
Thanks Bharath for pointed for another solution with value_counts, differences are explained here:
freq = df['id'].value_counts().rename_axis('id').to_frame('freq').reset_index()
print (freq)
id freq
0 2 3
1 4 2
2 3 1
3 1 1