Сoncatenate rows in pandas with conditions and calculations - pandas

If I have a dataframe:
myData = {'start': [1, 2, 3, 4, 5],
'end': [2, 3, 5,7,6],
'number': [1, 2, 7,9, 7]
}
df = pd.DataFrame(myData, columns=['start', 'end', 'number'])
df
And I need to do something like:
result = {'start': [1, 4, 5],
'end': [7,7,6],
'number': [10,9, 7]
}
df = pd.DataFrame(result, columns=['start', 'end', 'number'])
df
If number < 1, start = start(previous row), end = end(current row), then delete previous rows.
That is, to merge the rows, the difference between the end of the first and the beginning of the second is less than 1, rewrite the new beginning, merge the number and delete the first.
Can I do it without iteration?
enter image description here

You can use:
# identify when end - previous_start > 2
# and create a new group
group = df['end'].sub(df['start'].shift()).gt(2).cumsum()
# aggregate
out = df.groupby(group).agg({'start': 'first', 'end': 'last', 'number': 'sum'})
Output:
start end number
0 1 3 3
1 3 5 7
2 4 6 16

Related

Calculate the difference between all rows and a specific row in the dataframe

This is a similar question to this thread.
Lets consider df as:
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
How can you calculate the difference between all rows and the row at Nth index in a group (lowest index for EACH group) for column "B", and put it in column "D"? I want to calculate mean square displacement for my data and I want to calculate the difference of values in a column in each group with the first appeared row in that group.
I tried:
df['D'] = df.groupby(["A"])['B'].sub(df.groupby(['A'])["B"].iloc[0])
Group = df.groupby(["A"])
However using .sub and groupby raise the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'sub'
the desired result would be like this:
A B C D
0 a 2 3 0 *lowest index in group "a"
1 b 5 6 0 *lowest index in group "b"
2 c 8 9 0 *lowest index in group "c"
3 a 0 0 -2
4 a 8 7 6
5 c 2 1 -6
I guess this answer could be enough of a hint for you:
import pandas as pd
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
print("df:")
print(df)
print()
groupA = df.groupby(['A'])
print("groupA:")
print(groupA.groups)
print()
print("lowest indices for each group from columnA:")
lowest_indices = dict()
for k, v in groupA.groups.items():
lowest_indices[k] = v[0]
print(lowest_indices)
print()
columnB = df['B']
print("columnB:")
print(columnB)
print()
df['D'] = df['B']
for i in range(len(df)):
group_at_i = df['A'].iloc[i]
lowest_index_of_that = lowest_indices[group_at_i]
b_element_at_that_index = df['B'].iloc[lowest_index_of_that]
the_difference = df['B'].iloc[i] - b_element_at_that_index
df.loc[i, 'D'] = the_difference
print("df:")
print(df)

how to change value based on criteria pandas

I have a following problem. I have this df:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
I would like to have a new column where value will be 1 if in any other cases it is also 1. See desired output:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1], 'newvalue': [1, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
How can I do it please?
If need set 0,1 by condition - here at least one 1 use GroupBy.transform with GroupBy.any for mask and casting to integers for True, False to 1,0 map:
df['newvalue'] = df['value'].eq(1).groupby(df['id']).transform('any').astype(int)
Alternative:
df['newvalue'] = df['id'].isin(df.loc[df['value'].eq(1), 'id']).astype(int)
Or if only 0,1 values is possible simplify solution for new column by maximal values per groups:
df['newvalue'] = df.groupby('id')['value'].transform('max')
print (df)
id value newvalue
0 1 0 1
1 1 1 1
2 2 0 0
3 2 0 0
4 3 1 1

pandas row wise comparison and apply condition

This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.
You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5

pandas: most elegant way to pivot table on pattern in name of columns

Given the following DataFrame:
pd.DataFrame({
'x': [0, 1],
'y': [0, 1],
'a_idx': [0, 1],
'a_val': [2, 3],
'b_idx': [4, 5],
'b_val': [6, 7],
})
What is the cleanest way to pivot the DataFrame based on the prefix of the idx and val columns if you have an indeterminate amount of unique prefixes (a, b, ... n), so as to obtain the following DataFrame?
pd.DataFrame({
'x': [0, 1, 0, 1],
'y': [0, 1, 0, 1],
'key': ['a','a','b','b'],
'idx': [0, 1, 4, 5],
'val': [2, 3, 6, 7]
})
I am not very knowledgeable in pandas, so my easiest solution was to go earlier in the data generation process and generate a subset of the result DataFrame for each prefix in SQL, and then concat the result sets into a final DataFrame. I'm curious however if there is a simple way to do this using the API of pandas.DataFrame. Is there such a thing?
Let's try wide_to_long with extras:
(pd.wide_to_long(df,stubnames=['a','b'],
i=['x','y'],
j='key',
sep='_',
suffix='\\w+'
)
.unstack('key').stack(level=0).reset_index()
)
Or manually with melt:
out = df.melt(['x', 'y'])
out = (out.join(out['variable'].str.split('_', expand=True))
.rename(columns={0: 'key'})
.pivot_table(index=['x', 'y', 'key'], columns=[1], values='value')
.reset_index()
)
Output:
key x y level_2 idx val
0 0 0 a 0 2
1 0 0 b 4 6
2 1 1 a 1 3
3 1 1 b 5 7

Sort a dictionary in a column in pandas

I have a dataframe as shown below.
user_id Recommended_modules Remaining_modules
1 {A:[5,11], B:[4]} {A:2, B:1}
2 {A:[8,4,2], B:[5], C:[6,8]} {A:7, B:1, C:2}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[8,4,2], B:[5,1,2], C:[6]} {A:3, B:4, C:1}
Brief about the dataframe:
In the column Recommended_modules A, B and C are courses and the numbers inside the list are modules.
Key(Remaining_modules) = Course name
value(Remaining_modules) = Number of modules remaining in that course
From the above I would like to reorder the recommended_modules column based on the values in the Remaining_modules as shown below.
Expected Output:
user_id Ordered_Recommended_modules Ordered_Remaining_modules
1 {B:[4], A:[5,11]} {B:1, A:2}
2 {B:[5], C:[6,8], A:[8,4,2]} {B:1, C:2, A:7}
3 {B:[8], A:[2,3,9]} {B:1, A:5}
4 {C:[6], A:[8,4,2], B:[5,1,2]} {C:1, A:3, B:4}
Explanation:
For user_id = 2, Remaining_modules = {A:7, B:1, C:2}, sort like this {B:1, C:2, A:7}
similarly arrange Recommended_modules also in the same order as shown below
{B:[5], C:[6,8], A:[8,4,2]}.
It is possible, only need python 3.6+:
def f(x):
#https://stackoverflow.com/a/613218/2901002
d1 = {k: v for k, v in sorted(x['Remaining_modules'].items(), key=lambda item: item[1])}
L = d1.keys()
#https://stackoverflow.com/a/21773891/2901002
d2 = {key:x['Recommended_modules'][key] for key in L if key in x['Recommended_modules']}
x['Remaining_modules'] = d1
x['Recommended_modules'] = d2
return x
df = df.apply(f, axis=1)
print (df)
user_id Recommended_modules \
0 1 {'B': [4], 'A': [5, 11]}
1 2 {'B': [5], 'C': [6, 8], 'A': [8, 4, 2]}
2 3 {'B': [8], 'A': [2, 3, 9]}
3 4 {'C': [6], 'A': [8, 4, 2], 'B': [5, 1, 2]}
Remaining_modules
0 {'B': 1, 'A': 2}
1 {'B': 1, 'C': 2, 'A': 7}
2 {'B': 1, 'A': 5}
3 {'C': 1, 'A': 3, 'B': 4}