Apply function on a two dataframe rows

Apply function on a two dataframe rows - pandas

Given a pandas dataframe like this:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
col1 col2
0 1 4
1 2 5
2 3 6
I would like to do something equivalent to this using a function but without passing "by value" or as a global variable the whole dataframe (it could be huge and then it would give me a memory error):
i = -1
for index, row in df.iterrows():
if i < 0:
i = index
continue
c1 = df.loc[i][0] + df.loc[index][0]
c2 = df.loc[i][1] + df.loc[index][1]
df.ix[index, 0] = c1
df.ix[index, 1] = c2
i = index
col1 col2
0 1 4
1 3 9
2 6 15
i.e., I would like to have a function which will give me the previous output:
def my_function(two_rows):
row1 = two_rows[0]
row2 = two_rows[1]
c1 = row1[0] + row2[0]
c2 = row1[1] + row2[1]
row2[0] = c1
row2[1] = c2
return row2
df.apply(my_function, axis=1)
df
col1 col2
0 1 4
1 3 9
2 6 15
Is there a way of doing this?

What you've demonstrated is a cumsum
df.cumsum()
col1 col2
0 1 4
1 3 9
2 6 15
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
To define a function as a loop that does this in place
Slow cell by cell
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Compromise between memory and efficiency
def f(df):
for j in df.columns:
df[j].values[:] = df[j].values.cumsum()
return df
f(df)
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Note that you don't need to return df. I chose to for convenience.

Related

How to create a new column based on row values in python?

I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.

This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]

To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300

How can I aggregate strings from many cells into one cell?

Say I have two classes with a handful of students each, and I want to think of the possible pairings in each class. In my original data, I have one line per student.
What's the easiest way in Pandas to turn this dataset
Class Students
0 1 A
1 1 B
2 1 C
3 1 D
4 1 E
5 2 F
6 2 G
7 2 H
Into this new stuff?
Class Students
0 1 A,B
1 1 A,C
2 1 A,D
3 1 A,E
4 1 B,C
5 1 B,D
6 1 B,E
7 1 C,D
6 1 B,E
8 1 C,D
9 1 C,E
10 1 D,E
11 2 F,G
12 2 F,H
12 2 G,H

Try This:
import itertools
import pandas as pd
cla = [1, 1, 1, 1, 1, 2, 2, 2]
s = ["A", "B", "C", "D" , "E", "F", "G", "H"]
df = pd.DataFrame(cla, columns=["Class"])
df['Student'] = s
def create_combos(list_students):
combos = itertools.combinations(list_students, 2)
str_students = []
for i in combos:
str_students.append(str(i[0])+","+str(i[1]))
return str_students
def iterate_df(class_id):
df_temp = df.loc[df['Class'] == class_id]
list_student = list(df_temp['Student'])
list_combos = create_combos(list_student)
list_id = [class_id for i in list_combos]
return list_id, list_combos
list_classes = set(list(df['Class']))
new_id = []
new_combos = []
for idx in list_classes:
tmp_id, tmp_combo = iterate_df(idx)
new_id += tmp_id
new_combos += tmp_combo
new_df = pd.DataFrame(new_id, columns=["Class"])
new_df["Student"] = new_combos
print(new_df)

How to split dict in dataframe to many columns

I'm using dataframe. How to split dict list to many columns?
This is for a junior dataprocessor. In the past, I've tried on many ways.
import pandas as pd
l = [{'a':1,'b':2},{'a':3,'b':4}]
data = [{'key1':'x','key2':'y','value':l}]
df = pd.DataFrame(data)
data1 = {'key1':['x','x'],'key2':['y','y'],'a':[1,3],'b':[2,4]}
df1 = pd.DataFrame(data1)
df1 is what I need.

comprehension
d1 = df.drop('value', axis=1)
co = d1.columns
d2 = df.value
pd.DataFrame([
{**dict(zip(co, tup)), **d}
for tup, D in zip(zip(*map(d1.get, d1)), d2)
for d in D
])
a b key1 key2
0 1 2 x y
1 3 4 x y
Explode
See post on explode
This is a tad different but close
idx = df.index.repeat(df.value.str.len())
val = np.concatenate(df.value).tolist()
d0 = pd.DataFrame(val)
df.drop('value', axis=1).loc[idx].reset_index(drop=True).join(d0)
a b key1 key2
0 1 2 x y
1 3 4 x y

How to do the formulas without splitting the dataframe which had different conditions

I have the following dataframe
import pandas as pd
d = {
'ID':[1,2,3,4,5],
'Price1':[5,9,4,3,9],
'Price2':[9,10,13,14,18],
'Type':['A','A','B','C','D'],
}
df = pd.DataFrame(data = d)
df
For applying the formula without condition I use the following code
df = df.eval(
'Price = (Price1*Price1)/2'
)
df
How to do the formulas without splitting the dataframe which had different conditions
Need a new column called Price_on_type
The formula is differing for each type
For type A the formula for Price_on_type = Price1+Price1
For type B the formula for Price_on_type = (Price1+Price1)/2
For type C the formula for Price_on_type = Price1
For type D the formula for Price_on_type = Price2
Expected Output:
import pandas as pd
d = {
'ID':[1,2,3,4,5],
'Price1':[5,9,4,3,9],
'Price2':[9,10,13,14,18],
'Price':[12.5,40.5, 8.0, 4.5, 40.5],
'Price_on_type':[14,19,8.0,3,18],
}
df = pd.DataFrame(data = d)
df

You can use numpy.select:
masks = [df['Type'] == 'A',
df['Type'] == 'B',
df['Type'] == 'C',
df['Type'] == 'D']
vals = [df.eval('(Price1*Price1)'),
df.eval('(Price1*Price1)/2'),
df['Price1'],
df['Price2']]
Or:
vals = [df['Price1'] + df['Price2'],
(df['Price1'] + df['Price2']) / 2,
df['Price1'],
df['Price2']]
df['Price_on_type'] = np.select(masks, vals)
print (df)
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0

If your data is not too big, using apply with custom function on axis=1
def Prices(x):
dict_sw = {
'A': x.Price1 + x.Price2,
'B': (x.Price1 + x.Price2)/2,
'C': x.Price1,
'D': x.Price2,
}
return dict_sw[x.Type]
In [239]: df['Price_on_type'] = df.apply(Prices, axis=1)
In [240]: df
Out[240]:
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0
Or using the trick convert True to 1 and False to 0
df['Price_on_type'] = \
(df.Type == 'A') * (df.Price1 + df.Price2) + \
(df.Type == 'B') * (df.Price1 + df.Price2)/2 + \
(df.Type == 'C') * df.Price1 + \
(df.Type == 'D') * df.Price2
Out[308]:
ID Price1 Price2 Type Price_on_type
0 1 5 9 A 14.0
1 2 9 10 A 19.0
2 3 4 13 B 8.5
3 4 3 14 C 3.0
4 5 9 18 D 18.0

Pandas create row number - but not as an index

I want to create a row number series - but not override my date index.
I can do it with a loop but I think there must be an easier way?
_cnt = [ ]
for i in range ( len ( df ) ):
_cnt.append ( i )
df[ 'row' ] = _cnt
Thanks.

Probably the easiest way:
df['row'] = range(len(df))
>>> df
0 1
0 0.444965 0.993382
1 0.001578 0.174628
2 0.663239 0.072992
3 0.664612 0.291361
4 0.486449 0.528354
>>> df['row'] = range(len(df))
>>> df
0 1 row
0 0.444965 0.993382 0
1 0.001578 0.174628 1
2 0.663239 0.072992 2
3 0.664612 0.291361 3
4 0.486449 0.528354 4

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apply function on a two dataframe rows - pandas

Related

How to create a new column based on row values in python?

How can I aggregate strings from many cells into one cell?

How to split dict in dataframe to many columns

How to do the formulas without splitting the dataframe which had different conditions

Pandas create row number - but not as an index

Categories

Resources