Function creation with pandas dataframe? - pandas

I have the following pandas DataFrame named table
grh pm_0 age_0
0 1 39054414 74
1 2 34054409 37
2 3 3715955000 65
3 4 19373605 53
4 5 99411 64
5 6 25664143 37
6 7 5161112 77
7 8 41517547 80
8 9 9517054000 72
9 10 538129400 52
I have a loop iterating over like this :
df2=df.copy()
for k in range (1,3):
for i in range (1,5):
df["pm_"+str(i)]=df["pm_"+str(i-1)]/k
df2=df2.append(df)
print(df2.head(15))
It works but i would like to encapsule it in a function.
I tried something like this but it doesn't work.
I think i made something wrong..
def sto(scn):
df4=df.copy()
for k in range (1,scn):
for i in range (1,5):
df["pm_"+str(i)]=df["pm_"+str(i-1)]/k
df4=df4.append(df)
sto(3)
print(df4)
Traceback (most recent call last):
File "", line 11, in
print(df4)
NameError: name 'df4' is not defined
Any idea ?

You just need to explicitly mark it as global object:
df4 = df.copy()
def sto(scn):
global df4
for k in range (1,scn):
for i in range (1,5):
df["pm_"+str(i)]=df["pm_"+str(i-1)]/k
df4=df4.append(df)
return df4
sto(3)
You can pass a dataframe to the function, do:
def sto(scn, df):
dfx = df.copy()
for k in range (1,scn):
for i in range (1,5):
df["pm_"+str(i)]=df["pm_"+str(i-1)]/k
dfx=dfx.append(df)
return dfx
df4 = sto(3, df)

You need to send the df to the function and return it at the end:
def sto(scn, df)
.....
.....
return df

Related

How to return an instance of a dataframe in a loop?

I have several files that I want to use to create unique Dataframes in python. I created a class that takes each file and generates the Dataframe, but I cannot return the dataframe output (but I can print it).
For example, this structure works for me:
class myclass:
def __init__(self, x):
self.x=x
df =pd.DataFrame({'A':[2,3,4]})
self.output = df
y = myclass(x=1)
y.output
but this version does not work:
import pandas as pd
class myclass:
def __init__(self, x):
self.x=x
df =pd.DataFrame({'A':[2,3,4]})
self.output = df
for n in range (0,5):
y = myclass(x=n)
y.output
So I tried to dynamically create and assign variables during the loop but it's not clear to me what's wrong with it:
import pandas as pd
class myclass:
def __init__(self, x):
self.x=x
df =pd.DataFrame({'A':[2,3,4]})
self.output = df
i=0
for n in range (0,5):
i+=1
var = 'var'+str(i)
var = myclass(x=n)
var.output
print (var)
You need to print var.output:
for n in range (0,5):
var = myclass(x=f'var{n+1}')
print(var.output)
Output:
A
0 2
1 3
2 4
A
0 2
1 3
2 4
A
0 2
1 3
2 4
A
0 2
1 3
2 4
A
0 2
1 3
2 4
if you want to be able to index then use a dictionary:
dfs = {}
for n in range (0,5):
dfs[f'var{n+1}'] = myclass(x=f'var{n+1}').output
dfs['var3']
Output:
A
0 2
1 3
2 4

Remove the identical values, and leave only different

I would like to know if there is more optimal solution to leave the different value (to easily catch them) and to remove identical values under some columns.
merged = pd.merge(us_df, gb_df, how='outer', indicator=True)
res = pd.merge(merged[merged['_merge'] == 'left_only'].drop('_merge', axis=1),
merged[merged['_merge'] == 'right_only'].drop('_merge', axis=1),
on=us_df.columns.tolist()[0:col_range],
how='outer',
suffixes=('_US', '_GB')).fillna(' ')
cols = [col for col in res.columns.tolist() if '_US' in col or '_GB' in col]
sorted_cols = [col for col in res.columns.tolist() if '_US' not in col and '_GB' not in col] + sorted(cols)
I get this table (res):
Id
ages_GB
ages_US
salary_GB
salary_US
6
45
45
34
67
43
12
11
65
65
So far, I used this iteration:
cols = [ages_US, salary_US, ages_GB, salary_GB]
for i, row in res.iterrows():
for us, gb in zip(cols[:len(cols) // 2], cols[len(cols) // 2:]):
if row[us] == row[gb]:
res.at[i, us] = res.at[i, gb] = ' '
to get the result (where identical values under columns in cols are replaced with " " (space)):
Id
ages_GB
ages_US
salary_GB
salary_US
6
34
67
43
12
11
Is there another method to get the similar result?
Given your example I think loc offers a simpler solution assuming you want to compare two sets of columns.
I will first recreate a reproducible example of your dataset (I would recommend you create this in future questions as it makes it easier to understand and answer you question: How to create a Minimal, Reproducible Example)
d = {
'ages_GB': [45, 12],
'ages_US': [45, 11],
'salary_GB': [34, 65],
'salary_US': [67, 65]
}
df = pd.DataFrame(data=d)
print(df)
Initial DataFrame
ages_GB ages_US salary_GB salary_US
0 45 45 34 67
1 12 11 65 65
The simplest solution I can think of is to use loc to just reassign records to "" or NaN where ages_GB == ages_US & salary_GB == salary_US.
df.loc[df.ages_GB == df.ages_US, ['ages_GB', 'ages_US']] = ["", ""]
df.loc[df.salary_GB == df.salary_US, ['salary_GB', 'salary_US']] = ["", ""]
Output
ages_GB ages_US salary_GB salary_US
0 34 67
1 12 11
For a generic method, you can groupby on axis=1 using the columns prefixes, and get the duplicated values to use with mask:
prefix = df.columns.str.extract('^([^_]+)', expand=False)
# ['Id', 'ages', 'ages', 'salary', 'salary']
m = df.groupby(prefix, axis=1).transform(lambda s: s.duplicated(keep=False))
out = df.mask(m, '')
Output:
Id ages_GB ages_US salary_GB salary_US
0 6 34 67
1 43 12 11
Intermediate m:
Id ages_GB ages_US salary_GB salary_US
0 False True True False False
1 False False False True True

Subset two consecutive event occurrence in pandas

I'm trying to get a subset of my data whenever there is consecutive occurrence of an two events in that order. The event is time-stamped. So every time there are continuous 2's and then continuous 3's, I want to subset that to a dataframe and append it to a dictionary. The following code does that but I have to apply this to a very large dataframe of more than 20 mil obs. This is extremely slow using iterrows. How can I make this fast?
df = pd.DataFrame({'Date': [101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122],
'Event': [1,1,2,2,2,3,3,1,3,2,2,3,1,2,3,2,3,2,2,3,3,3]})
dfb = pd.DataFrame(columns = df.columns)
C = {}
f1 = 0
for index, row in df.iterrows():
if ((row['Event'] == 2) & (3 not in dfb['Event'].values)):
dfb = dfb.append(row)
f1 =1
elif ((row['Event'] == 3) & (f1 == 1)):
dfb = dfb.append(row)
elif 3 in dfb['Event'].values:
f1 = 0
C[str(dfb.iloc[0,0])] = dfb
del dfb
dfb = pd.DataFrame(columns = df.columns)
if row['Event'] == 2:
dfb = dfb.append(row)
f1 =1
else:
f1=0
del dfb
dfb = pd.DataFrame(columns = df.columns)
Edit: The desired output is basically a dictionary of the subsets shown in the imagehttps://i.stack.imgur.com/ClWZs.png
If you want to accerlate, you should vectorize your code. You could try it like this (df is the same with your code):
vec = df.copy()
vec['Event_y'] = vec['Event'].shift(1).fillna(0).astype(int)
vec['Same_Flag'] = float('nan')
vec.Same_Flag.loc[(vec['Event_y'] == vec['Event']) & (vec['Event'] != 1)] = 1
vec.dropna(inplace=True)
vec.loc[:, ('Date', 'Event')]
Output is:
Date Event
3 104 2
4 105 2
6 107 3
10 111 2
18 119 2
20 121 3
21 122 3
I think that's close to what you need. You could improve based on that.
I'm not understand why date 104, 105, 107 are not counted.

How can I increment a level in Pandas MultiIndex?

How can I increment all values in a specific level of a pandas multiindex?
You can create new MultiIndex.from_tuples and assign:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df = df.set_index(['A','B'])
print (df)
C D E F
A B
1 4 7 1 5 7
2 5 8 3 3 4
3 6 9 5 6 3
#change multiindex
new_index = list(zip(df.index.get_level_values('A'), df.index.get_level_values('B') + 1))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print (df)
C D E F
A B
1 5 7 1 5 7
2 6 8 3 3 4
3 7 9 5 6 3
Another possible solution with reset_index and set_index:
df = df.reset_index()
df.B = df.B + 1
df = df.set_index(['A','B'])
print (df)
C D E F
A B
1 5 7 1 5 7
2 6 8 3 3 4
3 7 9 5 6 3
Solution with DataFrame.assign:
print (df.reset_index().assign(B=lambda x: x.B+1).set_index(['A','B']))
Timings:
In [26]: %timeit (reset_set(df1))
1 loop, best of 3: 144 ms per loop
In [27]: %timeit (assign_method(df3))
10 loops, best of 3: 161 ms per loop
In [28]: %timeit (jul(df2))
1 loop, best of 3: 543 ms per loop
In [29]: %timeit (tuples_method(df))
1 loop, best of 3: 581 ms per loop
Code for timings:
np.random.seed(100)
N = 1000000
df = pd.DataFrame(np.random.randint(10, size=(N,5)), columns=list('ABCDE'))
print (df)
df = df.set_index(['A','B'])
print (df)
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
def reset_set(df):
df = df.reset_index()
df.B = df.B + 1
return df.set_index(['A','B'])
def assign_method(df):
df = df.reset_index().assign(B=lambda x: x.B+1).set_index(['A','B'])
return df
def tuples_method(df):
new_index = list(zip(df.index.get_level_values('A'), df.index.get_level_values('B') + 1))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
return df
def jul(df):
df.index = pd.MultiIndex.from_tuples([(x[0], x[1]+1) for x in df.index], names=df.index.names)
return df
Thank you Jeff for another solution:
df.index.set_levels(df.index.levels[1] + 1 , level=1, inplace=True)
print (df)
C D E F
A B
1 5 7 1 5 7
2 6 8 3 3 4
3 7 9 5 6 3
Here's a slightly different way:
df.index = pd.MultiIndex.from_tuples([(x[0], x[1]+1) for x in df.index], names=df.index.names)
1000 loops, best of 3: 840 µs per loop
For comparison:
new_index = list(zip(df.index.get_level_values('A'),
df.index.get_level_values('B') + 1))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
1000 loops, best of 3: 984 µs per loop
The reset_index method is 10 times slower.
It can be as simple as
df.index.set_levels(df.index.levels[0] + 1, 0, inplace=True)
demo
df = pd.DataFrame(
dict(A=[2, 3, 4, 5]),
pd.MultiIndex.from_product([[1, 2], [3, 4]])
)
df
df.index.set_levels(df.index.levels[0] + 1, 0, inplace=True)
df

Imposing a threshold on values in dataframe in Pandas

I have the following code:
t = 12
s = numpy.array(df.Array.tolist())
s[s<t] = 0
thresh = numpy.where(s>0, s-t, 0)
df['NewArray'] = list(thresh)
while it works, surely there must be a more pandas-like way of doing it.
EDIT:
df.Array.head() looks like this:
0 [0.771511552006, 0.771515476223, 0.77143569165...
1 [3.66720695274, 3.66722560562, 3.66684636758, ...
2 [2.3047433839, 2.30475510675, 2.30451676559, 2...
3 [0.999991522708, 0.999996609066, 0.99989319662...
4 [1.11132718786, 1.11133284052, 0.999679589875,...
Name: Array, dtype: object
IIUC you can simply subtract and use clip_lower:
In [29]: df["NewArray"] = (df["Array"] - 12).clip_lower(0)
In [30]: df
Out[30]:
Array NewArray
0 10 0
1 11 0
2 12 0
3 13 1
4 14 2