Pandas create row number - but not as an index - pandas

I want to create a row number series - but not override my date index.
I can do it with a loop but I think there must be an easier way?
_cnt = [ ]
for i in range ( len ( df ) ):
_cnt.append ( i )
df[ 'row' ] = _cnt
Thanks.

Probably the easiest way:
df['row'] = range(len(df))
>>> df
0 1
0 0.444965 0.993382
1 0.001578 0.174628
2 0.663239 0.072992
3 0.664612 0.291361
4 0.486449 0.528354
>>> df['row'] = range(len(df))
>>> df
0 1 row
0 0.444965 0.993382 0
1 0.001578 0.174628 1
2 0.663239 0.072992 2
3 0.664612 0.291361 3
4 0.486449 0.528354 4

Related

How to create a new column based on row values in python?

I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.
This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300

iterate among two columns of a dataframe

I am trying to iterate among two columns of a dataframe ("binS99", 'bin3HMax'). Those columns have values from 0 to 4. then I would like to create a new column ('Probability') in the same dataframe ("df_selection") taking the values from the matrix "Prob". The following code goes into a loop. any ideas on how to solve? thank you
prob = [[0, 0.00103, 0.00103],
[0, 0.00267, 0.00311],
[0, 0.00688, 0.01000],
[0, 0.01777, 0.03218]]
for index, row, in df_selection.iterrows():
a = int(df_selection.loc[index,"binS99"]) #int(str(row["binS99"]))
b = int(df_selection.loc[index,"bin3HMax"]) #int(str(row["bin3HMax"]))
df_selection.loc[index,"Probability"]= prob[a][b]
'''
I believe you need first check if maximal values in columns matched maximal number of values in lists and then use numpy indexing:
df_selection = pd.DataFrame({
'A':list('abcdef'),
'binS99':[0,1,2,0,2,1],
'bin3HMax':[1,2,1,0,1,0],
})
print (df_selection)
A binS99 bin3HMax
0 a 0 1
1 b 1 2
2 c 2 1
3 d 0 0
4 e 2 1
5 f 1 0
prob = [[0, 0.00103, 0.00103],
[0, 0.00267, 0.00311],
[0, 0.00688, 0.01000],
[0, 0.01777, 0.03218]]
arr_prob = np.array(prob)
print (arr_prob)
[[0. 0.00103 0.00103]
[0. 0.00267 0.00311]
[0. 0.00688 0.01 ]
[0. 0.01777 0.03218]]
a = df_selection['binS99'].to_numpy()
b = df_selection['bin3HMax'].to_numpy()
df_selection['Probability'] = arr_prob[a, b]
print (df_selection)
A binS99 bin3HMax Probability
0 a 0 1 0.00103
1 b 1 2 0.00311
2 c 2 1 0.00688
3 d 0 0 0.00000
4 e 2 1 0.00688
5 f 1 0 0.00000

Group by based on an if statement

I have a df that contains ids and timestamps.
I was looking to group by the id and then a condition on the timestamp in the two rows.
Something like if timestamp_col1 > timestamp_col1 for the second row then 1 else 2
Basically grouping the ids and an if statement to give a value of 1 if the first row timestamp is < than the second and 2 if the second row timestamp is < then the first
Updated output below where last two values should be 2
Use to_timedelta for converting times, then aggregate difference between first and last value and compare by gt (>), last map with numpy.where for assign new column:
df = pd.DataFrame({
'ID Code': ['a','a','b','b'],
'Time Created': ['21:25:27','21:12:09','21:12:00','21:12:40']
})
df['Time Created'] = pd.to_timedelta(df['Time Created'])
mask = df.groupby('ID Code')['Time Created'].agg(lambda x: x.iat[0] < x.iat[-1])
print (mask)
ID Code
a True
b False
Name: Time Created, dtype: bool
df['new'] = np.where(df['ID Code'].map(mask), 1, 2)
print (df)
ID Code Time Created new
0 a 21:25:27 2
1 a 21:12:09 2
2 b 21:12:00 1
3 b 21:12:40 1
Another solution with transform for return aggregate value to new column, here boolean mask:
df['Time Created'] = pd.to_timedelta(df['Time Created'])
mask = (df.groupby('ID Code')['Time Created'].transform(lambda x: x.iat[0] > x.iat[-1]))
print (mask)
0 True
1 True
2 False
3 False
Name: Time Created, dtype: bool
df['new'] = np.where(mask, 2, 1)
print (df)
ID Code Time Created new
0 a 21:25:27 2
1 a 21:12:09 2
2 b 21:12:00 1
3 b 21:12:40 1

Apply function on a two dataframe rows

Given a pandas dataframe like this:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
col1 col2
0 1 4
1 2 5
2 3 6
I would like to do something equivalent to this using a function but without passing "by value" or as a global variable the whole dataframe (it could be huge and then it would give me a memory error):
i = -1
for index, row in df.iterrows():
if i < 0:
i = index
continue
c1 = df.loc[i][0] + df.loc[index][0]
c2 = df.loc[i][1] + df.loc[index][1]
df.ix[index, 0] = c1
df.ix[index, 1] = c2
i = index
col1 col2
0 1 4
1 3 9
2 6 15
i.e., I would like to have a function which will give me the previous output:
def my_function(two_rows):
row1 = two_rows[0]
row2 = two_rows[1]
c1 = row1[0] + row2[0]
c2 = row1[1] + row2[1]
row2[0] = c1
row2[1] = c2
return row2
df.apply(my_function, axis=1)
df
col1 col2
0 1 4
1 3 9
2 6 15
Is there a way of doing this?
What you've demonstrated is a cumsum
df.cumsum()
col1 col2
0 1 4
1 3 9
2 6 15
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
To define a function as a loop that does this in place
Slow cell by cell
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Compromise between memory and efficiency
def f(df):
for j in df.columns:
df[j].values[:] = df[j].values.cumsum()
return df
f(df)
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Note that you don't need to return df. I chose to for convenience.

Imposing a threshold on values in dataframe in Pandas

I have the following code:
t = 12
s = numpy.array(df.Array.tolist())
s[s<t] = 0
thresh = numpy.where(s>0, s-t, 0)
df['NewArray'] = list(thresh)
while it works, surely there must be a more pandas-like way of doing it.
EDIT:
df.Array.head() looks like this:
0 [0.771511552006, 0.771515476223, 0.77143569165...
1 [3.66720695274, 3.66722560562, 3.66684636758, ...
2 [2.3047433839, 2.30475510675, 2.30451676559, 2...
3 [0.999991522708, 0.999996609066, 0.99989319662...
4 [1.11132718786, 1.11133284052, 0.999679589875,...
Name: Array, dtype: object
IIUC you can simply subtract and use clip_lower:
In [29]: df["NewArray"] = (df["Array"] - 12).clip_lower(0)
In [30]: df
Out[30]:
Array NewArray
0 10 0
1 11 0
2 12 0
3 13 1
4 14 2