pandas cumsum skip column - pandas

I am new to pandas and I can add to cumsum as
df.cumsum(axis=1)
y0 y1 y2
0 2 3 4
1 2 2 3
2 0 0 0
3 1 2 3
y0 y1 y2
0 2 5 9
1 2 4 7
2 0 0 0
3 1 3 6
But is there way to perform on only first 2 columns i.e. skip y2?

You need to exclude y2, find cumsum and concat y2 back.
pd.concat([df[['y0', 'y1']].cumsum(axis=1),df['y2']], axis=1)
Output:
y0 y1 y2
0 2 5 4
1 2 4 3
2 0 0 0
3 1 3 3

You can also use .loc to select only the columns you care about.
cols = ['y0', 'y1']
df.loc[:, cols] = df.loc[:, cols].cumsum(axis=1)
Output
y0 y1 y2
0 2 5 4
1 2 4 3
2 0 0 0
3 1 3 3
loc is a flexible way to slice a DataFrame and in general follows the format:
.loc[row_labels, column_labels]
where an : can be used to indicate all rows, or all_columns.

Related

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Pandas running sum

I have a pandas dataframe and it is something like this:
x y
1 0
2 1
3 2
4 0 <<<< Reset
5 1
6 2
7 3
8 0 <<<< Reset
9 1
10 2
The x values could be anything, they are not meaningful for this question. The y values increment, and reset and increment again. I need a third column (z) which is a number that represents the groups, so it increments when the y values are reset.
I cannot guarantee that the reset will be to zero, only a value that is less than the previous one, should indicate a reset.
x y z
1 0 0
2 1 0
3 2 0
4 0 1 <<<< Incremented by 1
5 1 1
6 2 1
7 3 1
8 0 2 <<<< Incremented by 1
9 1 2
10 2 2
So To produce z, i understand what needs to be done, just not familiar with the syntax. My solution would be to first assign z as a sparse column of 0 and 1's, where everything is zero except a 1 is given when y[ix] < y[ix-1], indicating that the y counter has been reset. Then a cumulative running sum should be performed on the z column, meaning that: z[ix] = sum(z[0],z[1],...,z[ix])
Id appreciate some help with the syntax of assigning column z, if someone has a moment.
Based on your logic:
#general case
df['z'] = df['y'].diff().lt(0).cumsum()
# or equivalently
# df['z'] = df['y'].lt(df['y'].shift()).cumsum()
Output:
x y z
0 1 0 0
1 2 1 0
2 3 2 0
3 4 0 1
4 5 1 1
5 6 2 1
6 7 3 1
7 8 0 2
8 9 1 2
9 10 2 2
Using ne(1)
df.y.diff().ne(1).cumsum().sub(1)
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: y, dtype: int32

How to use the each pair of values in column and transpose them into rows inside groups (groupby)

I have data on which I already applied group by user and sort by time (data.groupby('id').apply(lambda x: x.sort_values('time'))
):
user time point_id
1 00:00 1
1 00:01 3
1 00:02 4
1 00:03 2
2 00:00 1
2 00:05 3
2 00:15 1
3 00:00 1
3 01:00 2
3 02:00 3
And from that I need to inside each group to made links/transpose next 2 values into rows. It should look like this for the example above:
user start_point end_point
1 1 3
1 3 4
1 4 2
2 1 3
2 3 1
3 1 2
3 2 3
My final goal is to get matrix which will show how many links come into each point:
point_id | 1 | 2 | 3 | 4 |
--------------------------------------------
1 0 1 3 0
2 1 0 0 1
3 3 0 0 1
4 0 1 1 0
So this matrix means that from point 2 one link goes to point 1, from point 3 that 3 links go to the point one and etc.
The picture of this looks like this:
First, you can use shift() to group point_id into rows.
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
print(df)
user start_point end_point
0 1 1 3
1 1 3 4
2 1 4 2
4 2 1 3
5 2 3 1
7 3 1 2
8 3 2 3
Then you can use pd.crosstab to count directed link.
u = pd.crosstab(df.start_point, df.end_point)
print(u)
end_point 1 2 3 4
start_point
1 0 1 2 0
2 0 0 1 0
3 1 0 0 1
4 0 1 0 0
According to your results, what you need is undirected graph counting, so all we need to do is transpose and add.
result = u + u.T
print(result)
end_point 1 2 3 4
start_point
1 0 1 3 0
2 1 0 1 1
3 3 1 0 1
4 0 1 1 0
Final code as follow:
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
u = pd.crosstab(df.start_point, df.end_point)
result = u + u.T
I believe this works for your example, taking df = data.groupby('id').apply(lambda x: x.sort_values('time')) (your starting example):
groups = [(k, df.loc[v, 'point_id'].values) for k, v in df.groupby('user').groups.items()]
res = []
for g in groups:
res.append([(g[0], i) for i in (zip(g[1], g[1][1:]))])
df1 = pd.DataFrame([item for sublist in res for item in sublist])
df2 = df1.copy()
df2.iloc[:,-1] = df2.iloc[:,-1].apply(lambda x: (x[1], x[0])) # df2 swaps around the points
df_ = pd.concat([df1, df2]).sort_values(by=0)
df_['1'], df_['2'] = df_.iloc[:,-1].apply(lambda x: x[0]), df_.iloc[:,-1].apply(lambda x: x[1])
df_ = df_.drop(columns=1)
df_.columns = ['user', 'start_point', 'end_point'] # your intermediate table
df_.pivot_table(index='start_point', columns='end_point', aggfunc='count').fillna(0)
Output:
user
end_point 1 2 3 4
start_point
1 0.0 1.0 3.0 0.0
2 1.0 0.0 1.0 1.0
3 3.0 1.0 0.0 1.0
4 0.0 1.0 1.0 0.0

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated
Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

Apply an element-wise function on a pandas dataframe with index and column values as inputs

I often have this need, and I can't seem to find the way to do it efficiently.
Let's say I have a pandas DataFrame object and I want the value of each element (i,j) to be equal to f(index[i], columns[j]).
Using applymap, value of index and column for each element is lost.
What is the best way to do it?
It depends on what you are trying to do specifically.
clever hack
using pd.Panel.apply
it works because it will iterate over each series along the major and minor axes. It's name will be the tuple we need.
df = pd.DataFrame(index=range(5), columns=range(5))
def f1(x):
n = x.name
return n[0] + n[1] ** 2
pd.Panel(dict(A=df)).apply(f1, 0)
0 1 2 3 4
0 0 1 4 9 16
1 1 2 5 10 17
2 2 3 6 11 18
3 3 4 7 12 19
4 4 5 8 13 20
example 1
Here is one such use case and one possible solution for that use case
df = pd.DataFrame(index=range(5), columns=range(5))
f = lambda x: x[0] + x[1]
s = df.stack(dropna=False)
s.loc[:] = s.index.map(f)
s.unstack()
0 1 2 3 4
0 0 1 2 3 4
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
4 4 5 6 7 8
or this will do the same thing
df.stack(dropna=False).to_frame().apply(lambda x: f(x.name), 1).unstack()
example 2
df = pd.DataFrame(index=list('abcd'), columns=list('xyz'))
v = df.values
c = df.columns.values
i = df.index.values
pd.DataFrame(
(np.tile(i, len(c)) + c.repeat(len(i))).reshape(v.shape),
i, c
)
x y z
a ax bx cx
b dx ay by
c cy dy az
d bz cz dz