How to get the index of each increment in pandas series? - pandas

how to get the index of pandas series when the value incremented by one?
Ex. The input is
A
0 0
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
the output should be: [0, 1, 4, 6, 7]

You can use Series.duplicated and access the index, should be slightly faster.
df.index[~df.A.duplicated()]
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
If you really want a list, you can do this,
df.index[~df.A.duplicated()].tolist()
# [0, 1, 4, 6, 7]
Note that duplicated (and drop_duplicates) will only work if your Series does not have any decrements.
Alternatively, you can use diff here, and index into df.index, similar to the previous solution:
np.insert(df.index[df.A.diff().gt(0)], 0, 0)
# Int64Index([0, 1, 4, 6, 7], dtype='int64')

It is drop_duplicates
df.drop_duplicates('A').index.tolist()
[0, 1, 4, 6, 7]

This makes sure the second row is incremented by one (not by two or anything else!)
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
output is numpy array:
array([2, 5])
Example:
# * * here value increase by 1
# 0 1 2 3 4 5 6 7
df = pd.DataFrame({ 'A' : [1, 1, 1, 2, 8, 3, 4, 4]})
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
array([2, 5])

Related

Apply custom functions to groupby pandas

I know there are questions/answers about how to use custom function for groupby in pandas but my case is slightly different.
My data is
group_col val_col
0 a [1, 2, 34]
1 a [2, 4]
2 b [2, 3, 4, 5]
data = {'group_col': {0: 'a', 1: 'a', 2: 'b'}, 'val_col': {0: [1, 2, 34], 1: [2, 4], 2: [2, 3, 4, 5]}}
df = pd.DataFrame(data)
What I am trying to do is to group by group_col, then sum up the length of lists in val_col for each group. My desire output is
a 5
b 4
I wonder I can do this in pandas?
You can try
df['val_col'].str.len().groupby(df['group_col']).sum()
df.groupby('group_col')['val_col'].sum().str.len()
Output:
group_col
a 5
b 4
Name: val_col, dtype: int64

Randomness of np.random.shuffle

I have two arrays (i and j) that are exactly the same. I shuffle them with a specified random seed.
import numpy as np
np.random.seed(42)
i = np.array([0, 1, 2, 3, 4, 5, 6, 7])
j = np.array([0, 1, 2, 3, 4, 5, 6, 7])
np.random.shuffle(i)
np.random.shuffle(j)
print(i, j)
# [1 5 0 7 2 4 3 6] [3 7 0 4 5 2 1 6]
They were supposed to be the same after shuffling, but it is not the case.
Do you have any ideas about how to get the same results (like the example below) after shuffling?
# [1 5 0 7 2 4 3 6] [1 5 0 7 2 4 3 6]
Many thanks in advance!
Calling seed() sets the state of a global random number generator. Each call of shuffle continues with the same global random number generator, so the results are different, as they should be. If you want them to be the same, reset the seed before each call of shuffle.

Getting the difference iterated over every row from one dataframe to another in pandas

I have two datasets: df1 and df2, each with a column named 'value' with 10 records. Currently I have:
df = df1.value - df2.value
but this code outputs 10 rows only (as expected). How would one iterate the difference for all rows instead of just the difference between corresponding row index (and get a table of 100 records instead)?
Thanks in advance!
You can pandas.DataFrame.merge with how = 'cross' (cartesian product), then get the columns difference with pandas.DataFrame.diff:
#setup
df1 = pd.DataFrame({"value":[7,5,4,8,9]})
df2 = pd.DataFrame({"value":[1,7,9,5,3]})
df2.merge(df1, "cross", suffixes=['x','']).diff(axis = 1).dropna(1)
Output
value
0 6
1 4
2 3
3 7
4 8
5 0
6 -2
7 -3
8 1
9 2
10 -2
11 -4
12 -5
13 -1
14 0
15 2
16 0
17 -1
18 3
19 4
20 4
21 2
22 1
23 5
24 6
Try this.
ndf = df.assign(key=1).merge(df2.assign(key=1),on='key',suffixes=('_l','_r')).drop('key',axis=1)
ndf['value_l'] - ndf['value_r']
Use an outer subtraction.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"value":[7,5,4,8,9]})
df2 = pd.DataFrame({"value":[1,7,9,5,3]})
np.subtract.outer(df1['value'].to_numpy(), df2['value'].to_numpy())
#array([[ 6, 0, -2, 2, 4],
# [ 4, -2, -4, 0, 2],
# [ 3, -3, -5, -1, 1],
# [ 7, 1, -1, 3, 5],
# [ 8, 2, 0, 4, 6]])
Add a .ravel() if you want the same order as a cross join.
np.subtract.outer(df1['value'].to_numpy(), df2['value'].to_numpy()).ravel('F')
#array([ 6, 4, 3, 7, 8, 0, -2, -3, 1, 2, -2, -4, -5, -1, 0, 2, 0,
# -1, 3, 4, 4, 2, 1, 5, 6])

Intersect a dataframe with a larger one that includes it and remove common rows

I have two dataframes:
df_small = pd.DataFrame(np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]),
columns=['a', 'b', 'c'])
and
df_large = pd.DataFrame(np.array([[22, 1, 2, 3, 99],
[31, 4, 5, 6, 75],
[73, 7, 8, 9, 23],
[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Now what I want is to intersect the two and only take the rows in df_large that that do not contain the rows from df_small, hence the result should be:
df_result = pd.DataFrame(np.array([[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Use DataFrame.merge with indicator=True and left join and because error is necessary remove duplicates by DataFrame.drop_duplicates from df_small:
m = df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)['_merge'].ne('both')
df = df_large[m]
print (df)
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
Another solution is very similar, only filtered by query and last removed column _merge:
df = (df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)
.query('_merge != "both"')
.drop('_merge', axis=1))
Use DataFrame.merge:
df_large.merge(df_small,how='outer',indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
Output:
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
You can evade merging and make your code a bit more readable. It's really not that clear what happens when you merge and drop duplicates.
Indexes and Multiindexes were made for intersections and other set operations.
common_columns = df_large.columns.intersection(df_small.columns).to_list()
df_small_as_Multiindex = pd.MultiIndex.from_frame(df_small)
df_result = df_large.set_index(common_columns).\
drop(index = df_small_as_Multiindex).\ #Drop the common rows
reset_index() #Not needed if the a,b,c columns are meaningful indexes

Aggregating a time series in Pandas given a window size

Lets say I have this data
a = pandas.Series([1,2,3,4,5,6,7,8])
a
Out[313]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
dtype: int64
I would like aggregate data which groups data n rows at a time and sums them up. So if n=2 the new series would look like {3,7,11,15}.
try this:
In [39]: a.groupby(a.index//2).sum()
Out[39]:
0 3
1 7
2 11
3 15
dtype: int64
In [41]: a.index//2
Out[41]: Int64Index([0, 0, 1, 1, 2, 2, 3, 3], dtype='int64')
n=3
In [42]: n=3
In [43]: a.groupby(a.index//n).sum()
Out[43]:
0 6
1 15
2 15
dtype: int64
In [44]: a.index//n
Out[44]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2], dtype='int64')
you can use pandas rolling mean and get it like the following:
if n is your interval:
sums = list(a.rolling(n).sum()[n-1::n])
# Optional !!!
rem = len(a)%n
if rem != 0:
sums.append(a[-rem:].sum())
The first line perfectly adds the rows if the data can be properly divided into groups, else, we also can add the remaining sum (depends on your preference).
For e.g., in the above case, if n=3, then you may want to get either {6, 15, 15} or just {6, 15}. The code above is for the former case. And skipping the optional part gives you just {6, 15}.