df.diff() how to compare row A with last row B - pandas

If we have two columns A and B
how to compare row A with last row B
A B diff
0 0.904560 0.208318 0
1 0.679290 0.747496 0
2 0.069841 0.165834 0
3 0.045818 0.907888 0
4 0.485712 0.593785 0
5 0.771665 0.800182 0
6 0.485041 0.024829 0
7 0.897172 0.584406 0
8 0.561953 0.626699 0
9 0.412803 0.900643 0

You can use Pandas' shift function to create a 'lagged' version of column B. Then it's a simple difference between columns.
from io import StringIO
import pandas as pd
raw = '''
A,B
0.904560,0.208318
0.679290,0.747496
0.069841,0.165834
0.045818,0.907888
0.485712,0.593785
0.771665,0.800182
0.485041,0.024829
0.897172,0.584406
0.561953,0.626699
0.412803,0.900643
'''.strip()
df = pd.read_csv(StringIO(raw))
df['B_lag'] = df.B.shift(1)
df['diff'] = df.A - df.B_lag
print(df)
Output looks like
A B B_lag diff
0 0.904560 0.208318 NaN NaN
1 0.679290 0.747496 0.208318 0.470972
2 0.069841 0.165834 0.747496 -0.677655
3 0.045818 0.907888 0.165834 -0.120016
4 0.485712 0.593785 0.907888 -0.422176
5 0.771665 0.800182 0.593785 0.177880
6 0.485041 0.024829 0.800182 -0.315141
7 0.897172 0.584406 0.024829 0.872343
8 0.561953 0.626699 0.584406 -0.022453
9 0.412803 0.900643 0.626699 -0.213896

Related

insert column to df on sequenced location

i have a df like this:
id
month
1
1
1
3
1
4
1
6
i want to transform it become like this:
id
1
2
3
4
5
6
1
1
0
1
1
0
1
ive tried using this code:
ndf = df[['id']].join(pd.get_dummies(
df['month'])).groupby('id').max()
but it shows like this:
id
1
3
4
6
1
1
1
1
1
how can i insert the middle column (2 and 5) even if it's not in the data?
You can use pd.crosstab
instead, then create new columns using pd.RangeIndex based on the min and max month, and finally use DataFrame.reindex (and optionally DataFrame.reset_index afterwards):
import pandas as pd
new_cols = pd.RangeIndex(df['month'].min(), df['month'].max())
res = (
pd.crosstab(df['id'], df['month'])
.reindex(columns=new_cols, fill_value=0)
.reset_index()
)
Output:
>>> res
id 1 2 3 4 5
0 1 1 0 1 1 0

renaming multiple cells below a specific cell with pandas

I am trying to merge two Excel tables, but the rows don't line up because in one column information is split over several rows whereas in the other table it is contained in a single cell.
Is there a way with pandas to rename the cells in Table A so that they line up with the rows in Table B?
df_jobs = pd.read_excel(r"jobs.xlsx", usecols="Jobs")
df_positions = pd.read_excel(r"orders.xlsx", usecols="Orders")
Sample files:
https://drive.google.com/file/d/1PEG3nZc0183Gh-8A2xbIs9kEZIWLzLSa/view?usp=sharing
https://drive.google.com/file/d/1HfQ4q7pjba0TKNJAHBqcGeoqdY3Yr3DB/view?usp=sharing
I suppose your input data looks like:
>>> df1
A i j
0 O-20-003049 NaN NaN
1 1 0.643284 0.834937
2 2 0.056463 0.394168
3 3 0.773379 0.057465
4 4 0.081585 0.178991
5 5 0.667667 0.004370
6 6 0.672313 0.587615
7 O-20-003104 NaN NaN
8 1 0.916426 0.739700
9 O-20-003117 NaN NaN
10 1 0.800776 0.614192
11 2 0.925186 0.980913
12 3 0.503419 0.775606
>>> df2
A x y
0 O-20-003049.01 0.593312 0.666600
1 O-20-003049.02 0.554129 0.435650
2 O-20-003049.03 0.900707 0.623963
3 O-20-003049.04 0.023075 0.445153
4 O-20-003049.05 0.307908 0.503038
5 O-20-003049.06 0.844624 0.710027
6 O-20-003104.01 0.026914 0.091458
7 O-20-003117.01 0.275906 0.398993
8 O-20-003117.02 0.101117 0.691897
9 O-20-003117.03 0.739183 0.213401
We start by renaming the rows in column A:
# create a boolean mask
mask = df1["A"].str.startswith("O-")
# rename all rows
df1["A"] = df1.loc[mask, "A"].reindex(df1.index).ffill() \
+ "." + df1["A"].str.pad(2, fillchar="0")
# remove unwanted rows (where mask==True)
df1 = df1[~mask].reset_index(drop=True)
>>> df1
A i j
1 O-20-003049.01 0.000908 0.078590
2 O-20-003049.02 0.896207 0.406293
3 O-20-003049.03 0.120693 0.722355
4 O-20-003049.04 0.412412 0.447349
5 O-20-003049.05 0.369486 0.872241
6 O-20-003049.06 0.614941 0.907893
8 O-20-003104.01 0.519443 0.800131
10 O-20-003117.01 0.583067 0.760002
11 O-20-003117.02 0.133029 0.389461
12 O-20-003117.03 0.969289 0.397733
Now, we are able to merge data on column A:
>>> pd.merge(df1, df2, on="A")
A i j x y
0 O-20-003049.01 0.643284 0.834937 0.593312 0.666600
1 O-20-003049.02 0.056463 0.394168 0.554129 0.435650
2 O-20-003049.03 0.773379 0.057465 0.900707 0.623963
3 O-20-003049.04 0.081585 0.178991 0.023075 0.445153
4 O-20-003049.05 0.667667 0.004370 0.307908 0.503038
5 O-20-003049.06 0.672313 0.587615 0.844624 0.710027
6 O-20-003104.01 0.916426 0.739700 0.026914 0.091458
7 O-20-003117.01 0.800776 0.614192 0.275906 0.398993
8 O-20-003117.02 0.925186 0.980913 0.101117 0.691897
9 O-20-003117.03 0.503419 0.775606 0.739183 0.213401

Remove rows in pandas df with index values within a range

I would like to remove all rows in a pandas df that have an index value within 4 counts of the index value of the previous row.
In the pandas df below,
A B
0 1 1
5 5 5
8 9 9
9 10 10
Only the row with index value 0 should remain.
Thanks!
get the differences between the current and previous row as a list and pass to loc. Chose to get it as a list so i could return a dataframe as a final output.
ind = [ a for a,b in zip(df.index,df.index[1:]) if b-a > 4]
df.loc[ind]
A B
0 1 1
You can use reset_index, diff and shift:
In [1309]: df
Out[1309]:
A B
0 1 1
5 5 5
8 9 9
9 10 10
In [1310]: d = df.reset_index()
In [1313]: df = d[d['index'].diff(1).shift(-1) >=4].drop('index', 1)
In [1314]: df
Out[1313]:
A B
0 1 1

How to split a column in a data frame containing only numbers into multiple columns in pandas

I have a .dat file containing the following data:
0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011
Need to count number of zeros and ones in each row
I have tried with Pandas.
Step-1: Read the data file
Step-2: Given a column name
Step-3: Tried to split the values into multiple columns. But could
not succeed
df1=pd.read_csv('data.dat',header=None) df1.head()
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
df1.columns=['kirti']
df1.head()
Kirti
_______________________
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
I need to split the data frame into multiple columns depending upon the 0s and 1s in each row.
the maximum number of columns will be equal to max no of zeros and ones in any of the rows in the data frame.
First create one column DataFrame by parameters names and dtype=str for convert column to strings:
import pandas as pd
temp="""0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename'
df = pd.read_csv(StringIO(temp), header=None, names=['kirti'], dtype=str)
print (df)
kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
And then create new DataFrame by convert values to lists:
df = pd.DataFrame([list(x) for x in df['kirti']])
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0
1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 1 None None None None
2 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 None
3 0 1 1 1 1 1 1 0 1 0 1 0 0 None None None None None None
4 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 None None None
If your data is in a list of strings, then use the count method:
>> data = ["0001100000101010100", "110101000001111", "101100011001110111", "0111111010100", "1010111111100011"]
>> for i in data:
print(i.count("0"))
13
7
7
5
5
If your data is in a .dat file with whitespace sepparation as you discribed, then I would recommend loading your data as follows:
data = pd.read_csv("data.dat", lineterminator=" ",dtype="str", header=None, names=["Kirti"])
Kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
The lineterminator argument ensures that every entry is in a new row. The dtype argument ensures that it's read as string. Otherwise you will loose leading zeros.
If your data is in a DataFrame, you can use the count method (inspired from here):
>> data["Kirti"].str.count("0")
0 13
1 7
2 7
3 5
4 5
Name: Kirti, dtype: int64

Pandas : Get a column value where another column is the minimum in a sub-grouping [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')