My first question here! I'm looking for help on how to vectorize an operation on a pandas dataframe. I can simplify the problem down to a dataframe with three columns, a column that has values that will be updated, and two columns that have an iteration number, which is not the same between the two columns.
What I'd like to do is for one of the iteration columns, for each of the first values in the iteration column, to then refer to the corresponding value of the other iteration column (at that same index), and then finally to fill a value (zero) to the column with update values, but only for the rows in which the second iteration column has that same value. Hopefully this example will explain a bit better:
df = pd.DataFrame()
df['update_col'] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
df['iter2'] = [0, 1, 1, 2, 2, 3, 3, 4, 4]
df['iter1'] = [0, 0, 1, 1, 1, 2, 2, 2, 2]
print(df)
# update_col iter2 iter1
0 1 0 0
1 2 1 0
2 3 1 1
3 4 2 1
4 5 2 1
5 6 3 2
6 7 3 2
7 8 4 2
8 9 4 2
So basically, I want to do the following:
Reference the iter1 column, and when it changes (i.e. goes from 0 to 1 or from 1 to 2)
Look at the iter2 column at that index
Change the values in the "update column" to zero for all rows starting from the index in step 2 until iter2 is incremented to a new value
So the output would look like the following:
# update_col iter2 iter1
0 1 0 0
1 2 1 0
2 0 1 1
3 4 2 1
4 5 2 1
5 0 3 2
6 0 3 2
7 8 4 2
8 9 4 2
I think a properly constructed groupby could be a solution, but I am still a rookie at using it effectively.
I am currently achieving what I want with a complicated for loop, it makes the run time extremely long for the size and number of dataframes that I have to do this to. I think another solution could be a map or replace operation, but the complicating caveat is that I don't want to update all of the values for that value of iter2, only the values from that index of iter 1 until the last of those values in iter2.
Any help or insight is greatly appreciated!
This might not be a huge improvement on the loop you already have defined, but I think it eliminates the need to use a nested loop at least:
import pandas as pd
# creating data frame
df = pd.DataFrame()
df['update_col'] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
df['iter2'] = [0, 1, 1, 2, 2, 3, 3, 4, 4]
df['iter1'] = [0, 0, 1, 1, 1, 2, 2, 2, 2]
# computing difference from prev element by creating a shifted col
# and subtracting from the original.
# (you could also use a rolling window function for this)
df['change1'] = df['iter1'] - df['iter1'].shift(1)
df['change2'] = df['iter2'] - df['iter2'].shift(1)
# creating boolean cols to flag if iter1 or iter2 have changed
df['start'] = df['change1'] == 1
df['stop'] = df['change2'] == 1
# list to store result: if True, you update value to 0
res = [False] * len(df['start'])
for i in range(0, len(df['start'])):
if df['start'][i]:
#print('start detected')
res[i] = True
elif i > 1 and (not df['stop'][i]) and res[i-1]:
#print('continuation detected')
res[i] = True
#print(f'set res[{i}] to ', res[i])
df['update_to_zero'] = res
Which results in this df:
update_col iter2 iter1 change1 change2 start stop update_to_zero
0 1 0 0 NaN NaN False False False
1 2 1 0 0.0 1.0 False True False
2 3 1 1 1.0 0.0 True False True
3 4 2 1 0.0 1.0 False True False
4 5 2 1 0.0 0.0 False False False
5 6 3 2 1.0 1.0 True True True
6 7 3 2 0.0 0.0 False False True
7 8 4 2 0.0 1.0 False True False
8 9 4 2 0.0 0.0 False False False
Hope this helps!
Related
I noticed this today and wanted to ask because I am a little confused about this.
Lets say we have two df's
df = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('ABC'))
A B C
0 3 1 6
1 2 4 0
2 8 8 0
3 8 6 7
4 4 5 0
df2 = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('CBA'))
C B A
0 3 5 5
1 7 4 6
2 0 7 7
3 6 6 5
4 4 0 6
If we wanted to conditionally assign new values in the first df with values, we could do this:
df.loc[df['A'].gt(3)] = df2
I would expect the columns to be aligned, and if there were missing columns, for the values in the first df to be populated with nan. However when the above code is run, it replaces the data and does not take into account the column names. (it does take the index names into account however)
A B C
0 3 1 6
1 2 4 0
2 0 7 7
3 6 6 5
4 4 0 6
on index 2 instead of [7,7,0] we have [0,7,7].
However, if we pass the names of the columns into the loc statement, without changing the order of the columns in df2, it aligns with the columns.
df.loc[df['A'].gt(3),['A','B','C']] = df2
A B C
0 3 1 6
1 2 4 0
2 7 7 0
3 5 6 6
4 6 0 4
Why does this happen?
Interestingly, loc performs a number of optimizations to improve performance, one of those optimizations is checking the type of the index passed in.
Both Row and Column Indexes Included
When passing both a row index and a column index the __setitem__ function:
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)
iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value, self.name)
Interprets the key as a tuple.
key:
(0 False
1 False
2 True
3 True
4 True
Name: A, dtype: bool,
['A', 'B', 'C'])
This is then passed to _get_setitem_indexer to convert to a positional indexer from label-based:
indexer = self._get_setitem_indexer(key)
def _get_setitem_indexer(self, key):
"""
Convert a potentially-label-based key into a positional indexer.
"""
if self.name == "loc":
self._ensure_listlike_indexer(key)
if self.axis is not None:
return self._convert_tuple(key, is_setter=True)
ax = self.obj._get_axis(0)
if isinstance(ax, ABCMultiIndex) and self.name != "iloc":
with suppress(TypeError, KeyError, InvalidIndexError):
# TypeError e.g. passed a bool
return ax.get_loc(key)
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
if isinstance(key, range):
return list(key)
try:
return self._convert_to_indexer(key, axis=0, is_setter=True)
except TypeError as e:
# invalid indexer type vs 'other' indexing errors
if "cannot do" in str(e):
raise
elif "unhashable type" in str(e):
raise
raise IndexingError(key) from e
This generates a tuple indexer (both rows and columns are converted):
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
returns
(array([2, 3, 4], dtype=int64), array([0, 1, 2], dtype=int64))
Only Row Index Included
However, when only a row index is passed to loc the indexer is not a tuple and, as such, only a single dimension is converted from label to positional:
if isinstance(key, range):
return list(key)
returns
[2 3 4]
For this reason, no alignment happens among columns when only a single value is passed to loc, as no parsing is done to align the columns.
That is why an empty slice is often used:
df.loc[df['A'].gt(3), :] = df2
As this is sufficient to align the columns appropriately.
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('CBA'))
print(df)
print(df2)
df.loc[df['A'].gt(3), :] = df2
print(df)
Example:
df:
A B C
0 3 6 6
1 0 8 4
2 7 0 0
3 7 1 5
4 7 0 1
df2:
C B A
0 4 6 2
1 1 2 7
2 0 5 0
3 0 4 4
4 3 2 4
df.loc[df['A'].gt(3), :] = df2:
A B C
0 3 6 6
1 0 8 4
2 0 5 0
3 4 4 0 # Aligned as expected
4 4 2 3
I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.
As you can see below the column has 15 category codes seen in the column header.
Original DataFrame
I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].
This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].
'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''
Shape of desired DatFrame
If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.
You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.
You need to use a function and some loops to go through the columns.
First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)
1) Mail\n2) Email \n3) At PAC/TPAC 1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0 2 5
1 1 1
2 3 4
3 2 4
4 3 2
5 1 5
6 3 1
7 2 4
8 3 3
9 1 2
We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.
First, let's grab the first column.
s_col = df_full.iloc[:, 0]
print(s_col)
0 2
1 1
2 3
3 2
4 3
5 1
6 3
7 2
8 3
9 1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64
Split the header into individual pieces.
col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']
Clean up any leading or trailing white space.
col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']
Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
1) Mail 2) Email 3) At PAC/TPAC
0 2 2 2
1 1 1 1
2 3 3 3
3 2 2 2
4 3 3 3
5 1 1 1
6 3 3 3
7 2 2 2
8 3 3 3
9 1 1 1
Create a copy to make changes to the values.
df_res = df.copy()
Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
print(df_res)
1) Mail 2) Email 3) At PAC/TPAC
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 1 0
8 0 0 1
9 1 0 0
Now we have split a column into its components and assigned a bool value.
Let's step back and make the above a function so we can use it for each column in the original dataframe.
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
1) Mail 2) Email 3) At PAC/TPAC 1) ACC 2) IM 3) PT 4) Smoking, 5) Cessation
0 0 1 0 0 0 0 0 1
1 1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0
4 0 0 1 0 1 0 0 0
5 1 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0
7 0 1 0 0 0 0 1 0
8 0 0 1 0 0 1 0 0
9 1 0 0 0 1 0 0 0
Here is the full code...
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
I have to create an id in a pandas df where a counter resets itself.
My data looks like
counter
0
0
1
1
1
2
0
1
1
My desired output looks like
counter id
0 0
0 0
1 1
1 1
1 1
2 2
0 3
1 4
1 4
I have tried the following, which does not help. Any help will be appreciated.
df['id'] = df.groupby(df.counter.tolist(), sort=False).ngroup()
Check diff and cumsum
df['id'] = df.diff().ne(0).cumsum()-1
Another way of using itertools.groupby
from itertools import groupby
sum([ [y]*len(list(g)) for y,(_,g) in enumerate(groupby(df.counter))],[])
Out[46]: [0, 0, 1, 1, 1, 2, 3, 4, 4]
Let's say I have a dataframe df as below. To obtain 1st 2 and last 2 in each group I have used groupby.nth
df = pd.DataFrame({'A': ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'B': [1, 2, 3, 4, 5,6,7,8,1, 2, 3, 4, 5,6,7]}, columns=['A', 'B'])
df.groupby('A').nth([0,1,-2,-1])
Result:
B
A
a 1
a 2
a 7
a 8
b 1
b 2
b 6
b 7
I'm not sure how to obtain the middle 2 rows. For example, in group 'A' there are 8 instances so my middle would be 4, 5 (n/2, n/2+1) and group 'B' my middle rows would be 3, 4 (n/2-0.5, n/2+0.5). Any guidance is appreciated.
sacul's answer is nice , Here I just follow your own idea def a customize function
def middle(x):
if len(x) % 2 == 0:
return x.iloc[int(len(x) / 2) - 1:int(len(x) / 2) + 1]
else:
return x.iloc[int((len(x) / 2 - 0.5)) - 1:int(len(x) / 2 + 0.5)]
pd.concat([middle(y) for _ , y in df.groupby('A')])
Out[25]:
A B
3 a 4
4 a 5
10 b 3
11 b 4
You can use iloc to find the n//2 -1 and n//2 indices for each group (// is floor division):
g = df.groupby('A')
g.apply(lambda x: x['B'].iloc[[len(x)//2-1, len(x)//2]])
A
a 3 4
4 5
b 10 3
11 4
Name: B, dtype: int64
Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6