Using apply for multiple columns - pandas

I need to create 2 new columns based one existing 2 columns. I am trying to do it using 1 single apply function instead of 2 apply functions separately.
Initial Df for example is as follows:
ID1 ID2
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
Next I try to create 2 new columns using the below method:
def funct(row):
list1 = row.values
print(list1[0])
return row
df[['s1','s2']] = df[['ID1',"ID2"]].apply(lambda row: funct(row))
The issue is that I want to access the values individually which I am unable to do so . Here i tried converting into list but when I do list[0] i get
1
11
How to access 1 and 11 above? how should I index to access individual series value when I send two series together using apply?
NOTE: the content of funct() is just returning the same now because I still dont know how to access the values inorder to do something

add a parameter axis=1 to your apply function, like
import pandas as pd
from io import StringIO
s = """
,ID1,ID2
0,1,11
1,2,12
2,3,13
3,4,14
4,5,15
5,6,16
6,7,17
7,8,18
8,9,19
9,10,20
"""
df = pd.read_csv(StringIO(s),index_col=0)
def funct(row):
# return row
# update the answer
return pd.Series([row.ID1 + 100, row.ID2 + 20])
df[['s1','s2']] = df[['ID1',"ID2"]].apply(funct, axis=1)

Related

Subtract a specific row from a csv using phyton

I have two csv files: one containing data, the other one containing a single row with the same columns as the first file. I am trying to subtract the one row from the second file from all the rows from the first file using pandas.
I have tried the following, but to no avail.
df = df.subtract(row, axis=1)
You're looking for the "drop" method. From pandas docs:
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
drop by index:
df.drop([0, 1])
A B C D
2 8 9 10 11
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

Replace value in pandas dataframe based on where condition [duplicate]

This question already has answers here:
Efficiently replace values from a column to another column Pandas DataFrame
(5 answers)
Closed 10 months ago.
I have created a dataframe called df with this code:
import numpy as np
import pandas as pd
# initialize data of lists.
data = {'Feature1':[1,2,-9999999,4,5],
'Age':[20, 21, 19, 18,34,]}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
The dataframe looks like this:
Feature1 Age
0 1 20
1 2 21
2 -9999999 19
3 4 18
4 5 34
Every time there is a value of -9999999 in column Feature1 I need to replace it with the correspondent value from column Age. so, the output dataframe would look this this:
Feature1 Age
0 1 20
1 2 21
2 19 19
3 4 18
4 5 34
Bear in mind that the actual dataframe that I am using has 200K records (the one I have shown above is just an example).
How do I do that in pandas?
You can use np.where or Series.mask
df['Feature1'] = df['Feature1'].mask(df['Feature1'].eq(-9999999), df['Age'])
# or
df['Feature1'] = np.where(df['Feature1'].eq(-9999999), df['Age'], df['Feature1'])

Pandas Groupby and Apply

I am performing a grouby and apply over a dataframe that is returning some strange results, I am using pandas 1.3.1
Here is the code:
ddf = pd.DataFrame({
"id": [1,1,1,1,2]
})
def do_something(df):
return "x"
ddf["title"] = ddf.groupby("id").apply(do_something)
ddf
I would expect every row in the title column to be assigned the value "x" but when this happens I get this data:
id title
0 1 NaN
1 1 x
2 1 x
3 1 NaN
4 2 NaN
Is this expected?
The result is not strange, it's the right behavior: apply returns a value for the group, here 1 and 2 which becomes the index of the aggregation:
>>> list(ddf.groupby("id"))
[(1, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
0 1
1 1
2 1
3 1),
(2, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
4 2)]
Why I have a result? Because the label of the group is found in the same of your dataframe index:
>>> ddf.groupby("id").apply(do_something)
id
1 x
2 x
dtype: object
Now change the id like this:
ddf['id'] += 10
# id
# 0 11
# 1 11
# 2 11
# 3 11
# 4 12
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 0 11 NaN
# 1 11 NaN
# 2 11 NaN
# 3 11 NaN
# 4 12 NaN
Or change the index:
ddf.index += 10
# id
# 10 1
# 11 1
# 12 1
# 13 1
# 14 2
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 10 1 NaN
# 11 1 NaN
# 12 1 NaN
# 13 1 NaN
# 14 2 NaN
Yes it is expected.
First of all the apply(do_something) part works like a charme, it is the groupby right before that causes the problem.
A Groupby returns a groupby object, which is a little different to a normal dataframe. If you debug and inspect what the groupby returns, then you can see you need some form of summary function to use it(mean max or sum).If you run one of them as example like this:
df = ddf.groupby("id")
df.mean()
it leads to this result:
Empty DataFrame
Columns: []
Index: [1, 2]
After that do_something is applied to index 1 and 2 only; and then integrated into your original df. This is why you only have index 1 and 2 with x.
For now I would recommend leave out the groupby since it is not clear why you want to use it here anyway.
And have a deeper look into the groupby object
If need new column in aggregate function use GroupBy.transform, is necessary specified column after groupby used for processing, here id:
ddf["title"] = ddf.groupby("id")['id'].transform(do_something)
Or assign new column in function:
def do_something(x):
x['title'] = 'x'
return x
ddf = ddf.groupby("id").apply(do_something)
Explanation why not workin gis in another answers.

Converting a pandas crosstab into a stacked dataframe (a regular table)

Given a pandas crosstab, how do you convert that into a stacked dataframe?
Assume you have a stacked dataframe. First we convert it into a crosstab. Now I would like to revert back to the original stacked dataframe. I searched a problem statement that addresses this requirement, but could not find any that hits bang on. In case I have missed any, please leave a note to it in the comment section.
I would like to document the best practice here. So, thank you for your support.
I know that pandas.DataFrame.stack() would be the best approach. But one needs to be careful of the the "level" stacking is applied to.
Input: Crosstab:
Label a b c d r
ID
1 0 1 0 0 0
2 1 1 0 1 1
3 1 0 0 0 1
4 1 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
8 1 0 1 0 0
9 0 1 0 0 0
Output: Stacked DataFrame:
ID Label
0 1 b
1 2 a
2 2 b
3 2 d
4 2 r
5 3 a
6 3 r
7 4 a
8 4 d
9 6 a
10 7 c
11 8 a
12 8 c
13 9 b
Step-by-step Explanation:
First, let's make a function that would create our data. Note that it randomly generates the stacked dataframe, and so, the final output may differ from what I have given below.
Helper Function: Make the Stacked And Crosstab DataFrames
import numpy as np
import pandas as pd
# Make stacked dataframe
def _create_df():
"""
This dataframe will be used to create a crosstab
"""
B = np.array(list('abracadabra'))
A = np.arange(len(B))
AB = list()
for i in range(20):
a = np.random.randint(1,10)
b = np.random.randint(1,10)
AB += [(a,b)]
AB = np.unique(np.array(AB), axis=0)
AB = np.unique(np.array(list(zip(A[AB[:,0]], B[AB[:,1]]))), axis=0)
AB_df = pd.DataFrame({'ID': AB[:,0], 'Label': AB[:,1]})
return AB_df
original_stacked_df = _create_df()
# Make crosstab
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
What to expect?
You would expect a function to regenerate the stacked dataframe from the crosstab. I would provide my own solution to this in the answer section. If you could suggest something better that would be great.
Other References:
Closest stackoverflow discussion: pandas stacking a dataframe
Misleading stackoverflow question-topic: change pandas crossstab dataframe into plain table format:
You can just do stack
df[df.astype(bool)].stack().reset_index().drop(0,1)
The following produces the desired outcome.
def crosstab2stacked(crosstab):
stacked = crosstab.stack(dropna=True).reset_index()
stacked = stacked[stacked.replace(0,np.nan)[0].notnull()].drop(columns=[0])
return stacked
# Make original dataframe
original_stacked_df = _create_df()
# Make crosstab dataframe
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
# Recontruct stacked dataframe
recon_stacked_df = crosstab2stacked(crosstab = crosstab_df)
Check if original == reconstructed:
np.alltrue(original_stacked_df == recon_stacked_df)
Output: True

How to get the mode of a column in pandas where there are few of the same mode values pandas

I have a data frame and i'd like to get the mode of a specific column.
i'm using:
freq_mode = df.mode()['my_col'][0]
However I get the error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index my_col')
I'm guessing it's because I have few mode that are the same.
I need any of the mode, it doesn't matter. How can I use any() to get any of the mode existed?
For me your code working nice with sample data.
If necessary select first value of Series from mode use:
freq_mode = df['my_col'].mode().iat[0]
We can see the one column
df=pd.DataFrame({"A":[14,4,5,4,1,5],
"B":[5,2,54,3,2,7],
"C":[20,20,7,3,8,7],
"train_label":[7,7,6,6,6,7]})
X=df['train_label'].mode()
print(X)
DataFrame
A B C train_label
0 14 5 20 7
1 4 2 20 7
2 5 54 7 6
3 4 3 3 6
4 1 2 8 6
5 5 7 7 7
Output
0 6
1 7
dtype: int64