How to apply a function to multiple columns in Pandas [duplicate] - pandas

This question already has answers here:
Selecting multiple columns in a Pandas dataframe
(22 answers)
Closed 4 years ago.
I have a bunch of columns which requires cleaning in Pandas. I've written a function which does that cleaning. I'm not sure how to apply the same function to many columns. Here is what I'm trying:
df["Passengers", "Revenue", "Cost"].apply(convert_dash_comma_into_float)
But I'm getting KeyError.

Use double brackets [[]] as #chrisz points out:
Here is a MVCE:
df = pd.DataFrame(np.arange(30).reshape(10,-1),columns=['A','B','C'])
def f(x):
#Clean even numbers from columns.
return x.mask(x%2==0,0)
df[['B','C']] = df[['B','C']].apply(f)
print(df)
Output
A B C
0 0 1 0
1 3 0 5
2 6 7 0
3 9 0 11
4 12 13 0
5 15 0 17
6 18 19 0
7 21 0 23
8 24 25 0
9 27 0 29
​

Related

How to split pandas data frame by repeating rows? [duplicate]

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 1 year ago.
I have the df like this:
1 a 12
2 a 3
3 b 45
4 b 34
5 b 23
and I need to split it to two df like this:
1 a 12
2 a 3
and
3 b 45
4 b 34
5 b 23
Someone know any reasonable quick way?
Try with
d = {x : y for x , y in df.groupby('col')}

How to create an OD matrix from a pandas Data Frame only with specific columns

I have this data frame as in the picture below. I need to create an Origin-Destination matrix wherein the Row axis I will have the date as a column and the values from the "From municipality code", On the Columns axis I will have the values of the "To municipality code" and as the values to fill the matrix will be the values of the column "count". How do you get a matrix from the pandas data frame?
result_final.head()
ODMatrix= pd.DataFrame(0, index=list(range(0,202708)), columns = list(range(0,202708))
).add(df.pivot_table(values='count', index="from_municipality_code",
columns='to_municipality_code', aggfunc=len),fill_value=0).astype('int')
I tried to convert the pandas data frame into numpy array but it did not work.
result_final[['date', 'from_municipality_code','to_municipality_code','count','Lng_x','Lat_x','Lng_y','Lat_y',]].to_numpy()
This is the final matrix I want if this helps to visualize:
You can use the pivot_table method. Here is a working example:
import pandas as pd
import numpy as np
# Some example data
df = pd.DataFrame({"from": np.random.randint(0, 10, (1000,)), "to": np.random.randint(0, 10, (1000,))})
# Remove examples where from == to
df = df.loc[df["from"] != df["to"]].copy()
# The key operation
matrix = (
df.assign(count=1)
.pivot_table(index="from", columns="to", values="count", aggfunc="count")
.fillna(0)
.astype(int)
)
print(matrix)
to 0 1 2 3 4 5 6 7 8 9
from
0 0 10 14 7 9 14 18 6 11 8
1 11 0 12 7 4 12 9 11 6 13
2 6 14 0 12 13 8 5 15 11 10
3 10 9 12 0 14 10 8 14 9 11
4 10 14 14 11 0 8 4 10 11 4
5 15 10 10 18 8 0 15 15 8 12
6 9 7 10 13 10 8 0 11 12 10
7 9 12 4 6 9 9 8 0 8 12
8 8 8 11 12 15 10 11 4 0 6
9 10 13 11 16 14 18 11 9 4 0

How to get pandas crosstab margins value? [duplicate]

This question already has answers here:
Panda .loc or .iloc to select the columns from a dataset
(2 answers)
How are iloc and loc different?
(6 answers)
Closed 4 years ago.
I got a set of crosstab dataframe which looks like below
Batch 1 2 3 4 All
Fruits
Orange 2 3 4 5 14
Mango 3 2 1 7 13
Grape 2 2 2 2 8
Apple 5 5 8 9 27
All 13 14 18 27 62
The 'All' column and row is generated by the pandas crosstab's margins parameter, so my question is that how can I get the 'All' data by column, which is 13, 14, 18 and 27?

Setting to first rows of pandas DataFrame

I would like to set a value in some column for the first n rows of a pandas DataFrame.
>>> example = pd.DataFrame({'number':range(10),'name':list('aaabbbcccc')},index=range(20,0,-2)) # nontrivial index
>>> example
name number
20 a 0
18 a 1
16 a 2
14 b 3
12 b 4
10 b 5
8 c 6
6 c 7
4 c 8
2 c 9
I would like to set "number" for the first, say, 5 rows to the number 19. What I really want is to set the lowest values of "number" to that value, so I just sort first.
If my index was the trivial one, I could do
example.loc[:5-1,'number'] = 19 # -1 for inclusive indexing
# or
example.ix[:5-1,'number'] = 19
But since it's not, this would produce the following artifact (where all index values up to 4 have been chosen):
>>> example
name number
20 a 19
18 a 19
16 a 19
14 b 19
12 b 19
10 b 19
8 c 19
6 c 19
4 c 19
2 c 9
Using .iloc[] would be nice, except that it doesn't accept column names.
example.iloc[:5]['number'] = 19
works but gives a SettingWithCopyWarning.
My current solution is to do:
>>> example.sort_values('number',inplace=True)
>>> example.reset_index(drop=True,inplace=True)
>>> example.ix[:5-1,'number'] = 19
>>> example
name number
0 a 19
1 a 19
2 a 19
3 b 19
4 b 19
5 b 5
6 c 6
7 c 7
8 c 8
9 c 9
And since I have to repeat this for several columns, I have to do this a few times and reset the index each time, which also costs me my index (but never mind that).
Does anyone have a better solution?
I would use .iloc as .loc might yield unexpected results if certain indexes are repeated.
example.iloc[:5, example.columns.get_loc('number')] = 19
example.loc[example.index[:5], 'number'] = 19

Pandas Dynamic Index Referencing during Calculation

I have the following data frame
val sum
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 0
6 7 0
I would like to calculate the sum of the next three rows' (including the current row) values. I need to do this for very big files. What is the most efficient way? The expected result is
val sum
0 1 6
1 2 9
2 3 12
3 4 15
4 5 18
5 6 13
6 7 7
In general, how can I dynamically referencing to other rows (via boolean operations) while making assignments?
> pd.rolling_sum(df['val'], window=3).shift(-2)
0 6
1 9
2 12
3 15
4 18
5 NaN
6 NaN
If you want the last values to be "filled in" then you'll need to tack on NaN's to the end of your dataframe.