Faster way to operate columns with if conditions - pandas

I need to operate a column with an IF as shown in my code. It takes quite a time to compute, is there a faster, cleaner way to do this?
For reference, the column "coin" have pairs like "ETH_ARS", "DAI_USD" and so on, that´s why I split it.
for i in range(merged.shape[0]):
x = merged["coin"].iloc[i]
if x.split("_")[1] == "ARS":
merged["total"].iloc[i] = (
merged["price"].iloc[i]
* merged["amount"].iloc[i]
/ merged["valueUSD"].iloc[i]
)
else:
merged["total"].iloc[i] = merged["price"].iloc[i] * merged["amount"].iloc[i]

You can vectorize your code. The trick here is to set valueUSD=1 when coin column ends with USD. After that the operation is the same for all rows: total = price * amount / valueUSD.
Setup a MRE:
data = {'coin': ['ETH_ARS', 'DAI_USD'],
'price': [10, 12],
'amount': [3, 4],
'valueUSD': [2, 7]}
df = pd.DataFrame(data)
print(df)
# Output:
coin price amount valueUSD
0 ETH_ARS 10 3 2
1 DAI_USD 12 4 7 # <- should be set to 1 for division
valueUSD = df['valueUSD'].mask(df['coin'].str.split('_').str[1].eq('USD'), other=1)
df['total'] = df['price'] * df['amount'] / valueUSD
print(df)
# Output:
coin price amount valueUSD total
0 ETH_ARS 10 3 2 15.0 # = 10 * 3 / 2
1 DAI_USD 12 4 7 48.0 # = 10 * 3 / 1 (7 -> 1)
To do that, use mask and replace NaN by 1 instead of the valueUSD:
>>> valueUSD
0 2
1 1 # 7 -> 1
Name: valueUSD, dtype: int64

Related

groupby to show same row value from other columns

After groupby by "Mode" column and take out the value from "indicator" of "max, min", how to let the relative value to show in the same dataframe like below:
df = pd.read_csv(r'relative.csv')
Grouped = df.groupby('Mode')['Indicator'].agg(['max', 'min'])
print(Grouped)
(from google, maybe can use from col_value or row_value function, but seem be more complicated, could someone can help to solve it by easy ways? thank you.)
You can do it in two steps, using groupby and idxmin() or idxmix():
# Create a df with the min values of 'Indicator', renaming the column 'Value' to 'B'
min = df.loc[df.groupby('Mode')['Indicator'].idxmin()].reset_index(drop=True).rename(columns={'Indicator': 'min', 'Value': 'B'})
print(min)
# Mode min B
# 0 A 1 6
# 1 B 1 7
# Create a df with the max values of 'Indicator', renaming the column 'Value' to 'A'
max = df.loc[df.groupby('Mode')['Indicator'].idxmax()].reset_index(drop=True).rename(columns={'Indicator': 'max', 'Value': 'A'})
print(max)
# Mode max A
# 0 A 3 2
# 1 B 4 3
# Merge the dataframes together
result = pd.merge(min, max)
# reorder the columns to match expected output
print(result[['Mode', 'max','min','A', 'B']])
# Mode max min A B
# 0 A 3 1 2 6
# 1 B 4 1 3 7
The logic is unclear, there is no real reason why you would call your columns A/B since the 6/3 values in it are not coming from A/B.
I assume you want to achieve:
(df.groupby('Mode')['Indicator'].agg(['idxmax', 'idxmin'])
.rename(columns={'idxmin': 'min', 'idxmax': 'max'}).stack()
.to_frame('x').merge(df, left_on='x', right_index=True)
.drop(columns=['x', 'Mode']).unstack()
)
Output:
Indicator Value
max min max min
Mode
A 3 1 2 6
B 4 1 3 7
C 10 10 20 20
Used input:
Mode Indicator Value
0 A 1 6
1 A 2 5
2 A 3 2
3 B 4 3
4 B 3 6
5 B 2 8
6 B 1 7
7 C 10 20
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"Mode": ["A", "A", "A", "B", "B", "B", "B"],
"Indicator": [1, 2, 3, 4, 3, 2, 1],
"Value": [6, 5, 2, 3, 6, 8, 7],
}
)
new_df = df.groupby("Mode")["Indicator"].agg(["max", "min"])
print(new_df)
# Output
max min
Mode
A 3 1
B 4 1
Here is one way to do it with product from Python standard library's itertools module and Pandas at property:
from itertools import product
for row, (col, func) in product(["A", "B"], [("A", "max"), ("B", "min")]):
new_df.at[row, col] = df.loc[
(df["Mode"] == row) & (df["Indicator"] == new_df.loc[row, func]), "Value"
].values[0]
new_df = new_df.astype(int)
Then:
print(new_df)
# Output
max min A B
Mode
A 3 1 2 6
B 4 1 3 7

subset df by masking between specific rows

I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.
Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.
Note: The string ABC will be followed by different values.
I thought about returning all the indexes that contain these two values.
But mask could work better if I could return all rows between these two values?
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],
})
mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)
Intended Output:
Val
0 None
8 X
9 1
10 2
15 Y
16 1
17 2
If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:
#2 values of ABC, 5 in data
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
'None','ABC','None',1,2,3,4,5,'None','None','None']
})
m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
m2 = np.where(df['Val'] == 5)[0]
print (m1)
[ 1 12]
print (m2)
[ 7 18]
idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
print (df[~df.index.isin(idx)])
Val
0 None
8 X
9 1
10 2
11 None
19 X
20 1
21 2
a = df.index[df['Val'].str.contains('ABC')==True][0]
b = df.index[df['Val']==5][0]+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
Output
Val
0 None
8 X
9 1
10 2
If there are more than one 'ABC' and 5, then you the below version.
With this you get the df other than the first ABC & the last 5
a = (df['Val'].str.contains('ABC')==True).idxmax()
b = df['Val'].where(df['Val']==5).last_valid_index()+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6

How to calculate multiple columns from multiple columns in pandas

I am trying to calculate multiple colums from multiple columns in a pandas dataframe using a function.
The function takes three arguments -a-, -b-, and -c- and and returns three calculated values -sum-, -prod- and -quot-. In my pandas data frame I have three coumns -a-, -b- and and -c- from which I want to calculate the columns -sum-, -prod- and -quot-.
The mapping that I do works only when I have exactly three rows. I do not know what is going wrong, although I expect that it has to do something with selecting the correct axis. Could someone explain what is happening and how I can calculate the values that I would like to have.
Below are the situations that I have tested.
INITIAL VALUES
def sum_prod_quot(a,b,c):
sum = a + b + c
prod = a * b * c
quot = a / b / c
return (sum, prod, quot)
df = pd.DataFrame({ 'a': [20, 100, 18],
'b': [ 5, 10, 3],
'c': [ 2, 10, 6],
'd': [ 1, 2, 3]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
CALCULATION STEPS
Using exactly three rows
When I calculate three columns from this dataframe and using the function function I get:
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
This is exactly the result that I want to have: The sum-column has the sum of the elements in the columns a,b,c; the prod-column has the product of the elements in the columns a,b,c and the quot-column has the quotients of the elements in the columns a,b,c.
Using more than three rows
When I expand the dataframe with one row, I get an error!
The data frame is defined as:
df = pd.DataFrame({ 'a': [20, 100, 18, 40],
'b': [ 5, 10, 3, 10],
'c': [ 2, 10, 6, 4],
'd': [ 1, 2, 3, 4]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
3 40 10 4 4
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: too many values to unpack (expected 3)
while I would expect an extra row:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
3 40 10 4 4 54.0 1600.0 1.0
Using less than three rows
When I reduce tthe dataframe with one row I get also an error.
The dataframe is defined as:
df = pd.DataFrame({ 'a': [20, 100],
'b': [ 5, 10],
'c': [ 2, 10],
'd': [ 1, 2]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: need more than 2 values to unpack
while I would expect a row less:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
QUESTIONS
The questions I have:
1) Why do I get these errors?
2) How do I have to modify the call such that I get the desired data frame?
NOTE
In this link a similar question is asked, but the given answer did not work for me.
The answer doesn't seem correct for 3 rows as well. Can you check other values except first row and first column. Looking at the results, product of 20*5*2 is NOT 120, it's 200 and is placed below in sum column. You need to form list in correct way before assigning to new columns. You can try use following to set the new columns:
df['sum'], df['prod'], df['quot'] = zip(*map(sum_prod_quot, df['a'], df['b'], df['c']))
For details follow the link

Convert ordered levels to numeric in pandas

I was wondering is there any function in pandas that allows me to do this.
I have a column with levels [low, medium, high].
I would like to translate them to [1,2,3] to perform linear regression. However, what i am currently doing is df[df['interest_level'] == 'low'] = 1. is there a better way of doing this?
Thanks.
use pd.factorize() method:
df['interest_level'] = pd.factorize(df['interest_level'])[0]
you can also categorize your new numerical values (this might save a lot of memory):
Sample DataFrame:
In [34]: df = pd.DataFrame({'interest_level':np.random.choice(['medium','high','low'], 10)})
In [35]: df
Out[35]:
interest_level
0 high
1 low
2 medium
3 high
4 low
5 high
6 high
7 low
8 low
9 medium
Solution:
In [36]: df['interest_level'], cats = pd.factorize(df['interest_level'])
In [37]: df['interest_level'] = pd.Categorical(df['interest_level'], categories=np.arange(len(cats)))
In [38]: df
Out[38]:
interest_level
0 0
1 1
2 2
3 0
4 1
5 0
6 0
7 1
8 1
9 2
In [39]: cats # this can be used for the backtracing ...
Out[39]: Index(['high', 'low', 'medium'], dtype='object')
In [40]: df.memory_usage()
Out[40]:
Index 80
interest_level 34 # <---- NOTE: only 34 bytes used for 10 integers
dtype: int64
In [41]: df.dtypes
Out[41]:
interest_level category
dtype: object
You can use map:
d = {'low':1,'medium':2,'high':3}
df['interest_level'] = df['interest_level'].map(d)
Sample:
df = pd.DataFrame({'interest_level':['medium','high','low', 'low', 'medium']})
print (df)
interest_level
0 medium
1 high
2 low
3 low
4 medium
d = {'low':1,'medium':2,'high':3}
df['interest_level'] = df['interest_level'].map(d)
print (df)
interest_level
0 2
1 3
2 1
3 1
4 2
Another solution is cast to Categorical and then use cat.codes:
categories = ['low','medium','high']
df['interest_level'] = df['interest_level'].astype('category',
categories=categories,
ordered=True).cat.codes + 1
print (df)
interest_level
0 2
1 3
2 1
3 1
4 2