Pandas: Newbie question on compare and (re)calculate fields with pandas - pandas

What I need to do is to compare 2 fields in a row in a csv-file:
Data looks like this:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
in case that "price" is equal to "retail_price" the field retail_price must be reduced by a given percent-value, e.g. -10%
so in the example data, the first and last line should be changed to 180 and 179,955
I´m completely new to pandas and after reading the "getting started" part I did not find anything that I could set upon ...
so any help or hint (just point me in the direction, I will fiddle it out myself then) is appreciated,
Kind regards!

Use Series.eq for compare both values and if same multiple retail_price by 0.9 else not in numpy.where:
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(0.9), df['retail_price'])
print (df)
store ean price retail_price quantity
0 1 888721396226 200.00 180.000 2
1 1 888721396233 200.00 159.000 2
2 1 2194384654084 299.00 259.000 7
3 1 2194384654091 199.95 179.955 8
Or you can use DataFrame.loc for multiple only matched rows by 0.9:
mask = df['price'].eq(df['retail_price'])
df.loc[mask, 'retail_price'] *= 0.9
#working like
df.loc[mask, 'retail_price'] = df.loc[mask, 'retail_price'] * 0.9
EDIT: for filter rows not matched mask (with Falses in mask) use:
df2 = df[~mask].copy()
print (df2)
store ean price retail_price quantity
1 1 888721396233 200.0 159.0 2
2 1 2194384654084 299.0 259.0 7
print (mask)
0 True
1 False
2 False
3 True
dtype: bool

This ist my code:
import pandas as pd
import numpy as np
import sys
with open('prozente.txt', 'r') as f: #create multiplicator from static value in File "prozente.txt"
prozente = int(f.readline())
mulvalue = 1-(prozente/100)
df = pd.read_csv('1.csv', sep=';', header=1, names=['store','ean','price','retail_price','quantity'])
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(mulvalue).round(2), df['retail_price'])
df2 = df[~mask].copy()
df.to_csv('output.csv', columns=['store','ean','price','retail_price','quantity'],sep=';', index=False)
print(df)
print(df2)
using this as 1.csv:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
The content of file "prozente.txt" is
25

Related

Compare last digit

A have a csv file with columns as below:
Last_Price, Price, Marked
3,2.89,
1.99,2.09,
3.9,3.79,
I created a pandas dataframe named subdf.
I want to compare last digit of Last_Price and Price columns, If they are not equal I assign 'X' to the column 'Marked' otherwise leave it as blank or NaN.
I have tried:
subdf['Marked'] = np.where([x.strip()[-1] for x in subdf['Last_Price']] == [x.strip()[-1] for x in subdf['Price']],
'X',
np.nan)
It says: AttributeError: 'float' object has no attribute 'strip'
I tried the below as well but it didn't work. It doesn't capture last digit.
I guess I need a for loop as well.
str(subdf['Price']).strip()[-1]
Here's an example.
It assumes that all prices have 2 digits or less after the decimal point. And that for "3" you want to compare for the "0" in "3.00" and
not the last digit "3".
Multiplying by 10 moves only the last digit to the end
Then doing a MOD (%) division by 1 removes the front.
Please note: I removed the spaces from the csv captions to allow proper importing of the names
import pandas as pd
import io
TESTDATA = """Last_Price,Price,Marked
3,2.89,
1.99,2.09,
3.9,3.79,"""
subdf = pd.read_csv(io.StringIO(TESTDATA), sep=",")
subdf.loc[((subdf['Last_Price'] * 10) % 1) !=
((subdf['Price'] * 10) % 1), 'Marked'] = 'X'
print(subdf)
The result
Last_Price Price Marked
0 3.00 2.89 X
1 1.99 2.09 NaN
2 3.90 3.79 X
If you really want to compare against the string output,
so the "9" from "3.9" vs "3.79" match, then import the dataframe as a str
import pandas as pd
import io
TESTDATA = """Last_Price,Price,Marked
3,2.89,
1.99,2.09,
3.9,3.79,"""
subdf = pd.read_csv(io.StringIO(TESTDATA), sep=",", dtype='str')
subdf.loc[(subdf['Last_Price'].str.slice(-1)) !=
(subdf['Price'].str.slice(-1)), 'Marked'] = 'X'
print(subdf)
New result
Last_Price Price Marked
0 3 2.89 X
1 1.99 2.09 NaN
2 3.9 3.79 NaN

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

How do I set the cell values in one dataframe based on the values in two other dataframes?

I have three dataframes with the same shape, illustrated by the contrived data below. I want to iterate across df1 and set the value of each cell in the signals dataframe to 1 if the cell value in df1 is greater than the corresponding cell value in df2. Can someone illustrate how to accomplish that?
import pandas as pd
cols = ['ABC', 'DEF', 'GHI']
prices = [[12.22, 14.34, 98.34], [12.52, 15.34, 96.34], [13.12, 14.73, 97.47]]
prices_df1 = [[16.11, 18.12, 19.13], [16.21, 18.22, 19.23], [16.31, 18.32, 19.33]]
prices_df2 = [[12.22, 18.34, 17.34], [17.52, 18.34, 19.34], [13.12, 14.73, 16.47]]
mydates = ['09-15-2018', '09-16-2018', '09-17-2018']
signals = pd.DataFrame(index=mydates, columns=cols, data=0)
df1 = pd.DataFrame(index=mydates, columns=cols, data=prices_df1)
df2 = pd.DataFrame(index=mydates, columns=cols, data=prices_df2)
how do I set the signals dataframe to have a 1 if df1 > df2
You can use df.where
signals = signals.where(df1 < df2).fillna(1).astype(int)
signals
ABC DEF GHI
09-15-2018 1 1 0
09-16-2018 1 1 0
09-17-2018 1 1 0
You could use a fast boolean filter.
# better speed if boolean to integer conversion is separate from
# the boolean comparison
signals = (df1 > df2)
signals = signals.astype(int)
print(signals)
ABC DEF GHI
09-15-2018 1 1 0
09-16-2018 1 1 0
09-17-2018 1 1 0

Division between two numbers in a Dataframe

I am trying to calculate a percent change between 2 numbers in one column when a signal from another column is triggered.
The trigger can be found with np.where() but what I am having trouble with is the percent change. .pct_change does not work because if you .pct_change(-5) you get 16.03/20.35 and I want the number the opposite way 20.35/16.03. See table below. I have tried returning the array from the index in the np.where and adding it to an .iloc from the 'Close' column but it says I can't use that array to get an .iloc position. Can anyone help me solve this problem. Thank you.
IdxNum | Close | Signal (1s)
==============================
0 21.45 0
1 21.41 0
2 21.52 0
3 21.71 0
4 20.8 0
5 20.35 0
6 20.44 0
7 16.99 0
8 17.02 0
9 16.69 0
10 16.03 1<< 26.9% <<< 20.35/16.03-1 (df.Close[5]/df.Close[10]-1)
11 15.67 0
12 15.6 0
You can try this code block:
#Create DataFrame
df = pd.DataFrame({'IdxNum':range(13),
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Create a function that calculates the reqd diff
def cal_diff(row):
if(row['Signal']==1):
signal_index = int(row['IdxNum'])
row['diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1
return row
#Create a column and apply that difference
df['diff'] = 0
df = df.apply(lambda x:cal_diff(x),axis=1)
In case you don't have IdxNum column, you can use the index to calculate difference
#Create DataFrame
df = pd.DataFrame({
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Calculate the reqd difference
df['diff'] = 0
signal_index = df[df['Signal']==1].index[0]
df.ix[signal_index,'diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1