How do I set the cell values in one dataframe based on the values in two other dataframes? - pandas

I have three dataframes with the same shape, illustrated by the contrived data below. I want to iterate across df1 and set the value of each cell in the signals dataframe to 1 if the cell value in df1 is greater than the corresponding cell value in df2. Can someone illustrate how to accomplish that?
import pandas as pd
cols = ['ABC', 'DEF', 'GHI']
prices = [[12.22, 14.34, 98.34], [12.52, 15.34, 96.34], [13.12, 14.73, 97.47]]
prices_df1 = [[16.11, 18.12, 19.13], [16.21, 18.22, 19.23], [16.31, 18.32, 19.33]]
prices_df2 = [[12.22, 18.34, 17.34], [17.52, 18.34, 19.34], [13.12, 14.73, 16.47]]
mydates = ['09-15-2018', '09-16-2018', '09-17-2018']
signals = pd.DataFrame(index=mydates, columns=cols, data=0)
df1 = pd.DataFrame(index=mydates, columns=cols, data=prices_df1)
df2 = pd.DataFrame(index=mydates, columns=cols, data=prices_df2)
how do I set the signals dataframe to have a 1 if df1 > df2

You can use df.where
signals = signals.where(df1 < df2).fillna(1).astype(int)
signals
ABC DEF GHI
09-15-2018 1 1 0
09-16-2018 1 1 0
09-17-2018 1 1 0

You could use a fast boolean filter.
# better speed if boolean to integer conversion is separate from
# the boolean comparison
signals = (df1 > df2)
signals = signals.astype(int)
print(signals)
ABC DEF GHI
09-15-2018 1 1 0
09-16-2018 1 1 0
09-17-2018 1 1 0

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

Create new column based of two columns

I have two columns in a dataframe. I want to create third column such that if first column > second column than 1 ow 0. As below
Df
Value1 value 2. Newcolumn
101. 0
97. 1
Comparing two columns in a Pandas DataFrame and write the results of the comparison to a third column. It can do easily by these syntaxes
conditions=[(condition1),(condition2)]
choices=["choice1","choice2"]
df["new_column_name"]=np.select(conditions, choices, default)
conditions are the conditions to check for between the two columns
choices are the results to return based on the conditions
np.select is used to return the results to the new column
The dataframe is:
import numpy as np
import pandas as pd
#create DataFrame
df = pd.DataFrame({'Value1': [100,101],
'value 2': [101,97]})
#define conditions
conditions = [df['Value1'] < df['value 2'],
df['Value1'] > df['value 2']]
#define choices
choices = ['0', '1']
#create new column in DataFrame that displays results of comparisons
df['Newcolumn'] = np.select(conditions, choices, default='Tie')
Final dataframe
print(df)
Output:
Value1 value 2 Newcolumn
0 100 101 0
1 101 97 1

Pandas: Newbie question on compare and (re)calculate fields with pandas

What I need to do is to compare 2 fields in a row in a csv-file:
Data looks like this:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
in case that "price" is equal to "retail_price" the field retail_price must be reduced by a given percent-value, e.g. -10%
so in the example data, the first and last line should be changed to 180 and 179,955
I´m completely new to pandas and after reading the "getting started" part I did not find anything that I could set upon ...
so any help or hint (just point me in the direction, I will fiddle it out myself then) is appreciated,
Kind regards!
Use Series.eq for compare both values and if same multiple retail_price by 0.9 else not in numpy.where:
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(0.9), df['retail_price'])
print (df)
store ean price retail_price quantity
0 1 888721396226 200.00 180.000 2
1 1 888721396233 200.00 159.000 2
2 1 2194384654084 299.00 259.000 7
3 1 2194384654091 199.95 179.955 8
Or you can use DataFrame.loc for multiple only matched rows by 0.9:
mask = df['price'].eq(df['retail_price'])
df.loc[mask, 'retail_price'] *= 0.9
#working like
df.loc[mask, 'retail_price'] = df.loc[mask, 'retail_price'] * 0.9
EDIT: for filter rows not matched mask (with Falses in mask) use:
df2 = df[~mask].copy()
print (df2)
store ean price retail_price quantity
1 1 888721396233 200.0 159.0 2
2 1 2194384654084 299.0 259.0 7
print (mask)
0 True
1 False
2 False
3 True
dtype: bool
This ist my code:
import pandas as pd
import numpy as np
import sys
with open('prozente.txt', 'r') as f: #create multiplicator from static value in File "prozente.txt"
prozente = int(f.readline())
mulvalue = 1-(prozente/100)
df = pd.read_csv('1.csv', sep=';', header=1, names=['store','ean','price','retail_price','quantity'])
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(mulvalue).round(2), df['retail_price'])
df2 = df[~mask].copy()
df.to_csv('output.csv', columns=['store','ean','price','retail_price','quantity'],sep=';', index=False)
print(df)
print(df2)
using this as 1.csv:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
The content of file "prozente.txt" is
25

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).