How to quickly normalise data in pandas dataframe? - pandas

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.

Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c

In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

Pandas: Newbie question on compare and (re)calculate fields with pandas

What I need to do is to compare 2 fields in a row in a csv-file:
Data looks like this:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
in case that "price" is equal to "retail_price" the field retail_price must be reduced by a given percent-value, e.g. -10%
so in the example data, the first and last line should be changed to 180 and 179,955
I´m completely new to pandas and after reading the "getting started" part I did not find anything that I could set upon ...
so any help or hint (just point me in the direction, I will fiddle it out myself then) is appreciated,
Kind regards!
Use Series.eq for compare both values and if same multiple retail_price by 0.9 else not in numpy.where:
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(0.9), df['retail_price'])
print (df)
store ean price retail_price quantity
0 1 888721396226 200.00 180.000 2
1 1 888721396233 200.00 159.000 2
2 1 2194384654084 299.00 259.000 7
3 1 2194384654091 199.95 179.955 8
Or you can use DataFrame.loc for multiple only matched rows by 0.9:
mask = df['price'].eq(df['retail_price'])
df.loc[mask, 'retail_price'] *= 0.9
#working like
df.loc[mask, 'retail_price'] = df.loc[mask, 'retail_price'] * 0.9
EDIT: for filter rows not matched mask (with Falses in mask) use:
df2 = df[~mask].copy()
print (df2)
store ean price retail_price quantity
1 1 888721396233 200.0 159.0 2
2 1 2194384654084 299.0 259.0 7
print (mask)
0 True
1 False
2 False
3 True
dtype: bool
This ist my code:
import pandas as pd
import numpy as np
import sys
with open('prozente.txt', 'r') as f: #create multiplicator from static value in File "prozente.txt"
prozente = int(f.readline())
mulvalue = 1-(prozente/100)
df = pd.read_csv('1.csv', sep=';', header=1, names=['store','ean','price','retail_price','quantity'])
mask = df['price'].eq(df['retail_price'])
df['retail_price'] = np.where(mask, df['retail_price'].mul(mulvalue).round(2), df['retail_price'])
df2 = df[~mask].copy()
df.to_csv('output.csv', columns=['store','ean','price','retail_price','quantity'],sep=';', index=False)
print(df)
print(df2)
using this as 1.csv:
store;ean;price;retail_price;quantity
001;0888721396226;200;200;2
001;0888721396233;200;159;2
001;2194384654084;299;259;7
001;2194384654091;199.95;199.95;8
The content of file "prozente.txt" is
25

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

Pandas data frame creation using static data

I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})
Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000
You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')

pandas operations over multiple axis

How can I do operations over multiple columns in one go in pandas?
For example, I would like to calculate the df[['a',b']].mean(level=0) or df[['a',b']].kurtosis(level=0) (I need the level=0 as it's a multi indexed dataframe).
But I would like to have one single number and do the calculation over multiple axis in one go. A and B would be merged into one single column (or series).
In numpy this is possible I believe with axis=(0,1), but I'm unsure how this can be achieved in pandas.
Speed is very important, so apply or iterating is not a solution.
The expected result would be as follows:
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.rand(10, 2),
pd.MultiIndex.from_product([list('ab'), range(5)]),
list('AB')
)
df
Out[76]:
A B
a 0 0.444939 0.407554
1 0.460148 0.465239
2 0.462691 0.016545
3 0.850445 0.817744
4 0.777962 0.757983
b 0 0.934829 0.831104
1 0.879891 0.926879
2 0.721535 0.117642
3 0.145906 0.199844
4 0.437564 0.100702
expected result:
df.groupby(level=0).agg(['mean']).mean(axis=1)
Out[78]:
a 0.546125
b 0.529589
dtype: float64
But it needs to be achieved in one single calculation, not in mean of mean, as this will maybe work for the mean, but for other calculations it may not produce the same result as if it was done in one go (for example I'm not sure if the kurtosis of the kurtosis is equal to the kurtosis in one go.)
Consider the sample dataframe df
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.rand(10, 2),
pd.MultiIndex.from_product([list('ab'), range(5)]),
list('AB')
)
df
A B
a 0 0.444939 0.407554
1 0.460148 0.465239
2 0.462691 0.016545
3 0.850445 0.817744
4 0.777962 0.757983
b 0 0.934829 0.831104
1 0.879891 0.926879
2 0.721535 0.117642
3 0.145906 0.199844
4 0.437564 0.100702
Typical Solution
Use groupby and agg
df.groupby(level=0).agg(['mean', pd.Series.kurt])
A B
mean kurt mean kurt
a 0.599237 -2.885262 0.493013 0.018225
b 0.623945 -0.900488 0.435234 -3.105328
Solve Different
pd.concat([
df.mean(level=0),
df.kurt(level=0)
], axis=1, keys=['Mean', 'Kurt']).swaplevel(1, 0, 1).sort_index(1)
A B
Kurt Mean Kurt Mean
a -2.885262 0.599237 0.018225 0.493013
b -0.900488 0.623945 -3.105328 0.435234
This seems to work:
df.stack().mean(level=0)
Out[146]:
a 0.546125
b 0.529589
dtype: float64