I have a dataframe with 20k rows and 100 columns. I am trying to normalize my data across rows. Scikit's MinMaxScaler doesn't allow me to do this by rows. It has something called minmax_scale that allows row normalization but I cannot denormalize it later. At least, I don't see how to do it. How would you guys do it?
From sklearn.preprocessing.minmax_scale:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 5],
'B': [88, 300, 200]})
# Find and store min and max vectors
min_values = df.min()
max_values = df.max()
normalized_df = (df - min_values) / (df.max() - min_values)
denormalized_df= normalized_df * (max_values - min_values) + min_values
A B
1 88
2 300
5 200
A B
0.00 0.000000
0.25 1.000000
1.00 0.528302
A B
1.0 88.0
2.0 300.0
5.0 200.0
Related
I have a below dataframe, and my requirement is that, if both columns have np.nan then no change, if either of column has empty value then fill na with 0 value. I wrote this code but why its not working. Please suggest.
import pandas as pd
import numpy as np
data = {'Age': [np.nan, np.nan, 22, np.nan, 50,99],
'Salary': [217, np.nan, 262, 352, 570, np.nan]}
df = pd.DataFrame(data)
print(df)
cond1 = (df['Age'].isnull()) & (df['Salary'].isnull())
if cond1 is False:
df['Age'] = df['Age'].fillna(0)
df['Salary'] = df['Salary'].fillna(0)
print(df)
You can just assign it with update
c = ['Age','Salary']
df.update(df.loc[~df[c].isna().all(1),c].fillna(0))
df
Out[341]:
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
c1 = df['Age'].isna()
c2 = df['Salary'].isna()
df[np.c_[c1 & ~c2, ~c1 & c2]]=0
df
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
tmp=df.loc[(df['Age'].isna() & df['Salary'].isna())]
df.fillna(0,inplace=True)
df.loc[tmp.index]=np.nan
This might be a bit less sophisticated than the other answers but worked for me:
I first save the row(s) containing both Nan values at the same time
then fillna the original df as per normal
then set np.nan back to the location where we saved both rows containing Nan at the same time
Get the rows that are all nulls and use where to exclude them during the fill:
bools = df.isna().all(axis = 1)
df.where(bools, df.fillna(0))
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
Your if statement won't work because you need to check each row for True or False; cond1 is a series, and cannot be compared correctly to False (it will just return False, which is not entirely true), there can be multiple False and True in the series.
An inefficient way would be to traverse the rows:
for row, index in zip(cond1, df.index):
if not row:
df.loc[index] = df.loc[index].fillna(0)
apart from the inefficiency, you are keeping track of index positions; the pandas options try to abstract the process while being quite efficient, since the looping is in C
I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]
Rather than the mean score displaying as 91.144105, how to display instead 91.1?
Rather than the mode score displaying as ([90.0], [77]), how to display instead 90?
code snippet and output:
from scipy import stats, import numpy as np
pd.pivot_table(df_inspections_violations, index= ['ACTIVITY YEAR', 'FACILITY ZIP'], values= "SCORE",
aggfunc= ['mean', 'median', stats.mode])
You can use style.format (documentation).
But you'd better split the mode SCORE column in value and (I guess) index, so that you can use a dictionary to control each single column, for example:
df = pd.DataFrame({
'a': np.linspace(0, 1, 7),
'b': np.linspace(31, 90, 7),
'c': np.arange(10, 17)
})
df.style.format({
'a': "{:.2f}",
'b': "{:.1f}",
'c': int,
})
Output
a b c
0 0.00 31.0 10
1 0.17 40.8 11
2 0.33 50.7 12
3 0.50 60.5 13
4 0.67 70.3 14
5 0.83 80.2 15
6 1.00 90.0 16
A bit new to python so maybe code could be improved.
I have a txt file with x and y values, separated by some NaN in between.
Data goes from -x to x and then comes back (x to -x) but with somewhat different values of y, say:
x=np.array([-0.02,-0.01,0,0.01,0.02,NaN,1,NaN,0.02,0.01,0,-0.01,-0.02])
And I would like to plot (matplotlib) up to the first NaN with certain format, x=1 with other format, and last set of data with a third different format (color, marker, linewidth...).
Of course the data I have is a bit more complex, but I guess is a simple useful approximation.
Any idea?
I'm using pandas as my data manipulation tool
You can create a group label taking the cumsum of where x is null. Then you can define a dictionary keyed by the label with values being a dictionary containing all of the plotting parameters. Use groupby to plot each group separately, unpacking all the parameters to set the arguments for that group.
Sample Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.array([-0.02,-0.01,0,0.01,0.02,np.NaN,1,np.NaN,0.02,0.01,0,-0.01,-0.02])
df = pd.DataFrame({'x': x})
Code
df['label'] = df.x.isnull().cumsum().where(df.x.notnull())
plot_params = {0: {'lw': 2, 'color': 'red', 'marker': 'o'},
1: {'lw': 6, 'color': 'black', 'marker': 's'},
2: {'lw': 9, 'color': 'blue', 'marker': 'x'}}
fig, ax = plt.subplots(figsize=(3,3))
for label, gp in df.groupby('label'):
gp.plot(y='x', **plot_params[label], ax=ax, legend=None)
plt.show()
This is what df looks like for reference after defining the group label
print(df)
x label
0 -0.02 0.0
1 -0.01 0.0
2 0.00 0.0
3 0.01 0.0
4 0.02 0.0
5 NaN NaN
6 1.00 1.0
7 NaN NaN
8 0.02 2.0
9 0.01 2.0
10 0.00 2.0
11 -0.01 2.0
12 -0.02 2.0
I am trying to calculate multiple colums from multiple columns in a pandas dataframe using a function.
The function takes three arguments -a-, -b-, and -c- and and returns three calculated values -sum-, -prod- and -quot-. In my pandas data frame I have three coumns -a-, -b- and and -c- from which I want to calculate the columns -sum-, -prod- and -quot-.
The mapping that I do works only when I have exactly three rows. I do not know what is going wrong, although I expect that it has to do something with selecting the correct axis. Could someone explain what is happening and how I can calculate the values that I would like to have.
Below are the situations that I have tested.
INITIAL VALUES
def sum_prod_quot(a,b,c):
sum = a + b + c
prod = a * b * c
quot = a / b / c
return (sum, prod, quot)
df = pd.DataFrame({ 'a': [20, 100, 18],
'b': [ 5, 10, 3],
'c': [ 2, 10, 6],
'd': [ 1, 2, 3]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
CALCULATION STEPS
Using exactly three rows
When I calculate three columns from this dataframe and using the function function I get:
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
This is exactly the result that I want to have: The sum-column has the sum of the elements in the columns a,b,c; the prod-column has the product of the elements in the columns a,b,c and the quot-column has the quotients of the elements in the columns a,b,c.
Using more than three rows
When I expand the dataframe with one row, I get an error!
The data frame is defined as:
df = pd.DataFrame({ 'a': [20, 100, 18, 40],
'b': [ 5, 10, 3, 10],
'c': [ 2, 10, 6, 4],
'd': [ 1, 2, 3, 4]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
2 18 3 6 3
3 40 10 4 4
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: too many values to unpack (expected 3)
while I would expect an extra row:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
2 18 3 6 3 2.0 1.0 1.0
3 40 10 4 4 54.0 1600.0 1.0
Using less than three rows
When I reduce tthe dataframe with one row I get also an error.
The dataframe is defined as:
df = pd.DataFrame({ 'a': [20, 100],
'b': [ 5, 10],
'c': [ 2, 10],
'd': [ 1, 2]
})
df
a b c d
0 20 5 2 1
1 100 10 10 2
The call is
df['sum'], df['prod'], df['quot'] = \
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
The result is
...
list( map(sum_prod_quot, df['a'], df['b'], df['c']))
ValueError: need more than 2 values to unpack
while I would expect a row less:
df
a b c d sum prod quot
0 20 5 2 1 27.0 120.0 27.0
1 100 10 10 2 200.0 10000.0 324.0
QUESTIONS
The questions I have:
1) Why do I get these errors?
2) How do I have to modify the call such that I get the desired data frame?
NOTE
In this link a similar question is asked, but the given answer did not work for me.
The answer doesn't seem correct for 3 rows as well. Can you check other values except first row and first column. Looking at the results, product of 20*5*2 is NOT 120, it's 200 and is placed below in sum column. You need to form list in correct way before assigning to new columns. You can try use following to set the new columns:
df['sum'], df['prod'], df['quot'] = zip(*map(sum_prod_quot, df['a'], df['b'], df['c']))
For details follow the link