data highly skewed and values range is too large - pandas

i'am trying to rescale and normalize my dataset
my data is highly skewed and also the values range is too large which affecting my models performance
i've tried using robustscaler() and powerTransformer() and yet no improvement
below you can see the boxplot and kde plot and also skew() test of my data
df_test.agg(['skew', 'kurtosis']).transpose()
the data is financial data so it can take a large range of values ( they are not really ouliers)

Depending on your data, there are several ways to handle this. There is however a function that will help you handle skew data by doing a preliminary transformation to your normalization effort.
Go to this repo (https://github.com/datamadness/Automatic-skewness-transformation-for-Pandas-DataFrame) and download the functions skew_autotransform.py and TEST_skew_autotransform.py. Put this function in the same folder as your code. Use it in the same way as in this example:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from skew_autotransform import skew_autotransform
exampleDF = pd.DataFrame(load_boston()['data'], columns = load_boston()['feature_names'].tolist())
transformedDF = skew_autotransform(exampleDF.copy(deep=True), plot = True, exp = False, threshold = 0.5)
print('Original average skewness value was %2.2f' %(np.mean(abs(exampleDF.skew()))))
print('Average skewness after transformation is %2.2f' %(np.mean(abs(transformedDF.skew()))))
It will return several graphs and measures of skewness of each variable, but most importantly a transformed dataframe of the handled skewed data:
Original data:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0
.. ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0
PTRATIO B LSTAT
0 15.3 396.90 4.98
1 17.8 396.90 9.14
2 17.8 392.83 4.03
3 18.7 394.63 2.94
4 18.7 396.90 5.33
.. ... ... ...
501 21.0 391.99 9.67
502 21.0 396.90 9.08
503 21.0 396.90 5.64
504 21.0 393.45 6.48
505 21.0 396.90 7.88
[506 rows x 13 columns]
and the tranformed data:
CRIM ZN INDUS CHAS NOX RM AGE \
0 -6.843991 1.708418 2.31 -587728.314092 -0.834416 6.575 201.623543
1 -4.447833 -13.373080 7.07 -587728.314092 -1.092408 6.421 260.624267
2 -4.448936 -13.373080 7.07 -587728.314092 -1.092408 7.185 184.738608
3 -4.194470 -13.373080 2.18 -587728.314092 -1.140400 6.998 125.260171
4 -3.122838 -13.373080 2.18 -587728.314092 -1.140400 7.147 157.195622
.. ... ... ... ... ... ... ...
501 -3.255759 -13.373080 11.93 -587728.314092 -0.726384 6.593 218.025321
502 -3.708638 -13.373080 11.93 -587728.314092 -0.726384 6.120 250.894792
503 -3.297348 -13.373080 11.93 -587728.314092 -0.726384 6.976 315.757117
504 -2.513274 -13.373080 11.93 -587728.314092 -0.726384 6.794 307.850962
505 -3.643173 -13.373080 11.93 -587728.314092 -0.726384 6.030 269.101967
DIS RAD TAX PTRATIO B LSTAT
0 1.264870 0.000000 1.807258 32745.311816 9.053163e+08 1.938257
1 1.418585 0.660260 1.796577 63253.425063 9.053163e+08 2.876983
2 1.418585 0.660260 1.796577 63253.425063 8.717663e+08 1.640387
3 1.571460 1.017528 1.791645 78392.216639 8.864906e+08 1.222396
4 1.571460 1.017528 1.791645 78392.216639 9.053163e+08 2.036925
.. ... ... ... ... ... ...
501 0.846506 0.000000 1.803104 129845.602554 8.649562e+08 2.970889
502 0.776403 0.000000 1.803104 129845.602554 9.053163e+08 2.866089
503 0.728829 0.000000 1.803104 129845.602554 9.053163e+08 2.120221
504 0.814408 0.000000 1.803104 129845.602554 8.768178e+08 2.329393
505 0.855697 0.000000 1.803104 129845.602554 9.053163e+08 2.635552
[506 rows x 13 columns]
After having done this, normalize the data if you need to.
Update
Given the ranges of some of your data, you need to probably do this case by case and by trial and error. There are several normalizers you can use to test different approaches. I'll give you a few of them on an example columns,
exampleDF = pd.read_csv("test.csv", sep=",")
exampleDF = pd.DataFrame(exampleDF['LiabilitiesNoncurrent_total'])
LiabilitiesNoncurrent_total
count 6.000000e+02
mean 8.865754e+08
std 3.501445e+09
min -6.307000e+08
25% 6.179232e+05
50% 1.542650e+07
75% 3.036085e+08
max 5.231900e+10
Sigmoid
Define the following function
def sigmoid(x):
e = np.exp(1)
y = 1/(1+e**(-x))
return y
and do
df = sigmoid(exampleDF.LiabilitiesNoncurrent_total)
df = pd.DataFrame(df)
'LiabilitiesNoncurrent_total' had 'positive' skewness of 8.85
The transformed one has a skewness of -2.81
Log+1 Normalization
Another approach is to use a logarithmic function and then to normalize.
def normalize(column):
upper = column.max()
lower = column.min()
y = (column - lower)/(upper-lower)
return y
df = np.log(exampleDF['LiabilitiesNoncurrent_total'] + 1)
df_normalized = normalize(df)
The skewness is reduced by approxiamately the same amount.
I would opt for this last option rather than a sigmoidal approach. I also suspect that you can apply this solution to all your features.

Related

Change Value of a Dataframe Column Based on a Filter with specific parameters

I’m looking at this but I have no idea how to formulate it:
Change Value of a Dataframe Column Based on a Filter
I need to change the values in medianIncome with values of 0.4999 or lower to 0.4999 or if 15.0001 or higher to 15.0001.
Here's sample data:
id longitude_x latitude ocean_proximity longitude_y state medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 1 -122.23 37.88 NEAR BAY -122.23 CA 452.603 45.0 131.0 884.0 130.0 323.0 83252.0
1 396 -122.34 37.88 NEAR BAY -122.23 CA 350.004 41.0 930.0 3063.0 926.0 2560.0 17375.0
2 398 -122.29 37.88 NEAR BAY -122.23 CA 216.703 54.0 263.0 1211.0 230.0 525.0 38672.0
3 401 -122.28 37.88 NEAR BAY -122.23 CA 261.303 55.0 333.0 1845.0 335.0 772.0 42614.0
4 424 -122.26 37.88 NEAR BAY -122.23 CA 391.803 53.0 418.0 2553.0 404.0 898.0 62425.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
929044 9476 -123.38 39.37 INLAND -121.24 CA 124.601 20.0 813.0 3947.0 732.0 1902.0 26424.0
929045 9494 -123.75 39.37 INLAND -121.24 CA 151.403 20.0 299.0 1377.0 282.0 830.0 32500.0
929046 10065 -121.03 39.37 INLAND -121.24 CA 85.000 15.0 327.0 1338.0 310.0 1174.0 26341.0
929047 10074 -120.10 39.37 INLAND -121.24 CA 117.301 34.0 411.0 2328.0 373.0 1016.0 45208.0
929048 21558 -121.24 39.37 INLAND -121.24 CA 89.401 18.0 616.0 2787.0 532.0 1387.0 23886.0
It shows:
np.where(df['x'] > 0 & df['y'] < 10, 1, 0)
So I'm at:
np.where(housing['medianIncome'] > 15.0001
And I'm stuck as to the rest. Only using pandas and numpy, not able to use lambda.
I'm expecting an outcome that won't give an error. As of yet, I don't have an outcome.
Use Series.clip:
housing = pd.DataFrame({'medianIncome':[20,5,0.07]})
housing['medianIncome'] = housing['medianIncome'].clip(upper=15.0001, lower=0.4999)
print (housing)
medianIncome
0 15.0001
1 5.0000
2 0.4999
Alternative with numpy.select if need set another values by conditions:
housing['medianIncome'] = np.select([housing['medianIncome'].lt(0.4999),
housing['medianIncome'].gt(15.0001)],
[0,1],
default=housing['medianIncome'])
print (housing)
medianIncome
0 1.0
1 5.0
2 0.0

I asked an earlier question on changing a dollar value to float and divide it, and it's worked, but it doesn't change the value in the data frame

Here was the original question:
With only being able to import numpy and pandas, I need to do the following: Scale the medianIncome to express the values in $10,000 of dollars (example: 150000 will become 15, 30000 will become 3, 15000 will become 1.5, etc)
Here's the code that works:
temp_housing['medianIncome'].replace( '[(]','-', regex=True ).astype(float)) / 10000
But when I call the df after, it still shows the original amount instead of the 15 of 1.5. I'm not sure what I'm missing on this.
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0
The result is:
id medianIncome
0 1.7250
1 2.1806
2 2.4038
3 2.4597
4 1.8080
Name: medianIncome, Length: 20640, dtype: float64
But then when I call the df with
housing_cal
, it's back to:
id medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 23 113.903 31.0 543.0 2438.0 481.0 1016.0 17250.0
1 24 99.701 56.0 337.0 1692.0 328.0 856.0 21806.0
2 26 107.500 41.0 123.0 535.0 121.0 317.0 24038.0
3 27 93.803 53.0 244.0 1132.0 241.0 607.0 24597.0
4 28 105.504 52.0 423.0 1899.0 400.0 1104.0 18080.0

Vectorize for loop and return x day high and low

Overview
For each row of a dataframe I want to calculate the x day high and low.
An x day high is higher than previous x days.
An x day low is lower than previous x days.
The for loop is explained in further detail in this post
Update:
Answer by #mozway below completes in around 20 seconds with dataset containing 18k rows. Can this be improved with numpy with broadcasting etc?
Example
2020-03-20 has an x_day_low value of 1 as it is lower than the previous day.
2020-03-27 has an x_day_high value of 8 as it is higher than the previous 8 days.
See desired output and test code below which is calculated with a for loop in the findHighLow function. How would I vectorize findHighLow as the actual dataframe is somewhat larger.
Test data
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
#rows set to 15 for minimal example but actual dataframe contains around 18000 rows.
number_of_rows = 15
df = genMockDataFrame(number_of_rows,1.1904,'tttmmm','19/3/2020',seed=157)
def findHighLow (df):
df['x_day_high'] = 0
df['x_day_low'] = 0
for n in reversed(range(len(df['High']))):
for i in reversed(range(n)):
if df['High'][n] > df['High'][i]:
df['x_day_high'][n] = n - i
else: break
for n in reversed(range(len(df['Low']))):
for i in reversed(range(n)):
if df['Low'][n] < df['Low'][i]:
df['x_day_low'][n] = n - i
else: break
return df
df = findHighLow (df)
Desired output should match this:
df[["High","Low","x_day_high","x_day_low"]]
High Low x_day_high x_day_low
date
2020-03-19 1.1937 1.1832 0 0
2020-03-20 1.1879 1.1769 0 1
2020-03-21 1.1767 1.1662 0 2
2020-03-22 1.1721 1.1611 0 3
2020-03-23 1.1819 1.1690 2 0
2020-03-24 1.1928 1.1807 4 0
2020-03-25 1.1939 1.1864 6 0
2020-03-26 1.2141 1.1964 7 0
2020-03-27 1.2144 1.2039 8 0
2020-03-28 1.2099 1.2018 0 1
2020-03-29 1.2033 1.1853 0 4
2020-03-30 1.1887 1.1806 0 6
2020-03-31 1.1972 1.1873 1 0
2020-04-01 1.1997 1.1914 2 0
2020-04-02 1.1924 1.1781 0 9
Here are two so solutions. Both produce the desired output, as posted in the question.
The first solution uses Numba and completes in 0.5 seconds on my machine for 20k rows. If you can use Numba, this is the way to go. The second solution uses only Pandas/Numpy and completes in 1.5 seconds for 20k rows.
Numba
#numba.njit
def count_smaller(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] > current:
break
count += 1
return count
#numba.njit
def count_greater(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] < current:
break
count += 1
return count
df["x_day_high"] = df.High.expanding().apply(count_smaller, engine='numba', raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, engine='numba', raw=True)
Pandas/Numpy
def count_consecutive_true(bool_arr):
return bool_arr[::-1].cumprod().sum()
def count_smaller(arr):
return count_consecutive_true(arr <= arr[-1]) - 1
def count_greater(arr):
return count_consecutive_true(arr >= arr[-1]) - 1
df["x_day_high"] = df.High.expanding().apply(count_smaller, raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, raw=True)
This last solution is similar to mozway's. However it runs faster because it doesn't need to perform a join and uses numpy as much as possible. It also looks arbitrarily far back.
You can use rolling to get the last N days, a comparison + cumprod on the reversed boolean array to keep only the last consecutive valid values, and sum to count them. Apply on each column using agg and join the output after adding a prefix.
# number of days
N = 8
df.join(df.rolling(f'{N+1}d', min_periods=1)
.agg({'High': lambda s: s.le(s.iloc[-1])[::-1].cumprod().sum()-1,
'Low': lambda s: s.ge(s.iloc[-1])[::-1].cumprod().sum()-1,
})
.add_prefix(f'{N}_days_')
)
Output:
Open High Low Close 8_days_High 8_days_Low
date
2020-03-19 1.1904 1.1937 1.1832 1.1832 0.0 0.0
2020-03-20 1.1843 1.1879 1.1769 1.1772 0.0 1.0
2020-03-21 1.1755 1.1767 1.1662 1.1672 0.0 2.0
2020-03-22 1.1686 1.1721 1.1611 1.1721 0.0 3.0
2020-03-23 1.1732 1.1819 1.1690 1.1819 2.0 0.0
2020-03-24 1.1836 1.1928 1.1807 1.1922 4.0 0.0
2020-03-25 1.1939 1.1939 1.1864 1.1936 6.0 0.0
2020-03-26 1.1967 1.2141 1.1964 1.2114 7.0 0.0
2020-03-27 1.2118 1.2144 1.2039 1.2089 7.0 0.0
2020-03-28 1.2080 1.2099 1.2018 1.2041 0.0 1.0
2020-03-29 1.2033 1.2033 1.1853 1.1880 0.0 4.0
2020-03-30 1.1876 1.1887 1.1806 1.1879 0.0 6.0
2020-03-31 1.1921 1.1972 1.1873 1.1939 1.0 0.0
2020-04-01 1.1932 1.1997 1.1914 1.1914 2.0 0.0
2020-04-02 1.1902 1.1924 1.1781 1.1862 0.0 7.0

Some confusion in creating pivot table

I am trying to create a pivot table but i am not getting the result i want. Couldn't able to understand why is this happening.
I have a dataframe like this -
data_channel_is_lifestyle data_channel_is_bus shares
0 0.0 0.0 593
1 0.0 1.0 711
2 0.0 1.0 1500
3 0.0 0.0 1200
4 0.0 0.0 505
And the result i am looking for is name of the columns in the index and sum of shares in the column. So
i did this -
news_copy.pivot_table(index=['data_channel_is_lifestyle','data_channel_is_bus'], values='shares', aggfunc=sum)
but i am getting the result something like this -
shares
data_channel_is_lifestyle data_channel_is_bus
0.0 0.0 107709305
1.0 19168370
1.0 0.0 7728777
I don't want these 0's and 1's, i just want the result to be something like this -
shares
data_channel_is_lifestyle 107709305
data_channel_is_bus 19168370
How can i do this?
As you put it, it's just matrix multipliation:
df.filter(like='data').T#(df[['shares']])
Output (for sample data):
shares
data_channel_is_lifestyle 0.0
data_channel_is_bus 2211.0

Time-series prediction by separating dependent and independent variables

Suppose, I have this kind of data:
date pollution dew temp press wnd_dir wnd_spd snow rain
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
I want to apply neural network for the time-series prediction of pollution.
It should be noted that other variables: dew, temp, press, wnd_dir, wnd_spd, snow, rain are independent variables of pollution.
If I implement LSTM as in here the LSTM learns for all the variables as independent; and the model can predict for all variables.
But it is not necessary to predict for all independent variables, the only requirement is pollution, a dependent variable.
Is there any way to implement LSTM or another better architecture which learns and predict for only the dependent variable, by considering other independent variables as independent, and perform much better prediction of pollution?
It seems like the example is predicting only pollution already. If you see the reframed:
var1(t-1) var2(t-1) var3(t-1) var4(t-1) var5(t-1) var6(t-1) \
1 0.129779 0.352941 0.245902 0.527273 0.666667 0.002290
2 0.148893 0.367647 0.245902 0.527273 0.666667 0.003811
3 0.159960 0.426471 0.229508 0.545454 0.666667 0.005332
4 0.182093 0.485294 0.229508 0.563637 0.666667 0.008391
5 0.138833 0.485294 0.229508 0.563637 0.666667 0.009912
var7(t-1) var8(t-1) var1(t)
1 0.000000 0.0 0.148893
2 0.000000 0.0 0.159960
3 0.000000 0.0 0.182093
4 0.037037 0.0 0.138833
5 0.074074 0.0 0.109658
The var1 seems to be pollution. As you see, you have the values from the previous step (t-1) for all variables and the value for the current step t for pollution (var1(t)).
This last variable is what the example is feeding as y, as you can see in lines:
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
So the network should be already predicting only on pollution.