should I log transform my model for Linear Regression analysis

should I log transform my model for Linear Regression analysis - pandas

I have a dataset of boston houses with the following features
<class 'pandas.core.frame.DataFrame'>
Int64Index: 414 entries, 1 to 414
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X2 house age 414 non-null float64
1 X3 distance to the nearest MRT station 414 non-null float64
2 X4 number of convenience stores 414 non-null int64
3 X5 latitude 414 non-null float64
4 X6 longitude 414 non-null float64
5 Y house price of unit area 414 non-null float64
dtypes: float64(5), int64(1)
standard deviation is:
X2 house age 11.392485
X3 distance to the nearest MRT station 1262.109595
X4 number of convenience stores 2.945562
X5 latitude 0.012410
X6 longitude 0.015347
Y house price of unit area 13.606488
dtype: float64
I tried to calculate the skew of prices and got the value of 0.599
I log transformed the data and got the value of -0.7064
the question that I'm having is, should I continue to work with dataset log transformed or it's not necessary to transform it, and when should I even consider log transform in my data analysis?

Using log transformation or not completely depends on what fits better on your data. Just calculate the performance of your models (log transformed and not) and see which one has the best performance metrics.

Related

Histogram multiple columns, percentage per columns same plot

I have this dataframe(df_histo):
ID SA WE EE
1 NaN 320 23.4
2 234 2.3 NaN
.. .. .. ..
345 1.2 NaN 234
I can plot a correct histogram if I use density=True,
buckets = [0,250,500,1000,1500,2000,2500,3000,3500,4000,4500,5000]
plt.hist(df_histo, bins=buckets, label=dfltabv_histo.columns, density=True)
however I want to visualize the data(1 histogram) in total percentages for each country. That is, what percentage each bins represents in that country
the closest attempt has been this(with individual columns it works):
plt.hist(dfltabv_histo, bins=buckets, label=dfltabv_histo.columns, weights=np.ones(len(dfltabv_histo)) / len(dfltabv_histo))
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
ValueError: weights should have the same shape as x

Pandas subtract dates to get a surgery patient length of stay

I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column.
Usually I use
df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days
I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.
DISDATE.
. ADMIDATE.
. los.
2019-12-24
2019-12-08
-43805.
2019-05-15
. 2019-03-26
50.
2019-10-11
. 2019-10-07
4.
2019-06-20
2019-06-16
4
2019-04-11
2019-04-08
3
df.info()
df.info()
<class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.

Your sample works well with the right output (16 days for the first row)
Can you try that and check if the problem persists:
import io
data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
parse_dates=['DISDATE', 'ADMIDATE'])
print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)
Output:
0 16
1 50
2 4
3 4
4 3
dtype: int64
Update
To debug your bad dates, try:
df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']
You should see rows where values are not a right date.

Pandas - get values on a graph using quantile

I have this df_players:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TableIndex 739 non-null object
1 PlayerID 739 non-null int64
2 GameWeek 739 non-null int64
3 Date 739 non-null object
4 Points 739 non-null int64
5 Price 739 non-null float64
6 BPS 739 non-null int64
7 SelectedBy 739 non-null int64
8 NetTransfersIn 739 non-null int64
9 MinutesPlayed 739 non-null float64
10 CleanSheet 739 non-null float64
11 Saves 739 non-null float64
12 PlayersBasicID 739 non-null int64
13 PlayerCode 739 non-null object
14 FirstName 739 non-null object
15 WebName 739 non-null object
16 Team 739 non-null object
17 Position 739 non-null object
18 CommentName 739 non-null object
And I'm using this function, with quantile() (value passed by variable 'cut'), to plot the distribution of players:
def jointplot(X, Y, week=None, title=None,
positions=None, height=6,
xlim=None, ylim=None, cut=0.015,
color=CB91_Blue, levels=30, bw=0.5, top_rows=100000):
if positions == None:
positions = ['GKP','DEF','MID','FWD']
#Check if week is given as a list
if week == None:
week = list(range(max(df_players['GameWeek'])))
if type(week)!=list:
week = [week]
df_played = df_players.loc[(df_players['MinutesPlayed']>=45)
&(df_players['GameWeek'].isin(week))
&(df_players['Position'].isin(positions))].head(top_rows)
if xlim == None:
xlim = (df_played[X].quantile(cut),
df_played[X].quantile(1-cut))
if ylim == None:
ylim = (df_played[Y].quantile(cut),
df_played[Y].quantile(1-cut))
sns.jointplot(X, Y, data=df_played,
kind="kde", xlim=xlim, ylim=ylim,
color=color, n_levels=levels,
height=height, bw=bw);
plt.suptitle(title,fontsize=18);
plt.show()
call:
jointplot('Price', 'Points', positions=['FWD'],
color=color_list[3], title='Forwards')
this plots:
where:
xlim = (4.5, 11.892999999999995)
ylim = (1.0, 13.0)
As far as I'm concerned, these x and y limits allow me, using the range of quantile value (cut),(1-cut), to zoom into an area of datapoints.
QUESTION
Now I would like to get player 'WebName' for players within a certain area, like so:
Ater ploting I can chose a target area above and define the range, roughly, passing xlim and ylim:
jointplot('Price', 'Points', positions=['FWD'],
xlim=(5.5, 7.0), ylim=(11.5, 13.0),
color=color_list[3], title='Forwards')
which zooms in the area in red above.
But how can I get players names inside that area?

You can just select the portion of the players dataframe based on the bounds in the plot:
selected = df_players[
(df_players.Points >= points_lbound)
& (df_players.Points <= points_ubound)
& (df_players.Price >= price_lbound)
& (df_players.Price <= price_ubound)
]
The list of WebNames would then be selected.WebNames

How to view the total unqiue values for each cloumns if total unique value is less than a specific no. in my dataset

i am working on Heart Disease Prediction data and i want to know the unqiue values for each column
first i took total unique feature in my data is
framinghamDF.nunique()
output-
male 2
age 39
education 4
currentSmoker 2
cigsPerDay 33
BPMeds 2
prevalentStroke 2
prevalentHyp 2
diabetes 2
totChol 248
sysBP 234
diaBP 146
BMI 1364
heartRate 73
glucose 143
TenYearCHD 2
dtype: int64
now i took out individual features unique values
print(framinghamDF["education"].unique().tolist())
output
[4.0, 2.0, 1.0, 3.0, nan]
but i want to get all unique values of features which has less than 4 unique values

Filter index values of Series in boolean indexing:
s = framinghamDF.nunique()
out = s.index[s < 4].tolist()
#alternative
out = s[s < 4].index.tolist()
Last for all unique values use:
d = {x: framinghamDF[x].unique() for x in out}

How to select rows within a pandas dataframe based on time only when index is date and time

I have a dataframe that looks like this:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2016910 entries, 2009-01-02 04:51:00 to 2012-11-02 20:00:00
Freq: T
Data columns:
X1 2016910 non-null values
X2 2016910 non-null values
X3 2016910 non-null values
X4 2016910 non-null values
X5 2016910 non-null values
dtypes: float64(5)
and I would like to "filter" it by accessing only certain times across the whole range of dates. For example, I'd like to return a dataframe that contains all rows where the time is between 13:00:00 and 14:00:00, but for all of the dates. I am reading the data from a CSV file and the datetime is one column, but I could just as easily make the input CSV file contain a separate date and time. I tried the separate date and time route, and created a multiindex, but when I did, I ended up with two index columns -- one of them containing the proper date with an incorrect time instead of just a date, and the second one containing an incorrect date, and then a correct time, instead of just a time. The input data for my multiindex attempt looked like this:
20090102,04:51:00,89.9900,89.9900,89.9900,89.9900,100
20090102,05:36:00,90.0100,90.0100,90.0100,90.0100,200
20090102,05:44:00,90.1400,90.1400,90.1400,90.1400,100
20090102,05:50:00,90.0500,90.0500,90.0500,90.0500,500
20090102,05:56:00,90.1000,90.1000,90.1000,90.1000,300
20090102,05:57:00,90.1000,90.1000,90.1000,90.1000,200
which I tried to read using this code:
singledf = pd.DataFrame.from_csv("inputfile",header=None,index_col=[0,1],parse_dates=True)
which resulted in a dataframe that looks like this:
singledf.sort()
singledf
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 716244 entries, (<Timestamp: 2009-01-02 00:00:00>, <Timestamp: 2012-11-04 04:51:00>) to (<Timestamp: 2012-11-02 00:00:00>, <Timestamp: 2012-11-04 20:00:00>)
Data columns:
X2 716244 non-null values
X3 716244 non-null values
X4 716244 non-null values
X5 716244 non-null values
X6 716244 non-null values
dtypes: float64(4), int64(1)
Maybe the multiindex approach is totally wrong, but it's one thing I tried. It seems like it is stuck on using a datetime object, and wants to force the index columns to have a datetime instead of just a date or a time. My source CSV files for the my non-multiindex attempt looks like this:
20090102 04:51:00,89.9900,89.9900,89.9900,89.9900,100
20090102 05:36:00,90.0100,90.0100,90.0100,90.0100,200
20090102 05:44:00,90.1400,90.1400,90.1400,90.1400,100
20090102 05:50:00,90.0500,90.0500,90.0500,90.0500,500
20090102 05:56:00,90.1000,90.1000,90.1000,90.1000,300
I am using pandas .9. Any suggestions are appreciated!

A regular DatetimeIndex allows to use between_time method.
In [12]: data = """\
20090102,04:51:00,89.9900,89.9900,89.9900,89.9900,100
20090102,05:36:00,90.0100,90.0100,90.0100,90.0100,200
20090102,05:44:00,90.1400,90.1400,90.1400,90.1400,100
20090102,05:50:00,90.0500,90.0500,90.0500,90.0500,500
20090102,05:56:00,90.1000,90.1000,90.1000,90.1000,300
20090102,05:57:00,90.1000,90.1000,90.1000,90.1000,200
"""
In [13]: singledf = pd.DataFrame.from_csv(StringIO(data), header=None, parse_dates=[[0,1]])
In [14]: singledf
Out[14]:
X2 X3 X4 X5 X6
X0_X1
2009-01-02 04:51:00 89.99 89.99 89.99 89.99 100
2009-01-02 05:36:00 90.01 90.01 90.01 90.01 200
2009-01-02 05:44:00 90.14 90.14 90.14 90.14 100
2009-01-02 05:50:00 90.05 90.05 90.05 90.05 500
2009-01-02 05:56:00 90.10 90.10 90.10 90.10 300
2009-01-02 05:57:00 90.10 90.10 90.10 90.10 200
In [15]: singledf.between_time('5:30:00', '5:45:00')
Out[15]:
X2 X3 X4 X5 X6
X0_X1
2009-01-02 05:36:00 90.01 90.01 90.01 90.01 200
2009-01-02 05:44:00 90.14 90.14 90.14 90.14 100

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

should I log transform my model for Linear Regression analysis - pandas

Using log transformation or not completely depends on what fits better on your data. Just calculate the performance of your models (log transformed and not) and see which one has the best performance metrics.

Related

Histogram multiple columns, percentage per columns same plot

Pandas subtract dates to get a surgery patient length of stay

Pandas - get values on a graph using quantile

How to view the total unqiue values for each cloumns if total unique value is less than a specific no. in my dataset

How to select rows within a pandas dataframe based on time only when index is date and time

Categories

Resources