Pandas subtract dates to get a surgery patient length of stay - pandas

I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column.
Usually I use
df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days
I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.
DISDATE.
. ADMIDATE.
. los.
2019-12-24
2019-12-08
-43805.
2019-05-15
. 2019-03-26
50.
2019-10-11
. 2019-10-07
4.
2019-06-20
2019-06-16
4
2019-04-11
2019-04-08
3
df.info()
df.info()
<class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.

Your sample works well with the right output (16 days for the first row)
Can you try that and check if the problem persists:
import io
data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
parse_dates=['DISDATE', 'ADMIDATE'])
print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)
Output:
0 16
1 50
2 4
3 4
4 3
dtype: int64
Update
To debug your bad dates, try:
df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']
You should see rows where values are not a right date.

Related

convert csv import via pandas to separate columns

I have a csv file that came into pandas like this:
csv file:
Date,Numbers,Extra, NaN
05/17/2002,15 18 25 33 47,30,
Pandas input:
df = pd.read_csv('/Users/owner/Downloads/file.csv’)e
#s = Series('05/17/2002', '15 18 25 33 47')
#s.str.partition(' ‘)
Output
Date Numbers. Extra
<bound method NDFrame.head of Draw Date Winning Numbers Extra NaN.
05/17/2002 15 18 25 33 47 30 NaN.
<class 'pandas.core.frame.DataFrame’>
RangeIndex: 1718 entries, 0 to 1717
Data columns (total 4 columns):
Date 1718 non-null object
Numbers 1718 non-null object.
Extra 1718 non-null int64
NaN 815 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.8+ KB
How do I convert the non-null objects into two columns:
1 is a date
1 is a list
It doesn’t seem to recognize split or to.str. or headings
Thanks
I think you want this. It specifies column 0 as a date column, and a converter for column 1:
>>> df = pd.read_csv('file.csv',parse_dates=[0],converters={1:str.split})
>>> df
Date Numbers Extra NaN
0 2002-05-17 [15, 18, 25, 33, 47] 30

Insert items from MultiIndexed dataframe into regular dataframe based on time

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)

Removing index duplicates in pandas

There is a some seemingly inconsistent behaviour observed when removing duplicates in pandas.
Problem set up: I have a dataframe with three columns and 3330 timeseries observations as shown below:
data.describe()
Mean Buy Sell
count 3330 3330 3330
Checking if the data contains any duplicates, shows there are duplicate indices.
data.index.duplicated().any()
True
How many duplicates are in the data
data.loc[data.index.duplicated()].count()
Mean 38
Buy 38
Sell 38
The duplicates can be visually inspected too
`data[data.index.duplicated()]`
Dilemma: Clearly, there are duplicates in the data and it seems they are 38 of them per column. However, when I use the DataFrame's drop_duplicates(), it seems more data is dropped than expected.
`data.drop_duplicates().count()`
Mean 3241
Buy 3241
Sell 3241
dtype: int64
`data.count() - data.drop_duplicates().count()`
Mean 89
Buy 89
Sell 89
Any ideas on what is the cause of this disparity or the detail I'm missing would be appreciated. Note: It is possible to have similar entries of data but dates should not be duplicated hence the reasonable way to clean the data is to remove the duplicate days.
If I understand you correctly, you want to keep only the first occurrence (row / record) where there are duplicates in your index?
This will accomplish that.
import pandas as pd
df = pd.DataFrame({'IDX':[1,2,2,2,3,4,5,5,6],
'Mean':[1,2,3,4,5,6,7,8,9]}).set_index('IDX')
df
Mean
IDX
1 1
2 2
2 3
2 4
3 5
4 6
5 7
5 8
6 9
duplicates = df.index.duplicated()
duplicates
array([False, False, True, True, False, False, False, True, False])
keep = duplicates == False
df.loc[keep,:]
Mean
IDX
1 1
2 2
3 5
4 6
5 7
6 9

date can not be serialized

I am getting an error while trying to save the dataframe as a file.
from fastparquet import write
write('profile_dtl.parq', df)
The error is related to "date" and the error message looks like this...
ValueError: Can't infer object conversion type: 0 1990-01-01
1 1954-01-01
2 1981-11-15
3 1993-01-21
4 1948-01-01
5 1977-01-01
6 1968-04-28
7 1969-01-01
8 1989-01-01
9 1985-01-01
Name: dob, dtype: object
I have checked that the column is "object" just like any other column that can be serialized without any problem. If I remove the "dob" column from the dataframe, then this line will work. This will also work if there is date+time.
Only dates are not accepted by fast-parquet?
Try changing dob to datetime64 dtype:
import pandas as pd
dob = pd.Series(['1954-01-01', '1981-11-15', '1993-01-21', '1948-01-01',
'1977-01-01', '1968-04-28', '1969-01-01', '1989-01-01',
'1985-01-01'], name='dob')
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
Name: dob, dtype: object
Note the dtype that results:
pd.to_datetime(dob)
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
dtype: datetime64[ns]
Using this Series as an index in a DataFrame:
baz = list(range(9))
foo = pd.DataFrame(baz, index=pd.to_datetime(dob), columns=['dob'])
You should be able to save your Parquet file now.
from fastparquet import write
write('foo.parquet', foo)
$ls -l foo.parquet
-rw-r--r-- 1 moi admin 854 Oct 13 16:44 foo.parquet
Your dob Series has an object dtype and you left unchanged the object_encoding='infer' argument to fastparquet.write. So, from the docs:
"The special value 'infer' will cause the type to be guessed from the first ten non-null values."
Fastparquet does not try to infer a date value from what it expects to be one of bytes|utf8|json|bson|bool|int|float.

How do I sort column by targeting a specific number within that cell?

I would like to use Pandas Python to sort a specific column by date (more specifically the year). However, the year is buried within a bunch of other numbers. How do I just target the 2 digits that I need?
In the example below, I want to sort this column by the numbers [16,14,15...] rather than considering all the numbers in that row.
3/18/16 11:46
6/19/14 14:58
7/27/15 14:22
8/3/15 12:59
2/20/13 12:33
9/27/16 12:08
7/27/15 14:22
Given a dataframe like this,
date
0 3/18/16
1 6/19/14
2 7/27/15
3 8/3/15
4 2/20/13
5 9/27/16
6 7/27/15
You can convert the date column to datetime format and then sort.
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by = 'date')
The resulting dataframe
date
4 2013-02-20
1 2014-06-19
2 2015-07-27
6 2015-07-27
3 2015-08-03
0 2016-03-18
5 2016-09-27