date can not be serialized - pandas

I am getting an error while trying to save the dataframe as a file.
from fastparquet import write
write('profile_dtl.parq', df)
The error is related to "date" and the error message looks like this...
ValueError: Can't infer object conversion type: 0 1990-01-01
1 1954-01-01
2 1981-11-15
3 1993-01-21
4 1948-01-01
5 1977-01-01
6 1968-04-28
7 1969-01-01
8 1989-01-01
9 1985-01-01
Name: dob, dtype: object
I have checked that the column is "object" just like any other column that can be serialized without any problem. If I remove the "dob" column from the dataframe, then this line will work. This will also work if there is date+time.
Only dates are not accepted by fast-parquet?

Try changing dob to datetime64 dtype:
import pandas as pd
dob = pd.Series(['1954-01-01', '1981-11-15', '1993-01-21', '1948-01-01',
'1977-01-01', '1968-04-28', '1969-01-01', '1989-01-01',
'1985-01-01'], name='dob')
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
Name: dob, dtype: object
Note the dtype that results:
pd.to_datetime(dob)
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
dtype: datetime64[ns]
Using this Series as an index in a DataFrame:
baz = list(range(9))
foo = pd.DataFrame(baz, index=pd.to_datetime(dob), columns=['dob'])
You should be able to save your Parquet file now.
from fastparquet import write
write('foo.parquet', foo)
$ls -l foo.parquet
-rw-r--r-- 1 moi admin 854 Oct 13 16:44 foo.parquet
Your dob Series has an object dtype and you left unchanged the object_encoding='infer' argument to fastparquet.write. So, from the docs:
"The special value 'infer' will cause the type to be guessed from the first ten non-null values."
Fastparquet does not try to infer a date value from what it expects to be one of bytes|utf8|json|bson|bool|int|float.

Related

Pandas dataframe first value shows up as column name

I am new to pandas, I have a pandas data frame, but the first value (0,0), is being used as an index/name? I want 0.9121 to be the first value, how do I do that?
0 0.2171
1 0.21163
2 0.87221
3 0.432735
4 0.3231
Name: 0.9121, dtype: float64
I would like to have:
0 0.9121
1 0.2171
2 0.21163
3 0.87221
4 0.432735
5 0.3231

Panda astype not converting column to int even when using errors=ignore

I have the following DF
ID
0 1.0
1 555555.0
2 NaN
3 200.0
When I try to convert the ID column to Int64 I got the following error:
Cannot convert non-finite values (NA or inf) to integer
I've used the following code to solve this problem:
df["ID"] = df["ID"].astype('int64', errors='ignore')
Although, when I use the above code my ID column persists with float64 type.
Any tip to solve this problem?
Use pd.Int64DType64 instead of np.int64:
df['ID'] = df['ID'].fillna(pd.NA).astype(pd.Int64Dtype())
Output:
>>> df
ID
0 1
1 555555
2 <NA>
3 200
>>> df['ID'].dtype
Int64Dtype()
>>> df['ID'] + 10
0 11
1 555565
2 <NA>
3 210
Name: ID, dtype: Int64
>>> print(df.to_csv(index=False))
ID
1
555555
""
200

Pandas subtract dates to get a surgery patient length of stay

I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column.
Usually I use
df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days
I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.
DISDATE.
. ADMIDATE.
. los.
2019-12-24
2019-12-08
-43805.
2019-05-15
. 2019-03-26
50.
2019-10-11
. 2019-10-07
4.
2019-06-20
2019-06-16
4
2019-04-11
2019-04-08
3
df.info()
df.info()
<class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.
Your sample works well with the right output (16 days for the first row)
Can you try that and check if the problem persists:
import io
data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
parse_dates=['DISDATE', 'ADMIDATE'])
print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)
Output:
0 16
1 50
2 4
3 4
4 3
dtype: int64
Update
To debug your bad dates, try:
df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']
You should see rows where values are not a right date.

change multiple date time formats to single format in pandas dataframe

I have a DataFrame with multiple formats as shown below
0 07-04-2021
1 06-03-1991
2 12-10-2020
3 07/04/2021
4 05/12/1996
What I want is to have one format after applying the Pandas function to the entire column so that all the dates are in the format
date/month/year
What I tried is the following
date1 = pd.to_datetime(df['Date_Reported'], errors='coerce', format='%d/%m/%Y')
But it is not working out. Can this be done? Thank you
try with dayfirst=True:
date1=pd.to_datetime(df['Date_Reported'], errors='coerce',dayfirst=True)
output of date1:
0 2021-04-07
1 1991-03-06
2 2020-10-12
3 2021-04-07
4 1996-12-05
Name: Date_Reported, dtype: datetime64[ns]
If needed:
date1=date1.dt.strftime('%d/%m/%Y')
output of date1:
0 07/04/2021
1 06/03/1991
2 12/10/2020
3 07/04/2021
4 05/12/1996
Name: Date_Reported, dtype: object

replace any strings with nan in a pandas dataframe

I'm new to pandas and the dataframe-concept. Because of the format of my data (excel-sheets, first row is the name of my data, the second row is the unit) it's a little tricky to handle it in a data frame.
The task is to calculate new data from existing columns, e.g. df.['c'] = df['a']**2 + df.['b']
I get: TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
This did work, but is pain to my hands and eyes:
df.['c'] = df['a']
df.['c'] = df['a'].tail(len(df.['a']-1))**2 + df.['b'].tail(len(df.['b'])-1)
df.loc[0,'c'] = 'unit for c'
Is there any way to do this quicker or with less typing?
Thanks already
schamonn
Let's look at the error mentioned first in this post.
TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
What this error is staying that you are trying to take and string to a power, we can replicate this error using the following example:
df = pd.DataFrame({'a':['1','2','3'],'b':[4,5,6]})
df['a']**2
Output last line of stack trace:
TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
A simple resolution to this if all your a column are numeric representations, then use pd.to_numeric:
pd.to_numeric(df['a'])**2
Output:
0 1
1 4
2 9
Name: a, dtype: int64
Got non-numeric strings also in column a?
Use errors = 'coerce' as a parameter to pd.to_numeric
df = pd.DataFrame({'a':['a','1','2','3'],'b':[4,5,6,7]})
Use:
pd.to_numeric(df['a'], errors='coerce')**2
Output:
0 NaN
1 1.0
2 4.0
3 9.0
Name: a, dtype: float64
this is how I read in the data
Data = pd.read_excel(fileName, sheet_name = 'Messung')
In [154]: Data
Out[154]:
T1 T2 Messung Datum
0 °C °C - -
1 12 100 1 2018-12-06 00:00:00
2 15 200 2 2018-12-06 00:00:00
3 20 120 3 2018-12-06 00:00:00
4 10 160 4 2018-12-06 00:00:00
5 12 160 5 2018-12-06 00:00:00