I have an binary excel file with DATE column with value '7/31/2020'.
Upon reading the file the DATE value is getting converted to numpy.int64 with value 44043.
Can you tell me how to stop this conversion or getting the date as is given in excel.
This is my code to read the excel file
>>df = pd.read_excel('hello.xlsb', engine='pyxlsb')
>>df[DATE][0]
>>44043
Apparently the integer value is the number of days since the 0th of January 1900. But the 0th of January doesn't exist: there seems to be a fudge factor of 2 involved here.
>>> import datetime
>>> d = datetime.date(1900, 1, 1) + datetime.timedelta(days=44043 - 2)
>>> d
datetime.date(2020, 7, 31)
>>> d.isoformat()
'2020-07-31'
>>> d.strftime("%m/%d/%Y")
'07/31/2020'
See the strftime docs for other formatting options.
You could try parsing the column as a date format when reading it in:
df = pd.read_excel('hello.xlsb', engine='pyxlsb', parse_dates=[DATE])
DATE is the variable with the column name expected to be in date format.
Related
I have a column of years from the sunspots dataset.
I want to convert column 'year' in integer e.g. 1992 to datetime format then find the time delta and eventually compute total seconds (cumulative) to represent the time index column of a time series.
I am trying to use the following code but I get the error
TypeError: dtype datetime64[ns] cannot be converted to timedelta64[ns]
sunspots_df['year'] = pd.to_timedelta(pd.to_datetime(sunspots_df['year'], format='%Y') ).dt.total_seconds()
pandas.Timedelta "[r]epresents a duration, the difference between two dates or times." So you're trying to get Python to tell you the difference between a particular datetime and...nothing. That's why it's failing.
If it's important that you store your index this way (and there may be better ways), then you need to pick a start datetime and compute the difference to get a timedelta.
For example, this code...
import pandas as pd
df = pd.DataFrame({'year': [1990,1991,1992]})
diff = (pd.to_datetime(df['year'], format='%Y') - pd.to_datetime('1990', format='%Y'))\
.dt.total_seconds()
...returns a series whose values are seconds from January 1st, 1990. You'll note that it doesn't invoke pd.to_timedelta(), because it doesn't need to: the result of the subtraction is automatically a pd.timedelta column.
I have a pandas dataframe with a timestamp field which I have successfully to converted to datetime format and now I want to output just the month and day as a tuple for the first date value in the data frame. It is for a test and the output must not have leading zeros. I ahve tried a number of things but I cannot find an answer without converting the timestamp to a string which does not work.
This is the format
2021-05-04 14:20:00.426577
df_cleaned['trans_timestamp']=pd.to_datetime(df_cleaned['trans_timestamp']) is as far as I have got with the code.
I have been working on this for days and cannot get output the checker will accept.
Update
If you want to extract month and day from the first record (solution proposed by #FObersteiner)
>>> df['trans_timestamp'].iloc[0].timetuple()[1:3]
(5, 4)
If you want extract all month and day from your dataframe, use:
# Setup
df = pd.DataFrame({'trans_timestamp': ['2021-05-04 14:20:00.426577']})
df['trans_timestamp'] = pd.to_datetime(df['trans_timestamp'])
# Extract tuple
df['month_day'] = df['trans_timestamp'].apply(lambda x: (x.month, x.day))
print(df)
# Output
trans_timestamp month_day
0 2021-05-04 14:20:00.426577 (5, 4)
I have a pandas DataFrame which includes a datetime column and I want to filter the data frame between the current hour and 10 hours ago. I have tried different ways to do it but still I cannot handle it. Because when I want to use pandas, the column type is Series and I can't use timedelta to compare them. If I use a for loop to compare the column as a string to my time interval, it is not efficient.
The table is like this:
And I want to filter the 'dateTime' column between current time and 10 hours ago, then filter based on 'weeks' > 80.
I have tried these codes as well But they have not worked:
filter_criteria = main_table['dateTime'].sub(today).abs().apply(lambda x: x.hours <= 10)
main_table.loc[filter_criteria]
This returns an error:
TypeError: unsupported operand type(s) for -: 'str' and 'datetime.datetime'
Similarly this code has the same problem:
main_table.loc[main_table['dateTime'] >= (datetime.datetime.today() - pd.DateOffset(hours=10))]
And:
main_table[(pd.to_datetime('today') - main_table['dateTime'] ).dt.hours.le(10)]
In all of the code above main_table is the name of my data frame.
How can I filter them?
First you need to make sure that your datatype in datetime column is correct. you can check it by using:
main_table.info()
If it is not datetime (i.e, object) convert it:
# use proper formatting if this line does not work
main_table['dateTime'] = pd.to_datetime(main_table['dateTime'])
Then you need to find the datetime object of ten hour before current time (ref):
from datetime import datetime, timedelta
date_time_ten_before = datetime.now() - timedelta(hours = 10)
All it remains is to filter the column:
main_table_10 = main_table[main_table['dateTime'] >= date_time_ten_before]
This CSV has dates in ISO 8601 time.
0 2014-01-01T00:00:00.000
1 2014-01-01T00:46:43.000
2 2014-01-01T01:33:26.001
I want to select the rows up until January 2. I'm not sure how to do this. I thought including
parse_dates=True
would allow me to refer to the date/time values directly like this:
sat0 = sat0[sat0['epoch']<2014-01-02T00:00:00.000]
but it's not working.
You can use pd.to_datetime() in converting the target date string column by its appropriate format. In this case, you can use '%Y-%m-%dT%H:%M:%S.%f' as the date format.
Here's the implementation in your case:
import pandas as pd
sat0 = pd.DataFrame({
'epoch': [
'2014-01-01T00:00:00.000',
'2014-01-01T00:46:43.000',
'2014-03-01T01:33:26.001',
]})
date_fmt = '%Y-%m-%dT%H:%M:%S.%f'
sat0['epoch'] = pd.to_datetime(sat0['epoch'], format=date_fmt)
sat0 = sat0[sat0['epoch'] < '2014-01-02T00:00:00.000']
Which outputs
epoch
0 2014-01-01 00:00:00
1 2014-01-01 00:46:43
Create a dataframe whose first column is a text.
import pandas as pd
values = {'dates': ['2019','2020','2021'],
'price': [11,12,13]
}
df = pd.DataFrame(values, columns = ['dates','price'])
Check the dtypes:
df.dtypes
dates object
price int64
dtype: object
Convert type in the column dates to type dates.
df['dates'] = pd.to_datetime(df['dates'], format='%Y')
df
dates price
0 2019-01-01 11
1 2020-01-01 12
2 2021-01-01 13
I want to convert the type in dates column to date and the dates in the following format----contains only year number:
dates price
0 2019 11
1 2020 12
2 2021 13
How can achieve the target?
If you choose to have the datetime format for your columns, it is likely to benefit from it. What you see in the column ("2019-01-01") is a representation of the datetime object. The realquestion here is, why do you need to have a datetime object?
Actually, I don't care about datetime type:
Use a string ('2019'), or preferentially an integer (2019) which will enable you to perform sorting, calculations, etc.
I need the datetime type but I really want to see only the year:
Use style to format your column while retaining the underlying type:
df.style.format({'dates': lambda t: t.strftime('%Y')})
This will allow you to keep the type while having a clean visual format