TypeError: <class 'datetime.time'> is not convertible to datetime - pandas

The problem is somewhat simple. My objective is to compute the days difference between two dates, say A and B.
These are my attempts:
df['daydiff'] = df['A']-df['B']
df['daydiff'] = ((df['A']) - (df['B'])).dt.days
df['daydiff'] = (pd.to_datetime(df['A'])-pd.to_datetime(df['B'])).dt.days
These works for me before but for some reason, I'm keep getting this error this time:
TypeError: class 'datetime.time' is not convertible to datetime
When I export the df to excel, then the date works just fine. Any thoughts?

Use pd.Timestamp to handle the awkward differences in your formatted times.
df['A'] = df['A'].apply(pd.Timestamp) # will handle parsing
df['B'] = df['B'].apply(pd.Timestamp) # will handle parsing
df['day_diff'] = (df['A'] - df['B']).dt.days
Of course, if you don't want to change the format of the df['A'] and df['B'] within the DataFrame that you are outputting, you can do this in a one-liner.
df['day_diff'] = (df['A'].apply(pd.Timestamp) - df['B'].apply(pd.Timestamp)).dt.days
This will give you the days between as an integer.

When I applied the solution offered by emmet02, I got TypeError: Cannot convert input [00:00:00] of type as well. It's basically saying that the dataframe contains missing timestamp values which are represented as [00:00:00], and this value is rejected by pandas.Timestamp function.
To address this, simply apply a suitable missing-value strategy to clean your data set, before using
df.apply(pd.Timestamp)

Related

Finding smallest dtype to safely cast an array to

Let's say I want to find the smallest data type I can safely cast this array to, to save it as efficiently as possible. (The expected output is int8.)
arr = np.array([-101,125,6], dtype=np.int64)
The most logical solution seems something like
np.min_scalar_type(arr) # dtype('int64')
but that function doesn't work as expected for arrays. It just returns their original data type.
The next thing I tried is this:
np.promote_types(np.min_scalar_type(arr.min()), np.min_scalar_type(arr.max())) # dtype('int16')
but that still doesn't output the smallest possible data type.
What's a good way to achieve this?
Here's a working solution I wrote. It will only work for integers.
def smallest_dtype(arr):
arr_min = arr.min()
arr_max = arr.max()
for dtype_str in ["u1", "i1", "u2", "i2", "u4", "i4", "u8", "i8"]:
if (arr_min >= np.iinfo(np.dtype(dtype_str)).min) and (arr_max <= np.iinfo(np.dtype(dtype_str)).max):
return np.dtype(dtype_str)
This is close to your initial idea:
np.result_type(np.min_scalar_type(arr.min()), arr.max())
It will take the signed int8 from arr.min() if arr.max() fits inside of it.

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).

pandas questions about argmin and timestamp

final_month = pd.Timestamp('2018-02-01')
df_final_month = df[df['week'] >= final_month]
df_final_month.iloc[:, 1:].sum().argmax()
index = df.set_index('week')
index['storeC'].argmin()
the code above is correct, i just don't exactly understand how does it work inside. i have some questions:
1.the type(week) is datetime, the reason why set final_month as Timestamp is that the datetime is almost as same as Timestamp, they recognise each other in Python?
2.about the argmax(), and argmin(), for the df_final_month.iloc[:, 1:].sum().argmax(), i removed sum() and tried like df_final_month.iloc[:, 1:].argmax(), it returns
`AttributeError: 'DataFrame' object has no attribute 'argmax'`
why is it? why the second code doesn't need a max() or something to call argmin(), what's the requirement for using argmin()/argmax() ?
please explaining the details of how python or pandas deal with these data, the more detail the better.
thanks!
i am new in Python.
Is Timestamp almost as same as datetime?
Here is quote from pandas documentation itself:
TimeStamp is the pandas equivalent of python’s Datetime and is interchangable with it in most cases
In fact, if you look at source code of pandas you will see that Timestamp actually inherits from datetime. Here is code to check these statements are true:
dt = datetime.datetime(2018, 1, 1)
ts = pd.Timestamp('2018-01-01')
dt == ts # True
isinstance(ts, datetime.datetime) # True
Why calling argmax method on DataFrame, without calling sum throws an error?
Because DataFrame object doesn't have argmax method, only Series do. And sum, in your case, returns a Series instance.

python read data like -.345D-4

I am wondering how to read the values like -.345D+1 with numpy/scipy?
The values are float with first 0 ignored.
I have tried the numpy.loadtxt and got errors like
ValueError: invalid literal for float(): -.345D+01
Many thanks :)
You could write a converter and use the converters keyword. If cols are the indices of the columns where you expect this format:
converters = dict.fromkeys(cols, lambda x: float(x.replace("D", "E")))
np.loadtxt(yourfile, converters=converters)

Python turning ints into floats (Postgres database)

What is the best way to avoid this error?
DataError: invalid input syntax for integer: "669068424.0" CONTEXT:
COPY sequence_raw, line 2, column id: "669068424.0"
I created a table using pgadmin which specified the data type for each column. I then read the data in with pandas and do some processing. I could explicitly provide a list of columns and say that they are .astype(int), but is that necessary?
I understand that the reason that there is a .0 after the integers is because there are NaNs in the data so they are turned into floats instead of integers. What is the best way to work around this? I saw on the pre-release of pandas 0.19 that there is better handling of sparse data, is this covered by any chance?
def process_file(conn, table_name, file_object):
fake_conn = pg_engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)
fake_conn.commit()
fake_cur.close()
df = pd.read_sql_query(sql=query.format(**params), con=engine)
df.to_csv('../raw/temp_sequence.csv', index=False)
df = open('../raw/temp_sequence.csv')
process_file(conn=pg_engine, table_name='sequence_raw', file_object=df)
You can use the float_format parameter for to_csv to specify the format of the floats in the CSV:
df.to_csv('../raw/temp_sequence.csv', index=False, float_format="%d")