I am wondering how to read the values like -.345D+1 with numpy/scipy?
The values are float with first 0 ignored.
I have tried the numpy.loadtxt and got errors like
ValueError: invalid literal for float(): -.345D+01
Many thanks :)
You could write a converter and use the converters keyword. If cols are the indices of the columns where you expect this format:
converters = dict.fromkeys(cols, lambda x: float(x.replace("D", "E")))
np.loadtxt(yourfile, converters=converters)
Related
My dataframe has 2 levels for columns and I need to convert my column level[1] from datetime into strings but columns headers have some 'NaT's and hence my strftime function is failing.
df.columns=['a','b','d','d']+[x.strftime('%m-%d-%Y') for x in df.columns.levels[1][:-1]]
This gives me error that
ValueError: NaTType does not support strftime
Based on discussions on similar topic, I tried using
[x.datetime.strftime('%m-%d-%Y') for x in df.columns.levels[1][:-1]]
but then I get an error saying
AttributeError: 'Timestamp' object has no attribute 'datetime'
Is there anything that I am missing. Please help.
thank you!
You can add a condition when converting with:
[ x.strftime('%m-%d-%Y') if not pd.isnull(x) else "01-01-1970" \
for x in df.columns.levels[1][:-1]]
I got a money values columns erroneously displayed as string:
money value as string
I've tried to parse that column applying usual methods, unsuccessfully:
astype(float) and pd.to_numeric(df.col, errors=['coerce'])
Finally, I can only win if I use string manipulation techniques. As you see, too verbose:
apply split casting
So, what's happened here? There's no graceful way to solve this parsing?
ps: using pd.read_csv(path, dtype={'receita': float} I also got wrong
I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).
The problem is somewhat simple. My objective is to compute the days difference between two dates, say A and B.
These are my attempts:
df['daydiff'] = df['A']-df['B']
df['daydiff'] = ((df['A']) - (df['B'])).dt.days
df['daydiff'] = (pd.to_datetime(df['A'])-pd.to_datetime(df['B'])).dt.days
These works for me before but for some reason, I'm keep getting this error this time:
TypeError: class 'datetime.time' is not convertible to datetime
When I export the df to excel, then the date works just fine. Any thoughts?
Use pd.Timestamp to handle the awkward differences in your formatted times.
df['A'] = df['A'].apply(pd.Timestamp) # will handle parsing
df['B'] = df['B'].apply(pd.Timestamp) # will handle parsing
df['day_diff'] = (df['A'] - df['B']).dt.days
Of course, if you don't want to change the format of the df['A'] and df['B'] within the DataFrame that you are outputting, you can do this in a one-liner.
df['day_diff'] = (df['A'].apply(pd.Timestamp) - df['B'].apply(pd.Timestamp)).dt.days
This will give you the days between as an integer.
When I applied the solution offered by emmet02, I got TypeError: Cannot convert input [00:00:00] of type as well. It's basically saying that the dataframe contains missing timestamp values which are represented as [00:00:00], and this value is rejected by pandas.Timestamp function.
To address this, simply apply a suitable missing-value strategy to clean your data set, before using
df.apply(pd.Timestamp)
What is the best way to avoid this error?
DataError: invalid input syntax for integer: "669068424.0" CONTEXT:
COPY sequence_raw, line 2, column id: "669068424.0"
I created a table using pgadmin which specified the data type for each column. I then read the data in with pandas and do some processing. I could explicitly provide a list of columns and say that they are .astype(int), but is that necessary?
I understand that the reason that there is a .0 after the integers is because there are NaNs in the data so they are turned into floats instead of integers. What is the best way to work around this? I saw on the pre-release of pandas 0.19 that there is better handling of sparse data, is this covered by any chance?
def process_file(conn, table_name, file_object):
fake_conn = pg_engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)
fake_conn.commit()
fake_cur.close()
df = pd.read_sql_query(sql=query.format(**params), con=engine)
df.to_csv('../raw/temp_sequence.csv', index=False)
df = open('../raw/temp_sequence.csv')
process_file(conn=pg_engine, table_name='sequence_raw', file_object=df)
You can use the float_format parameter for to_csv to specify the format of the floats in the CSV:
df.to_csv('../raw/temp_sequence.csv', index=False, float_format="%d")