I have a large dataset in python/pandas, and it's behaving very strangely. I have a column, df['station'], with numeric and string values, and I have converted it into an object. If I do dtypes, the column claims to be of dtype object.
station object
However, when I run df['station'].value_counts() on the column, it says the dtype is int64
I'm unable to access the values that are strings, if I try df[df['station']=='NONE'] or df[df['station']=='S52'] which are both shown to be present in the value_counts() attached image, nothing shows up.
I tried to remove NaNs from the column, and then converting to string and running value_counts() again, but it still shows up as int64.
Related
I have a csv file where the a timestamp column is coming with values in the following format: 2022-05-12T07:09:33.727-07:00
When I try something like:
df['timestamp'] = pd.to_datetime(df['timestamp'])
It seems to fail silently as the dtype of that column is still object. I am wondering how I can parse such a value.
Also, what is the strategy so that it remains robust to a variety of input time formats?
So my dataframe is frame and there is tz column containing time zone information in string format.
clean_tz = frame['tz'].fillna('Missing')
If I know that all elements in tz column is in string format, although some maybe empty string or in incorrect format, is there a reason to run fillna since fillna() only checks for null or nan type?
Is there a reason to run fillna() if there are no nan in the dataframe?
I think not, there is no reason.
If I know that all elements in tz column is in string format, although some maybe empty string or in incorrect format, is there a reason to run fillna since fillna() only checks for null or nan type?
I think no, you can repalce empty strings like:
frame['tz'] = frame['tz'].str.repalce(r'^\s*$', 'Missing')
But also is possible repalce values to NaNs and then use fillna, but obviously if not exist before in data in my opinion it is double replace - empty strings to NaNs and then NaNs to something else.
I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).
I am trying to get a column in a dataframe form float to string. I have tried
df = readtable("data.csv", coltypes = {String, String, String, String, String, Float64, Float64, String});
but I got complained
syntax: { } vector syntax is discontinued
I also have tried
dfB[:serial] = string(dfB[:serial])
but it didn't work either. So, I'd like to know what would be the proper approach to change column data type in Julia.
thx
On your first attempt, Julia tells you what the problem is - you can't make a vector with {}, you need to use []. Also, the name of the keyword argument should be eltypes rather than coltypes.
On the second try, you don't have a float, you have a Vector of floats. So to change the type you need to change the type of all elements. In Julia, elementwise operations on vectors are generalized by the 'dot' syntax, e.g. string.(collect(dfB[:serial])) . The collect is needed currently to cast the DataArray to a normal Array first – this will fail if the DataArray contains NAs. IMHO the DataFrames interface is still rather wonky, so expect a few headaches like this ATM.
I often find myself changing the types of data in columns of my dataframes, converting between datetime and timedelta types, or string and time etc. So I need a way to check which data type each of my columns has.
df.dtypes is fine for numeric object types, but for everything else just shows 'object'. So how can I find out what kind of object?
You can inspect one of the cells to find the type.
import pandas as pd
#assume some kind of string and int data
records = [["a",1], ["b",2]]
df = pd.DataFrame(records)
df.dtypes
>0 object
>1 int64
>dtype: object
So pandas knows that column 1 is integer storage but column zero is shown as object.
df[0].dtype
>dtype('O')
This still shows "Object" storage.
type(df[0][0])
>str
Voila.
Of course, this depends on your exact data structure. If you've got NaNs anywhere in the column then it sometimes plays havoc with the converted type (havoc as in its not always clear why it ends up as object storage).