How can I change column data type from float to string in Julia? - dataframe

I am trying to get a column in a dataframe form float to string. I have tried
df = readtable("data.csv", coltypes = {String, String, String, String, String, Float64, Float64, String});
but I got complained
syntax: { } vector syntax is discontinued
I also have tried
dfB[:serial] = string(dfB[:serial])
but it didn't work either. So, I'd like to know what would be the proper approach to change column data type in Julia.
thx

On your first attempt, Julia tells you what the problem is - you can't make a vector with {}, you need to use []. Also, the name of the keyword argument should be eltypes rather than coltypes.
On the second try, you don't have a float, you have a Vector of floats. So to change the type you need to change the type of all elements. In Julia, elementwise operations on vectors are generalized by the 'dot' syntax, e.g. string.(collect(dfB[:serial])) . The collect is needed currently to cast the DataArray to a normal Array first – this will fail if the DataArray contains NAs. IMHO the DataFrames interface is still rather wonky, so expect a few headaches like this ATM.

Related

Finding smallest dtype to safely cast an array to

Let's say I want to find the smallest data type I can safely cast this array to, to save it as efficiently as possible. (The expected output is int8.)
arr = np.array([-101,125,6], dtype=np.int64)
The most logical solution seems something like
np.min_scalar_type(arr) # dtype('int64')
but that function doesn't work as expected for arrays. It just returns their original data type.
The next thing I tried is this:
np.promote_types(np.min_scalar_type(arr.min()), np.min_scalar_type(arr.max())) # dtype('int16')
but that still doesn't output the smallest possible data type.
What's a good way to achieve this?
Here's a working solution I wrote. It will only work for integers.
def smallest_dtype(arr):
arr_min = arr.min()
arr_max = arr.max()
for dtype_str in ["u1", "i1", "u2", "i2", "u4", "i4", "u8", "i8"]:
if (arr_min >= np.iinfo(np.dtype(dtype_str)).min) and (arr_max <= np.iinfo(np.dtype(dtype_str)).max):
return np.dtype(dtype_str)
This is close to your initial idea:
np.result_type(np.min_scalar_type(arr.min()), arr.max())
It will take the signed int8 from arr.min() if arr.max() fits inside of it.

Julia automatically inferring columns dataype of DataFrame (equivalent of pd.infer_objects())

Coming from Python, I started using Julia for its speed in a big-data project. When reading data from .xlsx files, the datatype in each column is "any", despite most of the data being integers or floats.
Is there any Julia-way of inferring the datatypes in a DataFrame (like df = infertype.(df))?
This may be difficult in Julia, given the reduced flexibility on dataypes, but any tips on how to accomplish it would be appreciated. Assume, ex-ante, I do not know which column is which, but the types can only be int, float, string or date.
Using DataFrames
Using XLSX
df = DataFrame(XLSX.readtable("MyFile.xlsx", "Sheet1")...)
You can just do:
df = DataFrame(XLSX.readtable("MyFile.xlsx", "Sheet1"; infer_eltypes=true)...)
Additionally, it is worth knowing that typing in Julia ? before the command shows help that can contain such information:
help?> XLSX.readtable
readtable(filepath, sheet, [columns]; [first_row], [column_labels], [header], [infer_eltypes], [stop_in_empty_row], [stop_in_row_function]) -> data, column_labels
Returns tabular data from a spreadsheet as a tuple (data, column_labels). (...)
(...)
Use infer_eltypes=true to get data as a Vector{Any} of typed vectors. The default value is infer_eltypes=false.
(...)

The best way to parse str to float in pandas

I got a money values columns erroneously displayed as string:
money value as string
I've tried to parse that column applying usual methods, unsuccessfully:
astype(float) and pd.to_numeric(df.col, errors=['coerce'])
Finally, I can only win if I use string manipulation techniques. As you see, too verbose:
apply split casting
So, what's happened here? There's no graceful way to solve this parsing?
ps: using pd.read_csv(path, dtype={'receita': float} I also got wrong

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).

Convert an alphanumeric string to integer format

I need to store an alphanumeric string in an integer column on one of my models.
I have tried:
#result.each do |i|
hex_id = []
i["id"].split(//).each{|c| hex_id.push(c.hex)}
hex_id = hex_id.join
...
Model.create(:origin_id => hex_id)
...
end
When I run this in the console using puts hex_id in place of the create line, it returns the correct values, however the above code results in the origin_id being set to "2147483647" for every instance. An example string input is "t6gnk3pp86gg4sboh5oin5vr40" so that doesn't make any sense to me.
Can anyone tell me what is going wrong here or suggest a better way to store a string like the aforementioned example as a unique integer?
Thanks.
Answering by request form OP
It seems that the hex_id.join operation does not concatenate strings in this case but instead sums or performs binary complement of the hex values. The issue could also be that hex_id is an array of hex-es rather than a string, or char array. Nevertheless, what seems to happen is reaching the maximum positive value for the integer type 2147483647. Still, I was unable to find any documented effects on array.join applied on a hex array, it appears it is not concatenation of the elements.
On the other hand, the desired result 060003008600401100500050040 is too large to be recorded as an integer either. A better approach would be to keep it as a string, or use different algorithm for producing a number form the original string. Perhaps aggregating the hex values by an arithmetic operation will do better than join ?