The best way to parse str to float in pandas - pandas

I got a money values columns erroneously displayed as string:
money value as string
I've tried to parse that column applying usual methods, unsuccessfully:
astype(float) and pd.to_numeric(df.col, errors=['coerce'])
Finally, I can only win if I use string manipulation techniques. As you see, too verbose:
apply split casting
So, what's happened here? There's no graceful way to solve this parsing?
ps: using pd.read_csv(path, dtype={'receita': float} I also got wrong

Related

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).

Go application making SQL Query using GROUP_CONCAT on FLOATS returns []uint8 instead of actual []float64

Have a problem using group_concat in a query made by my go application.
Any idea why a group_concat of FLOATS would look like a []uint8 on the Go side?
Cant seem to properly convert the suckers either.
It's definitely floats, I can see it in the raw query results, but when I do the same query in go and try to .Scan the result, Go complains that it's a []uint8 not a []float64 (which it actually is) Attempts to convert to floats gives me the wrong values (and way too many of them).
For example, at the database, I query and get 2 floats for the column in question, looks like this:
"5650.50, 5455.00"
On the go side however, go sees a []uint8 instead of []float64. Why does this happen? How does one workaround this to get the actual results?
My problem is that I have to use this SQL with the group_concat, due to the nature of the database I am working with, this is the best way to get the information, and more importantly the query itself works great, returns the data the function needs, but now I cant read it out because of type issues. No stranger to those, but Go isn't cooperating with me today.
I'd be more than pleased to learn WHY go is doing it this way, and delighted to learn of a way to deal with it.
Example:
SELECT ID, getDistance(33.1543,-110.4353, Loc.Lat, Loc.Lng) as distance,
GROUP_CONCAT(values) FROM stuff INNER JOIN device on device.ID = stuff.ID WHERE (someConditionsETC) GROUP BY ID ORDER BY ID
The actual result, when interfacing with the actual database (not within my application), is
"5650.00, 5850.50"
It's clearly 2 floats.
The same result produces a slice of uint8 when queried from Go and trying to .Scan the result in. If I range through and print those values, I get way more than 2, and they are uint8 (bytes) that look like this:
53,55,56,48,46,48,48
Not sure how Go expects me to handle this.
Solution.... stupid simple and not terribly obvious:
The solution: 
crazyBytes := []uint8("5760.00,5750.50")
aString := string(crazyBytes)
strSlice := strings.Split(aString,",") // string representation of our array (of floats)
var floatz []float64
for _, x := range strSlice {
fmt.Printf("At last, Float: %s \r\n",x)
f,err := strconv.ParseFloat(x,64)
if err != nil { fmt.Printf("Error: %s",err) }
floatz = append(floatz, f)
fmt.Printf("as float: %s \r\n", strconv.FormatFloat(f,'f',-1,64))
}
Yea sure, it's obvious NOW.
GROUP_CONCAT returns a string. So in Go you get a byte array of characters, not a float. The result you posted 53,55,56,48,46,48,48 translates into a string "5780.00" which does look like one of your values. So you need to either fix your SQL to return floats or use strings and strconv modules in Go to parse and convert your string into floats. I think the former approach is better, but it is up to you.

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...
As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)
:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!

Dataframes NAtype to binary Julia

I'm trying to write binary text files from a data frame in Julia using something along the lines of:
for x in RICT["$i"]["Sick"]
write(f9, convert(Int16, x ))
and everything works nicely except for when it comes to NA values. Missing values are treated as NA it seems, and I know that there are different ways of handling such values using the data frames package. Does anyone have any experience with these NAtypes? Should I convert the NAtypes to a more conventional type and then write them in? As always any help is much appreciated.
If you are writing a 16-byte integer value, there's no canonical representation of "blank", so you'd have to pick a special 16-byte integer value that represents NA. A common choice for this kind of thing is the smallest representable value – in this case typemin(Int16) == -32768. You can generalize this to other signed integer types.

Convert an alphanumeric string to integer format

I need to store an alphanumeric string in an integer column on one of my models.
I have tried:
#result.each do |i|
hex_id = []
i["id"].split(//).each{|c| hex_id.push(c.hex)}
hex_id = hex_id.join
...
Model.create(:origin_id => hex_id)
...
end
When I run this in the console using puts hex_id in place of the create line, it returns the correct values, however the above code results in the origin_id being set to "2147483647" for every instance. An example string input is "t6gnk3pp86gg4sboh5oin5vr40" so that doesn't make any sense to me.
Can anyone tell me what is going wrong here or suggest a better way to store a string like the aforementioned example as a unique integer?
Thanks.
Answering by request form OP
It seems that the hex_id.join operation does not concatenate strings in this case but instead sums or performs binary complement of the hex values. The issue could also be that hex_id is an array of hex-es rather than a string, or char array. Nevertheless, what seems to happen is reaching the maximum positive value for the integer type 2147483647. Still, I was unable to find any documented effects on array.join applied on a hex array, it appears it is not concatenation of the elements.
On the other hand, the desired result 060003008600401100500050040 is too large to be recorded as an integer either. A better approach would be to keep it as a string, or use different algorithm for producing a number form the original string. Perhaps aggregating the hex values by an arithmetic operation will do better than join ?