Can I assign multiple datatypes to a Pandas column? - pandas

I have a huge amount of data any work on this data takes up a really long time. One of the tips that I read to deal with a large amount of data is to change the datatypes of the columns to either 'int' or 'float' if possible.
I tried to follow this method but I am getting some errors because my column contains both float and string values. The error looks like this "Unable to parse string "3U00" at position 18". Hence my question:
1) Is there a way I can assign multiple data types to one column and how can I do that?
2) If I am able to achieve the above does this decrease my processing time?
Currently when I type :
dftest.info()
Result:
A_column non-null object

Related

Casting a string to float/decimal - big query

Hoping someone can advise me on this. I have two tables in big query, the first is called master, the second is called daily_transfer.
In the master table, there is a column named impression_share, the data type is float and all working correctly.
However my problem is with the daily_transfer table. The idea is that on a daily basis, I'll transfer this data into master. The schema and column names are exactly the same in both tables. The problem however is that in the daily transfer table, in my float column (impression_share), I have a string value, which is < 0.1.
This string isn't pulled up as an issue initially as the table is being loaded from a google sheet, so the error is only highlighted when I try to query the data.
In summary, column type is float, but a recurring value is a string. I've tried a couple of things, firstly replacing the '< 0.1' to '0.1', but I get an error that replace can only be used with an expression of string, string, string. Which makes sense to me.
So I've tried to cast the column instead from float to string, and then replace the value. When I try to cast though I'm getting an error right away:
"Error while reading table: data-studio-reporting.analytics.daily_transfer, error message: Could not convert value to float. Row 3; Col 6."
Column 6 being "impression_share", row 3 value being < 0.1.
The query I was trying is:
SELECT
SAFE_CAST(mydata.impression_share AS STRING)
FROM `data-studio-reporting.analytics.daily_transfer` mydata
I just don't know if its possible what I'm trying to do, or if I would be better recreating the daily_transfer table and setting column 6 (impression_share) as String, to make it easier to replace and then cast before I transfer to the main table?
Any help greatly appreciated!
Thanks,
Mark
Thanks for the help on this, changing the column type in my daily_transfer_table from float to string, then replacing and casting has worked.
SELECT
mydata.Date,
CAST (REPLACE(mydata.Impression_share,'<','') AS FLOAT64 ) as impression_share_final,
mydata.Available_impressions
FROM `data-studio-reporting.google_analytics.daily_transfer_temp_test` mydata
been great for my knowledge to learn this one. thanks!

"Numeric value '' is not recognized" - what column?

I am trying to insert data from a staging table into the master table. The table has nearly 300 columns, and is a mix of data-typed Varchars, Integers, Decimals, Dates, etc.
Snowflake gives the unhelpful error message of "Numeric value '' is not recognized"
I have gone through and cut out various parts of the query to try and isolate where it is coming from. After several hours and cutting every column, it is still happening.
Does anyone know of a Snowflake diagnostic query (like Redshift has) which can tell me a specific column where the issue is occurring?
Unfortunately not at the point you're at. If you went back to the COPY INTO that loaded the data, you'd be able to use VALIDATE() function to get better information to the record and byte-offset level.
I would query your staging table for just the numeric fields and look for blanks, or you can wrap all of your fields destined for numeric fields with try_to_number() functions. A bit tedious, but might not be too bad if you don't have a lot of numbers.
https://docs.snowflake.com/en/sql-reference/functions/try_to_decimal.html
As a note, when you stage, you should try and use the NULL_IF options to get rid of bad characters and/or try to load them into stage using the actual datatypes in your stage table, so you can leverage the VALIDATE() function to make sure the data types are correct before loading into Snowflake.
Query your staging using try_to_number() and/or try_to_decimal() for number and decimal fields of the table and the use the minus to get the difference
Select $1,$2,...$300 from #stage
minus
Select $1,try_to_number($2)...$300 from#stage
If any number field has a string that cannot be converted then it will be null and then minus should return those rows which have a problem..Once you get the rows then try to analyze the columns in the result set for errors.

BigQuery - Inferring Datatypes of Column Values

What is the best way to determine the datatype of a column value if the data has already been loaded and the data has been classified as STRING datatype (i.e. BQ table metadata has "STRING" as the datatype for every column)? I've found a few different methods, but not sure if I'm missing any or any of these is substantially more performant. The result should include statistics on the grain of each value, not just per column.
Using a combination of CASE and SAFE_CAST on the STRING value to sum up all the instances where it successfully was able to CAST to X data type (where X is any datatype, like INT64 or DATETIME and having a few lines in query repeat the SAFE_CAST to cover all potential datatypes)
Similar to above, but using REGEXP_CONTAINS instead of SAFE_CAST on every value and summing up all instances of TRUE (a community UDF also seems to tackle this: https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/typeof.sql)
(For above can also use countif(), if statements etc.)
Loading data into a pandas dataframe and using something like pd.api.types.infer_dtype to infer automatically, but this adds overhead and more components
Thanks!

Dynamic type cast in select query

I have totally rewritten my question because of inaccurate description of the problem!
We have to store a lot of different informations about a specific region. For this we need a flexible data structure which does not limit the possibilities for the user.
So we've create a key-value table for this additional data which is described through a meta table which contains the datatype of the value.
We already use this information for queries over our rest api. We then automatically wrap the requested field with into a cast.
SQL Fiddle
We return this data together with information form other tables as a JSON object. We convert the corresponding rows from the data-table with array_agg and json_object into a JSON object:
...
CASE
WHEN count(prop.name) = 0 THEN '{}'::json
ELSE json_object(array_agg(prop.name), array_agg(prop.value))
END AS data
...
This works very well. Now the problem we have is if we store data like a floating point number into this field, we then get returned a string representation of this number:
e.g. 5.231 returns as "5.231"
Now we would like to CAST this number during our select statement into the right data-format so the JSON result would be correctly formatted. We have all the information we need so we tried following:
SELECT
json_object(array_agg(data.name),
-- here I cast the value into the right datatype!
-- results in an error
array_agg(CAST(value AS datatype))) AS data
FROM data
JOIN (
SELECT name, datatype
FROM meta)
AS info
ON info.name = data.name
The error message is following:
ERROR: type "datatype" does not exist
LINE 3: array_agg(CAST(value AS datatype))) AS data
^
Query failed
PostgreSQL said: type "datatype" does not exist
So is it possible to dynamically cast the text of the data_type column to a postgresql type to return a well-formatted JSON object?
First, that's a terrible abuse of SQL, and ought to be avoided in practically all scenarios. If you have a scenario where this is legitimate, you probably already know your RDBMS so intimately, that you're writing custom indexing plugins, and wouldn't even think of asking this question...
If you tell us what you're actually trying to do, there's about a 99.9% chance we can tell you a better way to do it.
Now with that disclaimer aside:
This is not possible, without using dynamic SQL. With a sufficiently recent version of PostgreSQL, you can accomplish this with the use of 'EXECUTE IMMEDIATE', which you can read about in the manual. It basically boils down to using EXEC.
Note, however, that even using this method, the result for every row fetched in the same query must have the same data type. In other words, you can't expect that row 1 will have a data type of VARCHAR, and row 2 will have INT. That is completely impossible.
The problem you have is, that json_object does create an object out of a string array for the keys and another string array for the values. So if you feed your JSON objects into this method, it will always return an error.
So the first problem is, that you have to use a JSON or JSONB column for the values. Or you can convert the values from string to json with to_json().
Now the second problem is that you need to use another method to create your json object because you want to feed it with a string array for the keys and a json-object array for the values. For this there is a method called json_object_agg.
Then your output should be like the one you expected! Here the full query:
SELECT
json_object_agg(data.name, to_json(data.value)) AS data
FROM data

pig set data type of all columns

Im wondering if there is a way to set the data type of an arbitrary number of items in a tuple. For example if I create a field using $(1..) and I know that all the items will be integers, can I set that? something like:
.... GENERATE (chararray)$0 (int..)($1..)
I'm passing this tuple to a UDF and want to save time in parsing and converting DataByteArray to Int.