pig set data type of all columns - apache-pig

Im wondering if there is a way to set the data type of an arbitrary number of items in a tuple. For example if I create a field using $(1..) and I know that all the items will be integers, can I set that? something like:
.... GENERATE (chararray)$0 (int..)($1..)
I'm passing this tuple to a UDF and want to save time in parsing and converting DataByteArray to Int.

Related

BigQuery - Inferring Datatypes of Column Values

What is the best way to determine the datatype of a column value if the data has already been loaded and the data has been classified as STRING datatype (i.e. BQ table metadata has "STRING" as the datatype for every column)? I've found a few different methods, but not sure if I'm missing any or any of these is substantially more performant. The result should include statistics on the grain of each value, not just per column.
Using a combination of CASE and SAFE_CAST on the STRING value to sum up all the instances where it successfully was able to CAST to X data type (where X is any datatype, like INT64 or DATETIME and having a few lines in query repeat the SAFE_CAST to cover all potential datatypes)
Similar to above, but using REGEXP_CONTAINS instead of SAFE_CAST on every value and summing up all instances of TRUE (a community UDF also seems to tackle this: https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/typeof.sql)
(For above can also use countif(), if statements etc.)
Loading data into a pandas dataframe and using something like pd.api.types.infer_dtype to infer automatically, but this adds overhead and more components
Thanks!

how do i select certain key/value pair from json field inside a SQL table in SNOWFLAKE

I am currently working on building a dataware house in snowflake for the business that i work for and i have encounter some problems. I used to apply the function Json_value in TSQL for extracting certain key/value pair from json format field inside my original MSSQL DB.
All the other field are in the regular SQL format but there is this one field that i really need that is formated in JSON and i can't seems to exact the key/value pair that i need.
I'm new to SnowSQL and i can't seems to find a way to extract this within a regular query. Does anyone knows a way around my problem ?
* ID /// TYPE /// Name (JSON_FORMAT)/// Amount *
1 5 {En: "lunch, fr: "diner"} 10.00
I would like to extract this line (for exemple) and be able to only retrieve the EN: "lunch" part from my JSON format field.
Thank you !
Almost any time you use JSON in Snowflake, it's advisable to use the VARIANT data type. You can use the parse_json function to convert a string into a variant with JSON.
select
parse_json('{En: "lunch", fr: "diner"}') as VARIANT_COLUMN,
VARIANT_COLUMN:En::string as ENGLISH_WORD;
In this sample, the first column converts your JSON into a variant named VARIANT_COLUMN. The second column uses the variant, extracting the "En" property and casting it to a string data type.
You can define columns as variant and store JSON natively. That's going to improve performance and allow parsing using dot notation in SQL.
For anyone else who also stumbles upon this question:
You can also use JSON_EXTRACT_PATH_TEXT. Here is an example, if you wanted to create a new column called meal.
select json_extract_path_text(Name,'En') as meal from ...

Can I assign multiple datatypes to a Pandas column?

I have a huge amount of data any work on this data takes up a really long time. One of the tips that I read to deal with a large amount of data is to change the datatypes of the columns to either 'int' or 'float' if possible.
I tried to follow this method but I am getting some errors because my column contains both float and string values. The error looks like this "Unable to parse string "3U00" at position 18". Hence my question:
1) Is there a way I can assign multiple data types to one column and how can I do that?
2) If I am able to achieve the above does this decrease my processing time?
Currently when I type :
dftest.info()
Result:
A_column non-null object

DynamoDB table to Hive when some column have couple of different data types?

Hi I am trying to create external table from Dynamo in Hive and save it on s3 as parquet files. I encountered a problem with one column value that have items with different data types (sometimes string, sometimes number and sometimes array of strings/numbers). Because of that I cannot know what data type that column should be - if I set it to string items with number or array will have Null value for that attribute.
Does anyone know how can I create table that converts all these types to string? Will I have to write custom SerDe?
I suppose that you are using this Storage Handler org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler
if so, then take a look this documentation https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.ExternalTableForDDB.html
In particular this section:
Note: The following DynamoDB data types are not supported by the DynamoDBStorageHandler class, so they cannot be used with dynamodb.column.mapping
Map
List
Boolean
Null
Then if you have a DynamoDB column with any of above datatypes you hive column value always will be NULL.

Dynamic type cast in select query

I have totally rewritten my question because of inaccurate description of the problem!
We have to store a lot of different informations about a specific region. For this we need a flexible data structure which does not limit the possibilities for the user.
So we've create a key-value table for this additional data which is described through a meta table which contains the datatype of the value.
We already use this information for queries over our rest api. We then automatically wrap the requested field with into a cast.
SQL Fiddle
We return this data together with information form other tables as a JSON object. We convert the corresponding rows from the data-table with array_agg and json_object into a JSON object:
...
CASE
WHEN count(prop.name) = 0 THEN '{}'::json
ELSE json_object(array_agg(prop.name), array_agg(prop.value))
END AS data
...
This works very well. Now the problem we have is if we store data like a floating point number into this field, we then get returned a string representation of this number:
e.g. 5.231 returns as "5.231"
Now we would like to CAST this number during our select statement into the right data-format so the JSON result would be correctly formatted. We have all the information we need so we tried following:
SELECT
json_object(array_agg(data.name),
-- here I cast the value into the right datatype!
-- results in an error
array_agg(CAST(value AS datatype))) AS data
FROM data
JOIN (
SELECT name, datatype
FROM meta)
AS info
ON info.name = data.name
The error message is following:
ERROR: type "datatype" does not exist
LINE 3: array_agg(CAST(value AS datatype))) AS data
^
Query failed
PostgreSQL said: type "datatype" does not exist
So is it possible to dynamically cast the text of the data_type column to a postgresql type to return a well-formatted JSON object?
First, that's a terrible abuse of SQL, and ought to be avoided in practically all scenarios. If you have a scenario where this is legitimate, you probably already know your RDBMS so intimately, that you're writing custom indexing plugins, and wouldn't even think of asking this question...
If you tell us what you're actually trying to do, there's about a 99.9% chance we can tell you a better way to do it.
Now with that disclaimer aside:
This is not possible, without using dynamic SQL. With a sufficiently recent version of PostgreSQL, you can accomplish this with the use of 'EXECUTE IMMEDIATE', which you can read about in the manual. It basically boils down to using EXEC.
Note, however, that even using this method, the result for every row fetched in the same query must have the same data type. In other words, you can't expect that row 1 will have a data type of VARCHAR, and row 2 will have INT. That is completely impossible.
The problem you have is, that json_object does create an object out of a string array for the keys and another string array for the values. So if you feed your JSON objects into this method, it will always return an error.
So the first problem is, that you have to use a JSON or JSONB column for the values. Or you can convert the values from string to json with to_json().
Now the second problem is that you need to use another method to create your json object because you want to feed it with a string array for the keys and a json-object array for the values. For this there is a method called json_object_agg.
Then your output should be like the one you expected! Here the full query:
SELECT
json_object_agg(data.name, to_json(data.value)) AS data
FROM data