What happens if I send integers to a BigQuery field "string"? - google-bigquery

One of the columns I send (in my code) to BigQuery is integers. I added the columns to BigQuery and I was too fast and added them as type string.
Will they be automatically converted? Or will the data be totally corrupted (= I cannot trust at all the resulting string)?

Data shouldn't be automatically converted as this would destroy the purpose of having a table schema.
What I've seen people doing is saving a whole json line as string and then processing this string inside of BigQuery. Other than that, if you try to save values not correspondent to the field schema definition, you should see an error being thrown, like so:
If you need to change a table schema's definition, you can check this tutorial on updating a table schema.

Actually BigQuery converted automatically the integers that I have sent it to string, so my table populates ok

Related

DynamoDB table to Hive when some column have couple of different data types?

Hi I am trying to create external table from Dynamo in Hive and save it on s3 as parquet files. I encountered a problem with one column value that have items with different data types (sometimes string, sometimes number and sometimes array of strings/numbers). Because of that I cannot know what data type that column should be - if I set it to string items with number or array will have Null value for that attribute.
Does anyone know how can I create table that converts all these types to string? Will I have to write custom SerDe?
I suppose that you are using this Storage Handler org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler
if so, then take a look this documentation https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.ExternalTableForDDB.html
In particular this section:
Note: The following DynamoDB data types are not supported by the DynamoDBStorageHandler class, so they cannot be used with dynamodb.column.mapping
Map
List
Boolean
Null
Then if you have a DynamoDB column with any of above datatypes you hive column value always will be NULL.

Import PostgreSQL dump into SQL Server - data type errors

I have some data which was dumped from a PostgreSQL database (allegedly, using pg_dump) which needs to get imported into SQL Server.
While the data types are ok, I am running into an issue where there seems to be a placeholder for a NULL. I see a backslash followed by an uppercase N in many fields. Below is a snippet of the data, as viewed from within Excel. Left column has a Boolean data type, and the right one has an integer as the data type
Some of these are supposed to be of the Boolean datatype, and having two characters in there is most certainly not going to fly.
Here's what I tried so far:
Import via dirty read - keeping whatever datatypes SSIS decided each field had; to no avail. There were error messages about truncation on all of the boolean fields.
Creating a table for the data based on the correct data types, though this was more fun... I needed to do the same as in the dirty read, as the source would otherwise not load properly. There was also a need to transform the data into the correct data type for insertion into the destination data source; yet, I am getting truncation issues, when it most certainly shouldn't be.
Here is a sample expression in my derived column transformation editor:
(DT_BOOL)REPLACE(observation,"\\N","")
The data type should be Boolean.
Any suggestion would be really helpful!
Thanks!
Since I was unable to circumvent the SSIS rules in order to get my data into my tables without an error, I took the quick-and-dirty approach.
The solution which worked for me was to have the source data read each column as if it were a string, and the destination table had all fields be of the datatype VARCHAR. This destination table will be used as a staging table, once in SS, I can manipulate as needed.
Thank you #cha for your input.

Dynamic type cast in select query

I have totally rewritten my question because of inaccurate description of the problem!
We have to store a lot of different informations about a specific region. For this we need a flexible data structure which does not limit the possibilities for the user.
So we've create a key-value table for this additional data which is described through a meta table which contains the datatype of the value.
We already use this information for queries over our rest api. We then automatically wrap the requested field with into a cast.
SQL Fiddle
We return this data together with information form other tables as a JSON object. We convert the corresponding rows from the data-table with array_agg and json_object into a JSON object:
...
CASE
WHEN count(prop.name) = 0 THEN '{}'::json
ELSE json_object(array_agg(prop.name), array_agg(prop.value))
END AS data
...
This works very well. Now the problem we have is if we store data like a floating point number into this field, we then get returned a string representation of this number:
e.g. 5.231 returns as "5.231"
Now we would like to CAST this number during our select statement into the right data-format so the JSON result would be correctly formatted. We have all the information we need so we tried following:
SELECT
json_object(array_agg(data.name),
-- here I cast the value into the right datatype!
-- results in an error
array_agg(CAST(value AS datatype))) AS data
FROM data
JOIN (
SELECT name, datatype
FROM meta)
AS info
ON info.name = data.name
The error message is following:
ERROR: type "datatype" does not exist
LINE 3: array_agg(CAST(value AS datatype))) AS data
^
Query failed
PostgreSQL said: type "datatype" does not exist
So is it possible to dynamically cast the text of the data_type column to a postgresql type to return a well-formatted JSON object?
First, that's a terrible abuse of SQL, and ought to be avoided in practically all scenarios. If you have a scenario where this is legitimate, you probably already know your RDBMS so intimately, that you're writing custom indexing plugins, and wouldn't even think of asking this question...
If you tell us what you're actually trying to do, there's about a 99.9% chance we can tell you a better way to do it.
Now with that disclaimer aside:
This is not possible, without using dynamic SQL. With a sufficiently recent version of PostgreSQL, you can accomplish this with the use of 'EXECUTE IMMEDIATE', which you can read about in the manual. It basically boils down to using EXEC.
Note, however, that even using this method, the result for every row fetched in the same query must have the same data type. In other words, you can't expect that row 1 will have a data type of VARCHAR, and row 2 will have INT. That is completely impossible.
The problem you have is, that json_object does create an object out of a string array for the keys and another string array for the values. So if you feed your JSON objects into this method, it will always return an error.
So the first problem is, that you have to use a JSON or JSONB column for the values. Or you can convert the values from string to json with to_json().
Now the second problem is that you need to use another method to create your json object because you want to feed it with a string array for the keys and a json-object array for the values. For this there is a method called json_object_agg.
Then your output should be like the one you expected! Here the full query:
SELECT
json_object_agg(data.name, to_json(data.value)) AS data
FROM data

Insert data via SSIS package and different datatypes

I have a table with a column1 nvarchar(50) null. I want to insert this into a more 'tight' table with a nvarchar(30) not null. My idea was to insert a derived column task between source and destination task with this expression: Replace column1 = (DT_WSTR,30)Column1
I get the "truncation may occur error" and I am not allowed to insert the data into the new tighter table.
Also I am 100% sure that no values are over 30 characters in the column. Moreover I do not have the possibility to change the column data type in the source.
What is the best way to create the ETL process?
JotaBe recommended using a data conversion transformation. Yes, that is another way to achieve the same thing, but it will also error out if truncation occurs. Your way should work (I tried it), provided the input data really is less than 30 characters.
You could modify your derived column expression to
(DT_WSTR,30)Substring([Column1], 1, 30)
Consider changing the truncation error disposition of the Derived Column component within your Data Flow. By default, a truncation will cause the Derived Column component to fail. You can configure the component to ignore or redirect rows which are causing a truncation error.
To do this, open the Derived Column Transformation editor and click the 'Configure Error Output...' button in the bottom-left of the dialog. From here, change the setting in the 'Truncation' column for any applicable columns as required.
Be aware that any data which is truncated for columns ignoring failure will not be reported by SSIS during execution. It sounds like you've already done this, but it's important to be sure you've analysed your data as it currently stands and taken into consideration any possible future changes to the nature of the data before disabling truncation reporting.
To do so you must use a Data Conversion Transformation, which allows to change the data type from the original nvarchar(50) to the desired nvarchar(30).
You'll get a new column with the required data type.
Of course, you can decide what to do in case of error: truncation, by configuring this component.
UPDATE
As there are people who have downvoted this answer, let's add 3 more comments:
this solution is checked and works. Create a table with a nvarchar(50) column, a new table with a nvarchar(30) column, add a data flow that uses a data conversion transform and it works witout a glitch. Please, chek it, I guarantee. Besides, as the OP states "Also I am 100% sure that no values are over 30 characters in the column" in his case there will be no truncation problems. However, I recommend treating the possible errors, just in case they happen.
from MSDN: "a package can perform the following types of data conversions: ... Set the column length of string data"
from MSDN: "If the length of an output column of string data is shorter than the length of its corresponding input column, the output data is truncated."

T-SQL: How to log erroneous entries during import

I do an import and convert columns to other datatypes. During the import I check, if I can convert the values stored in the columns into the target datatypes. If not I put a default value, e.g. I convert a string to int and when I cannot convert the string I insert -1.
At this moment it would be great if I could log the erroneous entries into a separate table, e.g. instead of a parsable string '1234' 'xze' arrives, so I put -1 into the target table and 'xze' into the log table.
Is this possible with (T-)SQL?
Cheers,
Andreas
This is also very easy to do with SSIS data flow tasks. Any of the data conversion or lookup steps you might try have a "success" output and an "error" output. You simply direct all the error outputs to a "union" transform, and from there into a common error table. The result of all the successes go into your "success" table. You can even add extra details into the error table, to give you clear error messages.
The nice thing about this is that you still get high performance, as entire buffers move through the system. You'll eventually have buffers full of valid data being bulk written to the success table, and small buffers full of errors being written to the error table. When errors happen on a row, that row will simply be moved from one buffer of rows into another.
If you have a staging table, you can filter the good stuff into the final table and do a join back (NOT EXISTS) to find the rubbish for the log table