How to Convert empty string into NULL in HIVE Parquet table?

How to Convert empty string into NULL in HIVE Parquet table? - hive

I have a requirement to convert all blank values into NULL while loading and all of my target tables are parquet as per the requirement.
But I am unable to convert those values into NULL.
I am working on very large scale of data like 500gb or some tables are in TBs.
I have tried " 'serialization.null.format' = '' " but its not working.
And also i have tried case and if statement to convert those values in data laod phase. This method is working but it's taking too much time because of the data volume.
Is there any better solution using which I can achieve the conversion of blank into NULL?

I think, case condition can help you in this regard.
case when $column<>'' then $column else NULL
If you are fetching this data from raw format say JSON, it will take more time than fetching from ORC files

Please follow the next link, carefully and it will work directly, without the case transformation.
Shortly: before you use "serialization.null.format", you need to setup an option on Cloudera.

Related

How to print binary table column as zeros and ones using SQL

I have a table with a column of type "binary". When I select this column I see the data automatically converted and printed as a string, for example " ¢ZêZ". I want to write the select statement in a way that it is printed as actual zeros and ones e.g. "01001010". Note, I am running this query through a python script and dumping the results to a csv file.
If anybody has an idea how to do this your help would be really appreciate it.

I have found a way to do it at least on my vertica DB.
TO_BITSTRING(expression)
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/TO_BITSTRING.htm

"Numeric value '' is not recognized" - what column?

I am trying to insert data from a staging table into the master table. The table has nearly 300 columns, and is a mix of data-typed Varchars, Integers, Decimals, Dates, etc.
Snowflake gives the unhelpful error message of "Numeric value '' is not recognized"
I have gone through and cut out various parts of the query to try and isolate where it is coming from. After several hours and cutting every column, it is still happening.
Does anyone know of a Snowflake diagnostic query (like Redshift has) which can tell me a specific column where the issue is occurring?

Unfortunately not at the point you're at. If you went back to the COPY INTO that loaded the data, you'd be able to use VALIDATE() function to get better information to the record and byte-offset level.
I would query your staging table for just the numeric fields and look for blanks, or you can wrap all of your fields destined for numeric fields with try_to_number() functions. A bit tedious, but might not be too bad if you don't have a lot of numbers.
https://docs.snowflake.com/en/sql-reference/functions/try_to_decimal.html
As a note, when you stage, you should try and use the NULL_IF options to get rid of bad characters and/or try to load them into stage using the actual datatypes in your stage table, so you can leverage the VALIDATE() function to make sure the data types are correct before loading into Snowflake.

Query your staging using try_to_number() and/or try_to_decimal() for number and decimal fields of the table and the use the minus to get the difference
Select $1,$2,...$300 from #stage
minus
Select $1,try_to_number($2)...$300 from#stage
If any number field has a string that cannot be converted then it will be null and then minus should return those rows which have a problem..Once you get the rows then try to analyze the columns in the result set for errors.

AWS Athena - How to change format of date string

I have a two tables in a database in AWS Athena that I want to join.
I want to join them by several columns, one of them being date.
However in one data set the date string is encoded for single value months is encoded as
"08/31/2018"
While the other would have it encoded as
"8/31/2018"
Is there a way to make them the same format?
I am unsure if it is easier to add the extra 0 to strings which have lack the extra 0 or to concatenate strings which have the extra 0.
Based on what I have researched I think I will have to use the CASE and CONCAT functions.
Both of the tables were loaded into the database from a CSV file, and the variables are in the string format.
I have tried changing the values manually in the CSV file, tried running an R script on one of the tables to format the date in the same way, and have also tried re-loading the tables into the database as the same date format.
However no matter what I do whenever it is loaded into the database, even when they have the same date type, it always loads them with different formats.
One with the the extra 0 and the other without it.
The last avenue I haven't tried is through a SQL query.
However I am not well versed in Athena and am having a hard time formatting this query.
I know this is rather vague, so please ask me for more information if you need.
If someone could help me start this query I would be grateful.
Thank you for the help.
Here is the query for changing dates in Athena.
date_parse(table.date_variable,'%m/%d/%Y')
Though Athena tables are immutable once created.

You can convert the value to date using date_parse(). So, this should work:
date_parse(t1.datecol, '%m/%d/%Y') = str_to_date(t2.datecol, '%m/%d/%Y')
Having said that, you should fix the data model. Store dates as dates not as strings! Then you can use an equality join and that is just better all around.

BigQuery cant import the data from DataPrep

I have the table created in BigQuery with partitioned by date and it has the Date type. DataPrep also has the same column with same data type. When i try to load the data from dataprep to bigquery table i am getting the error like "The column datatypes in the dataset must match the destination column datatypes". Screenshot also been attached, please go through it and give me a solution.enter image description here

As the screenshot says, one is a TIME, DATETIME, or TIMESTAMP
the other is STRING as noted by the icons in front of your columns.
You need to make sure at dataset that you've chosen the right data type. Dataprep may infer wrong automatically sometimes your data type.

In this this thread it is mentioned that you need to convert both types to TIMESTAMP in order to make this work. In my case this did the trick, but it is kind of cumbersome. Hopefully they will support this for simple DATE columns soon.

Import PostgreSQL dump into SQL Server - data type errors

I have some data which was dumped from a PostgreSQL database (allegedly, using pg_dump) which needs to get imported into SQL Server.
While the data types are ok, I am running into an issue where there seems to be a placeholder for a NULL. I see a backslash followed by an uppercase N in many fields. Below is a snippet of the data, as viewed from within Excel. Left column has a Boolean data type, and the right one has an integer as the data type
Some of these are supposed to be of the Boolean datatype, and having two characters in there is most certainly not going to fly.
Here's what I tried so far:
Import via dirty read - keeping whatever datatypes SSIS decided each field had; to no avail. There were error messages about truncation on all of the boolean fields.
Creating a table for the data based on the correct data types, though this was more fun... I needed to do the same as in the dirty read, as the source would otherwise not load properly. There was also a need to transform the data into the correct data type for insertion into the destination data source; yet, I am getting truncation issues, when it most certainly shouldn't be.
Here is a sample expression in my derived column transformation editor:
(DT_BOOL)REPLACE(observation,"\\N","")
The data type should be Boolean.
Any suggestion would be really helpful!
Thanks!

Since I was unable to circumvent the SSIS rules in order to get my data into my tables without an error, I took the quick-and-dirty approach.
The solution which worked for me was to have the source data read each column as if it were a string, and the destination table had all fields be of the datatype VARCHAR. This destination table will be used as a staging table, once in SS, I can manipulate as needed.
Thank you #cha for your input.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas