SQLCODE=-302, SQLSTATE=22001 error while loading Spark dataframes to Relational DB tables - dataframe

I have a dataframe with 100 plus columns, majority of them are of StringType. During load into DB2 table, I sometimes face DB2 SQL Error: SQLCODE=-302, SQLSTATE=22001. On checking the error code I came to know that its a length issue, one ore more columns I have in my dataframe(usually StringType columns) might have data which exceeds the length of varchar or char field to which its being loaded to. The main issue is that the error doesnt specify which column or row causes this in the error log so I have to go through the whole dataframe, is there any elegant solution available for this?
For troubleshooting I have tried to get the max length of each column in dataframe and check if it exceeds the column length in database
df.withColumn("length_of_column1",length(col("column1")))
.groupBy()
.max("length_of_column1")
.show()

Related

"Numeric value '' is not recognized" - what column?

I am trying to insert data from a staging table into the master table. The table has nearly 300 columns, and is a mix of data-typed Varchars, Integers, Decimals, Dates, etc.
Snowflake gives the unhelpful error message of "Numeric value '' is not recognized"
I have gone through and cut out various parts of the query to try and isolate where it is coming from. After several hours and cutting every column, it is still happening.
Does anyone know of a Snowflake diagnostic query (like Redshift has) which can tell me a specific column where the issue is occurring?
Unfortunately not at the point you're at. If you went back to the COPY INTO that loaded the data, you'd be able to use VALIDATE() function to get better information to the record and byte-offset level.
I would query your staging table for just the numeric fields and look for blanks, or you can wrap all of your fields destined for numeric fields with try_to_number() functions. A bit tedious, but might not be too bad if you don't have a lot of numbers.
https://docs.snowflake.com/en/sql-reference/functions/try_to_decimal.html
As a note, when you stage, you should try and use the NULL_IF options to get rid of bad characters and/or try to load them into stage using the actual datatypes in your stage table, so you can leverage the VALIDATE() function to make sure the data types are correct before loading into Snowflake.
Query your staging using try_to_number() and/or try_to_decimal() for number and decimal fields of the table and the use the minus to get the difference
Select $1,$2,...$300 from #stage
minus
Select $1,try_to_number($2)...$300 from#stage
If any number field has a string that cannot be converted then it will be null and then minus should return those rows which have a problem..Once you get the rows then try to analyze the columns in the result set for errors.

pandas dataframe apply function with Value Error

Have multiple tables need to process. One column of each table is date related but date format is changing, like 5/26, 05/26, 05/26/2020,05262020,5262020 I used
df[date] = df[date].apply(dateutil.parser.parse, dayfirst=dayfirst,
yearfirst=yearfirst)
It used to works just fine, but recently some tables in the date column might have strings like"unknown" or "missing" or other strings. Then I got an error it broke the process.
"ValueError: Unknown string format"
How to handle this to exclude the rows I got
"ValueError: Unknown string format"
Thanks.
Figure out a way to handle this, use regular expression first to exclude those rows then apply.
df=df[df["date"].str.contains(re.compile('\d+'))]
df[date] = df[date].apply(dateutil.parser.parse, dayfirst=dayfirst,
yearfirst=yearfirst)

Can I assign multiple datatypes to a Pandas column?

I have a huge amount of data any work on this data takes up a really long time. One of the tips that I read to deal with a large amount of data is to change the datatypes of the columns to either 'int' or 'float' if possible.
I tried to follow this method but I am getting some errors because my column contains both float and string values. The error looks like this "Unable to parse string "3U00" at position 18". Hence my question:
1) Is there a way I can assign multiple data types to one column and how can I do that?
2) If I am able to achieve the above does this decrease my processing time?
Currently when I type :
dftest.info()
Result:
A_column non-null object

Finding which column caused the postgresql exception in a query.

I have a staging table with around 200 columns in Redshift. I first copy data from S3 to this table and then copy data from this table to another table using a large insert into select from query. Most of the fields in staging table are varchar, which I convert to the proper datatype in the query.
I am getting some field in the staging table which is causing a numeric overflow -
org.postgresql.util.PSQLException: ERROR: Numeric data overflow (addition)
Detail:
-----------------------------------------------
error: Numeric data overflow (addition)
code: 1058
context:
query: 9620240
location: numeric.hpp:112
process: query1_194 [pid=680]
how can I find, which field is causing this overflow, so that I can sanitize my input or correct my query.
I use Netezza which also can use regex functions to grep out rows. Fortunately redshift supports regexp as well. Please take a look at
http://docs.aws.amazon.com/redshift/latest/dg/REGEXP_COUNT.html
So the idea in your case is to use the regexp in the where clause and in this way you can find which values are exceeding the numeric cast occurring during the insert. The issue will be finding identifying data that allows you to determine which rows in the physical file are causing the issue. You could create another copy of the data and create row numbers in a temporary table. Use the temporary table as your source of analysis. How large is the numeric field you are going into ? You may need to do this analysis against more than 1 column if you have multiple columns being cast to numeric.

Why am I getting a "[SQL0802] Data conversion of data mapping error" exception?

I am not very familiar with iseries/DB2. However, I work on a website that uses it as its primary database.
A new column was recently added to an existing table. When I view it via AS400, I see the following data type:
Type: S
Length: 9
Dec: 2
This tells me it's a numeric field with 6 digits before the decimal point, and 2 digits after the decimal point.
When I query the data with a simple SELECT (SELECT MYCOL FROM MYTABLE), I get back all the records without a problem. However, when I try using a DISTINCT, GROUP BY, or ORDER BY on that same column I get the following exception:
[SQL0802] Data conversion of data mapping error
I've deduced that at least one record has invalid data - what my DBA calls "blanks" or "4 O". How is this possible though? Shouldn't the database throw an exception when invalid data is attempted to be added to that column?
Is there any way I can get around this, such as filtering out those bad records in my query?
"4 O" means 0x40 which is the EBCDIC code for a space or blank character and is the default value placed into any new space in a record.
Legacy programs / operations can introduce the decimal data error. For example if the new file was created and filled using the CPYF command with the FMTOPT(*NOCHK) option.
The easiest way to fix it is to write an HLL program (RPG) to read the file and correct the records.
The only solution I could find was to write a script that checks for blank values in the column and then updates them to zero when they are found.
If the file has record format level checking turned off [ie. LVLCHK(*NO)] or is overridden to that, then an HLL program. (ex. RPG, COBOL, etc) that was not recompiled with the new record might write out records with invalid data in this column, especially if the new column is not at the end of the record.
Make sure that all programs that use native I/O to write or update records on this file are recompiled.
I was able to solve this error by force-casting the key columns to integer. I changed the join from this...
FROM DAILYV INNER JOIN BXV ON DAILYV.DAITEM=BXV.BXPACK
...to this...
FROM DAILYV INNER JOIN BXV ON CAST(DAILYV.DAITEM AS INT)=CAST(BXV.BXPACK AS INT)
...and I didn't have to make any corrections to the tables. This is a very old, very messy database with lots of junk in it. I've made many corrections, but it's a work in progress.