pandas dataframe apply function with Value Error - pandas

Have multiple tables need to process. One column of each table is date related but date format is changing, like 5/26, 05/26, 05/26/2020,05262020,5262020 I used
df[date] = df[date].apply(dateutil.parser.parse, dayfirst=dayfirst,
yearfirst=yearfirst)
It used to works just fine, but recently some tables in the date column might have strings like"unknown" or "missing" or other strings. Then I got an error it broke the process.
"ValueError: Unknown string format"
How to handle this to exclude the rows I got
"ValueError: Unknown string format"
Thanks.

Figure out a way to handle this, use regular expression first to exclude those rows then apply.
df=df[df["date"].str.contains(re.compile('\d+'))]
df[date] = df[date].apply(dateutil.parser.parse, dayfirst=dayfirst,
yearfirst=yearfirst)

Related

SQLCODE=-302, SQLSTATE=22001 error while loading Spark dataframes to Relational DB tables

I have a dataframe with 100 plus columns, majority of them are of StringType. During load into DB2 table, I sometimes face DB2 SQL Error: SQLCODE=-302, SQLSTATE=22001. On checking the error code I came to know that its a length issue, one ore more columns I have in my dataframe(usually StringType columns) might have data which exceeds the length of varchar or char field to which its being loaded to. The main issue is that the error doesnt specify which column or row causes this in the error log so I have to go through the whole dataframe, is there any elegant solution available for this?
For troubleshooting I have tried to get the max length of each column in dataframe and check if it exceeds the column length in database
df.withColumn("length_of_column1",length(col("column1")))
.groupBy()
.max("length_of_column1")
.show()

"Numeric value '' is not recognized" - what column?

I am trying to insert data from a staging table into the master table. The table has nearly 300 columns, and is a mix of data-typed Varchars, Integers, Decimals, Dates, etc.
Snowflake gives the unhelpful error message of "Numeric value '' is not recognized"
I have gone through and cut out various parts of the query to try and isolate where it is coming from. After several hours and cutting every column, it is still happening.
Does anyone know of a Snowflake diagnostic query (like Redshift has) which can tell me a specific column where the issue is occurring?
Unfortunately not at the point you're at. If you went back to the COPY INTO that loaded the data, you'd be able to use VALIDATE() function to get better information to the record and byte-offset level.
I would query your staging table for just the numeric fields and look for blanks, or you can wrap all of your fields destined for numeric fields with try_to_number() functions. A bit tedious, but might not be too bad if you don't have a lot of numbers.
https://docs.snowflake.com/en/sql-reference/functions/try_to_decimal.html
As a note, when you stage, you should try and use the NULL_IF options to get rid of bad characters and/or try to load them into stage using the actual datatypes in your stage table, so you can leverage the VALIDATE() function to make sure the data types are correct before loading into Snowflake.
Query your staging using try_to_number() and/or try_to_decimal() for number and decimal fields of the table and the use the minus to get the difference
Select $1,$2,...$300 from #stage
minus
Select $1,try_to_number($2)...$300 from#stage
If any number field has a string that cannot be converted then it will be null and then minus should return those rows which have a problem..Once you get the rows then try to analyze the columns in the result set for errors.

BigQuery - Inferring Datatypes of Column Values

What is the best way to determine the datatype of a column value if the data has already been loaded and the data has been classified as STRING datatype (i.e. BQ table metadata has "STRING" as the datatype for every column)? I've found a few different methods, but not sure if I'm missing any or any of these is substantially more performant. The result should include statistics on the grain of each value, not just per column.
Using a combination of CASE and SAFE_CAST on the STRING value to sum up all the instances where it successfully was able to CAST to X data type (where X is any datatype, like INT64 or DATETIME and having a few lines in query repeat the SAFE_CAST to cover all potential datatypes)
Similar to above, but using REGEXP_CONTAINS instead of SAFE_CAST on every value and summing up all instances of TRUE (a community UDF also seems to tackle this: https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/typeof.sql)
(For above can also use countif(), if statements etc.)
Loading data into a pandas dataframe and using something like pd.api.types.infer_dtype to infer automatically, but this adds overhead and more components
Thanks!

Can I assign multiple datatypes to a Pandas column?

I have a huge amount of data any work on this data takes up a really long time. One of the tips that I read to deal with a large amount of data is to change the datatypes of the columns to either 'int' or 'float' if possible.
I tried to follow this method but I am getting some errors because my column contains both float and string values. The error looks like this "Unable to parse string "3U00" at position 18". Hence my question:
1) Is there a way I can assign multiple data types to one column and how can I do that?
2) If I am able to achieve the above does this decrease my processing time?
Currently when I type :
dftest.info()
Result:
A_column non-null object

Pandas interprets 'timestamp without timezones' columns as different types

I read a table with pandas:
import pandas as pd
import numpy as np
con = psycopg2.connect(...)
mframe = pd.read_sql('''select dt_A, dt_B from (...)''',con)
Both columns (dt_A and dt_B) are of type 'timestamp without timezone' in the database. However, they are read as different types by pandas:
mframe.dt_A.dtype, mframe.dt_B.dtype
Yields:
(dtype('O'), dtype('<M8[ns]'))
I was able to force both columns to be recognized as
"<M8[ns]"
using the 'parse_dates' parameter, but I'd like to understand what causes this. As far as I've checked, neither column contains any 'Na's (which was my first suspicion). What could case them to be interpreted differently?
Update:
I'm using Pandas version 0.15.1; and I can reproduce the problem using both sqlalchemy and psycopg2 connections.
Update 2: running the original query with a small limit works as I expected - that is, both columns have the same dtype "M8[ns]". Still not sure what kind of entry (something ill-formatted?) is causing this, but I'm satisfied for now.
Update 3: joris got it. See the comments below.
As you noted that it works correctly when limiting to some data (with adding LIMIT 5 to your query), it probably has to do with some 'incorrect' values in the dates.
To find out what value is causing the problem, you can read in all data (resulting in the object dtype), and then do the conversion manually with:
pd.to_datetime(column, errors='raise')
The errors='raise' will ensure you get an error message indicating which date cannot be converted.
To ensure that the column is converted to datetime64 values, regardless of invalid values, you should specify the column in the parse_dates kwarg.
It seems that when using read_sql_table, the invalid date will be converted to NaT automatically, while read_sql_query will leave the values as datetime.datetime values. I opened an issue for this inconsistency: https://github.com/pydata/pandas/issues/9261