DBT - Test that column values verify specific date format - testing

I'm using DBT to transform data from source table whith all STRING fields to target table with TYPED fields (eg: DATE, INT, ...)
I would like to ensure (using dbt test command) that datatype conversion is possible before to launch the dbt run command. For instance on expected DATE fields (in STRING in my source table), an assert must be run on the whole column values to pass the test.
On dbt-expectations package, there are some useful tests like "expect_column_values_to_be_of_type", but this test checks the column datatype (in table's structure) instead of checking if all column values match a specific datatype.
Do you have any idea to avoid writing a custom test and use a native or packaged one?
EDIT: candidate should be "expect_column_values_to_match_regex", but perhaps a better one...
Thank you very much for help :)

In the dbt package - dbt-expectations, I think that expect_column_values_to_match_regex and expect_column_values_to_not_match_regex are the best fit to your requirement.
Actually you can use dbt-utils.expression_is_true to write your own SQL. Also, you've asked about native, that's the way doing with native SQL.

Related

Data Factory expression substring? Is there a function similar like right?

Please help,
How could I extract 2019-04-02 out of the following string with Azure data flow expression?
ABC_DATASET-2019-04-02T02:10:03.5249248Z.parquet
The first part of the string received as a ChildItem from a GetMetaData activity is dynamically. So in this case it is ABC_DATASET that is dynamic.
Kind regards,
D
There are several ways to approach this problem, and they are really dependent on the format of the string value. Each of these approaches uses Derived Column to either create a new column or replace the existing column's value in the Data Flow.
Static format
If the format is always the same, meaning the length of the sections is always the same, then substring is simplest:
This will parse the string like so:
Useful reminder: substring and array indexes in Data Flow are 1-based.
Dynamic format
If the format of the base string is dynamic, things get a tad trickier. For this answer, I will assume that the basic format of {variabledata}-{timestamp}.parquet is consistent, so we can use the hyphen as a base delineator.
Derived Column has support for local variables, which is really useful when solving problems like this one. Let's start by creating a local variable to convert the string into an array based on the hyphen. This will lead to some other problems later since the string includes multiple hyphens thanks to the timestamp data, but we'll deal with that later. Inside the Derived Column Expression Builder, select "Locals":
On the right side, click "New" to create a local variable. We'll name it and define it using a split expression:
Press "OK" to save the local and go back to the Derived Column. Next, create another local variable for the yyyy portion of the date:
The cool part of this is I am now referencing the local variable array that I created in the previous step. I'll follow this pattern to create a local variable for MM too:
I'll do this one more time for the dd portion, but this time I have to do a bit more to get rid of all the extraneous data at the end of the string. Substring again turns out to be a good solution:
Now that I have the components I need isolated as variables, we just reconstruct them using string interpolation in the Derived Column:
Back in our data preview, we can see the results:
Where else to go from here
If these solutions don't address your problem, then you have to get creative. Here are some other functions that may help:
regexSplit
left
right
dropLeft
dropRight

How can I change a date field from String to Date or DateTime?

I an using Google Big Query and I have a field, named 'AsOfDate' which is set as a string datatype. I have a bunch of data in this field, which I really want to set as DateTime or just Date. Either is fine. I Googled for a solution, and I thought this would be pretty easy to do, but I can't seem to get the data type updated. I don't want to run a simple select statement; I want to permanently change the Schema. Has anyone run into this and figured out how to do this kind of thing? If so, please share your insights. Thanks!
To quote directly from the official documentation: 'Changing a column's data type is not supported by the BigQuery web UI, the command-line tool, or the API.'
https://cloud.google.com/bigquery/docs/manually-changing-schemas#changing_a_columns_data_type
There are two ways to manually change a column's data type:
Using a SQL query — Choose this option if you are more concerned about
simplicity and ease of use, and you are less concerned about costs.
Recreating the table — Choose this option if you are more concerned
about costs, and you are less concerned about simplicity and ease of
use.
You could use either of the approaches above along with the PARSE_DATE() function to transform your string into a date field.
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#parse_date

Import PostgreSQL dump into SQL Server - data type errors

I have some data which was dumped from a PostgreSQL database (allegedly, using pg_dump) which needs to get imported into SQL Server.
While the data types are ok, I am running into an issue where there seems to be a placeholder for a NULL. I see a backslash followed by an uppercase N in many fields. Below is a snippet of the data, as viewed from within Excel. Left column has a Boolean data type, and the right one has an integer as the data type
Some of these are supposed to be of the Boolean datatype, and having two characters in there is most certainly not going to fly.
Here's what I tried so far:
Import via dirty read - keeping whatever datatypes SSIS decided each field had; to no avail. There were error messages about truncation on all of the boolean fields.
Creating a table for the data based on the correct data types, though this was more fun... I needed to do the same as in the dirty read, as the source would otherwise not load properly. There was also a need to transform the data into the correct data type for insertion into the destination data source; yet, I am getting truncation issues, when it most certainly shouldn't be.
Here is a sample expression in my derived column transformation editor:
(DT_BOOL)REPLACE(observation,"\\N","")
The data type should be Boolean.
Any suggestion would be really helpful!
Thanks!
Since I was unable to circumvent the SSIS rules in order to get my data into my tables without an error, I took the quick-and-dirty approach.
The solution which worked for me was to have the source data read each column as if it were a string, and the destination table had all fields be of the datatype VARCHAR. This destination table will be used as a staging table, once in SS, I can manipulate as needed.
Thank you #cha for your input.

Insert data via SSIS package and different datatypes

I have a table with a column1 nvarchar(50) null. I want to insert this into a more 'tight' table with a nvarchar(30) not null. My idea was to insert a derived column task between source and destination task with this expression: Replace column1 = (DT_WSTR,30)Column1
I get the "truncation may occur error" and I am not allowed to insert the data into the new tighter table.
Also I am 100% sure that no values are over 30 characters in the column. Moreover I do not have the possibility to change the column data type in the source.
What is the best way to create the ETL process?
JotaBe recommended using a data conversion transformation. Yes, that is another way to achieve the same thing, but it will also error out if truncation occurs. Your way should work (I tried it), provided the input data really is less than 30 characters.
You could modify your derived column expression to
(DT_WSTR,30)Substring([Column1], 1, 30)
Consider changing the truncation error disposition of the Derived Column component within your Data Flow. By default, a truncation will cause the Derived Column component to fail. You can configure the component to ignore or redirect rows which are causing a truncation error.
To do this, open the Derived Column Transformation editor and click the 'Configure Error Output...' button in the bottom-left of the dialog. From here, change the setting in the 'Truncation' column for any applicable columns as required.
Be aware that any data which is truncated for columns ignoring failure will not be reported by SSIS during execution. It sounds like you've already done this, but it's important to be sure you've analysed your data as it currently stands and taken into consideration any possible future changes to the nature of the data before disabling truncation reporting.
To do so you must use a Data Conversion Transformation, which allows to change the data type from the original nvarchar(50) to the desired nvarchar(30).
You'll get a new column with the required data type.
Of course, you can decide what to do in case of error: truncation, by configuring this component.
UPDATE
As there are people who have downvoted this answer, let's add 3 more comments:
this solution is checked and works. Create a table with a nvarchar(50) column, a new table with a nvarchar(30) column, add a data flow that uses a data conversion transform and it works witout a glitch. Please, chek it, I guarantee. Besides, as the OP states "Also I am 100% sure that no values are over 30 characters in the column" in his case there will be no truncation problems. However, I recommend treating the possible errors, just in case they happen.
from MSDN: "a package can perform the following types of data conversions: ... Set the column length of string data"
from MSDN: "If the length of an output column of string data is shorter than the length of its corresponding input column, the output data is truncated."

How can I get SSIS Lookup transformation to ignore alphabetical case?

Hopefully this is easy to explain, but I have a lookup transformation in one of my SSIS packages. I am using it to lookup the id for an emplouyee record in a dimension table. However my problem is that some of the source data has employee names in all capitals (ex: CHERRERA) and the comparison data im using is all lower case (ex: cherrera).
The lookup is failing for the records that are not 100% case similar (ex: cherrera vs cherrera works fine - cherrera vs CHERRERA fails). Is there a way to make the lookup transformation ignore case on a string/varchar data type?
There isn't a way I believe to make the transformation be case-insensitive, however you could modify the SQL statement for your transformation to ensure that the source data matches the case of your comparison data by using the LOWER() string function.
Set the CacheType property of the lookup transformation to Partial or None.
The lookup comparisons will now be done by SQL Server and not by the SSIS lookup component, and will no longer be case sensitive.
You have to change the source and as well as look up data, both should be in same case type.
Based on this Microsoft Article:
The lookups performed by the Lookup transformation are case sensitive. To avoid lookup failures that are caused by case differences in data, first use the Character Map transformation to convert the data to uppercase or lowercase. Then, include the UPPER or LOWER functions in the SQL statement that generates the reference table
To read more about Character Map transformation, follow this link"
Character Map Transformation