Field specific errors for ETL - sql

I am creating a ETL process in MS SQL Server and I would like to have errors specific to a particular column of a particular row. For example, the data is initially loaded from excel files into a table(we'll call the Initial table) where all columns are varchar(2000) and then I stage the data to another table(the DataTypedTable) that contains more specific data types (datetime,int, etc.) or more tightly constrained varchar lengths. I need to be able to create error messages for a specific field such as:
"Jan. 13th" is not a valid date format for the submission date. Please use a format of MM/DD/YYYY
These error messages would need to be stored in some way such that later in the process a automated process can create reports with the error messages such that each message references a specific row and field(someone will need to go back and correct the data in the source system and resubmit the excel file). So ideally it would be inserted into a Failures tables of some sort and contain the primary key of the failed row, the column name, and the error message.
Question: So I am wondering if this can be accomplished with SSIS, or some open source tool like Talend, and if so, what would be your general approach? Or what hand coded approach you would take?
Couple approaches I've thought of using SQL(up until no I have done ETL by hand in SQL procs, but I want to consider other approaches. Possible C# even.):
Use a cursor to read through the Initial table, and for each row insert a blank record with only the primary key into the DataTyped table, then use a single update statement for each column, such that if that update fails I can insert a very specific error message specific to that column in the error messages table.
Insert all the data as is into the DataTyped table, but have duplicate columns like SubmissionDate and SubmissionDateOld. After the initial insert the *Old columns have data, the rest are blank, and I have a single update for each column that sets the SubmissionDate based on the SubmissionDateOld.
In addition to suggesting an approach, I'd like to know if you are using that approach or something similar already in the work you do.

I use the aproach where I put a conditional split into the data flow. The records which fail the conditions (invalid date, no data in a required field, etc.) are then sent to an exception table which includes the record identifier, the bad data, the reason it failed. You can then later on create a spreadsheet or text file of errors from this information to send back to the group providing the file. Good records of course go to the pther path and are inserted into the table.

How about some cleaning/transformation before loading into the staging (what you call initial tables) tables? Dump the data from Excel to a tab or comma separated file and then use some programming languages of your choice to do the data cleansing that you have noted. Also, how big is each data load? You can make use of multi-threaded or multi-process application to handle major loads (like loading few million rows at a time). During this process any error you encounter can be loaded into the exception table with identifier, error and comment details. This technique helps in having better control during data cleaning phase.
If the load is not that high and you want to do most of your work in database (SQL), then you may want to do as much data profiling as possible and have good understading of possible data variations that you can expect. With that you can use appropriate component (Talend or SSIS) to do the transformation or control the data flow. Also, by using regular expressions you can catch any entity that deviates from the set rule.

Related

allowLargeResults in Query job in BigQuery

I'm trying to run a Query job in BigQuery and getting the following error:
Response too large to return. Consider setting allowLargeResults to
true in your job configuration
I understand that I need to set allowLargeResults to True in my job configuration, but then I also have to supply a destination table field.
I don't want to insert the results of the query to specific table, only to process it locally.
how can I manage this situation?
I don't want to insert the results of the query to specific table,
only to process it locally.
Wanted to clarify – so you hopefully feel better about using destination table:
In reality, any query result ends up in some table!
If result is smaller than 128MB - BigQuery creates temporary table on your behalf (in special dataset which name starts with underscore so it is not visible in Web UI dataset/table navigator).
This temporary table is available for 24 hours and is used if you use Query Cashing or you can even use it by yourself – you just need to find which table is created. You can find this in API – destination table – which as I said above exists even if you have not set specific table. Or you can find it in Web UI
When result is bigger than 128MB – you must set destination table. The only drawback in your case is that you need to make sure you delete this table after you don’t need it anymore otherwise you will be paying for storage
You can do this either by actually deleting table - manually (in UI) or programmatically (API). Or you can set expiration on the table (API)
First of all if it's means it's too large, then probably greater than 128MB. You need to make sure that you query is accurate and if indeed you want to return the large data. Usually people make mistakes in the queries, like join explosion, missing time filters to reduce data, or missing limits.
After you are convinced the data is too large, you need to write to a table, then export to GCS, then download, and then deal with it.
https://cloud.google.com/bigquery/docs/exporting-data#exportingmultiple

how to valiate Millions of data?

How to validate the scenario?
Scenario 1:
Source File is Flat File which contains Millions of data.
All the data from the source file is loaded to target table in the Data Base.
Now the question is how to validate if all the data is loaded in the target table correctly??
Note: we can’t use xls to validate as we have Millions of records in it.
There are lots of ways one can validate data. Much of it depends on three things:
How much time do you have for validation?
What are your processing capabilities?
Is the data on a QA or Production SQL server?
If you are in QA and have lots of processing power, you can do basic checks:
Where there any warnings or errors during the data load?
Count the total number of items in the database vs. the raw file
Count the total number of null records in the database
Check the total number of columns vs. the raw file
Check the length of the variables. Are they as expected?
Are any character columns unexpectedly truncated?
Are numeric columns out to the correct number of significant digits?
Are dates reasonable? For example, if you expected dates from 2004, do they say 1970?
How many duplicates are there?
Check if the data in the columns make sense. A few questions you can ask: are any rows "shifted?" Are numeric variables in numeric columns? Is the key column actually a key? Do the column names make sense? Your check of null records should help detect these things.
Can you manually calculate any columns and compare your calculation to the one in the file?
If you are low on processing power or are on a production server and do not want to risk degrading performance for other users, you can do many of the above checks with a simple random sample. Take, say, 100,000 rows at a time.; or, stratify it if needed.
These are just a few checks you can do. The more comparisons and sanity checks, the better off you are.
Most importantly, communicate these findings and anything that seems strange to the file owner. They should be able to give you additional insight to the data load is correct, or if they even gave you the right file in the first place.
You're loading the data and providing as many reasonable checks as possible. If they're satisfied with the outcome, and you're satisfied with the outcome, you should consider the data valid.
I think the most complete solution would be to export the table back to a 2nd flat file that should be identical to the first, and then write a script that does a line by line diff check. You will be able to see if even a single row is different.
Given that you are migrating millions of rows of data I'm assuming that running a script overnight will not be a huge deal vs data integrity.
For quick validation you can just check that the row counts are the same and that there's no obviously bad data like for example a column mapped wrong or an entire column being null.
Im not expert on export from files, but if i should solve this issue i follow something like this.
Load the file into a plain TableA with no restriction so import process run ok.
Create another TableB with all validation. Type, string length, FK.
Create one store procedure to move the data from TableA to TableB
Include a catch error and insert into another table Errors where you insert row_id and err_description

When to use internal tables?

So, I have read that using internal tables increases the performance of the program and that we should make operations on DB tables as less as possible. But I have started working on a project that does not use internal tables at all.
Some details:
It is a scanner that adds or removes products in/from a store. First the primary key is checked (to see if that type of product exists) and then the product is added or removed. We use ‘Insert Into’ and ‘Delete From’ to add/remove the products directly from the DB table.
I have not asked why they do not use internal tables because I do not have a better solution so far.
Here’s what I have so far: Insert all products in an internal table, place the deleted products in another internal table.
Form update.
Modify zop_db_table from table gt_table." – to add all new products
LOOP AT gt_deleted INTO gs_deleted.
DELETE FROM zop_db_table WHERE index_nr = gs_deleted-index_nr.
ENDLOOP. " – to delete products
Endform.
But when can I perform this update?
I could set a ‘Save button’ to perform the update, but then there will be the risk that the user forgets to save large amounts of data, or drops the scanner, shutting it down or similar situations. So this is clearly not a good solution.
My final question is: Is there a (good) way to implement internal tables in a project like this?
internal tables should be used for data processing, like lists or arrays in other languages (c#, java...). From a performance and system load perspective it is preferred to first load all data you need into an internal table, then process that internal table instead of loading individual records from the database.
But that is mostly true for reporting, which is probably the most common type of custom abap program. You often see developers use select...endselect-statements, that in effect loop over a database table, transferring row after row to the report, one at a time. That is extremely slow compared to reading all records at once into an itab, then looping over the itab. More than once i've cut the execution time of a report down to a fraction by just eliminating roundtrips to the database.
If you have a good reason to read from the database or update records immediately, you should do so. If you can safely delay updates and deletes to a point in time where you can process all of them together, without risking inconsistencies, I'd consider than an improvement. But if there is a good reason (like consistency or data loss) to update immediately, do it.
Update: as #vwegert mentioned regarding the select-endselect statement, the statement doesn't actually create individual database queries for each row. The database interface of the application server optimizes the query, transferring rows in bulk to the application server. From there the records are transported to the abap report one by one (because in the report there is only the work area to store a single row), which has a significant performance impact especially for queries with large result sets. A select into an internal table can transport all rows directly to the abap report (as long as there is enough memory to hold them), as now there is the internal table to hold those records in the report.

Is it viable to have a SQL table with only one row and one column?

I'm currently working on my first application that uses a database so I'm very new to this. The database has multiple tables that are what you would expect to normally see.
However, I created one table which only has one row and one column used to keep a count of the total items processed by the program so it's available to access elsewhere. I can't just use
SELECT COUNT(*) FROM table_name
because these items that I am processing I do not want to actually keep in a table.
It seems like a waste to use a table to store one value so I am wondering if there a better way to keep track of this value.
What is your table storing? it's storing some kind of processing audit. So make it a little more useful - add a column storing the last datetime that the data was processed. Add a column for the time it took to process. Add another column which stores the username (or some identifier) of whoever ran the process. Now add a row for every table that is processed (there's only one now but there might be more in future). Try and envisage how your processing is going to grow in future

Duplicate rows on ole db destination

I have quite a complex scenario where the same package can be run in parallel. In some situations both execution can end up trying to insert the same row into the destination, which causes a violation of primary key error.
There is currently a lookup that checks the destination table to see i the record exists so the insert is done on the its "no match" output. It doesnt prevent the error because the lookup is loaded on the package start thus both packages get the same data on it and if a row comes in both of them will consider it a "new" row so the first one succeeds and the second, fails.
Anything that can be done to avoid this scenario? Pretty much ignore the "duplicate rows" on the oledb destination? I cant use the MAX ERROR COUNT because the duplicate row is in a bath among other rows that were not on the first package and should be inserted.
The default lookup behaviour is to employ Full Cache mode. As you have observed, during the package validation stage, it will pull all the lookup values into an local memory cache and use that which results in it missing updates to the table.
For your scenario, I would try changing the cache mode to None (partial is the other option). None indicates that an actual query should be fired off to the target database for every row that passes through. Depending on your data volume or a poorly performing query, that can have a not-insignificant impact on the destination. It still won't guarantee that the parallel instance isn't trying the to load the exact same record (or that the parallel run has already satisfied their lookup and is ready to write to the target table) but it should improve the situation.
If you cannot control the package executions such that the concurrent dataflows are firing, then you should look at re-architecting the approach (write to partitions and swap in, use something to lock resources, stage all the data and use a TSQL merge etc)
Just a thought ... How about writing the new records to a temp table and merging it intermittently? This will give an opportunity to filter out duplicates.