how to valiate Millions of data? - sql

How to validate the scenario?
Scenario 1:
Source File is Flat File which contains Millions of data.
All the data from the source file is loaded to target table in the Data Base.
Now the question is how to validate if all the data is loaded in the target table correctly??
Note: we can’t use xls to validate as we have Millions of records in it.

There are lots of ways one can validate data. Much of it depends on three things:
How much time do you have for validation?
What are your processing capabilities?
Is the data on a QA or Production SQL server?
If you are in QA and have lots of processing power, you can do basic checks:
Where there any warnings or errors during the data load?
Count the total number of items in the database vs. the raw file
Count the total number of null records in the database
Check the total number of columns vs. the raw file
Check the length of the variables. Are they as expected?
Are any character columns unexpectedly truncated?
Are numeric columns out to the correct number of significant digits?
Are dates reasonable? For example, if you expected dates from 2004, do they say 1970?
How many duplicates are there?
Check if the data in the columns make sense. A few questions you can ask: are any rows "shifted?" Are numeric variables in numeric columns? Is the key column actually a key? Do the column names make sense? Your check of null records should help detect these things.
Can you manually calculate any columns and compare your calculation to the one in the file?
If you are low on processing power or are on a production server and do not want to risk degrading performance for other users, you can do many of the above checks with a simple random sample. Take, say, 100,000 rows at a time.; or, stratify it if needed.
These are just a few checks you can do. The more comparisons and sanity checks, the better off you are.
Most importantly, communicate these findings and anything that seems strange to the file owner. They should be able to give you additional insight to the data load is correct, or if they even gave you the right file in the first place.
You're loading the data and providing as many reasonable checks as possible. If they're satisfied with the outcome, and you're satisfied with the outcome, you should consider the data valid.

I think the most complete solution would be to export the table back to a 2nd flat file that should be identical to the first, and then write a script that does a line by line diff check. You will be able to see if even a single row is different.
Given that you are migrating millions of rows of data I'm assuming that running a script overnight will not be a huge deal vs data integrity.
For quick validation you can just check that the row counts are the same and that there's no obviously bad data like for example a column mapped wrong or an entire column being null.

Im not expert on export from files, but if i should solve this issue i follow something like this.
Load the file into a plain TableA with no restriction so import process run ok.
Create another TableB with all validation. Type, string length, FK.
Create one store procedure to move the data from TableA to TableB
Include a catch error and insert into another table Errors where you insert row_id and err_description

Related

Best practice to update bulk data in table used for reporting in SQL

I have created a table for reporting purpose where I am storing data for about 50 columns and at some time interval my scheduler executes a service which processes other tables and fill up data in my flat table.
Currently I am deleting and inserting data in that table But I want to know if this is the good practice or should I check every column in every row and update it if any change found and insert new record if data does not exists.
FYI, total number of rows which are being reinserted is 100k+.
This is a very broad question that can only really be answered with access to your environment and discussion on your personal requirements. Obviously this is not possible via Stack Overflow.
This means you will need to make this decision yourself.
The information you need to understand to be able to do this are the types of table updates available and how you can achieve them, normally referred to as Slowly Changing Dimensions. There are several different types, each with their own advantages, disadvantages and optimal use cases.
Once you understand the how of getting your data to incrementally update as required, you can then look at the why and whether the extra processing logic required to achieve this is actually worth it. Your dataset of a few hundred thousand rows of data is not large and probably may therefore not need this level of processing just yet, though that assessment will depend on how complex and time consuming your current process is and how long you have to run it.
It is probably faster to repopulate the table of 100k rows. To do an update, you still need to:
generate all the rows to insert
compare values in every row
update the values that have changed
The expense of updating rows is heavily on the logging and data movement operations at the data page level. In addition, you need to bring the data together.
If the update is updating a significant portion of rows, perhaps even just a few percent of them, then it is likely that all data pages will be modified. So the I/O is pretty similar.
When you simply replace the table, you will start by either dropping the table or truncating it. Those are relatively cheap operations because they are not logged at the row level. Then you are inserting into the table. Inserting 100,000 rows from one table to another should be pretty fast.
The above is general guidance. Of course, if you are only changing 3 rows in the table each day, then update is going to be faster. Or, if you are adding a new layer of data each day, then just an insert, with a handful of changed historical values might be a fine approach.

SSIS does not recognize Indexes?

I have a table with a non-Clustered index on a varchar column 'A'.
when I use Order By A clause I can see it scans the index and gives me the result in a few seconds.
But when I use Sort Component of SSIS for column 'A', I can see it takes minutes to sort records.
So I understand that it does not recognize my non-clustered index
Does anyone has any idea for using indexes for SSIS but not using queries instead of components??
Order By A is run in the database.
When using a sort component, the sort is done in the SSIS runtime. Note that the query you use to feed to the sort does not have an order by in it (I assume)
It's done in the runtime because it is data source agnostic - your source could be excel or a text file or an in memory dataset or a multicase or pivot or anything.
My advice is to use the database as much as possible.
The only reason to use a sort in a SSIS package is if your source doesn't support sorting (i.e. a flat file) and you want to do a merge join in your package to something else. Which is a very rare and specific case
As I researched and working with SSIS these times I found out that the only way to use indexes is to connnect to database. However, when you fetch your data in the flow, all you have are just records and data. no indexes!
So for tasks like Merge Join which needs a Sort component before that, I tried to use Lookup component instead with full cache option. and cache whole data then use ORDER BY in the Source component query
31 Days of SSIS – What The Sorts:
Whether there are one hundred rows or ten million rows – all of the rows have to be consumed by the Sort Transformation before it can return the first row. This potentially places all of the data for the data flow path in memory. And the potentially bit is because if there is enough data it will spill over out of memory.
In the image to the right you can see that until the ten million rows are all received that data after that point in the Data Flow cannot be processed.
This behavior should be expected if you consider what the transformation needs to do. Before the first row can be sent along, the last row needs to be checked to make sure that it is not the first row.
For small and narrow datasets, this is not an issue. But if you’re dataset are large or wide you can find performance issues with packages that have sorts within them. All of the data load and sorted in memory can be a serious performance hog

How to seed database table with phone numbers using SQL?

I have table in my postgres database that I'm using to store phone numbers formatted as: 12223334444 (as a varchar).
As I know i will be working with US phone numbers only i thought it would be a good idea to pre-populate the database with all phone numbers that could be requested. That is, all numbers from 11_111_111_111 through 19_999_999_999.
Right now I'm achieving this by using application code and it takes a VERY long time.
Assuming I have a table named phones and a single column named digits, is there a way to populate the database using SQL?
Thank you!
This is a bad idea.
select 19999999999 - 11111111111;
8888888888
That's about 8.9 billion phone numbers. Don't build tables that big unless you absolutely have to. Tables that big severely affect transaction logs, index size, backup size and time to complete, etc.
But, if you need to generate and load a lot of data like this for PostgreSQL, there are two sensible ways to do it.
Steps depend on whether you generate the data using PostgreSQL or using application code.
For PostgreSQL, in a transaction,
drop all the indexes and constraints,
generate and insert the data
create all the indexes and constraints, and
commit the transaction.
Inserting great amounts of data is a lot faster if you commit, say, 10k rows at a time. Experiment. If you do that, you'll need to adjust the transaction boundaries in those steps above. (Each of those steps becomes one or more transactions.)
If you go with application code, it's usually fastest to
generate a csv file using application code,
drop all the indexes and constraints,
load the csv file with COPY,
create all the indexes and constraints.
This is a crazy question, but postgres has an answer
SELECT generate_series(1,19999999999)
This will create a table with all numbers between 1 and 19999999999
INSERT INTO phone_table(phone number)
SELECT LPAD(g::VARCHAR(12),10,'0') FROM generate_series(1,19999999999) AS g;
This will still take a long time, but it will probably be faster than application code.

Field specific errors for ETL

I am creating a ETL process in MS SQL Server and I would like to have errors specific to a particular column of a particular row. For example, the data is initially loaded from excel files into a table(we'll call the Initial table) where all columns are varchar(2000) and then I stage the data to another table(the DataTypedTable) that contains more specific data types (datetime,int, etc.) or more tightly constrained varchar lengths. I need to be able to create error messages for a specific field such as:
"Jan. 13th" is not a valid date format for the submission date. Please use a format of MM/DD/YYYY
These error messages would need to be stored in some way such that later in the process a automated process can create reports with the error messages such that each message references a specific row and field(someone will need to go back and correct the data in the source system and resubmit the excel file). So ideally it would be inserted into a Failures tables of some sort and contain the primary key of the failed row, the column name, and the error message.
Question: So I am wondering if this can be accomplished with SSIS, or some open source tool like Talend, and if so, what would be your general approach? Or what hand coded approach you would take?
Couple approaches I've thought of using SQL(up until no I have done ETL by hand in SQL procs, but I want to consider other approaches. Possible C# even.):
Use a cursor to read through the Initial table, and for each row insert a blank record with only the primary key into the DataTyped table, then use a single update statement for each column, such that if that update fails I can insert a very specific error message specific to that column in the error messages table.
Insert all the data as is into the DataTyped table, but have duplicate columns like SubmissionDate and SubmissionDateOld. After the initial insert the *Old columns have data, the rest are blank, and I have a single update for each column that sets the SubmissionDate based on the SubmissionDateOld.
In addition to suggesting an approach, I'd like to know if you are using that approach or something similar already in the work you do.
I use the aproach where I put a conditional split into the data flow. The records which fail the conditions (invalid date, no data in a required field, etc.) are then sent to an exception table which includes the record identifier, the bad data, the reason it failed. You can then later on create a spreadsheet or text file of errors from this information to send back to the group providing the file. Good records of course go to the pther path and are inserted into the table.
How about some cleaning/transformation before loading into the staging (what you call initial tables) tables? Dump the data from Excel to a tab or comma separated file and then use some programming languages of your choice to do the data cleansing that you have noted. Also, how big is each data load? You can make use of multi-threaded or multi-process application to handle major loads (like loading few million rows at a time). During this process any error you encounter can be loaded into the exception table with identifier, error and comment details. This technique helps in having better control during data cleaning phase.
If the load is not that high and you want to do most of your work in database (SQL), then you may want to do as much data profiling as possible and have good understading of possible data variations that you can expect. With that you can use appropriate component (Talend or SSIS) to do the transformation or control the data flow. Also, by using regular expressions you can catch any entity that deviates from the set rule.

How to find number of rows inserted/deleted in MySQL

Is there a way to find out the number of rows inserted/deleted in a table in MySQL? Is this kind of statistics kept somewhere in the database? If not, what would be the best way to implement something to keep track of these statistics?
When I say how many, I mean within a certain period (last 24 hours, or since server was up, or last week etc)
When I need to keep track of deleted things, I just don't delete.
I change a column value that excludes it from normal user results.
If space is an issue, you can set it's contents you no longer care about to empty.
Inserted you can user COUNT()
The Binary Log contains records of all queries that update or insert data. I don't know if it stores the number of affected rows, however.
There is also a General Query Log, which tracks all queries that were run.
(Information current for MySQL 5.0. If you're using an older version ymmv)
If I want to handle logging my SQL queries, I have 2 possibilities:
Turning the MySQL Log function on
Writting my own 'trace' class
I prefer doing number 2.
Why?
Because it is more controllable. You can easily differ from INSERT DELETE UPDATE and so on queries.
But that is not the only advantage of your own trace class, because creating trace files (so called "logs") makes administrative tasks much more easier.
You can structure the trace output, put it into a separate database, store it into some XML or JSON file.
You can order things as you want them to be.