I have a table with a non-Clustered index on a varchar column 'A'.
when I use Order By A clause I can see it scans the index and gives me the result in a few seconds.
But when I use Sort Component of SSIS for column 'A', I can see it takes minutes to sort records.
So I understand that it does not recognize my non-clustered index
Does anyone has any idea for using indexes for SSIS but not using queries instead of components??
Order By A is run in the database.
When using a sort component, the sort is done in the SSIS runtime. Note that the query you use to feed to the sort does not have an order by in it (I assume)
It's done in the runtime because it is data source agnostic - your source could be excel or a text file or an in memory dataset or a multicase or pivot or anything.
My advice is to use the database as much as possible.
The only reason to use a sort in a SSIS package is if your source doesn't support sorting (i.e. a flat file) and you want to do a merge join in your package to something else. Which is a very rare and specific case
As I researched and working with SSIS these times I found out that the only way to use indexes is to connnect to database. However, when you fetch your data in the flow, all you have are just records and data. no indexes!
So for tasks like Merge Join which needs a Sort component before that, I tried to use Lookup component instead with full cache option. and cache whole data then use ORDER BY in the Source component query
31 Days of SSIS – What The Sorts:
Whether there are one hundred rows or ten million rows – all of the rows have to be consumed by the Sort Transformation before it can return the first row. This potentially places all of the data for the data flow path in memory. And the potentially bit is because if there is enough data it will spill over out of memory.
In the image to the right you can see that until the ten million rows are all received that data after that point in the Data Flow cannot be processed.
This behavior should be expected if you consider what the transformation needs to do. Before the first row can be sent along, the last row needs to be checked to make sure that it is not the first row.
For small and narrow datasets, this is not an issue. But if you’re dataset are large or wide you can find performance issues with packages that have sorts within them. All of the data load and sorted in memory can be a serious performance hog
Related
How to validate the scenario?
Scenario 1:
Source File is Flat File which contains Millions of data.
All the data from the source file is loaded to target table in the Data Base.
Now the question is how to validate if all the data is loaded in the target table correctly??
Note: we can’t use xls to validate as we have Millions of records in it.
There are lots of ways one can validate data. Much of it depends on three things:
How much time do you have for validation?
What are your processing capabilities?
Is the data on a QA or Production SQL server?
If you are in QA and have lots of processing power, you can do basic checks:
Where there any warnings or errors during the data load?
Count the total number of items in the database vs. the raw file
Count the total number of null records in the database
Check the total number of columns vs. the raw file
Check the length of the variables. Are they as expected?
Are any character columns unexpectedly truncated?
Are numeric columns out to the correct number of significant digits?
Are dates reasonable? For example, if you expected dates from 2004, do they say 1970?
How many duplicates are there?
Check if the data in the columns make sense. A few questions you can ask: are any rows "shifted?" Are numeric variables in numeric columns? Is the key column actually a key? Do the column names make sense? Your check of null records should help detect these things.
Can you manually calculate any columns and compare your calculation to the one in the file?
If you are low on processing power or are on a production server and do not want to risk degrading performance for other users, you can do many of the above checks with a simple random sample. Take, say, 100,000 rows at a time.; or, stratify it if needed.
These are just a few checks you can do. The more comparisons and sanity checks, the better off you are.
Most importantly, communicate these findings and anything that seems strange to the file owner. They should be able to give you additional insight to the data load is correct, or if they even gave you the right file in the first place.
You're loading the data and providing as many reasonable checks as possible. If they're satisfied with the outcome, and you're satisfied with the outcome, you should consider the data valid.
I think the most complete solution would be to export the table back to a 2nd flat file that should be identical to the first, and then write a script that does a line by line diff check. You will be able to see if even a single row is different.
Given that you are migrating millions of rows of data I'm assuming that running a script overnight will not be a huge deal vs data integrity.
For quick validation you can just check that the row counts are the same and that there's no obviously bad data like for example a column mapped wrong or an entire column being null.
Im not expert on export from files, but if i should solve this issue i follow something like this.
Load the file into a plain TableA with no restriction so import process run ok.
Create another TableB with all validation. Type, string length, FK.
Create one store procedure to move the data from TableA to TableB
Include a catch error and insert into another table Errors where you insert row_id and err_description
I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance
1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.
That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.
I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).
I'm currently trying to optimize my program. I have a large database which consists of data which are timestamped. The data I need to update is only data for the current day, so I don't want to search the entire database more than once to find only the entries of today. Is there a way to select something and then use it later in several different (MERGE INTO) commands?
I want to select all the data of today, then run a while loop (in java) on every single entry of data for today updating them all. So is this even possible? Or do I have to traverse the entire database for each while-loop iteration?
If you are optimizing your program and your database is timestamped. Then the first thing you can do is to create index for the timestamps field. This will reduce your query execution time because your filter criteria is related to that time-stamp field.
Use a proper data caching technology, like memcached in order to minimize database hits for read-heavy, slowly changing data.
How much of a performance benefit is there by selecting only required field in query instead of querying the entire row? For example, if I have a row of 10 fields but only need 5 fields in the display, is it worth querying only those 5? what is the performance benefit with this limitation vs the risk of having to go back and add fields in the sql query later if needed?
It's not just the extra data aspect that you need to consider. Selecting all columns will negate the usefulness of covering indexes, since a bookmark lookup into the clustered index (or table) will be required.
It depends on how many rows are selected, and how much memory do those extra fields consume. It can run much slower if several text/blobs fields are present for example, or many rows are selected.
How is adding fields later a risk? modifying queries to fit changing requirements is a natural part of the development process.
The only benefit I know of explicitly naming your columns in your select statement is that if a column your code is using gets renamed your select statement will fail before your code. Even better if your select statement is within a proc, your proc and the DB script would not compile. This is very handy if you are using tools like VS DB edition to compile/verify DB scripts.
Otherwise the performance difference would be negligible.
The number of fields retrieved is a second order effect on performance relative to the large overhead of the SQL request itself -- going out of process, across the network to another host, and possibly to disk on that host takes many more cycles than shoveling a few extra bytes of data.
Obviously if the extra fields include a megabyte blob the equation is skewed. But my experience is that the transaction overhead is of the same order, or larger, than the actual data retreived. I remember vaguely from many years ago than an "empty" NOP TNS request is about 100 bytes on the wire.
If the SQL server is not the same machine from which you're querying, then selecting the extra columns transfers more data over the network (which can be a bottleneck), not forgetting that it has to read more data from the disk, allocate more memory to hold the results.
There's not one thing that would cause a problem by itself, but add things up and they all together cause performance issues. Every little bit helps when you have lots of either queries or data.
The risk I guess would be that you have to add the fields to the query later which possibly means changing code, but then you generally have to add more code to handle extra fields anyway.