i am working on a SSIS solution for datawarehouse for extracting Surrogate keys of corresponding application keys, I am using look up task of SSIS but the problem with this task is it caches the complete look up table in its memory . And my look up table size is huge i.e. 20 million records. So if u can suggest some ways or alternatives for look up task
I do not consider a table with 20 million records too huge for look up. You can do some filtering and by selecting only the required columns in the lookup you can optimize it for using small amount of memory.
For example if you have a key column of type int and a varchar column of size 10 needed for look up, a record will take 4+10bytes and 20million goes to 20Mx(4+10) ~= 280MB which cannot be considered as too high.
Still if you want to reduce memory usage, you will have to use joins.
Do a LEFT JOIN with your Lookup data when you bring the data into the SSIS package and then evaluate what you need to.
If the lookup table is in a different source, then you can do a LEFT JOIN in SSIS, but that is going to cache rows as well. I think that the JOIN may be marginally faster than a Lookup.
Do you have to scan the whole table? I.e. can you specify your lookup as a database View on the table, or even specify it as the results of a SQL Query (Use results of an SQL Query option)
Make sure that you pick only columns that you need in the look-up table, do not cache columns which are not needed. Find some time to take a look at MS "Project "Real" which uses SSIS in high data-volume applications and discusses best practices.
Related
I have more than 100 tables in Redshift that I'd like to UNION to create one consolidated table. I can't hardcode this query because the list of tables will grow quite quickly. So I want to be able to achieve a process wherein I'm able to write something like, "UNION all tables where the table name contains 'orders'".
What's the best way to do this in Redshift? I'm open to using third party tools/languages to do this if needed, but if possible to do within Redshift, that would be ideal.
I don't think this can be done inside of Redshift - I'll let someone with a bright idea chime in but I don't think there is a way.
So you will need an external system to compose the query for you. The table names can be found in the Redshift catalogs and composing the query can be made in a templating system like jinja2. Jinja2 can loop on a list of tables and build the UNION ALL SQL for you and runs stand alone or as a python library. Or you can have a process (Lambda) that builds a view over all your tables and you query just accesses the view.
Now let's talk about why you shouldn't be doing this. First off Redshift is designed to be efficient on large tables. The storage block size is 1MB and for tables of less than a few million rows can be significantly inefficient. A table of 10,000 rows can use less than 1% of the storage space for actual data so reading these tables can have a high overhead and if you need to scan 100's of these you can spend all your time reading barely used blocks. Not only is this inefficient in terms of execution but also in disk storage. You could be heading for big problems on this path.
Also, the Redshift query compiler has limits on segments and parts in the query. Unioning all these tables will hit these limits and fail as you move forward and add tables. Defining a process that will break one day is not likely where you want to be.
I have a table with over 300 million records only containing key, source and hash value. The inbuilt sql of the application runs a large IN predicate on hash value in its sql to fetch the data.The sqls are performing slowly so need suggestions on how to improve the performance of the sql. I wouldn't be able to change the sql as its an internal sql already built-in the application. So far I have tried to put in an index on the key and another on hash column but doesn't provide much help.
Off the top of my head, you could create a table, perhaps a temporary table (which can also accept indices), containing the values in the IN predicate of your current query. Then, a simple inner join of your original query to this table would have the same effect, except it might be substantially faster if DB2 can take advantage of the index. You would need to make sure that the columns involved in the join both have an index.
You (or your DBA) might like to try the Design Advisor to advise on new indexes etc.
This can be run from the command line tool db2advis https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.perf.doc/doc/c0005144.html
Or from Data Server Manager which provides a Web Based front end for the
Query Advisor and Access Path Advisor. https://www-01.ibm.com/support/docview.wss?uid=swg27048195
I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.
i am running a stored procedure to delete data from two tables:
delete from TESTING_testresults
from TESTING_testresults
inner join TESTING_QuickLabDump
on TESTING_QuickLabDump.quicklabdumpid = TESTING_TestResults.quicklabdumpid
where TESTING_quicklabdump.[Specimen ID]=#specimen
delete from TESTING_QuickLabDump
from TESTING_Quicklabdump
where [specimen id]=#specimen
one table is 60m rows and the other is about 2m rows
the procedure takes about 3 seconds to run.
is there any way i can speed this up? perhaps using EXISTS?
meaning IF EXISTS...THEN DELETE - because the delete should not be occurring every single time
something like this
if #specimen exists in TESTING_QuickLabDump then do the procedure with the two deletes
thank you !!!
Rewriting the query probably wont help speeding this up. Use the profiler to find out which parts of the query are slow. For this, make it profiler output the execution plan. Then, try adding appropriate indexes. Perhaps one or both tables could use an index over [specimen id].
For a table with 60 mil rows I would definitely look into partitioning the data horizontally and/or vertically. If it's time-sensitive data then you ought to be able to move old data into a history table. That's usually the first and most obvious thing people do so I would imagine if that were a possibility you would have already done it.
If there are many columns then it would definitely benefit you to denormalize the data into multiple tables. If you did this, I would suggest renaming the tables and creating a view of all the partitioned tables named after the original table. Doing that should ensure existing code isn't broken.
If you 'really' want to fine tune the speed then you should look into getting a faster hard drive and learn a little about hard drives work. Whether the data is stored towards the inner or outer section of the hd will affect speed of access slightly for example. And solid state hard drives have come a long way so you might look into getting one of those.
Beside indexing "obvious" fields, also look in your database schema and check if you have any FOREIGN KEYs whose ON DELETE CASCADE or SET NULL might be triggered by your delete (unlike Oracle, MS SQL Server will tend to show these in the execution plan). Fortunately, this is usually fairly easy to fix by indexing the child endpoint of the FOREIGN KEY.
Also check if you have any expensive triggers.
Is there any performance benefit to splitting a large table with roughly 100 columns into 2 separate tables? This would be in terms of inserting, deleting and selecting tasks? I'm using SQL Server 2008.
If one of the fields is a CLOB or BLOB and you anticipate it holding a huge amount of data and you won't need that field very often and the result set will transmitted over a long pipe (like server to a web-based client), then I think putting that field in a separate table would be appropriate.
But just returning 100 regular fields probably won't tax your system so much as to justify a separate table and a join.
The only benefit you might see is if a number of columns are only occasionally populated. In which case putting those into their own table and only adding a row when there is data might make sense in terms of overall row overhead and, depending on the number of rows, overall page count for the table(s). That said, this is one of the reasons they introduced sparse columns in SQL Server 2008.
For the maintenance and other overhead of managing two tables instead of one (especially given that people can act on individual tables if they choose), it's unlikely it would be worth it.
Can you describe what type of entity needs to have over 100 columns? Perhaps the data model is just wrong in the first place.
I would say no as it would take more execution time to join the 2 tables whenever you wanted to do something.
I depends if you use these fields in the same time in your application.
These kind of performance improvements are really bad : you make your source code impossible to understand. If you have performance trouble with this table, add something (like a table containing the 15 fields you'll use in a request that'll updated via trigger), don't modify your clean solution.
If you don't have performance problem, don't do anything, you'll see later !