SSIS performance in this case scenario - sql

Can this kind of logic can be implemented in SSIS and is it possible to do it in near-real time?
Users are submitting tables with hundreds of thousands of records and waiting for the results for up to 1 hour with the current implementation when the starting table have about 500.000 rows (after the STEP1 and STEP2 we have millions of records). In the future the amount of data and the user base may drastically grow.
STEP 1
We have a table (A) of around 500.000 rows with the following main columns: ID, AMOUNT
We also have a table (B) with the prop.steps and the following main columns: ID_A, ID_B, COEF
TABLE A:
id amount
a 1000
b 2000
TABLE B:
id_a,id_b,coef
a,a1,2
a1,b2,2
b,b1,5
We are creating new records from all the 500.000 records that we have in the table A multiplying the AMOUNT by the COEF:
OUTPUT TABLE:
id, amount
a,1000
a1,2000
a2,4000
b,2000
b1,10000
STEP 2
Following custom logic, we are assigning the amount of all the records calculated before to some other items with the following logic:
TABLE A
ID,AMOUNT
a,1000
a1,2000
a2,4000
b,2000
b1,10000
TABLE B
ID,WC,COEF
a,wc1,0.1
a,wc2,1
a,wc3,0.1
a1,wc4,1
a2,wc5,1
b,wc1,1
b1,wc1,1
b1,wc2,1
OUTPUT TABLE:
ID,WC,AMOUNT
a,wc1,100
a,wc2,1000
a,wc3,100
a1,wc4,2000
a2,wc5,4000
b,wc1,2000
b1,wc1,10000
b1,wc2,10000
The other steps are just joins and arithmetical operations on the tables and the overall number of records can't be reduced (the tables have also other fields with metadata).

In my personal experience that kind of logic can be completely implemented in SSIS.
I would do it in a Script Task or Component for two reasons:
First, if I understood correctly, you need an asynchronous task to
output more data than your input. Scripts can handle multiple and diferent outputs.
Second, in the script you can implement all those calculations
which in the case of using other components would take a lot of them and
relationships between them. And the most important aspect, the
algorithm complexity is kept in relation with your algorithmic
design which can be a huge boost on performance and scalability if
you get a good complexity, two aspects that, if I get it right again,
are fundamental.
There are, though, some professionals that have a bad opinion of "complex" scripts and...
The down step of this approach is that you need some ability with .NET and programming, also most of your package logic will be focus there and script debugging can be more complex than other components. But once you get to use the .NET features of SSIS, there is no turning back.
Usually getting near real time in SSIS is tricky for big data sets, and sometimes you need to integrate other tools (e.g. StreamInsight) to achieve it.

Related

BigQuery table performance loss with TABLE_QUERY

Having a surprising performance hit when querying multiple tables, vs. one big one. Scenario:
We have a simple web analytics tool based on BigQuery. We track basic events for individual sites. For about a month, we pumped all data into one big table. Now, we are breaking the data into partitions by SITE and MONTH.
So the big table was simply [events.all]
Now we have, say, [events.events_2014_06_SITEID]
Querying individual tables for a certain group is much faster, and processing much less data. But querying our entire dataset is much, much slower on even simple queries. And our new dataset is only 1 day old, whereas the big table is 30 days old, so it's slower despite querying far less data.
For example:
select count(et) from [events.all] where et='re'
--> completed in 3.2s, processing 79mb of data. This table has 21,048,979 rows.
select count(et) from ( TABLE_QUERY(events, 'table_id CONTAINS "events_2014_"') ) where et='re'
--> completed in 44.2s, processing 1.8mb of data. Put together, these tables have 492,264 rows.
How come this occurs, and is there any way to resolve this big disparity?

audit table vs. Type 2 Slowly Changing Dimension

In SQL Server 2008+, we'd like to enable tracking of historical changes to a "Customers" table in an operational database.
It's a new table and our app controls all writing to the database, so we don't need evil hacks like triggers. Instead we will build the change tracking into our business object layer, but we need to figure out the right database schema to use.
The number of rows will be under 100,000 and number of changes per record will average 1.5 per year.
There are at least two ways we've been looking at modelling this:
As a Type 2 Slowly Changing Dimension table called CustomersHistory, with columns for EffectiveStartDate, EffectiveEndDate (set to NULL for the current version of the customer), and auditing columns like ChangeReason and ChangedByUsername. Then we'd build a Customers view over that table which is filtered to EffectiveEndDate=NULL. Most parts of our app would query using that view, and only parts that need to be history-aware would query the underlying table. For performance, we could materialize the view and/or add a filtered index on EffectiveEndDate=NULL.
With a separate audit table. Every change to a Customer record writes once to the Customer table and again to a CustomerHistory audit table.
From a quick review of StackOverflow questions, #2 seems to be much more popular. But is this because most DB apps have to deal with legacy and rogue writers?
Given that we're starting from a blank slate, what are pros and cons of either approach? Which would you recommend?
In general, the issue with SCD Type- II is, if the average number of changes in the values of the attributes are very high, you end-up having a very fat dimension table. This growing dimension table joined with a huge fact table slows down the query performance gradually. It's like slow-poisoning.. Initially you don't see the impact. When you realize it, it's too late!
Now I understand that you will create a separate materialized view with EffectiveEndDate = NULL and that will be used in most of your joins. Additionally for you, the data volume is comparatively low (100,000). With average changes of only 1.5 per year, I don't think data volume / query performance etc. are going to be your problem in the near future.
In other words, your table is truly a slowly changing dimension (as opposed to a rapidly changing dimension - where your option #2 is a better fit). In your case, I will prefer option #1.

200 column table - 3 million rows - performance

i'm currently working on a project where the client has handed me a database that includes a table with over 200 columns and 3 million rows of data lol. This is definitely poorly designed and currently exploring some options. I developed the app on my 2012 mbp with 16gb of ram and an 512 ssd. I had to develop the app using mvc4 so set up the development and test environment using parallels 8 on osx. As part of the design, I developed an interface for the client to create custom queries to this large table with hundreds of rows so I am sending a queryString to the controller which is passed using dynamic linq and the results are sent to the view using JSON (to populate a kendo ui grid). On my mbp, when testing queries using the interface i created it takes max 10 secs (which find too much) to return the results to my kendo ui grid. Similarly, when I test queries directly in sql server, it never takes really long.
However when I deployed this to the client for testing these same queries take in excess of 3 mins +. So long story short, the client will be upgrading the server hardware but in the mean time they still need to test the app.
My question is, despite the fact that the table holds 200 columns, each row is unique. More specifically, the design is:
PK-(GUID) OrganizationID (FK) --- 200 columns (tax fields)
If I redesign this to:
PK (GUID) OrganizationID (FK) FieldID(FK) Input
Field table:
FieldID FieldName
This would turn this 3 million rows of data table into 600 million rows but only 3 columns. Will I see performance enhancements?
Any insight would be appreciated - I understand normalization but most of my experience is in programming.
Thanks in advance!
It is very hard to make any judgements without knowing the queries that you are running on the table.
Here are some considerations:
Be sure that the queries are using indexes if they are returning only a handful of rows.
Check that you have enough memory to store the table in memory.
When doing timings, be sure to ignore the first run, because this is just loading the page cache.
For testing purposes, just reduce the size of the table. That should speed things up.
As for your question about normalization. Your denormalized structure takes up much less disk space than a normalized structure, because you do not need to repeat the keys for each value. If you are looking for one value on one row, normalization will not help you. You will still need to scan the index to find the row and then load the row. And, the row will be on one page, regardless of whether it is normalized or denormalized. In fact, normalization might be worse, because the index will be much larger.
There are some examples of queries where normalizing the data will help. But, in general, you already have a more efficient data structure if you are fetching the data by rows.
You can take a paging approach. There will be 2 queries: initial will return all rows but only a column with unique IDs. This array can be split into pages, say 100 IDs per page. When user selects a specific page - you pass 100 ids to the second query which this time will return all 200 columns but only for requested 100 rows. This way you don't have to return all the columns across all the rows at once, which should yield significant performance boost.

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks
Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.
This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.
We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.
If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

Tool to generate worst-case data for a given SQL query

I would like to populate some tables with a large amount of data in order to empirically test the performance of an SQL query in the worst case scenario (well, as close to it as possible).
I considered using random values. But this would require manual adjustment to get even close to the worst case. Unconstrained random values are no good for a worst case because they tend mostly to be unique -- in which case an index on a single column should perform about as well as a compound index. On the other hand, random values chosen from too small a set will result in a large fraction of the rows being returned, which is uninteresting because it reflects not so much search performance as listing performance.
I also considered just looking at EXPLAIN PLAN, but this is not empirical, and also the explanation varies, partly depending on the data that you already have, rather than the worst case.
Is there a tool that analyzes a given SQL query (and the db schema and ideally indexes), then generates a large data set (of a given size) that will cause the query to perform as close to worst-case as possible?
Any RDBMS is fine.
I would also be interested in alternative approaches for gaining this level of insight into worst-case behaviour.
Short answer: There is no worst case scenario because every case can be made much worse, usually just by adding more data with the same distribution.
Long answer:
I would recommend to you to look not for the worst case scenario, but for an "overblown realistic scenario" in which you start from production data, define what you consider a large amount of entities (for each table separately), multiply by a factor of two or three, and generate the data from the production data you have by hand.
For example, if your production data has 1000 car models from 150 car manufacturers and you will decide you might need 10000 models from 300 manufacturers, you will first double the number of records in the referenced table (manufacturers), then generate a "copy" of existing 1000 car models to create another 1000 cars referencing those generated manufacturers, and then generating 4 more cars per each existing one, every time copying the existing distribution of values based on case by case decisions. This means new unique values in some columns, and simply copied values in others.
Do not forget to regenerate statistics after you are done. Why exactly am I saying this? Because you want to test the best possible query plan given the query, data, and schema, and optimize that.
Rationale: Queries are not algorithms. The query optimizer chooses a suitable query plan not only based on the query, but also on information about how big the tables approximately are, index coverage, operator selectivity, and so on. You are not really interested in learning how poorly chosen plans, or plans for unrealistically populated database execute. This could even induce you to add ill chosen indexes, and ill chosen indexes can make production performance worse. You want to learn and test what happens with the best plan for a realistic albeit large numbers of rows.
While you could test with 1,000,000 car models, odds are that such production content is science fiction for your specific database schema and queries. However, it would be even less useful to test with the number of car models equaling the number of car manufacturers in your database. While such a distribution might happen to be the worst possible one for your application, you will learn almost nothing from basing your metrics on it.