MYSQL summary tables for Web App advice - sql

I have a database where i have the data in a number of tables with relationships for example
TABLE Cars (stock)
---------------------
Model colourid Doors
----------------------
xyz 0 2
xyz 1 4
TABLE Colour
Colourid Name
---------------------
0 Red
1 Green
I need to produce several regular summaries for example a summery in the format.
| colour | Num Doors
Model | red green blue | 2 4 5 6
---------|----------------------|------------------
XYZ | 1 2 3 | 4 5 3 5 <<< Numbers in stock
UPDATE - "a car can have an arrangement of doors for example 2 door cars or cars with 4 doors. In the summary it shows the number of cars in stock with each door configuration for a particular model eg there are 4 cars of xyz with 2 doors. Please bare in mind that this is only an example, cars may not be the best example its all i could come up at the time"
Unfortunately rearranging tables may make them better for summaries but not for the day to day operations.
I can think of several ways to produce theses summary's eg/ multiple SQL queries and put the table together at presentation level, SQL level UNION with multiple queries, VIEWS with multiple nested queries or lastly cron jobs or trigger code to produce data in a summary table with data arranged suitable for summary queries and reporting.
I wonder if anyone could please give me some guidance considering these methods aren't very efficient, made worse in a multi user environments and where that regular summaries may be required.

I think you need a data warehousing solution - basically build a new schema just for reporting purpose and populate these tables periodically.
There can be several update mechanisms for the summary tables -
Background job scheduled to do this periodically. This is best if up-to-date information is not needed.
Update the summary table using triggers on the main transaction tables. This could get somewhat complicated, but it might be warrantied if you need up-to-date information.
Update the report tables whenever a report is drawn just before showing the report. You can use some anchor values to ensure that you are not recalculating entire report too frequently, just consider the new rows or newly updated rows after the last time the report was drawn.
Only problem is that you will need to alter the table several times whenever new values get added in the pivoted columns.

Just a small variation on Roopesh's answer
Depending on the size of the database, available server resources, how often you would run these reports and particularly if you can not allow to have stale reports you might do the conceptually the same as above, but not using real tables, but views
Here are two links that should get you started
Pivot in MySQL
MySQL Wizardry
Notes:
you don't have to run any DDL (you can even skip CREATE VIEW and use straight dynamic SQL) as compared to having materialized results
the complexity is comparable, but little lower (adding new value in materialized scenario requires 1) ALTER TABLE ADD COLUMN, 2) INSERT; with this approach you only modify SELECT to analyze one more case. basically the complexity is identical to the INSERT)
performance can be much worse if users are looking at the reports many times from the database directly, but as stated before it also guarantees that data is fresh

Related

SSIS performance in this case scenario

Can this kind of logic can be implemented in SSIS and is it possible to do it in near-real time?
Users are submitting tables with hundreds of thousands of records and waiting for the results for up to 1 hour with the current implementation when the starting table have about 500.000 rows (after the STEP1 and STEP2 we have millions of records). In the future the amount of data and the user base may drastically grow.
STEP 1
We have a table (A) of around 500.000 rows with the following main columns: ID, AMOUNT
We also have a table (B) with the prop.steps and the following main columns: ID_A, ID_B, COEF
TABLE A:
id amount
a 1000
b 2000
TABLE B:
id_a,id_b,coef
a,a1,2
a1,b2,2
b,b1,5
We are creating new records from all the 500.000 records that we have in the table A multiplying the AMOUNT by the COEF:
OUTPUT TABLE:
id, amount
a,1000
a1,2000
a2,4000
b,2000
b1,10000
STEP 2
Following custom logic, we are assigning the amount of all the records calculated before to some other items with the following logic:
TABLE A
ID,AMOUNT
a,1000
a1,2000
a2,4000
b,2000
b1,10000
TABLE B
ID,WC,COEF
a,wc1,0.1
a,wc2,1
a,wc3,0.1
a1,wc4,1
a2,wc5,1
b,wc1,1
b1,wc1,1
b1,wc2,1
OUTPUT TABLE:
ID,WC,AMOUNT
a,wc1,100
a,wc2,1000
a,wc3,100
a1,wc4,2000
a2,wc5,4000
b,wc1,2000
b1,wc1,10000
b1,wc2,10000
The other steps are just joins and arithmetical operations on the tables and the overall number of records can't be reduced (the tables have also other fields with metadata).
In my personal experience that kind of logic can be completely implemented in SSIS.
I would do it in a Script Task or Component for two reasons:
First, if I understood correctly, you need an asynchronous task to
output more data than your input. Scripts can handle multiple and diferent outputs.
Second, in the script you can implement all those calculations
which in the case of using other components would take a lot of them and
relationships between them. And the most important aspect, the
algorithm complexity is kept in relation with your algorithmic
design which can be a huge boost on performance and scalability if
you get a good complexity, two aspects that, if I get it right again,
are fundamental.
There are, though, some professionals that have a bad opinion of "complex" scripts and...
The down step of this approach is that you need some ability with .NET and programming, also most of your package logic will be focus there and script debugging can be more complex than other components. But once you get to use the .NET features of SSIS, there is no turning back.
Usually getting near real time in SSIS is tricky for big data sets, and sometimes you need to integrate other tools (e.g. StreamInsight) to achieve it.

Updating tables in greenplum database using gpload

I am having a local table XYZ in greenplum. I am populating that table from data from 5 other tables ( table XYZ has few columns and data from 5 different tables, populated by some join operation ).
This is working fine. But problem i am facing here are:
1> I need my table XYZ to be have most recent data. That is if any new entry comes in 5 tables ( from which XYZ is being populated ), my table XYZ should be updated.
2> If any existing record gets modified then in that case data in table XYZ should also be modified.
I have one more table History_of_XYZ, this table contains all the data ( history ) of XYZ. For Example : Lets say their is one entry for customer ABC as he is living in USA. but now ABC has moved to new country lets say Russia. Then my history table will have data corresponding to entry USA and table XYZ will have most recent updated data which is customer living in Russia.
So i am not able to figure out best way to approach step 1 and 2.
How can it be done considering all data is in greenplum database.
I did some research on gpload and other loading options but not sure how to approach step 1 and 2.
Any pointers will be helpful. I am pretty new to DB. So setting all table structure and populating the table was itself a big learning curve for me.
I guess you need to look at interactive ingesting tools like Spring XD, see the topic streams.
Regards,
Moha.
Simple use case for triggers both 1 & 2. Use Insert/Update triggers.
Greenplum does not support triggers. To resolve your problem, you need to maintain Last updated timestamp in all 5 source tables. And based on the frequency of 5 source tables update, schedule your program to load (either insert/update) XYZ table. If there are too many deletes and updates everyday, then its better to follow CTAS operation to maintain free disk space.

Efficient SQL schema and query design for top recent user audit table

I'd like some input on designing the SQL data layer for a service that should store and provide the latest N entries for a specific user. The idea is to track each user (id), the time of an event and then the event id.
The service should only respond with the last X numbers of events for each user, and also only contain the events that occured during the last Y number of days.
The service also needs to scale to large amounts of updates and reads.
I'm considering just a simple table with the fields:
ID | USERID | EVENT | TIMESTAMP
============================================
1 | 1 | created file Z | 2014-03-20
2 | 2 | deleted dir Y | 2014-03-20
3 | 1 | created dir Y | 2014-03-20
But how would you consider solving the temporal requirements? I see two alternatives here:
1) On insert and/or reads for a user, also remove outdated and all but the last X events for a user. Affects latency as you need to perform both select,delete and insert on each request. But it keeps the disk size to minimum.
2) Let the service filter on query and do pruning as a separate batch job with some sql that:
First removes all obsolete events irrespective of users based on the timestamp.
Then do some join that removes all but the last X events for each user.
I have looked for design principles regarding these requirements which seems like fairly common ones. But I haven't yet found a perfect match.
It is at the moment NOT a requirement to query for all users that have performed a specific type of events.
Thanks in advance!
Edit:
The service is meant to scale to millions of requests / hour so I've been playing around with the idea of denormalizing this for performance reasons. Given that the requirements are set in stone:
10 last events
No events older than 10 days
I'm actually considering a pivoted table like this:
USERID | EV_1 | TS_1 | EV_2 | TS_2 | EV_3 | TS_3 | etc up to 10...
======================================================================
1 | Create | 2014.. | Del x | 2013.. | etc.. | 2013.. |
This way I can probably shift the events with a MERGE with SELECT and I get eviction for "free". Then I only have to purge all records where TS_1 is older than 10 days. I can also filter in my application logic to only show the events that are newer than 10 days after doing the trivial selects.
The caveat is if events comes in "out of order". The idea above works if I can always guarantee that the events are ordered from "left to right". Probably have to think a bit on that one..
Aside from the fact that it is basically a big cut in the relational data model, do you think I'm on the right track here if it comes to prioritize performance above all?
Your table design is good. Consider also the indexes you want to use. In practice, you will need a multi-column index on (userid, timestamp) to quickly respond to queries that query the last N events having a certain userid. Then you need a single-column index on (timestamp) to efficiently delete old events.
How many events you're planning to store and how many events you're planning to retrieve per query? I.e. does the size of the table exceed the RAM available? Are you using traditional spinning hard disks or solid-state disks? If the size of the table exceeds the RAM available and you are using traditional HDDs, note that each row returned for the query takes about 5-15 milliseconds due to slow seek time.
If your system supports batch jobs, I would use a batch job to delete old events instead of deleting old events at each query. The reason is that batch jobs do not slow down the interactive code path, and can perform more work at once provided that you execute the batch job rarely enough.
If your system doesn't support batch jobs, you could use a probabilistic algorithm to delete old events, i.e. delete only with 1% probability if events are queried. Or alternatively, you could have a helper table into which you store the timestamp of the last deleting of old events, and then check that timestamp and if it's old enough then perform new delete job and update the timestamp. The helper table should be so small that it will always stay in the cache.
My inclination is not to delete data. I would just store the data in your structure and have an interface (perhaps a view or table functions) that runs a query such as;
select s.*
from simple s
where s.timestamp >= CURRENT_DATE - interval 'n days' and
s.UserId = $userid
order by s.timestamp desc
fetch first 10 row only;
(Note: this uses standard syntax because you haven't specified the database, but there is similar functionality in any database.)
For performance, you want an index on simple(UserId, timestamp). This will do most of the work.
If you really want, you can periodically delete older rows. However, keeping all the rows is advantageous for responding to changing requirements ("Oh, we now want 60 days instead of 30 days") or other purposes, such as investigations into user behaviors and changes in events over time.
There are situations that are out-of-the-ordinary where you might want a different approach. For instance, there could be legal restrictions on the amount of time you could hold the data. In that case, use a job that deletes old data and run it every day. Or, if your database technology were an in-memory database, you might want to restrict the size of the table so old data doesn't occupy much memory. Or, if you had really high transaction volumes and lost of users (like millions of users with thousands of events), you might be more concerned with data volume affecting performance.

audit table vs. Type 2 Slowly Changing Dimension

In SQL Server 2008+, we'd like to enable tracking of historical changes to a "Customers" table in an operational database.
It's a new table and our app controls all writing to the database, so we don't need evil hacks like triggers. Instead we will build the change tracking into our business object layer, but we need to figure out the right database schema to use.
The number of rows will be under 100,000 and number of changes per record will average 1.5 per year.
There are at least two ways we've been looking at modelling this:
As a Type 2 Slowly Changing Dimension table called CustomersHistory, with columns for EffectiveStartDate, EffectiveEndDate (set to NULL for the current version of the customer), and auditing columns like ChangeReason and ChangedByUsername. Then we'd build a Customers view over that table which is filtered to EffectiveEndDate=NULL. Most parts of our app would query using that view, and only parts that need to be history-aware would query the underlying table. For performance, we could materialize the view and/or add a filtered index on EffectiveEndDate=NULL.
With a separate audit table. Every change to a Customer record writes once to the Customer table and again to a CustomerHistory audit table.
From a quick review of StackOverflow questions, #2 seems to be much more popular. But is this because most DB apps have to deal with legacy and rogue writers?
Given that we're starting from a blank slate, what are pros and cons of either approach? Which would you recommend?
In general, the issue with SCD Type- II is, if the average number of changes in the values of the attributes are very high, you end-up having a very fat dimension table. This growing dimension table joined with a huge fact table slows down the query performance gradually. It's like slow-poisoning.. Initially you don't see the impact. When you realize it, it's too late!
Now I understand that you will create a separate materialized view with EffectiveEndDate = NULL and that will be used in most of your joins. Additionally for you, the data volume is comparatively low (100,000). With average changes of only 1.5 per year, I don't think data volume / query performance etc. are going to be your problem in the near future.
In other words, your table is truly a slowly changing dimension (as opposed to a rapidly changing dimension - where your option #2 is a better fit). In your case, I will prefer option #1.

200 column table - 3 million rows - performance

i'm currently working on a project where the client has handed me a database that includes a table with over 200 columns and 3 million rows of data lol. This is definitely poorly designed and currently exploring some options. I developed the app on my 2012 mbp with 16gb of ram and an 512 ssd. I had to develop the app using mvc4 so set up the development and test environment using parallels 8 on osx. As part of the design, I developed an interface for the client to create custom queries to this large table with hundreds of rows so I am sending a queryString to the controller which is passed using dynamic linq and the results are sent to the view using JSON (to populate a kendo ui grid). On my mbp, when testing queries using the interface i created it takes max 10 secs (which find too much) to return the results to my kendo ui grid. Similarly, when I test queries directly in sql server, it never takes really long.
However when I deployed this to the client for testing these same queries take in excess of 3 mins +. So long story short, the client will be upgrading the server hardware but in the mean time they still need to test the app.
My question is, despite the fact that the table holds 200 columns, each row is unique. More specifically, the design is:
PK-(GUID) OrganizationID (FK) --- 200 columns (tax fields)
If I redesign this to:
PK (GUID) OrganizationID (FK) FieldID(FK) Input
Field table:
FieldID FieldName
This would turn this 3 million rows of data table into 600 million rows but only 3 columns. Will I see performance enhancements?
Any insight would be appreciated - I understand normalization but most of my experience is in programming.
Thanks in advance!
It is very hard to make any judgements without knowing the queries that you are running on the table.
Here are some considerations:
Be sure that the queries are using indexes if they are returning only a handful of rows.
Check that you have enough memory to store the table in memory.
When doing timings, be sure to ignore the first run, because this is just loading the page cache.
For testing purposes, just reduce the size of the table. That should speed things up.
As for your question about normalization. Your denormalized structure takes up much less disk space than a normalized structure, because you do not need to repeat the keys for each value. If you are looking for one value on one row, normalization will not help you. You will still need to scan the index to find the row and then load the row. And, the row will be on one page, regardless of whether it is normalized or denormalized. In fact, normalization might be worse, because the index will be much larger.
There are some examples of queries where normalizing the data will help. But, in general, you already have a more efficient data structure if you are fetching the data by rows.
You can take a paging approach. There will be 2 queries: initial will return all rows but only a column with unique IDs. This array can be split into pages, say 100 IDs per page. When user selects a specific page - you pass 100 ids to the second query which this time will return all 200 columns but only for requested 100 rows. This way you don't have to return all the columns across all the rows at once, which should yield significant performance boost.