I am tracking clicks over three time periods: the past day, past week and past month.
To do this, I have three tables:
An hourly table, with columns link_id, two other attributes, and hour_1 to hour_24, together with a computed column giving the sum
A weekday table, with columns click_id, two other attributes, and day_1 to day_7, together with a computed column giving the sum
A monthday table, as above, with columns day_1 to day_31
When a click comes in, I store its key attributes like href, description, etc, in other tables, and insert or update the row(s) corresponding to the link_id in each of the above tables.
Each link can have several entries in each of the above hourly/weekday/monthday tables, depending on the two other attributes (e.g. where the user is sitting).
So if a user is Type A and sitting in X, three rows are created or added to in the above tables -- the first row records all clicks on that link over the time period, the second row records all clicks by "Type A people", and the third "All clicks by people in X".
I have designed it this way as I didn't want to have to move data around each hour/day/week/month. I just maintain pointers for "current hour" (1-24), "current day" (1-31) and "current weekday" (1-7), and write to the corresponding cells in the tables. When we enter a new period (e.g. "3pm-4pm"), I can just blank out that current column (e.g. hour_15), then start incrementing it for links as they come in. Every so often I can delete old rows which have fallen down to "all zero".
This way I shouldn't ever have to move around column data, which would likely be very expensive for what will potentially be tens of thousands of rows.
I will only be SELECTing either the current day/weekday/hour rows (prior to inserting/updating) or the TOP 20 values from the computed columns based on the attributes (and will likely cache these results for an hour or so).
After the tables populate, UPDATES will far exceed INSERTs as there aren't that many unique hrefs.
Three questions:
Is it OK to combine the three big tables into one big table of monthdays/weekdays/hours? This would give a table with 64 columns, which I'm not sure is overkill. On the other hand, keeping them separate like they are now triples the number of INSERT/UPDATE statements needed. I don't know enough about SQL server to know which is best.
Is this approach sensible? Most data sets I've worked with of course have a separate row per item and you would then sort by date -- but when tracking clicks from thousands of users this would give me many hundreds of thousands of rows, which I would have to cull very often, and ordering and summing them would be hideous. Once the tracker is proven, I have plans to roll the click listener out over hundreds of pages, so it needs to scale.
In terms of design, clearly there is some redundancy in having both weekdays and monthdays. However, this was the only way I could think of to maintain a pointer to a column and quickly update it, and use a computed column. If I eliminated the weekdays table, I would need to get an additional computed column on the "monthdays" that summed the previous 7 days -- (e.g. if today is the 21st, then sum day_14, day_15, day_16... day_20). The calculation would have to update every day, which I imagine would be expensive. Hence the additional "weekday" table for a simple static calculation. I value simple and fast calculations more highly than small data storage.
Thanks in advance!
Anytime you see columns with numbers in their names, such as column_1, column_2, column_3... your 'horrible database design' flag should raise. (FYI, here you are breaking 1NF, specifically you are repeating groups across columns)
Now, it is possible that such implementation can be acceptable (or even necessary) in production, but conceptually it is definitively wrong.
As Geert says, conceptually two tables will suffice. If the performance is an issue you could denormalize data for weekly/monthly stats, but still I would not model them as above but I would keep the
CREATE TABLE base_stats ( link_id INT, click_time DATETIME )
CREATE TABLE daily_stats ( link_id INT, period DATETIME, clicks INT )
You can always aggregate with
SELECT link_id, count(*) as clicks, DATE(click_time) as day
FROM base_stats
GROUP_BY link_id, day
which can be run periodically to fill the daily_stats. If you want to keep it up to date you can implement it in triggers (or if you really must, do it on the application side). You can also denormalize the data on different levels if necessary (by creating more aggregate tables, or by introducing another column in the aggregated data table), but that might be premature optimization.
The above design is much cleaner for future ad-hoc analysis (will happen with stats). For other benefits see wikipedia on repeating groups.
EDIT:
Even though the solution with two tables base_stats and aggregated_stats is accepted, with following strategy:
insert each click in base_stats
periodically aggregate the data from base_stats into daily_stats and purge the full detail
it might not be the optimal solution.
Based on discussions and clarification of requirements it seems that the table base_stats is not necessary. The following approach should be also investigated:
CREATE TABLE period_stats ( link_id INT, period DATETIME, ...)
Updates are easy with
UPDATE period_stats
SET clicks = clicks + 1
WHERE period = #dateTime AND link_id = #url AND ...
The cost of updating this table, properly indexed is as efficient as inserting rows in the base_table and any it is also easy to use it for analysis
SELECT link_id, SUM(clicks)
FROM period_stats
WHERE period between #dateTime1 AND #dateTime2
GROUP BY ...
Denormalization as you have done in your database can be a good solution for some problems. In your case however I would not choose the above solution mainly because you lose information that you might need in the future, maybe you want to report on half-hour intervals in the future.
So looking at your description you could do with only 2 tables: Links (ahref's and descriptions) and clicks on the links (containing the date and time of the click and maybe some other data). The drawback of course is that you have to store hunderds of thousands of records and querying this amount of data can take a lot of time. If this is the case you might consider storing aggregate data on these 2 tables in separate tables and update these tables on a regular basis.
That design is really bad. Unreason's proposal is better.
If you want to make it nice and easy, you could as well have a single table with 2 fields:
timeSlice
clickCount
location
userType
with TimeSlice holding the date and time rounded to the hour.
All the rest can be deducted from that, and you would have only
24 * 365 * locations# * types#
records per year.
Always depending on the configuration and feasibility, with this table design, you could eventually accumulate values in memory and only update the table once per 10 sec. or any time length <= 1 hour, depending on acceptable risk
Related
Is it possible to check in DB2 how many records were counted in specific table in specific day in past
I have a table with name 'XYZ' and I would like to check row count for specific day e.g. for 10.09.2020, for 05.09.2020 and for 01.09.2020
In ordinary SQL, without special provisions, no, you can´t!
Depending on your usage scenario, there are several ways to achieve this function. Here are three that I can think of:
If you table has a timestamp field or you can add one and you can guarantee there will be no rows deleted: You can just count the rows where the timestamp is smaller then your desired date. Cheap, performance wise, but deletes may make this impossible.
You could set up a procedure that runs daily and counts your rows to write them in a different table. This van also be rather cheap from a performance point of view, but you will be limited to the specific "snapshot" times you configured beforehand and you may have conditions where the count procedure did not run an therefore data is missing.
You could create an audit-table and a trigger on the table you are interested in to log every insert and delete operation on the table with a timestamp. This is the most performance heavy solution, but the only one that will give you always a full picture of the row count at any given time.
I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.
Is there a correlation between the amount of rows/number of columns used and it's impact within the (MS)SQL database?
A little more background:
We have to store lots of data from measurement devices. These devices ping a string with data back to us around 100 times a day. These strings contains +- 300 fields. Assume we have 100 devices in operation that means we get 10000 records back every day. At our back-end we split these data strings and have to put these into the database. When these data strings are fixed that means we add each days around 10000 new rows into the database. No big deal.
Whatsoever, the contents of these data strings may change during time. There are two options we are considering:
Using vertical tables to store the data dynamically
Using horizontal tables and add a new column now and then when it's needed.
From the perspective of ease we'd like to choose for the first approach. Whatsoever, that means we're adding 100*100*300=3000000 rows each day. Data has to be stored 1 year and a month (395 days) so then we're around 1.2 billion rows. Not calculated the expected growth.
Is it from a performance perspective smarter to use a 'vertical' or a 'horizontal' approach?
When choosing for the 'vertical' solution, how can we actual optimize performance by using PK's/FK's wisely?
When choosing for the 'horizontal' solution, are there recommendations for adding columns to the table?
I have a vertical DB with 275 million rows in the "values" table. We took this approach because we couldn't accurately define the schema at the outset either. Inserts are fantastic. Selects suck. Too be fair we throw in a couple of extra doohickies the typical vertical schema doesn't have to deal with.
Have a search for EAV aka Entity Attribute Value models. You'll find a lot of heat on both sides of the debate. Too good articles on making it work are
What is so bad about EAV, anyway?
dave’s guide to the eav
My guess is these sensors don't just start sending you extra fields. You have to release new sensors or sensor code for this to happen. That's your chance to do change control on your schema and add the extra columns. If external parties can connect sensors without notifying you this argument is null and void and you may be stuck with an EAV.
For the horizontal option you can split tables putting the frequently-used columns in one table and the less-used in a second; both tables have the same primary key values so you can link less-used to more-used columns. Also you can use RDBMS's built-in partitioning functionality to split each day's (or week's or month's) data for the others'.
Generally, you can tune a table more for inserts (or any DML) or for queries. Improving one side comes at the expense of the other. Usually, it's a balancing act.
First of all, 10K inserts a day is not really a large number. Sure, it's not insignificant, but it doesn't even come close to what would be considered "large" nowadays So, while we don't want to make inserts downright sluggish, this gives you some wiggle room.
Creating an index on the device id and/or entry timestamp will do some logical partitioning of the data for you. The exact makeup of your index(es) will depend on your queries. Are you looking for all entries for a given date or date range? Then index the timestamp column. Are you looking for all entries received from a particular device? Then index the device id column. Are you looking for entries from a particular device on a particular date or date range or sorted by the date? Then create an index on both columns.
So if you ask for the entries for device x on date y, then you are going out to the table and looking only at the rows you need. The fact that the table is much larger than the small subset you query is incidental. It's as if the rest of the table doesn't even exist. The total size of the table need not be intimidating.
Another option: As it looks like the data is written to the table and never altered after that, then you may want to create a data warehouse schema for the data. New entries can be moved to the warehouse every day or several times a day. The point is, the warehouse schema can have the data sliced, diced, quartered and cubed to make queries much more efficient. So you can have the existing table tuned for more efficient inserts and the warehouse tuned for more efficient queries. That is, after all, what data warehouses are for.
You also imply that some of each entry is (or can be) duplicated from one entry to the next. See if you can segment the data into three types:
Type 1: Data that never changes (the device id, for example)
Type 2: Data that rarely changes
Type 3: Data that changes often
Now all you have is a normalization problem, something a lot easier to solve. Let's say the row is equally split between the types. So you have one table with 100 rows of 33 columns. That's it. It never changes. Linked to that is a table with at least 100 rows of 33 columns but maybe several new rows are added each day. Finally, linked to the second table a table with rows of 33 columns that possibly grows by the full 10K every day.
This minimizes the grow-space required by the online database. The warehouse could then denormalize back to one huge table for ease of querying.
I've been asked to do a snapshots of certain tables from the database, so in the future we can have a clear view of the situation for any given day in the past. lets say that one of such tables looks like this:
GKEY Time_in Time_out Category Commodity
1001 2014-05-01 10:50 NULL EXPORT Apples
1002 2014-05-02 11:23 2014-05-20 12:05 IMPORT Bananas
1003 2014-05-05 11:23 NULL STORAGE Null
The simples way to do a snapshot would be creating copy of the table with another column SNAPSHOT_TAKEN (Datetime) and populate it with an INSERT statement
INSERT INTO UNITS_snapshot (SNAPSHOT_TAKEN, GKEY,Time_in, Time_out, Category, Commodity)
SELECT getdate() as SNAPSHOT_TAKEN, * FROM UNITS
OK, it works fine, but it would make the destination table quite big pretty soon, especially if I'd like to run this query often. Better solution would be checking for changes between current live table and the latest snapshot and write them down, omitting everything that hasn't been changed.
Is there a simply way to write such query?
EDIT: Possible solution for the "Forward delta" (assuming no deletes from original table)
INSERT INTO UNITS_snapshot
SELECT getdate() as SNAP_DATE,
r.* -- Here goes all data from from the original table
CASE when b.gkey is null then 'I' else 'U' END AS change_type
FROM UNITS r left outer join UNITS_snapshot b
WHERE (r.time_in <>b.time_in or r.time_out<>b.time_out or r.category<>b.category or r.commodity<>b.commodity or b.gkey is null)
and (b.snap_date =
(SELECT max (b.snap_date) from UNITS_snapshot b right outer join UNITS r
on r.gkey=b.gkey) or b.snap_date is null)
Assumptions: no value from original table is deleted. Probably also every field in WHERE should be COALESCE (xxx,'') to avoid comparing null values with set ones.
Both Dan Bracuk and ITroubs have made very good comments.
Solution 1 - Daily snapshop
The first solution you proposed is very simple. You can build the snapshot with a simple query and you can also consult it and rebuild any day's snapshot with a very simple query, by just filtering on the SNAPSHOT_TAKEN column.
If you have just some thousands of records, I'd go with this one, without worrying too much about its growing size.
Solution 2 - Daily snapshop with rolling history
This is basically the same as solution 1, but you keep only some of the snapshots over time... to avoid having the snapshot DB growing indefinitely over time.
The simplest approach is just to save the snapshots of the last N days... maybe a month or two of data. A more sophisticated approach is to keep snapshot with a density that depends on age... so, for example, you could have every day of the last month, plus every sunday of the last 3 months, plus every end-of-month of the last year, etc...
This solution requires you develop a procedure to handle deletion of the snapshots that are not required any more. It's not as simple as using getdate() within a query. But you obtain a good balance between space and historic information. You just need to balance out a good snapshot retainment strategy to suit your needs.
Solution 3 - Forward row delta
Building any type of delta is a much more complex procedure.
A forward delta is built by storing the initial snapshot (as if all rows had been inserted on that day) and then, on the following snapshots, just storing information about the difference between snapshot(N) and snapshot(N-1). This is done by analyzing each row and just storing the data if the row is new or updated or deleted. If the main table does not change much over time, you can save quite a lot of space, as no info is stored for unchanged rows.
Obviously, to handle deltas, you now need 2 extra columns, not just one:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
row change type (could be D=deleted, I=inserted, U=updated... or
something similar)
The main complexity derives from the necessity to identify rows (usually by primary key) so as to calculate if between 2 snapshots any individual row has been inserted, updated, deleted... or none of the above.
The other complexity comes from reading the snapshot DB and building the latest (or any other) snapshot. This is necessary because, having only row differences in the table, you cannot simply select a day's snapshot by filtering on snapshot_taken.
This is not easy in SQL. For each row you must take into account just the final version... the one with MAX snapshot_taken that is <= the date of the snapshot you want to build. If it is an insert or update, then keep the data for that row, else (if it is a delete) then ignore it.
To build a delta of snapshot(N), you must first build the latest snapshot (N-1) from the snapshot DB. Then you must compare the two snapshots by primary key or row identity and calculate the change type (I/U/D) and insert the changes in the snapshot DB.
Beware that you cannot delete old snapshot data without consolidating it first. That is because all snapshots are calculated from the oldest initial one and all the subsequent difference data. If you want to remove a year's of old snapshots, you'll have to consolidate the old initial snapshot and all the year's variations in a new initial snapshot.
Solution 4 - Backward row delta
This is very similar to solution 3, but a bit more complex.
A backward delta is built by storing the final snapshot and then, on the following snapshots, just storing information about the difference between snapshot(N-1) and snapshot(N).
The advantage is that the latest snapshot is always readily available through a simple select on the snapshot DB. You only need to merge the difference data when you want to retrieve an older snapshot. Compare this to the forward delta, where you always need to rebuild the snapshot from the difference data unless you are actually interested in the very first snapshot.
Another advantage (compared to solution 3) is that you can remove older snapshots by just deleting the difference data older than a particular snapshot. You can do this easily because snapshots are calculated from the final snapshot and not from the initial one.
The disadvantage is in the obscure logic. Difference data is calculated backwards. Values must be stored on the (U)pdate and (D)elete variations, but are unnecessary on the I variations. Going backwards, rows must be ignored if the first variation you find is an (I)nsert. Doable, but a bit trickier.
Solution 5 - Forward and backward column delta
If the main table has many columns, or many long text or varchar columns, and only a bunch of these are updated, then it could make sense to store only column variations instead of row variations.
This is done by using a table with this structure:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
change type (could be D=deleted, I=inserted, U=updated... or
something similar)
column name
value
The difference can be calculated forward or backward, as per row deltas.
I've seen this done, but I really advise against it. There are just too many disadvantages and added complexity.
Value is a text or varchar, and there are typecasting issues to handle if you have numeric, boolean or date/time values... and, if you have a lot of these, it could very well be you won't be saving as much space as you think you are.
Rebuilding any snapshot is hell. Altogether... any operation on this type of table really requires a lot of knowledge of the main table's structure.
I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?
The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.
To be more specific, the data in question is essentially a log with this sort of definition:
table log {
int duration
date time
string event
}
I'm doing queries like
SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today
The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.
The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.
An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.
Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?
There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".
If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.
Speeding up aggregates is usually done by keeping additional tables.
Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table
dimAavg(dimA, avgValue)
The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...
For example let us assume that system predominantly does inserts and only occasionally updates and deletes.
Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without a big impact on the system.
This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.
This way you can still get accurate aggregates from this table and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID
GROUP BY a.dimA
The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).
The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).
Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).
Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.
EDIT
There are also materialized views, 2, 3
Just a guess, but indexes won't help much since average must read all the record (in any order), indexes are usefull the find subsets of rows, ubt if you have to iterate on all rows with no special ordering indexes are not helping...
This might not be what you're looking for, but if your table has some way to order the data (e.g. by date), then you can just do incremental computations and store the results.
For example, if your data has a date column, you could compute the average for records 1 - Date1 then store the average for that batch along with Date1 and the #records you averaged. The next time you compute, you restrict your query to results Date1..Date2, and add the # of records, and update the last date queried. You have all the information you need to compute the new average.
When doing this, it would obviously be helpful to have an index on the date, or whatever column(s) you are using for the ordering.