How many records were counted in specific day - sql

Is it possible to check in DB2 how many records were counted in specific table in specific day in past
I have a table with name 'XYZ' and I would like to check row count for specific day e.g. for 10.09.2020, for 05.09.2020 and for 01.09.2020

In ordinary SQL, without special provisions, no, you can´t!
Depending on your usage scenario, there are several ways to achieve this function. Here are three that I can think of:
If you table has a timestamp field or you can add one and you can guarantee there will be no rows deleted: You can just count the rows where the timestamp is smaller then your desired date. Cheap, performance wise, but deletes may make this impossible.
You could set up a procedure that runs daily and counts your rows to write them in a different table. This van also be rather cheap from a performance point of view, but you will be limited to the specific "snapshot" times you configured beforehand and you may have conditions where the count procedure did not run an therefore data is missing.
You could create an audit-table and a trigger on the table you are interested in to log every insert and delete operation on the table with a timestamp. This is the most performance heavy solution, but the only one that will give you always a full picture of the row count at any given time.

Related

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

Incremental extraction from DB2

What would be the most efficient way to select only rows from DB2 table that are inserted/updated since the last select (or some specified time)? There is no field in the table that would allow us to do this easily.
We are extracting data from the table for purposes of reporting, and now we have to extract the whole table every time, which is causing big performance issues.
I found example on how to select only rows changed in last day:
SELECT * FROM ORDERS
WHERE ROW CHANGE TIMESTAMP FOR ORDERS >
CURRENT TIMESTAMP - 24 HOURS;
But, I am not sure how efficient this would be, since the table is enormous.
Is there some other way to select only rows that are changed, that might be more efficient that this?
I also found solution called ParStream. This seems as something that can speed up demanding queries on the data, but I was unable to find any useful documentation about it.
I propose these options:
You can use Change Data Capture, and this will replay automatically the modifications to another data source.
Normally, a select statement does not assure the order of the rows. That means that you cannot use a select without a time reference in order to retrieve the most recent. Thus, you have to have a time column in order to retrieve the most recent. You can keep track of the most recent row in a global variable, and the next time retrieve the rows with a time bigger than that variable. If you want to increase performance, you can put the table in append mode, and in this way the new rows will be physically together. Keeping an index on this time column could be expensive to maintain, but it will speed (no table scan) when you need to extract the rows.
If your server is DB2 for i, use database journaling. You can extract after images of inserted records by time period or journal entry number from the journal receiver(s). The data entries can then be copied to your target file.

How can I improve performance of average method in SQL?

I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?
The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.
To be more specific, the data in question is essentially a log with this sort of definition:
table log {
int duration
date time
string event
}
I'm doing queries like
SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today
The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.
The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.
An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.
Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?
There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".
If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.
Speeding up aggregates is usually done by keeping additional tables.
Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table
dimAavg(dimA, avgValue)
The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...
For example let us assume that system predominantly does inserts and only occasionally updates and deletes.
Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without a big impact on the system.
This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.
This way you can still get accurate aggregates from this table and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID
GROUP BY a.dimA
The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).
The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).
Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).
Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.
EDIT
There are also materialized views, 2, 3
Just a guess, but indexes won't help much since average must read all the record (in any order), indexes are usefull the find subsets of rows, ubt if you have to iterate on all rows with no special ordering indexes are not helping...
This might not be what you're looking for, but if your table has some way to order the data (e.g. by date), then you can just do incremental computations and store the results.
For example, if your data has a date column, you could compute the average for records 1 - Date1 then store the average for that batch along with Date1 and the #records you averaged. The next time you compute, you restrict your query to results Date1..Date2, and add the # of records, and update the last date queried. You have all the information you need to compute the new average.
When doing this, it would obviously be helpful to have an index on the date, or whatever column(s) you are using for the ordering.

Combine three tables into one, or too many columns?

I am tracking clicks over three time periods: the past day, past week and past month.
To do this, I have three tables:
An hourly table, with columns link_id, two other attributes, and hour_1 to hour_24, together with a computed column giving the sum
A weekday table, with columns click_id, two other attributes, and day_1 to day_7, together with a computed column giving the sum
A monthday table, as above, with columns day_1 to day_31
When a click comes in, I store its key attributes like href, description, etc, in other tables, and insert or update the row(s) corresponding to the link_id in each of the above tables.
Each link can have several entries in each of the above hourly/weekday/monthday tables, depending on the two other attributes (e.g. where the user is sitting).
So if a user is Type A and sitting in X, three rows are created or added to in the above tables -- the first row records all clicks on that link over the time period, the second row records all clicks by "Type A people", and the third "All clicks by people in X".
I have designed it this way as I didn't want to have to move data around each hour/day/week/month. I just maintain pointers for "current hour" (1-24), "current day" (1-31) and "current weekday" (1-7), and write to the corresponding cells in the tables. When we enter a new period (e.g. "3pm-4pm"), I can just blank out that current column (e.g. hour_15), then start incrementing it for links as they come in. Every so often I can delete old rows which have fallen down to "all zero".
This way I shouldn't ever have to move around column data, which would likely be very expensive for what will potentially be tens of thousands of rows.
I will only be SELECTing either the current day/weekday/hour rows (prior to inserting/updating) or the TOP 20 values from the computed columns based on the attributes (and will likely cache these results for an hour or so).
After the tables populate, UPDATES will far exceed INSERTs as there aren't that many unique hrefs.
Three questions:
Is it OK to combine the three big tables into one big table of monthdays/weekdays/hours? This would give a table with 64 columns, which I'm not sure is overkill. On the other hand, keeping them separate like they are now triples the number of INSERT/UPDATE statements needed. I don't know enough about SQL server to know which is best.
Is this approach sensible? Most data sets I've worked with of course have a separate row per item and you would then sort by date -- but when tracking clicks from thousands of users this would give me many hundreds of thousands of rows, which I would have to cull very often, and ordering and summing them would be hideous. Once the tracker is proven, I have plans to roll the click listener out over hundreds of pages, so it needs to scale.
In terms of design, clearly there is some redundancy in having both weekdays and monthdays. However, this was the only way I could think of to maintain a pointer to a column and quickly update it, and use a computed column. If I eliminated the weekdays table, I would need to get an additional computed column on the "monthdays" that summed the previous 7 days -- (e.g. if today is the 21st, then sum day_14, day_15, day_16... day_20). The calculation would have to update every day, which I imagine would be expensive. Hence the additional "weekday" table for a simple static calculation. I value simple and fast calculations more highly than small data storage.
Thanks in advance!
Anytime you see columns with numbers in their names, such as column_1, column_2, column_3... your 'horrible database design' flag should raise. (FYI, here you are breaking 1NF, specifically you are repeating groups across columns)
Now, it is possible that such implementation can be acceptable (or even necessary) in production, but conceptually it is definitively wrong.
As Geert says, conceptually two tables will suffice. If the performance is an issue you could denormalize data for weekly/monthly stats, but still I would not model them as above but I would keep the
CREATE TABLE base_stats ( link_id INT, click_time DATETIME )
CREATE TABLE daily_stats ( link_id INT, period DATETIME, clicks INT )
You can always aggregate with
SELECT link_id, count(*) as clicks, DATE(click_time) as day
FROM base_stats
GROUP_BY link_id, day
which can be run periodically to fill the daily_stats. If you want to keep it up to date you can implement it in triggers (or if you really must, do it on the application side). You can also denormalize the data on different levels if necessary (by creating more aggregate tables, or by introducing another column in the aggregated data table), but that might be premature optimization.
The above design is much cleaner for future ad-hoc analysis (will happen with stats). For other benefits see wikipedia on repeating groups.
EDIT:
Even though the solution with two tables base_stats and aggregated_stats is accepted, with following strategy:
insert each click in base_stats
periodically aggregate the data from base_stats into daily_stats and purge the full detail
it might not be the optimal solution.
Based on discussions and clarification of requirements it seems that the table base_stats is not necessary. The following approach should be also investigated:
CREATE TABLE period_stats ( link_id INT, period DATETIME, ...)
Updates are easy with
UPDATE period_stats
SET clicks = clicks + 1
WHERE period = #dateTime AND link_id = #url AND ...
The cost of updating this table, properly indexed is as efficient as inserting rows in the base_table and any it is also easy to use it for analysis
SELECT link_id, SUM(clicks)
FROM period_stats
WHERE period between #dateTime1 AND #dateTime2
GROUP BY ...
Denormalization as you have done in your database can be a good solution for some problems. In your case however I would not choose the above solution mainly because you lose information that you might need in the future, maybe you want to report on half-hour intervals in the future.
So looking at your description you could do with only 2 tables: Links (ahref's and descriptions) and clicks on the links (containing the date and time of the click and maybe some other data). The drawback of course is that you have to store hunderds of thousands of records and querying this amount of data can take a lot of time. If this is the case you might consider storing aggregate data on these 2 tables in separate tables and update these tables on a regular basis.
That design is really bad. Unreason's proposal is better.
If you want to make it nice and easy, you could as well have a single table with 2 fields:
timeSlice
clickCount
location
userType
with TimeSlice holding the date and time rounded to the hour.
All the rest can be deducted from that, and you would have only
24 * 365 * locations# * types#
records per year.
Always depending on the configuration and feasibility, with this table design, you could eventually accumulate values in memory and only update the table once per 10 sec. or any time length <= 1 hour, depending on acceptable risk

SQL Is it possible to setup a column that will contain a value dependent on another column?

I have a table (A) that lists all bundles created off a machine in a day. It lists the date created and the weight of the bundle. I have an ID column, a date column, and a weight column. I also have a table (B) that holds the details related to that machine for the day. In that table (B), I want a column that lists a sum of weights from the other table (A) that the dates match on. So if the machine runs 30 bundles in a day, I'll have 30 rows in table (A) all dated the same day. In table (B) I'll have 1 row detailing other information about the machine for the day plus the column that holds the total bundle weight created for the day.
Is there a way to make the total column in table (B) automatically adjust itself whenever a row is added to table (A)? Is this possible to do in the table schema itself rather than in an SQL statement each time a bundle is added? If it's not, what sort of SQL statement do I need?
Wes
It would be a mistake to do so unless you have performance problems that require it.
A better approach is to define a view in the database that will aggregate the daily bundles by machine:
CREATE VIEW MachineDailyTotals
(MachineID, RunDate, BundleCount, TotalWeight)
AS SELECT MachineID, RunDate, COUNT(*), SUM(WeightCol)
FROM BundleListTable
GROUP BY MachineID, RunDate
This will allow you to always see the correct, updated total weight per machine per day without imposing any load on the database until you actually look at the data. You can perform a simple OUTER JOIN with the machine table to get information about the machine, including the daily total info, without having to actually store the totals anywhere.
If you need the sum (or other aggregate) in real time, add a trigger on table A for INSERT, UPDATE, DELETE which calculates the sum to be stored in B.
Otherwise, add a daily job which calculates the sums.
Please specify which database you are using.
Are you sure that you don't want to pull this information dynamically rather than storing it in a separate table? This seems like an indirect violation of Normalization rules in that you'll be storing the same information in two different places. With a dynamic query, you'll always be sure that the derived information will be correct without having to worry about the coding and maintenance of triggers.
Of course, if you are dealing with large amounts of data and query times are becoming an issue, you may want the shortcut of a summary table. But, in general, I'd advise against it.
This can be accomplished via Triggers which are little bits of code that execute whenever a certain action (insert/update/delete) happens on a table. The syntax is varies by vendor (MySQL vs. Oracle) but the language is typically the same language you would write a stored procedure in.
If you mention the DB type I can help with the actual syntax