A lot of events of various type is triggered in my application - I want to count them over time to keep track.
I'm trying to figure out the best way to do this. I will have multiple servers and threads saving events, so it has to work during concurrency.
I cannot have one row per event, since the number of events is very large, it has to be aggregated somehow.
So, I'm trying to have a table row per event type and "time interval" like
COLUMN
------
ID
EVENTTYPE
COUNT
FIRSTTIMESTAMP
LASTTIMESTAMP
I first tried to make a solution where a new row was created by the logger:
UDPATE EVENTCOUNTER SET COUNT = COUNT + 1 WHERE LASTTIMESTAMP > CURRENT TIMESTAMP and EVENTTYPE = ?;
If num rows updated = 0 then insert a new row with new timestamps.
However, to make this work, I would have to lock the entire table so that there will no be a race condition creating new rows from multiple threads.
i.e.
LOCK TABLE EVENTCOUNTER ..
UPDATE ..
if numRows = 0 then INSERT ..
COMMIT
Will this table lock impact performance by a great deal? Is there a better way to solve my problem without table locks?
Using a DB2 database and Java Client - actually Hibernate if that matters.
I would insert a row for each event and run a script every 24h or so that aggregates the information and puts the aggregated in a separate table. This is the classical way OLAP (analisys services etc.) works.
Since you have many inputs, consider using a less restrictive lock (here are the transaction locks for db2). After all, if your aggregated result misses a few entries out of hundreds it's not that bad.
I do not know much about DB2 but in Oracle there are Insert/Update row or statement level triggers. So, every time user inserts/updates a row trigger fires and does smth. (row level trigger) like update some log or other table as in your case. Not sure if this feature is available in DB2. Just FYI...
Related
We have a table that people has application. As users go through the applications the table is updated. Once the form is submitted there is a flag that is changed from 1 to 2. We want to run a daily count of submissions and keep track of it. So we would have a separate table with a column for the date and a column for the total submission for that day. I want that daily count to be incremented each time a row is updated from 1 to 2. I was thinking that using a trigger where its something like this:
CREATE TRIGGER daily_submission_count
AFTER UPDATE on [form_info_table]
FOR EACH ROW
WHEN (form_info_table.status == 2)
BEGIN
UPDATE [daily_count_table].daily_count SET daily_count = daily_count + 1;
Of course with this approach, I would have to initialize the row with another rule that runs at 12:00am that sets the date and the count to 0. I am having trouble finding the right way to write the SQL or if there is a better approach to this.
You could do that but it is a bad idea. Think about that ONE person that updates that FORM_INFO_TABLE and steps away from their desk before ending the transaction (or in more realistic terms, the service stalls or crashes). Now you have an open lock on DAILY_COUNT_TABLE.
Every single other attempt to increment that count is now dead in the water. Your system comes to a halt. Even in the best case situation, where everything is running smoothly, you have effectively reduced your system to a single transaction at a time, because whoever is incrementing the daily count blocks everyone else, even if it is just for a moment.
The first thing I would look at is: How expensive is it to derive the daily count with NO additional database structures, ie, is it really that expensive to run:
SELECT COUNT(*)
FROM FORM_INFO_TABLE
WHERE STATUS = 2
AND <some appropriate date range>
because if that is fast enough (for the amount of times you need to run it), then you're done.
If that alone is too slow, then the next option might be an index on only the records with status of 2, eg
CREATE INDEX MY_INDEX ON FORM_INFO_TABLE ( CASE WHEN STATUS = 2 THEN DATE_COLUMN END )
because then the only entries in that index are rows that contain a status of 2. An appropriate query to scan just that index is your next option - no extra code needed but a small transaction overhead for STATUS=2 rows to get that daily count efficiently.
If you really have some complex structures for which both of the above are not possible, then you could look at a queueing mechanisn, where the trigger is along the lines of
CREATE TRIGGER daily_submission_count
AFTER UPDATE on [form_info_table]
FOR EACH ROW
WHEN (form_info_table.status == 2)
BEGIN
insert into pending_info values (:new.primary_key_col);
--- (or the same with DBMS_AQ if you prefer)
END
and then a background task that comes along from time to time to collate those rows and summarise the data appropriately.
But it would have to be a pretty special set of circumstances to warrant that complexity.
Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.
What I am trying to do
I am developing a web service, which runs in multiple server instances, all accessing the same RDBMS (PostgreSQL). While the database is needed for persistence, it contains very little data, which is why every server instance has a cache of all the data. Further the application is really simple in that it only ever inserts new rows in rather simple tables and selects that data in a scheduled fashion from all server instances (no updates or changes... only inserts and reads).
The way it is currently implemented
basically I have a table which roughly looks like this:
id BIGSERIAL,
creation_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- further data columns...
The server is doing something like this every couple of seconds (pseudocode):
get all rows with creation_timestamp > lastMaxTimestamp
lastMaxTimestamp = max timestamp for all data just retrieved
insert new rows into application cache
The issue I am running into
The application skips certain rows when updating the caches. I analyzed the issue and figured out, that the problem is caused in the following way:
one server instance is creating a new row in the context of a transaction. An id for the new row is retrieved from the associated sequence (id=n) and the creation_timestamp (with value ts_1) is set.
another server does the same in the context of a different transaction. The new row in this transaction gets id=n+1 and a creation_timestamp ts_2 (where ts_1 < ts_2).
transaction 2 finishes before transaction 1
one of the servers executes a "select all rows with creation_timestamp > lastMaxTimestamp". It gets row n+1, but not n1. It sets lastMaxTimestamp to ts_2.
transaction 1 completes
some time later the server from step 4 executes "select all rows with creation_timestamp > lastMaxTimestamp" again. But since lastMaxTimestamp=ts_2 and ts_2>ts_1 the row n will never be read on that server.
Note: CURRENT_TIMESTAMP has the same value during a transaction, which is the transaction start time.
So the application gets inconsistent data into its cache and can't get new rows based on the insertion timestamp OR based on the sequence id. Transaction isolation levels don't really change anything about the situation, since the problem is created in essence by transaction 2 finishing before transaction 1.
My question
Am I missing something? I am thinking there must be a straightforward way to get all new rows of a RDBMS, but I can't come up with a simple solution... at least with a simple solution that is consistent. Extensive locking (e.g. of tables) wouldn't be acceptable because of performance reasons. Simply trying to ensure to get all ids from that sequence seems like a) a complicated solution and b) can't be done easily, since rollbacks during transactions can happen (which would lead to sequence ids not being used).
Anyone has the solution?
After a lot of searching, I found the right keywords to google for... "transaction commit timestamp" to leads to all sorts of transaction timestamp tracking and system columns like xmin:
https://dba.stackexchange.com/questions/232273/is-there-way-to-get-transaction-commit-timestamp-in-postgres
This post has some more detailed information:
Questions about Postgres track_commit_timestamp (pg_xact_commit_timestamp)
In short:
you can turn on a postgresql option to track timestamps of commits and compare those instead of the current_timestamps/clock_timestamps inside the transaction
it seems though, that it is only tracked when a transaction is completed - not when it is commited, which makes the solution not bullet proof. There are also further issue to consider like transaction id (xmin) rollover for example
logical decoding / replication is something to look into for a proper solution
Thanks to everyone trying to help me find an answer. I hope this summary is useful to someone in the future.
I am implementing a CQRS pattern where one or more processes are inserting records into the database and one or more processes are pulling them at a difference pace.
I'd like consumer processes to poll the database for new records that were inserted since last check, but I'm not sure how to (safely) implement this.
You can assume that rows will not change once they are inserted. It seems it isn't enough for each row to have a unique id, and a timestamp indicating when it was inserted.
If I query for records with a timestamp greater than the last row I saw then I run into problems if multiple records were inserted at the same time (having the same timestamp).
If I query for records with an id greater than the last row I saw then I run into problems where concurrent transactions may commit IDs in non-increasing order (e.g. postgreSQL sessions allocate and cache sequence IDs ahead of time to improve performance).
Ideally, I am looking for a DBMS-agnostic solution and be able to consume data as close to real-time as possible. Any ideas?
Clarification: Each row should be consumed multiple times, once per consumer. Meaning, just because one consumer processes a row should not prevent other consumers from doing so. Each consumer will do something different with the same data.
Since you have a lot of data coming in and might have multiple records for the last time stamp, you need a way to keep track of the data read. Here are a few different approaches with their pro and cons:
You can wait for the data to come in for a time stamp. You would do this by not reading the MAX(timestamp) so you would get all the data from the table except the last one for which the data might still be coming in.
Pro: Simple design
Con: Not real time processing
You can store the id's you have read each time for the last time stamp. When getting the data, you can use a query like (timestamp = lasttimestamp and id not in (set of ids)) or timestamp > lasttimestamp)
Pro: Almost real time
Con: Additional storage required
If you don't use sharding or similar:
You can use optimistic locking.
For this you can create an order column, with an unique index on the records table (the Log). Before each insertion, the producer query the Log for the greatest order, it increments it and insert the next record with this order.
If a concurrency exception occurs (i.e. Duplicate entry '12345' for key order) then you retry the entire process (query, increment, insert).
If you use sharding or similar:
Then you will need an additional service/table that will generate a new, unique, always-increasing order integer every time it is asked to do so.
This has the disadvantage that there is another piece that must be managed, a single point of failure that must be highly-available.
P.S.
"sharding or similar" means that you can't have unique indexes on the entire table because you use sharding or you write to multiple tables.
you can't rely on the timestamps or anything that relates to physical time because the system time may be adjusted, by an automated service (NTP) or by an human operator.
I have a column in the database which keeps counts of incoming requests, but updated from different sources and systems.
And the incoming requests are in thousands per minute.
What is the best way to update this column with the new request count?
The 2 ways at the top of my head are -
Read current value from column, increment it by one, and then update it back(All part of a sproc).
The problem I see with this is that every source/system that updates needs to lock this column and this might increase the wait time of read and updating of the column. And will slow down the DB.
Put requests in a queue, and a job reads the queue and updates the column, one at a time. This method looks safer, atleast to me, but is it too much work to get a count of requests coming in?
What is the approach you would typically take in such a read & update in a column in huge amounts scenario?
Thanks
1000s per minute is not "huge". Let's say its 10k per minute. That leaves 6ms of time per update. For an in-memory row with a simple integer increment and not too many indexes expect <1ms per update. Works out fine.
So just use
UPDATE T SET Count = Count + 1 WHERE ID = 1234
Put an index on the database and just do:
update table t
set request_count = requestcount + 1
where <whatever conditions are appropriate>;
Be sure that the conditions in the where clause all refer to indexes, so finding the row is likely to be as fast as possible.
Without strenuous effort, I would expect the update to be as fast enough. You should test this to see if this is true. You could also insert a row into a requests table and do the counting when you query that table. inserts are faster than updates, because the engine doesn't have to find the row first.
If this doesn't meet performance goals, then some sort of distributed mechanism may prove successful. I don't see that batching the requests using sequences would be a simple solution. Although the queue is likely to be distributed, you then have the problem that the request counts are out-of-sync with the actual updates.