I want multiple servers processing data from a single database table - sql

I have a single database table on a relational database. Data will be loaded into it. I then want to have multiple servers processing that data concurrently (I don't want to have only one server running at a time). E.g. each server will:
Query for a fixed number of rows
Do some work for each row retrieved
Update each row to show it has been processed
How do I ensure that each row is only processed once? Note I don't want to pre-assign a row of data to a server; i'm designing for high availability so the solution should keep running if one or more servers goes down.
The solution I've gone for so far is as follows:
The table has three columns: LOCKED_BY (VARCHAR), LOCKED_AT (TIMESTAMP) and PROCESSED (CHAR)
Each server starts by attempting to "pseudo-lock" some rows by doing:
UPDATE THE_TABLE
SET LOCKED_BY= $servername,
LOCKED_AT = CURRENT_TIMESTAMP,
WHERE (LOCKED_BY = null OR (CURRENT_TIMESTAMP- LOCKED_AT > $timeout)
AND PROCSSED = 'N'
i.e. try to "pseudo-lock" rows that aren't locked already or where the pseudo-lock as expired. Only do this for unprocessed rows.
More than one server may have attempted this at the same time. The current server needs to query to find out if it was successful in the "pseudo-lock":
SELECT * FROM THE_TABLE
WHERE LOCKED_BY = $server_name
AND PROCESSED = 'N'
If any rows are returned the server can process them.
Once the processing has been done the row is updated
UPDATE THE_TABLE SET PROCESSED = 'Y' WHERE PRIMARYKEYCOL = $pk
Note: the update statement should ideally limit the number of rows updated.

If you are open to changing platform then I would suggest moving to a modern, cloud-based solution like Snowflake. This will do what you want but in the background and by default - so you don't need to know what it's doing or how it's doing it (unless you want to).
This may come across as patronising, which is not my intention, but what you are attempting (in the way you are attempting it) is very complex; so if you don't already know how to do it then someone telling you how to do it is not going to give you the skills/experience you need to be able to implement it successfully

Related

How do you deduplicate records in a BigQuery table?

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated
I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.

Unable to implement SCD2

I was just working on SCD Type 2 and was unable to fully implement it in a way that some scenarios were not getting full filled. I had done it in IICS. Really finding it very very difficult to cover all possible scenarios. So, below is the flow:
Src
---> Lkp ( on src.id = tgt.id)
---> expression ( flag= iif (isnull (tgt.surrogatekey) then Insert, iif(isnotnull(tgt.surrogatekey) and md5(other_non_key_cole)<>tgt.md5)then Update)
----> insert on flag insert(works fine)
but on update i pass updates
to 2 target instances
of same target table
in one i am updating it
as new update as insert
and in other i
am updating tgt_end_date=lkp_start_date for previously stored ids and active_ind becomes 'N'.
But what happens is this works but not in when i receive new updates with again same records meaning duplicates or simply rerunning the mapping inserts duplicates in the target table and changing of end_date also becomes unstable when i insert multiple changes of the same record it sets all active_flags to 'Y' what expected is all should be 'N' except the last latest in evry run. Could anyone please help with this even in SQL if you can interpret.
If I’ve understood your question correctly, in each run you have multiple records in your source that are matching a single record in your target.
If this is the case then you need to process your data so that you have a single source record per target record before you put the data through the SCD2 process

Reducing database load from consecutive queries

I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.

Auto-incrementing a Firebird field value when using UPDATE OR INSERT INTO

I have been a Delphi programmer for 25 years, but managed to avoid SQL until now. I was a dBase expert back in the day. I am using Firebird 3.0 SuperServer as a service on a Windows server 2012 box. I run a UDP listener service written in Delphi 2007 to receive status info from a software product we publish.
The FB database is fairly simple. I use the user's IP address as the primary key and record reports as they come in. I am currently getting about 150,000 reports a day and they are logged in a text file.
Rather than insert every report into a table, I would like to increment an integer value in a single record with a "running total" of reports received from each IP address. It would save a LOT of data.
The table has fields for IP address (Primary Key), LastSeen (timestamp), and Hits (integer). There are a few other fields but they aren't important.
I use UPDATE OR INSERT INTO when the report is received. If the IP address does not exist, a new row is inserted. If it does exist, then the record is updated.
I would like it to increment the "Hits" field by +1 every time I receive a report. In other words, if "Hits" already = 1, then I want to inc(Hits) on UPDATE to 2. And so on. Basically, the "Hits" field would be a running total of the number of times an IP address sends a report.
Adding 3 million rows a month just so I can get a COUNT for a specific IP address does not seem efficient at all!
Is there a way to do this?
The UPDATE OR INSERT statement is not suitable for this, as you need to specify the values to update or insert, so you will end up with the same value for both the insert and the update. You could address this by creating a before insert trigger that will always assign 1 to the field that holds the count (ignoring the value provided by the statement for the insert), but it is probably better to use MERGE, as it gives you more control about the resulting action.
For example:
merge into user_stats
using (
select '127.0.0.1' as ipaddress, timestamp '2021-05-30 17:38' as lastseen
from rdb$database
) as src
on user_stats.ipaddress = src.ipaddress
when matched then
update set
user_stats.hits = user_stats.hits + 1,
user_stats.lastseen = max_value(user_stats.lastseen , src.lastseen)
when not matched then
insert (ipaddress, hits, lastseen) values (src.ipaddress, 1, src.lastseen)
However, if you get a lot of updates for the same IP address, and those updates are processed concurrently, this can be rather error-prone due to update conflicts. You can address this by inserting individual hits, and then have a background process to summarize those records into a single record (e.g. daily).
Also keep in mind that having a single record removes the possibility to perform more analysis (e.g. distribution of hits, number of hits on day X or a time HH:mm, etc).

SQL Update query not completed

I made a SQL query to update registers on a table. The table has about 15 million registers. The update statement is like:
UPDATE temp_conafe
set apoyo = trim(apoyo)
where cve_status like '%APOYO%';
I keep checking the field v$transaction.used_ured to see if the query is rolling forward or backwards but when the number of records reach to more than 15 millions the query starts rolling backwards.
How do I get the update to complete successfully?
I'm not the DBA, just a programmer, but I can't keep developing till that thing updates my registers.
It looks as if your transaction is to big. Try to add another limiting clause in the where. If you have a Id field you can add something like this:
where cve_status like '%APOYO%'
AND id > 1 AND id < 100000
You need to run it multiple times an change the range accordingly. If this is not an option you have to talk to your DBA and ask him to give you more resources.