I coding a application that dealing with files. So, I have a table that contains information about all the files that registered in the application.
My "files" table looks like this: ID, Path and LastScanTime.
The algorithm that I use in my application is simple:
Take the oldest row (LastScanTime is the oldest)
Extract the file path
Do some magics on this file (takes exactly 5 minutes)
Update the LastScanTime to the current time (now)
Go to step "1"
Until now, the task is pretty simple. For this, I going to use this SQL statement for getting the oldest item:
SELECT TOP 1 * FROM files ORDER BY [LastScanTime] ASC
and at the end of the item processing (preventing the item to be selected immediately again):
UPDATE Files SET [LastScanTime]=GETDATE() WHERE Id=#ItemID
Now, I going to add some complexity to the algorithm:
Take the 3 oldest row (LastScanTime is the oldest)
For each row, do:
A. Extract the file path
B. Do some magics on this file (takes exactly 5 minutes)
C. Update the LastScanTime to the current time (now)
D. Go to step "1"
The problem that now I facing with is that the whole process is going to be processed in parallel (no more serial processing). So, changing my SQL statement to the next statement is not enough!
SELECT TOP 3 * FROM files ORDER BY [LastScanTime] ASC
Why this SQL statement isn't enough?
Let's say that I run my code and started to execute the first 3 items. Now, after a minute I want to execute another 3 items. This SQL statement will retrieve exactly the same "oldest" items that we already started to process.
Possible solution
Implementing a SELECT & UPDATE (combined) that getting the 3 oldest item and immediately update their last scan time. Since there no SELECT & UPDATE in same statement, what will happens if between the executing of the first SELECT, will come in another SELECT? The both statements will get the same results. This is a problem... Another problem is that we mark the item as "scanned recently", before the scan is really finished. What happend if the scanned will terminated by an error?
I'm looking for tips and tricks to solve this problem. The solutions can add columns as needed.
I'll appreciate you help.
Well I usually have habit of having two different field name in the database. one is AddedDate and another is ModifiedDate.
So the algorithm in your terms will be:-
Take the oldest row (AddedDate is the oldest)
Extract the file path
Do some process on this file
Update the ModifiedDate to the current time (now)
It seems that you are going to invent event queue with your SQL. Possibly standard approaches like RabbitMQ or ActiveMQ may solve your problem.
Related
We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated
I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.
I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.
Due to recent updates to the recent database, I have run into a weird problem. I have two tables, tVehicleDeal table and tVehicleLog table. We did a 'migration' meaning we created an app that will transfer the data from a old database to a more relational database. This process took awhile, but it finished and everything seemed good to go. What happens now, is that anytime tVehicleDeal is updated, the corresponding information is inserted into tVehicleLog. The problem that has occurred is.. I ran a script that would update the current deal in tVehicleDeal to the most recent log in tVehicleLog. I made an error in my script, and not all the current deals in tVehicleDeal were updated properly. As a result, when the users updated the active deal in tVehicleDeal, not all the information was inserted into the tVehicleLog. I need to find a way to update the newest entry with some fields from the past entries such as the date it was titled. Some Deals have as many as 20 different logs for it whereas some may have only 2 or 3. I have found this link here but I'm not 100 percent positive this is what I'm looking for. I have tried something similar to this but I am unable to get anything to work using the examples found on that page. Any other ideas will help greatly!
EDIT:
What I am unable to figure out is how to update a column in tVehicleLog. For example:
In the tVehicleLog table there are 6 results for a particular DealID.
The first through 4 do not have a titled date in it, but the 5th row does have a titled date.
I can't figure out how to update the titled column the 6th row for that dealID based on the 5th row that does have the titled date.
The link provided above looked like it was something I was looking for but I was unable to get that solution to work.
Based on this line from your question,
I can't figure out how to update the titled column the 6th row for
that dealID based on the 5th row that does have the titled date.
It seems like this should fix your problem. It is written only to solve this specific scenario. If other scenarios exist that are not exactly like this one, adjustments may have to be made. If I didn't understand your problem, please post further clarification.
UPDATE L1
SET TitleDate=L2.TitleDate
FROM tVehicleLog L1
INNER JOIN tVehicleLog L2
ON L1.DealID=L2.DealID
AND L2.TitleDate IS NOT NULL
WHERE L1.<PrimaryKeyColumn>=#ThePrimaryKeyColumnOfTheRowYouWantToUpdate
Hello I need a SQL query statement that gets me rows 'start' to 'finish'.
For example:
A website with many items where page 1 selects only items 1-10, page 2 has 11-20 and so on.
I know how to do this with Microsoft SQL Server and MySQL but I need an implementation that is platform independent. :/
I have an Increment line for IDs but deleting in-between will mess the result when I select via
WHERE ID > number AND ID < othernumber
of course
Is this possible without fetching the whole database to a ResultSet?
I think your safest bet would be to use the BETWEEN operator. I believe it works across Oracle/MySQL/MSSQL.
WHERE ID BETWEEN number AND othernumber
Concerning your comment " I was just think for the case when first 100 IDs are gone I'll have to check further until there is something to fetch", you might wanna consider NOT actually ever deleting stuff from your database but to add a flag like "active" or something like that to your tables so you can avoid situations like the one you're now trying to avoid. The alternative is where you are now, having to find the max and min rows in a filter
I made a SQL query to update registers on a table. The table has about 15 million registers. The update statement is like:
UPDATE temp_conafe
set apoyo = trim(apoyo)
where cve_status like '%APOYO%';
I keep checking the field v$transaction.used_ured to see if the query is rolling forward or backwards but when the number of records reach to more than 15 millions the query starts rolling backwards.
How do I get the update to complete successfully?
I'm not the DBA, just a programmer, but I can't keep developing till that thing updates my registers.
It looks as if your transaction is to big. Try to add another limiting clause in the where. If you have a Id field you can add something like this:
where cve_status like '%APOYO%'
AND id > 1 AND id < 100000
You need to run it multiple times an change the range accordingly. If this is not an option you have to talk to your DBA and ask him to give you more resources.