How to retrieve only updated/new records since the last query in SQL?

How to retrieve only updated/new records since the last query in SQL? - sql

I was asked to design a class for caching SQL query results. Calling the class' query method will query and cache the entire set of results at the first time; afterward, each subsequence query will retrieve only the updated portion, and will merge the result into the cache.
If the class is required to be generic, i.e. NO knowledge about the db and the tables, do you have any idea?
Is it possible, and how to retrieve only updated/new records since the last query?

MySQL (nor any other SQL-based database, as far as I know) does not store the "last update time" or "insert time" of a row anywhere, at least not anywhere you can query (parsing out the binary log is probably not something you're looking to do).
To get "new records" you will need to either
Query for all records with an autoincrement value higher than the last known max
Add a datetime/timestamp to the table that would store the insert/updated time of the record, and query based on that
You could consider having your code create its own table that would store a record every time an UPDATE or INSERT happened on another table, and then add triggers to all those other tables that would populate your table.
But that might be a bit much.

Either your cache class has to be notified of all updates, or you need some kind of expiration.
The only other option is to query the database to check, but then you kind of lose the point of the cache :)
The above is true for a "generic" cache. If you implement a specialized cache for a particular table you might be able to exploit DML patterns.

Related

How to Represent a Single Statistic in SQL Database?

I have an SQL table (SQLite database), Listing, that has a datetime field. For my program, I need to know the most recent time field my program has seen. However, I don't store all listings the program has seen into the database.
So my question is, what is the usual way to store data like "most recently seen object", which is a single record, into a database? Is there something more elegant than making another table that has one record with a datetime field?

First of all, you need to save all records into your database, then you get the max() suggested by #GordonLinoff. What's your problem in save the data? If you hope keep database size small, You can add a trigger which delete oldest rows during new INSERTs.
But I don't understand properly your example. Can you give us some part of code?

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.

NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.

No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.

Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...

Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)

Help updating a column using other columns of the same table

Table: Customer with columns Start_Time and End_Time.
I need to add a new column "Duration" that is End_Time - Start_Time.
However, I need to do this using a trigger or procedure so that immediately after a new record is added to Customer table, the column Duration is updated.

If you are using MS SQL, the ideal answer is probably a computed column.
The less data you actually duplicate, the less opportunity for data inconsistency you will have, therefore the less consistency-ensuring/verification code and fewer maintenance processes will result from your schema.
To set this up, (again, if using MS SQL), just add another column using the designer, and expand the "Computed Column Specification" area. (You can refer to other columns from this same table for this calculation.) Then enter "End_Time - Start_Time". Depending on what you are going to do with this data, may want to use something like DATEDIFF(minute, Start_Time, End_Time) for your formula, instead. It's exactly what this feature is for.
If it is a very expensive calculation (which yours is probably not, from the information you've given) you could configure the results to be "persisted" - that's very much like a trigger but clearer to implement and maintain.
Alternately, you could create a new View that does the same calculation, and "project" this first table through it whenever getting information. But you probably already knew that, thus this answer was born! :)
p.s. I personally recommend avoiding triggers like the plague. They cause extra operations that are often not expected by a developer, maintainer, or admin. This can cause operations to fail, return unexpected extra result sets, or modify rows that perhaps an admin was specifically trying to avoid modifying during an administrative (read: unsupported grin) fix.
p.p.s. In this case I'd also recommend against a stored procedure, for the same maintenance reason as triggers. Although you could restrict security such that the only way to update the table was through a stored procedure, this can fail for many of the same reasons triggers can fail. Best to avoid duplicating the data if you can.
p.p.p.s :) This is not to say stored procedures are bad as a whole. On complex transactional operations or tightly integrated procedural filtering of large related tables in order to return a comparatively small result set they are still often the best choice.

As per shannon, though the the term in oracle is a "Virtual Column"
There were an 11g enhancement. Prior to that, use a view (and that is still a potential answer for 11g).
Do not use a trigger or stored procedure.

Sorting based on calculation with nhibernate - best practice

I need to do paging with the sort order based on a calculation. The calculation is similar to something like reddit's hotness algorithm in that its dependant on time - time since post creation.
I'm wondering what the best practice for this would be. Whether to have this sort as a SQL function, or to run an update once an hour to calculate the whole table.
The table has hundreds of thousands of rows. And I'm using nhibernate, so this could cause problems for the scheduled full calcution.
Any advice?

It most likely will depend a lot on the load on your server. A few assumptions for my answer:
Your calculation is most likely not simple, but will take into account a variety of factors, including time elapsed since post
You are expecting at least reasonable growth in your site, meaning new data will be added to your table.
I would suggest your best bet would be to calculate and store your ranking value, and as Nuno G mentioned retrieve using an ordered clause. As you note there are likely to be some implications, two of which would be:
Scheduling Updates
Ensuring access to the table
As far as scheduling goes you may be able to look at some ways of intelligently recalculating your value. For example, you may be able to identify when a calculation is likely to be altered (for example, if a dependant record is updated you might fire a trigger, adding the ID of your table to a queue for recalculation). You may also do the update in ranges, rather then in the full table.
You will also want to minimise any locking of your table whilst you are recalculating. There are a number of ways to do this, including setting your isolation levels (using MS SQL terminonlogy). If you are really worried you could even perform your calculation externally (eg. in a temp table) and then simply run an update of the values to your main table.
As a final note I would recommend looking into the paging options available to you - if you are talking about thousands of records make sure that your mechanism determines the page you need on the SQL server so that you are not returning the thousands of rows to your application, as this will slow things down for you.

If you can perform the calculation using SQL, try use Hibernate to load the sorted collection by executing a SQLQuery, where your query includes a 'ORDER BY' expression.

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks

You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle

Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.

We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.

Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?

I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.

The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.

Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)

Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows

We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.

In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.

Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.

We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.

The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.

No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.

From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.

Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.

This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.

Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas