SQLite: How to SELECT "most recent record for each user" from single table with composite key? - sql

I'm not a database guru and feel like I'm missing some core SQL knowledge to grok a solution to this problem. Here's the situation as briefly as I can explain it.
Context:
I have a SQLite database table that contains timestamped user event records. The records can be uniquely identified by the combination of timestamp and user ID (i.e., when the event took place and who the event is about). I understand this situation is called a "composite primary key." The table looks something like this (with a bunch of other columns removed, of course):
sqlite> select Last_Updated,User_ID from records limit 4;
Last_Updated User_ID
------------- --------
1434003858430 1
1433882146115 3
1433882837088 3
1433964103500 2
Question: How do I SELECT a result set containing only the most recent record for each user?
Given the above example, what I'd like to get back is a table that looks like this:
Last_Updated User_ID
------------- --------
1434003858430 1
1433882837088 3
1433964103500 2
(Note that the result set only includes user 3's most recent record.)
In reality, I have approximately 2.5 million rows in this table.
Bonus: I've been reading answers about JOINs, de-dupe procedures, and a bunch more, and I've been googling for tutorials/articles in the hopes that I would find what I'm missing. I have extensive programming background so I could de-dupe this dataset in procedural code like I've done a hundred times before, but I'm tired of writing scripts to do what I believe should be possible in SQL. That's what it's for, right?
So, what do you think is missing from my understand of SQL, conceptually, that I need in order to understand why the solution you've provided to my question actually works? (A reference to a good article that actually explains the theory behind the practice would suffice.) I want to know WHY the solution actually works, not just that it does.
Many thanks for your time!

You could try this:
select user_id, max(last_updated) as latest
from records
group by user_id
This should give you the latest record per user. I assume you have an index on user_id and last_updated combined.
In the above query, generally speaking - we are asking the database to group user_id records. If there are more than 1 records for user_id 1, they will all be grouped together. From that recordset, maximum last_updated will be picked for output. Then the next group is sought and the same operation is applied there.
If you have a composite index, sqlite will likely just use the index because the index contains both fields addressed in the query. Indexes are smaller than the table itself, so scanning or seeking is faster.

Well, in true "d'oh!" fashion, right after I ask this question, I find the answer.
For my case, the answer is:
SELECT MAX(Last_Updated),User_ID FROM records GROUP BY User_ID
I was making this more complicated than it needed to be by thinking I needed to use JOINs and stuff. Applying an aggregate function like MAX() is all that's needed to select only those rows whose content matches the function result. That means this statement…
SELECT MAX(Last_Updated),User_ID FROM records
…would therefor return a result set containing only 1 row, the most recent event.
By adding the GROUP BY clause, however, the result set contains a row for each "group" of results, i.e., for each user. My programmer-brain did not understand that GROUP BY is how we say "for each" in SQL. I think I get it now.
Note to self: keep it simple, stupid. :)

Related

Quickly find a record appearing more than twice in a table

Title pretty much says it all, using Oracle SQL, I'd like to get, as quickly as possible, three records that share an ID from a very large table. The rows are not duplicates, they share one ID (rID) but differ in another (mID).
One approach I know I could do (that would be very slow) would be to load the first say 1000 records into a C# program, then execute a COUNT query to count the number of records with each ID until I hit one with 3 records and return that ID. I know this is a terrible approach but should give an idea of what I want to get out of this.
I've tried using GROUP BY, and this would work but would be unacceptably slow, I don't care about the state of the rest of the table, I just need a single ID that has three records. Ideally I'd do something like a GROUP BY that would stop after finding the first ID with three or more records and just return that one. There are over a million records in the table so efficiency is important.
What you describe translates to:
select the_id
from the_table
group by the_id
having count(*) >= 3
fetch first row only;
This should be as fast as it gets. You can help Oracle by providing an index on the id. That's about it.

Store results of SQL Server query for pagination

In my database I have a table with a rather large data set that users can perform searches on. So for the following table structure for the Person table that contains about 250,000 records:
firstName|lastName|age
---------|--------|---
John | Doe |25
---------|--------|---
John | Sams |15
---------|--------|---
the users would be able to perform a query that can return about 500 or so results. What I would like to do is allow the user see his search results 50 at a time using pagination. I've figured out the client side pagination stuff, but I need somewhere to store the query results so that the pagination uses the results from his unique query and not from a SELECT * statement.
Can anyone provide some guidance on the best way to achieve this? Thanks.
Side note: I've been trying to use temp tables to do this by using the SELECT INTO statements, but I think that might cause some problems if, say, User A performs a search and his results are stored in the temp table then User B performs a search shortly after and User A's search results are overwritten.
In SQL Server the ROW_NUMBER() function is great for pagination, and may be helpful depending on what parameters change between searches, for example if searches were just for different firstName values you could use:
;WITH search AS (SELECT *,ROW_NUMBER() OVER (PARTITION BY firstName ORDER BY lastName) AS RN_firstName
FROM YourTable)
SELECT *
FROM search
WHERE RN BETWEEN 51 AND 100
AND firstName = 'John'
You could add additional ROW_NUMBER() lines, altering the PARTITION BY clause based on which fields are being searched.
Historically, for us, the best way to manage this is to create a complete new table, with a unique name. Then, when you're done, you can schedule the table for deletion.
The table, if practical, simply contains an index id (a simple sequenece: 1,2,3,4,5) and the primary key to the table(s) that are part of the query. Not the entire result set.
Your pagination logic then does something like:
SELECT p.* FROM temp_1234 t, primary_table p
WHERE t.pkey = p.primary_key
AND t.serial_id between 51 and 100
The serial id is your paging index.
So, you end up with something like (note, I'm not a SQL Server guy, so pardon):
CREATE TABLE temp_1234 (
serial_id serial,
pkey number
);
INSERT INTO temp_1234
SELECT 0, primary_key FROM primary_table WHERE <criteria> ORDER BY <sort>;
CREATE INDEX i_temp_1234 ON temp_1234(serial_id); // I think sql already does this for you
If you can delay the index, it's faster than creating it first, but it's a marginal improvement most likely.
Also, create a tracking table where you insert the table name, and the date. You can use this with a reaper process later (late at night) to DROP the days tables (those more than, say, X hours old).
Full table operations are much cheaper than inserting and deleting rows in to an individual table:
INSERT INTO page_table SELECT 'temp_1234', <sequence>, primary_key...
DELETE FROM page_table WHERE page_id = 'temp_1234';
That's just awful.
First of all, make sure you really need to do this. You're adding significant complexity, so go & measure whether the queries and pagination really hurts or you just "feel like you should". The pagination can be handled with ROW_NUMBER() quite easily.
Assuming you go ahead, once you've got your query, clearly you need to build a cache so first you need to identify what the key is. It will be the SQL statement or operation identifier (name of stored procedure perhaps) and the criteria used. If you don't want to share between users then the user name or some kind of session ID too.
Now when you do a query, you first look up in this table with all the key data then either
a) Can't find it so you run the query and add to the cache, storing the criteria/keys and the data or PK of the data depending on if you want a snapshot or real time. Bear in mind that "real time" isn't really because other users could be changing data under you.
b) Find it, so remove the results (or join the PK to the underlying tables) and return the results.
Of course now you need a background process to go and clean up the cache when it's been hanging around too long.
Like I said - you should really make sure you need to do this before you embark on it. In the example you give I don't think it's worth it.

Storing count in the main table

I have three database tables,
Users ( UserID, ... )
Entries ( EntryID, ... )
Likes ( UserID, EntryID, ... )
My question is easy. Should I use a LikeCount column in the table Entries or use a SELECT COUNT(*) statement everytime I need it from Likes table? Which one is the better practice?
It's probably duplicate with this question: storing the count of rows or just count the rows? The given answer to that question is basically don't use LikeCount and count every time you need instead. However they do not give satisfactory answers to the following questions:
What is the bad consequences of storing count in the table
What is the performance analyzes of these two different approaches if I need to count likes very frequently in my application
PS: I use SQL Server 2008 if it is important
What is the bad consequences of storing count in the table:
There are 2 problems with this approach:
You have to use database triggers or application code to keep the count up to date as the Likes table changes.
If you ever get #1 wrong, you have to deal with the fact that the LikeCount might not actually match the number of likes.
What is the performance analyzes of these two different approaches if I need to count likes very frequently in my application:
I believe that if you created an index for the Likes table, most database engines will be able to answer a COUNT(*) query very quickly without referencing the actual table. Basically, in the index the database keeps track of how many rows match a given key, which is the same thing as your LikeCount.
If you are going to write a query like:
SELECT count(*) from Likes where EntryID=45;
Then your index has to be on EntryID.
But, if you are going to write a query like:
SELECT count(*) from Likes where EntryID=45 and deleted=False;
Then your index has to be on (EntryID, deleted).

SQL Limit on "WHERE X IN (...)"

I've got some data I'd like to pull off our SQL server.
This old database does not have any primary keys associated with it, so pulling data is like querying an Excel spreadsheet (what it actually originated as years ago).
I need to run reports on this data, though.
Currently, I get a list of distinct serial numbers for a given time period, then pull all of the records for a given serial number. For a 1-month time frame, this can be 1500 to 3000 serial numbers. The serial number field is formatted as char(20), even though the serial numbers are only 15 characters long.
BEGIN UPDATE
There are typically 5 to 15 entries in this table per Serial_Number.
There are at most 10 machines writing data to this table, so identical Date_Time values are possible
END UPDATE
This process takes a while, but between different serial numbers in the list, I am able to update the Windows Form with a Progress Bar so management knows something is happening and about how much longer to expect.
I am always trying to make this query run faster.
Now, I am thinking about pulling the data I need using a WHERE clause such as:
SELECT Col1, Col2, Col3
FROM Table1
WHERE Serial_Number IN (
SELECT DISTINCT Serial_Number
FROM Table1
WHERE Date_Time Between #startDate AND #endDate
)
My question is: Are there any issues I could run into with this, particularly because we have so many distinct serial numbers during a given time frame.
And, of course, you know someone in Management is going to try running a year's worth of data when they are bored! Then, they are going to try running data since Jesus was born, just because they've got nothing better to do.
Restate Question: Is there a limit to the WHERE clause's IN method that limits the number of items I can pass in?
Index Serial_Number and Date_Time in Table1 (with separate indexes, not a single compound index) and this should perform fairly well for you unless the table is really truly ginormous.
You might get a little more speed with one index on Serial_Number and the second on (Date_Time, Serial_Number). That second index covers the sub query, allowing it to be answered from the index alone.
Note: I'm suggesting indexes, not primary keys, which don't require uniqueness.
Well, in the naive case where there are no indexes (which it sounds like is your case) you're going to have to scan over all the rows in Table1 to perform the DISTINCT on Serial_Number anyway. So I'm not sure it's going to help you much.
I would highly recommend the following:
Use the execution plan to determine what's going on in your query, and
Use that information to add some relevant indexes to speed your operations.
Just from what we see here, it sounds like Date_Time would be a good candidate for a clustered index in Table1.
Edit:
To make a nonunique clustered index as I describe above, you can use the following:
CREATE CLUSTERED INDEX IX_Table1_Date_Time
ON Table1 (Date_Time)
(from http://msdn.microsoft.com/en-us/library/aa258260(v=sql.80).aspx)
This will reorder your table such that all rows are sorted in Date_Time order. Further work with the execution plan will help identify other indexes that may greatly help your performance, depending on the exact types of queries you run.
Honestly, I see no benefit to the WHERE clause as it is written.
You use an expensive inner query, but don't do anything meaningful with the results. I don't even see you getting the Serial_Number in the results anywhere. However, based on your question, it does sound like you need it.
I don't see the need for the DISTINCT keyword for Serial_Number, since the duplicates would not be eliminated in the results in the outer query.
What is wrong with doing this?
SELECT Serial_Number, Col1, Col2, Col3
FROM Table1
WHERE Date_Time Between #startDate AND #endDate
This should do the same thing as your original query. But it would eliminate the expensive nested query.
Just put an index on Date_Time and it should work. This would also eliminate the need for the index on Serial_Number.
Apparently, there is no way to tell what the maximum length of the WHERE X IN (...) can be.
For now, this is the answer.
If, at some later point in time, someone comes along and finds something to the contrary, please post that answer and I will mark it as such.
Thanks,
Joe

What's the most efficient way to check the presence of a row in a table?

Say I want to check if a record in a MySQL table exists. I'd run a query, check the number of rows returned. If 0 rows do this, otherwise do that.
SELECT * FROM table WHERE id=5
SELECT id FROM table WHERE id=5
Is there any difference at all between these two queries? Is effort spent in returning every column, or is effort spent in filtering out the columns we don't care about?
SELECT COUNT(*) FROM table WHERE id=5
Is a whole new question. Would the server grab all the values and then count the values (harder than usual), or would it not bother grabbing anything and just increment a variable each time it finds a match (easier than usual)?
I think I'm making a lot of false assumptions about how MySQL works, but that's the meat of the question! Where am I wrong? Educate me, Stack Overflow!
Optimizers are pretty smart (generally). They typically only grab what they need so I'd go with:
SELECT COUNT(1) FROM mytable WHERE id = 5
The most explicit way would be
SELECT WHEN EXISTS (SELECT 1 FROM table WHERE id = 5) THEN 1 ELSE 0 END
If there is an index on (or starting with) id, it will only search, with maximum efficiency, for the first entry in the index it can find with that value. It won't read the record.
If you SELECT COUNT(*) (or COUNT anything else) it will, under the same circumstances, count the index entries, but not read the records.
If you SELECT *, it will read all the records.
Limit your results to at most one row by appending LIMIT 1, if all you want to do is check the presence of a record.
SELECT id FROM table WHERE id=5 LIMIT 1
This will definitely ensure that no more than one row is returned or processed. In my experience, LIMIT 1 (or TOP 1 depending in the DB) to check for existence of a row makes a big difference in terms of performance for large tables.
EDIT: I think I misread your question, but I'll leave my answer here anyway if it's of any help.
I would think this
SELECT null FROM table WHERE id = 5 LIMIT 1;
would be faster than this
SELECT 1 FROM table WHERE id = 5 LIMIT 1;
but the timer says the winner is "SELECT 1".
For the first two queries, most people will generally say, always specify exactly what you need and leave the rest. Effort isn't all specific as bandwidth could be spent in returning data that you aren't even going to do anything with.
As for the previous answer will do for your result set, unless you're dealing with a language that supports affected rows. This can sometimes work when getting data to collect information on how many rows were returned in the last query. You'll need to look at your interface documentation as to how to get that information.
The difference between your 3 queries depends on how you've built your index. Only returning the primary key is likely to be faster as MySQL will have your index in memory, and not have to hit disk. Adding the LIMIT 1 is also a good trick that will speed up the optimizer significantly in early 5.0.x branches and earlier.
try EXPLAIN SELECT id FROM table WHERE id=5 and check the Extras column for the presence of USING INDEX. If its there, then you're query is coming straight from the index, and is going to be much faster.