Quickly find a record appearing more than twice in a table - sql

Title pretty much says it all, using Oracle SQL, I'd like to get, as quickly as possible, three records that share an ID from a very large table. The rows are not duplicates, they share one ID (rID) but differ in another (mID).
One approach I know I could do (that would be very slow) would be to load the first say 1000 records into a C# program, then execute a COUNT query to count the number of records with each ID until I hit one with 3 records and return that ID. I know this is a terrible approach but should give an idea of what I want to get out of this.
I've tried using GROUP BY, and this would work but would be unacceptably slow, I don't care about the state of the rest of the table, I just need a single ID that has three records. Ideally I'd do something like a GROUP BY that would stop after finding the first ID with three or more records and just return that one. There are over a million records in the table so efficiency is important.

What you describe translates to:
select the_id
from the_table
group by the_id
having count(*) >= 3
fetch first row only;
This should be as fast as it gets. You can help Oracle by providing an index on the id. That's about it.

Related

Postgres Skipping Duplicate Field in Select

I want to select a limited number of items but only keeping ones with a distinct value for a specific field. I have tried using SELECT DISTINCT ON(field) as well as GROUP BY but they are both extremely slow because the table is very large. I assume this is because using DISTINCT will actually sort the table into distinct values before selecting at all.
SELECT DISTINCT ON(parent) id FROM posts WHERE sub = ? LIMIT 25
For my purposes this is unnecessary because I am using a LIMIT and can guarantee the limit will be met without scanning much of the table at all. Similar to selecting a value with a condition, which (without an index) will scan each row and check if it meets the condition before continuing, how can I use not having duplicate fields as a condition?
Another way to think about it is how do I do this:
SELECT DISTINCT ON (parent) post.id FROM
(SELECT id FROM posts WHERE sub = ? ORDER BY id LIMIT 25) AS post
While guaranteeing that there are 25 results. Here the result is very fast but it will usually have less results than required because multiple rows can have the same parent.
The way you're thinking may seem to make sense, but if you think a little deeper, you'll find that it cannot work that way. You want 25 unique result. To give you that, first it needs to go through the records and find the unique ones then return the first 25.
What you actually want is for it to go through the records one by one and check, do I already have similar value? If yes, discard it and continue, if no, add it to the results. Now check, do I already have 25 results? If no, continue, if yes, stop and return the results.
This is not a trivial task to do in a query. Your best bet is to do it in a stored procedure with a cursor. That will be much easier as you are in full control of the flow, just follow the steps as per the description above.
For my purposes this is unnecessary because I am using a LIMIT and can guarantee the limit will be met without scanning much of the table at all.
If you really know that your first 25 results will be found in the first xx records (say first 100), and that's all you care to achieve, then you can use a somewhat dumb query:
SELECT DISTINCT ON (parent) post.id
FROM (SELECT id FROM posts WHERE sub = ? ORDER BY id LIMIT 100) AS post
LIMIT 25
Change the 100 to whatever suits your needs.
When you use distinct on, you should use order by:
SELECT DISTINCT ON (parent) id
FROM posts
WHERE sub = ?
ORDER BY parent
LIMIT 25;
To optimize this query, you want an index on posts(sub, parent, id).

SQLite: How to SELECT "most recent record for each user" from single table with composite key?

I'm not a database guru and feel like I'm missing some core SQL knowledge to grok a solution to this problem. Here's the situation as briefly as I can explain it.
Context:
I have a SQLite database table that contains timestamped user event records. The records can be uniquely identified by the combination of timestamp and user ID (i.e., when the event took place and who the event is about). I understand this situation is called a "composite primary key." The table looks something like this (with a bunch of other columns removed, of course):
sqlite> select Last_Updated,User_ID from records limit 4;
Last_Updated User_ID
------------- --------
1434003858430 1
1433882146115 3
1433882837088 3
1433964103500 2
Question: How do I SELECT a result set containing only the most recent record for each user?
Given the above example, what I'd like to get back is a table that looks like this:
Last_Updated User_ID
------------- --------
1434003858430 1
1433882837088 3
1433964103500 2
(Note that the result set only includes user 3's most recent record.)
In reality, I have approximately 2.5 million rows in this table.
Bonus: I've been reading answers about JOINs, de-dupe procedures, and a bunch more, and I've been googling for tutorials/articles in the hopes that I would find what I'm missing. I have extensive programming background so I could de-dupe this dataset in procedural code like I've done a hundred times before, but I'm tired of writing scripts to do what I believe should be possible in SQL. That's what it's for, right?
So, what do you think is missing from my understand of SQL, conceptually, that I need in order to understand why the solution you've provided to my question actually works? (A reference to a good article that actually explains the theory behind the practice would suffice.) I want to know WHY the solution actually works, not just that it does.
Many thanks for your time!
You could try this:
select user_id, max(last_updated) as latest
from records
group by user_id
This should give you the latest record per user. I assume you have an index on user_id and last_updated combined.
In the above query, generally speaking - we are asking the database to group user_id records. If there are more than 1 records for user_id 1, they will all be grouped together. From that recordset, maximum last_updated will be picked for output. Then the next group is sought and the same operation is applied there.
If you have a composite index, sqlite will likely just use the index because the index contains both fields addressed in the query. Indexes are smaller than the table itself, so scanning or seeking is faster.
Well, in true "d'oh!" fashion, right after I ask this question, I find the answer.
For my case, the answer is:
SELECT MAX(Last_Updated),User_ID FROM records GROUP BY User_ID
I was making this more complicated than it needed to be by thinking I needed to use JOINs and stuff. Applying an aggregate function like MAX() is all that's needed to select only those rows whose content matches the function result. That means this statement…
SELECT MAX(Last_Updated),User_ID FROM records
…would therefor return a result set containing only 1 row, the most recent event.
By adding the GROUP BY clause, however, the result set contains a row for each "group" of results, i.e., for each user. My programmer-brain did not understand that GROUP BY is how we say "for each" in SQL. I think I get it now.
Note to self: keep it simple, stupid. :)

Is it possible to query a table without order columns by page

I've a big table which contains more than 100K records, in oracle. I want to get all of the records and save each row to a file with JDBC.
In order to make it faster, I want to create 100 threads to read the data from the table concurrently. I will get the total count of the records in the first sql, then split it to 100 pages, then get one page in a thread with a new connection.
But I've a problem, that there is no any column can be used to order. There is no column with sequence, no accurate timestamp. I can't use a sql query without order by clause to query, since there is no guarantee it will return the data with the same order every time (per this question).
So is it possible to solve it?
Finally, I used rowid to order:
select * from mytable order by rowid
It seems work well.

Storing count in the main table

I have three database tables,
Users ( UserID, ... )
Entries ( EntryID, ... )
Likes ( UserID, EntryID, ... )
My question is easy. Should I use a LikeCount column in the table Entries or use a SELECT COUNT(*) statement everytime I need it from Likes table? Which one is the better practice?
It's probably duplicate with this question: storing the count of rows or just count the rows? The given answer to that question is basically don't use LikeCount and count every time you need instead. However they do not give satisfactory answers to the following questions:
What is the bad consequences of storing count in the table
What is the performance analyzes of these two different approaches if I need to count likes very frequently in my application
PS: I use SQL Server 2008 if it is important
What is the bad consequences of storing count in the table:
There are 2 problems with this approach:
You have to use database triggers or application code to keep the count up to date as the Likes table changes.
If you ever get #1 wrong, you have to deal with the fact that the LikeCount might not actually match the number of likes.
What is the performance analyzes of these two different approaches if I need to count likes very frequently in my application:
I believe that if you created an index for the Likes table, most database engines will be able to answer a COUNT(*) query very quickly without referencing the actual table. Basically, in the index the database keeps track of how many rows match a given key, which is the same thing as your LikeCount.
If you are going to write a query like:
SELECT count(*) from Likes where EntryID=45;
Then your index has to be on EntryID.
But, if you are going to write a query like:
SELECT count(*) from Likes where EntryID=45 and deleted=False;
Then your index has to be on (EntryID, deleted).

What's the most efficient way to check the presence of a row in a table?

Say I want to check if a record in a MySQL table exists. I'd run a query, check the number of rows returned. If 0 rows do this, otherwise do that.
SELECT * FROM table WHERE id=5
SELECT id FROM table WHERE id=5
Is there any difference at all between these two queries? Is effort spent in returning every column, or is effort spent in filtering out the columns we don't care about?
SELECT COUNT(*) FROM table WHERE id=5
Is a whole new question. Would the server grab all the values and then count the values (harder than usual), or would it not bother grabbing anything and just increment a variable each time it finds a match (easier than usual)?
I think I'm making a lot of false assumptions about how MySQL works, but that's the meat of the question! Where am I wrong? Educate me, Stack Overflow!
Optimizers are pretty smart (generally). They typically only grab what they need so I'd go with:
SELECT COUNT(1) FROM mytable WHERE id = 5
The most explicit way would be
SELECT WHEN EXISTS (SELECT 1 FROM table WHERE id = 5) THEN 1 ELSE 0 END
If there is an index on (or starting with) id, it will only search, with maximum efficiency, for the first entry in the index it can find with that value. It won't read the record.
If you SELECT COUNT(*) (or COUNT anything else) it will, under the same circumstances, count the index entries, but not read the records.
If you SELECT *, it will read all the records.
Limit your results to at most one row by appending LIMIT 1, if all you want to do is check the presence of a record.
SELECT id FROM table WHERE id=5 LIMIT 1
This will definitely ensure that no more than one row is returned or processed. In my experience, LIMIT 1 (or TOP 1 depending in the DB) to check for existence of a row makes a big difference in terms of performance for large tables.
EDIT: I think I misread your question, but I'll leave my answer here anyway if it's of any help.
I would think this
SELECT null FROM table WHERE id = 5 LIMIT 1;
would be faster than this
SELECT 1 FROM table WHERE id = 5 LIMIT 1;
but the timer says the winner is "SELECT 1".
For the first two queries, most people will generally say, always specify exactly what you need and leave the rest. Effort isn't all specific as bandwidth could be spent in returning data that you aren't even going to do anything with.
As for the previous answer will do for your result set, unless you're dealing with a language that supports affected rows. This can sometimes work when getting data to collect information on how many rows were returned in the last query. You'll need to look at your interface documentation as to how to get that information.
The difference between your 3 queries depends on how you've built your index. Only returning the primary key is likely to be faster as MySQL will have your index in memory, and not have to hit disk. Adding the LIMIT 1 is also a good trick that will speed up the optimizer significantly in early 5.0.x branches and earlier.
try EXPLAIN SELECT id FROM table WHERE id=5 and check the Extras column for the presence of USING INDEX. If its there, then you're query is coming straight from the index, and is going to be much faster.