Semi-Distinct MySQL Query - sql

I have a MySQL table called items that contains thousands of records. Each record has a user_id field and a created (datetime) field.
Trying to put together a query to SELECT 25 rows, passing a string of user ids as a condition and sorted by created DESC.
In some cases, there might be just a few user ids, while in other instances, there may be hundreds.
If the result set is greater than 25, I want to pare it down by eliminating duplicate user_id records. For instance, if there were two records for user_id = 3, only the most recent (according to created datetime) would be included.
In my attempts at a solution, I am having trouble because while, for example, it's easy to get a result set of 100 (allowing duplicate user_id records), or a result set of 16 (using GROUP BY for unique user_id records), it's hard to get 25.
One logical approach, which may not be the correct MySQL approach, is to get the most recent record for each for each user_id, and then, if the result set is less than 25, begin adding a second record for each user_id until the 25 record limit is met (maybe a third, fourth, etc. record for each user_id would be needed).
Can this be accomplished with a MySQL query, or will I need to take a large result set and trim it down to 25 with code?

I don't think what you're trying to accomplish is possible as a SQL query. Your desire is to return 25 rows, no matter what the normal data groupings are whereas SQL is usually picky about returning based on data groupings.
If you want a purely MySQL-based solution, you may be able to accomplish this with a stored procedure. (Supported in MySQL 5.0.x and later.) However, it might just make more sense to run the query to return all 100+ rows and then trim it programmatically within the application.

This will get you the most recent for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
WHERE i2.id IS NULL
his will get you the most recent two records for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
LEFT JOIN items IS i3
ON i2.user_id = i3.user_id AND i2.create > i3.create
WHERE i3.id IS NULL
Try working from there.
You could nicely put this into a stored procedure.

My opinion is to use application logic, as this is very much application layer logic you are trying to implement at the DB level, i.e. filtering down the results to make the search more useful to the end user.
You could implement a stored procedure (personally I would never do such a thing) or just get the application to decide which 25 results.

One approach would be to get the most recent item from each user, followed by the most recent items from all users, and limit that. You could construct pathological examples where this probably isn't what you want, but it should be pretty good in general.

Unfortunately, there is no easy way :( I had to do something similar when I built a report for my company that would pull up customer disables that were logged in a database. Only problem was that the disconnect is ran and logged every 30 minutes. Therefore, the rows would not be distinct since the timestamp was different in every disconnect. I solved this problem with sub queries. I don't have the exact code anymore, but I beleive this is how I implemented it:
SELECT CORP, HOUSE, CUST,
(
SELECT TOP 1 hsd
FROM #TempTable t2
WHERE t1.corp = t2.corp
AND t1.house = t2.house
AND t1.cust = t2.cust
) DisableDate
FROM #TempTable t1
GROUP BY corp, house, cust -- selecting distinct
So, my answer is to elimante the non-distinct column from the query by using sub queries. There might be an easier way to do it though. I'm curious to see what others post.
Sorry, i keep editing this, I keep trying to find ways to make it easier to show what I did.

Related

Quickly find a record appearing more than twice in a table

Title pretty much says it all, using Oracle SQL, I'd like to get, as quickly as possible, three records that share an ID from a very large table. The rows are not duplicates, they share one ID (rID) but differ in another (mID).
One approach I know I could do (that would be very slow) would be to load the first say 1000 records into a C# program, then execute a COUNT query to count the number of records with each ID until I hit one with 3 records and return that ID. I know this is a terrible approach but should give an idea of what I want to get out of this.
I've tried using GROUP BY, and this would work but would be unacceptably slow, I don't care about the state of the rest of the table, I just need a single ID that has three records. Ideally I'd do something like a GROUP BY that would stop after finding the first ID with three or more records and just return that one. There are over a million records in the table so efficiency is important.
What you describe translates to:
select the_id
from the_table
group by the_id
having count(*) >= 3
fetch first row only;
This should be as fast as it gets. You can help Oracle by providing an index on the id. That's about it.

Postgres Skipping Duplicate Field in Select

I want to select a limited number of items but only keeping ones with a distinct value for a specific field. I have tried using SELECT DISTINCT ON(field) as well as GROUP BY but they are both extremely slow because the table is very large. I assume this is because using DISTINCT will actually sort the table into distinct values before selecting at all.
SELECT DISTINCT ON(parent) id FROM posts WHERE sub = ? LIMIT 25
For my purposes this is unnecessary because I am using a LIMIT and can guarantee the limit will be met without scanning much of the table at all. Similar to selecting a value with a condition, which (without an index) will scan each row and check if it meets the condition before continuing, how can I use not having duplicate fields as a condition?
Another way to think about it is how do I do this:
SELECT DISTINCT ON (parent) post.id FROM
(SELECT id FROM posts WHERE sub = ? ORDER BY id LIMIT 25) AS post
While guaranteeing that there are 25 results. Here the result is very fast but it will usually have less results than required because multiple rows can have the same parent.
The way you're thinking may seem to make sense, but if you think a little deeper, you'll find that it cannot work that way. You want 25 unique result. To give you that, first it needs to go through the records and find the unique ones then return the first 25.
What you actually want is for it to go through the records one by one and check, do I already have similar value? If yes, discard it and continue, if no, add it to the results. Now check, do I already have 25 results? If no, continue, if yes, stop and return the results.
This is not a trivial task to do in a query. Your best bet is to do it in a stored procedure with a cursor. That will be much easier as you are in full control of the flow, just follow the steps as per the description above.
For my purposes this is unnecessary because I am using a LIMIT and can guarantee the limit will be met without scanning much of the table at all.
If you really know that your first 25 results will be found in the first xx records (say first 100), and that's all you care to achieve, then you can use a somewhat dumb query:
SELECT DISTINCT ON (parent) post.id
FROM (SELECT id FROM posts WHERE sub = ? ORDER BY id LIMIT 100) AS post
LIMIT 25
Change the 100 to whatever suits your needs.
When you use distinct on, you should use order by:
SELECT DISTINCT ON (parent) id
FROM posts
WHERE sub = ?
ORDER BY parent
LIMIT 25;
To optimize this query, you want an index on posts(sub, parent, id).

Store results of SQL Server query for pagination

In my database I have a table with a rather large data set that users can perform searches on. So for the following table structure for the Person table that contains about 250,000 records:
firstName|lastName|age
---------|--------|---
John | Doe |25
---------|--------|---
John | Sams |15
---------|--------|---
the users would be able to perform a query that can return about 500 or so results. What I would like to do is allow the user see his search results 50 at a time using pagination. I've figured out the client side pagination stuff, but I need somewhere to store the query results so that the pagination uses the results from his unique query and not from a SELECT * statement.
Can anyone provide some guidance on the best way to achieve this? Thanks.
Side note: I've been trying to use temp tables to do this by using the SELECT INTO statements, but I think that might cause some problems if, say, User A performs a search and his results are stored in the temp table then User B performs a search shortly after and User A's search results are overwritten.
In SQL Server the ROW_NUMBER() function is great for pagination, and may be helpful depending on what parameters change between searches, for example if searches were just for different firstName values you could use:
;WITH search AS (SELECT *,ROW_NUMBER() OVER (PARTITION BY firstName ORDER BY lastName) AS RN_firstName
FROM YourTable)
SELECT *
FROM search
WHERE RN BETWEEN 51 AND 100
AND firstName = 'John'
You could add additional ROW_NUMBER() lines, altering the PARTITION BY clause based on which fields are being searched.
Historically, for us, the best way to manage this is to create a complete new table, with a unique name. Then, when you're done, you can schedule the table for deletion.
The table, if practical, simply contains an index id (a simple sequenece: 1,2,3,4,5) and the primary key to the table(s) that are part of the query. Not the entire result set.
Your pagination logic then does something like:
SELECT p.* FROM temp_1234 t, primary_table p
WHERE t.pkey = p.primary_key
AND t.serial_id between 51 and 100
The serial id is your paging index.
So, you end up with something like (note, I'm not a SQL Server guy, so pardon):
CREATE TABLE temp_1234 (
serial_id serial,
pkey number
);
INSERT INTO temp_1234
SELECT 0, primary_key FROM primary_table WHERE <criteria> ORDER BY <sort>;
CREATE INDEX i_temp_1234 ON temp_1234(serial_id); // I think sql already does this for you
If you can delay the index, it's faster than creating it first, but it's a marginal improvement most likely.
Also, create a tracking table where you insert the table name, and the date. You can use this with a reaper process later (late at night) to DROP the days tables (those more than, say, X hours old).
Full table operations are much cheaper than inserting and deleting rows in to an individual table:
INSERT INTO page_table SELECT 'temp_1234', <sequence>, primary_key...
DELETE FROM page_table WHERE page_id = 'temp_1234';
That's just awful.
First of all, make sure you really need to do this. You're adding significant complexity, so go & measure whether the queries and pagination really hurts or you just "feel like you should". The pagination can be handled with ROW_NUMBER() quite easily.
Assuming you go ahead, once you've got your query, clearly you need to build a cache so first you need to identify what the key is. It will be the SQL statement or operation identifier (name of stored procedure perhaps) and the criteria used. If you don't want to share between users then the user name or some kind of session ID too.
Now when you do a query, you first look up in this table with all the key data then either
a) Can't find it so you run the query and add to the cache, storing the criteria/keys and the data or PK of the data depending on if you want a snapshot or real time. Bear in mind that "real time" isn't really because other users could be changing data under you.
b) Find it, so remove the results (or join the PK to the underlying tables) and return the results.
Of course now you need a background process to go and clean up the cache when it's been hanging around too long.
Like I said - you should really make sure you need to do this before you embark on it. In the example you give I don't think it's worth it.

Multiple SQL searches vs searching through one returned array

Is it faster to do multiple SQL finds on one table with different conditions or to get all the items from the table in an array and then separate them out that way? I realize I'm not explaining my question that well, so here is an example:
I want to pull records on posts and display them in categories based on when they were posted, say within one year, within one month, one week, etc. The nature of the categories results in lower level categories being entirely contained within upper level ones.
Should I do a SQL find with different conditions for each category, resulting in multiple calls to the database, or should I do one search, returning all of the items and then sort them out from the array? Thanks for your responses, sorry I'm new at this.
Typically I would say that you are going to get better performance by letting your database engine do the sorting work. Each database engine has this functionality and typically it can do it faster than you can.
So I would vote to use the database to get your multiple groups rather than trying to do it yourself in memory.
I typically perform one large sql query and then break the array up in ruby to minimize the number or duration of database connections.
This isn't necessarily any faster, and I have never benchmarked it, but less reads to the db hopefully means it will scale longer.
Edit: Nevermind, I didn't quite understand the question. Just let SQL perform the ordering for you in a convenient fashion and then process the array yourself.
You can probably make it even easier if you let your SELECT statement generate helper columns to say which categories (e.g., based on the date) a record belongs to.
The simplest, and easiest to understand would be to perform multiple queries for each criteria, and then form each of those result sets into a group. I don't think you want to start traversing result sets and duplicating rows.
If you really want to do it in one query, you could try a UNION query.
SELECT *, 1 as group from posts WHERE date > '2009-07-24 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 2 as group from posts WHERE date > '2009-07-17 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 3 as group from posts WHERE date > '2009-06-24 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 4 as group from posts WHERE date > '2008-07-24 00:00:00' ORDER BY date DESC
After that you just need to traverse the list once, and filter into new lists by the "group" column.
It depends. If you're using OR operators in your procedures, then things could get kind of slow. It would be better at that point to use multiple SQL statements.
But really, you need to analyze the query plans and decide for yourself if it is efficient enough or not. Run real world examples and TEST TEST TEST.

Using top clause in Access sub report

I'm making a report in Access 2003 that contains a sub report of related records. Within the sub report, I want the top two records only. When I add "TOP 2" to the sub report's query, it seems to select the top two records before it filters on the link fields. How do I get the top two records of only those records that apply to the corresponding link field? Thanks.
The sample query below is supposed to return a pair of most recent orders for each customer (instead of all orders):
select
Order.ID,
Order.Customer_ID,
Order.PlacementDate
from
Order
where
Order.ID in
(
select top 2
RecentOrder.ID
from
Order as RecentOrder
where
RecentOrder.Customer_ID = Order.Customer_ID
order by
RecentOrder.PlacementDate Desc
)
A query like this could be used in your sub-report to avoid using a temporary table.
CAVEAT EMPTOR: I did not test this sample query, and I don't know if this query would work for a report running against Jet database (we don't use Access to store data and we avoid Access reports like plague :-). But it should against SQL Server.
I also don't know how well it would perform in your case. As usual, it depends. :-)
BTW, speaking of performance and hacks. I would not consider usage of temporary table a hack. At worst, this trick can be considered as a more-complicated-than-necessary interface to the report. :-) And using such temporary table may actually be one of the good ways to improve performance. So, don't hurry writing it off. :-)
I've got two suggestions:
1) Pass your master field (on the parent form) to the query as a parameter (you could reference a field on the parent form directly as well)
2) You could fake out rownumbers in Access and limit them to only rownum <= 2. E.g.,
SELECT o1.order_number, o1.order_date,
(SELECT COUNT(*) FROM orders AS o2
WHERE o2.order_date <= o1.order_date) AS RowNum
FROM
orders AS o1
ORDER BY o1.order_date
(from http://groups.google.com/group/microsoft.public.access.queries/msg/ec562cbc51f03b6e?pli=1)
However, this kind of query might return an read only record set, so it might not be appropriated if you needed to do the same thing on a Form instead of a Report.