SQL select groups of distinct items in prepared statement? - sql

I have a batch job that I run on a table which I'm sure I could write as a prepared statement. Currently it's all in Java and no doubt less efficient than it could be. For a table like so:
CREATE TABLE thing (
`tag` varchar,
`document` varchar,
`weight` float,
)
I want to create a new table that contains the top N entries for every tag. Currently I do this:
create new table with same schema
select distinct tag
for each tag:
select * limit N insert into the new table
This requires executing a query to get the distinct tags, then selecting the top N items for that tag and inserting them... all very inefficient.
Is there a stored procedure (or even a simple query) that I could use to do this? If dialect is important, I'm using MySQL.
(And yes, I do have my indexes sorted!)
Cheers
Joe

I haven't done this in a while (spoiled by CTE's in SQL Server), and I'm assuming that your data is ordered by weight; try
SELECT tag, document, weight
FROM thing
WHERE (SELECT COUNT(*)
FROM thing as t
WHERE t.tag = thing.tag AND t.weight < thing.weight
) < N;
I think that will do it.
EDIT: corrected error in code; need < N, not <= N.

If you were using SQL Server, I would suggest using the ROW_NUMBER function, grouped by tag, and select where row_number < N. (So in other words, order and number the rows for each tag according to their position in the tag group, then pick the top N rows from each group.) I found an article about simulating the ROW_NUMBER function in MySQL here:
http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/
See if this helps you out!

Related

SQL for getting each category data in maria db

I need to fetch 4 random values from each category. What should be the correct sql syntax for maria db. I have attached one image of table structure.
Please click here to check the structure
Should i write some procedure or i can do it with basic sql syntax?
You can do that with a SQL statement if you only have a few rows:
SELECT id, question, ... FROM x1 ORDER BY rand() LIMIT 1
This works fine if you have only a few rows - as soon as you have thousands of rows the overhead for sorting the rows becomes important, you have to sort all rows for getting only one row.
A trickier but better solution would be:
SELECT id, question from x1 JOIN (SELECT CEIL(RAND() * (SELECT(MAX(id)) FROM x1)) AS id) as id using(id);
Running EXPLAIN on both SELECTS will show you the difference...
If you need random value for different categories combine the selects via union and add a where clause
http://mysql.rjweb.org/doc.php/groupwise_max#top_n_in_each_group
But then ORDER BY category, RAND(). (Your category is the blog's province.)
Notice how it uses #variables to do the counting.
If you have MariaDB 10.2, then use one of its Windowing functions.
SELECT column FROM table WHERE category_id = XXX
ORDER BY RAND()
LIMIT 4
do it for all categories

What exactly is the/this data statement in SAS doing? PostgreSQL equivalent?

I'm converting a SAS script to Python for a PostgreSQL environment. In a few places I've found a data statement in SAS, which looks something like this (in multiple scripts):
data dups;
set picc;
by btn wtn resp_ji;
if not (first.resp_ji and last.resp_ji);
run;
Obviously datasets aren't the same in python or SQL environments, and I'm having trouble determining what this specific statement is doing. To be clear, there are a number of scripts being converted which create a dataset in this manner with this same name. So my expectation would be that most of these would be overwritten over and over.
I'm also unclear as to what the postgres equivalent to the condition in the data dups statement would be.
Is there an obvious PostgreSQL statement that would work in its place? Something like this?:
CREATE TABLE dups AS
SELECT btn, wtn, resp_ji
WHERE /*some condition that matches the condition in the data statement*/
Does the
by btn wtn respji;
statement mean which columns are copied over, or is that the equivalent of an ORDER BY clause in PostgreSQL?
Thanks.
The statement is using what's called 'by group processing'. Before the step can run, it requires that the data is sorted by btn wtn resp_ji.
The first.resp_ji piece is checking to see if it's the first time it's seen the current value of resp_ji within the current btn/wtn combination. Likewise the last.resp_ji piece is checking if it's the final time that it will see the current value of resp_ji within the current btn/wtn combination.
Combining it all together the statement:
if not (first.resp_ji and last.resp_ji);
Is saying, if the current value of resp_ji occurs multiple times for the current combination of btn/wtn then keep the record, otherwise discard the record. The behaviour of the if statement when used like that implicitly keeps/discards the record.
To do the equivalent in SQL, you could do something like:
Find all records to discard.
Discard those records from the original dataset.
So...
create table rows_to_discard as
select btn, wtn, resp_ji, count(*) as freq
from mytable
group by btn, wtn, resp_ji
having count(*) = 1
create table want as
select a.*
from mytable a
left join rows_to_discard b on b.btn = a.btn
and b.wtn = a.wtn
and b.resp_ji = a.resp_ji
where b.btn is null
EDIT : I should mention that there is no simple SQL equivalent. It may be possible by numbering rows in subqueries, and then building logic on top of that but it'd be ugh-ly. It may also depend on the specific flavour of SQL being used.
As someone that learned SAS before postgressql, I found the following much more similar to SAS first. last. logic:
--first.
select distinct on (resp_ji) from <table> order by resp_ji
--last.
select distinct on (resp_ji) from <table> order by resp_ji desc
A way to detect duplicates (when no extra differentiating field is available) is to use the ctid as tie-breaker:
CREATE TABLE dups
AS
SELECT * FROM pics p
WHERE EXISTS ( SELECT * FROM pics x
WHERE x.btn = p.btn
AND x.wtn = p.wtn
AND x.resp_ji = p.resp_ji
AND x.ctid <> p.ctid
);

efficiently find subset of records as well as total count

I'm writing a function in ColdFusion that returns the first couple of records that match the user's input, as well as the total count of matching records in the entire database. The function will be used to feed an autocomplete, so speed/efficiency are its top concerns. For example, if the function receives input "bl", it might return {sampleMatches:["blue", "blade", "blunt"], totalMatches:5000}
I attempted to do this in a single query for speed purposes, and ended up with something that looked like this:
select record, count(*) over ()
from table
where criteria like :criteria
and rownum <= :desiredCount
The problem with this solution is that count(*) over () always returns the value of :desiredCount. I saw a similar question to mine here, but my app will not have permissions to create a temp table. So is there a way to solve my problem in one query? Is there a better way to solve it? Thanks!
I'm writing this on top of my head, so you should definitely have to time this, but I believe that using following CTE
only requires you to write the conditions once
only returns the amount of records you specify
has the correct total count added to each record
and is evaluated only once
SQL Statement
WITH q AS (
SELECT record
FROM table
WHERE criteria like :criteria
)
SELECT q1.*, q2.*
FROM q q1
CROSS JOIN (
SELECT COUNT(*) FROM q
) q2
WHERE rownum <= :desiredCount
A nested subquery should return the results you want
select record, cnt
from (select record, count(*) over () cnt
from table
where criteria like :criteria)
where rownum <= :desiredCount
This will, however, force Oracle to completely process the query in order to generate the accurate count. This seems unlikely to be what you want if you're trying to do an autocomplete particularly when Oracle may decide that it would be more efficient to do a table scan on table if :criteria is just b since that predicate isn't selective enough. Are you really sure that you need a completely accurate count of the number of results? Are you sure that your table is small enough/ your system is fast enough/ your predicates are selective enough for that to be a requirement that you could realistically meet? Would it be possible to return a less-expensive (but less-accurate) estimate of the number of rows? Or to limit the count to something smaller (say, 100) and have the UI display something like "and 100+ more results"?

counting rows in select clause with DB2

I would like to query a DB2 table and get all the results of a query in addition to all of the rows returned by the select statement in a separate column.
E.g., if the table contains columns 'id' and 'user_id', assuming 100 rows, the result of the query would appear in this format: (id) | (user_id) | 100.
I do not wish to use a 'group by' clause in the query. (Just in case you are confused about what i am asking) Also, I could not find an example here: http://mysite.verizon.net/Graeme_Birchall/cookbook/DB2V97CK.PDF.
Also, if there is a more efficient way of getting both these results (values + count), I would welcome any ideas. My environment uses zend framework 1.x, which does not have an ODBC adapter for DB2. (See issue http://framework.zend.com/issues/browse/ZF-905.)
If I understand what you are asking for, then the answer should be
select t.*, g.tally
from mytable t,
(select count(*) as tally
from mytable
) as g;
If this is not what you want, then please give an actual example of desired output, supposing there are 3 to 5 records, so that we can see exactly what you want.
You would use window/analytic functions for this:
select t.*, count(*) over() as NumRows
from table t;
This will work for whatever kind of query you have.

Delete all but top n from database table in SQL

What's the best way to delete all rows from a table in sql but to keep n number of rows on the top?
DELETE FROM Table WHERE ID NOT IN (SELECT TOP 10 ID FROM Table)
Edit:
Chris brings up a good performance hit since the TOP 10 query would be run for each row. If this is a one time thing, then it may not be as big of a deal, but if it is a common thing, then I did look closer at it.
I would select ID column(s) the set of rows that you want to keep into a temp table or table variable. Then delete all the rows that do not exist in the temp table. The syntax mentioned by another user:
DELETE FROM Table WHERE ID NOT IN (SELECT TOP 10 ID FROM Table)
Has a potential problem. The "SELECT TOP 10" query will be executed for each row in the table, which could be a huge performance hit. You want to avoid making the same query over and over again.
This syntax should work, based what you listed as your original SQL statement:
create table #nuke(NukeID int)
insert into #nuke(Nuke) select top 1000 id from article
delete article where not exists (select 1 from nuke where Nukeid = id)
drop table #nuke
Future reference for those of use who don't use MS SQL.
In PostgreSQL use ORDER BY and LIMIT instead of TOP.
DELETE FROM table
WHERE id NOT IN (SELECT id FROM table ORDER BY id LIMIT n);
MySQL -- well...
Error -- This version of MySQL does not yet support 'LIMIT &
IN/ALL/ANY/SOME subquery'
Not yet I guess.
Here is how I did it. This method is faster and simpler:
Delete all but top n from database table in MS SQL using OFFSET command
WITH CTE AS
(
SELECT ID
FROM dbo.TableName
ORDER BY ID DESC
OFFSET 11 ROWS
)
DELETE CTE;
Replace ID with column by which you want to sort.
Replace number after OFFSET with number of rows which you want to keep.
Choose DESC or ASC - whatever suits your case.
I think using a virtual table would be much better than an IN-clause or temp table.
DELETE
Product
FROM
Product
LEFT OUTER JOIN
(
SELECT TOP 10
Product.id
FROM
Product
) TopProducts ON Product.id = TopProducts.id
WHERE
TopProducts.id IS NULL
This really is going to be language specific, but I would likely use something like the following for SQL server.
declare #n int
SET #n = SELECT Count(*) FROM dTABLE;
DELETE TOP (#n - 10 ) FROM dTable
if you don't care about the exact number of rows, there is always
DELETE TOP 90 PERCENT FROM dTABLE;
I don't know about other flavors but MySQL DELETE allows LIMIT.
If you could order things so that the n rows you want to keep are at the bottom, then you could do a DELETE FROM table LIMIT tablecount-n.
Edit
Oooo. I think I like Cory Foy's answer better, assuming it works in your case. My way feels a little clunky by comparison.
I would solve it using the technique below. The example expect an article table with an id on each row.
Delete article where id not in (select top 1000 id from article)
Edit: Too slow to answer my own question ...
Refactored?
Delete a From Table a Inner Join (
Select Top (Select Count(tableID) From Table) - 10)
From Table Order By tableID Desc
) b On b.tableID = A.tableID
edit: tried them both in the query analyzer, current answer is fasted (damn order by...)
Better way would be to insert the rows you DO want into another table, drop the original table and then rename the new table so it has the same name as the old table
I've got a trick to avoid executing the TOP expression for every row. We can combine TOP with MAX to get the MaxId we want to keep. Then we just delete everything greater than MaxId.
-- Declare Variable to hold the highest id we want to keep.
DECLARE #MaxId as int = (
SELECT MAX(temp.ID)
FROM (SELECT TOP 10 ID FROM table ORDER BY ID ASC) temp
)
-- Delete anything greater than MaxId. If MaxId is null, there is nothing to delete.
IF #MaxId IS NOT NULL
DELETE FROM table WHERE ID > #MaxId
Note: It is important to use ORDER BY when declaring MaxId to ensure proper results are queried.