Transform arbitrary SQL SELECT TOP(x) to a SELECT COUNT(*)? - sql

I want to be able to take any arbitrary SELECT TOP(X) query that would normally return a large number of rows (without the X limit) and transform that query into a query that counts how many rows would have been returned without the TOP(X) (i.e. SELECT COUNT(*)). Remember I am asking about an arbitrary query with any number of joins, where clauses, group by's etc.
Is there a way to do this?
edited to show syntax with Shannon's solution:
i.e.
`SELECT TOP(X) [colnames] FROM [tables with joins]
WHERE [constraints] GROUP BY [cols] ORDER BY [cols]`
becomes
`SELECT COUNT(*) FROM
(SELECT [colnames] FROM [tables with joins]
WHERE [constraints] GROUP BY [cols]) t`

Inline view:
select count(*)
from (...slightly transformed query...) t
... slightly transfomed query... is:
If the select clause contains any columns without names, such as select ... avg(x) ... then do one of 1) Alias the column, such as avg(x) as AvgX, 2) Remove the column, but make sure at least one column is left, or my favorite 3) Just make the select clause select 1 as C
Remove TOP from select clause.
Remove order by clause.
EDIT 1 Fixed by adding aliases for the inline view and dealing with unnamed columns in select clause.
EDIT 2 But what about the performance? Doesn't this require the DB to run the big query that I wanted to avoid in the first place with TOP(X)?
Not necessarily. It may be the case for some queries that this count will do more work than the TOP(x) would. And it may be the case that for a particular query, you could make the equivelent count faster by making addional changes to remove work that is not needed for the final count. But those simplifications can not be included in a general method to take any arbitrary SELECT TOP(X) query that would normally return a large number of rows (without the X limit) and transform that query into a query that counts how many rows would have been returned without the TOP(X).
And in some cases, the query optimizer may optimize away stuff so that the DB is not to run the big query.
For example Test table & data, using SQL Server 2005:
create table t (PK int identity(1, 1) primary key,
u int not null unique,
string VARCHAR(2000))
insert into t (u, string)
select top 100000 row_number() over (order by s1.id) , replace(space(2000), ' ', 'x')
from sysobjects s1,
sysobjects s2,
sysobjects s3,
sysobjects s4,
sysobjects s5,
sysobjects s6,
sysobjects s7
The non-clustered index on column u will be much smaller than the clustered index on column PK.
Then set up SMSS to show the actual execution plan for:
select PK, U, String from t
select count(*) from t
The first select does a clusted index scan, because it needs to return data out of the leafs. The second query does an index scan on the smaller non-clusteed index created for the unique constraint on U.
Applying the transform of the first query we get:
select count(*)
from (select PK, U, String from t) t
Running that and looking at the plan, the index on U is used again, exact same plan as select count(*) from t. The leaves are not visited to find the values for String on every row.

Related

postgres jsonb_object_keys distinct or group by extremely slow

Database version is PostgreSQL 11.16
My table have 424868 records with json field. When I do:
SELECT jsonb_object_keys(raw_json) FROM table;
It returns result for me within a second. So, I need to remove duplicate keys, but when I do:
SELECT DISTINCT jsonb_object_keys(raw_json) FROM table;
My database CPU increase to 100% and it takes 15 min to get result. I tried solution with group by:
select array_agg(json_keys),id from (
select jsonb_object_keys(raw_json) as json_keys, id from table) a group by a.id
Same result.
For debugging I did this:
select count(*) from (SELECT jsonb_object_keys(raw_json) as k from table) test
and it returns me 41633935 keys

SQL CTE for a numbers table is fast for static value, but slow for table value

I'm trying to output the values of a table multiple times, based on a column in that table.
I tried to use CTE to make a numbers table on the fly:
WITH cte AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY (select 0)) AS i
FROM
sys.columns c1 CROSS JOIN sys.columns c2 CROSS JOIN sys.columns c3
)
select *
from myTable, cte
WHERE i <= myTable.timesToRepeatColumn
and myTable.id = '209386'
This SQL seems to take forever to run, so it seems to be trying to run the entire CTE before joining.
If I replace myTable.timesToRepeatColumn with a static value (say 10000), the query returns virtually instantly. So it seems to be doing the where i <= before fully cross-joining the CTE's table.
How can I tell SQL to do the where statement first like it does with a static number?
you can use recursive cte to achieve your goal
WITH cte AS (
SELECT
*
, timesToRepeatColumn as level
FROM
myTablewhere
WHERE myTable.id = '209386'
UNION ALL
SELECT
*
, level -1 as level
FROM
cte
WHERE
level > 0
)
SELECT * FROM cte
CTEs in SQL Server are not necessarily run 'independently'. SQL (in SQL Server, etc) is declarative, which means you tell it what you want, not how to do it.
It the query optimiser determines that it can do it better by doing something differently, it will.
A good example is
IF EXISTS(SELECT * FROM test) PRINT 'X';
IF (SELECT COUNT(*) FROM test) > 0 PRINT 'Y';
IF (SELECT COUNT(*) FROM test) > 1 PRINT 'Z';
If it was doing what you told it do, the query plans for the second and third would basically be the same. However, when you run it, the query plans for the first and second are the same; the third differs.
When you hard-code the value (e.g., 10,000), the query optimiser can use that hardcoded value to determine what to do. In this case, it probably determines it doesn't need to run the full CTE, just run it until you get 10,000 rows.
However, if you use a value that can vary (e.g., myTable.timesToRepeatColumn), then the query optimiser often makes a query plan that would word for any value. As such, it makes a query plan that is not fantastic for your situation - probably creating the full CTE in memory before using it. If sys.columns has 100 rows, that's 100^3 rows it creates. If it's 1000, it's 1000^3 e.g., 1,000,000,000. Likely you have more than 1000 rows.

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

SQLServer SQL query with a row counter

I have a SQL query, that returns a set of rows:
SELECT id, name FROM users where group = 2
I need to also include a column that has an incrementing integer value, so the first row needs to have a 1 in the counter column, the second a 2, the third a 3 etc
The query shown here is just a simplified example, in reality the query could be arbitrarily complex, with several joins and nested queries.
I know this could be achieved using a temporary table with an autonumber field, but is there a way of doing it within the query itself ?
For starters, something along the lines of:
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY my_order_column) AS Row_Counter
FROM my_table
However, it's important to note that the ROW_NUMBER() OVER (ORDER BY ...) construct only determines the values of Row_Counter, it doesn't guarantee the ordering of the results.
Unless the SELECT itself has an explicit ORDER BY clause, the results could be returned in any order, dependent on how SQL Server decides to optimise the query. (See this article for more info.)
The only way to guarantee that the results will always be returned in Row_Counter order is to apply exactly the same ordering to both the SELECT and the ROW_NUMBER():
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY my_order_column) AS Row_Counter
FROM my_table
ORDER BY my_order_column -- exact copy of the ordering used for Row_Counter
The above pattern will always return results in the correct order and works well for simple queries, but what about an "arbitrarily complex" query with perhaps dozens of expressions in the ORDER BY clause? In those situations I prefer something like this instead:
SELECT t.*
FROM
(
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY ...) AS Row_Counter -- complex ordering
FROM my_table
) AS t
ORDER BY t.Row_Counter
Using a nested query means that there's no need to duplicate the complicated ORDER BY clause, which means less clutter and easier maintenance. The outer ORDER BY t.Row_Counter also makes the intent of the query much clearer to your fellow developers.
In SQL Server 2005 and up, you can use the ROW_NUMBER() function, which has options for the sort order and the groups over which the counts are done (and reset).
The simplest way is to use a variable row counter. However it would be two actual SQL commands. One to set the variable, and then the query as follows:
SET #n=0;
SELECT #n:=#n+1, a.* FROM tablename a
Your query can be as complex as you like with joins etc. I usually make this a stored procedure. You can have all kinds of fun with the variable, even use it to calculate against field values. The key is the :=
Heres a different approach.
If you have several tables of data that are not joinable, or you for some reason dont want to count all the rows at the same time but you still want them to be part off the same rowcount, you can create a table that does the job for you.
Example:
create table #test (
rowcounter int identity,
invoicenumber varchar(30)
)
insert into #test(invoicenumber) select [column] from [Table1]
insert into #test(invoicenumber) select [column] from [Table2]
insert into #test(invoicenumber) select [column] from [Table3]
select * from #test
drop table #test

Returning more than one value from a sql statement

I was looking at sql inner queries (bit like the sql equivalent of a C# anon method), and was wondering, can I return more than one value from a query?
For example, return the number of rows in a table as one output value, and also, as another output value, return the distinct number of rows?
Also, how does distinct work? Is this based on whether one field may be the same as another (thus classified as "distinct")?
I am using Sql Server 2005. Would there be a performance penalty if I return one value from one query, rather than two from one query?
Thanks
You could do your first question by doing this:
SELECT
COUNT(field1),
COUNT(DISTINCT field2)
FROM table
(For the first field you could do * if needed to count null values.)
Distinct means the definition of the word. It eliminates duplicate returned rows.
Returning 2 values instead of 1 would depend on what the values were, if they were indexed or not and other undetermined possible variables.
If you are meaning subqueries within the select statement, no you can only return 1 value. If you want more than 1 value you will have to use the subquery as a join.
If the inner query is inline in the SELECT, you may struggle to select multiple values. However, it is often possible to JOIN to a sub-query instead; that way, the sub-query can be named and you can get multiple results
SELECT a.Foo, a.Bar, x.[Count], x.[Avg]
FROM a
INNER JOIN (SELECT COUNT(1) AS [Count], AVG(something) AS [Avg]) x
ON x.Something = a.Something
Which might help.
DISTINCT does what it says. IIRC, you can SELECT COUNT(DISTINCT Foo) etc to query distinct data.
you can return multiple results in 3 ways (off the top of my head)
By having a select with multiple values eg: select col1, col2, col3
With multiple queries eg: select 1 ; select "2" ; select colA. you would get to them in a datareader by calling .NextRecord()
Using output parameters, declare the parameters before exec the query then get the value from them afterwards. eg: set #param1 = "2" . string myparam2 = sqlcommand.parameters["param1"].tostring()
Distinct, filters resulting rows to be unique.
Inner queries in the form:
SELECT * FROM tbl WHERE fld in (SELECT fld2 FROM tbl2 WHERE tbl.fld = tbl2.fld2)
cannot return multiple rows. When you need multiple rows from a secondary query, you usually need to do an inner join on the other query.
rows:
SELECT count(*), count(distinct *) from table
will return a dataset with one row containing two columns. Column 1 is the total number of rows in the table. Column 2 counts only distinct rows.
Distinct means the returned dataset will not have any duplicate rows. Distinct can only appear once usually directly after the select. Thus a query such as:
SELECT distinct a, b, c FROM table
might have this result:
a1 b1 c1
a1 b1 c2
a1 b2 c2
a1 b3 c2
Note that values are duplicated across the whole result set but each row is unique.
I'm not sure what your last question means. You should return from a query all the data relevant to the query. As for faster, only benchmarking can tell you which approach is faster.