postgres jsonb_object_keys distinct or group by extremely slow - sql

Database version is PostgreSQL 11.16
My table have 424868 records with json field. When I do:
SELECT jsonb_object_keys(raw_json) FROM table;
It returns result for me within a second. So, I need to remove duplicate keys, but when I do:
SELECT DISTINCT jsonb_object_keys(raw_json) FROM table;
My database CPU increase to 100% and it takes 15 min to get result. I tried solution with group by:
select array_agg(json_keys),id from (
select jsonb_object_keys(raw_json) as json_keys, id from table) a group by a.id
Same result.
For debugging I did this:
select count(*) from (SELECT jsonb_object_keys(raw_json) as k from table) test
and it returns me 41633935 keys

Related

Fetch No oF Rows that can be returned by select query

I'm trying to fetch data and showing in a table with pagination. so I use limit and offset for that but I also need to show no of rows that can be fetched from that query. Is there any way to get that.
I tried
resultset.last() and getRow()
select count(*) from(query) myNewTable;
These two cases i'm getting correct answer but is it correct way to do this. Performance is a concern
We can get the limited records using below code,
First, we need to set how many records we want like below,
var limit = 10;
After that sent this limit to the below statement
WITH
Temp AS(
SELECT
ROW_NUMBER() OVER( primayKey DESC ) AS RowNumber,
*
FROM
myNewTable
),
Temp2 AS(
SELECT COUNT(*) AS TotalCount FROM Temp
)
SELECT TOP limit * FROM Temp, Temp2 WHERE RowNumber > :offset order by RowNumber
This is run in both MSSQL and MySQL
There is no easy way of doing this.
1. As you found out, it usually boils down to executing 2 queries:
Executing SELECT with limit and offset in order to fetch the data that you need.
Executing a COUNT(*) in order to count the total number of pages.
This approach might work for tables that don't have a lot of rows, or when you filter the data (int the COUNT and SELECT queries) on a column that is indexed.
2. If your table is large, but the data that you need to show represents smaller percentage of the data from the table and the data shares a common trait (for example, the data in all of your pages is created on a single day) you can use partitioning. Executing COUNT and SELECT on a single partition will be way more faster than executing them on the whole table.
3. You can create another table which will store the value of the COUNT query.
For example, lets say that your big_table table looks like this:
id | user_id | timestamp_column | text_column | another_text_column
Now, your SELECT query looks like this:
SELECT * FROM big_table WHERE user_id = 4 ORDER BY timestamp_column LIMIT 20 OFFSET 20;
And your count query:
SELECT COUNT(*) FROM table WHERE user_id = 4;
You could create a count_table that will have the following format:
user_id | count
Once you fill this table with the current data in the system, you will create a trigger which will update this table on every insert or update of the big_table.
This way, the count query will be really fast, because it will be executed on the count_table, for example:
SELECT count FROM count_table WHERE user_id = 4
The drawback of this approach is that the insert in the big_table will be slower, since the trigger will fire and update the count_table on every insert.
This are the approaches that you can try but in the end it all depends on the size and type of your data.

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to
ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)
SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the
SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )
FROM srcTable s
3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.
select * from srcTable s
where exists
( select 1 from (
SELECT
keypart1,
keypart2,
row_number() over( partition by keypart1, keypart2 ) seq
FROM srcTable t
WHERE
-- (whatever additional filtering you want)
) t
where seq > 1
AND t.keypart1 = s.keypart1
AND t.keypart2 = s.keypart2
)
Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

How can I get the total result count, and a given subset ('page' of results) with the same SQL Query with Oracle

I would like to display a table of results. The data is sourced from a SQL query on an Oracle database. I would like to show the results one page (say, 10 records) at a time, minimising the actual data being sent to the front-end.
At the same time, I would like to show the total number of possible results (say, showing 1-10 of 123), and to allow for pagination (say, to calculate that 10 per page, 123 results, therefore 13 pages).
I can get the total number of results with a single count query.
SELECT count(*) AS NUM_RESULTS FROM ... etc.
and I can get the desired subset with another query
SELECT * FROM ... etc. WHERE ? <= ROWNUM AND ROWNUM < ?
But, is there a way to get all the relevant details in one single query?
Update
Actually, the above query using ROWNUM seems to work for 0 - 10, but not for 10 - 20, so how can I do that too?
ROWNUM is a bit tricky to use.
The ROWNUM pseudocolumn always starts with 1 for the first result that actually gets fetched. If you filter for ROWNUM>10, you will never fetch any result and therefore will not get any.
If you want to use it for paging (not that you really should), it requires nested subqueries:
select * from
(select rownum n, x.* from
(select * from mytable order by name) x
)
where n between 3 and 5;
Note that you need another nested subquery to get the order by right; if you put the order by one level higher
select * from
(select rownum n, x.* from mytable x order by name)
where n between 3 and 5;
it will pick 3 random(*) rows and sort them, but that is ususally not what you want.
(*) not really random, but probably not what you expect.
See http://use-the-index-luke.com/sql/partial-results/window-functions for more effient ways to implement pagination.
You can use inner join on your table and fetch total number of result in your subquery. The example of an query is as follows:
SELECT E.emp_name, E.emp_age, E.emp_sal, E.emp_count
FROM EMP as E
INNER JOIN (SELECT emp_name, COUNT(*) As emp_count
FROM EMP GROUP BY emp_name) AS T
ON E.emp_name = T.emp_name WHERE E.emp_age < 35;
Not sure exactly what you're after based on your question wording, but it seems like you want to see your specialized table of all records with a row number between two values, and in an adjacent field in each record see the total count of records. If so, you can try selecting everything from your table and joining a subquery of a COUNT value as a field by saying where 1=1 (i.e. everywhere) tack that field onto the record. Example:
SELECT *
FROM table_name LEFT JOIN (SELECT COUNT(*) AS NUM_RESULTS FROM table_name) ON 1=1
WHERE ? <= ROWNUM AND ROWNUM < ?

SQL Delete low counts

I have a table with this data:
Id Qty
-- ---
A 1
A 2
A 3
B 112
B 125
B 109
But I'm supposed to only have the max values for each id. Max value for A is 3 and for B is 125. How can I isolate (and delete) the other values?
The final table should look like this :
Id Qty
-- ---
A 3
B 125
Running MySQL 4.1
Oh wait. Got a simpler solution :
I'll select all the max values(group by id), export the data, flush the table, reimport only the max values.
CREATE TABLE tabletemp LIKE table;
INSERT INTO tabletemp SELECT id,MAX(qty) FROM table GROUP BY id;
DROP TABLE table;
RENAME TABLE tabletemp TO table;
Thanks to all !
Try this in SQL Server:
delete from tbl o
left outer join
(Select max(qty) anz , id
from tbl i
group by i.id) k on o.id = k.id and k.anz = o.qty
where k.id is null
Revision 2 for MySQL... Can anyone check this one?:
delete from tbl o
where concat(id,qty) not in
(select concat(id,anz) from (Select max(qty) anz , id
from tbl i
group by i.id))
Explanation:
Since I was supposed to not use joins (See comments about MySQL Support on joins and delete/update/insert), I moved the subquery into a IN(a,b,c) clause.
Inside an In clause I can use a subquery, but that query is only allowed to return one field. So in order to filter all elements that are not the maximum, i need to concat both fields into a single one, so i can return it inside the in clause. So basically my query inside the IN returns the biggest ID+QTY only. To compare it with the main table i also need to make a concat on the outside, so the data for both fields match.
Basically the In clause contains:
("A3","B125")
Disclaimer: The above query is "evil!" since it uses a function (concat) on fields to compare against. This will cause any index on those fields to become almost useless. You should never formulate a query that way that is run on a regular basis. I only wanted to try to bend it so it works on mysql.
Example of this "bad construct":
(Get all o from the last 2 weeks)
select ... from orders where orderday + 14 > now()
You should allways do:
select ... from orders where orderday > now() - 14
The difference is subtle: Version 2 only has to do the math once, and is able to use the index, and version 1 has to do the math for every single row in the orders table., and you can forget about the index usage...
I'd try this:
delete from T
where exists (
select * from T as T2
where T2.Id = T.Id
and T2.Qty > T.Qty
);
For those who might have similar question in the future, this might be supported some day (it is now in SQL Server 2005 and later)
It won't require a join, and it has advantages over the use of a temporary table if the table has dependencies
with Tranked(Id,Qty,rk) as (
select
Id, Qty,
rank() over (
partition by Id
order by Qty desc
)
from T
)
delete from Tranked
where rk > 1;
You'll have to go via another table (among other things that makes a single delete statement here quite impossible in mysql is you can't delete from a table and use the same table in a subquery).
BEGIN;
create temporary table tmp_del select id,max(qty) as qty from the_tbl;
delete the_tbl from the_tbl,tmp_del where
the_tbl.id=tmp_del.id and the_tbl.qty=tmp_del.qty;
drop table tmp_del;
END;
MySQL 4.0 and later supports a simple multi-table syntax for DELETE:
DELETE t1 FROM MyTable t1 JOIN MyTable t2 ON t1.id = t2.id AND t1.qty < t2.qty;
This produces a join of each row with a given id to all other rows with the same id, and deletes only the row with the lesser qty in each pairing. After this is all done, the row with the greatest qty per group of id is left not deleted.
If you only have one row with a given id, it still works because a single row is naturally the one with the greatest value.
FWIW, I just tried my solution using MySQL 5.0.75 on a Macbook Pro 2.40GHz. I inserted 1 million rows of synthetic data, with different numbers of rows per "group":
2 rows per id completes in 26.78 sec.
5 rows per id completes in 43.18 sec.
10 rows per id completes in 1 min 3.77 sec.
100 rows per id completes in 6 min 46.60 sec.
1000 rows per id didn't complete before I terminated it.

Transform arbitrary SQL SELECT TOP(x) to a SELECT COUNT(*)?

I want to be able to take any arbitrary SELECT TOP(X) query that would normally return a large number of rows (without the X limit) and transform that query into a query that counts how many rows would have been returned without the TOP(X) (i.e. SELECT COUNT(*)). Remember I am asking about an arbitrary query with any number of joins, where clauses, group by's etc.
Is there a way to do this?
edited to show syntax with Shannon's solution:
i.e.
`SELECT TOP(X) [colnames] FROM [tables with joins]
WHERE [constraints] GROUP BY [cols] ORDER BY [cols]`
becomes
`SELECT COUNT(*) FROM
(SELECT [colnames] FROM [tables with joins]
WHERE [constraints] GROUP BY [cols]) t`
Inline view:
select count(*)
from (...slightly transformed query...) t
... slightly transfomed query... is:
If the select clause contains any columns without names, such as select ... avg(x) ... then do one of 1) Alias the column, such as avg(x) as AvgX, 2) Remove the column, but make sure at least one column is left, or my favorite 3) Just make the select clause select 1 as C
Remove TOP from select clause.
Remove order by clause.
EDIT 1 Fixed by adding aliases for the inline view and dealing with unnamed columns in select clause.
EDIT 2 But what about the performance? Doesn't this require the DB to run the big query that I wanted to avoid in the first place with TOP(X)?
Not necessarily. It may be the case for some queries that this count will do more work than the TOP(x) would. And it may be the case that for a particular query, you could make the equivelent count faster by making addional changes to remove work that is not needed for the final count. But those simplifications can not be included in a general method to take any arbitrary SELECT TOP(X) query that would normally return a large number of rows (without the X limit) and transform that query into a query that counts how many rows would have been returned without the TOP(X).
And in some cases, the query optimizer may optimize away stuff so that the DB is not to run the big query.
For example Test table & data, using SQL Server 2005:
create table t (PK int identity(1, 1) primary key,
u int not null unique,
string VARCHAR(2000))
insert into t (u, string)
select top 100000 row_number() over (order by s1.id) , replace(space(2000), ' ', 'x')
from sysobjects s1,
sysobjects s2,
sysobjects s3,
sysobjects s4,
sysobjects s5,
sysobjects s6,
sysobjects s7
The non-clustered index on column u will be much smaller than the clustered index on column PK.
Then set up SMSS to show the actual execution plan for:
select PK, U, String from t
select count(*) from t
The first select does a clusted index scan, because it needs to return data out of the leafs. The second query does an index scan on the smaller non-clusteed index created for the unique constraint on U.
Applying the transform of the first query we get:
select count(*)
from (select PK, U, String from t) t
Running that and looking at the plan, the index on U is used again, exact same plan as select count(*) from t. The leaves are not visited to find the values for String on every row.