How to COUNT DISTINCT on more than one column - sql

I have the following table.
group _id p_id version value
1 1 1 10
1 1 2 11
1 1 2 12
1 2 3 13
2 1 2 14
2 1 3 15
2 1 2 16
I would like to count on how many records for each group_id and how many distinct p_id + version for each group_id. I have following query
SELECT "group_id",count(*) , count(distinct "p_id","version")
FROM tbl
group by "group_id"
Aapparently, it' not going to work, as Oracle will give me error on COUNT
ORA-00909: invalid number of arguments
I know this can be done by subquery. However, is there any simple way to get same result? Considing the performance is important to me, as we have more than 500 million records in the table.
SQL Fiddle

I don't know if it's the best way, but I normally concatenate the two values, using a delimiter to enforce "distinctness", so they become one expression, which Oracle can handle with COUNT DISTINCT:
SELECT "group_id",count(*) , count(distinct "p_id" || '-' || "version")
FROM tbl
group by "group_id"

Related

Get previous value from column A when column B is not null in Hive

I have a table tableA below
ID number Estimate Client
---- ------
1 3 8 A
1 NULL 10 Null
1 5 11 A
1 NULL 19 Null
2 NULL 20 Null
2 2 70 A
.......
I would like to select previous row of Estimate column when number column is not null. For instance, when number = 3, then pre_estimate = NULL, when number = 5, then pre_estimate = 10, and when number = 2, then pre_estimate = 20.
The query below does not seem to return the correct answer in Hive. What should be correct way to do it?
select lag(Estimate, 1) OVER (partition by ID) as prev_estimate
from tableA
where number is not null
Consider the table with following structure:
number - int
estimate - int
order_column - int
order_column is taken as a column on which you want to sort your table rows.
Data in table:
number estimate order_column
3 8 1
NULL 10 2
5 11 3
NULL 19 4
NULL 20 5
2 70 6
I used the following query and got the result you have mentioned.
SELECT * FROM (SELECT number, estimate, lag(estimate,1) over(order by order_column) as prev_estimate from tableA) tbl where tbl.number is not null;
As per my understanding, I didn't find the reason to partition by id, that's why I haven't considered ID in the table.
The reason you were getting wrong results is due to the reason that where clause in main query will select only the records with number as not null and then it computes lag function, but you need to consider all the rows when computing the lag function and then you should select rows with number as not null.

SQL: Selecting rows from non unique column values once partitioned by another column

Using SQL here. Trying to select all rows where the column value is unique within that specific partition.
Have tried:
select *
from dataTable
where value in ( select value
from dataTable
group by tv_id, value
having count(*) > 1)
but it returns the full table-- i think the issue is that the values for many of tv_ids are identical and overlap.
What I have:
tv_id value
1 1
1 2
1 2
1 3
2 1
2 1
2 2
2 3
2 4
3 1
3 1
3 2
What I want:
tv_id value
1 2
1 2
2 1
2 1
3 1
3 1
I have a bunch of tv_ids and essentially, I only want the rows where the value is not unique within each tv_id.
Ex: I don't want tv_id, value: 3, 2 because it is the only combination in the data.
Thanks in advance!
Maybe something like this does the trick
Oracle Option
I include this oracle version because it enables you to understand better what are you querying.
select tv_id, value
from dataTable
where (tv_id, value) in (
select tv_id, value
from dataTable
group by tv_id, value
having count(1) > 1
)
SQL
But this is a standard sql version that will work with almost any database engine
select tv_id, value
from dataTable d1
join (
select tv_id, value
from dataTable
group by tv_id, value
having count(1) > 1
) d2
on d1.tv_id=d2.tv_id
and d1.value=d2.value
You need to query the same table twice because the group by makes a distinct in your data, so you won't retrieve duplicated rows as you show in your expected output.

Compare column entry to every other entry in the same column

I have a Column of values in SQLite.
value
-----
1
2
3
4
5
For each value I would like to know how many of the other values are larger and display the result. E.g. For value 1 there are 4 entries that have higher values.
value | Count
-------------
1 | 4
2 | 3
3 | 2
4 | 1
5 | 0
I have tried nested select statements and using the Count(*) function but I do not seem to be able to extract the correct levels. Any suggestions would be much appreciated.
Many Thanks
You can do this with a correlated subquery in SQLite:
select value,
(select count(*) from t t2 where t2.value > t.value) as "count"
from t;
In most other databases, you would use a ranking function such as rank() or dense_rank(), but SQLite doesn't support these functions.

How to compare two rows in postgresql?

I have a table tab that contaions:
item identifier quantity methodid
10 1 20 2
10 1 30 3
11 1 10 3
11 1 12.5 3
11 2 20 5
12 1 20 1
12 1 30 1
I need to write a function that checks if there is a case of duplicate methodid for item and identifier.
In the above example item 11 identifier 1 has two rows of methodid 3 means it's duplicated, also item 12 idfentifier 1 has duplicated rows as well.
I don't need to do anything to the data just to identify this situation.
I don't need to find where and what was duplicated... just tell there is duplication.
The only information I have is the identifier
CREATE OR REPLACE FUNCTION func(identifier integer)
RETURNS integer AS
$BODY$
declare
errorcode int;
begin
if _____________ then
errorcode =1;
raise exception 'there id duplication in this identifier';
END IF;
continue work
return 0;
exception
when raise_exception then
return errorcode;
end;
$BODY$
LANGUAGE plpgsql VOLATILE
in the blank spot I want to put a query that checks for duplications.
How do I write a query that perform the check?
The structure of function can be changed. but I need somehow to know when to raise the exception.
To check wether any datasets are duplicated based on selected columns you could group by these columns and count the occurrences.
So in your case you could do:
SELECT 1 FROM tab GROUP BY item, identifier, methodid HAVING COUNT(*) > 1;
To incorporate this into your functions you could just check if it exists:
if EXISTS (SELECT 1 ...) then
Use group by:
select item, identifier, methodid, count(*)
from tab
group by item, identifier, methodid
having count(*) > 1
Where having count(*) > 1 is used to return only duplicated rows.
Try with this following one may be you will get your result set.
First generate a row number for the table which we have.
For that the following is the query.
select *,ROW_NUMBER() over (partition by item,identifier,methodid order by item) as RowID
from tab;
Then you will get the result like below.
Item Identifier quantity methodid RowID
10 1 20 2 1
10 1 30 3 1
11 1 10 3 1
11 1 12.5 3 2
11 2 20 5 1
12 1 20 1 1
12 1 30 1 2
12 1 40 2 1
So from this result set you can try with following query,then you will get the result
select * from (
select *,ROW_NUMBER() over (partition by item,identifier,methodid order by item) as rowid
from tab) as p
where p.rowid = 1
Thanks.
select *
from ( select item,identifier,quantity,methodid,
row_number() over(partition item,identifier,methodid) as rank)
each rank row with value higher than 1 is a duplicated row

Distinct count mismatch

This is my initial table structure.
MEMBER_ID ITEM_ID ACCOUNT
1 3 A
1 4 A
2 1 B
3 4 B
4 4 B
5 4 A
6 2 A
When I want the distinct number of members I do
Select COUNT(DISTINCT MEMBER_ID) FROM TABLE A
I get 6, the expected answer
When I do
SELECT COUNT(DISTINCT MEMBER_ID),ACCOUNT FROM TABLE A GROUP BY 2
I get something like A=4 and B=3, what do you think is the disconnect here.
Thanks
I find the results highly unlikely. You would, however, get 4 and 3 if the data were slightly different:
MEMBER_ID ITEM_ID ACCOUNT
1 3 A
1 4 B
2 1 B
3 4 B
4 4 A
5 4 A
6 2 A
With the group by, MEMBER_ID = 1 would be counted twice -- once for A and once for B. My guess is that something like this is happening for your real problem. COUNT(DISTINCT) is not additive. So, when you break it in apart using a group by, the sum of the values is not (necessarily) the sum for all the data. This differs from MIN(), MAX(), COUNT(*), and SUM(). However, AVG() is also not additive (although it is easily recalculated).