I need to output report of users by their age, gender, education, income, etc from our database. However, there are about 40 variables. It seems just silly to group by each variable one bye one but I'm not aware of other ways and I don't know how to write UDF to solve it yet. I'd appreciate your help.
It's not that complicated but it does come up a lot in daily work. My work environment is Hive/Impala.
We cannot implement 'Group By' task on input rows in UDF , UDAF or UDTF.
UDF takes in a single input row and output a single output row.
UDAF just does Aggregations on one column, but not by Grouping rows.
UDTF transforms a single input row to multiple output rows.
Only possible solution is to write multiple Queries and Combine them using UNION ALL and display/insert into table
Sample Query:
SELECT *
FROM
(
SELECT COUNT(column1),column1 FROM table GROUP BY column1
UNION ALL
SELECT COUNT(column2),column2 FROM table GROUP BY column2
UNION ALL
SELECT COUNT(column3),column3 FROM table GROUP BY column3
) s
I have a database of items that stores each available variety of a given item on a separate row. Just going by the item names, the database records look like this:
10316-00-B
10316-00-M
10316-02-B
10316-03-B
10316-04-B
10316-23-B
10316-88-B
...where "10316" is the actual item, and then everything after that represents a specific variety type. I would like to write a query that would return only one record for item "10316" (or whichever), along with the information associated with that item in all the other table columns, and ignore the rest. So I would get the record for item "10316-00-B", but none of the others. This process would repeat for all other items returned by my query.
I'm not picky about which item of a given group is returned; the top one would do. I just to see one listing each for item 10316, 10317, 10318 and onward, instead of each item and variation.
I've tried DISTINCT left(columnName, 5), but that doesn't filter the records as I need. Does anyone know how to do this? I'm using Sql Server 2008 R2.
UPDATE: I'm trying implement the solution provided by #mo2 -- it appears this one is most likely to return all data for a given item, which is what I want. Unfortunately I'm getting the error "No column name was specified for column 1 of 't'", and I'm not sure how to correct this. My version of his query looks like this:
SELECT * FROM (
select left(item, 5), col1, col2, ... , row_number()
over (partition by LEFT(item, 5) order by item) rn
from Table
) t
where rn = 1
FINAL UPDATE: I marked the answer from #kbball as the solution, but really credit should go to #mo2 as well. The answer from #kbball helped me make sense of the answer from #mo2.
Distinct would work if you were only looking for the one column alone. But when you add the remaining columns you need, it is thrown off. You can try this:
select * from (
select
left(col1, 5) col1,
col2,
col3,
col4,
row_number() over (partition by left(col1, 5) order by col1) rn
from your_table
) t
where rn = 1
you can change the order by to a different column such as a date/time column if you want to get the first or last record for each item.
This one is a little simpler:
SELECT DISTINCT sub.item
FROM (SELECT LEFT(columnName,5) AS item FROM Table) sub
You'll want to have the specific variety type as a separate field for each item, just as a general technique for database design:
10316 | 00 | B
Then you can query using:
SELECT DISTINCT(item)
FROM itemlist
WHERE itemnum=10316;
Hope this helps
I have a scenario for a type2 table where I have to remove duplicates on total row level.
Lets consider below example as the data in table.
A|B|C|D|E
100|12-01-2016|2|3|4
100|13-01-2016|3|4|5
100|14-01-2016|2|3|4
100|15-01-2016|5|6|7
100|16-01-2016|5|6|7
If you consider A as key column, you know that last 2 rows are duplicates.
Generally to find duplicates, we use group by function.
select A,C,D,E,count(1)
from table
group by A,C,D,E
having count(*)>1
for this output would be 100|2|3|4 as duplicate and also 100|5|6|7.
However, only 100|5|6|7 is only duplicate as per type 2 and not 100|2|3|4 because this value has come back in 3rd run and not soon after 1st load.
If I add date field into group by 100|5|6|7 will not be considered as duplicate, but in reality it is.
Trying to figure out duplicates as explained above.
Duplicates should only be 100|5|6|7 and not 100|2|3|4.
can someone please help out with SQL for the same.
Regards
Raghav
Use row_number analytical function to get rid of duplicates.
delete from
(
select a,b,c,d,e,row_number() over (partition by a,b,c,d,e) as rownumb
from table
) as a
where rownumb > 1
if you want to see all duplicated rows, you need join table with your group by query or filter table using group query as subquery.
wITH CTE AS (select a, B, C,D,E, count(*)
from TABLE
group by 1,2,3,4,5
having count(*)>1)
sELECT * FROM cte
WHERE B <> B + 1
Try this query and see if it works. In case you are getting any errors then let me know.
I am assuming that your column B is in the Date format if not then cast it to date
If you can see the duplicate then just replace select * to delete
I want to get unique value form table. But all values should be unique.
So suggest how to get.
SELECT DISTINCT ProCode
, id,SubCat
,SmlImgPath
,RupPrice
,ActualPrice
,ProName
FROM product
WHERE ProCode='FZ10003-EBA';
(one day I may be able to post comments!)
SQLFiddle to show normal, distinct and returning a single row
SELECT DISTINCT works fine but it doesn't work the way you want it to work. From the data you posted in the comment under Klas' answer, it's clear you're expecting a single result when there are data differences somewhere in the columns. For example
/Products/CELEBRITY/KANGANA
is completely DISTINCT from
/Products/SALWAR
What you appear to be looking for cannot work with DISTINCT nor can it work with GROUP BY. Basically the only way two (or three, or ten, or 100) rows will become ONE row is if the data in ALL SEVEN COLUMNS in your SELECT are IDENTICAL.
Take a step back and think about what it is, exactly, you're trying to achieve here.
Are you saying that you want one record only? This is called aggregation. In case there are more records then one (three in your example), you would have to decide for each column, which value to show.
Which SubCat, which SmlImgPath, etc. do you want to see in your result line? The maximum value? The minimum? Or the string 'various'? An example:
SELECT
ProCode
, CASE WHEN MIN(id) <> MAX(id) THEN 'various' ELSE MIN(id) END
, MIN(SubCat)
, MAX(SmlImgPath)
, AVG(RupPrice)
, AVG(ActualPrice)
, MAX(ProName)
FROM product
WHERE ProCode='FZ10003-EBA'
GROUP BY ProCode;
DISTINCT refers to all selected columns, so the answer is that your SELECT already does that.
EDIT:
It seems your problem isn't related to DISTINCT. What you want is to get a single row when your search returns multiple rows.
If you don't care which row you get then you can use:
MS SQL Server syntax:
SELECT TOP 1 ProCode
, id,SubCat
,SmlImgPath
,RupPrice
,ActualPrice
,ProName
FROM product
WHERE ProCode='FZ10003-EBA';
MYSQL syntax:
SELECT ProCode
, id,SubCat
,SmlImgPath
,RupPrice
,ActualPrice
,ProName
FROM product
WHERE ProCode='FZ10003-EBA'
LIMIT 1;
Oracle syntax:
SELECT ProCode
, id,SubCat
,SmlImgPath
,RupPrice
,ActualPrice
,ProName
FROM product
WHERE ProCode='FZ10003-EBA'
AND rownum <= 1;
Is there a better way of doing a query like this:
SELECT COUNT(*)
FROM (SELECT DISTINCT DocumentId, DocumentSessionId
FROM DocumentOutputItems) AS internalQuery
I need to count the number of distinct items from this table but the distinct is over two columns.
My query works fine but I was wondering if I can get the final result using just one query (without using a sub-query)
If you are trying to improve performance, you could try creating a persisted computed column on either a hash or concatenated value of the two columns.
Once it is persisted, provided the column is deterministic and you are using "sane" database settings, it can be indexed and / or statistics can be created on it.
I believe a distinct count of the computed column would be equivalent to your query.
Edit: Altered from the less-than-reliable checksum-only query
I've discovered a way to do this (in SQL Server 2005) that works pretty well for me and I can use as many columns as I need (by adding them to the CHECKSUM() function). The REVERSE() function turns the ints into varchars to make the distinct more reliable
SELECT COUNT(DISTINCT (CHECKSUM(DocumentId,DocumentSessionId)) + CHECKSUM(REVERSE(DocumentId),REVERSE(DocumentSessionId)) )
FROM DocumentOutPutItems
What is it about your existing query that you don't like? If you are concerned that DISTINCT across two columns does not return just the unique permutations why not try it?
It certainly works as you might expect in Oracle.
SQL> select distinct deptno, job from emp
2 order by deptno, job
3 /
DEPTNO JOB
---------- ---------
10 CLERK
10 MANAGER
10 PRESIDENT
20 ANALYST
20 CLERK
20 MANAGER
30 CLERK
30 MANAGER
30 SALESMAN
9 rows selected.
SQL> select count(*) from (
2 select distinct deptno, job from emp
3 )
4 /
COUNT(*)
----------
9
SQL>
edit
I went down a blind alley with analytics but the answer was depressingly obvious...
SQL> select count(distinct concat(deptno,job)) from emp
2 /
COUNT(DISTINCTCONCAT(DEPTNO,JOB))
---------------------------------
9
SQL>
edit 2
Given the following data the concatenating solution provided above will miscount:
col1 col2
---- ----
A AA
AA A
So we to include a separator...
select col1 + '*' + col2 from t23
/
Obviously the chosen separator must be a character, or set of characters, which can never appear in either column.
To run as a single query, concatenate the columns, then get the distinct count of instances of the concatenated string.
SELECT count(DISTINCT concat(DocumentId, DocumentSessionId)) FROM DocumentOutputItems;
In MySQL you can do the same thing without the concatenation step as follows:
SELECT count(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems;
This feature is mentioned in the MySQL documentation:
http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct
How about something like:
select count(*)
from
(select count(*) cnt
from DocumentOutputItems
group by DocumentId, DocumentSessionId) t1
Probably just does the same as you are already though but it avoids the DISTINCT.
Some SQL databases can work with a tuple expression so you can just do:
SELECT COUNT(DISTINCT (DocumentId, DocumentSessionId))
FROM DocumentOutputItems;
If your database doesn't support this, it can be simulated as per #oncel-umut-turer's suggestion of CHECKSUM or other scalar function providing good uniqueness e.g.
COUNT(DISTINCT CONCAT(DocumentId, ':', DocumentSessionId)).
MySQL specifically supports COUNT(DISTINCT expr, expr, ...) which is non-SQL standard syntax. It also notes In standard SQL, you would have to do a concatenation of all expressions inside COUNT(DISTINCT ...).
A related use of tuples is performing IN queries such as:
SELECT * FROM DocumentOutputItems
WHERE (DocumentId, DocumentSessionId) in (('a', '1'), ('b', '2'));
Here's a shorter version without the subselect:
SELECT COUNT(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems
It works fine in MySQL, and I think that the optimizer has an easier time understanding this one.
Edit: Apparently I misread MSSQL and MySQL - sorry about that, but maybe it helps anyway.
I have used this approach and it has worked for me.
SELECT COUNT(DISTINCT DocumentID || DocumentSessionId)
FROM DocumentOutputItems
For my case, it provides correct result.
There's nothing wrong with your query, but you could also do it this way:
WITH internalQuery (Amount)
AS
(
SELECT (0)
FROM DocumentOutputItems
GROUP BY DocumentId, DocumentSessionId
)
SELECT COUNT(*) AS NumberOfDistinctRows
FROM internalQuery
If you're working with datatypes of fixed length, you can cast to binary to do this very easily and very quickly. Assuming DocumentId and DocumentSessionId are both ints, and are therefore 4 bytes long...
SELECT COUNT(DISTINCT CAST(DocumentId as binary(4)) + CAST(DocumentSessionId as binary(4)))
FROM DocumentOutputItems
My specific problem required me to divide a SUM by the COUNT of the distinct combination of various foreign keys and a date field, grouping by another foreign key and occasionally filtering by certain values or keys. The table is very large, and using a sub-query dramatically increased the query time. And due to the complexity, statistics simply wasn't a viable option. The CHECKSUM solution was also far too slow in its conversion, particularly as a result of the various data types, and I couldn't risk its unreliability.
However, using the above solution had virtually no increase on the query time (comparing with using simply the SUM), and should be completely reliable! It should be able to help others in a similar situation so I'm posting it here.
if you had only one field to "DISTINCT", you could use:
SELECT COUNT(DISTINCT DocumentId)
FROM DocumentOutputItems
and that does return the same query plan as the original, as tested with SET SHOWPLAN_ALL ON. However you are using two fields so you could try something crazy like:
SELECT COUNT(DISTINCT convert(varchar(15),DocumentId)+'|~|'+convert(varchar(15), DocumentSessionId))
FROM DocumentOutputItems
but you'll have issues if NULLs are involved. I'd just stick with the original query.
Hope this works i am writing on prima vista
SELECT COUNT(*)
FROM DocumentOutputItems
GROUP BY DocumentId, DocumentSessionId
I wish MS SQL could also do something like COUNT(DISTINCT A, B). But it can't.
At first JayTee's answer seemed like a solution to me bu after some tests CHECKSUM() failed to create unique values. A quick example is, both CHECKSUM(31,467,519) and CHECKSUM(69,1120,823) gives the same answer which is 55.
Then I made some research and found that Microsoft does NOT recommend using CHECKSUM for change detection purposes. In some forums some suggested using
SELECT COUNT(DISTINCT CHECKSUM(value1, value2, ..., valueN) + CHECKSUM(valueN, value(N-1), ..., value1))
but this is also not conforting.
You can use HASHBYTES() function as suggested in TSQL CHECKSUM conundrum. However this also has a small chance of not returning unique results.
I would suggest using
SELECT COUNT(DISTINCT CAST(DocumentId AS VARCHAR)+'-'+CAST(DocumentSessionId AS VARCHAR)) FROM DocumentOutputItems
I found this when I Googled for my own issue, found that if you count DISTINCT objects, you get the correct number returned (I'm using MySQL)
SELECT COUNT(DISTINCT DocumentID) AS Count1,
COUNT(DISTINCT DocumentSessionId) AS Count2
FROM DocumentOutputItems
How about this,
Select DocumentId, DocumentSessionId, count(*) as c
from DocumentOutputItems
group by DocumentId, DocumentSessionId;
This will get us the count of all possible combinations of DocumentId, and DocumentSessionId
It works for me. In oracle:
SELECT SUM(DECODE(COUNT(*),1,1,1))
FROM DocumentOutputItems GROUP BY DocumentId, DocumentSessionId;
In jpql:
SELECT SUM(CASE WHEN COUNT(i)=1 THEN 1 ELSE 1 END)
FROM DocumentOutputItems i GROUP BY i.DocumentId, i.DocumentSessionId;
I had a similar question but the query I had was a sub-query with the comparison data in the main query. something like:
Select code, id, title, name
(select count(distinct col1) from mytable where code = a.code and length(title) >0)
from mytable a
group by code, id, title, name
--needs distinct over col2 as well as col1
ignoring the complexities of this, I realized I couldn't get the value of a.code into the subquery with the double sub query described in the original question
Select count(1) from (select distinct col1, col2 from mytable where code = a.code...)
--this doesn't work because the sub-query doesn't know what "a" is
So eventually I figured out I could cheat, and combine the columns:
Select count(distinct(col1 || col2)) from mytable where code = a.code...
This is what ended up working
This code uses distinct on 2 parameters and provides count of number of rows specific to those distinct values row count. It worked for me in MySQL like a charm.
select DISTINCT DocumentId as i, DocumentSessionId as s , count(*)
from DocumentOutputItems
group by i ,s;
You can just use the Count Function Twice.
In this case, it would be:
SELECT COUNT (DISTINCT DocumentId), COUNT (DISTINCT DocumentSessionId)
FROM DocumentOutputItems