count distinct issue in Hive

count distinct issue in Hive - hive

I'm trying to compute a number of (unique) apparition of each element in a Hive table column regarding other columns.
I tried this query, but I've this error Expression not in GROUP BY key custom
SELECT custom, dist_pt, dt, art, COUNT(DISTINCT art) OVER (PARTITION BY custom, dist_pt) as nb_art FROM Tab ;

Remove DISTINCT from your COUNT() and add "GROUP BY art" at the end of your query. You need to segment, or group by, art in order to count how many records have each unique value of art.

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!

You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.

GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need

Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

exclude one column from grouping in sql query

I run an sql query to calculate mean and count after grouping combinations of themes, and country. This works fine.
create table georisk as (select themes, country , AVG(value) as mean, count(themes) as count from mytable group by themes, country order by themes, suppliers_country)
However, now I want to add an additional col to my table with the value max(date_t)without grouping by anything. A single value will be added for all the rows. If I do this:
create table georisk as (select themes, country , AVG(value) as mean, count(themes) as count, max(date_t) as last_included_date from mytable group by themes, country order by themes, suppliers_country)
the max(date_t) will also be according to the grouping. How can I just extract one max value within a single query?

I think you want this...
select
themes,
country,
AVG(value) as mean,
count(themes) as count,
max(date_t) as last_included_date,
max(max(date_t)) over () as very_last_include_date
from
mytable
group by
themes,
country
order by
themes,
country -- Note, you had a typo here ; suppliers_country
The GROUP BY is evaluated before the SELECT, then the aggregates are evaluated, then the window function is evaluated.
MAX(
MAX(date_t) -- normal aggregate
)
OVER () -- window function across whole result set's values of `MAX(date_t)`
Normally a window function has a PARTITION BY, leaving it empty means 'no partition' and therefor 'whole result set'.

Why does MAX statement require a Group By?

I understand why the first query needs a GROUP BY, as it doesn't know which date to apply the sum to, but I don't understand why this is the case with the second query. The value that ultimately is the max amount is already contained in the table - it is not calculated like SUM is. thank you
-- First Query
select
sum(OrderSales),OrderDates
From Orders
-- Second Query
select
max(FilmOscarWins),FilmName
From tblFilm

It is not the SUM and MAX that require the GROUP BY, it is the unaggregated column.
If you just write this, you will get a single row, for the maximum value of the FilmOscarWins column across the whole table:
select
max(FilmOscarWins)
From
tblFilm
If the most Oscars any film won was 12, that one row will say 12. But there could be multiple films, all of which won 12 Oscars, so if we ask for the FilmName alongside that 12, there is no single answer.
By adding the Group By, we fundamentally change the query: instead of returning one number for the whole table, it will return one row for each group - which in this case, means one row for each film.
If you do want to get a list of all those films which had the maximum 12 Oscars, you have to do something more complicated, such as using a sub-query to first find that single number (12) and then find all the rows matching it:
select
FilmOscarWins,
FilmName
From
tblFilm
Where FilmOscarWins = (
select
max(FilmOscarWins)
From
tblFilm
)

If you want the film with the most Oscar wins, then use select top:
select top (1) f.*
From tblFilm f
order by FilmOscarWins desc;
In an aggregation query, the select columns need to be consistent with the group by columns -- the unaggregated columns in the select must match the group by.

SQL to find best row in group based on multiple columns?

Let's say I have an Oracle table with measurements in different categories:
CREATE TABLE measurements (
category CHAR(8),
value NUMBER,
error NUMBER,
created DATE
)
Now I want to find the "best" row in each category, where "best" is defined like this:
It has the lowest errror.
If there are multiple measurements with the same error, the one that was created most recently is the considered to be the best.
This is a variation of the greatest N per group problem, but including two columns instead of one. How can I express this in SQL?

Use ROW_NUMBER:
WITH cte AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY category ORDER BY error, created DESC) rn
FROM measurements m
)
SELECT category, value, error, created
FROM cte
WHERE rn = 1;
For a brief explanation, the PARTITION BY clause instructs the DB to generate a separate row number for each group of records in the same category. The ORDER BY clause places those records with the smallest error first. Should two or more records in the same category be tied with the lowest error, then the next sorting level would place the record with the most recent creation date first.

The alias name RANK() function is not recognized in the where clause with DISTINCT columns

I have 2 tables with columns (customer, position, product ,sales_cycle, call_count , cntry_cd , owner_cd , cr8) and I am facing some challenges as mentioned below Kindly please help me to fix this
My Requirement
I have 2 tables test.table1 and test.table2
I need to insert values form "test.table2" by doing an select with "test.table1". But I am facing a problem i.e. I am getting some duplicates while loading data to "test.table2"
I have totally 8 columns in both the table but while loading I need to take the highest rank of the column "call_count" with condition of unique values of these columns (customer, position, product ,sales_cycle)
Query what I tried
select
distinct (customer, position, product ,sales_cycle),
rank () over (order by call_count desc) rnk,
cntry_cd,
owner_cd,
cr8
from test.table1
where rnk=1
I am facing few challenges in the above query (The database I am using is RedShift)
1.I can't do distinct for only few columns
2.The alias name "rnk" is not recognized in the where clause
Kindly please help me to fix this , Thanks

You can't use a column alias on the same level where it's introduced. You need to wrap the query in a derived table. The distinct as shown is useless as well if you use rank()
select customer, position, product, sales_cycle,
cntry_cd, owner_cd, cr8
from (
select customer, position, product, sales_cycle,
cntry_cd, owner_cd, cr8,
rank () over (order by call_count desc) rnk
from test.table1
) t
where rnk=1;
The derived table adds no overhead to the processing time. In this case it is merely syntactic sugar to allow you to reference the column alias.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

count distinct issue in Hive - hive

Remove DISTINCT from your COUNT() and add "GROUP BY art" at the end of your query. You need to segment, or group by, art in order to count how many records have each unique value of art.

Related

Confused with the Group By function in SQL

exclude one column from grouping in sql query

Why does MAX statement require a Group By?

SQL to find best row in group based on multiple columns?

The alias name RANK() function is not recognized in the where clause with DISTINCT columns

Categories

Resources