Query does not found column, suggests same column in Hive SQL - sql

I have the following query in SQL:
select midquery.account, midquery.name, midquery.label, midquery.labelfrequency
from(
-- Count the appearance of each label.
select count(*) as labelfrequency, account, name, label
from(
select account, name, label from myTable
) innerquery
group by account, name, label
) midquery
-- Select most frequent values only.
where rank() over
(partition by midquery.account, midquery.name
order by midquery.labelfrequency desc) = 1
The idea is to find the most frequent label per name-account set. When I run this query, I get the following error:
Error while compiling statement: FAILED: SemanticException [Error 10002]: Line 12:74 Invalid column reference 'labelfrequency': (possible column names are: labelfrequency, account, name, label)
I don't quite understand why the interpreter does not find the column labelfrequency but can suggest it. Have you got any suggestions on how to tackle this issue?
Edit: if I move the rank() to the select part, I get results.
select midquery.account, midquery.name, midquery.label, midquery.labelfrequency,
rank() over (partition by midquery.account, midquery.name
order by midquery.labelfrequency desc)
from(
-- Count the appearance of each label.
select count(*) as labelfrequency, account, name, label
from(
select account, name, label from myTable
) innerquery
group by account, name, label
) midquery

Window functions are simply not allowed in the WHERE clause. There are good reasons for this, but you can think of it as just another rule of SQL -- similar to column aliases not being recognized.
(The real reason is specifying how the window function would operate when there are multiple filtering conditions. It is (almost ?) impossible to come up with a coherent set of rules.)
Having said that, you can simplify your query:
select t.account, t.name, t.label, t.labelfrequency
from (select count(*) as labelfrequency, account, name, label,
rank() over (partition by account, name
order by count(*) desc
) as seqnum
from myTable t
group by account, name, label
) t
where seqnum = 1;
That is, window functions and aggregation functions can be combined. And you don't need a subquery to specify only a handful a columns.

Related

Invalid group by expression error when using any_value with max and window function in Snowflake

I was given a query and I am attempting to modify it in order to get the most recent version of each COMP_ID. The original query:
SELECT
ANY_VALUE(DATA_INDEX)::string AS DATA_INDEX,
COMP_ID::string AS COMP_ID,
ANY_VALUE(ACCOUNT_ID)::string AS ACCOUNT_ID,
ANY_VALUE(COMP_VERSION)::string AS COMP_VERSION,
ANY_VALUE(NAME)::string AS NAME,
ANY_VALUE(DESCRIPTION)::string AS DESCRIPTION,
MAX(OBJECT_DICT:"startshape-type")[0]::string AS STARTSHAPE_TYPE,
MAX(OBJECT_DICT:"startshape-connector-type")[0]::string AS STARTSHAPE_CONNECTOR_TYPE ,
MAX(OBJECT_DICT:"startshape-action-type")[0]::string AS STATSHAPE_ACTION_TYPE,
MAX(OBJECT_DICT:"overrides-enabled")[0]::string AS OVERRIDES_ENABLED
FROM COMP_DATA
GROUP BY COMP_ID
ORDER BY COMP_ID;
I then attempted to use a window function to grab only the highest version for each comp_id.
This is the modified query:
SELECT
ANY_VALUE(DATA_INDEX)::string AS DATA_INDEX,
COMP_ID::string AS COMP_ID,
ANY_VALUE(ACCOUNT_ID)::string AS ACCOUNT_ID,
ANY_VALUE(COMP_VERSION)::string AS COMP_VERSION,
ANY_VALUE(NAME)::string AS NAME,
ANY_VALUE(DESCRIPTION)::string AS DESCRIPTION,
MAX(OBJECT_DICT:"startshape-type")[0]::string AS STARTSHAPE_TYPE,
MAX(OBJECT_DICT:"startshape-connector-type")[0]::string AS STARTSHAPE_CONNECTOR_TYPE ,
MAX(OBJECT_DICT:"startshape-action-type")[0]::string AS STATSHAPE_ACTION_TYPE,
MAX(OBJECT_DICT:"overrides-enabled")[0]::string AS OVERRIDES_ENABLED,
ROW_NUMBER() OVER (PARTITION BY COMP_ID ORDER BY COMP_VERSION DESC) AS ROW_NUM
FROM COMP_DATA
QUALIFY 1 = ROW_NUM;
When attempting to compile the below error is given:
SQL compilation error: [COMP_DATA.COMP_ID] is not a valid group by expression
I had originally thought the issue was the ANY_VALUE on COMP_VERSION, but after removing the ANY_VALUE the same error was given. The only way I found to not get an error was removing the 4 MAX fields and all of the ANY_VALUE()'s, as shown below:
SELECT
DATA_INDEX::string AS DATA_INDEX,
COMP_ID::string AS COMP_ID,
ACCOUNT_ID::string AS ACCOUNT_ID,
COMP_VERSION::string AS COMP_VERSION,
NAME::string AS NAME,
DESCRIPTION::string AS DESCRIPTION,
ROW_NUMBER() OVER (PARTITION BY COMP_ID ORDER BY COMP_VERSION DESC) AS ROW_NUM
FROM COMP_DATA
QUALIFY 1 = ROW_NUM;
Of course this is not at all sufficient since I need the 4 max fields.
I have also tried creating the table with the max fields and from that new table using the window function to select the highest COMP_VERSION of each COMP_ID, but the same error was given.
When you added your QUALIFY clause you dropped the GROUP BY clause from your SQL, aggregate function like MAX, need all selections to be aggregate function OR to have a GROUP BY clause.
So if you only want the best row per the grouping clause, which you note, you aggregate functions need to be explicitly windowed. Thus
SELECT
data_index::string AS data_index,
comp_id::string AS comp_id,
account_id::string AS account_id,
comp_version::string AS comp_version,
name::string AS name,
description::string AS description,
MAX(object_dict:"startshape-type")OVER(PARTITION BY comp_id)[0]::string AS startshape_type,
MAX(object_dict:"startshape-connector-type")OVER (PARTITION BY comp_id)[0]::string AS startshape_connector_type ,
MAX(object_dict:"startshape-action-type")OVER (PARTITION BY comp_id)[0]::string AS statshape_action_type,
MAX(object_dict:"overrides-enabled")OVER(PARTITION BY comp_id)[0]::string AS overrides_enabled,
FROM COMP_DATA
QUALIFY 1 = ROW_NUMBER() OVER (PARTITION BY comp_id ORDER BY comp_version DESC);
There is a small chance you will need to add a set of brackets around those MAX's like
(MAX(object_dict:"overrides-enabled")OVER(PARTITION BY comp_id))[0]::string AS overrides_enabled,
But I suspect it will work out of the box. And I assumed you don't want the row_number so pushed it into the qualify (because it will always be the value 1)

Select one random row by group (Oracle 10g)

This post is similar to this thread in that I have multiple observations per group. However, I want to randomly select only one of them. I am also working on Oracle 10g.
There are multiple rows per person_id in table df. I want to order each group of person_ids by dbms_random.value() and select the first observation from each group. To do so, I tried:
select
person_id, purchase_date
from
df
where
row_number() over (partition by person_id order by dbms_random.value()) = 1
The query returns:
ORA-30483: window functions are not allowed here
30483. 00000 - "window functions are not allowed here"
*Cause: Window functions are allowed only in the SELECT list of a query. And, window function cannot be an argument to another window or group function.
Use a subquery:
select person_id, purchase_date
from (select df.*,
row_number() over (partition by person_id order by dbms_random.value()) as seqnum
from df
) df
where seqnum = 1;
One option would be using WITH..AS Clause :
WITH t AS
(
SELECT df.*,
ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY dbms_random.value()) AS rn
FROM df
)
SELECT person_id, purchase_date
FROM t
WHERE rn = 1
Aggregate queries (using GROUP BY and aggregate functions) are much faster than equivalent analytic functions that do the same job. So, if you have a lot of data to process, or if the data is not excessively large but you must run this query often, you may want a more efficient query that uses aggregation instead of analytic functions.
Here is one possible approach:
select person_id,
max(purchase_date) keep (dense_rank first order by dbms_random.value())
as random_purchase_date
from df
group by person_id
;

How do we find frequency of one column based off two other columns in SQL?

I'm relatively new to working with SQL and wasn't able to find any past threads to solve my question. I have three columns in a table, columns being name, customer, and location. I'd like to add an additional column determining which location is most frequent, based off name and customer (first two columns).
I have included a photo of an example where name-Jane customer-BEC in my created column would be "Texas" as that has 2 occurrences as opposed to one for California. Would there be anyway to implement this?
If you want 'Texas' on all four rows:
select t.Name, t.Customer, t.Location,
(select t2.location
from table1 t2
where t2.name = t.name
group by name, location
order by count(*) desc
fetch first 1 row only
) as most_frequent_location
from table1 t ;
You can also do this with analytic functions:
select t.Name, t.Customer, t.Location,
max(location) keep (dense_rank first order by location_count desc) over (partition by name) most_frequent_location
from (select t.*,
count(*) over (partition by name, customer, location) as location_count
from table1 t
) t;
Here is a db<>fiddle.
Both of these version put 'Texas' in all four rows. However, each can be tweaks with minimal effort to put 'California' in the row for ARC.
In Oracle, you can use aggregate function stats_mode() to compute the most occuring value in a group.
Unfortunately it is not implemented as a window function. So one option uses an aggregate subquery, and then a join with the original table:
select t.*, s.top_location
from mytable t
inner join (
select name, customer, stats_mode(location) top_location
from mytable
group by name, customer
) s where s.name = t.name and s.customer = t.customer
You could also use a correlated subquery:
select
t.*,
(
select stats_mode(t1.location)
from mytable t1
where t1.name = t.name and t1.customer = t.customer
) top_location
from mytable t
This is more a question about understanding the concepts of a relational database. If you want that information, you would not put that in an additional column. It is calculated data over multiple columns - why would you store that in the table itself ? It is complex to code and it would also be very expensive for the database (imagine all the rows you have to calculate that value for if someone inserted a million rows)
Instead you can do one of the following
Calculate it at runtime, as shown in the other answers
if you want to make it more persisent, you could embed that query above in a view
if you want to physically store the info, you could use a materialized view
Plenty of documentation on those 3 options in the official oracle documentation
Your first step is to construct a query that determines the most frequent location, which is as simple as:
select Name, Customer, Location, count(*)
from table1
group by Name, Customer, Location
This isn't immediately useful, but the logic can be used in row_number(), which gives you a unique id for each row returned. In the query below, I'm ordering by count(*) in descending order so that the most frequent occurrence has the value 1.
Note that row_number() returns '1' to only one row.
So, now we have
select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1 tb_
group by Name, Customer, Location
The final step puts it all together:
select tab.*, tb_.Location most_freq_location
from table1 tab
inner join
(select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1
group by Name, Customer, Location) tb_
on tb_.Name = tab.Name
and tb_.Customer = tab.Customer
and freq_name_cust = 1
You can see how it all works in this Fiddle where I deliberately inserted rows with the same frequency for California and Texas for one of the customers for illustration purposes.

How to work with problems correlated subqueries that reference other tables, without using Join

I am trying to work on public dataset bigquery-public-data.austin_crime.crime of the BigQuery. My goal is to get the output as three column that shows the
discription(of the crime), count of them, and top district for that particular description(crime).
I am able to get the first two columns with this query.
select
a.description,
count(*) as district_count
from `bigquery-public-data.austin_crime.crime` a
group by description order by district_count desc
and was hoping I can get that done with one query and then I tried this in order to get the third column showing me the Top district for that particular description (crime) by adding the code below
select
a.description,
count(*) as district_count,
(
select district from
( select
district, rank() over(order by COUNT(*) desc) as rank
FROM `bigquery-public-data.austin_crime.crime`
where description = a.description
group by district
) where rank = 1
) as top_District
from `bigquery-public-data.austin_crime.crime` a
group by description
order by district_count desc
The error i am getting is this. "Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN."
I think i can do that by joins. Can someone has better solution possibly to do that using without join.
Below is for BigQuery Standard SQL
#standardSQL
SELECT description,
ANY_VALUE(district_count) AS district_count,
STRING_AGG(district ORDER BY cnt DESC LIMIT 1) AS top_district
FROM (
SELECT description, district,
COUNT(1) OVER(PARTITION BY description) AS district_count,
COUNT(1) OVER(PARTITION BY description, district) AS cnt
FROM `bigquery-public-data.austin_crime.crime`
)
GROUP BY description
-- ORDER BY district_count DESC

error using "order by" in a select statement (error: column is not contained in "either an aggregate function or the GROUP BY clause")

I have a table below and want to count the number of consecutive occurrences each letter appears. The code to reproduce the table I am using is listed for those helping to save time.
CREATE TABLE table1 (id integer, names varchar(50));
INSERT INTO table1 VALUES (1,'A');
INSERT INTO table1 VALUES (2,'A');
INSERT INTO table1 VALUES (3,'B');
INSERT INTO table1 VALUES (4,'B');
INSERT INTO table1 VALUES (5,'B');
INSERT INTO table1 VALUES (6,'B');
INSERT INTO table1 VALUES (7,'C');
INSERT INTO table1 VALUES (8,'B');
INSERT INTO table1 VALUES (9,'B');
select * from table1;
I found code already written to accomplish this online, which I've tested and can confirm it runs successfully. It's shown here.
select names, count(*) as count
from (select id, names, (row_number() over (order by id) - row_number() over (partition by names order by id)) as grp
from table1
) as temp
group by grp, names
I am trying to add in the ORDER BY clause at the end, like so:
select names, count(*) as count
from (select id, names, (row_number() over (order by id) - row_number() over (partition by names order by id)) as grp
from table1
) as temp
group by grp, names
order by id -- added this here, but it creates an error.
but kept getting the error "Column "temp.id" is invalid in the ORDER BY clause because it is not contained in either an aggregate function or the GROUP BY clause." However, I am able to order by "names." What is the difference here?
Also, why can't I add in the "order by id" in the subquery? If I run this subquery on its own (see below), then the "order by id" is fine, but all together it cannot run. Why is this?
select names, count(*) as count
from (select id, names, (row_number() over (order by id) - row_number() over (partition by names order by id)) as grp
from table1
order by id -- added this in here, but it creates an error.
) as temp
group by grp, names
order by names
A select statement returns rows in an arbitrary order -- unless it has an order by. This is an extension of the fact that SQL operators on unordered sets.
Your select has no order by, so you should not assume the data would come back in any particular ordering. To get the results order by id, add order by id to the select.
kept getting the error "Column "temp.id" is invalid in the ORDER BY
clause because it is not contained in either an aggregate function or
the GROUP BY clause." However, I am able to order by "names." What is
the difference here?
SQL does things in a certain order. If your query has a GROUP BY (which yours does), that is done first. After grouping, the only thing SQL has is the columns that are selected and grouped by, so those are the only columns that can be used in the order by clause.
As an example, think of houses in a street. If you did a query on houses, returning colour & count, you might get something like Red 2, White 10, Green 3. But asking to sort that by address number makes no sense, because that information is not in data we've returned. In your case you are returning names, count, and you used grp in the group by clause, so those are the only things you can use to sort the final data, because they are all you have, and all that makes sense.
Also, why can't I add in the "order by id" in the subquery? If I run
this subquery on its own (see below), then the "order by id" is fine,
but all together it cannot run. Why is this?
When you have a subquery, the results are used as if they were a table. You can join on it, or query from it like you are, but the point is the order of that table has no effect on any thing else. The entry order of the underlying table is no guarantee that your query will come out in that order, unless you use an order by clause. And because you are doing a group by, that order means nothing anyway. Because the order of the subquery has no effect, SQL won't let you put it in.