BigQuery - running count and split functions together

BigQuery - running count and split functions together - google-bigquery

I am trying to run a count on the result of the split functions. The query below shows an example:
select a.name,
count(if(split(b.name,",")='test',null,1)) > 0 hasTest,
from (select * from (select 'test,this' as name) a left join (select '2' as name) b on
a.name=b.name)
This query yields an error: SELECT clause has mix of aggregations 'hasTest' and fields 'a.name' without GROUP BY clause
If I change the hasTest column to be an integer instead of boolean, so that:
count(if(split(b.name,",")='test',null,1))
The query succeeds.
For some reason BigQuery knows how to evaluate the count function (working on a nested element created in place, therefore not requiring a group by clause), but is not able to take the same capability when the count is wrapped in a boolean operator.

I think it's just an unclear error message.
The problem here seems to be with the data type of null. bq needs you to define nulls data type. The default null data type is boolean. If you don't define it, there is a mix of data types in the same field.

Related

Athena subquery not working - incorrect table name

I'm trying to write an athena query that uses a subquery in the where clause because I want to place a restriction on an array-type field. I don't want to do a cross join unnest since I don't want to flatten each row.
Example query:
SELECT
foo.some_scalar_field,
foo.some_other_scalar_field
FROM "fake_db"."table_name"
WHERE EXISTS (SELECT NULL FROM foo.some_array_field as T(item) WHERE item = "THING")
When I do this, I get an error saying that "foo" isn't a valid table name. I'm a bit new to athena, however a previous SQL engine I used to use supported this kind of query. Is doing this not possible in Athena?
Edit;;
I went ahead and added an unnest:
SELECT
foo.some_scalar_field,
foo.some_other_scalar_field
FROM "fake_db"."table_name"
WHERE EXISTS (SELECT NULL FROM UNNEST(foo.some_array_field) as T(item) WHERE item = "THING")
Now i get an error that the correlated subquery here is not supported. It seems like Athena doesn't support correlated subqueries?

There is no need for unnest and correlated subqueries, just use array functions. Something along this lines:
SELECT
foo.some_scalar_field,
foo.some_other_scalar_field
FROM "fake_db"."table_name"
WHERE any_match(foo.some_array_field, item -> item = "THING")

What happens when you use DISTINCT * in COUNT() in SQL?

I've just learned about the COUNT() function, and how it is possible to get the number of rows in a column by passing * as the argument.
SELECT COUNT(*) FROM table;
I've also learned that we can get the number of distinct rows of a column in a table by using DISTINCT.
SELECT COUNT(DISTINCT column) FROM table;
I've noticed that the following returns nothing.
SELECT COUNT(DISTINCT *) FROM table;
Why is this?
I suppose the root of my issue is that I don't quite fully understand what the COUNT() function with * as the argument does exactly. My resource says that the COUNT() function takes a column as an argument and counts how many non-NULL rows there are. So say we have a table that has a column with some rows having both NULL and non-NULL values. If COUNT(column) doesn't count the non-NULL rows, what happens differently in COUNT(*) so that all the rows are counted? And by extension, what happens during COUNT(DISTINCT *)?

This would be a syntax error in most databases. If it were allowed, it would probably be equivalent to:
select count(*)
from (select distinct * from t) t
However, NULL values might throw it off.

SQL returning 0 for count(), but returning multiple rows with simple SELECT

Sorry if the phrasing of my question was not very clear.
I am running this simple query below
SELECT count(cg)
FROM all_data
WHERE cg is null
and am getting 0 as the result. When I run this query
SELECT cg
FROM all_data
WHERE cg is null
and get a bunch of records that fit the criteria. There are very obviously many records that have a cg value of null, but they do not appear from the count() query.
Is there a reason for this? Am I doing something wrong?
Thanks for any help

Aggregates (COUNT(), SUM() etc.) ignore NULL values.
Use COUNT(*) to count all rows matching your condition.
SELECT COUNT(*)
FROM all_data
WHERE cg IS NULL
Further reading - Count Function (Microsoft Access SQL):
The Count function does not count records that have Null fields unless expr is the asterisk (*) wildcard character. If you use an asterisk, Count calculates the total number of records, including those that contain Null fields. Count(*) is considerably faster than Count([Column Name]).

If you want to count the amount of null values use the following query
SELECT
SUM(CASE WHEN CG IS NULL THEN 1 END) AMOUNT_CG
FROM all_data
No more follow the tip of the friend above

According to the SQL Reference Manual section on Aggregate Functions:
All aggregate functions except COUNT(*) and GROUPING ignore nulls. You can use the NVL function in the argument to an aggregate function to substitute a value for a null. COUNT never returns null, but returns either a number or zero. For all the remaining aggregate functions, if the data set contains no rows, or contains only rows with nulls as arguments to the aggregate function, then the function returns null.
So from above information we can conclude that to solve your problem use count(*) instead of count(cg).

PostgreSQL Where count condition

I have following query in PostgreSQL:
SELECT
COUNT(a.log_id) AS overall_count
FROM
"Log" as a,
"License" as b
WHERE
a.license_id=7
AND
a.license_id=b.license_id
AND
b.limit_call > overall_count
GROUP BY
a.license_id;
Why do I get this error:
ERROR: column "overall_count" does not exist
My table structure:
License(license_id, license_name, limit_call, create_date, expire_date)
Log(log_id, license_id, log, call_date)
I want to check if a license has reached the limit for calls in a specific month.

SELECT a.license_id, a.limit_call
, count(b.license_id) AS overall_count
FROM "License" a
LEFT JOIN "Log" b USING (license_id)
WHERE a.license_id = 7
GROUP BY a.license_id -- , a.limit_call -- add in old versions
HAVING a.limit_call > count(b.license_id)
Since Postgres 9.1 the primary key covers all columns of a table in the GROUP BY clause. In older versions you'd have to add a.limit_call to the GROUP BY list. The release notes for 9.1:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause
Further reading:
Why can't I exclude dependent columns from `GROUP BY` when I aggregate by a key?
The condition you had in the WHERE clause has to move to the HAVING clause since it refers to the result of an aggregate function (after WHERE has been applied). And you cannot refer to output columns (column aliases) in the HAVING clause, where you can only reference input columns. So you have to repeat the expression. The manual:
An output column's name can be used to refer to the column's value in
ORDER BY and GROUP BY clauses, but not in the WHERE or HAVING
clauses; there you must write out the expression instead.
I reversed the order of tables in the FROM clause and cleaned up the syntax a bit to make it less confusing. USING is just a notational convenience here.
I used LEFT JOIN instead of JOIN, so you do not exclude licenses without any logs at all.
Only non-null values are counted by count(). Since you want to count related entries in table "Log" it is safer and slightly cheaper to use count(b.license_id). This column is used in the join, so we don't have to bother whether the column can be null or not.
count(*) is even shorter and slightly faster, yet. If you don't mind to get a count of 1 for 0 rows in the left table, use that.
Aside: I would advise not to use mixed case identifiers in Postgres if possible. Very error prone.

Pure conditional count(*):
SELECT COUNT(*) FILTER(where a.myfield > 0) AS my_count
FROM "Log" as a
GROUP BY a.license_id
so you:
get 0 for groups where the condition never meets
can add as many count(*) columns as you need
Filter-out the groups, having condition mismatch:
NOTE: you cannot use HAVING b.limit_call > ..., unless you group by limit_call. But you can use an agregate function to map many "limit_calls" in the group into a single value. For example, in your case, you can use MAX:
SELECT COUNT(a.log_id) AS overall_count
FROM "Log" as a
JOIN "License" b ON(a.license_id=b.license_id)
GROUP BY a.license_id
HAVING MAX(b.limit_call) > COUNT(a.log_id)
And don't care about duplicating COUNT(a.log_id) expression in the first and in the last lines. Postgres will optimize it.

The where query doesn't recognize your column alias, and furthermore, you're trying to filter out rows after aggregation. Try:
SELECT
COUNT(a.log_id) AS overall_count
FROM
"Log" as a,
"License" as b
WHERE
a.license_id=7
AND
a.license_id=b.license_id
GROUP BY
a.license_id
having b.limit_call > count(a.log_id);
The having clause is similar to the where clause, except that it deals with columns after an aggregation, whereas the where clause works on columns before an aggregation.
Also, is there a reason why your table names are enclosed in double quotes?

What is the difference between HAVING and WHERE in SQL?

What is the difference between HAVING and WHERE in an SQL SELECT statement?
EDIT: I have marked Steven's answer as the correct one as it contained the key bit of information on the link:
When GROUP BY is not used, HAVING behaves like a WHERE clause
The situation I had seen the WHERE in did not have GROUP BY and is where my confusion started. Of course, until you know this you can't specify it in the question.

HAVING: is used to check conditions after the aggregation takes place.
WHERE: is used to check conditions before the aggregation takes place.
This code:
select City, CNT=Count(1)
From Address
Where State = 'MA'
Group By City
Gives you a table of all cities in MA and the number of addresses in each city.
This code:
select City, CNT=Count(1)
From Address
Where State = 'MA'
Group By City
Having Count(1)>5
Gives you a table of cities in MA with more than 5 addresses and the number of addresses in each city.

HAVING specifies a search condition for a
group or an aggregate function used in SELECT statement.
Source

Number one difference for me: if HAVING was removed from the SQL language then life would go on more or less as before. Certainly, a minority queries would need to be rewritten using a derived table, CTE, etc but they would arguably be easier to understand and maintain as a result. Maybe vendors' optimizer code would need to be rewritten to account for this, again an opportunity for improvement within the industry.
Now consider for a moment removing WHERE from the language. This time the majority of queries in existence would need to be rewritten without an obvious alternative construct. Coders would have to get creative e.g. inner join to a table known to contain exactly one row (e.g. DUAL in Oracle) using the ON clause to simulate the prior WHERE clause. Such constructions would be contrived; it would be obvious there was something was missing from the language and the situation would be worse as a result.
TL;DR we could lose HAVING tomorrow and things would be no worse, possibly better, but the same cannot be said of WHERE.
From the answers here, it seems that many folk don't realize that a HAVING clause may be used without a GROUP BY clause. In this case, the HAVING clause is applied to the entire table expression and requires that only constants appear in the SELECT clause. Typically the HAVING clause will involve aggregates.
This is more useful than it sounds. For example, consider this query to test whether the name column is unique for all values in T:
SELECT 1 AS result
FROM T
HAVING COUNT( DISTINCT name ) = COUNT( name );
There are only two possible results: if the HAVING clause is true then the result with be a single row containing the value 1, otherwise the result will be the empty set.

The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate functions.
Check out this w3schools link for more information
Syntax:
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name
HAVING aggregate_function(column_name) operator value
A query such as this:
SELECT column_name, COUNT( column_name ) AS column_name_tally
FROM table_name
WHERE column_name < 3
GROUP
BY column_name
HAVING COUNT( column_name ) >= 3;
...may be rewritten using a derived table (and omitting the HAVING) like this:
SELECT column_name, column_name_tally
FROM (
SELECT column_name, COUNT(column_name) AS column_name_tally
FROM table_name
WHERE column_name < 3
GROUP
BY column_name
) pointless_range_variable_required_here
WHERE column_name_tally >= 3;

The difference between the two is in the relationship to the GROUP BY clause:
WHERE comes before GROUP BY; SQL evaluates the WHERE clause before it groups records.
HAVING comes after GROUP BY; SQL evaluates HAVING after it groups records.
References
SQLite SELECT Statement Syntax/Railroad Diagram
Informix SELECT Statement Syntax/Railroad Diagram

HAVING is used when you are using an aggregate such as GROUP BY.
SELECT edc_country, COUNT(*)
FROM Ed_Centers
GROUP BY edc_country
HAVING COUNT(*) > 1
ORDER BY edc_country;

WHERE is applied as a limitation on the set returned by SQL; it uses SQL's built-in set oeprations and indexes and therefore is the fastest way to filter result sets. Always use WHERE whenever possible.
HAVING is necessary for some aggregate filters. It filters the query AFTER sql has retrieved, assembled, and sorted the results. Therefore, it is much slower than WHERE and should be avoided except in those situations that require it.
SQL Server will let you get away with using HAVING even when WHERE would be much faster. Don't do it.

WHERE clause does not work for aggregate functions
means : you should not use like this
bonus : table name
SELECT name
FROM bonus
GROUP BY name
WHERE sum(salary) > 200
HERE Instead of using WHERE clause you have to use HAVING..
without using GROUP BY clause, HAVING clause just works as WHERE clause
SELECT name
FROM bonus
GROUP BY name
HAVING sum(salary) > 200

Difference b/w WHERE and HAVING clause:
The main difference between WHERE and HAVING clause is, WHERE is used for row operations and HAVING is used for column operations.
Why we need HAVING clause?
As we know, aggregate functions can only be performed on columns, so we can not use aggregate functions in WHERE clause. Therefore, we use aggregate functions in HAVING clause.

One way to think of it is that the having clause is an additional filter to the where clause.
A WHERE clause is used filters records from a result. The filter occurs before any groupings are made. A HAVING clause is used to filter values from a group

In an Aggregate query, (Any query Where an aggregate function is used) Predicates in a where clause are evaluated before the aggregated intermediate result set is generated,
Predicates in a Having clause are applied to the aggregate result set AFTER it has been generated. That's why predicate conditions on aggregate values must be placed in Having clause, not in the Where clause, and why you can use aliases defined in the Select clause in a Having Clause, but not in a Where Clause.

I had a problem and found out another difference between WHERE and HAVING. It does not act the same way on indexed columns.
WHERE my_indexed_row = 123 will show rows and automatically perform a "ORDER ASC" on other indexed rows.
HAVING my_indexed_row = 123 shows everything from the oldest "inserted" row to the newest one, no ordering.

When GROUP BY is not used, the WHERE and HAVING clauses are essentially equivalent.
However, when GROUP BY is used:
The WHERE clause is used to filter records from a result. The
filtering occurs before any groupings are made.
The HAVING clause is used to filter values from a group (i.e., to
check conditions after aggregation into groups has been performed).
Resource from Here

From here.
the SQL standard requires that HAVING
must reference only columns in the
GROUP BY clause or columns used in
aggregate functions
as opposed to the WHERE clause which is applied to database rows

While working on a project, this was also my question. As stated above, the HAVING checks the condition on the query result already found. But WHERE is for checking condition while query runs.
Let me give an example to illustrate this. Suppose you have a database table like this.
usertable{ int userid, date datefield, int dailyincome }
Suppose, the following rows are in table:
1, 2011-05-20, 100
1, 2011-05-21, 50
1, 2011-05-30, 10
2, 2011-05-30, 10
2, 2011-05-20, 20
Now, we want to get the userids and sum(dailyincome) whose sum(dailyincome)>100
If we write:
SELECT userid, sum(dailyincome) FROM usertable WHERE
sum(dailyincome)>100 GROUP BY userid
This will be an error. The correct query would be:
SELECT userid, sum(dailyincome) FROM usertable GROUP BY userid HAVING
sum(dailyincome)>100

WHERE clause is used for comparing values in the base table, whereas the HAVING clause can be used for filtering the results of aggregate functions in the result set of the query
Click here!

When GROUP BY is not used, the WHERE and HAVING clauses are essentially equivalent.
However, when GROUP BY is used:
The WHERE clause is used to filter records from a result. The
filtering occurs before any groupings are made.
The HAVING clause is
used to filter values from a group (i.e., to check conditions after
aggregation into groups has been performed).

I use HAVING for constraining a query based on the results of an aggregate function. E.G. select * in blahblahblah group by SOMETHING having count(SOMETHING)>0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery - running count and split functions together - google-bigquery

I think it's just an unclear error message. The problem here seems to be with the data type of null. bq needs you to define nulls data type. The default null data type is boolean. If you don't define it, there is a mix of data types in the same field.

Related

Athena subquery not working - incorrect table name

What happens when you use DISTINCT * in COUNT() in SQL?

SQL returning 0 for count(), but returning multiple rows with simple SELECT

PostgreSQL Where count condition

What is the difference between HAVING and WHERE in SQL?

Categories

Resources