How exactly is the value of count(*) determined in BigQuery?

How exactly is the value of count(*) determined in BigQuery? - google-bigquery

I am joining a table of about 70000 rows with a slightly bigger second table through inner join each. Now count(a.business_column) and count(*) give different results. The former correctly reports back ~70000, while the latter gives ~200000. But this only happens when I select count(*) alone, when I select them together they give the same result (~70000). How is this possible?
select
count(*)
/*,count(a.business_column)*/
from table_a a
inner join each table_b b
on b.key_column = a.business_column

UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.
To answer the title question: COUNT(*) in BigQuery is always accurate.
The caveat is that in SQL COUNT(*) and COUNT(column) have semantically different meanings - and the sample query can be interpreted in different ways.
See: http://www.xaprb.com/blog/2009/04/08/the-dangerous-subtleties-of-left-join-and-count-in-sql/
There they have this sample query:
select user.userid, count(email.subject)
from user
inner join email on user.userid = email.userid
group by user.userid;
That query turns out to be ambigous, and the article author changes it for a more explicit one, adding this comment:
But what if that’s not what the author of the query meant? There’s no
way to really know. There are several possible intended meanings for
the query, and there are several different ways to write the query to
express those meanings more clearly. But the original query is
ambiguous, for a few reasons. And everyone who reads this query
afterwards will end up guessing what the original author meant. “I
think I can safely change this to…”
UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.

COUNT(*) counts most repeated field in your query, if you want to count full records - use COUNT(0).

Related

Working of subquery in SQL Oracle

I was trying to understand how nested or nested subqueries work in Oracle when dealing with SQL. So lets take an example where I have 2 tables, one where I hold all student information and one where I hold all the grades each student has received. Now I'm trying to find all students that received at least one 'A' grade form the grades table. I can do a simple join and get the output for this. But the problem is if a student has received an 'A' grade twice, his ID shows up twice. Now I know I can use the DISTINCT word to solve my problem. But I wanted to do this using nested queries and so this is what I typed ->
select id from students where id in (select id from grades);
Now this query returns an output with no duplicates. I'm trying to get my head around this and how this nested query works in detail. What does the "where in" part also do? Really confused.

While it's not universally true that a distinct is a bad thing, it is often misused -- and I think your example is a good one where there is a better way.
In this case, I think your best bet is a semi-join. Here is a rough example:
select s.*
from students s
where exists (
select null
from grades g
where
s.student_id = g.student_id and
g.grade = 'A'
)
Oracle does a pretty nice job of executing a subquery into a semi-join in the background, when it makes sense, but other DBMSs definitely benefit from this construct.

Actually the way it works is a join and a distinct - but Oracle is smart, it does it efficiently. It does what you would do: it takes the first student_id from the first table, and it tries to match it against rows in the second table. But, since you don't need the whole join, it will stop as soon as it finds a match - then it moves on to the second row in the first table.
I assume you meant the subquery to be select id from grades where grade = 'A', right?

the specified field 'Match ID' could refer to more than one table listed in the FROM clause of your sql statement

I am trying to give the number of matches that a player did not play but listed in the team. If he hasnt played his Minutes per game will be zero which i have implemented. PLease PLease Help!!!
SELECT tblGameResults.MatchID, Player_ID_Number, Minutes_Per_Game
FROM
tblGameResults
INNER JOIN
tblPlayerStatistics ON tblGameResults.MatchID = tblPlayerStatistics.MatchID
WHERE Minutes_Per_Game = 0
GROUP BY
tblGameResults.MatchID

Although your query looks conceptually like it would work, here are a few pointers (I am making some guesses on the tables where Minutes_Per_Game and Player_ID_Number come from):
Give aliases to all of your table names
SELECT game.MatchID, stats.Player_ID_Number, stats.Minutes_Per_Game
FROM
tblGameResults as game
INNER JOIN
tblPlayerStatistics as stats ON game.MatchID = stats.MatchID
WHERE stats.Minutes_Per_Game = 0
GROUP BY
game.MatchID
This should help you identify where your problems are, or it will help the database know where to pull columns from. Additionally, you are selecting three columns. You should probably group by all of them unless you want to do some aggregate functions on the last two in your select statement.

In an EXISTS can my JOIN ON use a value from the original select

I have an order system. Users with can be attached to different orders as a type of different user. They can download documents associated with an order. Documents are only given to certain types of users on the order. I'm having trouble writing the query to check a user's permission to view a document and select the info about the document.
I have the following tables and (applicable) fields:
Docs: DocNo, FileNo
DocAccess: DocNo, UserTypeWithAccess
FileUsers: FileNo, UserType, UserNo
I have the following query:
SELECT Docs.*
FROM Docs
WHERE DocNo = 1000
AND EXISTS (
SELECT * FROM DocAccess
LEFT JOIN FileUsers
ON FileUsers.UserType = DocAccess.UserTypeWithAccess
AND FileUsers.FileNo = Docs.FileNo /* Errors here */
WHERE DocAccess.UserNo = 2000 )
The trouble is that in the Exists Select, it does not recognize Docs (at Docs.FileNo) as a valid table. If I move the second on argument to the where clause it works, but I would rather limit the initial join rather than filter them out after the fact.
I can get around this a couple ways, but this seems like it would be best. Anything I'm missing here? Or is it simply not allowed?

I think this is a limitation of your database engine. In most databases, docs would be in scope for the entire subquery -- including both the where and in clauses.
However, you do not need to worry about where you put the particular clause. SQL is a descriptive language, not a procedural language. The purpose of SQL is to describe the output. The SQL engine, parser, and compiler should be choosing the most optimal execution path. Not always true. But, move the condition to the where clause and don't worry about it.

I am not clear why do you need to join with FileUsers at all in your subquery?
What is the purpose and idea of the query (in plain English)?
In any case, if you do need to join with FileUsers then I suggest to use the inner join and move second filter to the WHERE condition. I don't think you can use it in JOIN condition in subquery - at least I've never seen it used this way before. I believe you can only correlate through WHERE clause.

You have to use aliases to get this working:
SELECT
doc.*
FROM
Docs doc
WHERE
doc.DocNo = 1000
AND EXISTS (
SELECT
*
FROM
DocAccess acc
LEFT OUTER JOIN
FileUsers usr
ON
usr.UserType = acc.UserTypeWithAccess
AND usr.FileNo = doc.FileNo
WHERE
acc.UserNo = 2000
)
This also makes it more clear which table each field belongs to (think about using the same table twice or more in the same query with different aliases).
If you would only like to limit the output to one row you can use TOP 1:
SELECT TOP 1
doc.*
FROM
Docs doc
INNER JOIN
FileUsers usr
ON
usr.FileNo = doc.FileNo
INNER JOIN
DocAccess acc
ON
acc.UserTypeWithAccess = usr.UserType
WHERE
doc.DocNo = 1000
AND acc.UserNo = 2000
Of course the second query works a bit different than the first one (both JOINS are INNER). Depeding on your data model you might even leave the TOP 1 out of that query.

Select Statement with Distinct returning multiple rows and need only first result

I having a challenge with my query returning multiple results.
SELECT DISTINCT gpph.id, gpph.cname, gc2a.assetfilename, gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
A record, 5679599, has 2 'Primary Images' and is returning 2 results for that id but I only need the first result back. Is there any way to do this IN the current query? Do I need to write multiple queries?
I need some direction on how to constrain the results to only 1 result on Primary Image. I have looked at a ton of similar questions but most typically are just requiring the guidance of adding 'distinct' to the beginning of their query rather than on the where clause.
Edit: This problem is created by a user inputting 2 Primary Images on one record in the database. My business requirements only state to take the first result.
Any help would be awesome!

Given the choice is arbitary which to return, we can just use an aggregate on the value. This then needs a group by clause, which eliminates the need for the distinct.
SELECT gpph.id, gpph.cname, max(gc2a.assetfilename), gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
GROUP BY gpph.id, gpph.cname, gpph.alternateURL
In this instance, using max(gc2a.assetfilename) is going to give you the alphabetically highest value in the event of there being more than one record. It's not the ideal choice, some kind of timestamp knowing the order of the records might be more helpful, since then the meaning of the word 'first' could make more sense.

Replace distinct to group by :
SELECT MAX(gpph.id), gpph.cname, gc2a.assetfilename, gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
AND gpph.id = MAX(gpph.id)
GROUP BY gpph.cname, gc2a.assetfilename, gpph.alternateURL

Why is selecting specified columns, and all, wrong in Oracle SQL?

Say I have a select statement that goes..
select * from animals
That gives a a query result of all the columns in the table.
Now, if the 42nd column of the table animals is is_parent, and I want to return that in my results, just after gender, so I can see it more easily. But I also want all the other columns.
select is_parent, * from animals
This returns ORA-00936: missing expression.
The same statement will work fine in Sybase, and I know that you need to add a table alias to the animals table to get it to work ( select is_parent, a.* from animals ani), but why must Oracle need a table alias to be able to work out the select?

Actually, it's easy to solve the original problem. You just have to qualify the *.
select is_parent, animals.* from animals;
should work just fine. Aliases for the table names also work.

There is no merit in doing this in production code. We should explicitly name the columns we want rather than using the SELECT * construct.
As for ad hoc querying, get yourself an IDE - SQL Developer, TOAD, PL/SQL Developer, etc - which allows us to manipulate queries and result sets without needing extensions to SQL.

Good question, I've often wondered this myself but have then accepted it as one of those things...
Similar problem is this:
sql>select geometrie.SDO_GTYPE from ngg_basiscomponent
ORA-00904: "GEOMETRIE"."SDO_GTYPE": invalid identifier
where geometrie is a column of type mdsys.sdo_geometry.
Add an alias and the thing works.
sql>select a.geometrie.SDO_GTYPE from ngg_basiscomponent a;

Lots of good answers so far on why select * shouldn't be used and they're all perfectly correct. However, don't think any of them answer the original question on why the particular syntax fails.
Sadly, I think the reason is... "because it doesn't".
I don't think it's anything to do with single-table vs. multi-table queries:
This works fine:
select *
from
person p inner join user u on u.person_id = p.person_id
But this fails:
select p.person_id, *
from
person p inner join user u on u.person_id = p.person_id
While this works:
select p.person_id, p.*, u.*
from
person p inner join user u on u.person_id = p.person_id
It might be some historical compatibility thing with 20-year old legacy code.
Another for the "buy why!!!" bucket, along with why can't you group by an alias?

The use case for the alias.* format is as follows
select parent.*, child.col
from parent join child on parent.parent_id = child.parent_id
That is, selecting all the columns from one table in a join, plus (optionally) one or more columns from other tables.
The fact that you can use it to select the same column twice is just a side-effect. There is no real point to selecting the same column twice and I don't think laziness is a real justification.

Select * in the real world is only dangerous when referring to columns by index number after retrieval rather than by name, the bigger problem is inefficiency when not all columns are required in the resultset (network traffic, cpu and memory load).
Of course if you're adding columns from other tables (as is the case in this example it can be dangerous as these tables may over time have columns with matching names, select *, x in that case would fail if a column x is added to the table that previously didn't have it.

why must Oracle need a table alias to be able to work out the select
Teradata is requiring the same. As both are quite old (maybe better call it mature :-) DBMSes this might be historical reasons.
My usual explanation is: an unqualified * means everything/all columns and the parser/optimizer is simply confused because you request more than everything.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How exactly is the value of count(*) determined in BigQuery? - google-bigquery

COUNT(*) counts most repeated field in your query, if you want to count full records - use COUNT(0).

Related

Working of subquery in SQL Oracle

the specified field 'Match ID' could refer to more than one table listed in the FROM clause of your sql statement

In an EXISTS can my JOIN ON use a value from the original select

Select Statement with Distinct returning multiple rows and need only first result

Why is selecting specified columns, and all, wrong in Oracle SQL?

Categories

Resources