Duplicate column names in the result are not supported in BigQuery - google-bigquery

I'm trying to select some columns in BQ and getting a complaint about duplicate IDs:
Duplicate column names in the result are not supported. Found duplicate(s): id
The query I'm using is:
SELECT
billing_account_id,service.id,service.description,sku.id
FROM `billing-management-edab.billing_dataset.gcp_billing_export_v1_blah_blah_blah`
Why are service.id and sku.id considered duplicates? And how can I get around that in my query?

Give aliases to the two id columns:
SELECT
billing_account_id,
service.id AS service_id,
service.description,
sku.id AS sku_id
FROM `billing-management-edab.billing_dataset.gcp_billing_export_v1_blah_blah_blah`
Actually, you may have left out part of your query, in particular the other table(s) which themselves were aliased as service and sku. But in any case, giving each of the two id columns in your select clause as a distinct alias should resolve your error.

Related

How to adapt a GROUP BY across multiple tables?

I'm trying to optimize a SQL query that uses a GROUP BY across multiple tables. Essentially, I have multiple tables which all contain a PID column, and the output I need is a record of every PID in all of the tables as well as a count of how many records across all of those tables contain that PID. When trying to use GROUP BY PID, I get a "column ambiguously defined" error if using multiple tables. Here is an example of the code I am using to retrieve the proper data from one table (can ignore the where clause):
select pid, count(*)
from table1
where vendor_id in(1,2)
and delay_code <=23
and age between 18 and 49 and sex = 'M'
group by pid
Essentially, I want to do this across a group of tables (i.e. table1, table2, table3 etc), but can't figure out how to do so without getting a "column ambiguously defined" error.
You need to identify what record you are referencing. You can do that either by specifying the table, or using an alias. Aliases are required when you have multiple references to the same table.
Specify table:
SELECT table1.pid, COUNT(*)
FROM table1
GROUP BY table1.pid
Use alias:
SELECT t1.pid, COUNT(*)
FROM table1 AS t1
GROUP BY t1.pid

Get min/max value by key - different approaches

I have a table with two columns namely ID and KEY (let key here be an integer) such as
ID KEY
ABC 6
DEF 1
GHI 12
TASK: Get the ID of the MAX key
Solution 1:
Select Top(1) ID
from TABLE
order by KEY desc
Solution 2:
Select ID
from TABLE
where ID = MAX(ID)
EDIT: The query was invalid. This is what I meant:
Select ID
from TABLE
where KEY = (select max(KEY) from TABLE)
Is one of these solutions categorically better than the other? What are the advantages/disvantages of each solution.
EDIT:
Assume there is no index.
Case 1 - large table
Case 2 - small table
Background:
I am doing code review and I have found both solutions multiple times in different context - sometimes with indices, sometimes without, sometimes for large tables, sometimes for small.
The two queries are different (after your edits fixing the second one).
The first necessarily returns a single row.
The second returns all matching rows.
The first returns a row even when key is NULL.
The second does not.
You should use the logic that does what you want.
An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list..
Solution 1 will be the best. A subquery in a where clause will be less optimal.
There really are lots of design techniques to look at for performance which I am not going to go into with this answer. I found this article yesterday which gave me more perspective https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-storage-internals-101/
In Solution 1, the order by clause will just sort your query result.
Query execution order:
FROM clause ON clause OUTER clause WHERE clause GROUP BY clause HAVING clause SELECT clause DISTINCT clause ORDER BY clause TOP clause
You can use the following query:
Select ID,
RANK() OVER (ORDER BY KEY DESC) AS KeyRank
from table1
HAVING keyRank = 1
Solution 1 will work but Solution 2 will throw exception like bellow
Msg 147, Level 15, State 1, Line 22 An aggregate may not appear in the
WHERE clause unless it is in a subquery contained in a HAVING clause
or a select list, and the column being aggregated is an outer
reference.
You can go with query 1 ,
You cannot use query 2 because you cannot use aggregate function like that if you want to use where clause and aggregate function in your query you have to go with as below :
Select id from table where key in (select max(key) from test);
reference only using aggregate function and having clause
Select ID ,max(key)
from test
group by ID,key
having (key) >= 12
order by 1

Duplicate values SQL (MS Access)

I need to find duplicate records across 2 or more fields. But using this does not work in Access:
SELECT assay.depth_from, assay.au_gt
FROM assay
GROUP BY depth_from, au_gt
HAVING count(*) >1;
Am I missing something? It does match up with various answers here so not sure what.
I just get a records with duplicate depth_from but the au_gt are not duplicate. Actually not all the depth_from are even all duplicated.
I see two possible syntax issues with your SQL. First, you probably don't need to use the assay. prefix before your field names since you have specified which table you are selecting from, and this makes your reference to those fields in GROUP BY inconsistent. If you do use assay. in your SELECT statement use it in GROUP BY as well. Secondly, you should include count(*) in the SELECT statement. This is basically for the same reason- whatever you reference in GROUP BY and HAVING should be the column names you specified in SELECT. Try this:
SELECT depth_from, au_gt, count(*)
FROM assay
GROUP BY depth_from, au_gt
HAVING count(*) >1;

Why does the number of rows increase in a SELECT statement with INNER JOIN when a second column is selected?

I am writing some queries with self-joins in SQL Server. When I have only one column in the SELECT clause, the query returns a certain number of rows. When I add another column, from the second instance of the table, to the SELECT clause, the results increase by 1000 rows!
How is this possible?
Thanks.
EDIT:
I have a subquery in the FROM clause, which is also a self-join on the same table.
How is this possible?
the only thing I can think of is that you have SELECT DISTINCT and the additional column makes some results distinct that weren't before the additional column.
For example I would expect the second result to have many more rows
SELECT DISTINCT First_name From Table
vs
SELECT DISTINCT First_name, Last_name From Table
But if we had the actual SQL then something else might come to mind

DISTINCT pulling duplicate column values

The following query is pulling duplicate site_ids, with me using DISTINCT I can't figure out why...
SELECT
DISTINCT site_id,
deal_woot.*,
site.woot_off,
site.name AS site_name
FROM deal_woot
INNER JOIN site ON site.id = site_id
WHERE site_id IN (2, 3, 4, 5, 6)
ORDER BY deal_woot.id DESC LIMIT 5
DISTINCT looks at the entire record, not just the column directly after it. To accomplish what you want, you'll need to use GROUP BY:
Non-working code:
SELECT
site_id,
deal_woot.*,
site.woot_off,
site.name AS site_name
FROM deal_woot
INNER JOIN site ON site.id = site_id
WHERE site_id IN (2, 3, 4, 5, 6)
GROUP BY site_id
Why doesn't it work? If you GROUP BY a column, you should use an aggregate function (such as MIN or MAX) on the rest of the columns -- otherwise, if there are multiple site_woot_offs for a given site_id, it's not clear to SQL which of those values you want to SELECT.
You will probably have to expand deal_woot.* to list each of its fields.
Side-note: If you're using MySQL, I believe it's not technically necessary to specify an aggregate function for the remaining columns. If you don't specify an aggregate function for a column, it chooses a single column value for you (usually the first value in the result set).
Your query is returning DISTINCT rows, it is not just looking at site_id. In other words, if any of the columns are different, a new row is returned from this query.
This makes sense, because if you actually do have differences, what should the server return as values for deal_woot.* ? If you want to do this, you need to specify this - perhaps done by getting distinct site_id's, then getting LIMIT 1 of the other values in a subquery with an appropiate ORDER BY clause.
You are selecting distinct value from one table only. When you join with the other table it will pull all rows that match each of your distinct value from the other table, causing duplicate id's
If you want to select site info and a single row from deal_woot table with the same site_id, you need to use a different query. For example,
SELECT site.id, deal_woot.*, site.woot_off, site.name
FROM site
INNER JOIN
(SELECT site_id, MAX(id) as id FROM deal_woot
WHERE site_id IN (2,3,4,5,6) GROUP BY site_id) X
ON (X.site_id = site.id)
INNER JOIN deal_woot ON (deal_woot.id = X.id)
WHERE site.id IN (2,3,4,5,6);
This query should work regardless of sql dialect/db vendor. For mysql, you can just add group by site_id to your original query, since it lets you use GROUP BY without aggregate functions.
** I assume that deal_woot.id and site.id are primary keys for deal_woot and site tables respectively.