Cannot group by timestamp_trunc - sql

I have this query in standard sql.
select timestamp_trunc(endTime, MONTH), count(1)
from `simple_table`
group by timestamp_trunc(endTime, MONTH);
Which returnts the following error:
SELECT list expression references column endTime which is neither grouped nor aggregated at [1:24]
However, the following code:
select timestamp_trunc(endTime, MONTH)
from `simple_table`
limit 10
Works perfectly. Is there some hidden reference about BigQuery's ability to do group by that I am missing?

just do as below
select timestamp_trunc(endTime, MONTH), count(1)
from `simple_table`
group by 1
or
select timestamp_trunc(endTime, MONTH) as m, count(1)
from `simple_table`
group by m
I think what happens is not the problem in using functions/expressions in GROUP BY, but rather the fact that engine does not recognize that expression for field in SELECT list and expression in GROUP BY are the same. Rather they are treated as different, thus engine think endTime filed is "orphan" (neither aggregated nor grouped by)
for example, below will work (of course it is not what you need - but it proves that group by accepts expressions)
select count(1)
from `simple_table`
group by timestamp_trunc(endTime, MONTH)

Related

How to aggregate rows on BigQuery

I need to group different years in my dataset so that I can see the total number of login_log_id each year has(BigQuery)
SELECT login_log_id,
DATE(login_time) as login_date,
EXTRACT(YEAR FROM login_time) as login_year,
TIME(login_time) as login_time,
FROM `steel-time-347714.flex.logs`
GROUP BY login_log_id
I want to make a group by so that I can see total number of login_log_id generated in different years.
My columns are login_log_id, login_time
I am getting following error :-
SELECT list expression references column login_time which is neither grouped nor aggregated at [2:6]
The error is because every column you refer to in the select need to be aggregated or be in the GROUP BY.
If you want the total logins by year, you can do:
SELECT
EXTRACT(YEAR FROM login_time) as login_year,
COUNT(1) as total_logins,
COUNT(DISTINCT login_log_id) as total_unique_logins
FROM `steel-time-347714.flex.logs`
GROUP BY login_year
But if you want the total by login_log_id and year:
SELECT
login_log_id,
EXTRACT(YEAR FROM login_time) as login_year,
COUNT(1) as total_logins
FROM `steel-time-347714.flex.logs`
GROUP BY login_log_id, login_year

How to reference fields from table created in sub-query's of large JOIN

I am writing a large query with many JOINs (shortened it in example here) and I am trying to reference values form other sub-queries but can't figure out how.
This is my example query:
DROP TABLE IF EXISTS breakdown;
CREATE TEMP TABLE breakdown AS
SELECT * FROM
(
SELECT COUNT(DISTINCT s_id) AS before, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) < date_trunc('sec',time) GROUP BY day
)
JOIN
(
SELECT ROUND(before * 100.0 / total, 1) AS Percent_1, day
FROM breakdown
GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS equal, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) = date_trunc('sec',time) GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS after, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) > date_trunc('sec',time) GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS total, date_trunc('day', earliest) AS day
FROM first
GROUP BY 2
) USING (day)
ORDER BY day;
SELECT * FROM breakdown ORDER BY day;
The last query gives me the total and for each of the previous subqueries I want to get the percentages as well.
I found the code for getting the percentage (second JOIN) but I don't know how to reference the values from the other tables.
E.g. for getting the percentage from the first query I want to use the COUNT of the first query which I renamed before and then divide that by the COUNT of the last query which I renamed total (If there is an easier solution to do this i.e. get the percentage for each of the sub-queries please let me know), But I cant seem to find how to reference them. I tried adding AS x to the end of each subquery and calling by that (x.total) as well as trying to reference via the parent table (breakdown.total) but neither worked.
How can I do this without changing my table too much as it is a long table with a lot of sub-queries.
This is what my table looks like I would like to add percentage for each column
Using redshift BTW.
Thanks
I'm a little confused by all that is going on as you drop table breakdown and then in the second subquery of the create table you reference breakdown. I suspect that there are some issues in the provided sample of SQL. Please update if there are issues.
For a number of these subqueries it looks like you are using a subquery where a case statement will do. In Redshift you don't want to scan the same table over and over if you can prevent it. For example if we look at the the 3rd and 4th subqueries you can replace these with one query. Also in these cases I like to use the DECODE() statement rather than CASE since it is more readable in these simple cases.
(
SELECT COUNT(DISTINCT s_id) AS equal, date_trunc('day', time) AS day
FROM table_a
WHERE date_trunc('sec',earliest) = date_trunc('sec',time)
GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS after, date_trunc('day', time) AS day
FROM table_a
WHERE date_trunc('sec',earliest) > date_trunc('sec',time)
GROUP BY day
)
Becomes:
(
SELECT COUNT(DISTINCT DECODE(date_trunc('sec',earliest) = date_trunc('sec',time), true, s_id, NULL)) AS equal,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) > date_trunc('sec',time), true, s_id, NULL)) AS after,
date_trunc('day', time) AS day
FROM table_a
GROUP BY day
)
Read each table once (if at all possible) and calculate the desired results. then you will have all your values in one layer of query and can reference these new values. This will be faster (especially on Redshift).
=============================
Expanding based on comment made by poster.
It appears that using DECODE() and referencing derived columns in a single query can produce what you want. I don't have your data so I cannot test this but here is what I'd want to move to:
SELECT
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) < date_trunc('sec',time), true, s_id)) AS before,
ROUND(before * 100.0 / total, 1) AS Percent_1,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) = date_trunc('sec',time), true, s_id)) AS equal,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) > date_trunc('sec',time), true, s_id)) AS after,
COUNT(DISTINCT s_id) AS total
FROM table_a
GROUP BY date_trunc('day', time);
This should be a complete replacement for the SELECT currently inside your CREATE TEMP TABLE. However, I don't have sample data so this is untested.

Is there a way to count how many strings in a specific column are seen for the 1st time?

**Is there a way to count how many strings in a specific column are seen for
Since the value in the column 2 gets repeated sometimes due to the fact that some clients make several transactions in different times (the client can make a transaction in the 1st month then later in the next year).
Is there a way for me to count how many IDs are completely new per month through a group by (never seen before)?
Please let me know if you need more context.
Thanks!
A simple way is two levels of aggregation. The inner level gets the first date for each customer. The outer summarizes by year and month:
select year(min_date), month(min_date), count(*) as num_firsts
from (select customerid, min(date) as min_date
from t
group by customerid
) c
group by year(min_date), month(min_date)
order by year(min_date), month(min_date);
Note that date/time functions depends on the database you are using, so the syntax for getting the year/month from the date may differ in your database.
You can do the following which will assign a rank to each of the transactions which are unique for that particular customer_id (rank 1 therefore will mean that it is the first order for that customer_id)
The above is included in an inline view and the inline view is then queried to give you the month and the count of the customer id for that month ONLY if their rank = 1.
I have tested on Oracle and works as expected.
SELECT DISTINCT
EXTRACT(MONTH FROM date_of_transaction) AS month,
COUNT(customer_id)
FROM
(
SELECT
date_of_transaction,
customer_id,
RANK() OVER(PARTITION BY customer_id
ORDER BY
date_of_transaction ASC
) AS rank
FROM
table_1
)
WHERE
rank = 1
GROUP BY
EXTRACT(MONTH FROM date_of_transaction)
ORDER BY
EXTRACT(MONTH FROM date_of_transaction) ASC;
Firstly you should generate associate every ID with year and month which are completely new then count, while grouping by year and month:
SELECT count(*) as new_customers, extract(year from t1.date) as year,
extract(month from t1.date) as month FROM table t1
WHERE not exists (SELECT 1 FROM table t2 WHERE t1.id==t2.id AND t2.date<t1.date)
GROUP BY year, month;
Your results will contain, new customer count, year and month

Bigquery Error code: Window ORDER BY expression references column start_date which is neither grouped nor aggregated at

I am using BigQuery for SQL and I can't figure out why there is an error message that comes like this:
Window ORDER BY expression references column start_date which is neither grouped nor aggregated at [4:73]
Here is my code:
SELECT EXTRACT(WEEK FROM start_date) as week, count(start_date) as count,
RANK() OVER (PARTITION BY start_station_name ORDER BY EXTRACT(WEEK FROM start_date))
from `bigquery-public-data.london_bicycles.cycle_hire`
GROUP BY EXTRACT(WEEK FROM start_date), start_station_name)
I thought I have grouped the week below, as seen in the last line. So what can cause this error message to keep popping up?
This is a parsing error in BigQuery, which you can work around with an aggregation function. Your query has another issue, which is the start_station_name.
SELECT EXTRACT(WEEK FROM start_date) as week, start_station_name, count(start_date) as count,
RANK() OVER (PARTITION BY start_station_name ORDER BY MIN(EXTRACT(WEEK FROM start_date)))
from `bigquery-public-data.london_bicycles.cycle_hire`
GROUP BY 1, 2;
The MIN() really serves no purpose other than lettering BigQuery parse the query. Because the expression is part of the GROUP BY, there is only one value for the MIN() to consider.
This is a bug in the BigQuery parsing, because it does not recognize that the expression is the same as the expression in the GROUP BY. Happily, it is easy to work around.
try like below using cte
with cte as
(
SELECT *, EXTRACT(WEEK FROM start_date) as week
from `bigquery-public-data.london_bicycles.cycle_hire`
) select week,count(start_date) as count,
RANK() OVER (PARTITION BY start_station_name ORDER BY week)
from cte group by week,start_station_name
In query you need to make sure that you have to put ORDER BY only on those values which you are selecting.
With your query, the problem is you are doing ORDER BY EXTRACT(WEEK from start_date). Rather than doing this you should write ORDER BY week because you are selecting week already

error: column "month" does not exist in PG query

My PG query:
SELECT "Tracks"."PageId",
date_trunc("month", "Tracks"."createdAt") AS month,
count(*)
FROM "Tracks"
WHERE "Tracks"."PageId" IN (29,30,31)
GROUP BY month, "Tracks"."PageId"
and my schema:
id, createdAt, updatedAt, PageId
A bit confused as to why I'm receiving this error!
You can use an alias in the where or group by clause. You need to repeat the expression:
SELECT "Tracks"."PageId",
date_trunc('month', "Tracks"."createdAt") AS month,
count(*)
FROM "Tracks"
WHERE "Tracks"."PageId" IN (29,30,31)
GROUP BY date_trunc('month', "Tracks"."createdAt"), "Tracks"."PageId";
Note that the first parameter for date_trunc() is a varchar value, so you need to put that in single quotes, not double quotes.
If you don't want to repeat the expression you can put that into a derived table:
select "PageId", month, count(*)
from (
SELECT "Tracks"."PageId",
date_trunc('month', "Tracks"."createdAt") AS month
FROM "Tracks"
WHERE "Tracks"."PageId" IN (29,30,31)
) t
group by month, "PageId";
Unrelated, but: you should really avoid quoted identifiers. They are much more trouble then they are worth it