Pivot data in BigQuery SQL? [duplicate] - sql

This question already has answers here:
How to Pivot table in BigQuery
(7 answers)
Closed 2 years ago.
I am working with BigQuery. I have two tables:
organisations:
org_code STRING
name STRING
spending:
org STRING
month DATE
quantity INTEGER
code STRING
And then quite a complicated query to get results by each organisation, by month:
SELECT
organisations.org_code AS org,
num.month AS month,
(num.quantity / denom.quantity) AS ratio_quantity
FROM (
SELECT
org_code, name
FROM
[mytable.organisations]) AS organisations
LEFT OUTER JOIN EACH (
SELECT
org,
month,
SUM(quantity) AS quantity
FROM
[mytable.spending]
GROUP BY
org,
month) AS denom
ON
denom.org = organisations.org_code
LEFT OUTER JOIN EACH (
SELECT
org,
month,
SUM(quantity) AS quantity
FROM
[hscic.spending]
WHERE
code LIKE 'XXXX%'
GROUP BY
org,
month) AS num
ON
denom.month = num.month
AND denom.org = num.org
ORDER BY org, month
My final results look like this, with a row per org/month combination:
org,month,ratio_quantity
A81001,2015-10-01 00:00:00 UTC,28
A82001,2015-11-01 00:00:00 UTC,43
A82002,2015-10-01 00:00:00 UTC,16
Now I would like to pivot the results to look like this, with one row per month, and one column per organisation:
month,items.A81001,items.A82002...
2015-10-01 00:00:00 UTC,28,16
2015-11-01 00:00:00 UTC,43,...
Is this possible in the same BigQuery call? Or should I create a new table and pivot it from there? Or should I just do the reshaping in Python?
UPDATE: There are about 500,000 results, fyi.

Q. Is this possible in the same BigQuery call? Or should I create a new
table and pivot it from there?
In general, you can use that “complicated query” as a subquery for extra logic to be applied to your current result.
So, it is definitely doable. But code can quickly become un-manageable or hard to manage – so you can consider writing this result into new table and then pivot it from there
If you stuck with direction of doing pivot (the way you described in your question) - check below link to see detailed intro on how you can implement pivot within BigQuery.
How to scale Pivoting in BigQuery?
Please note – there is a limitation of 10K columns per table - so you are limited with 10K organizations.
You can also see below as simplified examples (if above one is too complex/verbose):
How to transpose rows to columns with large amount of the data in BigQuery/SQL?
How to create dummy variable columns for thousands of categories in Google BigQuery?
Pivot Repeated fields in BigQuery
Q. Or should I just do the reshaping in Python?
If above will not work for you – pivoting on client is always an option but now you should consider client side limitations
Hope this helped!

Related

What is the underlying purpose of a group by statement in SQL? [duplicate]

This question already has answers here:
Is SQL GROUP BY a design flaw? [closed]
(9 answers)
Is it really necessary to have GROUP BY in the SQL standard
(3 answers)
Closed 12 months ago.
Lately I have been dealing with extremely wide queries that perform a lot of transforms on data, and I am annoyed by having to maintain wide group by statements. This has me wondering,
why do they exist?
For example
select
company,
sum(owing) as owing
from
receivables
group by
company
Given this statement, it seems to me that the group by is implied.
There is an aggregate function
There only field not part of an aggregation is company.
Therefore, I would expect that a query engine could determine that company should be the thing grouped on.
select
company,
sum(owing) as owing
from
receivables
My general assumption is always that something like this exists for a reason, I just don't understand the reason, but ... I don't understand the reason.
What is the scenario that makes the existence of group by necessary?
Update
Based on comments, a point regarding mult-table queries making it less obvious to the engine. Also, a point regarding multi-nonaggregate fields.
select
c.name as company,
t.curr as currency,
sum(t.amt) as owing
from
company c
inner join transactions t on c.id = t.comp_id
having
sum(t.amt) < 0
This (more realistic) version of the original query uses two tables. It is still unclear to me, why the engine would not know to group on company and currency as they are the non-aggregated fields
An example from Oracle which supports nested aggregate functions
Assume that you have a cube rolling results.
The following query shows us the throws distribution.
select result
,count(*) as count
from cube_roll
group by result
RESULT
COUNT
1
11
2
23
3
12
4
23
5
15
6
16
The following query shows us the maximum count for the results.
Please note that result does not appear in the SELECT clause.
select max(count(*)) as max_count
from cube_roll
group by result
MAX_COUNT
23
Please note that result cannot be added to the SELECT clause.
select result -- invalid reference
,max(count(*)) as max_count
from cube_roll
group by result
ORA-00937: not a single-group group function
Fiddle

Cohort retention with SQL BigQuery

I am trying to create a retention table like the following using SQL in Big Query but with MONTHLY cohorts;
I have the following columns to use in my dataset, I am only using one table and it's name is 'curious-furnace-341507.TEST.Test_Dataset_-_Orders'
order_date
order_id
customer_id
2020-01-02
12345
6789
I do not need the new user column and the data goes through June 2020 I think ideally a cohort month column that lists January-June cohorts and then 5 periods across.
I have tried so many different things and keep getting errors in BigQuery I think I am approaching it all wrong. The online queries I am trying to pull from seem to use dates rather than months which is also causing some confusion as I think I need to truncate my date column to months only in the query?
Does anyone have a go-to query that will work in BigQuery for a retention table or can help me approach this? Thanks!
This may help you:
With cohorts AS (
SELECT
customer_id,
MIN(DATE(order_date)) AS cohort_date
FROM 'curious-furnace-341507.TEST.Test_Dataset_-_Orders'
GROUP BY 1)
SELECT
FORMAT_DATE("%Y-%m", c.cohort_date) AS cohort_mth,
t.customer_id AS cust_id,
DATE_DIFF(t.order_date, c.cohort_date, month) AS order_period,
FROM 'curious-furnace-341507.TEST.Test_Dataset_-_Orders' t
JOIN cohorts c ON t.customer_id = c.customer_id
WHERE cohort_date >= ('2020-01-01')
AND DATE_DIFF(t.order_date, c.cohort_date, month) <=5
GROUP BY 1, 2, 3
I typically do pivots and % calcs in excel/ sheets. So this will give just you the input data you need for that.
NOTE:
This will give you a count of unique customers who ordered in period X (ignores repeat orders in period).
This also has period 0 (ordered again in cohort_mth) which you may wish to keep/ exclude.

SQL Query - Calculating Previous Year Sales

I don't know anything about SQL. I currently have a query that gives me this. Sales for some products by channel/etc (please note this is a very simplified version, there's more fields) by week/period/year:
Basically what I would need is to add a column that gives me the sales for prior year. Basically, transform the table as below. In Excel it would be a simple sumifs that would just sum the same exact criteria aside from the year which would be the previous year.
Is it possible to do this within SQL? The dataset is too large to do it within Excel.
I think you just want lag():
select t.*,
lag(sales) over (partition by channel, product, weekno order by yearno) as prev_sales
from t;
If I understand the data, then periodno is redundant with weekno.

Matrix to show all months of the year

I have a Web Service that I am getting data from (Not mine and I can't change it)
The data is in the following format
<DealMetrics>
<DealId>1</DealId>
<DealName>ABC</DealName>
<FundAbbreviation>ABC</FundAbbreviation>
<NavYear>2012</NavYear>
<NavMonth>January</NavMonth>
<Nav>123</Nav>
.
.
.
</DealMetrics>
I have a matrix that displays this performance data and works fine - I have a row group on the year and a column group on the month. The problem occurs if a deal has only been running for a short time - I want to display all months regardless of whether we have any data - So for example if a deal started in September and we're in December I'd want the column headings to display for Jan - Aug as well as the data that is returned in the web service.
Any ideas?
Regards
Andy
Might not be the most elegant way of doing it (I'm a relative newbie to SQL!) but I would create a table which is effectively a calendar of all the months within the period you are interested in (using a recursive CTE) and then just join the results of the query from the web service onto that?
Assuming your metrics end up in a SQL Server table with the following columns:
-- Metrics table:
DealId
DealName
FullAbbreviation
NavYear
NavMonth
Nav
Create a calendar table to hold all years and months you could possibly be interested in. For instructions on how to create a calendar table, check out this post, for example: How to create a Calender table for 100 years in Sql
Then, when you have your calendar table, use it in an outer join with your metrics table, so that you get all your years and months from the calendar table, and the metric data (if any). The query in your dataset could for example look like this:
SELECT DealId,
DealName,
FullAbbreviation,
ISNULL(NavYear, Months.Year) AS NavYear,
ISNULL(NavMonth, Months.Month) AS NavMonth,
Nav
FROM
(SELECT DISTINCT Year, Month, MonthNumber FROM Calendar) AS Months
LEFT JOIN Metrics ON Months.Year = Metrics.NavYear AND Months.Month = Metrics.NavMonth
ORDER BY NavYear, NavMonth

Access SQL group by name and then set values in horizontal from month1 to month12

I have a Table that has (names, money, date) And I would like to get the (name, money of month 1, money of month 2 and so on to month12)
How to do it?
I know how to extract the month from the date;
First query:
name, iif(month(date) = 1, money, 0) AS m1, and so on up to m12
Second query:
name, sum(m1) AS mo1, and so on up to mo12
group by name
Limitations: only one insert per month and the query must have an year filter that selects ONLY 1 YEAR.
you might want to build a pivot table, using the month function to generate the month value for each of your dates. You'll then be able to use this month value as a column in your pivot table.
Be careful: values for same months in different years will be aggregated, unless you expressely filter your data for a specific year
What you want is called crosstab query in msaccess parlance (and PIVOT in bigger systems).
Here's Allen Brown's nice write up with a lot of attention to detail.