Bigquery - how to aggregate data based on conditions - sql

I have a simple table like the following, which has product, price, cost and category. price and cost can be null.
And this table is being updated from time to time. Now I want to have a daily summary of the table content grouped by category, to see in each category, how many products that has no price, and how many has a price, and how many products has a price that is higher than the cost, so the result table would look like the following:
I think I can get a query running everyday by setting up query re-run schedule in bigQuery, so I can have three rows of data appended to the result table everyday.
But the problem is, how can I get those three rows? I know I can group by, but how do I get the count with those conditions like not null, larger than, etc.

You seem to want window functions:
select t.*
countif(price is nuill) over (partition by date) as products_no_price,
countif(price <= cost) over (partition by date) as products_price_lower_than_cost
from t;
You can run this code on the table that has date column. In fact, you don't need to store the last two columns.
If you want to insert the first table into the second, then there is no date and you can simply use:
select t.*
countif(price is nuill) over () as products_no_price,
countif(price <= cost) over () as products_price_lower_than_cost
from t;

Related

Delete duplicates using dense rank

I have a sales data table with cust_ids and their transaction dates.
I want to create a table that stores, for every customer, their cust_id, their last purchased date (on the basis of transaction dates) and the count of times they have purchased.
I wrote this code:
SELECT
cust_xref_id, txn_ts,
DENSE_RANK() OVER (PARTITION BY cust_xref_id ORDER BY CAST(txn_ts as timestamp) DESC) AS rank,
COUNT(txn_ts)
FROM
sales_data_table
But I understand that the above code would give an output like this (attached example picture)
How do I modify the code to get an output like :
I am a beginner in SQL queries and would really appreciate any help! :)
This would be an aggregation query which changes the table key from (customer_id, date) to (customer_id)
SELECT
cust_xref_id,
MAX(txn_ts) as last_purchase_date,
COUNT(txn_ts) as count_purchase_dates
FROM
sales_data_table
GROUP BY
cust_xref_id
You are looking for last purchase date and count of distinct transaction dates ( like if a person buys twice, it should be considered as one single time).
Although you mentioned you want count of dates but sample data shows you want count of distinct dates - customer 284214 transacted 9 times but distinct will give you 7.
So, here is the SQL you can use to get your result.
SELECT
cust_xref_id,
MAX(txn_ts) as last_purchase_date,
COUNT(distinct txn_ts) as count_purchase_dates -- Pls note distinct will count distinct dates
FROM sales_data_table
GROUP BY 1

INSERT INTO two columns from a SELECT query

I have a table called VIEWS with Id, Day, Month, name of video, name of browser... but I'm interested only in Id, Day and Month.
The ID can be duplicate because the user (ID) can watch a video multiple days in multiple months.
This is the query for the minimum date and the maximum date.
SELECT ID, CONCAT(MIN(DAY), '/', MIN(MONTH)) AS MIN_DATE,
CONCAT(MAX(DAY), '/', MAX(MONTH)) AS MAX_DATE,
FROM Views
GROUP BY ID
I want to insert this select with two columns(MIN_DATE and MAX_DATE) to two new columns with insert into.
How can be the insert into query?
To do what you are trying to do (there are some issues with your solution, please read my comments below), first you need to add the new columns to the table.
ALTER TABLE Views ADD MIN_DATE VARCHAR(10)
ALTER TABLE Views ADD MAX_DATE VARCHAR(10)
Then you need to UPDATE your new columns (not INSERT, because you don't want new rows). Determine the min/max for each ID, then join the result back to the table to be able to update each row. You can't update directly from a GROUP BY as rows are grouped and lose their original row.
;WITH MinMax
(
SELECT
ID,
CONCAT(MIN(V.DAY), '/', MIN(V.MONTH)) AS MIN_DATE,
CONCAT(MAX(V.DAY), '/', MAX(V.MONTH)) AS MAX_DATE
FROM
Views AS V
GROUP BY
ID
)
UPDATE V SET
MIN_DATE = M.MIN_DATE,
MAX_DATE = M.MAX_DATE
FROM
MinMax AS M
INNER JOIN Views AS V ON M.ID = V.ID
The problems that I see with this design are:
Storing aggregated columns: you usually want to do this only for performance issues (which I believe is not the case here), as querying the aggregated (grouped) rows is faster due to being less rows to read. The problem is that you will have to update the grouped values each time one of the original rows is updated, which as extra processing time. Another option would be periodically updating the aggregated values, but you will have to accept that for a period of time the grouped values are not really representing the tracking table.
Keeping aggregated columns on the same table as the data they are aggregating: this is normalization problem. Updating or inserting a row will trigger updating all rows with the same ID as the min/max values might have changed. Also the min/max values will always be repeated on all rows that belong to the same ID, which is extra space that you are wasting. If you had to save aggregated data, you need to save it on a different table, which causes the problems I listed on the previous point.
Using text data type to store dates: you always want to work dates with a proper DATETIME data type. This will not only enable to use date functions like DATEADD or DATEDIFF, but also save space (varchars that store dates need more bytes that DATETIME). I don't see the year part on your query, it should be considered to compute a min/max (this might depend what you are storing on this table).
Computing the min/max incorrectly: If you have the following rows:
ID DAY MONTH
1 5 1
1 3 2
The current result of your query would be 3/1 as MIN_DATE and 5/2 as MAX_DATE, which I believe is not what you are trying to find. The lowest here should be the 5th of January and the highest the 3rd of February. This is a consequence of storing date parts as independent values and not the whole date as a DATETIME.
What you usually want to do for this scenario is to group directly on the query that needs the data grouped, so you will do the GROUP BY on the SELECT that needs the min/max. Having an index by ID would make the grouping very fast. Thus, you save the storage space you would use to keep the aggregated values and also the result is always the real grouped result at the time that you are querying.
Would be something like the following:
;WITH MinMax
(
SELECT
ID,
CONCAT(MIN(V.DAY), '/', MIN(V.MONTH)) AS MIN_DATE, -- Date problem (varchar + min/max computed seperately)
CONCAT(MAX(V.DAY), '/', MAX(V.MONTH)) AS MAX_DATE -- Date problem (varchar + min/max computed seperately)
FROM
Views AS V
GROUP BY
ID
)
SELECT
V.*,
M.MIN_DATE,
M.MAX_DATE
FROM
MinMax AS M
INNER JOIN Views AS V ON M.ID = V.ID

SQL How to add a column to a table that's the sum of a category's quantity

I have a data table shaped like below:
What I'm looking to do with SQL is add a column that will be the sum for the total category by month without removing any rows. For Example,
My goal is to take this category data and do some calculations with it like dividing it by the Qty and seeing how it changes over time.
What I've tried to do is use GROUP BY the category and date but that ends up with me losing the Item level data which I want to compare the Category level data to.
I also tried doing something like this
SELECT
Item, Category, Date, Qty, (sum(QTY) from TABLE)
FROM TABLE
but that only gives the sum of the QTY for the whole column not split out by Month/Year and Category.
Does anyone know what might help? I'm relatively new to using SQL so I hope I explained my question properly.
Use window functions:
select t.*,
sum(qty) over (partition by category, date) as category_sum
from t;
This assumes that date is really just the month and year. If it is the exact date, you need to extract the month and year from it.

Averaging Grouped Data in Single SQL Statement Using Multiple Group Bys

I want to see the average cost of an item. First I am using a SUM statement and GROUP BY the manufacturing order and Item to see how much each item costs per manufacturing order (using WHERE statements to take out specific steps in the process). Then I want to average those to see how much the item costs on average based on that set, can I do this easily in one statement instead on creating a temp table?
You have to take result in temp table if you first want to sum the cost of an item per manufacture order and perform average on total cost per item achieved from sum. I hope I understood your problem statement clearly.
SELECT item, AVG(cost) FROM
(SELECT item, manufacture_order, SUM(COST) cost
FROM manufacture_order_tab
GROUP BY item, manufacture_order) tab1
GROUP BY item;
try this
SELECT AVG(Cost), SUM(COST)
FROM your_table
GROUP BY your_column

SQL SUM function with added

total novice here with SQL SUM function question. So, SUM function itself works as I expected it to:
select ID, sum(amount)
from table1
group by ID
There are several records for each ID and my goal is to summarize each ID on one row where the next column would give me the summarized amount of column AMOUNT.
This works fine, however I also need to filter out based on certain criteria in the summarized amount field. I.e. only look for results where the summarized amount is either bigger, smaller or between certain number.
This is the part I'm struggling with, as I can't seem to use column AMOUNT, as this messes up summarizing results.
Column name for summarized results is shown as "00002", however using this in the between or > / < clause does not work either. Tried this:
select ID, sum(amount)
from table1
where 00002 > 1000
group by ID
No error message, just blank result, however plenty of summarized results with values over 1000.
Unfortunately not sure on the engine the database runs on, however it should be some IBM-based product.
The WHERE clause will filter individual rows that don't match the condition before aggregating them.
If you want to do post aggregation filtering you need to use the HAVING Clause.
HAVING will apply the filter to the results after being grouped.
select ID, sum(amount)
from table1
group by ID
having sum(amount) > 1000