Weighted Average in BigQuery - google-bigquery

I am using imported GA data to calculate the average product position on a page, I am currently doing this by averaging the item position by SKU - is there a way to calculate this as a weighted average within my query, as a product could display 10 times in position 1, and once at position 10, I wouldn't want the average to be 5.
Here is my query so far:
SELECT hits.product.productSKU AS SKU, AVG(hits.product.productListPosition) AS Average_Position
FROM (TABLE_DATE_RANGE([***.ga_sessions_], TIMESTAMP('2016-04-24'), TIMESTAMP('2016-04-30')))
GROUP BY SKU
ORDER BY Average_Position ASC

I tested this query and it worked here:
SELECT
sku,
nom / den avg_position from(
SELECT
sku,
SUM(position * freq) nom,
SUM(freq) den from(
SELECT
prods.productsku sku,
prods.productlistposition position,
COUNT(prods.productlistposition) freq
FROM
`project_id.dataset_id.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) prods
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2016-04-24') AND TIMESTAMP('2016-04-30')
AND prods.productlistposition > 0
GROUP BY
sku,
position )
GROUP BY
sku )
Notice that I used the Standard version of BigQuery as this is highly recommended.
If you must use the Legacy version adapting this query might be easy (supposing you don't have to use the FLATTEN operation).
You said you want to consider the positions on a given page, this can be done as well by inserting in the first where clause the condition
and hits.page.pagepath = 'your page url'

Related

Is there a way to calculate the average number of times an event happens when all data is stored as string?

I am working in BigQuery and using SQL to calculate the average number of ads viewed per user based on their engagement level (levels range from 1 - 5). I previously calculated the average number of days users were active based on their engagement level, but when I do average number of ads viewed based on engagement level the query fails. My guess is that the string for ads viewed is stored as a string.
Is there a way to average the number of times 'ad viewed' occurs in a list of events, based on engagement?
I tried changing the original code I used where I extracted 'Average Days' to extract 'Ads Viewed' but that does not work.
I tried average(count(if(ads.viewed,1,0))), but that won't work either. I can't figure out what I am doing wrong.
I also checked this post (SQL average of string values) but this doesn't seem to apply.
SELECT
engagement_level,
COUNT(event="ADSVIEWED") AS AverageAds
I have also tried:
SELECT
engagement_level,
AVG(IF(event="ADSVIEWED",1,0)) AS AverageAds
But that doesn't work either.
It should put out a table of the engagement level with the corresponding average. For 'Average Days' it worked out to be Engagement Level: Average Days (1: 2.45, 2: 3.21, 3: 4.67, etc.). But it doesn't work for the ads_viewed event.
If I understand correctly, you can do this without a subquery:
SELECT engagement_level,
COUNTIF(event = 'ADSVIEWED') / COUNT(DISTINCT user_id) as avg_per_user
FROM t
GROUP BY engagement_level;
This counts the number of events and divides by the number of users. If you only want to count users who have the event:
SELECT engagement_level,
COUNT(*) / COUNT(DISTINCT user_id) as avg_per_user
FROM t
WHERE event = 'ADSVIEWED'
GROUP BY engagement_level;
... to calculate the average number of ads viewed per user based on their engagement level ...
Below is for BigQuery Standard SQL
#standardSQL
SELECT engagement_level, AVG(Ads) AverageAds FROM (
SELECT engagement_level, user_id, COUNTIF(event = 'ADSVIEWED') Ads
FROM `project.dataset.table`
GROUP BY engagement_level, user_id
)
GROUP BY engagement_level
You can test, play with above using dummy data like in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user_id, 1 engagement_level, 'ADSVIEWED' event UNION ALL
SELECT 1, 1, 'a' UNION ALL
SELECT 1, 1, 'ADSVIEWED' UNION ALL
SELECT 2, 1, 'b' UNION ALL
SELECT 2, 1, 'ADSVIEWED'
)
SELECT engagement_level, AVG(Ads) AverageAds FROM (
SELECT engagement_level, user_id, COUNTIF(event = 'ADSVIEWED') Ads
FROM `project.dataset.table`
GROUP BY engagement_level, user_id
)
GROUP BY engagement_level
with result
Row engagement_level AverageAds
1 1 1.5

average query ORA-00936 error

SQL> SELECT consignmentNo, VoyageNo, Weight
2 (SELECT (AVG(WEIGHT) FROM consignment), AS AVERAGE,
3 WHERE Weight = 650,
4 FROM consignment;
(SELECT (AVG(WEIGHT) FROM consignment), AS AVERAGE,
*
ERROR at line 2:
ORA-00936: missing expression
average weight for a particular ship, listing consignments for the particular ship also, unable to identify the error
Are you simply looking for group by?
SELECT VoyageNo, AVG(Weight)
FROM consignment
GROUP BY VoyageNo;
If you want the average along with the detailed information, you want a window function:
SELECT c.*, AVG(Weight) OVER (PARTITION BY VoyageNo)
FROM consignment c;
This assumes that VoyageNo is what you mean by ship.
You seems want :
SELECT consignmentNo, VoyageNo, Weight, avg.AVERAGE
FROM consignment CROSS JOIN
(SELECT AVG(WEIGHT) AS AVERAGE FROM consignment) avg
WHERE Weight = 650;
You have an extra , in your query (before AS AVERAGE) and you are missing a , after Weight. Also from and where is not in the right order. Try this:
SELECT consignmentNo, VoyageNo, Weight,
(SELECT (AVG(WEIGHT) FROM consignment) AS AVERAGE,
FROM consignment
WHERE Weight = 650;

how to perform multiple aggregations on a single SQL query

I have a table with Three columns:
GEOID, ParcelID, and PurchaseDate.
The PKs are GEOID and ParcelID which is formatted as such:
GEOID PARCELID PURCHASEDATE
12345 AB123 1/2/1932
12345 sfw123 2/5/2012
12345 fdf323 4/2/2015
12346 dfefej 2/31/2022 <-New GEOID
What I need is an aggregation based on GEOID.
I need to count the number of ParcelIDs from last month PER GEOID
and I need to provide a percentage of that GEOID of all total sold last month.
I need to produce three columns:
GEOID Nbr_Parcels_Sold Percent_of_total
For each GEOID, I need to know how many Parcels Sold Last month, and with that Number, find out how much percentage that entails for all Solds.
For example: if there was 20 Parcels Sold last month, and 4 of them were sold from GEOID 12345, then the output would be:
GEOID Nbr_Parcels_Sold Perc_Total
12345 4 .2 (or 20%)
I am having issues with the dual aggregation. The concern is that the table in question has over 8 million records.
if there is a SQL Warrior out here who have seen this issue before, Any wisdom would be greatly appreciated.
Thanks.
Hopefully you are using SQL Server 2005 or later version, in which case you can get advantage of windowed aggregation. In this case, windowed aggregation will allow you to get the total sale count alongside counts per GEOID and use the total in calculations. Basically, the following query returns just the counts:
SELECT
GEOID,
Nbr_Parcels_Sold = COUNT(*),
Total_Parcels_Sold = SUM(COUNT(*)) OVER ()
FROM
dbo.atable
GROUP BY
GEOID
;
The COUNT(*) call gives you counts per GEOID, according to the GROUP BY clause. Now, the SUM(...) OVER expression gives you the grand total count in the same row as the detail count. It is the empty OVER clause that tells the SUM function to add up the results of COUNT(*) across the entire result set. You can use that result in calculations just like the result of any other function (or any expression in general).
The above query simply returns the total value. As you actually want not the value itself but a percentage from it for each GEOID, you can just put the SUM(...) OVER call into an expression:
SELECT
GEOID,
Nbr_Parcels_Sold = COUNT(*),
Percent_of_total = COUNT(*) * 100 / SUM(COUNT(*)) OVER ()
FROM
dbo.atable
GROUP BY
GEOID
;
The above will give you integer percentages (truncated). If you want more precision or a different representation, remember to cast either the divisor or the dividend (optionally both) to a non-integer numeric type, since SQL Server always performs integral division when both operands are integers.
How about using sub-query to count the sum
WITH data AS
(
SELECT *
FROM [Table]
WHERE
YEAR(PURCHASEDATE) * 100 + MONTH(PURCHASEDATE) = 201505
)
SELECT
GEOID,
COUNT(*) AS Nbr_Parcels_Sold,
CONVERT(decimal(18,8), COUNT(*)) /
(SELECT COUNT(*) FROM data) AS Perc_Total
FROM
data t
GROUP BY
GEOID
EDIT
To update another table by the result, use UPDATE under WITH()
WITH data AS
(
SELECT *
FROM [Table]
WHERE
YEAR(PURCHASEDATE) * 100 + MONTH(PURCHASEDATE) = 201505
)
UPDATE target SET
Nbr_Parcels_Sold = source.Nbr_Parcels_Sold,
Perc_Total = source.Perc_Total
FROM
[AnotherTable] target
INNER JOIN
(
SELECT
GEOID,
COUNT(*) AS Nbr_Parcels_Sold,
CONVERT(decimal(18,8), COUNT(*)) /
(SELECT COUNT(*) FROM data) AS Perc_Total
FROM
data t
GROUP BY
GEOID
) source ON target.GEOID = source.GEOID
Try the following. It grabs the total sales into a variable then uses it in the subsequent query:
DECLARE #pMonthStartDate DATETIME
DECLARE #MonthEndDate DATETIME
DECLARE #TotalPurchaseCount INT
SET #pMonthStartDate = <EnterFirstDayOfAMonth>
SET #MonthEndDate = DATEADD(MONTH, 1, #pMonthStartDate)
SELECT
#TotalPurchaseCount = COUNT(*)
FROM
GEOIDs
WHERE
PurchaseDate BETWEEN #pMonthStartDate
AND #MonthEndDate
SELECT
GEOID,
COUNT(PARCELID) AS Nbr_Parcels_Sold,
CAST(COUNT(PARCELID) AS FLOAT) / CAST(#TotalPurchaseCount AS FLOAT) * 100.0 AS Perc_Total
FROM
GEOIDs
WHERE
ModifiedDate BETWEEN #pMonthStartDate
AND #MonthEndDate
GROUP BY
GEOID
I'm guessing your table name is GEOIDs. Change the value of #pMonthStartDate to suit yourself. If your PKs are as you say then this will be a quick query.

FIFO Implementation in Inventory using SQL

This is basically an inventory project which tracks the "Stock In" and "Stock Out" of items through Purchase and sales respectively.
The inventory system follows FIFO Method (the items which are first purchased are always sold first). For example:
If we purchased Item A in months January, February and March
When a customer comes we give away items purchased during January
only when the January items are over we starts giving away February items and so on
So I have to show here the total stock in my hand and the split up so that I can see the total cost incurred.
Actual table data:
The result set I need to obtain:
My client insists that I should not use Cursor, so is there any other way of doing so?
As some comment already said a CTE can solve this
with cte as (
select item, wh, stock_in, stock_out, price, value
, row_number() over (partition by item, wh order by item, wh) as rank
from myTable)
select a.item, a.wh
, a.stock_in - coalesce(b.stock_out, 0) stock
, a.price
, a.value - coalesce(b.value, 0) value
from cte a
left join cte b on a.item = b.item and a.wh = b.wh and a.rank = b.rank - 1
where a.stock_in - coalesce(b.stock_out, 0) > 0
If the second "Item B" has the wrong price (the IN price is 25, the OUT is 35).
SQL 2008 fiddle
Just for fun, with sql server 2012 and the introduction of the LEAD and LAG function the same thing is possible in a somewhat easier way
with cte as (
select item, wh, stock_in
, coalesce(LEAD(stock_out)
OVER (partition by item, wh order by item, wh), 0) stock_out
, price, value
, coalesce(LEAD(value)
OVER (partition by item, wh order by item, wh), 0) value_out
from myTable)
select item
, wh
, (stock_in - stock_out) stock
, price
, (value - value_out) value
from cte
where (stock_in - stock_out) > 0
SQL2012 fiddle
Update
ATTENTION -> To use the two query before this point the data need to be in the correct order.
To have the details with more then one row per day you need something reliable to order the row with the same date, like a date column with time, an autoincremental ID or something down the same line, and it's not possible to use the query already written because they are based on the position of the data.
A better idea is to split the data in IN and OUT, order it by item, wh and data, and apply a rank on both data, like this:
SELECT d_in.item
, d_in.wh
, d_in.stock_in - coalesce(d_out.stock_out, 0) stock
, d_in.price
, d_in.value - coalesce(d_out.value, 0) value
FROM (SELECT item, wh, stock_in, price, value
, rank = row_number() OVER
(PARTITION BY item, wh ORDER BY item, wh, date)
FROM myTable
WHERE stock_out = 0) d_in
LEFT JOIN
(SELECT item, wh, stock_out, price, value
, rank = row_number() OVER
(PARTITION BY item, wh ORDER BY item, wh, date)
FROM myTable
WHERE stock_in = 0) d_out
ON d_in.item = d_out.item AND d_in.wh = d_out.wh
AND d_in.rank = d_out.rank
WHERE d_in.stock_in - coalesce(d_out.stock_out, 0) > 0
SQLFiddle
But this query is NOT completely reliable, the order of data in the same order group is not stable.
I haven't change the query to recalculate the price if the IN.price is different from the OUT.price
If cursors aren't an option, a SQLCLR stored procedure might be. This way you could obtain the raw data into .net objects, manipulate / sort it using c# or vb.net and set the resulting data as the procedure's output. Not only this will give you what you want, it may even turn up being much easier than trying to do the same in pure T-SQL, depending on your programming background.

MDX - Top X Sales People by Total Sales for Each Date

I'm trying to do this, but with MDX in my cube:
select
*
from
(
select
Date, SalesPerson, TotalSales, row_number() over(partition by Date order by TotalSales desc) as Num
from SalesFact as ms
) as x
where
Num < 5
order by
Date, SalesPerson, Num desc
Let's say I have a cube with these dimensions:
Date (Year, Month, Date) - date is always 1st of month
SalesPerson
The fact table has three columns - Date, SalesPerson, TotalSales - ie, the amount that person sold in that month.
I want, for each month, to see the top 5 sales people, and each of their TotalSales. The top 5 sales people can be different from one month to the next.
I am able to get the results for one month, using a query that looks like this:
select
[Measures].[TotalSales] on columns,
(
subset
(
order
(
[SalesPerson].children,
[Measures].[TotalSales],
bdesc
),
0,
5
)
) on rows
from
Hypercube
where
(
[Date].[Date].&[2009-03-01T00:00:00]
)
What I'm after is a query that puts Date and SalesPerson on rows, and TotalSales on columns.
I want to see over time each month, and for each month, the top 5 sales people, and how much they sold.
When I try to do it this way, it doesn't seem to filter / group the sales people by each date (get top 5 for each date). The values returned are all over the place and include very low and null values. Notably, the SalesPerson list is the same for each date, even though TotalSales varies a lot.
select
[Measures].[TotalSales] on columns,
(
[Date].[Hierarchy].[Date].members,
subset
(
order
(
[SalesPerson].children,
[Measures].[TotalSales],
bdesc
),
0,
5
)
) on rows
from
Hypercube
It seems that everything inside "subset" needs to be filtered by the current [Date].[Hierarchy].[Date], but using CurrentMember gives a crossjoin / axis error:
select
[Measures].[TotalSales] on columns,
(
[Date].[Hierarchy].[Date].members,
subset
(
order
(
([SalesPerson].children, [Date].[Hierarchy].CurrentMember),
[Measures].[TotalSales],
bdesc
),
0,
5
)
) on rows
from
Hypercube
Error: Executing the query ... Query (3, 2) The Hierarchy hierarchy
is used more than once in the Crossjoin function.
Execution complete
I've tried several variations of the last query with no luck.
Hopefully the answers will be helpful to others new to MDX as well.
I eventually found out how to do what I was looking for. The solution revolved around using the Generate function, and starting with the basic example on MSDN and modifying the dimensions and measure to be the ones in my cube got me going in the right direction.
From http://msdn.microsoft.com/en-us/library/ms145526.aspx
Is there a better way?
Also, be wary of trying to refactor sets into the with block. This seems to change when the set is evaluated / change its scope and will change the results.
with
set
Dates as
{
[Date].[Hierarchy].[Date].&[2009-02-01T00:00:00],
[Date].[Hierarchy].[Date].&[2009-03-01T00:00:00],
[Date].[Hierarchy].[Date].&[2009-04-01T00:00:00]
}
select
Measures.[TotalSales]
on columns,
generate
(
Dates,
topcount
(
[Date].Hierarchy.CurrentMember
*
[SalesPerson].Children,
5,
Measures.[TotalSales]
)
)
on rows
from
Hypercube