KQL aggregation function product - kql

I'm having the following table:
productName
feature
probability_for _feature
A
w
0.2
A
z
0.8
B
w
0.2
B
z
0.8
B
x
0.3
I want to get for each productName the product of probability for it's feature meaning productName A have both feature w with probability 0.2 and feature z with probability 0.8 hence the product for feature A will bee 0.2*0.8= 0.16.
Thus , in the table above I will get:
productName
features
probability_for _feature
A
[w,z]
0.16
B
[w,z,x]
0.048
Or just :
productName
probability
A
0.16
B
0.048
I didn't find any product or multiply function as aggregation function and would like to get some help .
Thank you !

Here's an option, using a cumulative sum and leveraging the fact that log(x1) + log(x2) … + log(xN) == log(x1 * x2 * … * xN):
datatable(productName:string, feature:string, probability_for_feature:double)
[
'A', 'w', 0.2,
'A', 'z', 0.8,
'B', 'w', 0.2,
'B', 'z', 0.8,
'B', 'x', 0.3,
]
| order by productName asc
| extend l = log10(probability_for_feature), rn = row_number()
| extend cumsum = row_cumsum(l, productName != prev(productName))
| summarize arg_max(rn, *), features = make_list(feature) by productName
| project productName, features, product = exp10(cumsum)
productName
features
product
A
[ "w", "z"]
0.16
B
[ "w", "z", "x"]
0.048

For aggregation with any function you can use the scan operator.
scan operator example cumulative sum
Example with multiply:
datatable (productName: string, feature: string, probability_for_feature: double)
[
'A', 'w', 0.2,
'A', 'z', 0.8,
'B', 'w', 0.2,
'B', 'z', 0.8,
'B', 'x', 0.3,
]
| sort by productName asc
| partition by productName
(
// for every productName scan all rows
scan declare (probability: double= 1.0) with
(
// multiply probability for every row and return last result
step s1 output=last: true => probability = probability_for_feature * s1.probability;
)
)
| project productName, probability

Related

presto sql query for getting the fill rate of the table

I want a generic query to get fill rate of all columns in table .Query should work irrespective of the column number.I have to implement this using presto sql.I have tried searching for a method but nothing seems to working.
Input
A
B
C
D
1
null
null
1
2
2
3
4
Null
Null
Null
5
Output
A
B
C
D
0.66
0.33
0.33
1.0
Explanation:
A Col contains 3 rows with 2 non null values so 2/3
B and C Cols contain 2 null value and one non null value so 1/3
D col there is no null values so 3/3
Thanks in advance
AFAIK Presto/Trino does not provide dynamic query execution capabilities (i.e. something like EXEC in T-SQL) so the only option (unless you are ready to go down user defined function road) to write a query which will enumerate all needed columns (if you are using client from another language - you can build the query dynamically leveraging information_schema.columns info):
with dataset(A, B, C, D) as (
values (1, null, null, 1),
(2, 2, 3, 4),
(Null, Null, Null, 5)
)
select 1.00 * count_if(a is not null) / count(*) a,
1.00 * count_if(b is not null) / count(*) b,
1.00 * count_if(c is not null) / count(*) c,
1.00 * count_if(d is not null) / count(*) d
from dataset;
Output:
a
b
c
d
0.67
0.33
0.33
1.00

How do you allocate an amount in blank to the rest proportionately?

I have a table with a list of markets and corresponding amounts related to those markets
Market
Amount
A
10
B
30
C
50
D
10
10
I would like this $10 in the blank market to be allocated to the rest of the markets proportionately based on amounts excluding the blank market (ex. amount(A)/sum(A+B+C+D))
The desired output is:
Market
Amount
A
11
B
33
C
55
D
11
I think I can query it using multiple CTEs, but wanted to see if it's possible to allocate using as few CTEs as possible or not using CTE at all.
So with this CTE just for data:
with data(market, amount) as (
select * from values
('A', 10),
('B', 30),
('C', 50),
('D', 10),
(null, 10)
)
we can:
select d.*
,sum(iff(d.market is null, d.amount,null)) over() as to_spread
,sum(iff(d.market is not null, d.amount,null)) over() as total
,div0(d.amount, total) as part
,part * to_spread as bump
,d.amount + bump as result
from data as d
qualify market is not null
to get:
MARKET
AMOUNT
TO_SPREAD
TOTAL
PART
BUMP
RESULT
A
10
10
100
0.1
1
11
B
30
10
100
0.3
3
33
C
50
10
100
0.5
5
55
D
10
10
100
0.1
1
11
We can then fold a few of those steps up:
select d.*
,d.amount + div0(d.amount, sum(iff(d.market is not null, d.amount,null)) over()) * sum(iff(d.market is null, d.amount,null)) over() as result
from data as d
qualify market is not null
MARKET
AMOUNT
RESULT
A
10
11
B
30
33
C
50
55
D
10
11
seems these results are on fixed point numbers, the truncation of division, will loss "amounts", which could be spread fairly, but that might require a second pass.

Combinations in Pandas Python (more than 2 unique)

I have a dataframe where each row has a particular activity of a user:
UserID Purchased
A Laptop
A Food
A Car
B Laptop
B Food
C Food
D Car
Now I want to find all the unique combinations of purchased products and number of unique users against each combination. My data set has around 8 different products so doing it manually is very time consuming. I want end result to be something like:
Number of products Products Unique count of Users
1 Food 1
2 Car 1
2 Laptop,Food 1
3 Car,Laptop,Food 1
# updated sample data
d = {'UserID': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'C', 6: 'D', 7: 'C'},
'Purchased': {0: 'Laptop',
1: 'Food',
2: 'Car',
3: 'Laptop',
4: 'Food',
5: 'Food',
6: 'Car',
7: 'Laptop'}}
df = pd.DataFrame(d)
# groupby user id and combine the purchases to a tuple
new_df = df.groupby('UserID').agg(tuple)
# list comprehension to sort your grouped purchases
new_df['Purchased'] = [tuple(sorted(x)) for x in new_df['Purchased']]
# groupby purchases and get then count, which is the number of users for each purchases
final_df = new_df.reset_index().groupby('Purchased').agg('count').reset_index()
# get the len of purchased, which is the number of products in the tuple
final_df['num_of_prod'] = final_df['Purchased'].agg(len)
# rename the columns
final_df = final_df.rename(columns={'UserID': 'user_count'})
Purchased user_count num_of_prod
0 (Car,) 1 1
1 (Car, Food, Laptop) 1 3
2 (Food, Laptop) 2 2

Populate column based on row values BigQuery Standard SQL

I have a Table lets say :-
Name A B C D
------- --- --- --- ---
alpha 0 1 0 0.6
beta 0.6 0 0 0.1
gama 0 0 0 0.6
Now I want to populate values on Two columns(Result & Class) based on A, B, C, D values.
The condition is if value in any of the field(A,B,C,D) is >.5 then Result column should have "F" else it should have "P". Also the column whose valie is >.5 should be in Class example("A,D")
For better understanding here is the result I want:-
Name A B C D Result Class
------- --- --- --- --- -------- -------
alpha 0 1 0 0.6 F B,D
beta 0.6 0 0 0.1 F A
gama 0 0 0 0.4 P NULL
I am New to BigQuery and need Help. What would be workaround.
This what I have done till yet
SELECT *, CASE WHEN (A > .5 OR B > .5 OR C > .5 OR D >.5)
THEN 'F'
ELSE 'P' END AS Result AND Class....//here i am stuck
FROM table1
Actually, I have no Idea how to Build this exact Script. I was able to achieve first part where I was able to Populate Result column with "F" and "P" but could not make Class to populate column names....
Since you are analysing each column, I assume you do not have a extensive quantity of columns. Therefore, I created a simple JavaScript User Defined Function (UDF) in order to check the row's value and return the column's name if the condition is met.
I have used the provided sample data to test the below query.
#javaScript UDF
CREATE TEMP FUNCTION class(A FLOAT64, B FLOAT64, C FLOAT64, D FLOAT64)
RETURNS String
LANGUAGE js AS """
var class_array=[];
if(A > 0.5){class_array.push("A");}
if(B > 0.5){class_array.push("B");}
if(C > 0.5){class_array.push("C");}
if(D > 0.5){class_array.push("D");}
return class_array;
""";
#sample data
WITH data as (
SELECT "alpha" as Name, 0 as A, 1 as B, 0 as C, 0.6 as D UNION ALL
SELECT "beta", 0.6, 0, 0, 0.1 UNION ALL
SELECT "gama", 0, 0, 0, 0.4
)
Select name, A,B,C,D,
CASE WHEN (A > .5 OR B > .5 OR C > .5 OR D >.5) THEN "F" ELSE "P" END AS Result,
IF(class(A,B,C,D) is null , null, class(A,B,C,D)) as Class from data
And the output,
Row name A B C D Result Class
1 alpha 0 1 0 0.6 F B,D
2 beta 0.6 0 0 0.1 F A
3 gama 0 0 0 0.4 P
As it is shown within the UDF, each row's value is analysed and if the condition is met, the column's name is manually added to an array of strings. In addition, pay attention that the JS UDF returns a String, not an array. It automatically converts the previously created Array to String.
Lastly, I should point that is not possible to retrieve the column name within a query in this context. Although, you can retrieve it, in other scenarios, using INFORMATION_SCHEMA.
Below is for BigQuery Standard SQL
Using javaScript UDF helps in many cases but should be avoid if problem can be solved with SQL as in below example
#standardSQL
SELECT *,
( SELECT IF(LOGICAL_OR(val > 0.5), 'F', 'P')
FROM UNNEST([A,B,C,D]) val
) AS Result,
( SELECT STRING_AGG(['A','B','C','D'][OFFSET(pos)])
FROM UNNEST([A,B,C,D]) val WITH OFFSET pos
WHERE val > 0.5
) AS Class
FROM `project.dataset.table`
You can test , play with above using sample data from y our question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'alpha' name, 0 A, 1 B, 0 C, 0.6 D UNION ALL
SELECT 'beta', 0.6, 0, 0, 0.1 UNION ALL
SELECT 'gamma', 0, 0, 0, 0.4
)
SELECT *,
( SELECT IF(LOGICAL_OR(val > 0.5), 'F', 'P')
FROM UNNEST([A,B,C,D]) val
) AS Result,
( SELECT STRING_AGG(['A','B','C','D'][OFFSET(pos)])
FROM UNNEST([A,B,C,D]) val WITH OFFSET pos
WHERE val > 0.5
) AS Class
FROM `project.dataset.table`
with output as
Row name A B C D Result Class
1 alpha 0.0 1 0 0.6 F B,D
2 beta 0.6 0 0 0.1 F A
3 gamma 0.0 0 0 0.4 P null

In SQL, how to calculate standard deviation with a row excluded from the calculation?

I have a number of rows, which have ID, 'Group' and Value columns. I'd like to calculate the standard deviation on Value column for each of the Group. In the calculation, I have to exclude each of the record in the group out and carry out the calculation, then assign the result to the row. How can I achieve it?
Many thanks.
EDIT: I am using MS SQL Server 2008 R2
EDIT2:
Suppose we have a table
ID Group Value
1 A 2.5
2 A 4.1
3 B 3.8
4 B 11.2
5 B 15.4
6 C 0.8
7 C 7.1
8 C 1.0
9 B 5.2
10 A 6.9
The expected output is
ID Group Value Std(pseudo values)
1 A 2.5 xxx
2 A 4.1 xxx
3 B 3.8 xxx
4 B 11.2 xxx
5 B 15.4 xxx
6 C 0.8 xxx
7 C 7.1 xxx
8 C 1.0 xxx
9 B 5.2 xxx
10 A 6.9 xxx
The calculation of standard deviation of certain group is assigned to individual rows in Std column. But in order to assure the independence, we do it as std_x1 = STD(x2, x3, x4, ...).
To count STD based on assupmtion std_x1 = STD(x2, x3, x4, ...) and std_x2 = STD(x3, x4, x5, ...) you can use this query:
SELECT t.ID,
t.[Group],
t.Value,
STDEV(t.Value) OVER (PARTITION BY [GROUP] ORDER BY t.ID ASC ROWS UNBOUNDED FOLLOWING) AS std
FROM tbl t
To count STD based on assupmtion std_x1 = STD(x2, x3, x4, ...) and std_x2 = STD(x1, x3, x4, x5, ...) you can use this query:
SELECT t.ID,
t.[Group],
t.Value,
(SELECT STDEV(t1.value)
FROM tbl t1
WHERE t1.[Group] = t.[Group]
AND t1.ID <> t.ID) AS Std
FROM tbl t