min over one dimension followed by max over another dimension

min over one dimension followed by max over another dimension - sql

I have an SQL table that looks like this:
i j x
0 0 0.5
0 1 1.0
0 2 1.5
1 0 1.4
1 1 1.3
1 2 1.2
and so on. I would like to take the average over the j dimension followed by the minimum over the i dimension. In this case, taking the average over the j dimension produces the following:
i x
0 1.0
1 1.3
Taking the minimum over the i dimension then produces the value 1.0, which is the final result. Is there an efficient way to perform a query like the one in this example, i.e., a query in which a sequence of dimension reduction operations is performed in a specified order?
Note that if we reverse the order of operations, the intermediate result is
j x
0 0.5
1 1.0
2 1.2
Taking the average over the j dimension produces a final result of 0.9. Thus, the order of operations is important.
Phillip
http://phillipmfeldman.org

You can do it with a subquery, of course:
SELECT MIN(avg_over_j) FROM (
SELECT i, AVG(x) AS avg_over_j
FROM TheTable
GROUP BY i
)
But this isn't APL or the J language; there's no "dimension reduction operations".

Related

Snowflake CEIL Function - round up to next 0.1 kilometer

I have a column containing measurement values in meters.
I want to round them up (ceil) them to the next 100m and return it as a km value.
Special thing is: if the original value is a "round" number (100m increment) it should be ceiled up to the next 100m increment (see line 3 in the example below).
Example:
meter_value kilometer_value
1111 1.2
111 0.2
1000 1.1
I think I can get the first two lines by doing:
ceil(meter_value/1000,1) as kilometer_value
The solution I thought of to fix the edge case in line three is to just add 1 meter always:
ceil((meter_value+1)/1000,1) as kilometer_value
It seems a bit clumsy, is there a better way/alternative function to archive this?

You can check to see if it's divisible by 100 and only add one if it is:
ceil(((meter_value + iff(meter_value % 100 = 0, 1, 0))/1000), 1)
This will prevent situations where (if decimal parts are allowed) adding 1 to a value of 999.5 would not be accurate if adding one all the time.

Greg's answer is good, simpler to read to me would be to
divide by 100
floor
add 1
ceil
divide by 10
select
column1 as meter_value
,ceil(((meter_value + iff(meter_value % 100 = 0, 1, 0))/1000), 1) as greg
,ceil(floor(meter_value/100)+1)/10 as simeon
from values
(1111)
,(111)
,(1000)
,(1)
,(0)
;
METER_VALUE
GREG
SIMEON
1,111
1.2
1.2
111
0.2
0.2
1,000
1.1
1.1
1
0.1
0.1
0
0.1
0.1
do we want to mention negative values? I mean it distance, so it's a directionless magnitude, right?
anyway with negative value, both our methods the +1 forces the boundary case to be wrong.
Actually:
Once you have floored adding the 1 or 0.1 if you divide by 1000 vs 100 first, you don't need to ceil at all
thus two short forms can be:
,ceil(floor(meter_value/100)+1)/10 as version_a
,(floor(meter_value/100)+1)/10 as version_b
,floor(meter_value/1000,1)+0.1 as version_c

How can I match a sequence of numbers with the specific columns based on their names in BigQuery with the SQL query result groupped by this sequence?

I am working with two tables in BigQuery and I need the achieve the following result:
ret_d
ad_arpu
pur_arpu
d1
x
x
d2
y
y
dx
etc
etc
A bit of clarification: ARPU is average revenue per user, which in this case consists of revenbue from purchases (pur_arpu) and from ads (ad_arpu).
I need to return pur_arpu and ad_arpu next to the corresponding day that customer has been retained, i.e. retention day #x (ret_dx). Arpu is revenue divided by number of customers at a certain period of time. I need arpu per user per retained day.
In the table I am working with I have the following relevant columns
C_id
ret_d1
ret_d2
ret_dx
ad_rev_d1
ad_rev_d2
ad_rev_dx
pur_rev_d1
pur_rev_d2
pur_rev_dx
id1
0 OR 1
0 OR 1
0 OR 1
x
x
x
x
x
x
id2
0 OR 1
0 OR 1
0 OR 1
y
y
y
y
y
y
idx
0 OR 1
0 OR 1
0 OR 1
etc
etc
etc
etc
etc
etc
c_id - customer id,
ret_dx - retained day #x, 1 stands for user retaining on that day,
ad_rev_dx - revenue from ads on dx,
pur_rev_dx - revenue from purchases on dx.
My idea of the SQL query was to return a sequence of numbers, ad_arpu, and pur_arpu from subquery. So something alogn these lines
SELECT
["d1", "d2", "dx"] AS ret_d,
ad_arpu AS ad_arpu,
pur_arpu AS pur_arpu
FROM
(SELECT
SUM(ad_rev_d1)/SUM(ret_d1) AS ad_arpu_d1,
SUM(ad_rev_d2)/SUM(ret_d2) AS ad_arpu_d2,
SUM(ad_rev_dx)/SUM(ret_dx) AS ad_arpu_dx,
SUM(pur_rev_d1)/SUM(ret_d1) AS pur_arpu_d1,
SUM(pur_rev_d2)/SUM(ret_d2) AS pur_arpu_d2,
SUM(pur_rev_dx)/SUM(ret_dx) AS pur_arpu_dx
--somehow match ad_arpu_dx and pur_arpu_dx to the corresponding days from the ret_d sequence to get ad_arpu adn pur_arpu
FROM tablex)
But so far my efforts have been getting me nowhere :( Missing how to match arpu numbers and the day sequence numbers.
I really appreciate any tips that could help with this!

Distribute numbers as close to possible

This seems to be a 2 step problem I'm trying to solve.
Let's say we have N records, and we are trying to distribute as evenly as possible into K groups.
The second problem - each group in K can only accept an M amount of records.
For example, if we have 5 records, and 3 groups, then we would distribute 2 into Group K1, 2 into Group K2 and 1 record into Group K3. However, if say in group 1, it only accepts at most 1 record. Then the arrangement would need to be 1 into Group K1, 2 into Group K2, and 2 into Group K3.
I'm not necessary after the solution but what algorithm I might need to use to solve this? Apparently for the distribution, I need to use the Greedy algorithm? But for the second step, this seems to be a bit more complicated
Edit:
The example I'm looking at is:
Number of records: 23
Groups: 10
Max records for each group
G1 = 4
G2 = 1
G3 = 0
G4 = 5
G5 = 0
G6 = 0
G7 = 2
G8 = 4
G9 = 2
G10 = 2

if N=12 and K=3 then in normal situation,you just split it V=12/3=4 for each group. but since you have M limitation, and for example K3 can only accept 1 then the distribution can be 6-5-1 which is not evenly distributed.
So i guess you need to sort K based on the M limitation, so for the example above the groups order become K3-K1-K2.
then if the distributed value V is bigger than the accepted amount M for that group, you need to take the remainder and distribute it again to the remaining group (K3=1, then 4-1=3 must be distributed to K1 and K2).
the implementation might be complicated, i hope you can find more simple solution for this

From what I understood, you need to separate all groups which allows a fixed number of values first and then equally distribute records among remaining groups. Let's take an example, let's say we have 15 records which needs to be distributed among 5 groups (G1, G2, G3, G4 and G5). Also let's assume that G2 and G4 allows max records of 2 and 4 respectively. Now algorithm should go like this:
Get average(ceiling integer) of records based on number of groups (In this example we'll get 3).
Add all max allowed records which are smaller than our average (In this example it's G2 only who's max limit(i.e. 2) is less than our average hence the number comes as 2).
Now subtract our number from step 2 from total records and also subtract the number of groups involved in step 2 from total groups. (remaining total records: 13, remaining total groups 4).
Get the new average(ceiling integer) using remaining records and groups. (New average 4).
Get average (Integer) (i.e. 3) and allot equal number of records to remaining groups - 1.
Get Mod (i.e. 1) and allot that number to the last group.
Now what we finally will have here:
G1(No limit): 4
G2(Limit 2): 2
G3(No limit): 4
G4(Limit 4): 4
G5(No limit): 1
Let me know if you think that this algo might fail for some scenarios.
Formula to get ceiling integer average
floor((#total_records + #total_groups-1) / #total_groups)

Calculate diff() between selected rows

I have a dataframe with ordered times (in seconds) and a column that is either 0 or 1:
time bit
index
0 0.24 0
1 0.245 0
2 0.47 1
3 0.471 1
4 0.479 0
5 0.58 1
... ... ...
I want to select those rows where the time difference is, let's say <0.01 s. But only those differences between rows with bit 1 and bit 0. So in the above example I would only select row 3 and 4 (or any one of them). I thought that I would calculate the diff() of the time column. But I need to somehow select on the 0/1 bit.

Coming from the future to answer this one. You can apply a function to the dataframe that finds the indices of the rows that adhere to the condition and returns the row pairs accordingly:
def filter_(x, threshold = 0.01):
indices = df.index[(df.time.diff() < threshold) & (df.bit.diff().abs() == 1)]
mask = indices | indices - 1
return x[mask]
print(df.apply(filter_, args = (0.01,)))
Output:
time bit
3 0.471 1
4 0.479 0

Calculate statistical distributions from a column in a datatable

I don't know if there is a specific method in VB.Net to calculate the statistical distribution from an array of values like the formula Frequency() in Excel. If not what is the easiest and fastest way of doing the same thing ?
For example I've a DataTable with my values in a Column called "Cement Deviation" :
Column Deviation
0
14
11
2
6
1
16
14
5
21
The bands in which I want to know the frequency of these values are :
From minValue To -50 by Step of 10
From -50 To -10 by Step of 5
From -10 To -5 by Step of 1
From -5 To -1 by Step of 0.5
From -1 To -0.5 by Step of 0.1
From -0.5 To -0.1 by Step of 0.05
From -0.1 To 0.1 by Step of 0.01
From 0.1 To 0.5 by Step of 0.05
From 0.5 To 1 by Step of 0.1
From 1 To 5 by Step of 0.5
From 5 To 10 by Step of 1
From 10 To 50 by Step of 5
From 50 To maxValue by Step of 10
Can someone help me with it?
Thanks

I don't know how you calculate it since my experiences with statistical distribution is limited and you haven't mentioned the way you want to calculate it.
However, this does at least compile:
Dim stat(2) As Integer
For Each row As DataRow In gridView.Rows
Dim cementDeviation = row.Field(Of Int32)("Cement Deviation")
Select Case cementDeviation
Case 0 To 10
stat(0) += 1
Case 10 To 20
stat(1) += 1
End Select
Next
In general there's nothing bad in looping the DataRows to calculate the values. But you should set OPTION STRICT to on, then your code would not compile because row("Cement Deviation") is an object not integer. The good thing is that you are forced to use the correct types which prevents from nasty runtime errors.
Edit Here is an example how you could use dynamic ranges and count each class with LINQ. I have used a DataTable to store the min- and max-values but you could also use a different in-memory collection like List(Of CustomClass) or even better - the database.
You can also simply loop the table but you wanted to see a different approach. I like LINQ since it can reduce complexitiy and increase readability:
The range table with sample data:
Dim rangeTable = New DataTable()
rangeTable.Columns.Add("Min", GetType(Int32))
rangeTable.Columns.Add("Max", GetType(Int32))
For i = 0 To 90 Step 10
rangeTable.Rows.Add(i, i + 10)
Next
A single LINQ query to calculate all occurences for every range even ordered descending:
Dim stats =
From rangeRow In rangeTable
Let min = rangeRow.Field(Of Int32)("Min")
Let max = rangeRow.Field(Of Int32)("Max")
Select StatsInfo = New With {
.Min = min, .Max = max,
.Count = (From devRow In devTable
Let cementDeviation = devRow.Field(Of Int32)("Cement Deviation")
Where cementDeviation >= min AndAlso cementDeviation <= max).Count()
}
Order By StatsInfo.Count Descending
Output the result:
For Each stat In stats
Console.WriteLine("Min: {0} Max: {1} Count: {2}", stat.Min, stat.Max, stat.Count)
Next
Note that i've renamed your DataTable to devTable since gridView is not a good name.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

min over one dimension followed by max over another dimension - sql

You can do it with a subquery, of course: SELECT MIN(avg_over_j) FROM ( SELECT i, AVG(x) AS avg_over_j FROM TheTable GROUP BY i ) But this isn't APL or the J language; there's no "dimension reduction operations".

Related

Snowflake CEIL Function - round up to next 0.1 kilometer

How can I match a sequence of numbers with the specific columns based on their names in BigQuery with the SQL query result groupped by this sequence?

Distribute numbers as close to possible

Calculate diff() between selected rows

Calculate statistical distributions from a column in a datatable

Categories

Resources