I have one table that I need to get some metrics from.
For example I have the following table:
meas_count
skippings
links
extra
10
8
4.2
some
10
9
5.8
some
10
9
5.8
some_2
11
8
4.2
some
11
8
5.8
some
11
9
5.9
some
I need to get a view of an existing table in the following form for further work:
meas_count
skippings
links_min
links_max
10
8
0
4
10
8
4
5
10
8
5
6
10
9
0
4
10
9
4
5
10
9
5
6
11
8
0
4
11
8
4
5
11
8
5
6
11
9
0
4
11
9
4
5
11
9
5
6
At the moment I have 2 queries, the results of which I need to combine to get the result I need.
First request:
SELECT meas_count,skippings FROM current_stats GROUP BY meas_count,skippings
Creates the following:
meas_count
skippings
10
8
10
9
11
8
11
9
Second request:
SELECT
LAG(rounded) OVER (ORDER BY rounded) as links_min,
rounded as links_max FROM
(SELECT * FROM
(SELECT ROUND(links, 1) as rounded FROM current_stats)
GROUP BY rounded ORDER BY rounded)
Creates the following:
links_min
links_max
NULL
4
4
5
5
6
I need something like result of sets multiplication...
What query should be executed to get the table of the view I need as a result?
I also have an additional question: is the execution of the second query slowed down due to several SELECTs inside?
You can do that by doing an INNER JOIN on the two tables without specifying a join condition. That will give you every combination of the two sets of rows.
SELECT * FROM
(
SELECT meas_count,skippings
FROM current_stats
GROUP BY meas_count,skippings)
AS one
INNER JOIN
(
SELECT LAG(rounded) OVER (ORDER BY rounded) as links_min,
rounded as links_max FROM
(SELECT * FROM
(SELECT ROUND(links, 1) as rounded FROM current_stats)
GROUP BY rounded
ORDER BY rounded
)
) AS two;
As for performance, that's really only an issue if there is a better way to do it. Of course nested SELECTs take time, but the query optimizers in today's SQL engine are pretty good at determining what you MEANT from what you SAID.
Related
I'm trying to prepare my data to create a burndown visual. As you can see the Rate column isn't simply A - B, as it carries forward the previous value if B is null.
I've tried some case statements using lag and sums but no avail.
Some direction on the case statement or an optimal solution would be ideal.
For example, this is how my data looks:
ID
A
B
1
20
NULL
2
20
3
3
20
NULL
4
20
7
5
20
NULL
6
20
NULL
7
20
NULL
8
20
5
9
20
7
And I want a rate column that looks like this.
ID
A
B
Rate
1
20
NULL
20
2
20
3
17
3
20
NULL
17
4
20
7
10
5
20
NULL
10
6
20
NULL
10
7
20
NULL
10
8
20
5
5
9
20
7
-2
Thanks to #Larnu for the guidance.
Here is the solution when you have your data partitioned by some group ID and ordered by some data or row ID.
SELECT
GROUP_ID,
ROW_ID,
COL_A,
COL_B,
COL_A - (SUM(ISNULL(COL_B,0)) OVER (PARTITION BY GROUP_ID ORDER BY ROW_ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM table
I tried a linear regression with Big Query.
therefor I used test data:
nr1 nr2 x
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
11 11 11
12 12 12
With the following query i created a model.
CREATE MODEL `regression_model_9`
OPTIONS
(model_type='linear_reg',
input_label_cols=['x']) AS
SELECT
nr1,
nr2,
x
FROM
`reg_test`
After that I evaluate the model and want to make a prediction, like described here:
https://cloud.google.com/bigquery/docs/bigqueryml-analyst-start
So what I have to do to get predict a 13?
With the following I get "Query returned zero records.....
SELECT
x
FROM
ML.PREDICT(MODEL `regression_model_9`,
(
SELECT
x,
nr1,
nr2
FROM
`reg_test`
where nr1=13
))
... what I have to do to get predict a 13?
#standardSQL
SELECT *
FROM ML.PREDICT(MODEL `yourproject.yourdataset.regression_model_9`,
(SELECT 13 nr1, 13 nr2))
with result as something like below
Row predicted_x nr1 nr2
1 12.999999982559942 13 13
This question already has answers here:
How to get the cumulative sum by group in R?
(2 answers)
Closed 7 years ago.
Hopefully the title is explicit enough.
I have a table looking like that :
classes id value
a 1 10
a 2 15
a 3 12
b 1 5
b 2 9
b 3 7
c 1 6
c 2 14
c 3 6
and here is what I would like :
classes id value cumsum
a 1 10 10
a 2 15 25
a 3 12 37
b 1 5 5
b 2 9 14
b 3 7 21
c 1 6 6
c 2 14 20
c 3 6 26
I've seen this solution, and I've already applied it successfully to cases where I don't have multiple classes :
id value cumsum
1 10 10
2 15 25
3 12 37
It was reasonably fast, even with datasets of size equivalent to the one I'm currently working on.
However, when I try to apply the exact same code to the dataset I'm working on now (which looks like the first table of this question, IE multiple classes), without subsetting it by a,b,c, it seems to me that it's taking ages (it's been running for 4 hours now. The dataset is 40.000 rows).
Any idea if there is an issue with the code from the linked answer, when used in this context ? I have trouble wrapping my head around the triangular join thingy, but I have the feeling there might be an issue with the size the join takes when the number of rows increases, thus slowing the whole thing a lot, which maybe is even worsened by the fact that there are multiple "classes" on which to do the cumulative sums.
Is there any way this could be done faster ? I'm using SQL in R through the SQLDF package. A solution in either R code (with or without an external common package) or SQL code will do.
Thanks
In SQL, you can do a cumulative sum using the ANSI standard sum() over () functionality:
select classes, id, value,
sum(value) over (partition by classes order by id) as cumesum
from t;
Or you can use by from the base package:
df$cumsum <- unlist(by(df$value, df$classes, cumsum))
# classes id value cumsum
#1 a 1 10 10
#2 a 2 15 25
#3 a 3 12 37
#4 b 1 5 5
#5 b 2 9 14
#6 b 3 7 21
#7 c 1 6 6
#8 c 2 14 20
#9 c 3 6 26
I need to create a new transport ID based on the cumulative sum of the volume being transported. Let´s say that originally everything was transported in truck A with a capacity of 25. Now I want to assign these items to shipments with truck B (Capacity 15).
The only real constraint is amt shipped cannot exceed capacity.
I can´t post a picture because of the restrictions...but the overall set up would be like this:
Old Trans # Volume New Trans # Cumulative Volume for Trans
1 1
1 9
1 3
1 7
1 4
2 9
2 10
3 8
3 5
3 9
4 4
4 6
4 8
5 9
5 1
5 5
5 8
6 3
6 4
6 3
6 4
6 4
6 7
7 7
7 10
7 4
8 10
8 6
8 7
9 4
9 9
9 6
10 7
10 4
10 1
10 1
10 5
10 2
11 9
11 3
11 9
12 8
12 5
12 9
13 9
Expected output would be that the first three entries would result in a new shipment ID of 1;the next two entries would result in a new shipment ID of 2;and so on... I´ve tried everthing that I know(excluding VBA): Index/lookup/if functions. My VBA skills are very limited though.Any tips?? thanks!
I think I see what you're trying to do here, and just using an IF formula (and inserting a new column to keep track):
In the Columns C and D, insert these formulas in row 3 and copy down (changing 15 for whatever you want your new volume capacity to be):
Column C: =IF(B3+C2<15,B3+C2,B3)
Column D: =IF(B3+C2<15,D2,D2+1)
And for the cells C2 and D2:
C2: = B2
D2: = A2
Is this what you're looking to do?
A simple formula could be written that 'floats' the range totals for each successive load ID.
In the following, I've typed 25 and 15 in D1:E1 and used a custom number format of I\D 0. In this way, the column is identified and the cell can be referenced as a true number load limit. You can hard-code the limits into the formula if you prefer by overwriting D$1 but you will not have a one-size-fits-all formula that can be copied right for alternate load limits as I have in my example..
The formula in D2 is,
=IF(ROW()=2, 1, (SUM(INDEX($B:$B, MATCH(D1, D1:D$1, 0)):$B2)>D$1)+ D1)
Fill right to E2 then down as necessary.
I have a dataset like below, but longer. I want to ensure I am picking the 'fleet_id' in terms of their 'StarDriver' value overall, but I want to return only two results for each 'supplier_id' and return a max of 20 in total.
(I'm sorry I didnt work out how to copy the below in proper formatting, couldn't find from toolbar above and google results were about copying data; would also be grateful if someone would point out how)
fleet_id supplier_id Ratings Driver Punctuality Car StarDriver
19442 151 10 5 5 5 5
19634 151 11 5 5 5 5
19437 151 12 5 5 5 5
12832 10 14 5 4.92857142857143 5 4.97619047619048
12217 111 10 5 5 4.9 4.96666666666667
21135 158 19 5 4.89473684210526 5 4.96491228070175
19436 151 14 4.85714285714286 5 5 4.95238095238095
12239 111 12 4.91666666666667 5 4.91666666666667 4.94444444444445
10520 92 12 4.91666666666667 5 4.91666666666667 4.94444444444445
19997 151 12 5 5 4.83333333333333 4.94444444444444
To limit to the top 2 for each supplier, use row_number(). This will enumerate the rows and you can choose just two with where seqnum <= 2.
The rest of the query is just selecting 20 rows based on a field:
select top 20 t.*
from (select t.*,
row_number() over (partition by supplier order by StarDriver desc) as seqnum
from table t
) t
where seqnum <= 2
order by StarDriver;