What is the difference of "dom_content_loaded.histogram.bin.start/end" in Google's BigQuery? - sql

I need to build a histogram, concerning DOMContentLoaded of a webpage. When I used BigQuery, I noticed that apart from density, there are 2 more attributes (start, end). In my head there should only be 1 attribute, the DOMContentLoaded event is only fired when the DOM has loaded.
Can anyone help clarify the difference of .start and .stop? These attributes always have 100 milliseconds difference between them (if start = X ms, then stop = X+100 ms. See a query example posted below.
I can not understand what these attributes represent exactly:
dom_content_loaded.histogram.bin.START
AND
dom_content_loaded.histogram.bin.END
Q: Which one of them represents the time that the DOMContentLoaded event
is fired in a user's browser?
SELECT
bin.START AS start,
bin.END AS endd
FROM
`chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE
origin = 'https://www.google.com'
Output:
Row |start | end
1 0 100
2 100 200
3 200 300
4 300 400
[...]

Below explains meaning of bin.start, bin.end and bin.density
Run below SELECT statement
SELECT
origin,
effective_connection_type.name type_name,
form_factor.name factor_name,
bin.start AS bin_start,
bin.end AS bin_end,
bin.density AS bin_density
FROM `chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE origin = 'https://www.google.com'
You will get 1550 rows in result
below are first 5 rows
Row origin type_name factor_name bin_start bin_end bin_density
1 https://www.google.com 4G phone 0 100 0.01065
2 https://www.google.com 4G phone 100 200 0.01065
3 https://www.google.com 4G phone 200 300 0.02705
4 https://www.google.com 4G phone 300 400 0.02705
5 https://www.google.com 4G phone 400 500 0.0225
You can read them as:
for phone with 4G load of dom_content was loaded within 100 milliseconds for 1.065% of loads; in between 100 and 200 milliseconds for 1.065%; in between 200 and 300 milliseconds for 2.705% and so on
To summarize for each origin, type and factor you got histogram that is represented as a repeated record with start and end of each bin along with density which represents percentage of respective user experience
Note: if you add up the dom_content_loaded densities across all dimensions for a single origin, you will get 1 (or a value very close to 1 due to approximations).
For example
SELECT SUM(bin.density) AS total_density
FROM `chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE origin = 'https://www.google.com'
returns
Row total_density
1 0.9995999999999978
Hope this helped

Related

Plotting data from two sets with different shapes in the same plot

I am using data collected from two different instruments which have different resolution because of the sampling rate of each instrument. For a specific time, one of the sets have >10k entries while the other has ~2.5k. They however capture data over the same time interval, and I want to plot them on top of each other even though they have different resolution in data. The minimum and maximum x of both sets are the same however one of them have more entries.
Simplified it could look like this:
1st set from instrument with higher sampling rate:
time(s) value
0.0 10
0.2 11
0.4 12
0.6 13
0.8 14
... ..
100 50
2nd set from instrument with lower sampling rate:
time(s) value
0 100
1 120
2 125
3 128
4 130
. ...
100 430
They are measuring different things, but I would like to display them in the same plot. How can I accomplish this?
I found the mistake.. I was trying to plot both datasets using the time data from the first instrument. Of course they need to be plotted with their respective time data and I put the first time data in the second plot by mistake..

Calculate percentage between two values

I have two columns that hold numbers for which I am trying to calculate the difference in % between and show the result in another column but the results seem to be wrong.
This is the code in question.
SELECT
GenPar.ParameterValue AS ClaimType,
COUNT(Submitted.ClaimNumber) AS SubmittedClaims,
COUNT(ApprovalProvision.ClaimNumber) AS ApprovedClaims,
COUNT(Declined.ClaimNumber) AS DeclinedClaims,
COUNT(Pending.ClaimNumber) AS PendingClaims,
ISNULL(SUM(SubmittedSum.SumInsured),0) AS TotalSubmittedSumInsured,
ISNULL(SUM(ApprovedSum.SumInsured),0) AS TotalApprovedSumInsured,
ISNULL(SUM(RejectedSum.SumInsured),0) AS TotalRejectedSumInsured,
ISNULL(SUM(PendingSum.SumInsured),0) AS TotalPendingSumInsured,
--This column is to show the diff in %
CASE WHEN COUNT(Submitted.ClaimNumber) <> 0 AND COUNT(ApprovalProvision.ClaimNumber) <> 0
THEN (COUNT(ApprovalProvision.ClaimNumber),0) - (COUNT(Submitted.ClaimNumber),0)
/COUNT(Submitted.ClaimNumber) * 100
ELSE 0
END
What I need is to show the difference in % between the columns SubmittedClaims and ApprovedClaims. Any column, or both may contain zeroes and it may not.
So it's: COUNT(Submitted.ClaimNumber) - COUNT(ApprovalProvision.ClaimNumber) / COUNT(Submitted.ClaimNumber) * 100 as far as I know.
I have tried this and an example of what it does is it takes 1 and 117 and returns 17 when the difference between 1 and 117 is a decrease of 99.15%. Another example is 2 and 100. This simply returns 0 whereas the difference is a decrease of 98%.
CASE WHEN COUNT(Submitted.ClaimNumber) <> 0 AND COUNT(ApprovalProvision.ClaimNumber) <> 0
THEN (COUNT(ApprovalProvision.ClaimNumber),0) - (COUNT(Submitted.ClaimNumber),0)
/COUNT(Submitted.ClaimNumber) * 100
ELSE 0
END
I've checked this link and this seems to be what I am doing.
Percentage difference between two values
I've also tried this code:
NULLIF(COUNT(Submitted.ClaimNumber),0) - NULLIF(COUNT(ApprovalProvision.ClaimNumber),0)
/ NULLIF(COUNT(Submitted.ClaimNumber),0) * 100
and this takes for example 2 and 100 and returns -4998 when the real difference is a decrease of 98%.
For completion, Submitted.ClaimNumber is this portion of code:
LEFT OUTER JOIN (SELECT * FROM Company.Schema.ClaimMain WHERE CurrentStatus=10)Submitted
ON Submitted.ClaimNumber = ClaimMain.ClaimNumber
ApprovalProvision.ClaimNumber is this:
LEFT OUTER JOIN (SELECT * FROM Company.Schema.ClaimMain WHERE CurrentStatus=15)ApprovalProvision
ON ApprovalProvision.ClaimNumber = ClaimMain.ClaimNumber
Ideally, this column would also deal with 0's. So if one value is 0 and the other is X, the result should return 0 since a percentage can't be calculated if original number is 0. If the original value is X and the new value is 0, I should show a decrease of 100%.
This will occur across all columns but there is no need to flood the page with the rest of the columns since all calculations will occur in the same manner.
Anybody see what I'm doing wrong?
I'm not familiar with why you have (x,0) as a syntax
But I see that you have
(COUNT(ApprovalProvision.ClaimNumber),0) - (COUNT(Submitted.ClaimNumber),0)
/COUNT(Submitted.ClaimNumber) * 100
shouldn't it be,
( COUNT(ApprovalProvision.ClaimNumber) - COUNT(Submitted.ClaimNumber) )
/COUNT(Submitted.ClaimNumber) * 100
It looks like it would do count of ApprovalProvision.ClaimNumber - 100 since submitted.claimnumber divided by itself is 1 times 100 is 100.
The 4900 number actually sounds right. Lets take the following example, you have 2 apples, and then you're given 98 more and got 100 apples.
An increase of 98% would have meant from 2 apples, you would have 3.96 apples.
An increase of 100% means from 2 apples you end with 4 apples. An increase of 1000% means from 2 apples you end with 22 apples. So 4000% means you end with 82 apples. 5000% means from 2 apples, you reach 102 apples.
(100-2)/2*100 = 98 / 2 = 49 * 100 = 4900, so it looks like there is a 4900% increase in number of apples if you started with 2 apples and reach 100.
Now if you had flipped the 2 and 100, say starting with 100, now you have 2,
(2-100)/100*100 = -98, so a -98% change of apples, or a 98% decrease.
Hope this solves your problem.

Daemon to monitor query and send mail conditionally in SQL Server

I've been melting my brains over a peculiar request: execute every two minutes a certain query and if it returns rows, send an e-mail with these. This was already done and delivered, so far so good. The result set of query is like this:
+----+---------------------+
| ID | last_update |
+----+---------------------|
| 21 | 2011-07-20 13:03:21 |
| 32 | 2011-07-20 13:04:31 |
| 43 | 2011-07-20 13:05:27 |
| 54 | 2011-07-20 13:06:41 |
+----+---------------------|
The trouble starts when the user asks me to modify it so the solution so that, e.g., the first time that ID 21 is caught being more than 5 minutes old, the e-mail is sent to a particular set of recipients; the second time, when ID 21 is between 5 and 10 minutes old another set of recipients is chosen. So far it's ok. The gotcha for me is from the third time onwards: the e-mails are now sent each half-hour, instead of every five minutes.
How should I keep track of the status of Mr. ID = 43 ? How would I know if he has already received an e-mail, two or three? And how to ensure that from the third e-mail onwards, the mails are sent each half-hour, instead of the usual 5 minutes?
I get the impression that you think this can be solved with a simple mathematical formula. And it probably can be, as long as your system is reliable.
Every thirty minutes can be seen as 360 degrees, or 2 pi radians, on a harmonic function graph. That's 12 degrees = 1 minute. Let's take cosin for instance:
f(x) = cos(x)
f(x) = cos(elapsedMinutes * 12 degrees)
Where elapsed minutes is the time since the first 30 minute update was due to go out. This should be a constant number of minutes added to the value of last_update.
Since you have a two minute window of error, it will be time to transmit the 30 minute update if the the value of f(x) (above) is between the value you would get at less than one minute before or after the scheduled update. Which would be = cos(1* 12 degrees) = 0.9781476007338056379285667478696.
Bringing it all together, it's time to send a thirty minute update if this SQL expression is true:
COS(RADIANS( 12 * DATEDIFF(minutes,
DATEADD(minutes, constantNumberOfMinutesBetweenSecondAndThirdUpdate, last_update),
CURRENT_TIMESTAMP))) > 0.9781476007338056379285667478696
If you need a wider window than exactly two minutes, just lower this number slightly.

SQL Query Recalculating Running Totals

I'm taking a set of transactions and amounts, and I want to create a new amount column, with the following logic --
Check a running total of (new) amounts thus far.
If adding this amount to the previous total would bring the total to less than zero, the new amount field should be zero. Otherwise, it should be equal to the old amount.
Here's an example of what I'm looking for --
Item Record Old amount New Amount Running Total
1 1 100 100 100
1 2 -100 -100 0
1 3 -200 0 0
1 4 500 500 500
1 5 -300 -300 200
1 6 300 300 500
My running total starts at zero.
My first amount is 100, and that doesn't take the total < 0, so I pass it through and set the
new amount to 100.
My second amount is -100, and that doesn't take my running total of 100 to < 0, so I set the new amount to -100.
My third amount is -200. That would take the running total of 0 to -200, < 0. Thus, I set the new amount to 0.
My fourth amount is 500. It gets passed through.
My fifth amount is -300. That would take the running total of 500 to 200, which is still >= 0. It gets passed through.
My sixth amount is 300. It gets passed through, leaving me with a final amount total of 500.
The difficult part is in cases like record five here. In order to know that it won't take the final running total below zero, you need to have already calculated the new total for record 3.
I think you can do this by setting up common table expressions in order to make a recursive query, but I've foundered on how exactly to create that. If possible, I'd like to avoid cursors.
this is a WINDOW FUNCTION solution with a wrapping CASE statement.
look up LAG

how to find Sum(field) in condition ie "select * from table where sum(field) < 150"

I have to retrieve only particular records whose sum value of size field is <=150.
I have table like below ...
userid size
1 70
2 100
3 50
4 25
5 120
6 90
The output should be ...
userid size
1 70
3 50
4 25
For example, if we add 70,50,25 we get 145 which is <=150.
How would I write a query to accomplish this?
Here's a query which will produce the above results:
SELECT * FROM `users` u
WHERE (select sum(size) from `users` where size <= u.size order by size) < 150
ORDER BY userid
However, the problem you describe of wanting the selection of users which would most closely fit into a given size, is a bin packing problem. This is an NP-Hard problem, and won't be easily solved with ANSI SQL. However, the above seems to return the right result, but in fact it simply starts with the smallest item, and continues to add items until the bin is full.
A general, more effective bin packing algorithm would is to start with the largest item and continue to add smaller ones as they fit. This algorithm would select users 5 and 4.
What you're looking for is a greedy algorithm. You can't really do this with one SQL statement.
It's similar to the subset sum problem. You are definitely going to be into exponential time ...
There are several ways to solve subset
sum in time exponential in N. The most
naïve algorithm would be to cycle
through all subsets of N numbers and,
for every one of them, check if the
subset sums to the right number. The
running time is of order O(2^N*N), since
there are 2N subsets and, to check
each subset, we need to sum at most N
elements.
Unless you can constrain the problem to smaller subsets.
According to your definition as it stands you could get any of these tables:
userid size userid size
1 70 2 100
userid size userid size
3 50 4 25
userid size userid size
5 120 6 90
userid size userid size
1 70 2 100
3 50 3 50
userid size userid size
1 70 2 100
4 25 4 25
userid size userid size
1 70 4 25
3 50 6 90
4 25
userid size userid size
4 25 3 50
5 120 6 90
SQL sucks at guessing. Do you mean to say you want the most users who's total size is under a certain limit? You'll need to create a temp table of all the combinations of users, then select the ones who's total size is less then the limit, then select the one with the most users, and possibly the lowest user ID or something. Either way, it won't be fast due to the first step.
But do you want to maximize the number of results or minimize or you simply don't care? first two cases is constraints optimization for which there should be solution using SQL, the latter (as mentioned above) requires greedy strategy.