Percentile vs percentile_approx in SparkSQL - hive

I am using percentile in one of my sql's but due to large data skewness, the query is failing with out of memory error. Even after increasing memory, the query is failing. I found this percentile_approx function which works on my dataset but there are differences between percentile and percentile_approx.
Can some kind soul please explain the differences?
My dataset contains dollar amounts up to 2 decimal points. The number of rows will vary from month to month but it's usually over 300 million rows.
Thanks

Related

Can someone explain this PromQL query to me?

I'm new to promQL and I am using it to create grafana dashboard to visualize various API metrics like throughput, latency etc.
For measuring latency I came across these queries being used together. Can someone explain how are they working
histogram_quantile(0.99, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
histogram_quantile(0.95, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
Also I want to write a query which will show me number of API calls with latency greater than 4sec. Can someone please help me there as well?
The provided queries are designed to return 99th and 95th percentiles for the http_request_duration_seconds{path="..."} metric of histogram type over requests received during the last 2 minutes (see 2m in square brackets).
Unfortunately the provided queries have some issues:
They use irate() function for calculating the per-second increase rate of every bucket defined in http_request_duration_seconds histogram. This function isn't recommended to use in general case, because it tends to return jumpy results on repeated queries - see this article for details. So it is better to use rate or increase instead when calculating histogram_quantile.
They multiply the calculated irate() by 30. This has no any effect on query results, since histogram_quantile() normalizes the provided per-bucket values.
So it is recommended to use the following query instead:
histogram_quantile(0.99,
sum(
increase(http_request_duration_seconds_bucket{path="..."}[2m])
) by (le)
)
This query works in the following way:
Prometheus selects all the time series matching the http_request_duration_seconds_bucket{path="..."} time series selector on the selected time range on the graph. These time series represent histogram buckets for the http_request_duration_seconds histogram. Each such bucket contains a counter, which counts the number of requests with duration not exceeding the value specified in the le label.
Prometheus calculates the increase over the last 2 minutes per each selected time series, e.g. how many requests hit every bucket during the last 2 minutes.
Prometheus calculates per-le sums over bucket values calculated at step 2 - see sum() function docs for details.
Prometheus calculates the estimated 99th percentile for the bucket results returned at step 3 by executing histogram_quantile function. The error of the estimation depends on the number of buckets and the le values. More buckets with better le distribution usually give lower error for the estimated percentile.

SQL percentile_cont vs SPSS frequencies percentiles

I am trying to figure out how percentiles are calculated by SQL's percentile_cont function and by SPSS in FREQUENCIES. I want to compare them and understand why they get different results.
I have tried looking this up myself, but finding a source for this information is difficult. If you have an explanation for why they differ, can you please share where I can read about that myself?
The percentile formula used in FREQUENCIES in SPSS Statistics is a weighted average method aimed at p(N+1), where p is the percentile expressed as a proportion (0-1 range) and N is the number of cases or records. Ignoring complications associated with weighted data, particularly non-integer weights, you order the data values in ascending order and if p(N+1) is an integer, you take the value of the p(N+1)th ordered case. If p(N+1) is between the integers associated with the ordinal positions of two numbers, you linearly interpolate between the values according to the fractional value of p(N+1).
This general formula is a commonly-used one, denoted method 4 in SAS and method 6 in a well-known November 1996 article in The American Statistician by Hyndman & Fan (Vol. 50, No. 4, pp. 361-365) that is the basis for the nine definitions used in the quantile function in R. There is one special point about the method in FREQUENCIES, which is that while other implementations of this method will set any percentile where p(N+1)>N to the value of the Nth case, in SPSS Statistics the value is given as missing instead.
The method used in SQL's percentile_cont appears to be method 7 in the list of nine from Hyndman & Fan, which aims at 1+(N-1)p. The EXAMINE procedure in SPSS Statistics offers the method used in FREQUENCIES (as the HAVERAGE method) and four additional methods. None of these match the method in SQL's percentile_cont.
Formulas for the statistics in FREQUENCIES, EXAMINE, and other procedures in SPSS Statistics are available in the IBM SPSS Statistics Algorithms manual, a pdf of which is freely downloadable.

Latest date from a date list >180 days in past from a given date in same list

I have a "Appeared date" column A and next to it i have a ">180" date column B. There is also "CONCAT" column C and a "ATTR" column D.
What i want to do is find out the latest date 180 or more from past, and write it in ">180" column, for each date in "Appeared Date" column, where the Concat column values are same.
The Date in >180 column should be more than 180 days from "Appeared date" column in the past, but should also be an earliest date found only from the "Appeared date" column.
Based on this i would like to check if a particular product had "ATTR" = 'NEW' >180 earlier also i.e. was it launched 180 days or more ago and appearing again recently?
Is there an excel formula which can get the nearest dates (>180) picked from the Appeared date and show it in the ">180" column?
Will it involve a mix of SMALL(), FREQUENCY(), MATCH(), INDEX() etc?
Or a VBA procedure is required?
To do this efficiently with formulas, you can use something called Range Slicing to reduce the size of the arrays to be processed, by efficiently truncating them so that they contain just the subset of those 3,000 to 50,000 rows that could possibly hold the correct answer, and THEN doing the actual equality check. (As apposed to your MAX/Array approach, which does computationally expensive array operations on all the rows, even though most of the rows have no relationship with the current row that you seek an answer for).
Here's my approach. First, here's my table layout:
...and here's my formulas:
180: =[#Appeared]-180
Start: =MATCH([#CONCAT],[CONCAT],0)
End: =MATCH([#CONCAT],[CONCAT],1)
LastRow: =MATCH(1,--(OFFSET([Appeared],[#Start],,[#End]-[#Start])>[#180]),0)+[#Start]-1
LastItem: =INDEX([Appeared],[#LastRow])
LastDate > 180: =IF([#Appeared]-[#LastItem]>180,[#LastItem],"")
Days: =IFERROR([#Appeared]-[#[LastDate > 180]],"")
Even with this small data set, my approach is around twice as fast as your MAX approach. And as the size of the data grows, your approach is going to get exponentially slower, as more and more processing power is wasted on crunching rows that can't possibly contain the answer. Whereas mine will get slower in a linear fashion. We're probably talking a difference of minutes, or perhaps even an hour or so at the extremes.
Note that while you could do my approach with a single mega-formula, you would be wise not to: it won't be anywhere near as efficient. splitting your mega-formulas into separate cells is a good idea in any case because it may help speed up calculation due to something called multithreading. Here’s what Diego Oppenheimer, a former program manager for Microsoft Excel, had to say on the subject back in 2005 :
Multithreading enables Excel to spot formulas that can be calculated concurrently, and then run those formulas on multiple processors simultaneously. The net effect is that a given spreadsheet finishes calculating in less time, improving Excel’s overall calculation performance. Excel can take advantage of as many processors (or cores, which to Excel appear as processors) as there are on a machine—when Excel loads a workbook, it asks the operating system how many processors are available, and it creates a thread for each processor. In general, the more processors, the better the performance improvement.
Diego went on to outline how spreadsheet design has a direct impact on any performance increase:
A spreadsheet that has a lot of completely independent calculations should see enormous benefit. People who care about performance can tweak their spreadsheets to take advantage of this capability.
The bottom line: Splitting formulas into separate cells increases the chances of calculating formulas in parallel, as further outlined by Excel MVP and calculation expert Charles Williams at the following links:
Decision Models: Excel Calculation Process
Excel 2010 Performance: Performance and Limit Improvements
I think i found the answer. Earlier i was using the MIN function, though incorrectly, as the dates in the array formula (when you select and hit F9 key) were coming in descending order. So i finally used the MAX function to find the earliest date which was more than 180 in the past.
=IF(MAX(IF(--(A2-$A$2:$A$33>=180)*(--(C2=$C$2:$C$33))*(--
($D$2:$D$33="NEW")),$A$2:$A$33))=0,"",MAX(IF(--(A2-$A$2:$A$33>=180)*(--
(C2=$C$2:$C$33))*(--($D$2:$D$33="NEW")),$A$2:$A$33)))
Check the revised Sample.xlsx which is self-explanatory. I have added the Attr='NEW' criteria in the formula for the final workaround, to find if there were any new items that came 180 days or earlier.
Though still an ADO query alternative may be required to process the large amounts of data.

SDK2 query for counting: which is more efficient?

I have an app that is displaying metrics about defects in a project.
I have the option of making one query that returns all the defects, and from that I can break out about four different metrics (How many defects escaped QA in 90 days, 180 days, and then the same metrics again but only counting sev1/sev2 defects).
I could make four queries and limit the results to one so that I just get a count for each. Or I could make one query that encompass them all (all defects that escaped QA in 180 days) and then count up the difference.
I'm figuring worst case, the number of defects that escaped QA in the last six months will generally be less than 100, certainly less 500 worst case.
Which would you do-- four queryies with one result each, or one single query that on average might return 50, perhaps worst case 500?
And I guess the key question is-- where are the inflections points? Perhaps I have more metrics tomorrow (who knows, 8?) and a different average defect counts. Is there a rule of thumb I could use to help choose which approach?
Well I would probably make the series of four queries and use the result count. If you are expecting 500 defects that will end up being three queries each with 200 defects anyways.
The solution where you do each individual query and use the total result count would be safe with even a very large amount of defects. Plus I usually find it to be a bad plan to think that I know the data sets that an App will be dealing with. Most of my Apps end up living much longer and being used on larger datasets than I intended.
The max page size is 200, so it sounds like you'd be requesting between 1 and 3 pages to get all the data vs. 4 queries with a page size of 1 and using the TotalResultCount...
You'd definitely have less aggregation code to write if you use the multi query approach (letting the server do the counting for you based on your supplied filters).
I'd guess the 4 independent queries might be faster but it would be interesting to hear back your experimental results...

Trending 100 million+ rows

I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real
Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.
I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.
Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?