How can we calculate percentile in druid while querying? - sql

I'm trying to query 10th, 25th, 75th percentile for each row in druid from an integer column value. I came across some solutions ( http://druid.io/docs/latest/development/extensions-core/datasketches-quantiles.html ) but not sure how they can be implemented. Can somebody explain it in simpler terms?

If you're using druid SQL, it's pretty easy (once you load the extension). Eg, you can use
SELECT APPROX_QUANTILE_DS(myNum, .25) FROM myData
to get the 25th percentile of myNum. (If you search for 'approx_quantile_ds' on this doc page, there's also a link to a known issue.)
For native queries, I'm not sure offhand, but maybe this will help: https://www.druidforum.org/t/quantile-calculation/4929

Related

Derived Table Error: "The multi-part identifier could not be bound"

I'm having trouble getting the results I would like from the query I've built. The overall goal I'm trying to accomplish is to get the first odometer reading of the month and the last odometer reading of the month for a specific vehicle. I would then like to subtract the two to get total miles driven for that month. I figured a derived table with window functions would best help to accomplish this goal (see example SQL below).
SELECT
VEHICLE_ID2_FW
FROM
(SELECT
VEHICLE_ID2_FW,
LOCATION_CODE_FW,
MIN(ODOMETER_FW) OVER(PARTITION BY YEAR(DATE_FW), MONTH(DATE_FW)) AS MIN_ODO,
MAX(ODOMETER_FW) OVER(PARTITION BY YEAR(DATE_FW), MONTH(DATE_FW)) AS MAX_ODO
FROM
GPS_TRIPS_FW) AS G
I keep running into an issue where the derived table's query, by itself, runs and
works. However, when I bracket it in the FROM clause it shoots back an the error
The multi-part identifier could not be bound
Hoping that I could get some help figuring this out and maybe finding an overall better way to accomplish my goal. Thank you!
Odometers only increase (well, that should be true). So just use aggregation:
select VEHICLE_ID2_FW, year(date_fw), month(date_fw),
min(ODOMETER_FW), max(ODOMETER_FW),
max(ODOMETER_FW) - min(ODOMETER_FW) as miles_driven_in_month
from GPS_TRIPS_FW
group by VEHICLE_ID2_FW, year(date_fw), month(date_fw);
This answers the question that you asked. I don't think it solves your problem, though, because the total miles driven per month will not add up to the total miles driven. The issue are the miles driven between the last record at the end of the month and the first at the beginning of the next month.
If this is an issue, ask another question. Provide sample data, desired results, and an appropriate database tag.

How to get accurate percentiles for big data in PIG?

I'm trying to find percentiles in PIG and for that I'm using datafu-1.2.0.jar package, but my answers are not matching. I found out that all the data is pushed to a single reducer per key so I think that's why it is not working for me (file is 10.5 GB). Is there a way to solve this problem without having to partition my data as in I want to find percentiles for entire record using group all?

Using 'HINTS' in sql query

i am sorry if i sound silly asking but i haven't been using sql hints long and i am going over some chapter review work for school. I am having trouble getting my head wrapped around them.
For instance, one question i did in oracle on a test database i had made was "Show the top 10% of the daily total number of auctions. My answer was(which worked):
SELECT DAYOFWEEK, DAILY_TOTAL
FROM (
SELECT T.DAYOFWEEK,
SUM(AF.TOTAL_NUM_OF_AUCTIONS) AS DAILY_TOTAL,
CUME_DIST() OVER (ORDER BY SUM(AF.TOTAL_NUM_OF_AUCTIONS) ASC) AS Percentile
FROM TIME_DIM T, AUCT_FACT AF
WHERE AF.TIME_ID = T.TIME_ID
GROUP BY T.DAYOFWEEK)
WHERE Percentile > .9
ORDER BY Percentile DESC;
The problem i have now is, it says, for me to try and achieve this output with a different query, which i asked my teacher and they said that they mean to use hints, i looked over notes i have on them and it really doesn't explain thoroughly enough how to optimise this query with hints, or to do it in a simpler manner.
Any help would really be appreciated
=) thanks guys!
Hints are options you include in your query to direct the cost base optimizer which indexes to use.
It looks like daily total is something you can implement a summary index on.

Efficient way to compute accumulating value in sqlite3

I have an sqlite3 table that tells when I gain/lose points in a game. Sample/query result:
SELECT time,p2 FROM events WHERE p1='barrycarter' AND action='points'
ORDER BY time;
1280622305|-22
1280625580|-9
1280627919|20
1280688964|21
1280694395|-11
1280698006|28
1280705461|-14
1280706788|-13
[etc]
I now want my running point total. Given that I start w/ 1000 points,
here's one way to do it.
SELECT DISTINCT(time), (SELECT
1000+SUM(p2) FROM events e WHERE p1='barrycarter' AND action='points'
AND e.time <= e2.time) AS points FROM events e2 WHERE p1='barrycarter'
AND action='points' ORDER BY time
but this is highly inefficient. What's a better way to write this?
MySQL has #variables so you can do things like:
SELECT time, #tot := #tot+points ...
but I'm using sqlite3 and the above isn't ANSI standard SQL anyway.
More info on the db if anyone needs it: http://ccgames.db.94y.info/
EDIT: Thanks for the answers! My dilemma: I let anyone run any
single SELECT query on "http://ccgames.db.94y.info/". I want to give
them useful access to my data, but not to the point of allowing
scripting or allowing multiple queries with state. So I need a single
SQL query that can do accumulation. See also:
Existing solution to share database data usefully but safely?
SQLite is meant to be a small embedded database. Given that definition, it is not unreasonable to find many limitations with it. The task at hand is not solvable using SQLite alone, or it will be terribly slow as you have found. The query you have written is a triangular cross join that will not scale, or rather, will scale badly.
The most efficient way to tackle the problem is through the program that is making use of SQLite, e.g. if you were using Web SQL in HTML5, you can easily accumulate in JavaScript.
There is a discussion about this problem in the sqlite mailing list.
Your 2 options are:
Iterate through all the rows with a cursor and calculate the running sum on the client.
Store sums instead of, or as well as storing points. (if you only store sums you can get the points by doing sum(n) - sum(n-1) which is fast).

SOLR - Boost function (bf) to increase score of documents whose date is closest to NOW

I have a solr instance containing documents which have a 'startTime' field ranging from last month to a year from now. I'd like to add a boost query/function to boost the scores of documents whose startTime field is close to the current time.
So far I have seen a lot of examples which use rord to add boosts to documents whom are newer but I have never seen an example of something like this.
Can anyone tell me how to do it please?
Thanks
If you're on Solr 1.4+, then you have access to the "ms" function in function queries, and the standard, textbook approach to boosting by recency is:
recip(ms(NOW,startTime),3.16e-11,1,1)
ms gives the number of milliseconds between its two arguments. The expression as a whole boosts scores by 1 for docs dated now, by 1/2 for docs dated 1 year ago, by 1/3 for docs dated 2 years ago, etc.. (See http://wiki.apache.org/solr/FunctionQuery#Date_Boosting, as Sean Timm pointed out.)
In your case you have docs dated in the future, and those will get assigned a negative score by the above function, so you probably would want to throw in an absolute value, like this:
recip(abs(ms(NOW,startTime)),3.16e-11,1,1)
abs(ms(NOW,startTime)) will give the # of milliseconds between startTime and now, guaranteed to be nonnegative.
That would be a good starting place. If you want, you can then tweak the 3.16e-11 if it's too agressive or not agressive enough.
Tangentially, the ms function will only work on fields based on the TrieDate class, not the classic Date and LegacyDate classes. If your schema.xml was based on the example one for Solr 1.4, then your date field is probably already in the correct format.
You can do date math in Solr 1.4.
http://wiki.apache.org/solr/FunctionQuery#Date_Boosting