Distribution of table in time - sql

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...

Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;

Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...

As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.

#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.

Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS

select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.

For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/

Related

A trigger to create populate a table based on other table

Ok, since it seems that my last two questions (this one and this one) only lead to confussion, I will try to explain the FULL problem here, so it might be a long post.
I'm trying to create a database for a trading system. The database has 2 main tables. One is table "Ticks" and the other is "Candles". As shown in the figure, each table has its own attributes..
Candles, bars or ohlc are the same thing.
The way a candle is seen in a chart is like this:
Candles are just a way to representate aggregated data, nothing more.
There are many ways to aggregate ticks in order to create one candle. In this post, I'm asking for a particular way that is creating one candle every 500 ticks. So, if the ticks table has 1000 ticks, I can create only 2 candles. If it has 500 ticks, I can create 1 candle. If it has 5000 ticks, I can create 10 candles. If there are 5001 ticks I still have only 10 candles, because I'm missing the other 499 ticks in order to create the 11th candle.
Actually, I'm storing all the ticks using a python script and creating (and therefore, inserting in the candles table) candles with another python script. This is a real time process.
Both scripts run in a while True: loop. No, I can't (read shouldn't) stop the scripts because the market is opened 24 hours - 5 days a week.
What I'm trying to do is to get rid of the python script that creates and stores the candles in the candles table. Why? Because I think that it will improve performance. Instead of doing multiple queries to know the amount of ticks that are available to create a new candle, I think that a trigger could handle it in a more efficient way (please, if I'm mistaken correct me).
I don't know how to actually solve it, but what I'm trying is to do this (thanks to #GordonLinoff for helping me in previous questions):
do $$
begin
with total_ticks as (
select count(*) c from (
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc) totals),
ticks_for_candles as(
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc
), candles as(
select max(date) as date,
max(bid) filter (where mod(seqnum, 500) = 1) as open,
max(bid) as high,
min(bid) as low,
max(bid) filter (where mod(seqnum, 500) = 500-1) as close,
max(ask) filter (where mod(seqnum, 500) = 500-1) as ask
from (
select t.*, row_number() over (order by date) as seqnum
from (select * from ticks_for_candles) t) as a
group by floor((seqnum - 1) /500)
having count(*) = 500
)
case 500<(select * from total_ticks)
when true then
return select * from candles
end;
end $$;
Using this, I get this error:
ERROR: syntax error at or near "case"
LINE 33: case 500<(select * from total_ticks)
^
SQL state: 42601
Character: 945
As you can see, there is no select after the CETs. If I put:
select case 500<(select * from total_ticks)
when true then
return select * from candles
end;
end $$;
I get this error:
ERROR: subquery must return only one column
LINE 31: (select * from candles)
^
QUERY: with total_ticks as (
select count(*) c from (
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc) totals),
ticks_for_candles as(
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc
), candles as(
select max(date) as date,
max(bid) filter (where mod(seqnum, 500) = 1) as open,
max(bid) as high,
min(bid) as low,
max(bid) filter (where mod(seqnum, 500) = 500-1) as close,
max(ask) filter (where mod(seqnum, 500) = 500-1) as ask
from (
select t.*, row_number() over (order by date) as seqnum
from (select * from ticks_for_candles) t) as a
group by floor((seqnum - 1) /500)
having count(*) = 500
)
select case 1000>(select * from total_ticks)
when true then
(select * from candles)
end
CONTEXT: PL/pgSQL function inline_code_block line 4 at SQL statement
SQL state: 42601
So honestly, I don't know how to do it correctly. It doesn't has to be with the actual code I provide here, but the desired output looks as follows:
-----------------------------------------------------------------------------------
| date | open | high | low | close | ask |
|2020-05-01 20:39:27.603452| 1.0976 | 1.09766 | 1.09732 | 1.09762 | 1.09776 |
This would be the output when there is enough ticks to create only 1 candle. If there is enough to create two of them, then there should be 2 rows.
So, at the end of the day, what I have in mind is that the trigger should check constantly if there is enough data to create a candle and if it is, then create it.
Is this a good idea or I should stick to the python script?
Can this be achieved with my approach?
What I'm doing wrong?
What should I do and how should I manage this situation?
I really hope that the question now is complete and there is no missing information.
All comments and advices are appreciated.
Thanks!
EDIT: Since this is a real time process, in one second there could be 499 ticks in the database and in the next second there could be 503 ticks. This means that 4 ticks arrived in 1 second.
Being a database guy, my approach would be to use triggers in the database.
Create a third table candle_in_the_making that contains the data from the ticks that have not yet been aggregated to a candles entry.
Create an INSERT trigger on the ticks table (doesn't matter if BEFORE or AFTER) that does the following:
For every tick inserted, add a row to candle_in_the_making.
If the row count reaches 500, compute and insert a new candles row and TRUNCATE candle_in_the_making.
This is simple if ticks are inserted only in a single thread.
If ticks are inserted concurrently, you have to find a way to prevent two threads from inserting the 500th tick in candle_in_the_making at the same time (so that you end up with 501 entries). I can think of two ways to do that in the database:
Have an extra table c_i_m_count that contains only a single number, which is the number of rows in candle_in_the_making. Before you insert into candle_in_the_making, you run the atomic
UPDATE c_i_m_count SET counter = counter + 1 RETURNING counter;
This locks the row, so that any two INSERTs into counter_in_the_making are effectively serialized.
Use advisory locks to serialize the inserting threads. In particular, a transaction level exclusive lock as taken by pg_advisory_xact_lock would be indicated.

Access 2013 - Query not returning correct Number of Results

I am trying to get the query below to return the TWO lowest PlayedTo results for each PlayerID.
select
x1.PlayerID, x1.RoundID, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and x2.PlayedTo <= x1.PlayedTo
) <3
order by PlayerID, PlayedTo, RoundID;
Unfortunately at the moment it doesn't return a result when there is a tie for one of the lowest scores. A copy of the dataset and code is here http://sqlfiddle.com/#!3/4a9fc/13.
PlayerID 47 has only one result returned as there are two different RoundID's that are tied for the second lowest PlayedTo. For what I am trying to calculate it doesn't matter which of these two it returns as I just need to know what the number is but for reporting I ideally need to know the one with the newest date.
One other slight problem with the query is the time it takes to run. It takes about 2 minutes in Access to run through the 83 records but it will need to run on about 1000 records when the database is fully up and running.
Any help will be much appreciated.
Resolve the tie by adding DatePlayed to your internal sorting (you wanted the one with the newest date anyway):
select
x1.PlayerID, x1.RoundID
, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and (x2.PlayedTo < x1.PlayedTo
or x2.PlayedTo = x1.PlayedTo
and x2.DatePlayed >= x1.DatePlayed
)
) <3
order by PlayerID, PlayedTo, RoundID;
For performance create an index supporting the join condition. Something like:
create index P_7to8Calcs__PlayerID_RoundID on P_7to8Calcs(PlayerId, PlayedTo);
Note: I used your SQLFiddle as I do not have Acess available here.
Edit: In case the index does not improve performance enough, you might want to try the following query using window functions (which avoids nested sub-query). It works in your SQLFiddle but I am not sure if this is supported by Access.
select x1.PlayerID, x1.RoundID, x1.PlayedTo
from (
select PlayerID, RoundID, PlayedTo
, RANK() OVER (PARTITION BY PlayerId ORDER BY PlayedTo, DatePlayed DESC) AS Rank
from P_7to8Calcs
) as x1
where x1.RANK < 3
order by PlayerID, PlayedTo, RoundID;
See OVER clause and Ranking Functions for documentation.

How can I modify this SQL query to exclude all results except from the previous two hours?

We're currently using SQL Express on SQL Server 2005 and want to set up an automated ftp file transfer every two hours to our client. We want to be able to send them bi-hourly uploads without duplicates throughout the day. Is this possible to do by modifying this existing query?
Use Sweet
select distinct d.AccountCode, f.ProcessedFileName, f.CallStartDateTime, f.PathToFile
from CSR_CallDetail d, CSR_FileListing f
where d.CallId = f.CallId
and f.ProcessedFileName like '%mp3'
and f.CallStartDateTime between convert(varchar(10),getdate()-1,101) and convert(varchar(10),getdate(),101)
and d.AccountCode > '740000'
and f.AccountCode > '740000'
and not exists (select 1 from( select processedfilename from csr_filelisting) p
where f.compressedfilename = p.processedfilename)
Here's the updated query
Use Sweet
select distinct d.AccountCode, f.ProcessedFileName, f.CallStartDateTime, f.PathToFile
from CSR_CallDetail d, CSR_FileListing f
where d.CallId = f.CallId
and f.ProcessedFileName like '%mp3'
and DATEDiff(hh, f.callstartdatetime, GETDATE ()) <=2
and d.AccountCode > '740000'
and f.AccountCode > '740000'
and not exists (select 1 from( select processedfilename from csr_filelisting) p where f.compressedfilename = p.processedfilename)
Let's say the query you posted returns desired result. If so, we need a date (and time) the records have been saved. All you need is to add condition:
AND DATEDIFF(hh, date_of_record, GETDATE()) <=2
I assume in your case it will be:
AND DATEDIFF(hh, f.CallStartDateTime , GETDATE()) <=2
You won't be able to rely on timestamps to get guaranteed exact sequential nonoverlapping sets of anything. You'll always be up against a race condition. What you should do is add a bit column somewhere that will mean you've already processed that row, and set it appropriately at the time of processing. Use transactions and isolation levels to ensure that no one is updating it while you're working on it (a brief moment, one hopes).

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

How to write an SQL query that retrieves high scores over a recent subset of scores -- see explaination

Given a table of responses with columns:
Username, LessonNumber, QuestionNumber, Response, Score, Timestamp
How would I run a query that returns which users got a score of 90 or better on their first attempt at every question in their last 5 lessons? "last 5 lessons" is a limiting condition, rather than a requirement, so if they completely only 1 lesson, but got all of their first attempts for each question right, then they should be included in the results. We just don't want to look back farther than 5 lessons.
About the data: Users may be on different lessons. Some users may have not yet completed five lessons (may only be on lesson 3 for example). Each lesson has a different number of questions. Users have different lesson paths, so they may skip some lesson numbers or even complete lessons out of sequence.
Since this seems to be a problem of transforming temporally non-uniform/discontinuous values into uniform/contiguous values per-user, I think I can solve the bulk of the problem with a couple ranking function calls. The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions completed is variable per-user.
So far...
As a starting point or hint at what may need to happen, I've transformed Timestamp into an "AttemptNumber" for each question, by using "row_number() over (partition by Username,LessonNumber,QuestionNumber order by Timestamp) as AttemptNumber".
I'm also trying to transform LessonNumber from an absolute value into a contiguous ranked value for individual users. I could use "dense_rank() over (partition by Username order by LessonNumber desc) as LessonRank", but that assumes the order lessons are completed corresponds with the order of LessonNumber, which is unfortunately not always the case. However, let's assume that this is the case, since I do have a way of producing such a number through a couple of joins, so I can use the dense_rank transform described to select the "last 5 completed lessons" (i.e. LessonRank <= 5).
For the >90 condition, I think I can transform the score into an integer so that it's "1" if >= 90, and "0" if < 90. I can then introduce a clause like "group by Username having SUM(Score)=COUNT(Score).", which will select only those users with all scores equal to 1.
Any solutions or suggestions would be appreciated.
You kind of gave away the solution:
SELECT DISTINCT Username
FROM Results
WHERE Username NOT in (
SELECT DISTINCT Username
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1 and Score < 90
)
Concerning the LessonRank, I used exactly what you desribed since it is not clear how to order the lessons otherwise: The timestamp of the first attempt of the first question of a lesson? Or the timestamp of the first attempt of any question of a lesson? Or simply the first(or the most recent?) timestamp of any result of any question of a lesson?
The innermost Select adds all the AttemptNumber and LessonRank as provided by you.
The next Select retains only the results which would disqualify a user to be in the final list - all first attempts with an insufficient score in the last 5 lessons. We end up with a list of users we do not want to display in the final result.
Therefore, in the outermost Select, we can select all the users which are not in the exclusion list. Basically all the other users which have answered any question.
EDIT: As so often, second try should be better...
One more EDIT:
Here's a version including your remarks in the comments.
SELECT Username
FROM
(
SELECT Username, CASE WHEN Score >= 90 THEN 1 ELSE 0 END AS QuestionScoredWell
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1
) as ff
Group BY Username
HAVING MIN(QuestionScoredWell) = 1
I used a Having clause with a MIN expression on the calculated QuestionScoredWell value.
When comparing the execution plans for both queries, this query is actually faster. Not sure though whether this is partially due to the low number of data rows in my table.
Random suggestions:
1
The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions is variable per-user.
is equivalent to
There exists no first attempt with a score <= 90 most-recent 5 lessons
which strikes me as a little easier to grab with a NOT EXISTS subquery.
2
First attempt is the same as where timestamp = (select min(timestamp) ... )
You need to identify the top 5 lessons per user first, using the timestamp to prioritize lessons, then you can limit by score. Try:
Select username
from table t inner join
(select top 5 username, lessonNumber
from table
order by timestamp desc) l
on t.username = l.username and t.lessonNumber = l.lessonNumber
from table
where score >= 90