Finding top 10 occurrences in data

Finding top 10 occurrences in data - sql

I am trying to find the top 10 mentions (#xxxxx) in my twitter data. I have created the initial table twitter.full_text_ts and loaded it with my data.
create table twitter.full_text_ts as
select id, cast(concat(substr(ts,1,10), ' ', substr(ts,12,8)) as timestamp) as ts, lat, lon, tweet
from full_text;
Ive been able to extract the mentions in the tweets by using this query (patterns)
select id, ts, regexp_extract(lower(tweet), '(.*)#user_(\\S{8})([:| ])(.*)',2) as patterns
from twitter.full_text_ts
order by patterns desc
limit 50;
executing this gives me
USER_a3ed4b5a 2010-03-07 03:46:23 fffed220
USER_dc8cfa6f 2010-03-05 18:28:39 fffdabf9
USER_dc8cfa6f 2010-03-05 18:32:55 fffdabf9
USER_915e3f8c 2010-03-07 03:39:09 fffdabf9
and so on...
You can see the fffed220 etc is the extracted patterns.
Now what I would like to do is count the number of times each of these mentions (patterns) occurs and output the top 10. for example fffdabf9 occurs 20 times, fffxxxx occurs 17 times and so on.

The most readable way to do this would be to save your first query into a temporary table, then do a groupby on the temp table:
create table tmp as
--your query
select patterns, count(*) n_mentions
from tmp
group by patterns
order by count(*) desc
limit 10;

with mentions as
(select id, ts,
regexp_extract(lower(tweet), '(.*)#user_(\\S{8})([:| ])(.*)',2) as patterns
from twitter.full_text_ts
order by patterns desc
limit 50)
select patterns, count(*)
from mentions
group by patterns
order by count(*) desc
limit 10;

Related

Oracle sapling of data in parallel

I have X million records in a table TABLE_A and want to process these records one by one.
How can I divide the population equally among 10 instances of same PL/SQL scripts to process in parallel?
See below query
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM
(
SELECT CUSTOMER_ID, NAME, HOME_TELNO, DEPT_NAME, ROWNUM AS RNUM
FROM TABLE_A ORDER BY CUSTOMER_ID ASC
) CBR
WHERE CBR.RNUM < :sqli_end_rownum AND CBR.RNUM >= :sqli_start_rownum ;
Values will be incremented in each iteration of loop. In next iteration sqli_start_rownum will become sqli_end_rownum.
This query is taking much time. Does someone has better way to do it

You could look into DBMS_PARALLEL_EXECUTE:
http://docs.oracle.com/cd/E11882_01/appdev.112/e40758/d_parallel_ex.htm#ARPLS67331
For example:
https://oracle-base.com/articles/11g/dbms_parallel_execute_11gR2
The poor man's version of this is basically to run a query to generate ranges of rowids. You can then access the rows in the table within a given range.
Step1: create the number of "buckets" you want to divide the table into and get a range of rowids for each bucket. Here's an 8-bucket example:
select bucket_num, min(rid) as start_rowid, max(rid) as end_rowid, count(*)
from (select rowid rid
, ntile(8) over (order by rowid) as bucket_num
from table_a
)
group by bucket_num
order by bucket_num;
You'd get an output that looks like this (I'm using 12c - rowids may look different in 11g):
BUCKET_NUM START_ROWID END_ROWID COUNT(*)
1 AABetTAAIAAB8GCAAA AABetTAAIAAB8u5AAl 82792
2 AABetTAAIAAB8u5AAm AABetTAAIAAB9RrABi 82792
3 AABetTAAIAAB9RrABj AABetTAAIAAB96vAAU 82792
4 AABetTAAIAAB96vAAV AABetTAAIAAB+gKAAs 82792
5 AABetTAAIAAB+gKAAt AABetTAAIAAB+/vABv 82792
6 AABetTAAIAAB+/vABw AABetTAAIAAB/hbAB1 82791
7 AABetTAAIAAB/hbAB2 AABetTAAIAACARDABf 82791
8 AABetTAAIAACARDABg AABetTAAIAACBGnABq 82791
(The sum of the counts will be the total number of rows in the table at the time of the query.)
Step2: can grab a set of rows from the table for a given range:
SELECT <whatever you need>
FROM <table>
WHERE rowid BETWEEN 'AABetTAAIAAB8GCAAA' and 'AABetTAAIAAB8u5AAl'
...
Step3: repeat step2 for the given ranges.
so instead of this:
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM
(
SELECT CUSTOMER_ID, NAME, HOME_TELNO, DEPT_NAME, ROWNUM AS RNUM
FROM TABLE_A ORDER BY CUSTOMER_ID ASC
) CBR
WHERE CBR.RNUM < :sqli_end_rownum AND CBR.RNUM >= :sqli_start_rownum ;
you'll just have this:
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM table_a
WHERE rowid BETWEEN :start_rowid and :end_rowid
You can use this to run the same job in parallel but you'll need a separate session for each run (e.g. multiple SQL Plus sessions. You can also use something like DBMS_JOBS/DBMS_SCHEDULER to launch background jobs.
(Note: always be aware if your table is being updated between the time the buckets are calculated and the time you access the tables as you can miss rows.)

Split the results of a query in half

I'm trying to export rows from one database to Excel and I'm limited to 65000 rows at a shot. That tells me I'm working with an Access database but I'm not sure since this is a 3rd party application (MRI Netsource) with limited query ability. I've tried the options posted at this solution (Is there a way to split the results of a select query into two equal halfs?) but neither of them work -- in fact, they end up duplicating results rather than cutting them in half.
One possibly related issue is that this table does not have a unique ID field. Each record's unique ID can be dynamically formed by the concatenation of several text fields.
This produces 91934 results:
SELECT * from note
This produces 122731 results:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY notedate) AS rn FROM note
) T1
WHERE rn % 2 = 1
EDIT: Likewise, this produces 91934 results, half of them with a tile_nr value of 1, the other half with a value of 2:
SELECT *, NTILE(2) OVER (ORDER BY notedate) AS tile_nr FROM note
However this produces 122778 results, all of which have a tile_nr value of 1:
SELECT bldgid, leasid, notedate, ref1, ref2, tile_nr
FROM (
SELECT *, NTILE(2) OVER (ORDER BY notedate) AS tile_nr FROM note) x
WHERE x.tile_nr = 1
I know that I could just use a COUNT to get the exact number of records, run one query using TOP 65000 ORDER BY notedate, and then another that says TOP 26934 ORDER BY notedate DESC, for example, but as this dataset changes a lot I'd prefer some way to automate this to save time.

Access 2013 - Query not returning correct Number of Results

I am trying to get the query below to return the TWO lowest PlayedTo results for each PlayerID.
select
x1.PlayerID, x1.RoundID, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and x2.PlayedTo <= x1.PlayedTo
) <3
order by PlayerID, PlayedTo, RoundID;
Unfortunately at the moment it doesn't return a result when there is a tie for one of the lowest scores. A copy of the dataset and code is here http://sqlfiddle.com/#!3/4a9fc/13.
PlayerID 47 has only one result returned as there are two different RoundID's that are tied for the second lowest PlayedTo. For what I am trying to calculate it doesn't matter which of these two it returns as I just need to know what the number is but for reporting I ideally need to know the one with the newest date.
One other slight problem with the query is the time it takes to run. It takes about 2 minutes in Access to run through the 83 records but it will need to run on about 1000 records when the database is fully up and running.
Any help will be much appreciated.

Resolve the tie by adding DatePlayed to your internal sorting (you wanted the one with the newest date anyway):
select
x1.PlayerID, x1.RoundID
, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and (x2.PlayedTo < x1.PlayedTo
or x2.PlayedTo = x1.PlayedTo
and x2.DatePlayed >= x1.DatePlayed
)
) <3
order by PlayerID, PlayedTo, RoundID;
For performance create an index supporting the join condition. Something like:
create index P_7to8Calcs__PlayerID_RoundID on P_7to8Calcs(PlayerId, PlayedTo);
Note: I used your SQLFiddle as I do not have Acess available here.
Edit: In case the index does not improve performance enough, you might want to try the following query using window functions (which avoids nested sub-query). It works in your SQLFiddle but I am not sure if this is supported by Access.
select x1.PlayerID, x1.RoundID, x1.PlayedTo
from (
select PlayerID, RoundID, PlayedTo
, RANK() OVER (PARTITION BY PlayerId ORDER BY PlayedTo, DatePlayed DESC) AS Rank
from P_7to8Calcs
) as x1
where x1.RANK < 3
order by PlayerID, PlayedTo, RoundID;
See OVER clause and Ranking Functions for documentation.

How to write an SQL query that retrieves high scores over a recent subset of scores -- see explaination

Given a table of responses with columns:
Username, LessonNumber, QuestionNumber, Response, Score, Timestamp
How would I run a query that returns which users got a score of 90 or better on their first attempt at every question in their last 5 lessons? "last 5 lessons" is a limiting condition, rather than a requirement, so if they completely only 1 lesson, but got all of their first attempts for each question right, then they should be included in the results. We just don't want to look back farther than 5 lessons.
About the data: Users may be on different lessons. Some users may have not yet completed five lessons (may only be on lesson 3 for example). Each lesson has a different number of questions. Users have different lesson paths, so they may skip some lesson numbers or even complete lessons out of sequence.
Since this seems to be a problem of transforming temporally non-uniform/discontinuous values into uniform/contiguous values per-user, I think I can solve the bulk of the problem with a couple ranking function calls. The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions completed is variable per-user.
So far...
As a starting point or hint at what may need to happen, I've transformed Timestamp into an "AttemptNumber" for each question, by using "row_number() over (partition by Username,LessonNumber,QuestionNumber order by Timestamp) as AttemptNumber".
I'm also trying to transform LessonNumber from an absolute value into a contiguous ranked value for individual users. I could use "dense_rank() over (partition by Username order by LessonNumber desc) as LessonRank", but that assumes the order lessons are completed corresponds with the order of LessonNumber, which is unfortunately not always the case. However, let's assume that this is the case, since I do have a way of producing such a number through a couple of joins, so I can use the dense_rank transform described to select the "last 5 completed lessons" (i.e. LessonRank <= 5).
For the >90 condition, I think I can transform the score into an integer so that it's "1" if >= 90, and "0" if < 90. I can then introduce a clause like "group by Username having SUM(Score)=COUNT(Score).", which will select only those users with all scores equal to 1.
Any solutions or suggestions would be appreciated.

You kind of gave away the solution:
SELECT DISTINCT Username
FROM Results
WHERE Username NOT in (
SELECT DISTINCT Username
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1 and Score < 90
)
Concerning the LessonRank, I used exactly what you desribed since it is not clear how to order the lessons otherwise: The timestamp of the first attempt of the first question of a lesson? Or the timestamp of the first attempt of any question of a lesson? Or simply the first(or the most recent?) timestamp of any result of any question of a lesson?
The innermost Select adds all the AttemptNumber and LessonRank as provided by you.
The next Select retains only the results which would disqualify a user to be in the final list - all first attempts with an insufficient score in the last 5 lessons. We end up with a list of users we do not want to display in the final result.
Therefore, in the outermost Select, we can select all the users which are not in the exclusion list. Basically all the other users which have answered any question.
EDIT: As so often, second try should be better...
One more EDIT:
Here's a version including your remarks in the comments.
SELECT Username
FROM
(
SELECT Username, CASE WHEN Score >= 90 THEN 1 ELSE 0 END AS QuestionScoredWell
FROM (
SELECT
r.Username,r.LessonNumber, r.QuestionNumber, r.Score, r.Timestamp
, row_number() over (partition by r.Username,r.LessonNumber,r.QuestionNumber order by r.Timestamp) as AttemptNumber
, dense_rank() over (partition by r.Username order by r.LessonNumber desc) AS LessonRank
FROM Results r
) as f
WHERE LessonRank <= 5 and AttemptNumber = 1
) as ff
Group BY Username
HAVING MIN(QuestionScoredWell) = 1
I used a Having clause with a MIN expression on the calculated QuestionScoredWell value.
When comparing the execution plans for both queries, this query is actually faster. Not sure though whether this is partially due to the low number of data rows in my table.

Random suggestions:
1
The conditional specification of scoring above 90 for "first attempt at every question in their last 5 lessons" is also tricky, because the number of questions is variable per-user.
is equivalent to
There exists no first attempt with a score <= 90 most-recent 5 lessons
which strikes me as a little easier to grab with a NOT EXISTS subquery.
2
First attempt is the same as where timestamp = (select min(timestamp) ... )

You need to identify the top 5 lessons per user first, using the timestamp to prioritize lessons, then you can limit by score. Try:
Select username
from table t inner join
(select top 5 username, lessonNumber
from table
order by timestamp desc) l
on t.username = l.username and t.lessonNumber = l.lessonNumber
from table
where score >= 90

Distribution of table in time

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...

Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;

Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...

As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.

#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.

Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS

select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.

For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas