Cumulative average number of records created for specific day of week or date range - sql

Yeah, so I'm filling out a requirements document for a new client project and they're asking for growth trends and performance expectations calculated from existing data within our database.
The best source of data for something like this would be our logs table as we pretty much log every single transaction that occurs within our application.
Now, here's the issue, I don't have a whole lot of experience with MySql when it comes to collating cumulative sum and running averages. I've thrown together the following query which kind of makes sense to me, but it just keeps locking up the command console. The thing takes forever to execute and there are only 80k records within the test sample.
So, given the following basic table structure:
id | action | date_created
1 | 'merp' | 2007-06-20 17:17:00
2 | 'foo' | 2007-06-21 09:54:48
3 | 'bar' | 2007-06-21 12:47:30
... thousands of records ...
3545 | 'stab' | 2007-07-05 11:28:36
How would I go about calculating the average number of records created for each given day of the week?
day_of_week | average_records_created
1 | 234
2 | 23
3 | 5
4 | 67
5 | 234
6 | 12
7 | 36
I have the following query which makes me want to murderdeathkill myself by casting my body down an elevator shaft... and onto some bullets:
SELECT
DISTINCT(DAYOFWEEK(DATE(t1.datetime_entry))) AS t1.day_of_week,
AVG((SELECT COUNT(*) FROM VMS_LOGS t2 WHERE DAYOFWEEK(DATE(t2.date_time_entry)) = t1.day_of_week)) AS average_records_created
FROM VMS_LOGS t1
GROUP BY t1.day_of_week;
Halps? Please, don't make me cut myself again. :'(

How far back do you need to go when sampling this information? This solution works as long as it's less than a year.
Because day of week and week number are constant for a record, create a companion table that has the ID, WeekNumber, and DayOfWeek. Whenever you want to run this statistic, just generate the "missing" records from your master table.
Then, your report can be something along the lines of:
select
DayOfWeek
, count(*)/count(distinct(WeekNumber)) as Average
from
MyCompanionTable
group by
DayOfWeek
Of course if the table is too large, then you can instead pre-summarize the data on a daily basis and just use that, and add in "today's" data from your master table when running the report.

I rewrote your query as:
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week

The reason why your query takes so long is because of your inner select, you are essentialy running 6,400,000,000 queries. With a query like this your best solution may be to develop a timed reporting system, where the user receives an email when the query is done and the report is constructed or the user logs in and checks the report after.
Even with the optimization written by OMG Ponies (bellow) you are still looking at around the same number of queries.
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week

Related

Redshift SQL - Skipped sequence

I'm working on applicant pipeline data and need to get a count of applicants who made it to each phase of the pipeline/funnel. If an applicant skips a phase, I need to need to count them in the phase anyway. Here's an example of how that data might look for one applicant:
Stage name | Entered on
Application Review | 9/7/2018
Recruiter Screen | 9/10/2018
Phone Interview | blank
Interview | 9/17/2018
Interview 2 | 9/20/2018
Offer | blank
this is what the table looks like:
CREATE TABLE application_stages (
application_id bigint,
stage_id bigint,
entered_on timestamp without time zone,
exited_on timestamp without time zone,
stage_name character varying
);
In this example, I want to count Application Review through Interview 2 (including the skipped/blank Phone Interview phase), but not the Offer. How would I write the above in SQL? (Data is stored in Amazon Redshift. Using SQL workbench to query.)
Also, please let me know if there is anything else I can add to my question to make the issue/solution clearer.
You can hardcode the stages of the pipeline in event_list table like this:
id | stage_name
1 | first stage
2 | second stage
3 | third stage
4 | fourth stage
UPD: The deeper is the stage of the funnel, the higher is its ID. This way, you can compare them, i.e. third stage is deeper than second stage because 3>2. Thus, if you need to find people that reached the 2nd stage it includes people that have events with id=2 OR events with id>2, i.e. events deeper in the funnel.
If the second stage is missed and the third stage is recorded for some person you can still count that person as "reached second stage" by joining your event data to this table by stage_name and counting the number of records with id>=2, like
select count(distinct user_id)
from event_data t1
join event_list t2
using (stage_name)
where t2.id>=2
Alternatively, you can left join your event table to event_list and fill the gaps using lag function that returns the value of the previous row (i.e. assigning the timestamp of first stage to the second stage in the case above)
Here is the SQL I ended up with. Thanks for the ideas, #AlexYes!
select stage_name,
application_stages.application_id, entered_on,
case when entered_on is NULL then lead(entered_on,1)
ignore nulls
over
(PARTITION BY application_stages.application_id order by case stage_name
when 'Application Review' then 1
when 'Recruiter Screen' then 2
when 'Phone Interview' then 3
when 'Interview' then 4
when 'Interview 2' then 5
when 'Offer' then 6
when 'Hired' then 7 end) else entered_on end as for_count, exited_on
from application_stages
I realize that the above SQL doesn't give me the counts but I am doing the counts in Tableau. Happy to have the format above in case I need to do other calculations on the new "for_count" field.

user defined psuedocolumn oracle

I have a large dataset in an oracle database that is currently accessed from Java one item at a time. For example if a user is trying to do a bulk get of 50 items it will process them sequentially, calling a stored procedure for each one. I am now trying to implement a bulk get, but am having some difficulty due to the way the user can pass in a range query:
An example table:
prim_key | identifier | start | end
----------+--------------+---------+-------
1 | aaa | 1 | 3
2 | aaa | 3 | 7
3 | bbb | 1 | 5
The way it works is that if you have a query like (id='aaa' and pos=1) it will find prim_key = 1, but if you query (id='aaa' and pos=2) it won't find anything. If you do (id='aaa' and pos=-2) then it will again find prim_key=1 because the stored proc converts the -2 into a range scan equivalent to start<=2 and end>2.
(Extra context: the start/end are actually dates and this querying mechanism allows efficient "latest as of date" queries as opposed to doing something like select prim_key,
start from myTable
where start = (select max(start) from myTable where start <= 2))
This is all fine and works correctly for single gets, but now I'm trying to do bulk gets so that we can speed up the batch considerably. The first attempt was to multithread the individual calls, but it put too much stress on the database to be doing so many parallel queries on the same table. To solve this I've been trying to create a query like
select prim_key
from myTable
where (identifier='aaa' and start=3)
or (identifier='aaa' and start<=2 and end>2)
building this up from the list of input parameters ('aaa',3 ; 'bbb',-2), which works well and produces an explain plan using all of the indexes I would expect.
My Problem: I need to know what the input parameters were that retrieved that row in order to do further processing and return the relevant prim_key. I need to use something like a psuedocolumn that I can define myself:
select prim_key, PSUEDO
from myTable
where (identifier='aaa' and start=3 and PSUEDO='a3')
or (identifier='aaa' and start<=2 and end>2 and PSUEDO='a-2')
but I can't find any way to return a value from the where clause, and I think subqueries would lose the indexing efficiencies gained by doing it all in one select.
Try something like:
select
prim_key,
case when start = 3 then 'a3' else 'a-2' end pseudo
from
you_table
where
...

Is this SELECT and ORDER BY query the most efficient way I could have done it?

In my journey to learn SQL, I'm writing various queries on an old database of mine, but getting into more complex things, I want to make sure I'm not over engineering this. I have a table Agent, with different agents offering different prices for cities. Multiple agents can serve the same city, each with different prices. I wanted to run a query which would return the total cost of hiring all of the agents for any given city, ordered by the most expensive.
WITH orderedPrices AS (
SELECT SUM(agtFMPrice)
OVER (PARTITION BY agtCity)
AS IX FROM Agent)
SELECT IX
FROM orderedPrices
ORDER BY IX DESC
I found that doing it without the view returned by orderedPrices, it wouldn't order the prices (I assume because it's an aggregate function, or whatever they're called). Did I do this in the best way I could have, or could it be simplified?
Also, if you're feeling particularly bored, go ahead and give me a new assignment/query to do on this table. I could use the practice.
What you have written in English doesn't seem to quite match qhat you have written in SQL.
English:
- One record per City
- One field per record, showing the total cost of all associated agents
SQL:
- One record per Agent
- One field per record, showing the total cost of all agents in the same city
AgentID | agtCity | agtFMPrice
---------+---------+------------
1 | 1 | 10
2 | 1 | 20
3 | 2 | 30
4 | 2 | 10
5 | 2 | 25
Results of SQL version Results of English version
------------------------ ----------------------------
30 30
30 65
65
65
65
If you want the English version, I'd do this...
SELECT
agtCity,
SUM(agtFMPrice) AS IX
FROM
Agent
GROUP BY
agtCity
ORDER BY
SUM(agtFMPrice) DESC
To assist performance, the table could (should?) also have an Index on (agtCity)

Postgresql (Rails 3) merge rows on column (same table)

First, I've been using mysql for forever and am now upgrading to postgresql. The sql syntax is much stricter and some behavior different, thus my question.
I've been searching around for how to merge rows in a postgresql query on a table such as
id | name | amount
0 | foo | 12
1 | bar | 10
2 | bar | 13
3 | foo | 20
and get
name | amount
foo | 32
bar | 23
The closest I've found is Merge duplicate records into 1 records with the same table and table fields
sql returning duplicates of 'name':
scope :tallied, lambda { group(:name, :amount).select("charges.name AS name,
SUM(charges.amount) AS amount,
COUNT(*) AS tally").order("name, amount desc") }
What I need is
scope :tallied, lambda { group(:name, :amount).select("DISTINCT ON(charges.name) charges.name AS name,
SUM(charges.amount) AS amount,
COUNT(*) AS tally").order("name, amount desc") }
except, rather than returning the first row of a given name, should return mash of all rows with a given name (amount added)
In mysql, appending .group(:name) (not needing the initial group) to the select would work as expected.
This seems like an everyday sort of task which should be easy. What would be a simple way of doing this? Please point me on the right path.
P.S. I'm trying to learn here (so are others), don't just throw sql in my face, please explain it.
I've no idea what RoR is doing in the background, but I'm guessing that group(:name, :amount) will run a query that groups by name, amount. The one you're looking for is group by name:
select name, sum(amount) as amount, count(*) as tally
from charges
group by name
If you append amount to the group by clause, the query will do just that -- i.e. count(*) would return the number of times each amount appears per name, and the sum() would return that number times that amount.

Pulling items out of a DB with weighted chance

Let's say I had a table full of records that I wanted to pull random records from. However, I want certain rows in that table to appear more often than others (and which ones vary by user). What's the best way to go about this, using SQL?
The only way I can think of is to create a temporary table, fill it with the rows I want to be more common, and then pad it with other randomly selected rows from the table. Is there a better way?
One way I can think of is to create another column in the table which is a rolling sum of your weights, then pull your records by generating a random number between 0 and the total of all your weights, and pull the row with the highest rolling sum value less than the random number.
For example, if you had four rows with the following weights:
+---+--------+------------+
|row| weight | rollingsum |
+---+--------+------------+
| a | 3 | 3 |
| b | 3 | 6 |
| c | 4 | 10 |
| d | 1 | 11 |
+---+--------+------------+
Then, choose a random number n between 0 and 11, inclusive, and return row a if 0<=n<3, b if 3<=n<6, and so on.
Here are some links on generating rolling sums:
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql.html
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql_followup.html
I don't know that it can be done very easily with SQL alone. With T-SQL or similar, you could write a loop to duplicate rows, or you can use the SQL to generate the instructions for doing the row duplication instead.
I don't know your probability model, but you could use an approach like this to achieve the latter. Given these table definitions:
RowSource
---------
RowID
UserRowProbability
------------------
UserId
RowId
FrequencyMultiplier
You could write a query like this (SQL Server specific):
SELECT TOP 100 rs.RowId, urp.FrequencyMultiplier
FROM RowSource rs
LEFT JOIN UserRowProbability urp ON rs.RowId = urp.RowId
ORDER BY ISNULL(urp.FrequencyMultiplier, 1) DESC, NEWID()
This would take care of selecting a random set of rows as well as how many should be repeated. Then, in your application logic, you could do the row duplication and shuffle the results.
Start with 3 tables users, data and user-data. User-data contains which rows should be prefered for each user.
Then create one view based on the data rows that are prefered by the the user.
Create a second view that has the none prefered data.
Create a third view which is a union of the first 2. The union should select more rows from the prefered data.
Then finally select random rows from the third view.