Unable to divide to counts of two separate lists in SQL, keeps returning 1 - sql

I have one list of events. One event name is creating an account and another is creating an account with Facebook. I am trying to see what percentage of accounts created use Facebook.
The code below will give me an accurate count of the number of facebook accounts and total accounts, but when I try to divide the two numbers it just gives me the number 1.
I am very new to SQL, and have spent hours trying to figure out why it is doing that to no avail.
with
fb_act as (
select *
from raw_event
where name = 'onboard_fb_success'
and event_ts::date >= current_date - 30
),
total_act as (
select *
from raw_event
where name ='create_account'
and event_ts::date >= current_date - 30
)
select count(fb_act)/count(total_act), total_act.event_ts::date as day
from total_act, fb_act
group by day
order by day
I expect the output to be about ~.3, but the actual output is always exactly 1.

Conditional aggregation is a much simpler way to write the query. You appear to be using Postgres, so something like this:
select re.event_ts::date as day,
(sum( (name = 'onboard_fb_success' and event_ts::date >= current_date - 30):: int) /
sum( name = 'create_account' and event_ts::date >= current_date - 30)::int)
) as ratio
from raw_event re
group by re.event_ts::date
order by day;

Related

POSTGRES DATA_TRUNC should return 0 for intervals that has no data

I am trying to do a time series-like reporting, for that, I am using the Postgres DATA_TRUNC function, it works fine and I am getting the expected output, but when a specific interval has no record then it is getting skipped to show, but my expected output is to get the interval also with 0 as the count, below is the query that I have right now. What change I should do to get the intervals that have no data? Thanks in advance.
SELECT date_trunc('days', sent_at), count('*')
FROM (select * from invoice
WHERE supplier = 'ABC' and sent_at BETWEEN '2021-12-01' AND '2022-07-31') as inv
GROUP BY date_trunc('days', sent_at)
ORDER BY date_trunc('days', sent_at);
Expected: As you can see below, the current output now shows 02/12 and then 07/12, it has skipped dates in the middle, but for me, it should also show 03/12, 04/12, 05/12 with count as 0
Current output
It doesn't seem like you have those dates in your data, in which case you need to generate them. Also, casting your timestamp to date instead of date_trunc() can get rid of those zeroes.
SELECT dates::date, count(*) filter (where sent_at is not null)
FROM (
select *
from invoice a
right join generate_series( '2021-12-01'::date,
'2021-12-31'::date,
'1 day'::interval ) as b(dates)
on sent_at::date=b.dates) as inv
GROUP BY 1
ORDER BY 1;
Here's a working example. Also, please try to improve your question according to #nbk's comment.

SQL: Average value per day

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.

Help me build a SQL select statement

SQL isn't my greatest strength and I need some help building a select statement.
Basically, this is my requirement. The table stores a list of names and a timestamp of when the name was entered in the table. Names may be entered multiple times during a week, but only once a day.
I want the select query to return names that were entered anytime in the past 7 days, but not today.
To get a list of names entered today, this is the statement I have:
Select * from table where Date(timestamp) = Date(now())
And to get a list of names entered in the past 7 days, not including today:
Select * from table where (Date(now())- Date(timestamp) < 7) and (date(timestamp) != date(now()))
If the first query returns a set or results, say A, and the second query returns B, how can I get
B-A
Try this if you're working with SQL Server:
SELECT * FROM Table
WHERE Timestamp BETWEEN
dateadd(day,datediff(day,0,getdate()),-7),
AND dateadd(day,datediff(day,0,getdate()),0)
This ensures that the timestamp is between 00:00 7 days ago, and 00:00 today. Today's entries with time greater than 00:00 will not be included.
In plain English, you want records from your second query where the name is not in your first query. In SQL:
Select *
from table
where (Date(now())- Date(timestamp) < 7)
and (date(timestamp) != date(now()))
and name not in (Select name
from table
where Date(timestamp) = Date(now())
)
not in
like
select pk from B where PK not in A
or you can do something like
Select * from table where (Date(now())- Date(timestamp) < 7) and (Date(now())- Date(timestamp) > 1)

How to obtain the result by using only 1 query?

I have a table containing IP Address,timestamp and browser columns.I need to find the percentage usage of a browser within past 1 week. How do I do it in a single query using nesting? No,it is not a homework question. I just can't seem to figure it out.
Using two inline views. One for the counts and one for the total.
Select
(bCounts.Broswer_counts * 100 / total.total) percentage,
bCounts.broswer
FROM
(
Select
Count(timestamp) broswer_counts,
browser
From
table
Where
timestamp > '12/1/2010'
Group by
Browser) bCounts,
(SELECT COUNT(TimeStamp) total From Table WHERE timestamp > '12/1/2010') Total

Sqlite3: Need to Cartesian On date

I have a table which is a list of games that have been played in a sqlite3 database. The field "datetime" is the a datetime of when game ended. The field "duration" is the number of seconds the game lasted. I want to know what percent of the past 24 hours had at least 5 games running simutaniously. I figured out to tell how many games running at a given time are:
select count(*)
from games
where strftime('%s',datetime)+0 >= 1257173442 and
strftime('%s',datetime)-duration <= 1257173442
If I had a table that was simply a list of every second (or every 30 seconds or something) I could do an intentional cartisian product like this:
select count(*)
from (
select count(*) as concurrent, d.second
from games g, date d
where strftime('%s',datetime)+0 >= d.second and
strftime('%s',datetime)-duration <= d.second and
d.second >= strftime('%s','now') - 24*60*60 and
d.second <= strftime('%s','now')
group by d.second) x
where concurrent >=5
Is there a way to create this date table on the fly? Or that I can get a similar effect to this without having to actually create a new table that is simply a list of all the seconds this week?
Thanks
First, I can't think of a way to approach your problem by creating a table on the fly or without the aid of an extra table. Sorry.
My suggestion is for you to rely on a static Numbers table.
Create a fixed table with the format:
CREATE TABLE Numbers (
number INTEGER PRIMARY KEY
);
Populate it with the number of seconds in 24h (24*60*60 = 84600). I would use any scripting language to do that using the insert statement:
insert into numbers default values;
Now the Numbers table has the numbers 1 through 84600. Your query will them be modified to be:
select count(*)
from (
select count(*) as concurrent, strftime('%s','now') - 84601 + n.number second
from games g, numbers n
where strftime('%s',datetime)+0 >= strftime('%s','now') - 84601 + n.number and
strftime('%s',datetime)-duration <= strftime('%s','now') - 84601 + n.number
group by second) x
where concurrent >=5
Without a procedural language in the mix, that is the best you'll be able to do, I think.
Great question!
Here's a query that I think gives you what you want without using a separate table. Note this is untested (so probably contains errors) and I've assumed datetime is an int column with # of seconds to avoid a ton of strftime's.
select sum(concurrent_period) from (
select min(end_table.datetime - begin_table.begin_time) as concurrent_period
from (
select g1.datetime, g1.num_end, count(*) as concurrent
from (
select datetime, count(*) as num_end
from games group by datetime
) g1, games g2
where g2.datetime >= g1.datetime and
g2.datetime-g2.duration < g1.datetime and
g1.datetime >= strftime('%s','now') - 24*60*60 and
g1.datetime <= strftime('%s','now')+0
) end_table, (
select g3.begin_time, g1.num_begin, count(*) as concurrent
from (
select datetime-duration as begin_time,
count(*) as num_begin
from games group by datetime-duration
) g3, games g4
where g4.datetime >= g3.begin_time and
g4.datetime-g4.duration < g3.begin_time and
g3.begin_time >= strftime('%s','now') - 24*60*60 and
g3.begin_time >= strftime('%s','now')+0
) begin_table
where end_table.datetime > begin_table.begin_time
and begin_table.concurrent < 5
and begin_table.concurrent+begin_table.num_begin >= 5
and end_table.concurrent >= 5
and end_table.concurrent-end_table.num_end < 5
group by begin_table.begin_time
) aah
The basic idea is to make two tables: one with the # of concurrent games at the begin time of each game, and one with the # of concurrent games at the end time. Then join the tables together and only take rows at "critical points" where # of concurrent games crosses 5. For each critical begin time, take the critical end time that happened soonest and that hopefully gives all the periods where at least 5 games were running concurrently.
Hope that's not too convoluted to be helpful!
Kevin rather beat me to the punchline there (+1), but I'll post this variation as it differs at least a little.
The key ideas are
Map the data in to a stream of events with attributes time and 'polarity' (=start or end of game)
Keep a running total of how many games are open at the time of each event
(this is done by forming a self-join on the event stream)
Find the event times where the number of games (as Kevin says) transitions up to 5, or down to 4
A little trick: add up all the down-to-4 times and take away the up-to-5s - the order is not important
The result is the number of seconds spent with 5 or more games open
I don't have sqllite, so I've been testing with MySQL, and I've not bothered to limit the time window to preserve some sanity. Shouldn't be difficult to revise.
Also, and more importantly, I've not considered what to do if games are open at the beginning or end of the period!
Something tells me there's a big simplification to be had here, but I've not spotted it yet.
SELECT SUM( event_time )
FROM (
SELECT -ga.event_type * ga.event_time AS event_time,
SUM( ga.event_type * gb.event_type ) event_type
FROM
( SELECT UNIX_TIMESTAMP( g1.endtime - g1.duration ) AS event_time
, 1 event_type
FROM games g1
UNION
SELECT UNIX_TIMESTAMP( g1.endtime )
, -1
FROM games g1 ) AS ga,
( SELECT UNIX_TIMESTAMP( g1.endtime - g1.duration ) AS event_time
, 1 event_type
FROM games g1
UNION
SELECT UNIX_TIMESTAMP( g1.endtime )
, -1
FROM games g1 ) AS gb
WHERE
ga.event_time >= gb.event_time
GROUP BY ga.event_time
HAVING SUM( ga.event_type * gb.event_type ) IN ( -4, 5 )
) AS gr
Why don't you trim the date and keep only the time, if you filter your data for any given date every time is unique. In this way you'll only need a table with numbers from 1 to 86400 (or less if you take bigger intervals), you may create two columns, "from" and "to" to define the intervals.
I'm not familiar with SQLite functions but according to the manual you have to use the strftime function with this format: HH:MM:SS.