Big Query / SQL finding "new" data in a date range - sql

I have a pretty big event log with columns:
id, timestamp, text, user_id
The text field contains a variety of things, like:
Road: This is the road name
City: This is the city name
Type: This is a type
etc..
I would like to get the result to the following:
Given a start and end date, how many **new** users used a road (that haven't before) grouped by road.
I've got various parts of this working fine (like the total amount of users, the grouping by, date range and so on. The SQL for getting the new users is alluding me though, having tried solutions like SELECT AS STRUCT on sub queries amongst other things.
Ultimately, I'd love to see a result like:
road, total_users, new_users
Any help would be much appreciated.

If I understand correctly, you want something like this:
select road, counif(seqnum = 1) as new_users, count(distinct user_id) as num_users
from (select l.*,
row_number() over (partition by l.user_id, l.text order by l.timestamp) as seqnum
from log l
where l.type = 'Road'
) l
where timestamp >= #timestamp1 and timestamp < #timestamp2
group by road;
This assumes that you have a column that specifies the type (i.e. "road") and another column with the name of the road (i.e. "Champs-Elysees").

Related

Attribute sales to email campaigns based on time window

Hopefully someone is able to help me with the following. I have two tables, one containing events which are "clicked emails" and one containing "conversions" (=orders). I would like to write attribution logic in BigQuery SQL to match the conversions to the clicked emails based on the following logic:
the matching ID is a "person_id"
conversions that happened within a 5 day (120h) timeframe after the clicked email count to that mail
if the customer clicked on two different emails within a five day window, the email clicked most recent before the conversion gets attributed the revenue
To know: the clicks table also contains clicks which do not have a conversion, and the conversions table also contains conversions not related to emails.
Wished end result: a table containing a count of all attributed orders and a count of all clicks, on date and campaign name.
I figured I would need to do a left join getting only the conversions in that might be related to a click, on person_id. However, now I need to define the window up until which conversions are counted (the 5 days). Maybe I could include this in the where statement? Then after that, I need to check, in case the count of conversions is > 1, to only take the conversion into account where the "date diff" is smallest.
How far I got :
SELECT
c. person_id,
c.campaign_name,
c.datetime,
s.processed_at,
c.email,
s.order_number,
SUM(s.total_price) AS revenue,
COUNT(DISTINCT s.order_number) AS orders
FROM
`klaviyo_de_clicks` c
LEFT JOIN
`klaviyo_de_conversions` s
ON
c.person_id = s.person_id
GROUP BY
1,2,3,4,5,6
Thanks for your help!
You can get the emails just before the conversion using union all and window functions.
The idea is to put all the data into a single "result set" with the clicks and conversions separated. Then use window functions to get the previous click information for each conversion.
Your sample code as a bunch of columns not alluded to in your question. But this is the structure of the code:
with cc as (
select person_id, date_time, conversion_id, null as click_id
from conversions c
union all
select person_id, date_time, null, click_id
from clicks c
)
select cc.*,
(click_date_time >= datetime_add(datetime, interval -5 day)) as is_click_conversion
from (select cc.*,
max(case when click_id is not null then date_time end) over (partition by person_id order by date_time) as click_date_time,
last_value(click_id ignore nulls) over (partition by person_id order by date_time) as prev_click_id
from cc
) cc
where conversion_id is not null;
If you need additional columns, you can use join to bring them in.

selecting percentage of time on/off for groupings by id

I'm looking to summarize some information into a kind of report, and the crux of it is similar to the following problem. I'm looking for the approach in any sql-like language.
consider a schema containing the following:
id - int, on - bool, time - datetime
This table is basically a log that specifies when a thing of id changes state between 'on' and 'off'.
What I want is a table with the percentage of time 'on' for each id seen. So a result might look like this
id, percent 'on'
1, 50
2, 45
3, 67
I would expect the overall time to be
now - (time first seen in the log)
Programatically, I understand how to do this. For each id, I just want to add up all of the segments of time for which the item was 'on' and express this as a percentage of the total time. I'm not quite seeing how to do this in sql however
You can use lead() and some date/time arithmetic (which varies by database).
In pseudo-code this looks like:
select id,
sum(csae when status = on then coalesce(next_datetime, current_datetime) - datetime) end) / (current_datetime - min(datetime))
from (select t.*,
lead(datetime) over (partition by id order by datetime) as next_datetime
from t
) t
group by id;
Date/time functions vary by database, so this is just to give an idea of what to do.

SQL query with summed statistical data, grouped by date

I'm trying to wrap my head around a problem with making a query for a statistical overview of a system.
The table I want to pull data from is called 'Event', and holds the following columns (among others, only the necessary is posted):
date (as timestamp)
positionId (as number)
eventType (as string)
Another table that most likely is necessary is 'Location', with, among others, holds the following columns:
id (as number)
clinic (as boolean)
What I want is a sum of events in different conditions, grouped by days. The user can give an input over the range of days wanted, which means the output should only show a line per day inside the given limits. The columns should be the following:
date: a date, grouping the data by days
deliverySum: A sum of entries for the given day, where eventType is 'sightingDelivered', and the Location with id=posiitonId has clinic=true
pickupSum: Same as deliverySum, but eventType is 'sightingPickup'
rejectedSum: A sum over events for the day, where the positionId is 4000
acceptedSum: Same as rejectedSum, but positionId is 3000
So, one line should show the sums for the given day over the different criteria.
I'm fairly well read in SQL, but my experience is quite low, which lead to me asking here.
Any help would be appreciated
SQL Server has neither timestamps nor booleans, so I'll answer this for MySQL.
select date(date),
sum( e.eventtype = 'sightingDelivered' and l.clinic) as deliverySum,
sum( e.eventtype = 'sightingPickup' and l.clinic) as pickupSum,
sum( e.position_id = 4000 ) as rejectedSum,
sum( e.position_id = 3000 ) as acceptedSum
from event e left join
location l
on e.position_id = l.id
where date >= $date1 and date < $date2 + interval 1 day
group by date(date);

Get Max(date) or latest date with 2 conditions or group by or subquery

I only have basic SQL skills. I'm working in SQL in Navicat. I've looked through the threads of people who were also trying to get latest date, but not yet been able to apply it to my situation.
I am trying to get the latest date for each name, for each chemical. I think of it this way: "Within each chemical, look at data for each name, choose the most recent one."
I have tried using max(date(date)) but it needs to be nested or subqueried within chemical.
I also tried ranking by date(date) DESC, then using LIMIT 1. But I was not able to nest this within chemical either.
When I try to write it as a subquery, I keep getting an error on the ( . I've switched it up so that I am beginning the subquery a number of different ways, but the error returns near that area always.
Here is what the data looks like:
1
Here is one of my failed queries:
SELECT
WELL_NAME,
CHEMICAL,
RESULT,
APPROX_LAT,
APPROX_LONG,
DATE
FROM
data_all
ORDER BY
CHEMICAL ASC,
date( date ) DESC (
SELECT
WELL_NAME,
CHEMICAL,
APPROX_LAT,
APPROX_LONG,
DATE
FROM
data_all
WHERE
WELL_NAME = WELL_NAME
AND CHEMICAL = CHEMICAL
AND APPROX_LAT = APPROX_LAT
AND APPROX_LONG = APPROX_LONG,
LIMIT 2
)
If someone does have a response, it would be great if it is in as lay language as possible. I've only had one coding class. Thanks very much.
Maybe something like this?
SELECT WELL_NAME, CHEMICAL, MAX(DATE)
FROM data_all
GROUP BY WELL_NAME, CHEMICAL
If you want all information, then use the ANSI-standard ROW_NUMBER():
SELECT da.*
FROM (SELECT da.*
ROW_NUMBER() OVER (PARTITION BY chemical, name ORDER BY date DESC) as senum
FROM data_all da
) da
WHERE seqnum = 1;

SQL: Average value per day

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.