How to calculate two sum on the same dataset with Apache Flink - sum

I have a simple stream if data which is of this form:
id | name | eventType | eventTime
----------------------------------
1 A PLAY (ts of when the client fired the event)
1 B IMPRESSION
2 A CLICK
The end goal is to calculate the sum of event of the eventType CLICK divided by the sum of eventType of the type IMPRESSION grouped by ID and NAME for a tumbling window of 60 seconds.
in pure SQL it would look like
SELECT d.id, d.name, d.impressionCount, d.clickCount, d.clickCount / d.impressionCount * 100.0 FROM
( SELECT i.id, i.name, count(*) as clickCount, c.impressionCount from events as i
LEFT JOIN
(
SELECT id, name, count(*) as impressionCount from events WHERE event_type = 'IMPRESSION' GROUP BY id,name
) as c
ON i.id = c.id and i.name = c.name
WHERE event_type = 'CLICK'
GROUP BY i.id, i.name
) as d
So I first need to create a column with the number clicks and a new column with the number of impression and then i use that table to do a division.
My question is.. what is the best to do this with Flink APis ? I have attempted to do this:
Table clickCountTable = eventsTable
.where("eventType = 'CLICK'")
.window(Tumble.over("1.minute").on("eventTime").as("minuteWindow"))
.groupBy("id, name, minuteWindow")
.select("concat(concat(id,'_'), name) as id, eventType.count as clickCount, minuteWindow.rowtime as minute");
and same for the impression and then I join this two table. But I do not get the right result and I'm not sure this is the best way to achieve what I want to do using tubling window.
EDIT:
This is how I transform the stream into tables:
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
[.....]
DataStream<EventWithCount> eventStreamWithTime = eventStream
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<EventWithCount>() {
#Override
public long extractAscendingTimestamp(EventWithCount element) {
try {
DateFormat df1 = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSSSS");
Date parsedDate = df1.parse(element.eventTime);
Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());
return timestamp.getTime();
} catch (Exception e) {
throw new RuntimeException(e.getMessage());
}
}});
tEnv.fromDataStream(eventStreamWithTime, "id, name, eventType, eventTime.rowtime");
tEnv.registerTable("Events", eventsTable);

Your Table API query to count the CLICK events by id and name per minute looks good.
Table clickCountTable = eventsTable
.where("eventType = 'CLICK'")
.window(Tumble.over("1.minute").on("eventTime").as("minuteWindow"))
.groupBy("id, name, minuteWindow")
.select("concat(concat(id,'_'), name) as clickId, eventType.count as clickCount, minuteWindow.rowtime as clickMin");
Do the same for IMPRESSION:
Table impressionCountTable = eventsTable
.where("eventType = 'IMPRESSION'")
.window(Tumble.over("1.minute").on("eventTime").as("minuteWindow"))
.groupBy("id, name, minuteWindow")
.select("concat(concat(id,'_'), name) as impId, eventType.count as impCount, minuteWindow.rowtime as impMin");
Finally, you have to join both tables:
Table result = impressionCountTable
.leftOuterJoin(clickCountTable, "impId = countId && impMin = countMin")
.select("impId as id, impMin as minute, clickCount / impCount as ratio")
Note the join condition impMin = countMin. This will turn the join into a time-windowed join with a minimal window size of 1 millisecond (ms is the granularity of time in Flink SQL).
You said, that the query did not behave as you expected. Can you be more specific about your expected and actual result?

Related

How to find combination of intersection from many tables?

I have a list of different channels that could potentially bring users to a website (organic, SEO, online marketing, etc.). I would like to find an efficient way to count daily active user that comes from the combination of these channels. Each channel has its own table and track its respective users.
The tables looks like the following,
channel A
date user_id
2020-08-01 A
2020-08-01 B
2020-08-01 C
channel B
date user_id
2020-08-01 C
2020-08-01 D
2020-08-01 G
channel C
date user_id
2020-08-01 A
2020-08-01 C
2020-08-01 F
I want to know the following combinations
Only visit channel A
Only visit channel A & B
Only visit channel B & C
Only visit channel B
etc.
However, when there are a lot of channels (I have around 8 channels) the combination is a lot. What I've done roughly is as simple as this (this one includes channel A)
SELECT
a.date,
COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1
but extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this? I was thinking to use FULL OUTER JOIN but can't seem to get the grasp out of it. Answers really appreciated.
I would approach this with union all and two levels of aggregation:
select date, channels, count(*) as num_users
from (select date, user_id, string_agg(channel order by channel) as channels
from ((select distinct date, user_id, 'a' as channel from a) union all
(select distinct date, user_id, 'b' as channel from b) union all
(select distinct date, user_id, 'c' as channel from c)
) abc
group by date, user_id
) c
group by date, channels;
However, when there are a lot of channels (I have around 8 channels) the combination is a lot
extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this?
Below is for BigQuery Standard SQL and addresses exactly above aspect of the OP's concerns
#standardSQL
CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>)
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
var combine = function(a) {
var fn = function(n, src, got, all) {
if (n == 0) {
if (got.length > 0) {
all[all.length] = got;
} return;
}
for (var j = 0; j < src.length; j++) {
fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
} return;
}
var all = []; for (var i = 1; i < a.length; i++) {
fn(i, a, [], all);
}
all.push(a);
return all;
}
return combine(a)
''';
with users as (
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
), visits as (
select date, user_id,
string_agg(channel, ' & ' order by channel) combination
from users
group by date, user_id
), channels AS (
select channel, cast(row_number() over(order by channel) as string) channel_num
from (select distinct channel from users)
), combinations as (
select string_agg(channel, ' & ' order by channel_num) combination
from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items,
unnest(split(items)) AS channel_num
join channels using(channel_num)
group by items
)
select date,
combination as channels_visited_only,
count(distinct user_id) dau
from visits
join combinations using (combination)
group by date, combination
order by combination
If to apply to sample data from your question - output is
Some explanations to help with using above
CTE users just simply union all tables and adds channel column to be able to distinguish from which table respective row came
CTE visits extracts list of all visited channels for each user-date combination
CTE channels just simply prepares list of channels and assigns number for later use
CTE combinations uses JS UDF to generate all combinations of channels' numbers and then joins them back to channels to generate channels combinations
and final SELECT statement is simply looks for those users whose list of visited channels match channels combination generated in previous step
Some recommendations for further streamlining above code
assuming your channel tables names follow channel_* pattern
you can use wildcard tables feature in users CTE and instead of
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
you can use something like below - so just one line instead of as many lines as cannles you have
select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*
I think you could use set operators to answer your questions: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
E.g.
is (A except B) except C
is A intersect B
etc.
I am thinking full join and aggregation:
select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
from (select 'a' channel_a, a.* from channel_a) a
full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
group by date, a.channel_a, b.channel_b, c.channel_c

Get top 1 row for every ID

There is a few posts about it but i can't make it work...
I just want to select just one row per ID, something like row_number() over Partition in oracle but in access.
ty
SELECT a.*
FROM DATA as a
WHERE a.a_sku = (SELECT top 1 b.a_sku
FROM DATA as b
WHERE a.a_sku = b.a_sku)
but i get the same table Data out of it
Sample of table DATA
https://ibb.co/X4492fY
You should try below query -
SELECT a.*
FROM DATA as a
WHERE a.Active = (SELECT b.Active
FROM DATA as b
WHERE a.a_sku = b.a_sku
AND a.Active < b.Active)
If you don't care which record within each group of records with a matching a_sku values is returned, you can use the First or Last functions, e.g.:
select t.a_sku, first(t.field2), first(t.field3), ..., first(t.fieldN)
from data t
group by t.a_sku

Suggest most optimized way using hive or pig

Problem Statement
Assume there is one text file of logs. Below are the fields in the file.
Log File
userID
productID
action
Where Action would be one of these –
Browse, Click, AddToCart, Purchase, LogOut
Select users who performed AddToCart action but did not perform Purchase action.
('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')
Can anyone suggest to get this info using hive or pig with optimised performance
This is possible to do using sum() or analytical sum() depending on exact requirements in a single table scan. What if User added to cart two products, but purchased only one?
For User+Product:
select userID, productID
from
(
select
userID,
productID,
sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
sum(case when action='Purchase' then 1 else 0 end) Purchase_cnt
from table
group by userID, productID
)s
where addToCart_cnt>0 and Purchase_cnt=0
Hive: Use not in
select * from table
where action='AddtoCart' and
userID not in (select distinct userID from table where action='Purchase')
Pig: Filter the ids using action and do a left join and check id is null
A = LOAD '\path\file.txt' USING PigStorage(',') AS (userID:int,b:int,c:int,action:chararray) -- Note I am assuming the first 3 columns are int.You will have to figure out the loading without the quotes.
B = FILTER A BY (action='AddToCart');
C = FILTER A BY (action='Purchase');
D = JOIN B BY userID LEFT OUTER,C BY userID;
E = FILTER D BY C.userID is null;
DUMP E;

High performance TSQL to retrieve data

I have two tables with below structure
Person(ID, Name, ...)
Action(ID, FirstPersonId, SecondPersonId, Date)
I wanna retrieve this data for each person:
Number of action that a person be on second person from last action
that be on first person
Current query
Select Result.Id ,
(Select Count(*)
From Action
Where SecondPersonId = Result.Id
AND Date > Result.LastAction)
From
(Select ID ,
(
Select Top 1 Date
From Action
Where Action.FirstPersonId = Person.Id
) as LastAction
From Person ) As Result
this query has bad performance and i need very better one.
with lastActionPerson as -- last action for every first person
(select FirstPersonId , max([Date]) as LastActionDate
from Action
)
select a.SecondPersonId ,count(*)
from lastActionPerson lap
join Action a
on a.SecondPersonId = lap.FirstPersonId -- be on second person
and a.[Date] > lap.lastActionDate
-- you could continue to right join person table to show the person without actions
group by a.SecondPersonId

Optimizing a troublesome query

I'm generating a PDF, via php from 2 mysql tables, that contains a table. On larger tables the script is eating up a lot of memory and is starting to become a problem.
My first table contains "inspections." There are many rows per day. This has a many to one relationship with the user table.
Table "inspections"
id
area
inpsection_date
inpsection_agent_1
inpsection_agent_2
inpsection_agent_3
id (int)
area (varchar) - is one of 8 "areas" ie: Concrete, Soils, Earthwork
inspection_date (int) - unix timestamp
inspection_agent_1 (int) - a user id
inspection_agent_2 (int) - a user id
inspection_agent_3 (int) - a user id
Second table is the user's info. All I need is to join the name to the "inspection_agents_x"
id
name
The final table, that is going to be in the PDF, needs to organize the data by:
by day
by user, find every "area" that the user "inspected" on that day
Concrete
Soils
Earthwork
1/18/2011
Jon Doe
X
Jane Doe
X
X
And so on for each day. Right now I'm just doing a simple join on the names and then organizing everything on the code end. I know I'm leaving a lot on the table as far as the queries go, I just can't think of way to do it.
Thanks for any and all help.
Select U.name
, user_inspections.inspection_date
, Min( Case When user_inspections.area = 'Concrete' Then 'X' End ) As Concrete
, Min( Case When user_inspections.area = 'Soils' Then 'X' End ) As Soils
, Min( Case When user_inspections.area = 'Earthwork' Then 'X' End ) As Earthwork
From users As U
Join (
Select area, inspection_date, inspection_agent1 As user_id
From inspections
Union All
Select area, inspection_date, inspection_agent2 As user_id
From inspections
Union All
Select area, inspection_date, inspection_agent3 As user_id
From inspections
) As user_inspections
On user_inspections.user_id = U.id
Group By U.name, user_inspections.inspection_date
This is effectively a static crosstab. It means that you will need to know all areas that should be outputted in the query at design time.
One of the reasons this query is problematic is that your schema is not normalized. Your inspection table should look like:
Create Table inspections
(
id int...
, area varchar...
, inspection_date date ...
, inspection_agent int References Users ( Id )
)
That would avoid the inner Union All query to get the output you want.
I would go like this:
select i.*, u1.name, u2.name, u3.name from inspections i left join users u1 on (i.inspection_agent_id1 = u1.id) left join users u2 on (i.inspection_agent_id2 = u2.id) left join users u3 on (i.inspection_agent_id3 = u3.id) order by i.inspection_date asc;
Then select distinct areas names and remember them or fetch them from area table if you have any:
select distinct area from inspections;
Then it's just foreach:
$day = "";
foreach($inspection in $inspections)
{
if($day == "" || $inspection["inspection_date"] != $day)
{
//start new row with date here
}
//start standard row with user name
}
It isn't clear if you have to display all users each time ( even if some of them do not do inspection that thay), if you have to you should fetch users once and loop over $users and search for user in $inspection row.