I have scenario where i have got tables (in propriety datastore) with thousands of columns. The tables before being exported for querying is transformed to narrow format (http://en.wikipedia.org/wiki/Wide_and_Narrow_Data).
I am developing a query executor. The input to this query executor is the narrow tables not the original tables. I want to perform joins on two similar narrow tables, but cannot figure out the exact general logic behind it.
For example lets say we have two table R and S in the original format(wide format)
Table R
C1 C2 C3 R1 R2 R3
5 6 7 1234 4552 12532
5 6 8 4512 21523 434
15 16 17 1254 1212 3576
Table S
C1 C2 C3 S1 S2 S3
5 6 7 5412 35112 3512
5 6 8 125393 1523 6749
15 16 17 74397 4311 1153
C1, C2, C3 are the common columns between the tables.
The narrow table for table R is
C1 C2 C3 Key Value
5 6 7 R1 1234
R2 4552
R3 12532
5 6 8 R1 4512
R2 21523
R3 434
15 16 17 R1 1254
R2 1212
R3 3576
The narrow table for table S is
C1 C2 C3 Key Value
5 6 7 S1 5412
S2 35112
S3 3512
5 6 8 S1 125393
S2 1523
S3 6749
15 16 17 S1 74397
S2 4311
S3 1153
Now when i join the original table R and S (on C1, C2 and C3) i get the result
C1 C2 C3 R1 R2 R3 S1 S2 S3
5 6 7 1234 4552 12532 5412 35112 3512
5 6 8 4512 21523 434 125393 1523 6749
15 16 17 1254 1212 3576 74397 4311 1153
Whose narrow format is
C1 C2 C3 Key Value
5 6 7 R1 1234
R2 4552
R3 12532
S1 5412
S2 35112
S3 3512
5 6 8 R1 4512
R2 21523
R3 434
S1 125393
S2 1523
S3 6749
15 16 17 R1 1254
R2 1212
R3 3576
S1 74397
S2 4311
S3 1153
How can i get the above table by just joining the narrow tables (on the common columns) that i got as input.
If you use normal tabular join (natural joing, outer join etc) between the two narrow tables you will get an exploded table because each key on table R gets multiplied with all the keys in table S.
I am not using SQL, or postgres or any database system. I am looking for the answer in terms of algorithms or relational algebraic expressions.
You're looking for the set union operator: A∪B is defined as the set of all tuples that appear in A, B or both, supposing the two relations have the same schema. The narrow tables all have the same schema (id, key, value), so they're perfectly union compatible.
And I have proof:
Suppose we have relations A(id, val1, val2 ... val_n) and B(id, val_n+1 ... val_n+m). We will also need a relation holding our variable names V(variable) = {('val1'), ('val2') ... ('val_n+m')}. The narrow-format equivalent of A is A'(id, variable, value), which we can construct like this:
That is, for each value we project A to (id, val_i), rename val_i to "value", put the variable name in the table (by taking the cross product with a single tuple in V); then we take the union of all these relations. Let us also construct B'(id, variable, value) in a similar fashion.
The natural join can be defined using only primitives:
Therefore we can construct (A ⋈ B)' like this (having combined the projections):
Let's apply the projection earlier:
But a val_i can only appear in A or B, not both, making one term of the cross product zero half of the time so this can be reduced and re-ordered into
which is exactly A' U B'.
So, we have shown that (A ⋈ B)' = A' U B', that is, the narrow format of the joined tables is the union of the narrow format tables.
Related
Currently, I am doing an ETL task from record data for process mining task. The goal is to make a "Directly Follow (DF)" Matrix based on the record data. This is the flow:
I have a record (event) data, for example:
ID ev_ID Act Complete
1 1 A 2020-01-13 11:46
2 1 B 2020-01-13 11:50
3 1 C 2020-01-13 11:55
4 1 D 2020-01-13 12:50
5 1 E 2020-01-13 12:52
6 2 A 2020-01-06 09:13
7 2 B 2020-01-06 09:15
8 2 C 2020-01-06 11:46
9 2 D 2020-01-06 11:46
10 3 A 2020-01-06 08:11
11 3 C 2020-01-06 08:10
12 3 B 2020-01-06 09:46
13 3 D 2020-01-06 11:23
14 3 E 2020-01-06 16:05
As I mentioned above, I want to create a DF matrix that shows the "direct follow relation" see here. However, I want to change the output with a table representation (not a matrix).
The (desired) output:
From To Frequency
A A 0
A B 3
A C 1
… … …
D E 2
… … …
E E 0
The idea is to calculate the frequency of "direct follow relation" for each activity per ev_id. For example:
We have ev_1 = [ABCD]
The ev_1 has direct follow relation: AB, BC, and CD.
So, we can calculate the direct follow frequency for each activity.
My question:
Is there anyone who can suggest how to make the output using a SQL query?
I am doing the task with PostgreSQL now.
Any help is appreciated. Thank you very much.
I tried by myself, but the result seems not correctly 100%.
This is my code:
with ev_data as (
select
ID as eid,
ev_ID as ci,
Act as ea,
Complete as ec
from
table_name
),
A0 as (
select
eid,
ci::int,
row_number() over (partition by ci order by ci, ec) as idx,
ea as act1,
ea as act2
from
ev_data
),
A1 as (
select
L1.ci as ci1,
L1.idx as idx1,
L1.act1 as afrom,
L2.ci as ci2,
L2.idx as idx2,
L2.act2 as ato
from A0 as L1
join A0 as L2
on L1.ci = L2.ci
and L2.idx = L1.idx + 1
)
select
afrom,
ato,
count(*) as count
from A1
group by afrom, ato
order by afrom
Let me assume that your goal is the first matrix. You have two issues:
Getting the adjacent counts.
Generating the rows with 0 values.
Neither is really difficult. The first uses lead() and aggregation. The second uses cross join:
select a_f.act_from, a_t.act_to,
count(t.id)
from (select distinct act as act_from from table_name
) a_f cross join
(select distinct act as act_to from table_name
) a_t left join
(select t.*,
lead(act) over (partition by ev_id order by complete) as next_act
from table_name t
) t
on t.act = a_f.act_from and
t.next_act = a_t.act_to
group by a_f.act_from, a_t.act_to;
route_number source_id latitude_value longitude_value no_of_stores
r1 676 28.15085 32.66055 23
r2 715 28.2160253 32.5214831 23
r3 345 28.2123115 32.537211 22
r4 150 28.23009 32.50323 23
r5 534 28.0949248 32.8075467 21
r6 1789 28.2204214 32.5035782 22
r7 647 28.21548 32.50238 23
r8 667 28.21132 32.51481 22
r9 2242 28.2389 32.5 19
r10 797 28.161657 32.8416816 20
r11 1097 28.1792849 32.8255522 19
r12 591 28.2513623 32.7638247 22
r13 1091 28.251208 32.7808329 21
r14 1267 28.2102213 32.8129836 21
r15 1016 28.1654648 32.8350845 19
r16 785 28.0786012 32.9513468 4
r17 1072 28.1701673 32.8382309 1
Mentioned above is a dataframe i am dealing with.
As you can see, the no. of stores in a route_number are different.
mean(no_of_stores) = 20 in this case
What i am looking for is,
depending on the geo-locations(latitude & longitude value) of my source_id , i want to combine multiple routes which lie closer to each other into 1 such that the no_of_stores in new group are equally divided.
The condition of routes lying closer to each other can be excluded, and just merge routes with lesser no. of stores into 1 can also be done.
i.e the routes which lie closer to each other( and no_of_stores are less than the mean(no_of_stores)), combine them into 1 big route, such that no_of_stores in the new routes formed is the mean of no_of_stores column, which in case is around 19.
Final output expected something like this: (not actual)
route_number new_route_no
r1 A1 #since its already has stores greater than mean
r2 A2
r3 A3
r4 A4
....................
r9 A9 #(19 stores)
r17 A9 #(1 stores) total 20
....................
r11 A11
r16 A11
r15 A15 #19 stores , since it cannot be combined further,keep as it is
I have tried using pandas groupby and aggregate methods, but couldnt find a way to transform this dataframe,
Any leads would be helpful.
The code as written below returns the appropriate customers, lockers, units and balance. However, when I add in the commented-out code it reiterates each customer's data for each club even though each customer can only be a member of one club.
USE db1
GO
SELECT [db1].[dbo].[Customer].[CustomerNumber] AS 'Customer No.'
-- ,A. ClubID AS 'Club ID No.'
,(SELECT CONCAT (SI.Locker, '-', SI.Frequency)) AS Locker
,SI.Unit AS Unit
--,[db2].[dbo].[vueClub].Club_aka AS Club
,[db1].[dbo].[Customer_Balance].[CurrentBalance]
FROM [db1].[dbo].[Customer_Balance]
JOIN [db1].[dbo].[Customer]
ON [db1].[dbo].[Customer_Balance].POSCusNo = Customer.CustomerNumber
JOIN [SQLSrv01].[ db3].[dbo].[md_Table_1] AS D
ON D.Contract_no = [db1].[dbo].[Customer_Balance]. POSCusNo
JOIN [SQLSrv01].[ db2].[dbo].[vueSoldLockers] AS SI
ON SI.CustomerID = [db1].[dbo].[Customer].CustomerID
--JOIN [db2].[dbo].[vueClub] AS A
--ON [db1].[dbo].[Customer].SiteID = A.SiteID
WHERE [db1].[dbo].[Customer_Balance].StatusCode = '1234'
ORDER BY Customer.CustomerNumber ASC
So if I run it as is I get:
Customer No. Locker Unit Current Balance
1 315 A1 456.00
2 316 A3 1204.70
3 317 B2 335.60
4 318 B4 1500.30
But if I include the commented-out code I get:
Customer No. Club ID No Locker Unit Club Current Balance
1 4 315 A1 Tigers 456.00
1 3 315 A1 Lions 456.00
2 4 316 A3 Tigers 1204.70
2 3 316 A3 Lions 1204.70
3 4 317 B2 Tigers 335.60
3 3 317 B2 Lions 335.60
4 4 318 B4 Tigers 1500.30
4 3 318 B4 Lions 1500.30
Is it because I don't have the JOIN set up properly?
Customer No. Club ID No Locker Unit Club Current Balance
1 4 315 A1 Tigers 456.00
1 3 315 A1 Lions 456.00
You are joining customer to vueClub on SiteID. Looks like the site customer 1 is in, has 2 clubs (3, 4)
I need to merge two data sets. Each data set contains a sequential observation number. The first data set contains only the first observation. The second data set contains all subsequent observations. Not all subjects have the same number of observations.
The problem is as follows. There are two different types of subject. The type is contained only in the first data set. When I merge the two data sets together, the type is missing on all observations but the first for each subject. Please see my example below.
I would like to know how to do this with both SQL and a DATA step. My real data sets are not large, so efficiency of processing is not major a concern.
I have tried using RETAIN, but as the second data set doesn't contain the TYPE variable, there is no value to retain. Regarding SQL, it seems like UNION should work, and there are countless examples of UNION on the internet, but they all involve a single variable. I need to know how to union the Observation variable by ID while retaining the Amount and assigning the Type.
Example
data set1;
input ID $
Observation
Type $
Amount
;
datalines;
002 1 A 15
026 1 A 30
031 1 B 7
028 1 B 10
036 1 A 22
;
run;
data set2;
input ID $
Observation
Amount
;
datalines;
002 2 11
002 3 35
002 4 13
002 5 12
026 2 21
026 3 12
026 4 40
031 2 11
028 2 27
036 2 10
036 3 15
036 4 16
036 5 12
036 6 20
;
run;
proc sort data = set1;
by ID
Observation
;
run;
proc sort data = set2;
by ID
Observation
;
run;
data merged;
merge set1
set2
;
by ID
Observation
;
run;
This gives
ID Observation Type Amount
002 1 A 15
002 2 11
002 3 35
002 4 13
002 5 12
026 1 A 30
026 2 21
026 3 12
026 4 40
028 1 B 10
028 2 27
031 1 B 7
031 2 11
036 1 A 22
036 2 10
036 3 15
036 4 16
036 5 12
036 6 20
However, what I need is
ID Observation Type Amount
002 1 A 15
002 2 A 11
002 3 A 35
002 4 A 13
002 5 A 12
026 1 A 30
026 2 A 21
026 3 A 12
026 4 A 40
028 1 B 10
028 2 B 27
031 1 B 7
031 2 B 11
036 1 A 22
036 2 A 10
036 3 A 15
036 4 A 16
036 5 A 12
036 6 A 20
I'm sure there are other ways to do it, but this is how I'd do it.
First, stack the data keeping only the common fields.
data new;
set set1 (drop = TYPE) set2;
run;
Then merge the type field back over.
proc sql;
create table new2 as select
a.*,
b.TYPE
from new a
left join set1 b
on a.id=b.id;
quit;
Proc SQL:
proc sql;
create table want as
select coalesce(a.id,b.id) as id,observation,type,amount from (select * from set1(drop=type) union
select * from set2) a left join set1 (keep=id type) b
on a.id=b.id;
quit;
The DATA step method is straight forward, just use SET with BY to interleave the records. You need to create a NEW variable to retain the values. If you want you can drop the old one and rename the new one to have its name.
data want ;
set set1 set2 ;
by id ;
if first.id then new_type=type;
retain new_type;
run;
For SQL use the method that #JJFord3 posted to first union the common fields and then merge on the TYPE flag. You can combine into a single statement.
proc sql;
create table want as
select a.*,b.type
from
(select id,observation,amount from set1
union
select id,observation,amount from set2
) a
left join set1 b
on a.id = b.id
order by 1,2
;
quit;
assume I have 4 assignment with assignmentID:A1,A2,A3,A4 in a assignment table and following table:
GroupID GraderName assigmentID
1 TA1 A1
2 TA2 A2
3 TA1 A4
4 TA1 A3
5 TA1 A1
6 TA2 A4
7 TA3 A3
8 TA3 A2
9 TA3 A1
10 TA2 A1
11 TA1 A2
Report name of the grader that mark at least one group for every assignment.
From my table, it should report TA1.
TA2 didnt mark any A3 and TA3 didnt mark any A4 thus ignore them.
Can someone suggest a approach using relational algebra operator such as cross join , natural join, self join, etc....
From the relational algebra perspective I would suggest to find all assignments first: π_assignmentID(assignment). Then, to answer your question you should use division, ÷, i.e., the query should be
π_{GraderName,assignmentID}(assignment) ÷ π_assignmentID(assignment)
If for whatever reason you do not like using division you can always replace by a repeated difference.