Joining data from two sources using bigquery - google-bigquery

Can anyone please check whether below code is correct? In cte_1, I’m taking all dimensions and metrics from t1 excpet value1, value2, value3. In cte_2, I’m finding the unique row number for t2. In cte_3, I’m taking all distinct dimensions and metrics using join on two keys such as Date, and Ad. In cte_4, I’m taking the values for only row number 1. I’m getting sum(value1),sum(value2),sum(value3) correct ,but sum(value4) is incorrect
WITH cte_1 AS
(SELECT *except(value1, value2, value3) FROM t1 where Date >"2020-02-16" and Publisher ="fb")
-- Find unique row number from t2--
,cte_2 as(
SELECT ROW_NUMBER() OVER(ORDER BY Date) distinct_row_number, * FROM t2
,cte_3 as
(SELECT cte_2.*,cte_1.*except(Date) FROM cte_2 join cte_1
on cte_2.Date = cte_1. Date
and cte_2.Ad= cte_1.Ad))
,cte_4 AS (
(SELECT *
FROM
(
SELECT *,
row_number() OVER (PARTITION BY distinct_row_number ORDER BY Date) as rn
FROM cte_3 ) T
where rn = 1 ))
select sum(value1),sum(value2),sum(value3),sum(value4) from cte_4
Please see the sample table below:

Whilst your data does not seem compliant with the query you shared, since it is lacking the field named Ad and other fields have different names, such as Date and ReportDate, I was able to identify some issues and propose improvements.
First, within your temp table cte_1, you are only using a filter in the WHERE clause, you could use it within your from statement in your last step, such as :
SELECT * FROM (SELECT field1,field2,field3 FROM t1 WHERE Date > DATE(2020,02,16) )
Second, in cte_2, you need to select all the columns you will need from the table t2. Otherwise, your table will have only the row number and it won't be possible to join it with other tables, once it does not provide any other information. Thus, if you need the row number, you select it together with the other columns, which it has to include your primary key if you will perform any join in the future. The syntax would be as follows:
SELECT field1, field2, ROW_NUMBER() OVER(ORDER BY Date) FROM t2
Third, in cte_3, I assume you want to perform an INNER JOIN. Thus, you need to make sure that the primary keys are present in both tables, in your case Date and Ad, which I could not find within your data. Furthermore, you can not have duplicated names when joining two tables and selecting all the columns. For example, in your case you have Brand, value 1, value 2 and value 3 in both tables, it will cause an error. Thus, you need to specify where these fields should come from by selecting one by one or the using a EXCEPT clause.
Finally, in cte_4 and your final select could be together in one step. Basically, you are selecting only one row of data ordered by Date. Then summing the fields value 1, value 2 and value 3 individually based on the partition by date. Moreover, you are not selecting any identifier for the sum, which means that your table will have only the final sums. In general, when peforming a aggregation, such as SUM(), the primary key(s) is selected as well. Lastly, this step could have been performed in one step such as follows, using only the data from t2:
SELECT ReportDate, Brand, sum(value1) as sum_1,sum(value2) as sum_1,sum(value3) as sum_1, sum(value4) as sum_1 FROM (SELECT t2.*, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY Date) as rn t2)
WHERE rn=1
GROUP BY ReportDate, Brand
UPDATE:
With your explanation in the comment section. I was able to created a more specific query. The fields ReportDate,Brand,Portfolio,Campaign and value1,value2,value3 are from t2. Whilst value4 is from t1. The sum is made based on the row number equals to 1. For this reason, the tables t1 and t2 are joined before being using ROW_NUMBER(). Finally, in the last Select statement rn is not selected and the data is aggregated based on ReportDate, Brand, Portfolio and t2.Campaign.
WITH cte_1 AS (
SELECT t2.ReportDate, t2.Brand, t2.Portfolio, t2.Campaign,
t2.value1, t2.value2, t2.value3, t1.value4
FROM t2 LEFT JOIN t1 on t2.ReportDate = t1.ReportDate and t1.placement=t2.Ad
),
cte_2 AS(
SELECT *, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY ReportDate) as rn FROM cte_1
)
SELECT ReportDate, Brand, Portfolio, Campaign, SUM(value1) as sum1, SUM(value2) as sum2, SUM(value3) as sum3,
SUM(value4) as sum4
FROM cte_2
WHERE rn=1
GROUP BY 1,2,3,4

Related

SQL: Extracting row segment pertaining to maximum value of a column for each unique entity in some other column

I have ‘Input Table’ as shown in attached snapshot.
Using SQL, I am intending to build an ‘Output Table’ where:
‘MaxDays’: should show the maximum value of ‘Days’ for a given ‘ID’
‘Type_MaxDays’: is the corresponding value of ‘Type’ pertaining to the maximum ‘Days’ identified for ‘MaxDays’
‘TotalUniqueType’: Counts all the unique ‘Type’ for any given ID
For example, for ID=878, Days=90 is the maximum of (63, 90, 33, 48) and it corresponds to Type=A. Hence, in output table, Max_Days= 90 and Type_MaxDays= A. Since ID=878 has total 4 unique 'Type' (ie.. A, B, C, D) so TotalUniqueType=4.
Finding the ‘TotalUniqueType’ seems straightforward, however coming from a python/pandas background, I am not able to figure out how to retrieve ‘MaxDays’ and ‘Type_MaxDays’ using SQL. Please advise.
I would recommend window functions and aggregation:
select id,
max(days) as maxdays,
max(case when seqnum = 1 then type end) as type_at_maxdays,
count(distinct type)
from (select t.*,
row_number() over (partition by id order by days desc) as seqnum
from t
) t
group by id;
One option would be to join to a subquery which finds, for each ID, the maximum number of days along with the distinct type count. Then, also select the type of the row having the maximum number of days.
SELECT
t1.ID,
t1.Days AS MaxDays,
t1.Type AS Type_MaxDays,
t2.TypeCnt AS TotalUniqueType
FROM yourTable t1
INNER JOIN
(
SELECT ID, MAX(Days) AS MaxDays, COUNT(DISTINCT Type) AS TypeCnt
FROM yourTable
GROUP BY ID
) t2
ON t1.ID = t2.ID AND t1.Days = t2.MaxDays;
Demo

How do we find frequency of one column based off two other columns in SQL?

I'm relatively new to working with SQL and wasn't able to find any past threads to solve my question. I have three columns in a table, columns being name, customer, and location. I'd like to add an additional column determining which location is most frequent, based off name and customer (first two columns).
I have included a photo of an example where name-Jane customer-BEC in my created column would be "Texas" as that has 2 occurrences as opposed to one for California. Would there be anyway to implement this?
If you want 'Texas' on all four rows:
select t.Name, t.Customer, t.Location,
(select t2.location
from table1 t2
where t2.name = t.name
group by name, location
order by count(*) desc
fetch first 1 row only
) as most_frequent_location
from table1 t ;
You can also do this with analytic functions:
select t.Name, t.Customer, t.Location,
max(location) keep (dense_rank first order by location_count desc) over (partition by name) most_frequent_location
from (select t.*,
count(*) over (partition by name, customer, location) as location_count
from table1 t
) t;
Here is a db<>fiddle.
Both of these version put 'Texas' in all four rows. However, each can be tweaks with minimal effort to put 'California' in the row for ARC.
In Oracle, you can use aggregate function stats_mode() to compute the most occuring value in a group.
Unfortunately it is not implemented as a window function. So one option uses an aggregate subquery, and then a join with the original table:
select t.*, s.top_location
from mytable t
inner join (
select name, customer, stats_mode(location) top_location
from mytable
group by name, customer
) s where s.name = t.name and s.customer = t.customer
You could also use a correlated subquery:
select
t.*,
(
select stats_mode(t1.location)
from mytable t1
where t1.name = t.name and t1.customer = t.customer
) top_location
from mytable t
This is more a question about understanding the concepts of a relational database. If you want that information, you would not put that in an additional column. It is calculated data over multiple columns - why would you store that in the table itself ? It is complex to code and it would also be very expensive for the database (imagine all the rows you have to calculate that value for if someone inserted a million rows)
Instead you can do one of the following
Calculate it at runtime, as shown in the other answers
if you want to make it more persisent, you could embed that query above in a view
if you want to physically store the info, you could use a materialized view
Plenty of documentation on those 3 options in the official oracle documentation
Your first step is to construct a query that determines the most frequent location, which is as simple as:
select Name, Customer, Location, count(*)
from table1
group by Name, Customer, Location
This isn't immediately useful, but the logic can be used in row_number(), which gives you a unique id for each row returned. In the query below, I'm ordering by count(*) in descending order so that the most frequent occurrence has the value 1.
Note that row_number() returns '1' to only one row.
So, now we have
select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1 tb_
group by Name, Customer, Location
The final step puts it all together:
select tab.*, tb_.Location most_freq_location
from table1 tab
inner join
(select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1
group by Name, Customer, Location) tb_
on tb_.Name = tab.Name
and tb_.Customer = tab.Customer
and freq_name_cust = 1
You can see how it all works in this Fiddle where I deliberately inserted rows with the same frequency for California and Texas for one of the customers for illustration purposes.

Replacing null info if matching id in select statement

I am pulling from multiple tables and have three union queries. I want information that is not in one row to be filled into the information from another row while still keeping a separate list of values for the month.
There is three rows for each id and essentially I want any information that is not in a row to be copied from a row with a matching id, but still keeping the three rows separate due to some pivoted monthly data that I want to keep separate for each row.
You can use analytic functions:
select max(name) over (partition by id) as name,
max(zip) over (partition by id) as zip,
max(type1) over (partition by id) as type1,
id, type, "201907", "201906", "201905"
from t;
You can use the following query:
SELECT
T2.MAX_NAME,
T2.MAX_ZIP,
T2.MAX_TYPE,
T1.ID,
T1.TYP1,
T1."201907",
T1."201906",
T1."201905"
FROM
TAB T1
JOIN (
SELECT
MAX(NAME) MAX_NAME,
MAX(ZIP) MAX_ZIP,
MAX(TYPE) MAX_TYPE,
ID
FROM
TAB
GROUP BY
ID
) T2 ON ( T1.ID = T2.ID );
db<>fiddle demo

SQL query for filtering duplicate rows of a column by the minimum DateTime of those corresponding rows

I have a SQL database table, "Helium_Test_Data", that has multiple entries based on the KeyID column (the KeyID represents a single tested part ). I need to query the entries and only show one entry per KeyID (part) based on the earliest creation date-time (format example is 2018-12-29 08:22:11.123). This is because the same part was tested several times but the first reading is the one I need to use. Here is the query currently tried:
SELECT mt.*
FROM Helium_Test_Data mt
INNER JOIN
(
SELECT
KeyID,
MIN(DateTime) AS DateTime
FROM Helium_Test_Data
WHERE PSNo='11166565'
GROUP BY KeyID
) t ON mt.KeyID = t.KeyID AND mt.DateTime = t.DateTime
WHERE PSNo='11167197'
AND (mt.DateTime > '2018-12-29 07:00')
AND (mt.DateTime < '2018-12-29 18:00') AND OK=1
ORDER BY KeyId,DateTime
It returns only the rows that have no duplicate KeyID present in the table whereas I need one row per every single KeyID (duplicate or not). And for the duplicate ones, I need the earliest date.
Thanks in advance for the help.
use row_number() window function which support most dbms
select * from
(
select *,row_number() over(partition by KeyID order by DateTime) rn
from Helium_Test_Data
) t where t.rn=1
or you could use corelated subquery
select t1.* from Helium_Test_Data t1
where t1.DateTime= (select min(DateTime)
from Helium_Test_Data t2
where t2.KeyID=t1.KeyID
)

In SQL, how do I create new column of values for each distinct values of another column?

Something like this: SQL How to create a value for a new column based on the count of an existing column by groups?
But I have more than two distinct values. I have a variable n number of distinct values, so I don't always know have many different counts I have.
And then in the original table, I want each row '3', '4', etc. to have the count i.e. all the rows with the '3' would have the same count, all the rows with '4' would have the same count, etc.
edit: Also how would I split the count via different dates i.e. '2017-07-19' for each distinct values?
edit2: Here is how I did it, but now I need to split it via different dates.
edit3: This is how I split via dates.
#standardSQL
SELECT * FROM
(SELECT * FROM table1) main
LEFT JOIN (SELECT event_date, value, COUNT(value) AS count
FROM table1
GROUP BY event_date, value) sub ON main.value=sub.value
AND sub.event_date=SAFE_CAST(main.event_time AS DATE)
edit4: I wish PARTITION BY was documented somewhere better. Nothing seems to be widely written on BigQuery or anything with detailed documentation
#standardSQL
SELECT
*,
COUNT(*) OVER (PARTITION BY event_date, value) AS cnt
FROM table1;
The query that you give would better be written using window functions:
SELECT t1.*, COUNT(*) OVER (PARTITION BY value) as cnt
FROM table1 t1;
I am not sure if this answers your question.
If you have another column that you want to count as well, you can use conditional aggregation:
SELECT t1.*,
COUNT(*) OVER (PARTITION BY value) as cnt,
SUM(CASE WHEN datecol = '2017-07-19' THEN 1 ELSE 0 END) OVER (PARTITION BY value) as cnt_20170719
FROM table1 t1;