SQL: Extracting row segment pertaining to maximum value of a column for each unique entity in some other column - sql

I have ‘Input Table’ as shown in attached snapshot.
Using SQL, I am intending to build an ‘Output Table’ where:
‘MaxDays’: should show the maximum value of ‘Days’ for a given ‘ID’
‘Type_MaxDays’: is the corresponding value of ‘Type’ pertaining to the maximum ‘Days’ identified for ‘MaxDays’
‘TotalUniqueType’: Counts all the unique ‘Type’ for any given ID
For example, for ID=878, Days=90 is the maximum of (63, 90, 33, 48) and it corresponds to Type=A. Hence, in output table, Max_Days= 90 and Type_MaxDays= A. Since ID=878 has total 4 unique 'Type' (ie.. A, B, C, D) so TotalUniqueType=4.
Finding the ‘TotalUniqueType’ seems straightforward, however coming from a python/pandas background, I am not able to figure out how to retrieve ‘MaxDays’ and ‘Type_MaxDays’ using SQL. Please advise.

I would recommend window functions and aggregation:
select id,
max(days) as maxdays,
max(case when seqnum = 1 then type end) as type_at_maxdays,
count(distinct type)
from (select t.*,
row_number() over (partition by id order by days desc) as seqnum
from t
) t
group by id;

One option would be to join to a subquery which finds, for each ID, the maximum number of days along with the distinct type count. Then, also select the type of the row having the maximum number of days.
SELECT
t1.ID,
t1.Days AS MaxDays,
t1.Type AS Type_MaxDays,
t2.TypeCnt AS TotalUniqueType
FROM yourTable t1
INNER JOIN
(
SELECT ID, MAX(Days) AS MaxDays, COUNT(DISTINCT Type) AS TypeCnt
FROM yourTable
GROUP BY ID
) t2
ON t1.ID = t2.ID AND t1.Days = t2.MaxDays;
Demo

Related

Starting each SQL record from a specific point

I am wanting to figure out how to start each record from a specific point in SQL. I have created a data set to try and represent what I would like.
This is the starting data set.
However, I want to get a new view, with a defined starting point.
So each member record starts from ID 33 onwards, ordered by Member and Date. Basically want every record after ID 33 and the corresponding date for it.
If your ids are in order, you can use:
select t.*
from t
where id >= 33
order by member, date;
If the ids are not in order, one method is a correlated subquery:
select t.*
from t
where date >= (select min(t2.date) from t t2 where t2.member = t.member and t2.id = 33);
And finally, a windows function approach is:
select t.*
from (select t.*,
min(case when id = 33 then date end) over (partition by member) as date_33
from t
) t
where date >= date_33;

Joining data from two sources using bigquery

Can anyone please check whether below code is correct? In cte_1, I’m taking all dimensions and metrics from t1 excpet value1, value2, value3. In cte_2, I’m finding the unique row number for t2. In cte_3, I’m taking all distinct dimensions and metrics using join on two keys such as Date, and Ad. In cte_4, I’m taking the values for only row number 1. I’m getting sum(value1),sum(value2),sum(value3) correct ,but sum(value4) is incorrect
WITH cte_1 AS
(SELECT *except(value1, value2, value3) FROM t1 where Date >"2020-02-16" and Publisher ="fb")
-- Find unique row number from t2--
,cte_2 as(
SELECT ROW_NUMBER() OVER(ORDER BY Date) distinct_row_number, * FROM t2
,cte_3 as
(SELECT cte_2.*,cte_1.*except(Date) FROM cte_2 join cte_1
on cte_2.Date = cte_1. Date
and cte_2.Ad= cte_1.Ad))
,cte_4 AS (
(SELECT *
FROM
(
SELECT *,
row_number() OVER (PARTITION BY distinct_row_number ORDER BY Date) as rn
FROM cte_3 ) T
where rn = 1 ))
select sum(value1),sum(value2),sum(value3),sum(value4) from cte_4
Please see the sample table below:
Whilst your data does not seem compliant with the query you shared, since it is lacking the field named Ad and other fields have different names, such as Date and ReportDate, I was able to identify some issues and propose improvements.
First, within your temp table cte_1, you are only using a filter in the WHERE clause, you could use it within your from statement in your last step, such as :
SELECT * FROM (SELECT field1,field2,field3 FROM t1 WHERE Date > DATE(2020,02,16) )
Second, in cte_2, you need to select all the columns you will need from the table t2. Otherwise, your table will have only the row number and it won't be possible to join it with other tables, once it does not provide any other information. Thus, if you need the row number, you select it together with the other columns, which it has to include your primary key if you will perform any join in the future. The syntax would be as follows:
SELECT field1, field2, ROW_NUMBER() OVER(ORDER BY Date) FROM t2
Third, in cte_3, I assume you want to perform an INNER JOIN. Thus, you need to make sure that the primary keys are present in both tables, in your case Date and Ad, which I could not find within your data. Furthermore, you can not have duplicated names when joining two tables and selecting all the columns. For example, in your case you have Brand, value 1, value 2 and value 3 in both tables, it will cause an error. Thus, you need to specify where these fields should come from by selecting one by one or the using a EXCEPT clause.
Finally, in cte_4 and your final select could be together in one step. Basically, you are selecting only one row of data ordered by Date. Then summing the fields value 1, value 2 and value 3 individually based on the partition by date. Moreover, you are not selecting any identifier for the sum, which means that your table will have only the final sums. In general, when peforming a aggregation, such as SUM(), the primary key(s) is selected as well. Lastly, this step could have been performed in one step such as follows, using only the data from t2:
SELECT ReportDate, Brand, sum(value1) as sum_1,sum(value2) as sum_1,sum(value3) as sum_1, sum(value4) as sum_1 FROM (SELECT t2.*, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY Date) as rn t2)
WHERE rn=1
GROUP BY ReportDate, Brand
UPDATE:
With your explanation in the comment section. I was able to created a more specific query. The fields ReportDate,Brand,Portfolio,Campaign and value1,value2,value3 are from t2. Whilst value4 is from t1. The sum is made based on the row number equals to 1. For this reason, the tables t1 and t2 are joined before being using ROW_NUMBER(). Finally, in the last Select statement rn is not selected and the data is aggregated based on ReportDate, Brand, Portfolio and t2.Campaign.
WITH cte_1 AS (
SELECT t2.ReportDate, t2.Brand, t2.Portfolio, t2.Campaign,
t2.value1, t2.value2, t2.value3, t1.value4
FROM t2 LEFT JOIN t1 on t2.ReportDate = t1.ReportDate and t1.placement=t2.Ad
),
cte_2 AS(
SELECT *, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY ReportDate) as rn FROM cte_1
)
SELECT ReportDate, Brand, Portfolio, Campaign, SUM(value1) as sum1, SUM(value2) as sum2, SUM(value3) as sum3,
SUM(value4) as sum4
FROM cte_2
WHERE rn=1
GROUP BY 1,2,3,4

How do we find frequency of one column based off two other columns in SQL?

I'm relatively new to working with SQL and wasn't able to find any past threads to solve my question. I have three columns in a table, columns being name, customer, and location. I'd like to add an additional column determining which location is most frequent, based off name and customer (first two columns).
I have included a photo of an example where name-Jane customer-BEC in my created column would be "Texas" as that has 2 occurrences as opposed to one for California. Would there be anyway to implement this?
If you want 'Texas' on all four rows:
select t.Name, t.Customer, t.Location,
(select t2.location
from table1 t2
where t2.name = t.name
group by name, location
order by count(*) desc
fetch first 1 row only
) as most_frequent_location
from table1 t ;
You can also do this with analytic functions:
select t.Name, t.Customer, t.Location,
max(location) keep (dense_rank first order by location_count desc) over (partition by name) most_frequent_location
from (select t.*,
count(*) over (partition by name, customer, location) as location_count
from table1 t
) t;
Here is a db<>fiddle.
Both of these version put 'Texas' in all four rows. However, each can be tweaks with minimal effort to put 'California' in the row for ARC.
In Oracle, you can use aggregate function stats_mode() to compute the most occuring value in a group.
Unfortunately it is not implemented as a window function. So one option uses an aggregate subquery, and then a join with the original table:
select t.*, s.top_location
from mytable t
inner join (
select name, customer, stats_mode(location) top_location
from mytable
group by name, customer
) s where s.name = t.name and s.customer = t.customer
You could also use a correlated subquery:
select
t.*,
(
select stats_mode(t1.location)
from mytable t1
where t1.name = t.name and t1.customer = t.customer
) top_location
from mytable t
This is more a question about understanding the concepts of a relational database. If you want that information, you would not put that in an additional column. It is calculated data over multiple columns - why would you store that in the table itself ? It is complex to code and it would also be very expensive for the database (imagine all the rows you have to calculate that value for if someone inserted a million rows)
Instead you can do one of the following
Calculate it at runtime, as shown in the other answers
if you want to make it more persisent, you could embed that query above in a view
if you want to physically store the info, you could use a materialized view
Plenty of documentation on those 3 options in the official oracle documentation
Your first step is to construct a query that determines the most frequent location, which is as simple as:
select Name, Customer, Location, count(*)
from table1
group by Name, Customer, Location
This isn't immediately useful, but the logic can be used in row_number(), which gives you a unique id for each row returned. In the query below, I'm ordering by count(*) in descending order so that the most frequent occurrence has the value 1.
Note that row_number() returns '1' to only one row.
So, now we have
select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1 tb_
group by Name, Customer, Location
The final step puts it all together:
select tab.*, tb_.Location most_freq_location
from table1 tab
inner join
(select Name, Customer, Location, row_number() over (partition by Name, Customer order by count(*) desc) freq_name_cust
from table1
group by Name, Customer, Location) tb_
on tb_.Name = tab.Name
and tb_.Customer = tab.Customer
and freq_name_cust = 1
You can see how it all works in this Fiddle where I deliberately inserted rows with the same frequency for California and Texas for one of the customers for illustration purposes.

SQL - Summarize column with maximum date value and other fields

I have a table with the following fields:
Id|Date|Name
---------------
A|2019-04-24|"VALUE1"
A|2019-04-23|"VALUE2"
A|2019-06-11|"VALUE3"
A|2019-06-12|"VALUE4"
B|2019-05-21|"VALUE5"
B|2019-05-22|"VALUE6"
B|2019-03-13|"VALUE7"
C|2019-01-03|"VALUE8"
I would like to get one line per Id having the info of the maximum date line. This would be the output:
Id|Date|Name
---------------
A|2019-06-12|"VALUE4"
B|2019-05-22|"VALUE6"
C|2019-01-03|"VALUE8"
I have achieved through a group by getting the Id and the MAX Date, but not the value associated to that date.
What I am working on now is to inner join that table with the input one joining it on date and id, but I am not able to join on two fields.
Is there any way to bring to the result the value field related to the max date in the group by clause?
Otherwise, How could I join on two different fields those two tables?
Any Suggestion?
Thank you so much!!
You can use a correlated subquery :
select t.*
from table t
where t.date = (select max(t1.date) from table t1 where t1.id = t.id);
However, Most of DBMS supports analytical functions, so you can use :
select t.*
from (select t.*, row_number() over (partition by t.id order by t.date desc) as seq
from table t
) t
where seq = 1;

Query historized data

To describe my query problem, the following data is helpful:
A single table contains the columns ID (int), VAL (varchar) and ORD (int)
The values of VAL may change over time by which older items identified by ID won't get updated but appended. The last valid item for ID is identified by the highest ORD value (increases over time).
T0, T1 and T2 are points in time where data got entered.
How do I get in an efficient manner to the Result set?
A solution must not involve materialized views etc. but should be expressible in a single SQL-query. Using Postgresql 9.3.
The correct way to select groupwise maximum in postgres is using DISTINCT ON
SELECT DISTINCT ON (id) sysid, id, val, ord
FROM my_table
ORDER BY id,ord DESC;
Fiddle
You want all records for which no newer record exists:
select *
from mytable
where not exists
(
select *
from mytable newer
where newer.id = mytable.id
and newer.ord > mytable.ord
)
order by id;
You can do the same with row numbers. Give the latest entry per ID the number 1 and keep these:
select sysid, id, val, ord
from
(
select
sysid, id, val, ord,
row_number() over (partition by id order by ord desc) as rn
from mytable
)
where rn = 1
order by id;
Left join the table (A) against itself (B) on the condition that B is more recent than A. Pick only the rows where B does not exist (i.e. A is the most recent row).
SELECT last_value.*
FROM my_table AS last_value
LEFT JOIN my_table
ON my_table.id = last_value.id
AND my_table.ord > last_value.ord
WHERE my_table.id IS NULL;
SQL Fiddle