SQL INNER JOIN duplicates to be remove based on criteria

SQL INNER JOIN duplicates to be remove based on criteria - sql

I have the below two tables:
Table 1
┌──────────┬────────────┬───────────────┐
│ account1 │ Fruit_name │ First_harvest │
├──────────┼────────────┼───────────────┤
│ 567 │ Apple │ 201805 │
│ 432 │ Mango │ 201809 │
│ 567 │ Apple │ 201836 │
└──────────┴────────────┴───────────────┘
Table 2
┌──────────┬─────────────┬──────────────┬───────────────┬──────────────┬─────────────┐
│ account1 │ Fruit_name │ Current_Farm │ Previous_Farm │ FirstHarvest │ LastHarvest │
├──────────┼─────────────┼──────────────┼───────────────┼──────────────┼─────────────┤
│ 567 │ Apple │ APFarm │ AppleYard │ 201801 │ 201810 │
│ 567 │ Apple │ APFarm │ FruitFarm │ 201805 │ 201830 │
│ 567 │ Apple │ APFarm │ FruitMarket │ 201831 │ 999999 │
│ 567 │ Royal Gala │ APFarm │ GrocerWorld │ 201815 │ 999999 │
└──────────┴─────────────┴──────────────┴───────────────┴──────────────┴─────────────┘
My code:
SELECT DISTINCT a.account1,a.fruit_name,Max(a.first_harvest) first_harvest,b.current_farm,b.previous_farm,b.firstharvest,b.lastharvest
FROM fruit_harvest_data a
INNER JOIN fruit_farm_data b
ON a.account1 = b.account1
AND CASE WHEN b.fruit_name = 'Apple' THEN 'Royal Gala'
ELSE b.fruit_name END =
CASE WHEN a.fruit_name = 'Apple' THEN 'Royal gala'
ELSE a.fruit_name END
WHERE a.first_harvest BETWEEN b.firstharvest AND b.lastharvest
GROUP BY a.account1,a.fruit_name,b.current_farm,b.previous_farm,b.firstharvest,b.lastharvest
HAVING Max(a.first_harvest) >= 201801
Result:
┌──────────┬────────────┬───────────────┬──────────────┬───────────────┬──────────────┬─────────────┐
│ account1 │ Fruit_name │ First_harvest │ Current_Farm │ Previous_Farm │ FirstHarvest │ LastHarvest │
├──────────┼────────────┼───────────────┼──────────────┼───────────────┼──────────────┼─────────────┤
│ 567 │ Apple │ 201836 │ APFarm │ FruitMarket │ 201831 │ 999999 │
│ 567 │ Royal Gala │ 201836 │ APFarm │ GrocerWorld │ 201815 │ 999999 │
└──────────┴────────────┴───────────────┴──────────────┴───────────────┴──────────────┴─────────────┘
Request:
I get duplicate data due to the way we have this stored. Is there a
way to only show the result if account1 has both Apple and Royal Gala then it should only select Royal Gala.
Please note: account1 eg., 567 can have multiple fruits like apple, roya gal, mango, orange. but should only select Royal gala in case if exists in both apple and royal gala.

I think below should work
select distinct T.* from
(SELECT DISTINCT a.account1,
case when a.fruit_name='Apple' or a.fruit_name='Royal Gala' then
'Apple' else a.fruit_name end as fruit_name ,Max(a.first_harvest) first_harvest,b.current_farm,b.previous_farm,b.firstharvest,b.lastharvest
FROM fruit_harvest_data a
INNER JOIN fruit_farm_data b
ON a.account1 = b.account1
AND CASE WHEN b.fruit_name = 'Apple' THEN 'Royal Gala'
ELSE b.fruit_name END =
CASE WHEN a.fruit_name = 'Apple' THEN 'Royal gala'
ELSE a.fruit_name END
WHERE a.first_harvest BETWEEN b.firstharvest AND b.lastharvest
GROUP BY a.account1,a.fruit_name,b.current_farm,b.previous_farm,b.firstharvest,b.lastharvest
HAVING Max(a.first_harvest) >= 201801
) as T

Still unclear about what you want in your result set - a more complete desired result would help, but to answer the question as to how to do it:
Since you have mentioned that Apple/Gala is an example, I would create a new table to contain these pairs:
create table replace_list(oldfruit varchar(20), newfruit varchar(20))
insert replace_list values ('Apple','Royal Gala')
Then in your query add this:
left join replace_list r on r.oldfruit=b.fruit_name
left join fruit_farm_data n on n.account1=a.account1 and n.fruit_name=newfruit
and in your where clause, you will check where either the fruit name does not have a replacement r.oldfruit is null or it does have a replacement, but the farm doesnt have that fruit n.fruit_name is null
where r.oldfruit is null or n.fruit_name is null
The rest of the query you can work out for yourself.

Related

Display COUNT(*) for every week instead of every day

Let us say that I have a table with user_id of Int32 type and login_time as DateTime in UTC format. user_id is not unique, so SELECT user_id, login_time FROM some_table; gives following result:
┌─user_id─┬──login_time─┐
│ 1 │ 2021-03-01 │
│ 1 │ 2021-03-01 │
│ 1 │ 2021-03-02 │
│ 2 │ 2021-03-02 │
│ 2 │ 2021-03-03 │
└─────────┴─────────────┘
If I run SELECT COUNT(*) as count, toDate(login_time) as l FROM some_table GROUP BY l I get following result:
┌─count───┬──login_time─┐
│ 2 │ 2021-03-01 │
│ 2 │ 2021-03-02 │
│ 1 │ 2021-03-03 │
└─────────┴─────────────┘
I would like to reformat the result to show COUNT on a weekly level, instead of every day, as I currently do.
My result for the above example could look something like this:
┌──count──┬──year─┬──month──┬─week ordinal┐
│ 5 │ 2021 │ 03 │ 1 │
│ 0 │ 2021 │ 03 │ 2 │
│ 0 │ 2021 │ 03 │ 3 │
│ 0 │ 2021 │ 03 │ 4 │
└─────────┴───────┴─────────┴─────────────┘
I have gone through the documentation, found some interesting functions, but did not manage to make them solve my problem.
I have never worked with clickhouse before and am not very experienced with SQL, which is why I ask here for help.

Try this query:
select count() count, toYear(start_of_month) year, toMonth(start_of_month) month,
toWeek(start_of_week) - toWeek(start_of_month) + 1 AS "week ordinal"
from (
select *, toStartOfMonth(login_time) start_of_month,
toStartOfWeek(login_time) start_of_week
from (
/* emulate test dataset */
select data.1 user_id, toDate(data.2) login_time
from (
select arrayJoin([
(1, '2021-02-27'),
(1, '2021-02-28'),
(1, '2021-03-01'),
(1, '2021-03-01'),
(1, '2021-03-02'),
(2, '2021-03-02'),
(2, '2021-03-03'),
(2, '2021-03-08'),
(2, '2021-03-16'),
(2, '2021-04-01')]) data)
)
)
group by start_of_month, start_of_week
order by start_of_month, start_of_week
/*
┌─count─┬─year─┬─month─┬─week ordinal─┐
│ 1 │ 2021 │ 2 │ 4 │
│ 1 │ 2021 │ 2 │ 5 │
│ 5 │ 2021 │ 3 │ 1 │
│ 1 │ 2021 │ 3 │ 2 │
│ 1 │ 2021 │ 3 │ 3 │
│ 1 │ 2021 │ 4 │ 1 │
└───────┴──────┴───────┴──────────────┘
*/

SQL query returns product of results instead of sum

How can I make sure that with this join I'll only receive the sum of results and not the product?
I have a project entity, which contains two one-to-many relations. If I query disposal and supply.
With the following query:
SELECT *
FROM projects
JOIN disposals disposal on projects.project_id = disposal.disposal_project_refer
WHERE (projects.project_name = 'Höngg')
I get following result:
project_id,project_name,disposal_id,depository_refer,material_refer,disposal_date,disposal_measurement,disposal_project_refer
1,Test,1,1,1,2020-08-12 15:24:49.913248,123,1
1,Test,2,1,2,2020-08-12 15:24:49.913248,123,1
1,Test,7,2,1,2020-08-12 15:24:49.913248,123,1
1,Test,10,3,4,2020-08-12 15:24:49.913248,123,1
The same amount of results get returned by same query for supplies.
type Project struct {
ProjectID uint `gorm:"primary_key" json:"ProjectID"`
ProjectName string `json:"ProjectName"`
Disposals []Disposal `gorm:"ForeignKey:disposal_project_refer"`
Supplies []Supply `gorm:"ForeignKey:supply_project_refer"`
}
If I query both tables I would like to receive the sum of both single queries. Currently I am receiving 16 results (4 supply results multiplied by 4 disposal results).
The combined query:
SELECT *
FROM projects
JOIN disposals disposal ON projects.project_id = disposal.disposal_project_refer
JOIN supplies supply ON projects.project_id = supply.supply_project_refer
WHERE (projects.project_name = 'Höngg');
I have tried achieving my goal with union queries but I was not sucessfull. What else should I try to achieve my goal?

It is your case (simplified):
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222))
select * from a join b on (a.x=b.x) join c on (b.x=c.x);
┌───┬───┬───┬────┬───┬─────┐
│ x │ y │ x │ z │ x │ t │
├───┼───┼───┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 11 │ 1 │ 111 │
│ 1 │ 1 │ 1 │ 11 │ 1 │ 222 │
│ 1 │ 1 │ 1 │ 22 │ 1 │ 111 │
│ 1 │ 1 │ 1 │ 22 │ 1 │ 222 │
└───┴───┴───┴────┴───┴─────┘
It produces cartesian join because the value for join is same in all tables. You need some additional condition for joining your data.For example (tests for various cases):
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬────┬───┬────┬────┬───┬─────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼────┼───┼────┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
└───┴───┴────┴───┴────┴────┴───┴─────┘
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22),(1,33)), c(x,t) as (values(1,111),(1,222))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬────┬───┬─────┬──────┬──────┬──────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼────┼───┼─────┼──────┼──────┼──────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
│ 1 │ 1 │ 3 │ 1 │ 33 │ ░░░░ │ ░░░░ │ ░░░░ │
└───┴───┴────┴───┴─────┴──────┴──────┴──────┘
# with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22)), c(x,t) as (values(1,111),(1,222),(1,333))
select *
from a
cross join lateral (
select *
from (select row_number() over() as rn, * from b where b.x=a.x) as b
full join (select row_number() over() as rn, * from c where c.x=a.x) as c on (b.rn=c.rn)
) as bc;
┌───┬───┬──────┬──────┬──────┬────┬───┬─────┐
│ x │ y │ rn │ x │ z │ rn │ x │ t │
├───┼───┼──────┼──────┼──────┼────┼───┼─────┤
│ 1 │ 1 │ 1 │ 1 │ 11 │ 1 │ 1 │ 111 │
│ 1 │ 1 │ 2 │ 1 │ 22 │ 2 │ 1 │ 222 │
│ 1 │ 1 │ ░░░░ │ ░░░░ │ ░░░░ │ 3 │ 1 │ 333 │
└───┴───┴──────┴──────┴──────┴────┴───┴─────┘
db<>fiddle
Note that there is no any obvious relations between disposals and supplies (b and c in my example) so the order of both could be random. As for me the better solution for this task could be the aggregation of the data from those tables using JSON for example:
with a(x,y) as (values(1,1)), b(x,z) as (values(1,11),(1,22),(1,33)), c(x,t) as (values(1,111),(1,222))
select
*,
(select json_agg(to_json(b.*)) from b where a.x=b.x) as b,
(select json_agg(to_json(c.*)) from c where a.x=c.x) as c
from a;
┌───┬───┬──────────────────────────────────────────────────┬────────────────────────────────────┐
│ x │ y │ b │ c │
├───┼───┼──────────────────────────────────────────────────┼────────────────────────────────────┤
│ 1 │ 1 │ [{"x":1,"z":11}, {"x":1,"z":22}, {"x":1,"z":33}] │ [{"x":1,"t":111}, {"x":1,"t":222}] │
└───┴───┴──────────────────────────────────────────────────┴────────────────────────────────────┘

how to show results of postcodes within a radius of a point

hi back with another problem lol, i have a table with several columns; 2 of which latitude and longitude and other is crime types, what i need to do is work out how many crimes were committed within an x amount of meters from a certain point
what i need is to find the amount of crimes that took place 250m, 500m and 1km from E:307998m, N:188746m this point
help would be appreciated or even just a push in the right direction
thanks

What an interesting question. The following may help.
You can use Pythagoras's theorem to calculate the distance from a point ([100,100] in this case) and any incident, then count the total where this is less than a threshold and of the right type.
# select * from test;
┌─────┬─────┬──────┐
│ x │ y │ type │
├─────┼─────┼──────┤
│ 100 │ 100 │ 1 │
│ 104 │ 100 │ 1 │
│ 110 │ 100 │ 1 │
│ 110 │ 102 │ 1 │
│ 50 │ 102 │ 2 │
│ 50 │ 150 │ 2 │
│ 50 │ 152 │ 3 │
│ 150 │ 152 │ 1 │
│ 40 │ 152 │ 1 │
│ 150 │ 150 │ 2 │
└─────┴─────┴──────┘
(10 rows)
select count(*) from test where sqrt((x-100)*(x-100)+(y-100)*(y-100))<30 and type = 1;
┌───────┐
│ count │
├───────┤
│ 4 │
└───────┘
(1 row)

Count and percentage of same strings within two groups with join

I have three tables in Google Bigquery:
t1) ID1, ID2
t2) ID1, Keywords (500.000 rows)
t3) ID2, Keywords (3 million rows)
The observations of ID1 have been matched/linked with observations in ID2, each observation has a number of keywords.
I want to know about the overlap in keywords between the matched ID1's and ID2's.
t1
┌─────────────┐
│ ID1 │ ID2 │
├──────┼──────┤
│ 1 │ A │
│ 1 │ B │
│ 1 │ C │
│ 1 │ D │
│ 2 │ E │
│ 2 │ F │
│ 2 │ G │
│ 2 │ H │
│ 3 │ I │
│ 3 │ J │
│ 3 │ K │
│ 3 │ L │
│ 4 │ M │
│ 4 │ N │
│ 4 │ O │
│ 4 │ P │
t2
┌──────────────────────┐
│ TABLE 2 │
├──────────────────────┤
│ ID1 │ KEYWORD │
│ 1 │ KEYWORD 1 │
│ 1 │ KEYWORD 2 │
│ 1 │ KEYWORD 3 │
│ 1 │ KEYWORD 4 │
│ 2 │ KEYWORD 2 │
│ 2 │ KEYWORD 3 │
│ 2 │ KEYWORD 6 │
│ 2 │ KEYWORD 8 │
│ 3 │ KEYWORD 10 │
│ 3 │ KEYWORD 64 │
│ 3 │ KEYWORD 42 │
│ 3 │ KEYWORD 39 │
│ 4 │ KEYWORD 18 │
│ 4 │ KEYWORD 33 │
│ 4 │ KEYWORD 52 │
│ 4 │ KEYWORD 24 │
└─────────┴────────────┘
t3
┌───────────────────────┐
│ TABLE 3 │
├───────────────────────┤
│ ID2 │ KEYWORD │
│ A │ KEYWORD 1 │
│ A │ KEYWORD 2 │
│ A │ KEYWORD 54 │
│ A │ KEYWORD 34 │
│ B │ KEYWORD 32 │
│ B │ KEYWORD 876 │
│ B │ KEYWORD 632 │
│ B │ KEYWORD 2 │
│ K │ KEYWORD 53 │
│ K │ KEYWORD 43 │
│ K │ KEYWORD 10 │
│ K │ KEYWORD 64 │
│ P │ KEYWORD 56 │
│ P │ KEYWORD 44 │
│ P │ KEYWORD 322 │
│ P │ KEYWORD 99 │
└─────────┴─────────────┘
As the tables show, ID1 (1) is matched to ID2 (A). Both ID1 and ID2 have a KEYWORD 1 and KEYWORD 2, so there's a total of 2 keywords that overlap between both matched observations, which in this case (as ID1 (A) has 4 keywords total) is 50% overlap.
I am looking to make the following table, where every row in t1 gets additional columns MATCH COUNT and MATCH PERCENTAGE.
┌───────────────────────────────────────────────┐
│ RESULT │
├───────────────────────────────────────────────┤
│ ID │ ID2 │ MATCH COUNT │ MATCH PERCENTAGE │
│ 1 │ A │ 2 │ 50% │
│ 1 │ B │ 1 │ 25% │
│(...) │(...)│ (...) │ (...) │
│ 3 │ K │ 2 │ 50% │
│ 4 │ P │ 0 │ 0% │
└────────┴─────┴─────────────┴──────────────────┘
I know it is good etiquette to show what I've already done, but honestly this one is way over my head and I don't even know where to start. I am hoping that somebody can get me into the right direction.

You can do this using join and group by:
select t1.id1, t2.id2
count(t3.keyword) as num_matches,
count(t3.keyword) / count(*) as proportion_matches
from t1 left join
t2
on t1.id1 = t2.id1 left join
t3
on t1.id2 = t3.id2 and
t2.keyword = t3.keyword
group by t1.id1, t2.id2;
This assumes that the keywords are unique for each id.

I think it is solution:
select Id1, Id2, Sum(Match) Match, Sum(Match) / Sum(Total) as Perc
from (
select t2.Id1, t2.Id2, Decode(t1.Keyword, t3.Keyword, 1, 0) Match, 1 Total
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t2.Id2 = t3.Id2)
)
group by Id1, Id2
if you don't have function Decode you can use case:
case when t1.Keyword = t3.Keyword then 1 else 0 end
Easier:
select t1.Id1, t1.Id2, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) Match, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) / Count(1) Perc
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2
Google have function CountIf, you can use also:
select t1.Id1, t1.Id2, CountIf(t2.Keyword = t3.Keyword) Match, CountIf(t2.Keyword = t3.Keyword) / Count(1) Perc
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2

Below is for BigQuery Standard SQL
#standardSQL
SELECT t1.id1, t1.id2,
COUNTIF(t2.keyword = t3.keyword) match_count,
COUNTIF(t2.keyword = t3.keyword) / COUNT(DISTINCT t2.keyword) match_percentage
FROM t2 CROSS JOIN t3
JOIN t1 ON t1.id1 = t2.id1 AND t1.id2 = t3.id2
GROUP BY t1.id1, t1.id2
-- ORDER BY t1.id1, t1.id2
with result as below
Row id1 id2 match_count match_percentage
1 1 A 2 0.5
2 1 B 1 0.25
3 3 K 2 0.5
4 4 P 0 0.0

Percent from total in row

I have some table that I guess with some Query, and I would like to get the percent of the total of a Row and not a column.
My query:
SELECT
SUM(CASE WHEN type = '0' THEN duration END) AS 'type0',
SUM(CASE WHEN type = '1' THEN duration END) AS 'type1',
SUM(CASE WHEN type = '2' THEN duration END) AS 'type2'
FROM table1
GROUP BY range
ORDER BY range
What I get this:
┌──────────────────┬──────────────────┬──────────────────┐
│ type0 │ type1 │ type2 │
├──────────────────┼──────────────────┼──────────────────┤
│ 59989.3049204680 │ 25232.1543858130 │ 24831.1788671015 │
│ 3306.3676530180 │ 1705.9501506120 │ 2657.4211752480 │
│ 352.0299258450 │ 1692.4885264580 │ 1805.3495437180 │
│ 37.4959716400 │ 1584.6392620720 │ 1343.1338054350 │
│ 8.6286011400 │ 1392.7870618600 │ 1042.1155937090 │
│ 9.4098509860 │ 1269.7669830510 │ 970.8922643280 │
│ 7.6270751800 │ 1163.2768018390 │ 836.8802361650 │
│ 2.9459229000 │ 873.3172769110 │ 464.6357979220 │
│ 3.2543335080 │ 695.5214343770 │ 380.5008553400 │
│ 5.4269405200 │ 3120.0459350020 │ 3603.2397332800 │
└──────────────────┴──────────────────┴──────────────────┘
What I'm trying to get:
┌────────┬────────┬────────┐
│ type0 │ type1 │ type2 │
├────────┼────────┼────────┤
│ 54,51% │ 22,93% │ 22,56% │
│ 43,11% │ 22,24% │ 34,65% │
│ 9,14% │ 43,96% │ 46,89% │
│ 1,26% │ 53,44% │ 45,30% │
│ 0,35% │ 57,00% │ 42,65% │
│ 0,42% │ 56,43% │ 43,15% │
│ 0,38% │ 57,94% │ 41,68% │
│ 0,22% │ 65,13% │ 34,65% │
│ 0,30% │ 64,44% │ 35,26% │
│ 0,08% │ 46,37% │ 53,55% │
└────────┴────────┴────────┘
I know how to make percent of each rows:
CAST(100 * SUM(duration) / SUM(SUM(duration)) OVER () AS DECIMAL(5, 2))
as window function, but I don't know the trick to do the same with rows.

Just do the math in your SELECT clause:
SELECT
SUM(CASE WHEN type = 0 THEN duration END)/SUM(CASE WHEN type in (0,1,2) THEN duration END) as type1_perc,
SUM(CASE WHEN type = 1 THEN duration END)/SUM(CASE WHEN type in (0,1,2) THEN duration END) as type2_perc,
SUM(CASE WHEN type = 2 THEN duration END)/SUM(CASE WHEN type in (0,1,2) THEN duration END) as type3_perc
FROM table1
GROUP BY range
ORDER BY range

Here is an option using temp tables.
SELECT
SUM(CASE WHEN type = '0' THEN duration END) AS 'type0',
SUM(CASE WHEN type = '1' THEN duration END) AS 'type1',
SUM(CASE WHEN type = '2' THEN duration END) AS 'type2'
INTO #TEMP
FROM table1
GROUP BY range
select type0/Sum(type0),type1/Sum(type1),type2/Sum(type2)
from #TEMP

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL INNER JOIN duplicates to be remove based on criteria - sql

Related

Display COUNT(*) for every week instead of every day

SQL query returns product of results instead of sum

how to show results of postcodes within a radius of a point

Count and percentage of same strings within two groups with join

Percent from total in row

Categories

Resources