Remove Duplicate Records in Hive

Remove Duplicate Records in Hive - sql

I want to create a table that indicates medical providers that are linked by common members. For example, if I go to prov 1 and prov 2, then prov 1 and prov 2 will be linked because I visited both.
I have a table where each record indicates a member visiting a provider on a specific date. The table contains millions of members and thousands of provs. Below is a small example of the table:
member prov date
1 1 1/1/15
1 2 1/2/15
2 16 1/12/14
2 5 1/1/16
I am trying to create a table where each record indicates two distinct providers being linked by a common member. For example:
member prov1 prov2 date1 date2
1 1 2 1/1/15 1/2/15
2 16 5 1/12/14 1/1/16
I am trying to use an inner join on the same table, but it is returning duplicate records. I thought the distinct clause would fix this, but it does not seem to get the job done. My query is shown below:
select distinct a.member, a.prov, b.prov, a.date, b.date
from table1 as a
inner join table1 as b
on a.member=b.member
This query returns distinct records, but there are records that contain the same information. Below shows an example of this:
a.member a.prov b.prov a.date b.date
1 1 2 1/1/15 1/2/15
1 2 1 1/2/15 1/1/15
Above we see that the records are distinct, but they describe the same information. Below is what I want the query to return:
a.member a.prov b.prov a.date b.date
1 1 2 1/1/15 1/2/15
How can I alter the above query so that I only return distinct information? I don't want 1 record per member. I want 1 record for each distinct prov pairings by member.

One option is to use conditional aggregation with a subquery using row_number:
select member,
max(case when rn = 1 then prov end) prov1,
max(case when rn = 2 then prov end) prov2,
max(case when rn = 1 then date end) date1,
max(case when rn = 2 then date end) date2
from (select member,
prov,
date,
row_number() over (partition by member order by prov, date) rn
from table1) t
group by member

Related

I need a 3 table join

This is my parent table acc_detial -
ACC_DETIAL example -
acc_id
1
2
3
Now i have 3 tables:
ORDER
EMAIL
REPORT
Each table contains 100 rows and acc_id are ForeignKey from ACC_DETIAL.
In ORDER table I have a columns ACC_ID and QUANTITY. I want the count of ACC_ID and sum of QUANTITY.
ORDER table example:
acc_id
quantity
date
1
2
2022/01/22
2
5
2022/01/23
1
10
2022/01/25
3
1
2022/01/25
In EMAIL table I have a column name ACC_ID and I want count of ACC_ID.
EMAIL table example:
acc_id
mail
date
1
5
2022/01/22
2
10
2022/01/22
1
7
2022/01/23
1
7
2022/01/24
2
10
2022/01/25
In REPORT table I have a columns ACC_ID and TYPE and I want the count of ACC_ID and TYPE. Note that TYPE column has only two, possible values:
postive
negative
I want count of each, i.e. count of postive and count of negative in TYPE column.
REPORT table example:
acc_id
type
date
1
positive
2022/01/22
2
negative
2022/01/22
1
negative
2022/01/23
2
postitive
2022/01/26
2
postitive
2022/01/27
I need to take this in a single i need answer as raw query or sqlalchemy. Is it possible or not? Do I need to write separate query to get each table result ?
Result -
result based on above examplec -
acc_id
total_Order_acc_id
total_Order_quantity
total_Email_acc_id
total_Report_acc_id
total_postitive_report
total_negative_report
1
2
12
3
2
1
1
2
1
5
2
3
2
1
3
1
1
Null
Null
Null
Null

You need to aggregate then join as the following:
SELECT ADL.acc_id,
ORD.ord_cnt AS total_Order_acc_id,
ORD.tot_quantity AS total_Order_quantity,
EML.eml_cnt AS total_Email_acc_id,
RPT.rpt_cnt AS total_Report_acc_id,
RPT.pcnt AS total_postitive_report,
RPT.ncnt AS total_negative_report
FROM ACC_DETIAL ADL LEFT JOIN
(
SELECT acc_id,
SUM(quantity) AS tot_quantity,
COUNT(*) AS ord_cnt
FROM ORDERS
GROUP BY acc_id
) ORD
ON ADL.acc_id = ORD.acc_id
LEFT JOIN
(
SELECT acc_id, COUNT(*) AS eml_cnt
FROM EMAIL
GROUP BY acc_id
) EML
ON ADL.acc_id = EML.acc_id
LEFT JOIN
(
SELECT acc_id,
COUNT(*) AS rpt_cnt,
COUNT(*) FILTER (WHERE type='positive') AS pcnt,
COUNT(*) FILTER (WHERE type='negative') AS ncnt
FROM REPORT
GROUP BY acc_id
) RPT
ON ADL.acc_id = RPT.acc_id
See demo

Sample :
Select
`order`.`acc_id`,
report_email_select.`type`,
report_email_select.report_count,
report_email_select.email_count,
SUM(`quantity`) as quantity_sum
FROM
`order`
Left JOIN(
Select
report_select.`acc_id`,
report_select.`type`,
report_select.report_count,
COUNT(*) as email_count
from
(
SELECT
report.`acc_id`,
report.`type`,
COUNT(*) as report_count
FROM
`report`
WHERE
1
GROUP BY
report.`acc_id`,
report.`type`
) AS report_select
INNER JOIN email ON email.acc_id = report_select.acc_id
GROUP BY
report_select.`acc_id`,
report_select.`type`
) AS report_email_select ON `order`.acc_id = report_email_select.acc_id
GROUP BY
`order`.`acc_id`,
report_email_select.`type`;

Getting count of last records of 2 columns SQL

I was looking for a solution for the below mentioned scenario.
So my table structure is like this ; Table name : energy_readings
equipment_id
meter_id
readings
reading_date
1
1
100
01/01/2022
1
1
200
02/01/2022
1
1
null
03/01/2022
1
2
100
01/01/2022
1
2
null
04/01/2022
2
1
null
04/01/2022
2
1
399
05/01/2022
2
2
null
02/01/2022
So from this , I want to get the number of nulls for the last record of same equipment_id and meter_id. (Should only consider the nulls of the last record of same equipment_id and meter_id)
EX : Here , the last reading for equipment 1 and meter 1 is a null , therefore it should be considered for the count. Also the last reading(Latest Date) for equipment 1 and meter 2 is a null , should be considered for count. But even though equipment 2 and meter 1 has a null , it is not the last record (Latest Date) , therefore should not be considered for the count.
Thus , this should be the result ;
equipment_id
Count
1
2
2
1
Hope I was clear with the question.
Thank you!

You can use CTE like below. CTE LatestRecord will get latest record for equipment_id & meter_id. Later you can join it with your current table and use WHERE to filter out record with null values only.
;WITH LatestRecord AS (
SELECT equipment_id, meter_id, MAX(reading_date) AS reading_date
FROM energy_readings
GROUP BY equipment_id, meter_id
)
SELECT er.meter_id, COUNT(1) AS [Count]
FROM energy_readings er
JOIN LatestRecord lr
ON lr.equipment_id = er.equipment_id
AND lr.meter_id = er.meter_id
AND lr.reading_date = er.reading_date
WHERE er.readings IS NULL
GROUP BY er.meter_id

with records as(
select equ_id,meter_id,reading_date,readings,
RANK() OVER(PARTITION BY meter_id,equ_id
order by reading_date) Count
from equipment order by equ_id
)
select equ_id,count(counter)
from
(
select equ_id,meter_id,reading_date,readings,MAX(Count) as counter
from records
group by meter_id,equ_id
order by equ_id
) where readings IS NULL group by equ_id
Explanation:-
records will order data by reading_date and will give counting as 1,2,3..
select max of count from records
select count of counter where reading is null
Partition by will give counting as shown in image
Result

Select rows with max date from table

I have such table and need table 2 result. I am trying to select rows with max date grouped by project_id and ordered by id. And result table must have id column. Tried such request:
SELECT MAX(charges.id) as id,
"charges"."profile_id", MAX(failed_at) AS failed_at
FROM "charges"
GROUP BY "charges"."profile_id"
ORDER BY "charges"."id" ASC
And have error:
ERROR: column "charges.id" must appear in the GROUP BY clause or be used in an aggregate function)
Example table
id
profile_id
failed_at
1
1
01.01.2021
2
1
01.02.2021
3
1
01.03.2021
4
2
01.06.2021
5
2
01.05.2021
6
2
01.04.2021
Needed result
id
profile_id
failed_at
3
1
01.03.2021
4
2
01.06.2021

SELECT charges.*
FROM charges
INNER JOIN
(
SELECT
profile_id,
MAX(charges.failed_at) AS MaxFailed_at
FROM charges
GROUP BY profile_id
) AS xQ ON charges.profile_id = xQ.profile_id AND charges.failed_at = xQ.MaxFailed_at

Postgresql query to filter latest data based on 2 columns

Table Structure First
users table
id
1
2
3
sites table
id
1
2
site_memberships table
site_id
user_id
created_on
1
1
1
1
1
2
1
1
3
2
1
1
2
1
2
1
2
2
1
2
3
Assuming higher the created_on number, latest the record
Expected Output
site_id
user_id
created_on
1
1
3
2
1
2
1
2
3
Expected output: I need latest record for each user for each site membership.
Tried the following query, but this does not seem to work.
select * from users inner join
(
SELECT ROW_NUMBER () OVER (
PARTITION BY sm.user_id,
sm.created_on
), sm.*
from site_memberships sm
inner join sites s on sm.site_id=s.id
) site_memberships
ON site_memberships.user_id = users.user_id where row_number=1```

I think you have overcomplicated the problem you want to solve.
You seem to want aggregation:
select site_id, user_id, max(created_on)
from site_memberships sm
group by site_id, user_id;
If you had additional columns that you wanted, you could use distinct on instead:
select distinct on (site_id, user_id) sm.*
from site_memberships sm
order by site_id, user_id, created_on desc;

Creating a Rank Column with Repeated Indexes

I want to output the following table:
User | Country | RANK
------------------------------
1 US 3
1 US 3
1 NZ 2
1 NZ 2
1 NZ 2
1 JP 1
2 US 2
2 US 2
2 US 2
2 CA 1
What I have is the 'User' and 'Country' columns and want to create the RANK column.
I tried to use the function rank() like
rank() over (partition by User, Country order by ct desc) where ct is just the time of the event since epoch but instead of giving some repeated numbers like 33 222 1, it ranks inside the partition, giving me 12 123 1.
I also tried row_number() with no success.
If I use rank() over (partition by User order by country desc) it works, but how can I guarantee that it also ranks by ct?
Any clues on how to do that?

You are quite vague about the schema of your data. But assuming you have data that looks like this:
User Country Unix_time(epoch)
1 US 1437888888
1 NZ 1437666666
2 US 1437777777
2 NZ 1435555555
I think this will work but I can't test as I don't have hive on my laptop.
select c.*, b.rank
from my_table c
left outer join
(select user
, country
, rank() over (partition by user, order by unix_time desc) as rank
from
(select user, country, max(unix_time) as unix_time
from my_table group by user, country
) a
) b
on c.user=b.user and c.country=b.country
;
Basically I am selecting the maximum value for the time stamp associated with each user and country. This can then be ranked and joined to the original dataset.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove Duplicate Records in Hive - sql

Related

I need a 3 table join

Getting count of last records of 2 columns SQL

Select rows with max date from table

Postgresql query to filter latest data based on 2 columns

Creating a Rank Column with Repeated Indexes

Categories

Resources