How to convert Mysql query to Hive - sql

I have this table:
CREATE TABLE ip_logs (
`ip_address` VARCHAR(11),
`start_date` VARCHAR(11),
`end_date` VARCHAR(11),
`loc_id` INTEGER
);
INSERT INTO ip_logs
(`ip_address`,`start_date`,`end_date`, `loc_id`)
VALUES
('120.0.53.21','2020-01-03','2020-01-09', '5'),
('198.5.273.2','2020-01-10','2020-01-14', '4'),
('198.5.273.2','2020-01-10','2020-01-14', '4'),
('198.5.273.2','2020-01-10','2020-01-14', '4'),
('100.36.33.1','2020-02-01','2020-02-02', '4'),
('100.36.33.1','2020-02-01','2020-02-02', '4'),
('100.36.33.1','2020-02-01','2020-02-02', '4'),
('198.0.47.33','2020-02-22','2020-02-24', '2'),
('122.8.0.11', '2020-02-25','2020-02-30','4'),
('198.0.47.33','2020-03-10','2020-03-17', '2'),
('198.0.47.33','2020-03-10','2020-03-17', '2'),
('122.8.0.11', '2020-03-18','2020-03-23','4'),
('198.5.273.2','2020-03-04','2020-03-09', '3'),
('106.25.12.2','2020-03-24','2020-03-30', '1');
I use this query to select the most frequent ip address:
select (
select ip_address
from ip_logs t2
where t2.loc_id = t1.loc_id
group by ip_address
order by count(*) desc
limit 1)
from ip_logs t1
group by loc_id
This works in mysql8.0. This however does not work in Hive as well. I get this error:
cannot recognize input near 'select' 'ip_address' 'from' in expression specification
Expected output is:
loc_id | ip_address
5 120.0.53.21
4 198.5.273.2
2 198.0.47.33
3 198.5.273.2
1 106.25.12.2

You can try using row_number() window function
select * from
(
select ip_address,loc_id,count(*) as frequency
row_number() over(partition by loc_id order by count(*) desc) as rn
from ip_logs group by ip_address,loc_id
)A where rn=1

Related

I need a query to get output as mention below

I have the data in the table as below.
cntrct_number status_cd registration_date
123 A 23-03-19
123 A 06-06-19
123 S 10-06-21
123 S 11-06-21
123 S 12-06-21
123 A 13-06-21
123 S 14-06-21
123 S 15-06-21
Now I want the two minimum dates of status_cd = 'S'
like the query should give the output as below.
123 S 11-06-21
123 S 14-06-21
The output is that when the status is changed then it should take the first row immediate after the change of status.
You can use where to filter the result by status, then order it by date, and the last step limit the output by 2 rows:
select * from table where status_cd = 'S' order by registration_date limit 2;
What we're looking for is a qualification in a window function.
Qualification 1: the status code is S.
Qualification 2: the previous status code is not S.
create or replace transient table T1(cntrct_number int, status_cd string, registration_date date);
insert into T1 (cntrct_number, status_cd, registration_date) values
(123, 'A', to_date('23-03-19', 'DD-MM-YY')),
(123, 'A', to_date('06-06-19', 'DD-MM-YY')),
(123, 'S', to_date('10-06-21', 'DD-MM-YY')),
(123, 'S', to_date('11-06-21', 'DD-MM-YY')),
(123, 'S', to_date('12-06-21', 'DD-MM-YY')),
(123, 'A', to_date('13-06-21', 'DD-MM-YY')),
(123, 'S', to_date('14-06-21', 'DD-MM-YY')),
(123, 'S', to_date('15-06-21', 'DD-MM-YY'));
select cntrct_number
,status_cd
,registration_date
from T1
qualify STATUS_CD = 'S'
and lag(STATUS_CD) over (partition by cntrct_number order by registration_date) <> 'S'
;

Group by absorb NULL unless it's the only value

I'm trying to group by a primary column and a secondary column. I want to ignore NULL in the secondary column unless it's the only value.
CREATE TABLE #tempx1 ( Id INT, [Foo] VARCHAR(10), OtherKeyId INT );
INSERT INTO #tempx1 ([Id],[Foo],[OtherKeyId]) VALUES
(1, 'A', NULL),
(2, 'B', NULL),
(3, 'B', 1),
(4, 'C', NULL),
(5, 'C', 1),
(6, 'C', 2);
I'm trying to get output like
Foo OtherKeyId
A NULL
B 1
C 1
C 2
This question is similar, but takes the MAX of the column I want, so it ignores other non-NULL values and won't work.
I tried to work out something based on this question, but I don't quite understand what that query does and can't get my output to work
-- Doesn't include Foo='A', creates duplicates for 'B' and 'C'
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [Foo] ORDER BY [OtherKeyId]) rn1
FROM #tempx1
)
SELECT c1.[Foo], c1.[OtherKeyId], c1.rn1
FROM cte c1
INNER JOIN cte c2 ON c2.[OtherKeyId] = c1.[OtherKeyId] AND c2.rn1 = c1.rn1
This is for a modern SQL Server: Microsoft SQL Server 2019
You can use a GROUP BY expression with HAVING clause like below one
SELECT [Foo],[OtherKeyId]
FROM #tempx1 t
GROUP BY [Foo],[OtherKeyId]
HAVING SUM(CASE WHEN [OtherKeyId] IS NULL THEN 0 END) IS NULL
OR ( SELECT COUNT(*) FROM #tempx1 WHERE [Foo] = t.[Foo] ) = 1
Demo
Hmmm . . . I think you want filtering:
select t.*
from #tempx1 t
where t.otherkeyid is not null or
not exists (select 1
from #tempx1 t2
where t2.foo = t.foo and t2.otherkeyid is not null
);
My actual problem is a bit more complicated than presented here, I ended up using the idea from Barbaros Özhan solution to count the number of items. This ends up with two inner queries on the data set with two different GROUP BY. I'm able to get the results I need on my real dataset using a query like the following:
SELECT
a.[Foo],
b.[OtherKeyId]
FROM (
SELECT
[Foo],
COUNT([OtherKeyId]) [C]
FROM #tempx1 t
GROUP BY [Foo]
) a
JOIN (
SELECT
[Foo],
[OtherKeyId]
FROM #tempx1 t
GROUP BY [Foo], [OtherKeyId]
) b ON b.[Foo] = a.[Foo]
WHERE
(b.[OtherKeyId] IS NULL AND a.[C] = 0)
OR (b.[OtherKeyId] IS NOT NULL AND a.[C] > 0)

Find the users having more than two elements and one of those elements must be A

I want to extract the users having more than two elements and one of those elements must be A.
This my table:
CREATE TABLE #myTable(
ID_element nvarchar(30),
Element nvarchar(10),
ID_client nvarchar(20)
)
This is the data of my table:
INSERT INTO #myTable VALUES
(13 ,'A', 1),(14 ,'B', 1),(15 ,NULL, 1),(16 ,NULL, 1),
(17 ,NULL, 1),(18 ,NULL, 1),(19 ,NULL, 1),(7, 'A', 2),
(8, 'B', 2),(9, 'C', 2),(10 ,'D', 2),(11 ,'F', 2),
(12 ,'G', 2),(1, 'A', 3),(2, 'B', 3),(3, 'C', 3),
(4, 'D', 3),(5, 'F', 3),(6, 'G', 3),(20 ,'Z', 4),
(22 ,'R', 4),(23 ,'D', 4),(24 ,'F', 5),(25 ,'G', 5),
(21 ,'x', 5)
And this is my query:
Select Distinct ID_client
from #myTable
Group by ID_client
Having Count(Element) > 2
Add to your query CROSS APPLY with id_clients that have element A
SELECT m.ID_client
FROM #myTable m
CROSS APPLY (
SELECT ID_client
FROM #myTable
WHERE ID_client = m.ID_client
AND Element = 'A'
) s
GROUP BY m.ID_client
HAVING COUNT(DISTINCT m.Element) > 2
Output:
ID_client
2
3
I think this is what you are looking for:
SELECT * FROM
(SELECT *, RANK() OVER (PARTITION BY element ORDER by id_client) AS grouped FROM #myTable) t
wHERE grouped > 1
AND Element = 'A'
ORDER by t.element
which brings back
ID_element Element ID_client grouped
7 A 2 2
1 A 3 3
You can select the ID_client values which have an 'A' as an Element and join your table with the result of that:
SELECT m.ID_Client
FROM #myTable AS m
JOIN (
SELECT a.ID_Client FROM #myTable AS a
WHERE a.Element = 'A') AS filteredClients
ON m.ID_client = filteredClients.ID_client
GROUP BY m.ID_client
HAVING COUNT(m.Element) > 2
Outputs:
ID_Client
2
3
However, this is not necessarily the best way to do it: When should I use Cross Apply over Inner Join?

SQL optimization for large data_sets

i got some nasty sql performance issue. I need to execute statment like:
SELECT *
FROM (SELECT /*+ FIRST_ROWS(26) */
a.*, ROWNUM rnum
FROM (SELECT *
FROM t1
WHERE t1_col1 = 'val1'
AND g_dom in ('1', '2', '3')
AND g_context IN ('3', '4', '5', '6')
AND i_col = 1
AND f_col in ('1', '2', '3', '4')
AND e_g IN (SELECT e_g
FROM t2
WHERE t2_col1 = 'val1'
AND g_context IN ('3', '4', '5', '6')
AND val like 'some val%')
ORDER BY order_id DESC) a)
WHERE rnum > 0;
Basically we got table t1 (our data table), and t2 (our support values). We got 1kk records in t1 and 10kk in t2. Column g_context narrows our data sets, but still, val had something like 500k records. We need 25 rows ordered by order_id.
Is there any way to tell inner statement
SELECT e_g FROM t2 WHERE t2_col1='val1' AND g_context IN('3','4','5','6' ) AND val like 'some val%
to get only 25 records that's match out outer statement criteria ?
Why not move rownum and the hint into the inner query like so:
SELECT t1.*,row_number() over (order by order_id desc) rn /*+ FIRST_ROWS(26) */
FROM t1
WHERE t1_col1 = 'val1'
AND g_dom in ('1', '2', '3')
AND g_context IN ('3', '4', '5', '6')
AND i_col = 1
AND f_col in ('1', '2', '3', '4')
AND e_g IN (SELECT e_g
FROM t2
WHERE t2_col1 = 'val1'
AND g_context IN ('3', '4', '5', '6')
AND val like 'some val%')
ORDER BY order_id DESC
Too me this extra subselect with hint and rownum does not seem to make any sense.
And the where-clause should be "WHERE rnum < 26", shouldn't it?

SQL - ALL, Including all values

I have two tables:
create table xyz
(campaign_id varchar(10)
,account_number varchar)
Insert into xyz
values ( 'A', '1'), ('A', '5'), ('A', '7'), ('A', '9'), ('A', '10'),
( 'B', '2'), ('B', '3'),
( 'C', '1'), ('C', '2'), ('C', '3'), ('C', '5'), ('C', '13'), ('C', '15'),
('D', '2'), ('D', '9'), ('D', '10')
create table abc
(account_number varchar)
insert into abc
values ('1'), ('2'), ('3'), ('5')
Now, I want to write a query where all the four account_number 1, 2, 3, 5 are included in a Campaign_id.
The answer is C.
[My aim is to find the Campaign Code that includes account_number 1, 2, 3 & 5. This condition is only satisfied by campaign code C.]
I tried using IN and ALL, but don't work. Could you please help.
I think what you are after is a inner join. Not sure from your questions which way around you want your data. However this should give you a good clue how to procede and what keywords to lock for in the documentation to go further.
SELECT a.*
FROM xyz a
INNER JOIN abc b ON b.account_number = a.account_number;
EDIT:
Seems I misunderstood the original question.. sorry. To get what you want you can just do:
SELECT campaign_id
FROM xyz
WHERE account_number IN ('1', '2', '3', '5')
GROUP BY campaign_id
HAVING COUNT(DISTINCT account_number) = 4;
This is called relational division if you want to investigate further.
SELECT campaign_id
FROM (
SELECT campaign_id, COUNT(*) AS c, total_accounts
FROM xyz
JOIN abc ON xyz.account_number = abc.account_number
CROSS JOIN (SELECT COUNT(*) AS total_accounts
FROM abc) AS x
GROUP BY campaign_id
HAVING c = total_accounts) AS subq
DEMO
select xyz.campaign_id
from xyz
join abc
on xyz.account_number = abc.account_number
group by xyz.campaign_id
having count(xyz.campaign_id) =
(select count(account_number) from abc);
Caution: t-sql implementation