Impala SQL with Multiple Count Distinct - Help Needed

Impala SQL with Multiple Count Distinct - Help Needed - impala

We have been trying hard for last several weeks for finding solution to a impala sql query problem.
We are looking forward for any guidance or advise on this situation.
Below is table for our requirement (also Image attached).
we need to report counts against each endpoint_type.
In my case its NAS and Remote Site. We need following counts.
Count of devices - We need count distinct of number of devices for each endpoint_type. This would be unique count of device.
Values Expected
NAS - 6
Remote Site - 4
Count of Shares - This field is distinct combination of (device,datacenter,share)
Values Expected
NAS - 6
Remote Site - 5
Count of folders - This field is distinct combination of (device,datacenter,share,folder_name) where folder_type=regular
Values Expected
NAS - 3
Remote Site - 2
Count of Open Folders - This field is distinct combination of (device,datacenter,share,folder_name) where folder_type=regular and open_status = Open
Values Expected
NAS - 2
Remote Site - 1
Count of Remediated Folders - This field is distinct combination of (device,datacenter,share,folder_name) where folder_type=regular and open_status = Open and remediated = YES
Values Expected
NAS - 2
Remote Site - 0
Since Impala throw error for multiple count distincts with different column combinations we are not able to do this in a query.
We even tried multiple queries and joining them but results did not come correctly.
When we get the count , we also need to show the records that make up that count.
Appreciate your help.
Thanks,
Sudi
enter image description here

Related

Removing SQL Rows from Query if two rows have an identical ID but differences in the columns

I´m currently working stuck on a SQL issue (well, mainly because I can´t find a way to google it and my SQL skills do not suffice to solve it myself)
I´m working on a system where documents are edited. If the editing process is finished, users mark the document as solved. In the MSSQL database, the corresponding row is not updated but instead, a new row is inserted. Thus, every document that has been processed has [e.g.: should have] multiple rows in the DB.
See the following situation:
ID
ID2
AnotherCondition
Steps
Process
Solved
1
1
yes
Three
ATAT
AF
2
2
yes
One
ATAT
FR
2
3
yes
One
ATAT
EG
2
4
yes
One
ATAT
AF
3
5
no
One
ABAT
AF
4
6
yes
One
ATAT
FR
5
7
no
One
AVAT
EG
6
8
yes
Two
SATT
FR
6
9
yes
Two
SATT
EG
6
10
yes
Two
SATT
AF
I need to select the rows which have not been processed yet. A "processed" document has a "FR" in the "Solved" column. Sadly other versions of the document exist in the DB, with other codes in the "Solved" columns.
Now: If there is a row which has "FR" in the "Solved" column I need to remove every row with the same ID from my SELECT statement as well. Is this doable?
In order to achieve this, I have to remove the rows with the IDs 2 | 4 (because the system sadly isn´t too reliable I guess) | and 6 in my select statement. Is this possible in general?
What I could do is to filter out the duplicates afterwards, in python/js/whatever. But I am curious whether I can "remove" these rows directly in the SQL statement as well.
To rephrase it another time: How can I make a select statement which returns only (in this example) the rows containing the ID´s 1, 3 and 5?

If you need to delete all rows where every id doesn't have any "Solved = 'no'", you can use a DELETE statement that will exclude all "id" values that have at least one "Solved = 'no'" in the corresponding rows.
DELETE FROM tab
WHERE id NOT IN (SELECT id FROM tab WHERE Solved1 = 'no');
Check the demo here.
Edit. If you need to use a SELECT statement, you can simply reverse the condition in the subquery:
SELECT *
FROM tab
WHERE id NOT IN (SELECT id FROM tab WHERE Solved1 = 'yes');
Check the demo here.

I'm not sure I understand your question correct:
...every document that has been processed has [...] multiple rows in the DB
I need to find out which documents have not been processed yet
So it seems you need to find unique documents with no versions, this could be done using a GROUP BY with a HAVING clause:
SELECT
Id
FROM dbo.TableName
GROUP BY Id
HAVING COUNT(*) = 1

How to check changes in column values?

I need to try to check some device IDs for work. These are values (15 characters, random string of numbers+letters) that mostly remain constant for users. However, every now and then these deviceIDs will change. And I'm trying to detect when they do change. Is there a way to write this kind of a dynamic query with SQL? Say, perhaps with a CASE statement?
user
device
date
1
23127dssds1272d
10-11
1
23127dssds1272d
10-11
1
23127dssds1272d
10-12
1
23127dssds1272d
10-12
1
04623jqdnq3000x
10-12

Count distinct device by id having count > 1?

Consider below approach
select *
from your_table
where true
qualify device != lag(device, 1, '') over(partition by user order by date)
if applied to sample data in your question - output is
As you can see here - at 10-11 first 'change, assignment' happened for user=1 ; and then on 10-12 he device changed

Join multiple tables in Microsoft SQL Server where there is only one line match from table 1 and multiple lines from table 2 and 3

I am stuck on something, which I have never used in my 10 years of SQL. I thought it would be useful if there was someway of doing this. Firstly I am running SQL Server Express (latest free version) on Windows. To talk to the database I am using SSMS.
There are three tables/queries.
1 table (A) has one data value I want to pull through.
2 tables (B)/(C) have multiple values.
Column common to all tables is CAMPAIGN NAME
Column common to (B)/(C) is PRODUCT NAME
This is an example of the data:
OUTPUT GOAL
I have tried the following:
UNION ALL (but this does not assist when I want to calculate AMOUNT - MARKETING - TOTAL INVESTMENT
I tried PARTITION (but I simple could now get it to work.
If I use joins, it brings through a head count / total investment and marketing cost per product, which when using SUM brings through the incorrect values for head count / total investment and marketing cost vs total amount, quantity.
I tried splitting the costs based on Quantity / Total Quantity or Amount / Total Amount, but the cost associated with the product is not correct or directly relating to the product this way.
Am I trying to do something impossible, or is there a way to do this in SQL?

The following comes pretty close to what you want:
select . . . -- select the columns you want here
from a join
b
on b.campaign_name = a.campaign_name join
c
on c.campaign_name = b.campaign_name and
c.product_name = b.product_name;
This produces a result set with a separate row for each campaign/product.

Find duplicates in Select statement after an if check

I am working on a project that keeps a track of repaired cell phones.
In the select statement, I would like to find the duplicate IMEI numbers and check if the AddedDate between the duplicates is less than 30 days. Another words, the select should list all the phones even including the duplicated IMEI numbers if the AddedDate is more than 30 days.
I hope I described it clear enough. Thank you.
Additional notes:
I have tried it by including groupBy under a sub-select which did find the duplicates, but I wasn't able to implement an if condition. Instead, I was going to place all duplicates into a dynamic table and then use a select statement against this table. Before doing so, I thought of posting my question here.
For example DB_Phones has the following rows
ID - AddedDate - IMEI
1 - 01.10.2012 - 123456789012345
2 - 15.10.2012 - 987654321012345
3 - 20.10.2012 - 123456789012345
Based on the table above, I would like to list only the second row (ID# 2) because the last duplicate (ID# 3) wasn't added 30 days after the row with the ID# 1. If rows were as below:
ID - AddedDate - IMEI
1 - 01.10.2012 - 123456789012345
2 - 15.10.2012 - 987654321012345
3 - 20.10.2012 - 123456789012345
4 - 21.11.2012 - 123456789012345
Then the second and fourth row should be returned. I need to return just one of the duplicates (last one) if the 30 day condition is met.
I hope it make more sense now. Thanks again.

A guess at what you're after:
SELECT
r.*,
(SELECT COUNT(*) FROM Repairs r2 WHERE r.IMEI = r2.IMEI
AND r.ID != r2.ID) as NumberOfAllDuplicates,
(SELECT COUNT(*) FROM Repairs r2 WHERE r.IMEI = r2.IMEI
AND ABS(DATEDIFF(day, r.AddedDate, r2.AddedDate)) < 30
AND r.ID != r2.ID) as NumberOfNearDuplicates
FROM
Repairs r
This depends on having an ID field, and everything existing in one table. With the correlated sub queries, it may not be very fast on long data.

Looking for an SQL statement which groups by type

first, I was pretty lost giving this question a correct title.
I'm working on a system which allows me to find specific networking devices. A network device (called "system" in my example) has a number of ports, where each port can have a specific configuration. An example would be: Return all devices which have at least 2 ports of type 100BASE-TX and at least 1 port of 1000BASE-TX.
Here's my example table which is named "ports":
system port type
1 1 10BASE-T
1 1 100BASE-TX
1 1 1000BASE-TX
1 2 10BASE-T
1 2 100BASE-TX
1 2 1000BASE-TX
1 3 10BASE-T
1 3 100BASE-TX
1 3 1000BASE-TX
2 1 100BASE-TX
2 2 100BASE-TX
2 3 100BASE-TX
Column descriptions:
"system" is the ID of the system which contains the ports
"port" is the ID of the port
"type" is the type which that single port can have
I'm pretty lost here, and I don't ask for a complete query, maybe some hints are enough for me to figure out the rest. I already tried to join the table with itself to retrieve all possible port combinations, but from that point I was lost again.
Here's my pseudo-code:
SELECT system FROM ports WHERE (number-of-possible-100base-tx-ports >= 2 AND number-of-possible-1000base-tx-ports >= 1)
Here's my expected result:
system
1
It is important to know that a port can be either of one or another type. Basically I want the user to ask: "List all devices which support 2 100BASE-TX ports and at least 1 1000BASE-TX port at the same time". For example, the following pseudo-sql should not return any results:
SELECT system FROM ports WHERE (number-of-possible-100base-tx-ports >= 2 AND number-of-possible-1000base-tx-ports >= 2)
This query shouldn't return any result since no device has more than three ports overall.
EDIT
Here's another pseudo-SQL which represent the question better:
SELECT system FROM ports WHERE (at-least-1-type = 1000BASE-TX AND at-least-2-other-types = 100BASE-TX) AND portid-from-type-1000BASE-TX <> portid-from-type-100BASE-TX
EDIT #2
After one night, I realized that it might not be possible using plain SQL. What I need would be an intermediate table containing all possible configurations per system, and I believe that table would be quite huge. Given the example table above, I would already have 27 different combinations for system 1; regular networking devices have 12, 24 or 48 ports and storing all combinations in a database wouldn't be very efficient. I have to think of a programmatic way to solve this problem.
Thanks in advance!
Timo

I've had a bash at this using SQLite and this query seems to work ok for the limited test data I've tested it against.
select sys as system from (
select a.sys, count(distinct a.port) as want_a, count(distinct b.port) as want_b
from test a left join test b
on a.sys=b.sys and a.port<>b.port and a.type<>b.type
where
a.type='$type_a'
and (b.type='$type_b' or b.type is null)
and a.sys in (
select sys from test group by sys having count(distinct port) >= $want_a+$want_b
)
group by a.sys
having want_a >= $want_a and want_b >= $want_b
) z;
Where $want_a is the count of ports for $type_a and $want_b is the count for $type_b. So your initial query up there has want_a=2, type_a='100BASE-TX', want_b=1, type_b='1000BASE-TX'.
In the gist, the first file is mysql.sh (test driver script, ./mysql.sh < test.txt), second is test.txt (test data), third is gotest.sh (sqlite3 driver script, ./gotest.sh < test.txt), third is the SQL. All tests PASS in mysql and sqlite3 so that's promising.

I think you need to clarify this requirement a bit further:
Return all devices which have at least
2 ports of type 100BASE-TX and at
least 1 port of 1000BASE-TX.
Since a single port can have more than one type, would this device satisfy the query or not?
port type
1 100BASE-TX
1 1000BASE-TX
2 100BASE-TX
Taking your requirement literally, I think this device qualifies, but I suspect what you really want is a device which can support 2 100BASE-TX and 1 1000BASE-TX connections at the same time, so would need to have at least 3 ports.

The answer here is to normalize the data.
Table 1:
System ID - key
Description
Table 2:
Port Type ID - key
Description
Table 3:
System ID - Key
Port ID - Key
Port Type
Select count(*), port_type from table_1 a, table_3 c
where a.system_id=c.system_id
group by port_type
having count(*) > 2
I hope that gets you close.

Well, first i might question this table structure... but from your description of the system you're working on it may be difficult to change...
I would suggest a query with two sub-queries like this:
SELCT *
FROM
(
SELECT COUNT(port) AS Port1Count, system
FROM ports
WHERE type = '100BASE-TX'
AND Port1Count >= 2
GROUP BY system
) AS A
INNER JOIN
(
SELECT COUNT(port) AS Port2Count, system
FROM ports
WHERE type = '1000BASE-TX'
AND Port2Count >= 1
GROUP BY system
) AS B ON A.System = B.System
This may not be EXACT depending on the flavor of SQL and I may have some syntax wrong here or there since I didn't actually try building your table. Hope it helps!

After one night, I realized that it might not be possible using plain SQL. What I need would be an intermediate table containing all possible configurations per system, and I believe that table would be quite huge. Given the example table above, I would already have 27 different combinations for system 1; regular networking devices have 12, 24 or 48 ports and storing all combinations in a database wouldn't be very efficient. I have to think of a programmatic way to solve this problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas