SQL groupby having count distinct - sql

I've got a postgres database that contains a table with IP, User, and time fields. I need a query to give me the complete set of all IPs that have only a single user active on them over a defined time period (i.e. I need to filter out IPs with multiple or no users, and should only have one row per IP). The user field contains some null values, that I can filter out. I'm using Pandas' read_sql() method to get a dataframe directly.
I can get the full dataframe of data from the defined time period easily with:
SELECT ip, user FROM table WHERE user IS NOT NULL AND time >= start AND time <= end
I can then take this data and wrangle the information I need out of it easily using pandas with groupby and filter operations. However, I would like to be able to get what I need using a single SQL query. Unfortunately, my SQL chops ain't too hot. My first attempt below isn't great; the dataframe I end up with isn't the same as when I create the dataframe manually using the original query above and some pandas wrangling.
SELECT DISTINCT ip, user FROM table WHERE user IS NOT NULL AND ip IN (SELECT ip FROM table WHERE user IS NOT NULL AND time >= start AND time <= end GROUP BY ip HAVING COUNT(DISTINCT user) = 1)
Can anyone point me in the right direction here? Thanks.
edit: I neglected to mention that there are multiple entries for each user/ip combination. The source is network authentication traffic, and users authenticate on IPs very frequently.
Sample table head:
---------------------------------
ip | user | time
---------------------------------
172.18.0.0 | jbloggs | 1531987000
172.18.0.0 | jbloggs | 1531987100
172.18.0.1 | jsmith | 1531987200
172.18.0.1 | jbloggs | 1531987300
172.18.0.2 | odin | 1531987400
If I were to query this example table for the time range 1531987000 to 1531987400 I would like the following output:
---------------------
ip | user
--------------------
172.18.0.0 | jbloggs
172.18.0.2 | odin

This should work
SELECT ip
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end
GROUP BY ip
HAVING COUNT(ip) = 1
Explanation:
SELECT ip FROM table WHERE user IS NOT NULL AND time >= start AND time <= end - filtering out the nulls and time periods
...GROUP BY ip HAVING COUNT(ip) = 1 - If an ip has multiple users, the count(no. of rows with that ip) would be greater > 1.

If by "single user" you mean that there could be multiple rows with only one user, then:
SELECT ip
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end
GROUP BY ip
HAVING MIN(user) = MAX(user) AND COUNT(user) = COUNT(*);

I have figured out a query that gets me what I want:
SELECT DISTINCT ip, user
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end AND ip IN
(SELECT ip FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end
GROUP BY ip HAVING COUNT(DISTINCT user) = 1)
Explanation:
The inner select gets me all IPs that have only one user across the specified time range. I then need to select the distinct ip/user pairs from the main table where the IPs are in the nested select.
It seems messy that I have to do the same filtering (of time range and non-null user fields) twice though, is there a better way to do this?

Related

Select specific grouped element

Hi I would like to ask you because I cannot find solution.
For example I have data like that:
number | date | user
10 | 2022-07-01 | A
15 | 2022-07-08 | A
9 | 2022-07-10 | A
Right now I need get the number for user where date is the newer one.
In this case I need get value 9
Ofcourse I have many diffrent users it is only for illustrate the issue.
Now I would like to select all unique users with his number that date is the newer one.
Is it possible to do it in one query?
I like to teach these problems by breaking it down into smaller problems.
First, write a query that tells you which is the "newer one" for each user.
SELECT user, max(date) as newer_one
FROM tbl
GROUP BY user
Next, you can join that result back to the original data. My favorite style is called a CTE which makes things readable and easier to debug. Like so,
WITH newest AS (
SELECT user, max(date) as newer_one
FROM tbl
GROUP BY user
)
SELECT original.*
FROM tbl AS original
INNER JOIN newest
ON original.user = newest.user
AND original.date = newest.newer_one
Some RDBMS don't support this CTE style, but you can put the query in the body which makes it harder to read sometimes but will work basically anywhere.
SELECT original.*
FROM tbl AS original
INNER JOIN
(
SELECT user, max(date) as newer_one
FROM tbl
GROUP BY user
) as newest
ON original.user = newest.user
AND original.date = newest.newer_one

How can I generate a list while ignoring records that have a date that does not fit into the range I am looking for?

I am using Microsoft SQL Server, I currently have a table with records for accounts. These master accounts can have several sub accounts linked to them. For example, master account XXX can have sub account XXXA and XXXB and... XXXN and so on and so forth.
These sub accounts can be opened and added to the master account XXX across time, so at different points in time. When a new sub account is opened, it also opens a new master account. From that point on, other sub accounts can be added to that master account number.
I have a column with the account opening dates. These dates are linked to when the sub accounts are opened.
I am trying to generate a list of master accounts (not sub accounts), that were opened between 2018-11-01 and 2019-02-15. However, I only want to include new MASTER ACCOUNTS, therefore ignoring any master accounts that have an account opening date prior to 2018-11-01.
The issue I am having is master accounts that are showing up in my generated list because they have sub accounts that have been added to them during the date ranges I am looking for.
I Have tried using the MIN function inside of having on my dates. I have checked other stack overflow threads for a solution as well
SELECT master_accounts, accountopendate, accountclosedate
FROM accounts
GROUP BY master_accounts, accountopendate, accountclosedate
HAVING MIN(accountopendate) BETWEEN '2018-11-01' AND '2019-02-15';
It gave me a list of the master accounts, however upon doing some QA, I find some master accounts in the list, that have been opened prior to 2018-11-01.
I would like a list of master accounts with the oldest account opening date being 2018-11-01, ignoring all the master accounts with account opening dates prior to 2018-11-01.
EXPECTED RESULT:
+-----------------+-----------------+------------------+
| master_accounts | accountopendate | accountclosedate |
+-----------------+-----------------+------------------+
| XXX | 2018-11-01 | NULL |
| ZZZ | 2018-12-01 | NULL |
| YYY | 2019-02-01 | NULL |
+-----------------+-----------------+------------------+
This should work, assuming the earliest opening date is always going to include the master account number.
First, isolate the account numbers and the initial opening date, then join that result set to your base table. I used a CTE, but a sub-query would accomplish the same thing.
Using a CTE:
WITH masterOpen AS
(
SELECT
master_accounts
,MIN(accountopendate) AS openDate
FROM
dbo.accounts
GROUP BY
master_accounts
)
SELECT
a.master_accounts
,a.accountopendate
,a.accountclosedate
FROM
dbo.accounts AS a
JOIN
masterOpen AS mo
ON
mo.master_accounts = a.master_accounts
AND
mo.openDate = a.accountopendate
AND
mo.openDate >= '2018-11-01'
AND
mo.openDate <= '2019-02-15';
Sub-query instead:
SELECT
a.master_accounts
,a.accountopendate
,a.accountclosedate
FROM
(
SELECT
master_accounts
,MIN(accountopendate) AS openDate
FROM
dbo.accounts
GROUP BY
master_accounts
) AS mo
JOIN
dbo.accounts AS a
ON
mo.master_accounts = a.master_accounts
AND
mo.openDate = a.accountopendate
AND
mo.openDate >= '2018-11-01'
AND
mo.openDate <= '2019-02-15';
The date parameters could also be broken out into a WHERE clause if you prefer, but with an INNER JOIN it will yield the same results. For current versions of the SQL engine, it's more a matter of preference than performance.
Why not just use filter
SELECT distinct master_accounts, accountopendate, accountclosedate
FROM accounts where accountopendate>='2018-11-01' AND accountopendate<='2019-02-15'

SQL statement that retrieves formulas for two different dates in db

All,
I have three total tables. The first table 'rollup1' contains the number of views and number of clicks for a campaign, as well as a one-up number for the day field (largest number in column represents the current date) A second table 'rollup2' contains the earnings for the campaign. It also contains the same one-up number for the dayfield. The third table 'campaigns' contains the ID/names for the campaigns. campaigns.id = rollup1.id = rollup2.id and rollup1.day=rollup2.day
I want to generate an SQL query that lists the campaign id, name, specific calculated value from yesterday, and specific calculated value from today. The specific calculated value I'm looking for is (earnings/clicks)*1000.
The results will look like:
id | name | yesterday | today
a | Campaign1 | $0.05 | $0.010
I think I can use case statements, but I can't seem to get it correct. Here's what I have so far. It calculates the formula for yesterday, but not the one for today. I need these to be side by side.
select campaigns.id, campaigns.name, rollup1.views,rollup1.clicks,rollup2.costs,round((rollup2.costs/rollup1.views)*1000,2) as yesterday
from campaigns,rollup1,rollup2
where campaigns.id = rollup1.campaign_id and campaigns.id = rollup2.campaign_id
and rollup1.dayperiod = rollup2.dayperiod
and rollup1.dayperiod = (SELECT (max(rollup1.dayperiod) -1) FROM rollup1)
Thanks for any help you can provide.

Is this possible with an SQL query?

Sorry for the generic title of the question, but I didn't know how else to put it.. So here goes:
I have a single table that holds the following information:
computerName | userName | date | logOn | startUp
| | | |
ID_000000001 | NULL | 2012-08-14 08:00:00.000 | NULL | 1
ID_000000001 | NULL | 2012-08-15 09:00:00.000 | NULL | 0
ID_000000003 | user02 | 2012-08-15 19:00:00.000 | 1 | NULL
ID_000000004 | user02 | 2012-08-16 20:00:00.000 | 0 | NULL
computername and username are self-explanatory I suppose
logOn is 1 when the user logged on at the machine and 0 when he logged off.
startUp is 1 when the machine was turned on and 0 when it got turned off.
the other entry is alway NULL respectively since we can't login and startup at the exact same time.
Now my task is: Find out which computers have been turned on the least amount of time over the last month (or any given amount of time, but for now let's say one month) Is this even possible with SQL? <-- Careful: I don't need to know how many times a PC was turned on, but how many hours/minutes each computer was turned on over the given timespace
There's two little problems as well:
We cannot say that the first entry of each computer is a 1 in the startUp column since the script that logs those events was installed recently and thus maybe a computer was already running when it started logging.
We cannot assume that if we order by date and only show the startUpcolumn that the entries will all be alternating 1's and 0's because if the computer is forced shut down by pulling the plug for example there won't be a log for the shutdown and there could be two 1's in a row.
EDIT: userName is of course NULL when startUp has a value, since turning on/shutting down doesn't show which user did that.
In a stored procedure, with cursors and fetch loops.
And you use a temptable to store by computername the uptime.
I give you the main plan, I'll let you see for the details in the TSQL guide.
Another link: a good example with TSQL Cursor.
DECLARE #total_hour_by_computername
declare #computer_name varchar(255)
declare #RowNum int
--Now in you ComputerNameList cursor, you have all different computernames:
declare ComputerNameList cursor for
select DISTINCT computername from myTable
-- We open the cursor
OPEN ComputerNameList
--You begin your foreach computername loop :
FETCH NEXT FROM ComputerNameList
INTO #computer_name
set #RowNum = 0
WHILE ##FETCH_STATUS = 0
BEGIN
SET #total_hour_by_computername=0;
--This query selects all startup 1 dates by computername order by date.
select #current_date=date from myTable where startup = 1 and computername = #computername order by date
--You use a 2nd loop on the dates that were sent you back:
--This query gives you the previous date
select TOP(1) #previousDate=date from myTable
where computername = #computername and date < #current_date and startup is not null
order by date DESC
--If it comes null, you can decide not to take it into account.
--Else
SET #total_hour_by_computername=#total_hour_by_computername+datediff(hour, #previousDate, #current_date);
--Once all dates have been parsed, you insert into your temptable the results for this computername
INSERT INTO TEMPTABLE(Computername,uptime) VALUES (#computername,#total_hour_by_computername)
--End of the #computer_name loop
FETCH NEXT FROM ComputerNameList
INTO #computer_name
END
CLOSE ComputerNameList
DEALLOCATE ComputerNameList
You only need a select into your temptable to determine which one of the computers has been up the most time.
You could group by computer, and use where to filter for startups in a particular month:
select computerName
, count(*)
from YourTable
where '2012-08-01' <= [date] and [date] < '2012-09-01'
and startup = 1
group by
computerName
order by
count(*) desc
As RoadWarrior pointed out, an accurate reports is not possible when shutdown messages are dropped. But here is an attempt to generate something useful. I'm going to assume the table name is computers:
SELECT c1.computerName,
timediff(MIN(c2.date), c1.date) as upTime
FROM computers as c1, computers as c2
WHERE c1.computerName=c2.computerName
AND c1.startUp=1 AND c2.startUp=0
AND c2.date >= c1.date
GROUP BY c1.date
ORDER BY c1.date;
This will generate a list of all the periods a computer was on. To generate your requested report you can use the above query as a subquery:
SELECT
c3.computerName,
SEC_TO_TIME(SUM(TIME_TO_SEC(c3.upTime))) AS totalUpTime
FROM
(SELECT c1.computerName,
timediff(MIN(c2.date), c1.date) AS upTime
FROM computers AS c1, computers AS c2
WHERE c1.computerName=c2.computerName
AND c1.startUp=1 AND c2.startUp=0
AND c2.date >= c1.date
GROUP BY c1.date
ORDER BY c1.date
) AS c3
GROUP BY c3.computerName
ORDER BY c3.totalUpTime;
Try this query (replace table_name with the name of your table):
SELECT SUM(startUp) AS startupTimes
FROM table_name
GROUP BY computerName
ORDER BY startupTimes
This will output the number of times each computer has been started. To get just the first row (the computer that has the least amount of startups) you can append LIMIT 1 to the query.
If (per your last paragraph) you aren't recording all shutdown events. then you don't have the information available to generate a report showing the amount of time each computer has been switched on. Because you aren't recording all instances of computer shutdown, it doesn't matter what SQL query you use.
FWIW, this schema isn't 3NF. A more common approach would be to have a single column recording each event, for example:
ComputerId:UserId:EventId:EventDate
The first three columns are each a foreign key into another table where the details are stored. Although even with this schema, the UserID would be null for startup/shutdown events.

Cumulative average number of records created for specific day of week or date range

Yeah, so I'm filling out a requirements document for a new client project and they're asking for growth trends and performance expectations calculated from existing data within our database.
The best source of data for something like this would be our logs table as we pretty much log every single transaction that occurs within our application.
Now, here's the issue, I don't have a whole lot of experience with MySql when it comes to collating cumulative sum and running averages. I've thrown together the following query which kind of makes sense to me, but it just keeps locking up the command console. The thing takes forever to execute and there are only 80k records within the test sample.
So, given the following basic table structure:
id | action | date_created
1 | 'merp' | 2007-06-20 17:17:00
2 | 'foo' | 2007-06-21 09:54:48
3 | 'bar' | 2007-06-21 12:47:30
... thousands of records ...
3545 | 'stab' | 2007-07-05 11:28:36
How would I go about calculating the average number of records created for each given day of the week?
day_of_week | average_records_created
1 | 234
2 | 23
3 | 5
4 | 67
5 | 234
6 | 12
7 | 36
I have the following query which makes me want to murderdeathkill myself by casting my body down an elevator shaft... and onto some bullets:
SELECT
DISTINCT(DAYOFWEEK(DATE(t1.datetime_entry))) AS t1.day_of_week,
AVG((SELECT COUNT(*) FROM VMS_LOGS t2 WHERE DAYOFWEEK(DATE(t2.date_time_entry)) = t1.day_of_week)) AS average_records_created
FROM VMS_LOGS t1
GROUP BY t1.day_of_week;
Halps? Please, don't make me cut myself again. :'(
How far back do you need to go when sampling this information? This solution works as long as it's less than a year.
Because day of week and week number are constant for a record, create a companion table that has the ID, WeekNumber, and DayOfWeek. Whenever you want to run this statistic, just generate the "missing" records from your master table.
Then, your report can be something along the lines of:
select
DayOfWeek
, count(*)/count(distinct(WeekNumber)) as Average
from
MyCompanionTable
group by
DayOfWeek
Of course if the table is too large, then you can instead pre-summarize the data on a daily basis and just use that, and add in "today's" data from your master table when running the report.
I rewrote your query as:
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week
The reason why your query takes so long is because of your inner select, you are essentialy running 6,400,000,000 queries. With a query like this your best solution may be to develop a timed reporting system, where the user receives an email when the query is done and the report is constructed or the user logs in and checks the report after.
Even with the optimization written by OMG Ponies (bellow) you are still looking at around the same number of queries.
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week