Effective way to separate a group into individual records - sql

I'm grouping some records by their proximity of time. What I do right now (timestamps in unixtime),
First off I do a sub select to grab records that are of interest of me,
(SELECT timestamp AS target_time FROM table WHERE something = cool) AS subselect
Then I want to look at the records that are close in time to those,
SELECT id FROM table, subselect WHERE ABS(target_time - timestamp) < 1800
But here is where I hit my problem. I want to only want the records where the time diffrance between the records around the target_time is > 20 mins. So to do this, I group by the target_time and add a HAVING section.
SELECT id FROM table, first WHERE ABS(target_time - timestamp) < 3600
GROUP BY target_time HAVING MAX(timestamp) - MIN(timestamp) > 1200
This is great, and all the records I don't like are gone, but now I only have the first id of the group, when I really want all of the ids. I can use GROUP_CONCAT but that gives me a be mess I can't do anymore queries on. What I really would like it to get all of the ids returned from all of these groups that are created. Do I need another SELECT statement? Or is there just a better way to structure what I got?
Thank you,
A SQL nub.

See if I have your problem correct:
For a given row in a table, you want to know the set of rows for similar records if the range of timestamps for those records is greater than 20 minutes. You want to to this for all ids in the table.
If you simply want a list of ids which fulfil this criteria, it is fairly straightforward:
given a table like:
create table foo (id bigint(4), section VARCHAR(2), modification datetime);
you can do:
select id, foo.section, min_max.min_modification, min_max.max_modification, abs(min_max.min_modification - min_max.max_modification) as diff
from foo,
(select section, max(modification) max_modification, min(modification) min_modification from foo as inner_foo group by section) as min_max
where foo.section = min_max.section
and abs(min_max.min_modification - min_max.max_modification) > 1800;
You're doing a subselect based on the 'similar rows' criteria (in this case the column section) to get the minimum and maximum timestamps for that section. This min and max applies to all ids in that section. Hence, for section 'A', you will have a list of ids, same for section 'B'.

My assumption is you want an output that looks like:
id1, timestamp1, fieldA, fieldB
id1, timestamp2, fieldA, fieldB
id2, timestamp3, fieldA, fieldB
id2, timestamp4, fieldA, fieldB
id3, timestamp5, fieldA, fieldB
id3, timestamp6, fieldA, fieldB
but the timestamp for these records is BETWEEN 1200 and 1800 seconds of a "target_time" where something = cool?
SELECT data.id, data.timestamp, data.fieldA, data.fieldB, ..., data.fieldX
FROM events
JOIN data
WHERE events.something = cool_event -- Gives the 'target_time' of cool_event
AND ABS(event.timestamp - data.timestamp) BETWEEN 1200 and 1800 -- gives data records 'near' target time, but at least 20 minutes away.
IF the 'data' and 'events' table are the SAME table, then just use table alias names, but you can join a table to itself, aka 'SELF-JOIN'.
SELECT data.id, data.timestamp, data.fieldA, data.fieldB, ..., data.fieldX
FROM events AS target, events AS data
WHERE target.something = cool_event -- gives the 'target_time' of cool_event
AND ABS(target.timestamp - data.timestamp) BETWEEN 1200 and 1800 -- gives data records 'near' target time, but at least 20 minutes away.
This sounds about right, and is without any group-by or aggregates needed.
You can order the resulting data if necessary.
-- J Jorgenson --

Related

How to create a new table that only keeps rows with more than 5 data records under the same id in Bigquery

I have a table like this:
Id
Date
Steps
Distance
1
2016-06-01
1000
1
There are over 1000 records and 50 Ids in this table, most ids have about 20 records, and some ids only have 1, or 2 records which I think are useless.
I want to create a table that excludes those ids with less than 5 records.
I wrote this code to find the ids that I want to exclude:
SELECT
Id,
COUNT(Id) AS num_id
FROM `table`
GROUP BY
Id
ORDER BY
num_id
Since there are only two ids I need to exclude, I use WHERE clause:
CREATE TABLE `` AS
SELECT
*
FROM ``
WHERE
Id <> 2320127002
AND Id <> 7007744171
Although I can get the result I want, I think there are better ways to solve this kind of problem. For example, if there are over 20 ids with less than 5 records in this table, what shall I do? Thank you.
Consider this:
CREATE TABLE `filtered_table` AS
SELECT *
FROM `table`
WHERE TRUE QUALIFY COUNT(*) OVER (PARTITION BY Id) >= 5
Note: You can remove WHERE TRUE if it runs successfully without it.

How to aggregate data stored column-wise in a matrix table

I have a table, Ellipses (...), represent multiple columns of a similar type
TABLE: diagnosis_info
COLUMNS: visit_id,
patient_diagnosis_code_1 ...
patient_diagnosis_code_100 -- char(100) with a value of ‘0’ or ‘1’
How do I find the most common diagnosis_code? There are 101 columns including the visit_id. The table is like a matrix table of 0s and 1s. How do I write something that can dynamically account for all the columns and count all the rows where the value is 1?
What I would normally do is not feasable as there are too many columns:
SELECT COUNT(patient_diagnostic_code_1), COUNT(patient_diagnostic_code_2),... FROM diagnostic_info WHERE patient_diagnostic_code_1 = ‘1’ and patient_diagnostic_code_2 = ‘1’ and ….
Then even if I typed all that out how would I select which column had the highest count of values = 1. The table is more column oriented instead of row oriented.
Unfortunately your data design is bad from the start. Instead it could be as simple as:
patient_id, visit_id, diagnosis_code
where a patient with 1 dignostic code would have 1 row, a patient with 100 diagnostic codes 100 rows and vice versa. At any given time you could transpose this into the format you presented (what is called a pivot or cross tab). Also in some databases, for example postgreSQL, you could put all those diagnostic codes into an array field, then it would look like:
patient_id, visit_id, diagnosis_code (data type -bool or int- array)
Now you need the reverse of it which is called unpivot. On some databases like SQL server there is UNPIVOT as an example.
Without knowing what your backend this, you could do that with an ugly SQL like:
select code, pdc
from
(
select 1 as code, count(*) as pdc
from myTable where patient_diagnosis_code_1=1
union
select 2 as code, count(*) as pdc
from myTable where patient_diagnosis_code_2=1
union
...
select 100 as code, count(*) as pdc
from myTable where patient_diagnosis_code_100=1
) tmp
order by pdc desc, code;
PS: This would return all the codes with their frequency ordered from most to least. You could limit to get 1 to get the max (with ties in case there are more than one code to match the max).

SQL Server Sum multiple rows into one - no temp table

I would like to see a most concise way to do what is outlined in this SO question: Sum values from multiple rows into one row
that is, combine multiple rows while summing a column.
But how to then delete the duplicates. In other words I have data like this:
Person Value
--------------
1 10
1 20
2 15
And I want to sum the values for any duplicates (on the Person col) into a single row and get rid of the other duplicates on the Person value. So my output would be:
Person Value
-------------
1 30
2 15
And I would like to do this without using a temp table. I think that I'll need to use OVER PARTITION BY but just not sure. Just trying to challenge myself in not doing it the temp table way. Working with SQL Server 2008 R2
Simply put, give me a concise stmt getting from my input to my output in the same table. So if my table name is People if I do a select * from People on it before the operation that I am asking in this question I get the first set above and then when I do a select * from People after the operation, I get the second set of data above.
Not sure why not using Temp table but here's one way to avoid it (tho imho this is an overkill):
UPDATE MyTable SET VALUE = (SELECT SUM(Value) FROM MyTable MT WHERE MT.Person = MyTable.Person);
WITH DUP_TABLE AS
(SELECT ROW_NUMBER()
OVER (PARTITION BY Person ORDER BY Person) As ROW_NO
FROM MyTable)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
First query updates every duplicate person to the summary value. Second query removes duplicate persons.
Demo: http://sqlfiddle.com/#!3/db7aa/11
All you're asking for is a simple SUM() aggregate function and a GROUP BY
SELECT Person, SUM(Value)
FROM myTable
GROUP BY Person
The SUM() by itself would sum up the values in a column, but when you add a secondary column and GROUP BY it, SQL will show distinct values from the secondary column and perform the aggregate function by those distinct categories.

Last id value in a table. SQL Server

Is there a way to know the last nth id field of a table, without scanning it completely? (just go to the end of table and get id value)
table
id fieldvalue
1 2323
2 4645
3 556
... ...
100000000 1232
So for example here n = 100000000 100 Million
--------------EDIT-----
So which one of the queries proposed would be more efficient?
SELECT MAX(id) FROM <tablename>
Assuming ID is the IDENTITY for the table, you could use SELECT IDENT_CURRENT('TABLE NAME').
See here for more info.
One thing to note about this approach: If you have INSERTs that fail but increment the IDENTITY counter, then you will get back a result that is higher than the result returned by SELECT MAX(id) FROM <tablename>
You can also use system tables to get all last values from all identity columns in system:
select
OBJECT_NAME(object_id) + '.' + name as col_name
, last_value
from
sys.identity_columns
order by last_value desc
In case when table1 rows are inserted first, and then rows to table2 which depend on ids from the table1, you can use SELECT:
INSERT INTO `table2` (`some_id`, `some_value`)
VALUES ((SELECT some_id
FROM `table1`
WHERE `other_key_1` = 'xxx'
AND `other_key_2` = 'yyy'),
'some value abc abc 123 123 ...');
Of course, this can work only if there are other identifiers that can uniquely identify rows from table1
First of all, you want to access the table in DESCENDING order by ID.
Then you would select the TOP N records.
At this point, you want the last record of the set which hopefully is obvious. Assuming that the id field is indexed, this would at most retrieve the last N records of the table and most likely would end up being optimized into a single record fetch.
Select Ident_Current('Your Table Name') gives the last Id of your table.

MySQL querying with a dynamic range?

Given the table snippet:
id | name | age
I am trying to form a query that will return 10 people within a certain age range. However, if there are not enough people in that range, I want to extend the range until I can find 10 people.
For instance, if I only find 5 people in a range of 30-40, I would find 5 others in a 25-45 range.
In addition, I would like the query to be able use order by RAND() or similar, in order to be able to get different results each time.
Is this going beyond what MySQL can handle? Will I have to put some of this logic in the application instead?
UPDATED for performance:
My original solution worked but requuired a table scan. Am's solution is a good one and doesn't require a table scan but its hard-coded ranges won't work when the only matches are far outliers. Plus it requires de-duping records. But combining both solutions can get you the best of both worlds, provided you have an index on age. (if you don't have an index on age, then all solutions will require a table scan).
The combined solution first picks only the rows which might qualify (the desired range, plus the 10 rows over and 10 rows under that range), and then uses my original logic to rank the results. Caveat: I don't have enough sample data present to verify that MySQL's optimizer is indeed smart enough to avoid a table scan here-- MySQL might not be smart enough to weave those three UNIONs together without a scan.
[just updated again to fix 2 embarrassing SQL typos: DESC where DESC shouldn't have been!]
SELECT * FROM
(
SELECT id, name, age,
CASE WHEN age BETWEEN 25 and 35 THEN RAND() ELSE ABS (age-30) END as distance
FROM
(
SELECT * FROM (SELECT * FROM Person WHERE age > 35 ORDER BY age DESC LIMIT 10) u1
UNION
SELECT * FROM (SELECT * FROM Person WHERE age < 25 ORDER BY age LIMIT 10) u2
UNION
SELECT * FROM (SELECT * FROM Person WHERE age BETWEEN 25 and 35) u3
) p2
ORDER BY distance
LIMIT 10
) p ORDER BY RAND() ;
Original Solution:
I'd approach it this way:
first, compute how far each record is from the center of the desired age range, and order the results by that distance. For all results inside the range, treat the distance as a random number between zero and one. This ensures that records inside the range will be selected in a random order, while records outside the range, if needed, will be selected in order closest to the desired range.
trim the number of records in that distance-ordered resultset to 10 records
randomize order of the resulting records
Like this:
CREATE TABLE Person (id int AUTO_INCREMENT PRIMARY KEY, name varchar(50) NOT NULL, age int NOT NULL);
INSERT INTO Person (name, age) VALUES ("Joe Smith", 26);
INSERT INTO Person (name, age) VALUES ("Frank Johnson", 32);
INSERT INTO Person (name, age) VALUES ("Sue Jones", 24);
INSERT INTO Person (name, age) VALUES ("Ella Frederick", 44);
SELECT * FROM
(
SELECT id, name, age,
CASE WHEN age BETWEEN 25 and 35 THEN RAND() ELSE ABS (age-30) END as distance
FROM Person
ORDER BY distance DESC
LIMIT 10
) p ORDER BY RAND() ;
Note that I'm assuming that, if there are not enough records inside the range, the records you want to append are the ones closest to that range. If this assumption is incorrect, please add more details to the question.
re: performance, this requires a scan through the table, so won't be fast-- I'm working on a scan-less solution now...
I would do somthing like this:
select * from (
SELECT * FROM (select * from ppl_table where age>30 and age<40 order by rand() limit 10) as Momo1
union
SELECT * FROM (select * from ppl_table where age>25 and age<40 order by rand() limit 20) as Momo2
) as FinalMomo
limit 10
basically selecting 10 users from the first group and then more from the second group.
if the first group doesn't add up to 10, there will be more from the second group.
The reason we are selectong 20 from the second group is because UNION will remove the duplicated values, and you want to have at least 10 users in the final result.
Edit
I added the as aliases from the inner SELECT, and made a separate in the inner SELECTs since MySql doesn't like ORDER BY with UNION