Trouble getting newest record from grouped records

Trouble getting newest record from grouped records - sql

I'm pretty new to Access so I'm sure this is something simple. I'm not sure I even have the best subject.
I have an Owner and a Names table that contain data like this:
Owner Names
TMKFK NID ... NIDFK Last ModDate
7721011 45 45 Smith 1/18/15
7721011 137 137 Jones 2/1/15
7721012 45 45 Smith 1/18/15
I am trying to query them so that I get the TMKFK for the latest timestamped record in the Name table. This is used for a lookup from a form. So if I lookup Smi* I expect to get 7721012.
After a bunch of looking around on this site and elsewhere and looking at partition over I concluded the answer had to be using a subquery, but I can't quite figure out what to put where. This is where I got stuck:
SELECT Owner.TMKFK
FROM Owner INNER JOIN Names ON Owner.NID = Names.NIDFK
GROUP BY Owner.TMKFK, [Owner Name].Last, [Owner Name].M
WHERE (Owner.TMKFK=7721011 Or Owner.TMKFK=7721012)
AND Names.Last Like "Smith"
AND Names.ModDate=(SELECT Max(Names.ModDate) FROM Names);
This fails because the subquery returns the Max date from the entire table and not just the two records with the same TMKFK. A HAVING clause doesn't seem to make a difference. Re-ordering the fields in group by didn't make a difference.

The subquery to get the max date would need to be restricted to the owner in question. Something along these lines:
SELECT Owner.TMKFK
FROM Owner INNER JOIN Names ON Owner.NID = Names.NIDFK
WHERE (Owner.TMKFK=7721011 Or Owner.TMKFK=7721012)
AND Names.Last Like 'Smith%'
AND Names.ModDate=(SELECT Max(Names.ModDate)
FROM Names
WHERE NIDFK = Owner.NID
)
Don't think you need the GROUP BY. Don't know the Access syntax, but LIKE usually implies wildcards like % and the string should be single quoted. And if you want case-insensitive searching:
AND UPPER(Names.Last) LIKE UPPER('Smith%')

Related

Access/SQL Select Query - Return "Most Like" Value Only

We have a chargeback process in an AccessDB where Departments must approve the expenses entered by another department. We only want a single 'default' approver, but the way the data has been set-up and the query we currently use to fill in the approver returns multiple results.
In the tUserSec table, for example, we have two columns. Name(UserIDX) and UserCode
User1 - 550*
User2 - 55003*
The idea here being that User1 is the Director and so is a 'catchall' for everything in this department, while User2 is a Manager and is specifically assigned to a narrower division. Departments are always 7 characters total.
Say the Department is 5500309, the idea is that User2 should populate as the approver since their code is most closely matched to the Department ID. However, using the "Like" criteria returns both users and the form appears to select one of the two users at random with no rhyme or reason that I can determine. It always selects User1 for 5500309 but always selects User2 for 5500301, despite there being no further delineation - but ideally User1 shouldn't be populating at all unless no one else matches closer.
Below is a simplified version of the SQL, I cut out some other stuff that muddies the situation:
SELECT TDepts.Dept, TDepts.DDescr, tUserSec.UserIDX
FROM tUserSec, TDepts
WHERE (((TDepts.Dept) Like [usercode] & "*"));
How can I change this up so that I only pull in the UserID who is most like the usercode? I tried to figure out a way to pull in the UserID based on the length or max of the usercode, etc. but I wasn't able to find a way that worked. It's a safe assumption that if two users have usercodes that are "like" the department that the usercode that is longest is the one we want.
(This is my first question on here and a struggled with how to best explain this issue. Please be gentle :) )

First, I have to say that the main problem here is when a developer thought that they would be clever and build a lot of logic into the department and user IDs. Hiding this sort of information within a column is a big source of headaches in general (as you're just starting to see).
I don't develop with Access, so I'm not certain of the syntax, but hopefully you'll get the general idea. Please let me know if the syntax needs to be tweaked for future users who find this question:
SELECT
D.Dept,
D.DDescr,
U.UserIDX
FROM
TDepts D
LEFT OUTER JOIN
(
SELECT
SQ_D.Dept,
MAX(LEN(SQ_U.usercode)) AS max_len_usercode
FROM
TDepts SQ_D
INNER JOIN tUserSec SQ_U ON SQ_D.Dept LIKE SQ_U.usercode & "*"
GROUP BY
SQ_D.Dept
) SQ ON SQ_D.Dept = D.Dept
LEFT OUTER JOIN tUserSec U ON
D.Dept LIKE U.usercode & "*" AND
LEN(U.usercode) = SQ.max_len_usercode
The query gets a list of all of the departments along with the length of the longest usercode that matches for that department. Then it uses that to determine which user matches for the "most like" the department.

Selecting only such groups that contain certain value

First of all, even though this SQL: How do you select only groups that do not contain a certain value? thread is almost identical to my problem, it doesn't fully dissipate my confusion about the problem.
Let's have a table "Contacts" like this one:
+----------------------+
| Department FirstName |
+----------------------+
| 100 Thomas |
| 200 Peter |
| 100 Jerry |
+----------------------+
First, I want to group the rows by the department number and show number of rows in each displayed group. This, I believe, can be easily done by the following query.
SELECT Department, Count(*) As "Rows_in_group"
FROM Contacts
GROUP BY Department
This outputs 2 groups. First with dep.no. 100 containing 2 rows, second with 200 containing only one row.
But then, I want to extend the query to exclude any group that doesn't contain certain value in certain column (e.g. Thomas in FirstName). Here are my questions:
1) Reading the above-mentioned thread I was able to come up with this, which seems to work correctly:
SELECT Department, Count(*) As "Rows_in_group"
FROM Contacts
WHERE Department IN (SELECT Department FROM Contacts WHERE FirstName = "Thomas")
GROUP BY Department
Q: How does this work? I understand the "WHERE Department IN" part, but then I'd expect a value, but instead another nested query is included, which to me doesn't make much sense as I'm only beginner with SQL.
2) By accident I was able to come up with another query that also seems to work, but feels weird and I also don't understand its workings.
SELECT Department, Count(*) As "Rows_in_group"
FROM Contacts
GROUP BY Department
HAVING NOT SUM(FirstName = "Thomas") = 0
Q: How does this work? Why alteration: HAVING SUM(FirstName = "Thomas") > 0 doesn't work?
3) Q: Is there any simple and correct way to do this using the HAVING clause?
I expected, that simple "HAVING FirstName='Thomas'" after the GROUP BY would do the trick as it seems to follow a common language, but it does not.
Note that I want the whole groups to be chosen by the query so "WHERE FirstName='Thomas'" isn't s solution for my problem as it excludes all the rows that don't satisfy the condition before the grouping takes place (at least the way I understand it).

Q: How does this work? I understand the "WHERE Department IN" part,
but then I'd expect a value, but instead another nested query is
included, which to me doesn't make much sense as I'm only beginner
with SQL.
The nested query returns values which are used to match against Department
2) By accident I was able to come up with another query that also
seems to work, but feels weird and I also don't understand its
workings.
HAVING NOT SUM(FirstName = "Thomas") = 0
"Feels weird" because, well, it is. This is not a place for the SUM function.
EDIT: Why does this work?
The expression FirstName = "Thomas" gets evaluated as true or false (known as a Boolean expression). True numerically is equal to 1 and False converts to 0 (zero). By including SUM you then calculated the totals so really zero (still) means false and "not zero" is true. Then to make it weird(er) you included NOT which negated the whole thing and it becomes NOT TRUE = 0 or FALSE = FALSE (which is of course... TRUE)!!
EDIT: I think what could be more helpful to you is consideration of when to use WHERE and when to use HAVING (instead of the Boolean magic taking place).
From this answer:
WHERE clause introduces a condition on individual rows; HAVING clause introduces a condition on aggregations, i.e. results of selection where a single result, such as count, average, min, max, or sum, has been produced from multiple rows.
WHERE was appropriate for your example because first you want to "only return rows WHERE Department IN (100)" and then you want to "group those rows by Department" and get a COUNT of how many rows had been selected.

Google Bigquery use of substr, never returns back results

I have a table which has two sets of data, one set of data has information like
Type | Name | Id
PackagedDrug |Pseudoephedrine HCl Oral Tablet 120 MG| 110
PackagedDrug |Pseudoephedrine HCl Oral Tablet 60 MG|111
DrugName| Pseudoephedrine HCl| 112
What I want to do is join PackagedDrug with DrugName concepts, so get all Ids for Type PackagedDrug whose Name is matching with Name for Type DrugName. If I hardcode the Name for DrugName in the following query, it runs instantenously, but if I take out the hardcoding then it just keeps on running. Could you please suggest me suitable ways to speed up the big query?
SELECT a.MSC_ID MSC_id, a.MSC_CONcept_type, a.concept_id, a.concept_name , b.concept_name
from
(select MSC_id, MSC_CONcept_type, concept_id, concept_name
FROM [ClientAlerts.MSC_Concepts]
where MSC_CONcept_type in ('MediSpan.Concepts.PackagedDrug') ) a
CROSS JOIN
(select MSC_CONcept_type, concept_id, concept_name , length(concept_name) len
FROM [ClientAlerts.MSC_Concepts]
where MSC_CONcept_type in ('MediSpan.Concepts.NamebasedClassification.DrugName')
-- and concept_name in ('Pseudoephedrine HCl')
) b
where substr(a.concept_name,1,b.len)+' ' = b.concept_name
Thanks,
Savita

This has nothing to do with BigQuery itself. When you hardcode, your values are "filtered" way faster, because it doesn't have to check every row, since it looks for the hardcoded value.
If you don't use the hardcoded value, it will look at WAY more rows, compare ALL the rows from your first query with your second. Honestly, if you describe your use case properly here, I don't think of any way to do this faster.
But one question does come to mind. Why do you have a "type". It seems like it should be two different tables instead.

SQL query to find records with specific prefix

I'm writing SQL queries and getting tripped up by wanting to solve everything with loops instead of set operations. For example, here's two tables (lists, really - one column each); idPrefix is a subset of idFull. I want to select every full ID that has a prefix I'm interested in; that is, every row in idFull which has a corresponding entry in idPrefix.
idPrefix.ID idFull.ID
---------- ----------
12 8
15 12
300 12-1-1
12-1-2
15
15-1
300
Desired result would be everything in idFull except the value 8. Super-easy with a for each loop, but I'm just not conceptualizing it as a set operation. I've tried a few variations on the below; everything seems to return all of one table. I'm not sure if my issue is with how I'm doing joins, or how I'm using LIKE.
SELECT f.ID
FROM idPrefix AS p
JOIN idFull AS f
ON f.ID LIKE (p.ID + '%')
Details:
Values are varchars, prefixes can be any length but do not contain the delimiter '-'.
This question seems similar, but more complex; this one only uses one table.
Answer doesn't need to be fast/optimized/whatever.
Using SQL Server 2008, but am more interested in conceptual understanding than a flavor-specific query.
Aaaaand I'm coming back to both real coding & SO after ~3 years, so sorry if I'm rusty on any etiquette.
Thanks!

You can join the full table to the prefix table with a LIKE
SELECT idFull.ID
FROM idFull full
INNER JOIN idPrefix pre ON full.ID LIKE pre.ID + '%'

How to optimize group by in table with huge number of records

I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same person or not.
Here is the query I write
SELECT *
FROM Person INNER JOIN
(
SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear
FROM Person
GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate)
HAVING count(*) > 1
) as dupPersons
ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear
order by Person.lastName,Person.firstName
but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way?
EDIT
Note that I can cut data, which can have contribution in optimization
for example if I want to cut data by 2 it could return two persons
Johan Smith |
Jane Smith | have same lastname and first name inita
Jack Smith |
Mark Tween | have same lastname and first name inita
Mac Tween |

If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN
SELECT *
FROM Person p1
INNER JOIN Person p2 ON p2.PersonID > p1.PersonID
WHERE SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1)
AND p2.LastName = p1.LastName
AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate)
ORDER BY
p1.LastName, p1.FirstName

Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent. When we look at whether a query is "optimized", there are two immediate parts to that: 1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc. 2. The context that the query operates within - DB specifics, task specifics, etc.
Your query passes #1, no problem. I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery. But these aren't optimization issues. Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads. Out of any further context, your query doesn't look like a bomb waiting to go off.
As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc.
Things I'd want to know:
How long does it already take to run?
What's your runtime window? (User & app tolerance for query time.)
Is this run once a day? Week? Month? Quarter?
Do you have the permission to create tables, change current tables, or alter indexes?
Maybe based on having run it, what's the ratio of duplicates you're expecting to find? 5%? 90%?
How stable is the matching criteria requirement?
Example scenario: If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC. If too many of those stipulations change, I might to a different direction entirely.

You can try something like this and see the difference on the execution plans, or benchmark the results on performance:
;WITH DupPersons AS
(
SELECT *, COUNT(1) OVER(PARTITION BY SUBSTRING(firstName, 1, 1), lastName, YEAR(birthDate)) Quant
FROM Person
)
SELECT *
FROM DupPersons
WHERE Quant > 1
Of course, it would also help to know your table definition and the indexes you created. I think that maybe it can help to add a computed column with the year of birthdate and create an index on it, the same with the first letter of firstname.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas