Understanding an SQL Query - sql

I'm new to SQL and I've been racking my brain trying to figure out exactly what a query I received at work to modify is stating. I believe it's using an alias but I'm not sure why because it only has one table that it is referring to. I think it's a fairly simply one I just don't get it.
select [CUSTOMERS].Prefix,
[CUSTOMERS].NAME,
[CUSTOMERS].Address,
[CUSTOMERS].[START_DATE],
[CUSTOMERS].[END_DATE] from [my_Company].[CUSTOMERS]
where [CUSTOMERS].[START_DATE] =
(select max(a.[START_DATE])
from [my_company].[CUSTOMERS] a
where a.Prefix = [CUSTOMERS].Prefix
and a.Address = [CUSTOMERS].ADDRESS
and coalesce(a.Name, 'Go-Figure') =
coalesce([CUSTOMERS].a.Name, 'Go-Figure'))

Here's a shot at it in english...
It looks like the intent is to get a list of customer names, addresses, start dates.
But the table is expected to contain more than one row with the same customer name and address, and the author wants only the row with the most recent start date.
Fine Points:
If a customer has the same name and address and prefix as another customer, the one with the most recent start date appears.
If a customer is missing the name 'Go Figure' is used. And so two rows with missing names will match, and the one with the most recent start date will be returned. A row with a missing name will not match another row that has a name. Both rows will be returned.
Any row that has no start date will be excluded from results.
This does not look like a query from a real business application. Maybe it's just a conceptual prototype. It is full of problems in most real world situations. Matching names and addresses with simple equality just doesn't work well in the real world, unless the names and addresses are already cleaned and de-duplicated by some other process.
Regarding the use of alias: Yes. The sub-query uses a as an alias for the my_Company.CUSTOMERS table.
I believe there is an error on the last line.
[CUSTOMERS].a.Name
is not a valid reference. It was probably meant to be
[CUSTOMERS].Name

I assume, it selects records about customers records from table [CUSTOMERS] whith the most recent [CUSTOMERS].[START_DATE]

#Joshp gave a good answer although I have seen these kinds of queries and worse in all kinds of real applications.
See if the query below gives you the same result though. The queries would not be equivalent in general but I suspect thet are the same with the data you've got. I believe the only assumption I'm making is that the ranges between start and end dates never intersect or overlap which implies that max start and max end are always together in the same row.
select
c.Prefix, c.NAME, c.Address,
max(c.START_DATE) as Start_Date,
max(c.END_DATE) as End_Date
from my_Company.CUSTOMERS as c
group by c.Prefix, c.NAME, c.Address
You'll notice the alias is a nice shorthand that keeps the query readable. Of course when there's only a single table they aren't strictly necessary at all.

Related

Compare names of two columns

Hellow everyone. Here is my question. In the 1st table person name written in 2 languages in two columns. In the 2nd table column is one for name, so names are written either 1st language or 2nd language.
How to compare these two table? Does my code works?
... t.datebirth=p.datebirth and (t.name=p.name1 or t.name=p.name2)
t.datebirth=p.datebirth and (t.name=p.name1 or t.name=p.name2)
Does my code works?
As I understood your question with the limited information you provided: yes, it works. It checks whether any of the two names in table p is equal to the name in table t.
You can simplify the logic with in:
t.datebirth = p.datebirth and t.name in (p.name1, p.name2)
This might not be a very efficient approach though. Depending on your use case, you might also want to consider two left joins, each joining on one of the names, and additional conditional logic in the rest of the query. But that cannot be assessed without a more detailed description of your use case.

Datasource Establishment in Tableau and 170,000 records

I have two EXCEL datasources. 175,000 rows. I'm trying to set up a join (Add New Join Clause) using the INNER option between the two datasources. The left datasource includes certain member id #s. Unfortunately, the right datasource's member id #s are within a large field called member Desc. Something like below,
Datasource Left
Member ID #
ALL89098
Datasource Right
Member Desc
YTRNNN TO=ALL89098_KIA TO BE OR NOT OR
POALL89098 JOE
So, I need to deal with two scenarios as you notice from above. The member id is within the Member Desc after a TO= and it could be anywhere like scenario 2 POALL89098
If I can't get this done in Tableau to establish the Join between these two columns from different datasources, since I have both of these datasources loaded into SQL Server DB, I can run SQL statements in SQL since they are in two different Tables within SQL Server DB as well.
I'm trying the use of CONTAINS clause in Tableau such as below but it is running very very slow. it is only Tableau Desktop with 16 GB Ram.
if contains([Member Desc],([Member id #])then
[Member id #]
ELSE
"NOT FOUND"
END
Thanks so much for your time.
SO, IS THERE A WAY TO HAVE THE REGEXP WITHIN IF AND ELSE OR CASE STATEMENTS?
You can create a join calculation. The highlighted dropdown shows where this can be found:
As long as the format of the Member ID in [Member Desc] has some pattern, it can be extracted with Regex. As you mention in your question, one way the ID may present itself is after a "TO=" and it looks like it ends before a "_". The following regex calculated field will pull the string between the two:
REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=_)")
The result should properly join the two datasources:
The above is an outline which I hope sets you on the right path. I realize that there may be a few different methods in which the [Member ID] presents itself so I wont be able to nail down the exact Regex, but if there is any pattern at all then the format above should work. (ie: even if the only pattern is that [Member ID] is three letters followed by four numbers - or it always starts with an A and ends with something else - etc.)
Regex should also perform better than a contains() function, but do be aware that the function does need to search through every string in every row to make the join.
Edit in response to comment:
To add multiple conditions, try the following method:
IF LEN(REGEXP_EXTRACT([Member Desc],"([^FROM=]*)(?=,)")) > 0
THEN REGEXP_EXTRACT([Member Desc],"([^FROM=]*)(?=,)")
ELSEIF LEN(REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=,)")) > 0
THEN REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=,)")
ELSEIF [...Put as many of these as might match your pattern]
THEN [...Put as many of these as might match your pattern]
END
Essentially the calculation is going down the list and trying each possibility. I changed yours a little to look at the length (LEN()) of the returned value which should compare fairly quickly, as it is an integer. As this calculation iterates through each ELSEIF and finds a match, it will stop iterating through the list -- so its important to put the most likely match at the top. The result of the calculated field should be a member ID. If there is no match, there really isn't a need for an ELSE statement because the Inner Join will exclude it automatically.
Edit in response to comment:
Thank you. I see your recommendations.
I think you are going to have to find a way to strip out the member ID from the member desc in SQL. There should be some pattern to Member ID.
For instance is it always 3 letters followed by 5 numbers or something similar.
If you can come up with a pattern, then you can use SQL and some combination of Substring, Charindex, and/or Like %Text% or a regex
pattern to strip out the actual member ID in the SQL Server table as its own field before bringing it into Tableau.

Select Distinct for First Part of column values, but last part (ticket number) needs to be wildcard

So I have a database of emails sent and received by our ticket system, Cherwell, version 9.3.2. It uses Microsoft SQL as a backend, we're on version 2012. I'm interested in doing cleanup on old or irrelevant emails. For instance emails 3+ years old, or notices sent to technicians saying they have a new task, or notices we send out that really have no value in retaining in full email stored in the database, as Cherwell also creates rows of plaintext for most of these emails. The table related to mail, TrebuchetMail is this size: 193,883.156 MB.
I'm wondering if it would improve overall performance to reduce this table, as nearly every type of record in Cherwell would access this table. Granted it would only be those rows relevant to the specific record.
Okay so my question: Subject is a column that stores the subject of the email. I have a few types of Subjects identified for removal, one example is this:
--165765
select count(*)
FROM [cherwell].[dbo].[TrebuchetMail]
where subject like 'You have an unacknowledged Task%';
After the You have an unacknowledged Task part of the subject is a number, the individual Task object's ID number. So doing a select distinct treats all 165765 rows as distinct, because they are. Can you do a wildcard with select distinct to group together similar but not exactly the same? Is there another function I could use rather than distinct? I realize it actually is distinct, but surely this problem has come up before. "select distinct Subject" query that would group together the rows where Subject is like 'You have an unacknowledged Task%' and Subject is like 'Ticket #%Created'. Would I always need some criteria, so maybe this is pointless because I'm going to have to look at the full results to come up with the criteria for the select distinct query anyway.
My goal is to identify different Subjects that could be targeted for archival/removal.
I found a 2013 thread that was a similar question, but it had to do with dates. The asker wanted to group together rows from a log that grouped together the days, disregarding the time aspect of the log. I didn't quite understand how I could translate that to work for my situation. I'd be very grateful for an explanation if that would work for me.
I know this might not be the answer you are looking for since it is low tech not based on formula. But since it is most likely a one time action, why not export the database as a table and sort it by the subject field. All the irrelevant records would be grouped together and could be easily deleted.
After this action simply re-import the table to the database. Of course this only works nicely on a flat database, not on a highly linked up one.
At the same time you would have a backup in case something goes wrong.
What you may want to do (bearing in mind I'm not a SQL expert), is create a new subquery/expression that stands in as a new column in your query, as a truncated section of subject.
Something like,
Select RecID, ( Subject.Replace('1','').Replace('2','').Replace('3','') As CustomColumn )
From TrebuchetMail
and so on, to where you strip out numbers 0-9 anywhere they appear in the subject line.
You can then potentially go distinct based on this I believe.
I'm sure there's a more elegant way of doing this with a Regex expression as well, I just am too novice for it
Not sure how it works out in practice.
Note.... I might have the syntax wrong on those replace commands. I think I'm thinking of how it's done in VB/C# and I think in SQL it's more like Replace(expression, 'text to be replaced', 'text to replace with') but you get the idea

Query to Find Adjacent Date Records

There exists in my database a page_history table; the idea is that whenever a record in the page table is changed, that record's old values are stored in the history table.
My job now is to find occasions in which a record was changed, and retrieve the pre- and post-conditions of that change. Specifically, I want to know when a page changed groups, and what groups were involved in the change. The query I have below can find these instances, but with the use of the min function, I can only get back the values that match between the two records:
select page_id,
original_group,
min(created2) change_date
from (select h.page_id,
h.group_id original_group,
i.group_id new_group,
h.created_dttm created1,
i.created_dttm created2
from page_history h,
page_history i
where h.page_id = i.page_id
and h.created_dttm < i.created_dttm
and h.group_id != i.group_id)
group by page_id, original_group, created1
order by page_id
When I try to get, say, any details of the second record, like new_group, I'm hit with a ORA-00979: not a GROUP BY expression error. I don't want to group by new_group, though, because that's going to destroy the logic (I think it would find records displaying times a page changed from a group to another group, regardless of any changes to other groups in between).
My question, then, is how can I modify this query, or go about writing a new one, that achieves a similar end, but with the added availability of columns that do not match between the two records? In essence, how can I find that min record without sacrificing all the other columns I'm not trying to compare? I don't exactly need a complete answer, any suggestions that point me in the right direction would be appreciated.
I use PL/SQL Developer, and it looks like version 11.2.0.2.0 of Oracle.
EDIT: I have found a solution. It's not pretty, and I'd still like to see some alternatives, but if helping me out would threaten to explode your brain, I would advise relocating to an easier question.
Without seeing your table structure it's hard to re-write the query but when you have a min function used like that it invariably seems better to put it into a separate sub select to get what you want and then compare the result of that.

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.