Datasource Establishment in Tableau and 170,000 records - sql

I have two EXCEL datasources. 175,000 rows. I'm trying to set up a join (Add New Join Clause) using the INNER option between the two datasources. The left datasource includes certain member id #s. Unfortunately, the right datasource's member id #s are within a large field called member Desc. Something like below,
Datasource Left
Member ID #
ALL89098
Datasource Right
Member Desc
YTRNNN TO=ALL89098_KIA TO BE OR NOT OR
POALL89098 JOE
So, I need to deal with two scenarios as you notice from above. The member id is within the Member Desc after a TO= and it could be anywhere like scenario 2 POALL89098
If I can't get this done in Tableau to establish the Join between these two columns from different datasources, since I have both of these datasources loaded into SQL Server DB, I can run SQL statements in SQL since they are in two different Tables within SQL Server DB as well.
I'm trying the use of CONTAINS clause in Tableau such as below but it is running very very slow. it is only Tableau Desktop with 16 GB Ram.
if contains([Member Desc],([Member id #])then
[Member id #]
ELSE
"NOT FOUND"
END
Thanks so much for your time.
SO, IS THERE A WAY TO HAVE THE REGEXP WITHIN IF AND ELSE OR CASE STATEMENTS?

You can create a join calculation. The highlighted dropdown shows where this can be found:
As long as the format of the Member ID in [Member Desc] has some pattern, it can be extracted with Regex. As you mention in your question, one way the ID may present itself is after a "TO=" and it looks like it ends before a "_". The following regex calculated field will pull the string between the two:
REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=_)")
The result should properly join the two datasources:
The above is an outline which I hope sets you on the right path. I realize that there may be a few different methods in which the [Member ID] presents itself so I wont be able to nail down the exact Regex, but if there is any pattern at all then the format above should work. (ie: even if the only pattern is that [Member ID] is three letters followed by four numbers - or it always starts with an A and ends with something else - etc.)
Regex should also perform better than a contains() function, but do be aware that the function does need to search through every string in every row to make the join.
Edit in response to comment:
To add multiple conditions, try the following method:
IF LEN(REGEXP_EXTRACT([Member Desc],"([^FROM=]*)(?=,)")) > 0
THEN REGEXP_EXTRACT([Member Desc],"([^FROM=]*)(?=,)")
ELSEIF LEN(REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=,)")) > 0
THEN REGEXP_EXTRACT([Member Desc],"([^TO=]*)(?=,)")
ELSEIF [...Put as many of these as might match your pattern]
THEN [...Put as many of these as might match your pattern]
END
Essentially the calculation is going down the list and trying each possibility. I changed yours a little to look at the length (LEN()) of the returned value which should compare fairly quickly, as it is an integer. As this calculation iterates through each ELSEIF and finds a match, it will stop iterating through the list -- so its important to put the most likely match at the top. The result of the calculated field should be a member ID. If there is no match, there really isn't a need for an ELSE statement because the Inner Join will exclude it automatically.
Edit in response to comment:
Thank you. I see your recommendations.

I think you are going to have to find a way to strip out the member ID from the member desc in SQL. There should be some pattern to Member ID.
For instance is it always 3 letters followed by 5 numbers or something similar.
If you can come up with a pattern, then you can use SQL and some combination of Substring, Charindex, and/or Like %Text% or a regex
pattern to strip out the actual member ID in the SQL Server table as its own field before bringing it into Tableau.

Related

Filter a query using the CONCAT function or similar

I have a query that is filtered on a list of order numbers. The actual filed for the order number is 9 characters long (char). However, occasionally the system that the end users get their order numbers from will generate an extra 0 or single alpha character to the beginning of this order number. I am trying to account for that using the existing SQL and although it is running, it takes exponentially longer (and sometimes won't even run).
Is the approach I am taking below the best way to account for these differences?
order number field example:
066005485,
066005612
example of what may be entered and I need to account for:
0066005485,
A066005612
Here is what I have tried that seems to not work or at least be EXTREMELY slow:
SELECT S.order_no AS 'contract_no',
S.SIZE_INDEX AS 'technical_index',
S.open_qty AS 'contract_open_qty',
S.order_qty AS 'contract_order_qty',
E.excess,
(S.order_qty - E.excess) AS 'new_contract_size_qty'
FROM EXCESS E
JOIN SIM S ON RIGHT(E.GPS_CONTRACT_NUMBER,9) = S.order_no AND E.[AFS TECH INDEX] = S.size_index
WHERE S.order_no IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697')
OR CONCAT('0',S.order_no) IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697')
ORDER BY S.order_no,
S.size_index
Any thoughts on something that may work better or I am missing?
I can't do anything about the nasty join that requires the right function. If you have any influence over the data base designers it could be fruitful to either have that key (E.GPS_CONTRACT_NUMBER) cleaned up before it is put into the table or get them to add another field where the RIGHT(E.GPS_CONTRACT_NUMBER,9) has already been performed and an index can be created.
But there is definitely something you can do to remove the concat function calculation and take advantage of any index on S.order_no. I noticed your Where clause looks like order_no IN listofvals OR Concat('0', order_no) IN samelistofvals . So instead of adding a zero onto order_no remove a zero from everything in the IN list.
Where order_no IN ('0066003816','0066003817','0066005485','0066005612','0066005390','0066005616','0066005617','A066005969','A066005970','0066005952','0066005798','0066006673','0066005802','0066006196','0066006197','0066006199','0066006205','0066006697',
'066003816','066003817','066005485','066005612','066005390','066005616','066005617','066005952','066005798','066006673','066005802','066006196','066006197','066006199','066006205','066006697')
Notice that the IN-list is on two lines and the second line is just the first repeated with the leading 0 removed and any entry beginning with "A" removed entirely. This simplifies the Where clause and allows use of indexes, if any exist.
If the efficiency problem is in the WHERE clause (not considering the JOIN operation), in order to improve the situation, you can try using the "pseudo-regex" pattern matching way with LIKE:
WHERE
S.order_no LIKE '[A0]06600%'
OR
S.order_no LIKE '06600%'
Warning: this pattern will match also strings that end with other numbers (e.g. 8648).
Does it work for you?

Access Unmatched or similar query where a column does not contain or is not like another column

I want to design a query that basically does a mass amount of "Not Like "*x*", except all of the things I would not like the query to contain are in another column.
I know I can do this one at a time by just using the criteria and specifying "Not like "*x*", but I have no idea how to do a not like for a whole column of data.
So, the long version is that I have a bunch of cameras hosted on several different severs on a corporate network. Each of these cameras are on the same subnet and everything but the last octet of the IP address matches the server. Now, I have already created a field in a query that trims off the last octet of my IP, so I now basically have a pre-made IP range of where the cameras could possibly be. However, I do not have an exact inventory of each of the cameras - and there's not really a fast way to do this.
I have a list of issues that I'm working on and I've noticed some of the cameras coming up in the list of issues (basically a table that includes a bunch of IP addresses). I'd like to remove all possible instances of the cameras from appearing in the report.
I've seen designs where people have been able to compare like columns, but I want to do the opposite. I want to generate a query where it does not contain anything like what's in the camera column.
For the sake of this, I'll call the query where I have the camera ranges Camera Ranges and the field Camera Range.
Is there a way I can accomplish this?
I'm open to designing a query or even changing up my table to make it easier to do the query.
Similar to the answer I provided here, rather than using a negative selection in which you are testing whether the value held by a record is not like any record in another dataset, the easier approach is to match those which are like the dataset and return those records with no match.
To accomplish this, you can use a left join coupled with an is null condition in the where clause, for example:
select
MainData.*
from
MainData left join ExclusionData on
MainData.TargetField like ExclusionData.Pattern
where
ExclusionData.Pattern is null
Or, if the pattern field does not already contain wildcard operators:
select
MainData.*
from
MainData left join ExclusionData on
MainData.TargetField like '*' & ExclusionData.Pattern & '*'
where
ExclusionData.Pattern is null
Note that MS Access will not be able to represent such calculated joins in the Query Designer, but the JET database engine used by MS Access will still be able to interpret and execute the valid SQL.

Understanding an SQL Query

I'm new to SQL and I've been racking my brain trying to figure out exactly what a query I received at work to modify is stating. I believe it's using an alias but I'm not sure why because it only has one table that it is referring to. I think it's a fairly simply one I just don't get it.
select [CUSTOMERS].Prefix,
[CUSTOMERS].NAME,
[CUSTOMERS].Address,
[CUSTOMERS].[START_DATE],
[CUSTOMERS].[END_DATE] from [my_Company].[CUSTOMERS]
where [CUSTOMERS].[START_DATE] =
(select max(a.[START_DATE])
from [my_company].[CUSTOMERS] a
where a.Prefix = [CUSTOMERS].Prefix
and a.Address = [CUSTOMERS].ADDRESS
and coalesce(a.Name, 'Go-Figure') =
coalesce([CUSTOMERS].a.Name, 'Go-Figure'))
Here's a shot at it in english...
It looks like the intent is to get a list of customer names, addresses, start dates.
But the table is expected to contain more than one row with the same customer name and address, and the author wants only the row with the most recent start date.
Fine Points:
If a customer has the same name and address and prefix as another customer, the one with the most recent start date appears.
If a customer is missing the name 'Go Figure' is used. And so two rows with missing names will match, and the one with the most recent start date will be returned. A row with a missing name will not match another row that has a name. Both rows will be returned.
Any row that has no start date will be excluded from results.
This does not look like a query from a real business application. Maybe it's just a conceptual prototype. It is full of problems in most real world situations. Matching names and addresses with simple equality just doesn't work well in the real world, unless the names and addresses are already cleaned and de-duplicated by some other process.
Regarding the use of alias: Yes. The sub-query uses a as an alias for the my_Company.CUSTOMERS table.
I believe there is an error on the last line.
[CUSTOMERS].a.Name
is not a valid reference. It was probably meant to be
[CUSTOMERS].Name
I assume, it selects records about customers records from table [CUSTOMERS] whith the most recent [CUSTOMERS].[START_DATE]
#Joshp gave a good answer although I have seen these kinds of queries and worse in all kinds of real applications.
See if the query below gives you the same result though. The queries would not be equivalent in general but I suspect thet are the same with the data you've got. I believe the only assumption I'm making is that the ranges between start and end dates never intersect or overlap which implies that max start and max end are always together in the same row.
select
c.Prefix, c.NAME, c.Address,
max(c.START_DATE) as Start_Date,
max(c.END_DATE) as End_Date
from my_Company.CUSTOMERS as c
group by c.Prefix, c.NAME, c.Address
You'll notice the alias is a nice shorthand that keeps the query readable. Of course when there's only a single table they aren't strictly necessary at all.

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

How best to sum multiple boolean values via SQL?

I have a table that contains, among other things, about 30 columns of boolean flags that denote particular attributes. I'd like to return them, sorted by frequency, as a recordset along with their column names, like so:
Attribute Count
attrib9 43
attrib13 27
attrib19 21
etc.
My efforts thus far can achieve something similar, but I can only get the attributes in columns using conditional SUMs, like this:
SELECT SUM(IIF(a.attribIndex=-1,1,0)), SUM(IIF(a.attribWorkflow =-1,1,0))...
Plus, the query is already getting a bit unwieldy with all 30 SUM/IIFs and won't handle any changes in the number of attributes without manual intervention.
The first six characters of the attribute columns are the same (attrib) and unique in the table, is it possible to use wildcards in column names to pick up all the applicable columns?
Also, can I pivot the results to give me a sorted two-column recordset?
I'm using Access 2003 and the query will eventually be via ADODB from Excel.
This depends on whether or not you have the attribute names anywhere in data. If you do, then birdlips' answer will do the trick. However, if the names are only column names, you've got a bit more work to do--and I'm afriad you can't do it with simple SQL.
No, you can't use wildcards to column names in SQL. You'll need procedural code to do this (i.e., a VB Module in Access--you could do it within a Stored Procedure if you were on SQL Server). Use this code build the SQL code.
It won't be pretty. I think you'll need to do it one attribute at a time: select a string whose value is that attribute name and the count-where-True, then either A) run that and store the result in a new row in a scratch table, or B) append all those selects together with "Union" between them before running the batch.
My Access VB is more than a bit rusty, so I don't trust myself to give you anything like executable code....
Just a simple count and group by should do it
Select attribute_name
, count(*)
from attribute_table
group by attribute_name
To answer your comment use Analytic Functions for that:
Select attribute_table.*
, count(*) over(partition by attribute_name) cnt
from attribute_table
In Access, Cross Tab queries (the traditional tool for transposing datasets) need at least 3 numeric/date fields to work. However since the output is to Excel, have you considered just outputting the data to a hidden sheet then using a pivot table?