Best way to filter data by criteria, and display description using SQL in MsAccess? - sql

I have a table in MS Access of data containing results from a survey, and I have a look up table of Risk Ids and descriptions of the sort of risk based on the survey results.
What I've tried so far is selecting distinct entries from my survey table, and inputing a new field into my query for the Risk Code whose number will depend on criteria that I determine, which I will then use to look up the risk.
My table for the survey looks like so:
Name | Location | Days spent eating IceCream | Icecream eating location
John Smith | London | 30 | Hull
My Risk ID table looks like so:
RiskID | RiskBool | Description
1 | Yes | At risk - This person eats too much icecream
2 | Yes | Risk - This person does not eat enough icecream
3 | No | Sensible amount of icecream eaten
4 | Yes | It is illegal to eat icecream in Hull
And my query looks (something) like this in access design view
Name | Location | Risk Code | RiskID | Description
I want to write SQL to change the Risk Code to 1, 2, 3, 4 (up to 15 in my real case) and then I will tell it to only display the person and the description for when the Risk ID and Code match. I haven't written this yet.
What is the best way to achieve this?
I see two possibilities:
Set up 15 queries one for each risk ID, add the descriptions to
those and then join those 15 sets of results together. This is what
I know how to do, but could end up quite messily.
Set up some 'check' using if statements, and then some how setting
the Risk Code field for that entry.
My current SQL looks like this, but it doesn't make any checks yet, I'm worried the if statment will be very, very long.
SELECT DISTINCT
[At Risk Employee List].Employee AS Name,
[At Risk Employee List].[DaysIceCream] AS [Days spent eating Icecream],
[At Risk Employee List].[Base Location],
[RiskCode] AS [Risk Code], <----is this where the check would need to go?
RiskDescLookup.RiskBoolean,
RiskDescLookup.RiskExplanation
FROM RiskDescLookup,
[Survey Raw Data]
INNER JOIN
[At Risk Employee List]
ON
[Survey Raw Data].ResID = [At Risk Employee List].[Staff ID]
GROUP BY
[At Risk Employee List].Employee,
[At Risk Employee List].[DaysIceCream],
[At Risk Employee List].[Base Location],
RiskDescLookup.RiskID,
[RiskCode] AS [Risk Code], <----is this where the check would need to go?
RiskDescLookup.RiskBoolean,
RiskDescLookup.RiskExplanation
I imagine the check done by if statements to be Very long and look something like (in pseudocode):
if ( [At Risk Employee List].[Base Location] = Hull, then [RiskCode]=4...., else if (DaysIceCream>42) then....
Is that the best way to do this? Do I even need to have a Risk Code?
I'm a bit lost as to how to produce this 'check' in the best possible way.

I am not entirely certain of your intent, but from what you've posted and the follow up comments it would appear that the process of joining the Risk Code to Risk ID is relatively simple once you have the Risk Code identified for each survey result.
The real issue it seems is how to encapsulate the logic to identify the Risk Code for each survey result. I would suggest "calculating" the risk code value for each survey result externally to your query and then join to those results before finally joining to the Risk ID.
For example, I might add a third table to the design SurveyRisk that contains Name and Risk Code.
Use whatever criteria and logic you need to use to identify the risk for each survey response. Enter these values into the SurveyRisk table. Then, you can simply join Survey to SurveyRisk to Risk to summarize your results.
Feel free to clarify where I'm misunderstanding what you are trying to accomplish and I'll edit my post accordingly.

The best way to do this is to use a look up table that emulates the structure of your data.
Add a row for every 'case', and in MS Access link the corresponding fields together.
Here is a few of the links:
Then alter the SQL to pair up any options that need to go together. For instance each of the checks I make are duplicated for two seperate locations.
Here is an example:
FROM RiskDescLookupReg
INNER JOIN ([Survey Raw Data]
INNER JOIN [At Risk Employee List]
ON [Survey Raw Data].ResID=[At Risk Employee List].[Staff ID])
ON (RiskDescLookupReg.RegTravelChoice=[Survey Raw Data].RegTravelChoice)
And (RiskDescLookupReg.MonthChoice2=[Survey Raw Data].MonthChoice2
And RiskDescLookupReg.PercentageTimeChoice2=[Survey Raw Data].PercentageTimeChoice2
And RiskDescLookupReg.LimitedDurationChoice2=[Survey Raw Data].LimitedDurationChoice2
And RiskDescLookupReg.TemporaryPurposeChoice2=[Survey Raw Data].TemporaryPurposeChoice2)
Or (
RiskDescLookupReg.MonthChoice1=[Survey Raw Data].MonthChoice1
And RiskDescLookupReg.PercentageTimeChoice1=[Survey Raw Data].PercentageTimeChoice1
And RiskDescLookupReg.LimitedDurationChoice1=[Survey Raw Data].LimitedDurationChoice1
And RiskDescLookupReg.TemporaryPurposeChoice1=[Survey Raw Data].TemporaryPurposeChoice1)
Not how there are two blocks for each location. If I only had one location of interest, I could drop the last block.
If you get duplicates because of the way your lookup table is arranged, you need to specify that the parts from the lookup table are enclosed in a LAST, and the parts from the survey in FIRST. Here is an example:
SELECT
[At Risk Employee List].Number,
FIRST([At Risk Employee List].Employee) AS Name,
FIRST([At Risk Employee List].[Base Location]) AS BaseLocation,
LAST(RiskDescLookupReg.RiskBool) AS RiskBool,
LAST(RiskDescLookupReg.RiskDesc) AS RiskDesc,
The use of LAST ensures that if someone would come up as at risk and not at risk, only the LAST at risk case is displayed (those entries come later in the field). This is counter to the fact when duplicates are displayed the at risk ones come first.

Related

SQL different null values in different rows

I have a quick question regarding writing a SQL query to obtain a complete entry from two or more entries where the data is missing in different columns.
This is the example, suppose I have this table:
Client Id | Name | Email
1234 | John | (null)
1244 | (null) | john#example.com
Would it be possible to write a query that would return the following?
Client Id | Name | Email
1234 | John | john#example.com
I am finding this particularly hard because these are 2 entires in the same table.
I apologize if this is trivial, I am still studying SQL and learning, but I wasn't able to come up with a solution for this and I although I've tried looking online I couldn't phrase the question in the proper way, I suppose and I couldn't really find the answer I was after.
Many thanks in advance for the help!
Yes, but actually no.
It is possible to write a query that works with your example data.
But just under the assumption that the first part of the mail is always equal to the name.
SELECT clients.id,clients.name,bclients.email FROM clients
JOIN clients bclients ON upper(clients.name) = upper(substring(bclients.email from 0 for position('#' in bclients.email)));
db<>fiddle
Explanation:
We join the table onto itself, to get the information into one row.
For this we first search for the position of the '#' in the email, get the substring from the start (0) of the string for the amount of characters until we hit the # (result of positon).
To avoid case-problems the name and substring are cast to uppercase for comparsion.
(lowercase would work the same)
The design is flawed
How can a client have multiple ids and different kind of information about the same user at the same time?
I think you want to split the table between clients and users, so that a user can have multiple clients.
I recommend that you read information about database normalization as this provides you with necessary knowledge for successfull database design.

Multiple query vs Single (multiple has many joins)

Recently stumbled on this situation. Doing both queries might be "light" in my situation, I just want to know when it comes to big dataset on what is better. Better in overall (performance, speed, etc etc).
Currently I do single queries of 2 1:N (has-many) relationship and reduce/transform the data in the application.
It looks like this transformed/reduced:
[
'field' => 'value',
'hasMany-1' => [],
'hasMany-2' => []
]
I'm actually somehow tempted to just do separate queries as it eliminates the pain of reducing it if I had more than 2 hasMany queries and is more quite readable but code currently works so I'll maybe just do it next time.
Is the compromise worth it? Again, in my situation it might be very "light" as I only have few rows (< 100) and structure is not complex as it is on early stage yet.
But asked in case I stumble upon this next time and when dataset grows larger.
** EDIT **
So the has-many relationship I'm talking about are: A customer has-many phones and pets.
My current query returns me this result (simplified):
customer_id | pet_name | phone
1 | john | 1234
1 | john | 5678
2 | jane | 1357
2 | jane | 2468
2 | joe | 1357
2 | joe | 2468
I think my query is fine. It seems logical for some rows to repeat because the other field has different value.
In general, you should issue a single query and let the optimizer do the work for you. At the very least, this saves multiple round-trips to the database and query compilation.
There are cases where multiple queries can have better performance, but I think it is better to start with a single query.
You have a particular issue regarding joins along multiple many-to-many dimensions. There is no need to do the joins "generally" and then "reduce" the results. There are more efficient methods.
I would suggest that you ask another question. Provide sample data, desired results, and an explanation of the logic you are attempting. You may be able to learn a more efficient way to write a single query.
You did not describe your table structures so i assume few things.
If you want to have pets and phones as one row do:
select c.customer_id, c.name,
array_to_string((array_agg(p.pet_name)),',') pet_names
array_to_string((array_agg(ph.phone)),',') phones
from customer c, pet p, phones ph
where p.customer_id=1
and p.customer_id=c.customer_id
and ph.customer_id=c.customer_id
group by c.customer_id, c.name
If you want to have row per pet_name with all possible phone numbers:
select c.customer_id, c.name,
p.pet_name
array_to_string((array_agg(ph.phone)),',') phones
from customer c, pet p, phones ph
where p.customer_id=1
and p.customer_id=c.customer_id
and ph.customer_id=c.customer_id
group by c.customer_id, c.name, p.pet_name
If we talk about performance it will be faster to do 2 queries to pets and phones separetly by customer_id. But until you have milions of rows it is not so important.
Of course you should have indexes on customer_id.

Summing n numerical variables by grouping level specific to each

I am working through a group by problem and could use some direction at this point. I want to summarize a number of variables by a grouping level which is different (but the same domain of values) for each of the variables to be summed. In pseudo-pseudo code, this is my issue: For each empYEAR variable (there are 20 or so employment-by-year variables in wide format), I want to sum it by the county in which the business was located in that particular year.
The data is a bunch of tables representing business establishments over a 20-year period from Dun & Bradstreet/NETS.
More details on the database, which is a number of flat files, all with the same primary key.
The primary key is DUNSNUMBER, which is present in several tables. There are tables detailing, for each year:
employment
county
sales
credit rating (and others)
all organized as follows (this table shows employment, but the other variables are similarly structured, with a year postfix).
dunsnumber|emp1990 |emp1991|emp1992|... |emp2011|
a | 12 |32 |31 |... | 35 |
b | |2 |3 |... | 5 |
c | 1 |1 | |... | |
d | 40 |86 |104 |... | 350 |
...
I would ultimately like to have a table that is structured like this:
county |emp1990|emp1991|emp1992|...|emp2011|sales1990|sales1991|sales1992|sales2011|...
A
B
C
...
My main challenge right now is this: How can I sum employment (or sales) by county by year as in the example table above, given that county as a grouping variable changes sometimes by the year and specified in another table?
It seems like something that would be fairly straightforward to do in, say, R with a long data format, but there are millions of records, so I prefer to keep the initial processing in postgres.
As I understand your question this sounds relatively straight forward. While I normally prefer normalized data to work with, I don't see that normalizing things beforehand will buy you anything specific here.
It seems to me you want something relatively simple like:
SELECT sum(emp1990), sum(emp1991), ....
FROM county c
JOIN emp e ON c.dunsnumber = e.dunsnumber
JOIN sales s ON c.dunsnumber = s.dunsnumber
JOIN ....
GROUP BY c.name, c.state;
I don't see a simpler way of doing this. Very likely you could query the system catalogs or information schema to generate a list of columns to sum up. the rest is a straight group by and join process as far as I can tell.
if the variable changes by name, the best thing to do in my experience is to put together a location view based on that union and join against it. This lets you hide the complexity from your main queries and as long as you don't also join the underlying tables should perform quite well.

How should you separate dimension tables from fact tables if you are not building a data warehouse?

I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!
Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.
If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.

SQL Query with multiple values in one column

I've been beating my head on the desk trying to figure this one out. I have a table that stores job information, and reasons for a job not being completed. The reasons are numeric,01,02,03,etc. You can have two reasons for a pending job. If you select two reasons, they are stored in the same column, separated by a comma. This is an example from the JOBID table:
Job_Number User_Assigned PendingInfo
1 user1 01,02
There is another table named Pending, that stores what those values actually represent. 01=Not enough info, 02=Not enough time, 03=Waiting Review. Example:
Pending_Num PendingWord
01 Not Enough Info
02 Not Enough Time
What I'm trying to do is query the database to give me all the job numbers, users, pendinginfo, and pending reason. I can break out the first value, but can't figure out how to do the second. What my limited skills have so far:
select Job_number,user_assigned,SUBSTRING(pendinginfo,0,3),pendingword
from jobid,pending
where
SUBSTRING(pendinginfo,0,3)=pending.pending_num and
pendinginfo!='00,00' and
pendinginfo!='NULL'
What I would like to see for this example would be:
Job_Number User_Assigned PendingInfo PendingWord PendingInfo PendingWord
1 User1 01 Not Enough Info 02 Not Enough Time
Thanks in advance
You really shouldn't store multiple items in one column if your SQL is ever going to want to process them individually. The "SQL gymnastics" you have to perform in those cases are both ugly hacks and performance degraders.
The ideal solution is to split the individual items into separate columns and, for 3NF, move those columns to a separate table as rows if you really want to do it properly (but baby steps are probably okay if you're sure there will never be more than two reasons in the short-medium term).
Then your queries will be both simpler and faster.
However, if that's not an option, you can use the afore-mentioned SQL gymnastics to do something like:
where find ( ',' |fld| ',', ',02,' ) > 0
assuming your SQL dialect has a string search function (find in this case, but I think charindex for SQLServer).
This will ensure all sub-columns begin and start with a comma (comma plus field plus comma) and look for a specific desired value (with the commas on either side to ensure it's a full sub-column match).
If you can't control what the application puts in that column, I would opt for the DBA solution - DBA solutions are defined as those a DBA has to do to work around the inadequacies of their users :-).
Create two new columns in that table and make an insert/update trigger which will populate them with the two reasons that a user puts into the original column.
Then query those two new columns for specific values rather than trying to split apart the old column.
This means that the cost of splitting is only on row insert/update, not on _every single select`, amortising that cost efficiently.
Still, my answer is to re-do the schema. That will be the best way in the long term in terms of speed, readable queries and maintainability.
I hope you are just maintaining the code and it's not a brand new implementation.
Please consider to use a different approach using a support table like this:
JOBS TABLE
jobID | userID
--------------
1 | user13
2 | user32
3 | user44
--------------
PENDING TABLE
pendingID | pendingText
---------------------------
01 | Not Enough Info
02 | Not Enough Time
---------------------------
JOB_PENDING TABLE
jobID | pendingID
-----------------
1 | 01
1 | 02
2 | 01
3 | 03
3 | 01
-----------------
You can easily query this tables using JOIN or subqueries.
If you need retro-compatibility on your software you can add a view to reach this goal.
I have a tables like:
Events
---------
eventId int
eventTypeIds nvarchar(50)
...
EventTypes
--------------
eventTypeId
Description
...
Each Event can have multiple eventtypes specified.
All I do is write 2 procedures in my site code, not SQL code
One procedure converts the table field (eventTypeIds) value like "3,4,15,6" into a ViewState array, so I can use it any where in code.
This procedure does the opposite it collects any options your checked and converts it in
If changing the schema is an option (which it probably should be) shouldn't you implement a many-to-many relationship here so that you have a bridging table between the two items? That way, you would store the number and its wording in one table, jobs in another, and "failure reasons for jobs" in the bridging table...
Have a look at a similar question I answered here
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM JobId
),
Split AS
(
SELECT JOB_NUMBER, USER_ASSIGNED, SUBSTRING(PENDING_INFO, Numbers.N, CHARINDEX(',', PENDING_INFO + ',', Numbers.N) - Numbers.N) AS PENDING_NUM
FROM JobId
JOIN Numbers ON Numbers.N <= DATALENGTH(PENDING_INFO) + 1
AND SUBSTRING(',' + PENDING_INFO, Numbers.N, 1) = ','
)
SELECT *
FROM Split JOIN Pending ON Split.PENDING_NUM = Pending.PENDING_NUM
The basic idea is that you have to multiply each row as many times as there are PENDING_NUMs. Then, extract the appropriate part of the string
While I agree with DBA perspective not to store multiple values in a single field it is doable, as bellow, practical for application logic and some performance issues. Let say you have 10000 user groups, each having average 1000 members. You may want to have a table user_groups with columns such as groupID and membersID. Your membersID column could be populated like this:
(',10,2001,20003,333,4520,') each number being a memberID, all separated with a comma. Add also a comma at the start and end of the data. Then your select would use like '%,someID,%'.
If you can not change your data ('01,02,03') or similar, let say you want rows containing 01 you still can use " select ... LIKE '01,%' OR '%,01' OR '%,01,%' " which will insure it match if at start, end or inside, while avoiding similar number (ie:101).