I've got a dataset with lots of 'projects' from various suppliers, each one containing hundreds of different "objectives" from a masterlist with outcomes either "achieved" or "unrealised"
Some of these objectives are also listed in a second column "external_objectives" which are from a second source.
I want to create a list of unique 'projects' from a single supplier 'Walmart' where all "objectives" marked as "achieved" are present in both "objective" and "external_objective". The presence of "unrealised" objectives in the projects doesn't matter, but I want to exclude all projects where there are "achieved" that are not present in both "objectives" and "external_objectives".
project
objective
external_objective
status
b12345
abcdef
abcdef
achieved
c23456
abcdeg
achieved
d23456
bbcdvg
unrealised
b12345
ghfjds
achieved
d23456
fghjka
fghjka
achieved
So I would want to select project 'd23456' from this list, but not 'b12345' or 'c23456'
What I have so far is below and I'm pretty sure it's wrong.
SELECT DISTINCT Project
FROM dataset
WHERE supplier="walmart" AND
status = "achieved" AND objectives IS NOT NULL AND external_objectives IS NOT NULL
GROUP BY project
The condition to be satisfied in order for the rows to be caught is that the amount of records for which "objective = external_objective" should be equal to the full amount of retrieved records, for the specific project.
You can use Google BigQuery COUNTIF to apply conditional aggregation within the COUNT aggregate function.
SELECT Project
FROM dataset
WHERE supplier = "walmart" AND status = 'achieved'
GROUP BY Project
HAVING COUNTIF(objective = external_objective) = COUNT(project)
Related
Good evening. Could someone please help me with the following. I am trying to join two tables.The first id wbr_global.gl_ap_details. This stores historic GL information. The second table sandbox.utr_fixed_mapping is where account mapping is stored. For example, ana ccount number 60820 is mapped as Employee relation. The first table needs the mapping from the second table linked on the account number. The output I am getting is not right and way to bug. Any help would be appreciated!
Output
select sandbox.utr_fixed_mapping_na.new_mapping_1,sum(wbr_global.gl_ap_details.amount)
from wbr_global.gl_ap_details
LEFT JOIN sandbox.utr_fixed_mapping_na ON wbr_global.gl_ap_details.account_number = sandbox.utr_fixed_mapping_na.account_number
Where gl_ap_details.cost_center = '1172'
and gl_ap_details.period_name = 'JUL-21'
and gl_ap_details.ledger_name = 'Amazon.com, Inc.'
Group by 1;
I tried adding the cast function but after 5000 seconds of the query running I canceled it.
The query itself appears ok, but minor changes. Learn to use table "aliases". This way you don't have to keep typing long database.table.column all over. Additionally, SQL is easier to read doing it that way anyhow.
Notice the aliases "gl" and "fm" after the tables are declared, then these aliases are used to represent the columns.. Easier to read, would you agree.
Added GL Account number as described below the query.
select
gl.account_number,
fm.new_mapping_1,
sum(gl.amount)
from
wbr_global.gl_ap_details gl
LEFT JOIN sandbox.utr_fixed_mapping_na fm
ON gl.account_number = fm.account_number
Where
gl.cost_center = '1172'
and gl.period_name = 'JUL-21'
and gl.ledger_name = 'Amazon.com, Inc.'
Group by
gl.account_number,
fm.new_mapping_1
Now, as for your query and getting null. This just means that there are records within the gl_ap_details table with an account number that is not found in the utr_fixed_mapping_na table. So, to see WHAT gl account number does NOT exist, I have added it to the query. Its possible there are MULTIPLE records in the gl_ap_details that are not found in the mapping table. So, you may get
GLAccount Description SumOfAmount
glaccount1 null $someAmount
glaccount37 null $someAmount
glaccount49 null $someAmount
glaccount72 Depreciation $someAmount
glaccount87 Real Estate $someAmount
glaccount92 Building $someAmount
glaccount99 Salaries $someAmount
I obviously made-up glaccounts just to show the purpose. You may have multiple where the null's total amount is actually masking how many different gl account numbers were NOT found.
Once you find which are missing, you can check / confirm they SHOULD be in the mapping table.
FEEDBACK.
Since you do realize the missing numbers, lets consider a Cartesian result. If there are multiple entries in the mapping table for the same G/L account number, you will get a Cartesian result thus bloating your numbers. To clarify, lets say your mapping table has
Mapping file.
GL Descr1 NewMapping
1 test Salaries
1 testView Buildings
1 Another Depreciation
And your GL_AP_Details has
GL Amount
1 $100
Your total for the query would result in $300 because the query is trying to join the AP Details GL #1 to EACH of the entries in the mapping file thus bloating the amount. You could also add a COUNT(*) as NumberOfEntries to the query to see how many transactions it THINKS it is processing. Is there some "unique ID" in the GL_AP_Details table? If so, then you could also do a count of DISTINCT ID values. If they are different (distinct is lower than # of entries), I think THAT is your culprit.
select
fm.new_mapping_1,
sum(gl.amount),
count(*) as NumberOfEntries,
count( distinct gl.UniqueIdField ) as DistinctTransactions
from
wbr_global.gl_ap_details gl
LEFT JOIN sandbox.utr_fixed_mapping_na fm
ON gl.account_number = fm.account_number
Where
gl.cost_center = '1172'
and gl.period_name = 'JUL-21'
and gl.ledger_name = 'Amazon.com, Inc.'
Group by
fm.new_mapping_1
Might you also need to limit the mapping table for a specific prophecy or mec view?
If you "think" that the result of an aggregate is wrong, then the easiest way to verify this is to select the individual rows that correlate to 1 record in the aggregate output and inspect the records, looking for duplications.
For instance, pick 'Building Management':
SELECT fixed.new_mapping_1,details.amount,*
FROM wbr_global.gl_ap_details details
LEFT JOIN sandbox.utr_fixed_mapping_na fixed ON details.account_number = fixed.account_number
WHERE details.cost_center = '1172'
AND details.period_name = 'JUL-21'
AND details.ledger_name = 'Amazon.com, Inc.'
AND details.account_number = 'Building Management'
Notice that we tack on a ,* to the end of the projection, this will show you everything that the query has access to, you should look for repeating sections of data that you were not expecting, then depending on which table they originate from your might add additional criteria to the JOIN, or to the WHERE or you might need to group by additional columns.
This type of issue is really hard to comment on in a forum like this because it is highly specific to your schema, and the data contained within it, making solutions highly subjective to criteria you are not likely to publish online.
Generally if you think a calculation is wrong, you need to manually compute it to verify, this above advice helps you to inspect the data your query is using, you should either construct your own query or use other tools to build the data set that helps you to manually compute the correct values, then work them back into or replace your original query.
The speed issues are out of scope here, we can comment on the poor schema design but I suspect you don't have a choice. In the utr_fixed_mapping_na table you should make the account_number have the same column type as the source data, or add a new column that has the data in the original type, then you can setup indexes on the columns to improve the speed of the join.
Schema
Question: List all paying customers with users who had 4 or 5 activities during the week of February 15, 2021; also include how many of the activities sent were paid, organic and/or app store. (i.e. include a column for each of the three source types).
My attempt so far:
SELECT source_type, COUNT(*)
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
I would like to get a second opinion on it. I didn't include the accounts table because I don't believe that I need it for this query, but I could be wrong.
Have you tried to run this? It doesn't satisfy the brief on FOUR counts:
List all the ... customers (that match criteria)
There is no customer information included in the results at all, so this is an outright fail.
paying customers
This is the top level criteria, only customers that are not free should be included in the results.
Criteria: users who had 4 or 5 activities
There has been no attempt to evaluate this user criteria in the query, and the results do not provide enough information to deduce it.
there is further ambiguity in this requirement, does it mean that it should only include results if the account has individual users that have 4 or 5 acitvities, or is it simply that the account should have 4 or 5 activities overall.
If this is a test question (clearly this is contrived, if it is not please ask for help on how to design a better schema) then the use of the term User is usually very specific and would suggest that you need to group by or otherwise make specific use of this facet in your query.
Bonus: (i.e. include a column for each of the three source types).
This is the only element that was attempted, as the data is grouped by source_type but the information cannot be correlated back to any specific user or customer.
Next time please include example data and the expected outcome with your post. In preparing the data for this post you would have come across these issues yourself and may have been inspired to ask a different question, or through the process of writing the post up you may have resolved the issue yourself.
without further clarification, we can still start to evolve this query, a good place to start is to exclude the criteria and focus on the format of the output. the requirement mentions the following output requirements:
List Customers
Include a column for each of the source types.
Firstly, even though you don't think you need to, the request clearly states that Customer is an important facet in the output, and in your schema account holds the customer information, so although we do not need to, it makes the data readable by humans if we do include information from the account table.
This is a standard PIVOT style response then, we want a row for each customer, presenting a count that aggregates each of the values for source_type. Most RDBMS will support some variant of a PIVOT operator or function, however we can achieve the same thing with simple CASE expressions to conditionally put a value into projected columns in the result set that match the values we want to aggregate, then we can use GROUP BY to evaluate the aggregation, in this case a COUNT
The following syntax is for MS SQL, however you can achieve something similar easily enough in other RBDMS
OP please tag this question with your preferred database engine...
NOTE: there is NO filtering in this query... yet
SELECT accounts.company_id
, accounts.company_name
, paid = COUNT(st_paid)
, organic = COUNT(st_organic)
, app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
GROUP BY accounts.company_id, accounts.company_name
This results in the following shape of result:
company_id
company_name
paid
organic
app_store
apl01
apples
4
8
0
ora01
oranges
6
12
0
Criteria
When you are happy with the shpe of the results and that all the relevant information is available, it is time to apply the criteria to filter this data.
From the requirement, the following criteria can be identified:
paying customers
The spec doesn't mention paying specifically, but it does include a note that (free customers have current_mrr = 0)
Now aren't we glad we did join on the account table :)
users who had 4 or 5 activities
This is very specific about explicitly 4 or 5 activities, no more, no less.
For the sake of simplicity, lets assume that the user facet of this requirement is not important and that is is simply a reference to all users on an account, not just users who have individually logged 4 or 5 activities on their own - this would require more demo data than I care to manufacture right now to prove.
during the week of February 15, 2021.
This one was correctly identified in the original post, but we need to call it out just the same.
OP has used Monday to Friday of that week, there is no mention that weeks start on a Monday or that they end on Friday but we'll go along, it's only the syntax we need to explore today.
In the real world the actual values specified in the criteria should be parameterised, mainly because you don't want to manually re-construct the entire query every time, but also to sanitise input and prevent SQL injection attacks.
Even though it seems overkill for this post, using parameters even in simple queries helps to identify the variable elements, so I will use parameters for the 2nd criteria to demonstrate the concept.
DECLARE #from DateTime = '2021-02-15' -- Date in ISO format
DECLARE #to DateTime = (SELECT DateAdd(d, 5, #from)) -- will match Friday: 2021-02-19
/* NOTE: requirement only mentioned the start date, not the end
so your code should also only rely on the single fixed start date */
SELECT accounts.company_id, accounts.company_name
, paid = COUNT(st_paid), organic = COUNT(st_organic), app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
WHERE -- paid accounts = exclude 'free' accounts
accounts.current_mrr > 0
-- Date range filter
AND activity_time BETWEEN #from AND #to
GROUP BY accounts.company_id, accounts.company_name
-- The fun bit, we use HAVING to apply a filter AFTER the grouping is evaluated
-- Wording was explicitly 4 OR 5, not BETWEEN so we use IN for that
HAVING COUNT(source_type) IN (4,5)
I believe you are missing some information there.
without more information on the tables, I can only guess that you also have a customer table. i am going to assume there is a customer_id key that serves as key between both tables
i would take your query and do something like:
SELECT customer_id,
COUNT() AS Total,
MAX(CASE WHEN source_type = "app" THEN "numoperations" END) "app_totals"),
MAX(CASE WHEN source_type = "paid" THEN "numoperations" END) "paid_totals"),
MAX(CASE WHEN source_type = "organic" THEN "numoperations" END) "organic_totals"),
FROM (
SELECT source_type, COUNT() AS num_operations
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
) tb1 GROUP BY customer_id
This is the most generic case i can think of, but does not scale very well. If you get new source types, you need to modify the query, and the structure of the output table also changes. Depending on the sql engine you are using (i.e. mysql vs microsoft sql) you could also use a pivot function.
The previous query is a little bit rough, but it will give you a general idea. You can add "ELSE" statements to the clause, to zero the fields when they have no values, and join with the customer table if you want only active customers, etc.
I'm in need of some assistance. I have search and not found what I'm looking for. I have an assigment for school that requires me to use SQL. I have a query that pulls some colunms from two tables:
SELECT Course.CourseNo, Course.CrHrs, Sections.Yr, Sections.Term, Sections.Location
FROM Course
INNER JOIN Sections ON Course.CourseNo = Sections.CourseNo
WHERE Sections.Term="spring";
I need to add a Totals row at the bottom to count the CourseNo and Sum the CrHrs. It has to be done through SQL query design as I need to paste the code. I know it can be done with the datasheet view but she will not accept that. Any advice?
To accomplish this, you can union your query together with an aggregation query. Its not clear from your question which columns you are trying to get "Totals" from, but here's an example of what I mean using your query and getting counts of each (kind of useless example - but you should be able to apply to what you are doing):
SELECT
[Course].[CourseNo]
, [Course].[CrHrs]
, [Sections].[Yr]
, [Sections].[Term]
, [Sections].[Location]
FROM
[Course]
INNER JOIN [Sections] ON [Course].[CourseNo] = [Sections].[CourseNo]
WHERE [Sections].[Term] = [spring]
UNION ALL
SELECT
"TOTALS"
, SUM([Course].[CrHrs])
, count([Sections].[Yr])
, Count([Sections].[Term])
, Count([Sections].[Location])
FROM
[Course]
INNER JOIN [Sections] ON [Course].[CourseNo] = [Sections].[CourseNo]
WHERE [Sections].[Term] = “spring”
You can prepare your "total" query separately, and then output both query results together with "UNION".
It might look like:
SELECT Course.CourseNo, Course.CrHrs, Sections.Yr, Sections.Term, Sections.Location
FROM Course
INNER JOIN Sections ON Course.CourseNo = Sections.CourseNo
WHERE Sections.Term="spring"
UNION
SELECT "Total", SUM(Course.CrHrs), SUM(Sections.Yr), SUM(Sections.Term), SUM(Sections.Location)
FROM Course
INNER JOIN Sections ON Course.CourseNo = Sections.CourseNo
WHERE Sections.Term="spring";
Whilst you can certainly union the aggregated totals query to the end of your original query, in my opinion this would be really bad practice and would be undesirable for any real-world application.
Consider that the resulting query could no longer be used for any meaningful analysis of the data: if displayed in a datagrid, the user would not be able to sort the data without the totals row being interspersed amongst the rest of the data; the user could no longer use the built-in Totals option to perform their own aggregate operation, and the insertion of a row only identifiable by the term totals could even conflict with other data within the set.
Instead, I would suggest displaying the totals within an entirely separate form control, using a separate query such as the following (based on your own example):
SELECT Count(Course.CourseNo) as Courses, Sum(Course.CrHrs) as Hours
FROM Course INNER JOIN Sections ON Course.CourseNo = Sections.CourseNo
WHERE Sections.Term = "spring";
However, since CrHrs are fields within your Course table and not within your Sections table, the above may yield multiples of the desired result, with the number of hours multiplied by the number of corresponding records in the Sections table.
If this is the case, the following may be more suitable:
SELECT Count(Course.CourseNo) as Courses, Sum(Course.CrHrs) as Hours
FROM
Course INNER JOIN
(SELECT DISTINCT s.CourseNo FROM Sections s WHERE s.Term = "spring") q
ON Course.CourseNo = q.CourseNo
My report shows only the latest diagnosis per patient based on their date_of_diagnosis - all other records are suppressed:
I summarize by diagnosis and age group in a crosstab. Crosstabs evaluate before printing, so any attempts to suppress, share variables, or summarize happen after the crosstab populates. This means Total in Each Age Group is correct, because each patient only has one age - but if a patient has more than one diagnosis, even if they're suppressed, they get counted multiple times:
I absolutely must use a crosstab for this due to the large number of diagnoses and age groups involved. How can I get the crosstab to ignore suppressed records? Or if I need to use a custom SQL Command table, how can I rewrite the existing SQL to ignore obsolete records?
Crystal's auto-generated SQL (through ODBC):
SELECT "Codes"."diagnosis_code",
"Codes"."diagnosis_value",
"Codes"."PATID",
"Codes"."FACILITY",
"Codes"."EPISODE_NUMBER",
"Record"."date_of_diagnosis"
FROM "SYSTEM"."Codes" "Codes",
"SYSTEM"."Entry" "Entry",
"SYSTEM"."Record" "Record"
WHERE "Codes"."DiagnosisEntry"="Entry"."ID" AND
"Codes"."EPISODE_NUMBER"="Entry"."EPISODE_NUMBER" AND
"Codes"."FACILITY"="Entry"."FACILITY" AND
"Codes"."PATID"="Entry"."PATID" AND
"Entry"."DiagnosisRecord"="Record"."ID" AND
"Entry"."EPISODE_NUMBER"="Record"."EPISODE_NUMBER" AND
"Entry"."FACILITY"="Record"."FACILITY" AND
"Entry"."PATID"="Record"."PATID"
You need only the latest diagnosis among a set of diagnoses. So I would suggest:
SELECT "Codes"."PATID",
"Codes"."diagnosis_code",
"Codes"."diagnosis_value",
"Codes"."FACILITY",
"Codes"."EPISODE_NUMBER",
"Record"."date_of_diagnosis"
FROM "SYSTEM"."Codes" "Codes",
"SYSTEM"."Entry" "Entry",
"SYSTEM"."Record" "Record"
WHERE "Codes"."DiagnosisEntry"="Entry"."ID" AND
"Codes"."EPISODE_NUMBER"="Entry"."EPISODE_NUMBER" AND
"Codes"."FACILITY"="Entry"."FACILITY" AND
"Codes"."PATID"="Entry"."PATID" AND
"Entry"."DiagnosisRecord"="Record"."ID" AND
"Entry"."EPISODE_NUMBER"="Record"."EPISODE_NUMBER" AND
"Entry"."FACILITY"="Record"."FACILITY" AND
"Entry"."PATID"="Record"."PATID"
AND "Entry"."date_of_diagnosis" = (SELECT MAX("date_of_diagnosis") FROM
"DiagonsisRecord" "A" WHERE "A"."DiagnosisRecord"="Entry"."DiagnosisRecord" )
This should get the maximum Date_of_Diagnosis for each patient and pass the filter parameter to get the last diagnosis of that patient.
Building off of Muffaddal Shakir's answer, I was able to write this query to perform the correct filter:
SELECT "Codes"."PATID",
"Codes"."diagnosis_code",
"Codes"."diagnosis_value",
"Codes"."FACILITY",
"Codes"."EPISODE_NUMBER",
"Record"."date_of_diagnosis"
FROM "SYSTEM"."codes" "Codes",
"SYSTEM"."entry" "Entry",
"SYSTEM"."record" "Record"
WHERE "Codes"."DiagnosisEntry"="Entry"."ID" AND
"Codes"."EPISODE_NUMBER"="Entry"."EPISODE_NUMBER" AND
"Codes"."FACILITY"="Entry"."FACILITY" AND
"Codes"."PATID"="Entry"."PATID" AND
"Entry"."DiagnosisRecord"="Record"."ID" AND
"Entry"."EPISODE_NUMBER"="Record"."EPISODE_NUMBER" AND
"Entry"."FACILITY"="Record"."FACILITY" AND
"Entry"."PATID"="Record"."PATID"
AND "Record"."date_of_diagnosis" = (
SELECT MAX("Record2"."date_of_diagnosis")
FROM "SYSTEM"."entry" "Entry2",
"SYSTEM"."record" "Record2"
WHERE "Entry2"."DiagnosisRecord"="Record2"."ID" AND
"Entry2"."EPISODE_NUMBER"="Record2"."EPISODE_NUMBER" AND
"Entry2"."FACILITY"="Record2"."FACILITY" AND
"Entry2"."PATID"="Record2"."PATID" AND
"Record"."PATID"="Record2"."PATID"
)
The key differences being:
The subquery uses unique aliases from the main query.
The last line "Record"."PATID"="Record2"."PATID" - Without this, the query only pulls back one diagnosis (the latest one in the whole system.) But now it checks for the latest diagnosis per person.
I have 3 tables.
One contains Profiles as described below:
ID NM
==============
1 Profile A
2 Profile B
The second contains assignments:
ID NM
==============
1 Assignment A
2 Assignment B
My third contains FID's for both and allows you to prioritize them like so:
ID P_FID A_FID PRIORITY
========================
1 1 2 1
2 1 1 2
My problem is populating the third table via a continuous form so the end user has the ability to input priorities. Basically, there is a combo box that lets the user select the appropriate profile. If there are no entries in the third table, it should show you all of the assignments so you can input priorities. If there are already records in that table it should retrieve those values so you can update the priorities.
The following query works great as long as the third table is empty. Once the user inputs priorities and tries to switch to a different profile, it doesn't return any records unless it is the selected profile.
SELECT tblProfileForAssignments.PROFILE_FID,
tblAssignments.NM,
tblProfileForAssignments.PRIORITY
FROM tblAssignments
LEFT JOIN tblProfileForAssignments ON tblAssignments.ID = tblProfileForAssignments.ASSGNMNT_FID
WHERE (tblProfileForAssignments.PROFILE_FID = Forms!frmProfileAssignments!cmboProfile)
OR (tblProfileForAssignments.PROFILE_FID IS NULL);
Can this be done in a single query utilizing a union, I would think, or should I just revert to VBA to figure this out? Like I said, it works great as long as the third table is empty or they only work on the first profile they select, beyond that it fails. Does this make sense?
Turning it into a subquery might give you what you need:
SELECT PRIORITIES.PROFILE_FID, tblAssignments.NM,
PRIORITIES.PRIORITY
FROM tblAssignments LEFT JOIN
(SELECT ASSGNMNT_FID, PROFILE_FID, PRIORITY
FROM tblProfileForAssignments
WHERE PROFILE_FID = [Forms]![frmProfileAssignments]![cmboProfile]) PRIORITIES
ON tblAssignments.ID = PRIORITIES.ASSGNMNT_FID
This should return all assignment names along with any assignments for the specified profile. The query in your example would not display records if assignments for any profile existed and the current profile had no assignments made.