MS Access COUNT with INNER JOIN - sql

Stumped on a problem that involves a MS ACCESS DB and two tables in an attempt to get two different record results by way of count.
Both tables I am working with have a primary key field as well as a field that holds a 1 or a 0 to mark if the record has been marked deletion.
The problem is that I am not able to get the total count and the difference from this query, I can only retrieve the count of all records not just of one that is not marked for deletion.
Example Category has two content records associated to it but one has a 1 in the delete field the other has 0. I am trying to get data.
I've attempted to use SUM but I receive errors from MS ACCESS when doing so. This is the current querystring I have been working with below.
To clarify, the reason for this is so that I can then get the two results and handle the difference on the client side once I get the data back from the recordset.
One category may have more than one content and some of the content associated with each category may have some records flagged for deletion and others may not.
Below is the SQL I'm working with.
Count(CONTENT.contentId) as cntDifference SUM(CASE WHEN cntDifference = 1
then 1 else 0) has also not proven to be successful.
SELECT CATEGORY.categoryId, CATEGORY.categoryTitle, CATEGORY.categoryDate,
Count(CONTENT.contentId) AS cntTotal, Last(CONTENT.contentDate) AS cntDate,
CATEGORY.isDeleted AS catDel
FROM CATEGORY
LEFT JOIN CONTENT ON CATEGORY.categoryId = CONTENT.categoryId
GROUP BY CATEGORY.categoryId, CATEGORY.categoryTitle,
CATEGORY.categoryDate, CATEGORY.userLevel, CATEGORY.isDeleted HAVING
(((CATEGORY.isDeleted)=0))
ORDER BY CATEGORY.categoryTitle

HAVING (((CATEGORY.isDeleted)=0))
filters out the deleted records so your non-deleted count is the same as the all records count.
To get counts of deleted/non-deleted remove it and use this:
SUM(IIF(CATEGORY.isDeleted=0,1,0)) AS CountOfNonDeleted
for non-deleted
and
SUM(IIF(CATEGORY.isDeleted=1,1,0)) AS CountOfDeleted
for deleted. You can also use these expressions to get difference between total, deleted and non-deleted counts.

Related

Need help wrapping head around joins

I have a database of a service that helps people sell things. If they fail a delivery of a sale, they get penalised. I am trying to extract the number of active listings each user had when a particular penalty was applied.
I have the equivalent to the following tables(and relevant fields):
user (id)
listing (id, user_id, status)
transaction (listing_id, seller_id)
listing_history (id, listing_status, date_created)
penalty (id, transaction_id, user_id, date_created)
The listing_history table saves an entry every time a listing is modified, saving a record of what the new state of the listing is.
My goal is to end with a result table with the field: penalty_id, and number of active listings the penalised user had when the penalty was applied.
So far I have the following:
SELECT s1.penalty_id,
COUNT(s1.record_id) 'active_listings'
FROM (
SELECT penalty.id AS 'penalty_id',
listing_history.id AS 'record_id',
FROM user
JOIN penalty ON penalty.user_id = user.id
JOIN transaction ON transaction.id = penalty.transaction_id
JOIN listing_history ON listing_history.listing_id = listing.id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
) s1
GROUP BY s1.penalty_id
Status = 0 means that the listing is active (or that the listing was active at the time the record was created). I got results similar to what I expected, but I fear I may be missing something or may be doing the JOINs wrong. Would this have your approval? (apart from the obvious non-use of aliases, for clarity problems).
UPDATE - As the comments on this answer indicate that changing the table structure isn't an option, here are more details on some queries you could use with the existing structure.
Note that I made a couple changes to the query before even modifying the logic.
As viki888 pointed out, there was a problem reference to listing.id; I've replaced it.
There was no real need for a subquery in the original query; I've simplified it out.
So the original query is rewritten as
SELECT penalty.id AS 'penalty_id'
, COUNT(listing_history.id) 'active_listings'
FROM user
JOIN penalty
ON penalty.user_id = user.id
JOIN transaction
ON transaction.id = penalty.transaction_id
JOIN listing_history
ON listing_history.listing_id = transaction.listing_id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
GROUP BY penalty.id
Now the most natural way, in my opinion, to write the corrected timeline constraint is with a NOT EXISTS condition that filters out all but the most recent listing_history record for a given id. This does require thinking about some edge cases:
Could two listing history records have the same create date? If so, how do you decide which happened first?
If a listing history record is created on the same day as the penalty, which is treated as happening first?
If the created_date is really a timestamp, then this may not matter much (if at all); if it's really a date, it might be a bigger issue. Since your original query required that the listing history be created before the penalty, I'll continue in that style; but it's still ambiguous how to handle the case where two history records with matching status have the same date. You may need to adjust the date comparisons to get the desired behavior.
SELECT penalty.id AS 'penalty_id'
, COUNT(DISTINCT listing_history.id) 'active_listings'
FROM user
JOIN penalty
ON penalty.user_id = user.id
JOIN transaction
ON transaction.id = penalty.transaction_id
JOIN listing_history
ON listing_history.listing_id = transaction.listing_id
WHERE listing_history.date_created < penalty.date_created
AND listing_history.status = 0
AND NOT EXISTS (SELECT 1
FROM listing_history h2
WHERE listing_history.date_created < h2.date_created
AND h2.date_created < penalty.date_created
AND h2.id = listing_history.id)
GROUP BY penalty.id
Note that I switched from COUNT(...) to COUNT(DISTINCT ...); this helps with some edge cases where two active records for the same listing might be counted.
If you change the date comparisons to use <= instead of < - or, equivalently, if you use BETWEEN to combine the date comparisons - then you'd want to add AND h2.status != 0 (or AND h2.status <> 0, depending on your database) to the subquery so that two concurrent ACTIVE records don't cancel each other out.
There are several equivalent ways to write this, and unfortunately its the kind of query that doesn't always cooperate with a database query optimizer so some trial and error may be necessary to make it run well with large data volumes. Hopefully that gives enough insight into the intended logic that you could work out some equivalents if need be. You could consider using NOT IN instead of NOT EXISTS; or you could use an outer join to a second instance of LISTING_HISTORY... There are probably others I'm not thinking of off hand.
I don't know that we're in a position to sign off on a general statement that the query is, or is not, "correct". If there's a specific question about whether a query will include/exclude a record in a specific situation (or why it does/doesn't, or how to modify it so it won't/will), those might get more complete answers.
I can say that there are a couple likely issues:
The only glaring logic issue has to do with timeline management, which is something that causes a lot of trouble with SQL. The issue is, while your query demonstrates that the listing was active at some point before the penalty creation date, it doesn't demonstrate that the listing was still active on the penalty creation date. Consider
PENALTY
id transaction date
1 10 2016-02-01
TRANSACTION
id listing_id
10 100
LISTING_HISTORY
listing_id status date
100 0 2016-01-01
100 1 2016-01-15
The joins would create a single record, and the count for penalty 1 would include listing 100 even though its status had changed to something other than 0 before the penalty was created.
This is hard - but not impossible - to fix with your existing table structure. You could add a NOT EXISTS condition looking for another LISTING_HISTORY record matching the ID with a date between the first LISTING_HISTORY date and the PENALTY date, for one.
It would be more efficient to add an end date to the LISTING_HISTORY date, but that may not be so easy depending on how the data is maintained.
The second potential issue is the COUNT(RECORD_ID). This may not do what you mean - what COUNT(x) may intuitively seem like it should do, is what COUNT(DISTINCT RECORD_ID) actually does. As written, if the join produces two matches with the same LISTING_HISTORY.ID value - i.e. the listing became active at two different times before the penalty - the listing would be counted twice.

Oracle SQL: Show Individual Count Even if There are Duplicates

this is my first question on here, so please forgive me if I break any rules.
Here is what I need to know:
How do I create an Oracle SQL query that will display a unique count of something even if there are duplicates in the results?
Example: a customer table has a list of purchases made by various customers. The table lists the customer ID, name, category of purchase (ie Hardware, Tools, Seasonal) ect. The outcome of the query needs to show each customer id, customer name and the category of the purchase, and a count of the individual customer. SO customer ID 1 for John Smith has made a purchase in each department. If I do a count of the customer, he will appear three times as he has made three purchases, but I also need a column to count the customer only once. The count in the other rows returned for the other departments should show a 0 or Null.
I normally achieve this by pulling everything and exporting to excel. I add a column that uses an IF formula on the ID to only show a 1 on the first occurrence of the customer IE: IF(A3=A2,0,1) (if a3 is the same as A2, show a 0, if it's not the same as A2 then show a 1). This will give me a unique count of customers for one part of the report and will still show me how many purchase the customer made in another part of the report.
I want to do this directly in the SQL query as I have a large set of data this needs to be done on, and adding any formulas in excel will make the sheet huge. This will also make it easier to host the query results in ACCESS so excel can pull it from there.
I have tried to find a solution to this for a while, but any searching on Google will usually return results on how to remove duplicates form a table or how to count the duplicates in a table.
I am sorry if this is long question, but I wanted to be through so I do not waste anyone's time on back an fourth comments (I have seen this many times on here and else where when the OP asks a very cryptic question and expects everyone to understand them without further expiation).
Using distinct can be used in a count to only count the unique values of a field.
SELECT
cust.customer_id, cust.customer_name, p.category,
count(distinct p.department_id) as total_departments,
count(*) as total_purchases
FROM customers cust
LEFT JOIN purchase_table p on (cust.customer_id = p.customer_id)
GROUP BY cust.customer_id, cust.customer_name, p.category
ORDER BY cust.customer_id;
Such method is not limited to the Oracle RDBMS.

Full outer join joining together every record multiple times

Query below:
select
cu.course_id as 'bb_course_id',
cu.user_id as 'bb_user_id',
cu.role as 'bb_role',
cu.available_ind as 'bb_available_ind',
CASE cu.row_status WHEN 0 THEN 'ENABLED' ELSE 'DISABLED' END AS 'bb_row_status',
eff.course_id as 'registrar_course_id',
eff.user_id as 'registrar_user_id',
eff.role as 'registrar_role',
eff.available_ind as 'registrar_available_ind',
CASE eff.row_status WHEN 'DISABLE' THEN 'DISABLED' END as 'registrar_row_status'
into enrollments_comparison_temp
from narrowed_users_enrollments cu
full outer join enrollments_feed_file eff on cu.course_id = eff.course_id
Quick background: I'm taking the data from a replicated table and selecting it into narrowed_users_enrollments based on some criteria. In a script I'm taking a text feed file, with enrollment data, and inserting it into enrollments_feed_file. The purpose is to compare the most recent enrollment data with enrollments already in the database.
However the issue is that joining these tables results in about 160,000 rows when I'm really only expecting about 22,000. The point of doing this comparison is so that I can look for nulled values on either side of the join. For example, if the table on the right contains a null, then disable the enrollment record. If the table on the left contains a null, then add this student's enrollment.
I know it's a little off because I'm not using PKs or FKs. This is what is selected into the table:
Here's a screenshot showing a select * from the enrollments table on the left and a feed file on the right.
http://i.imgur.com/0ZPZ9HS.png
Here's a screenshot showing the newly created table from the full outer join.
http://i.imgur.com/89ssAkS.png
As you can see even though there there's only one matching enrollment(the matching jmartinez12 columns), there's 4 extra rows created for the same record on the left for the enrollments on the right. What I'm trying to get is for it to be 5 rows, with the first being how it is in the screenshot(matching pre-existing enrollment and enrollment in the feed file), BUT, the next 4 rows with the bb_* columns should be NULL up to the registrar_course_id.
Am I overlooking something simple here? I've tried a select distinct and I've added a where clause specifying when the course_ids are equal however that ensures that I won't get null rows which I need. I have also joined the tables on the user_id however the results are still the same.
One quick suggestion is to add the DISTNCT clause. If the records you are setting are complete duplicates that may cut it down to what you are expecting.
The fix was to also join on:
ON cu.course_id = eff.course_id AND cu.user_id = eff.user_id

Update values in each row based on foreign_key value

Downloads table:
id (primary key)
user_id
item_id
created_at
updated_at
The user_id and item_id in this case are both incorrect, however, they're properly stored in the users and items table, respectively (import_id for in each table). Here's what I'm trying to script:
downloads.each do |download|
user = User.find_by_import_id(download.user_id)
item = item.find_by_import_id(download.item_id)
if user && item
download.update_attributes(:user_id => user.id, :item.id => item.id)
end
end
So,
look up the user and item based on
their respective "import_id"'s. Then
update those values in the download record
This takes forever with a ton of rows. Anyway to do this in SQL?
If I understand you correctly, you simply need to add two sub-querys in your SELECT statement to lookup the correct IDs. For example:
SELECT id,
(SELECT correct_id FROM User WHERE import_id=user_id) AS UserID,
(SELECT correct_id FROM Item WHERE import_id=item_id) AS ItemID,
created_at,
updated_at
FROM Downloads
This will translate your incorrect user_ids to whatever ID you want to come from the User table and it will do the same for your item_ids. The information coming from SQL will now be correct.
If, however, you want to update the tables with the correct information, you could write this like so:
UPDATE Downloads
SET user_id = User.user_id,
item_id = Item.item_id
FROM Downloads
INNER JOIN User ON Downloads.user_id = User.import_id
INNER JOIN Item ON Downloads.item_id = Item.import_id
WHERE ...
Make sure to put something in the WHERE clause so you don't update every record in the Downloads table (unless that is the plan). I rewrote the above statement to be a bit more optimized since the original version had two SELECT statements per row, which is a bit intense.
Edit:
Since this is PostgreSQL, you can't have the table name in both the UPDATE and the FROM section. Instead, the tables in the FROM section are joined to the table being updated. Here is a quote about this from the PostgreSQL website:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the fromlist, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
http://www.postgresql.org/docs/8.1/static/sql-update.html
With this in mind, here is an example that I think should work (can't test it, sorry):
UPDATE Downloads
SET user_id = User.user_id,
item_id = Item.item_id
FROM User, Item
WHERE Downloads.user_id = User.import_id AND
Downloads.item_id = Item.import_id
That is the basic idea. Don't forget you will still need to add extra criteria to the WHERE section to limit the rows that are updated.
i'm totally guessing from your question, but you have some kind of lookup table that will match an import user_id with the real user_id, and similarly from items. i.e. the assumption is your line of code:
User.find_by_import_id(download.user_id)
hits the database to do the lookup. the import_users / import_items tables are just the names i've given to the lookup tables to do this.
UPDATE downloads
SET downloads.user_id = users.user_id
, downloads.item_id = items.items_id
FROM downloads
INNER JOIN import_users ON downloads.user_id = import_users.import_user_id
INNER JOIN import_items ON downloads.item_id = import_items.import_item_id
Either way (lookup is in DB, or it's derived from code), would it not just be easier to insert the information correctly in the first place? this would mean you can't have any FK's on your table since sometimes they point to one table, and others they point to another. seems a bit odd.

How do I remove "duplicate" rows from a view?

I have a view which was working fine when I was joining my main table:
LEFT OUTER JOIN OFFICE ON CLIENT.CASE_OFFICE = OFFICE.TABLE_CODE.
However I needed to add the following join:
LEFT OUTER JOIN OFFICE_MIS ON CLIENT.REFERRAL_OFFICE = OFFICE_MIS.TABLE_CODE
Although I added DISTINCT, I still get a "duplicate" row. I say "duplicate" because the second row has a different value.
However, if I change the LEFT OUTER to an INNER JOIN, I lose all the rows for the clients who have these "duplicate" rows.
What am I doing wrong? How can I remove these "duplicate" rows from my view?
Note:
This question is not applicable in this instance:
How can I remove duplicate rows?
DISTINCT won't help you if the rows have any columns that are different. Obviously, one of the tables you are joining to has multiple rows for a single row in another table. To get one row back, you have to eliminate the other multiple rows in the table you are joining to.
The easiest way to do this is to enhance your where clause or JOIN restriction to only join to the single record you would like. Usually this requires determining a rule which will always select the 'correct' entry from the other table.
Let us assume you have a simple problem such as this:
Person: Jane
Pets: Cat, Dog
If you create a simple join here, you would receive two records for Jane:
Jane|Cat
Jane|Dog
This is completely correct if the point of your view is to list all of the combinations of people and pets. However, if your view was instead supposed to list people with pets, or list people and display one of their pets, you hit the problem you have now. For this, you need a rule.
SELECT Person.Name, Pets.Name
FROM Person
LEFT JOIN Pets pets1 ON pets1.PersonID = Person.ID
WHERE 0 = (SELECT COUNT(pets2.ID)
FROM Pets pets2
WHERE pets2.PersonID = pets1.PersonID
AND pets2.ID < pets1.ID);
What this does is apply a rule to restrict the Pets record in the join to to the Pet with the lowest ID (first in the Pets table). The WHERE clause essentially says "where there are no pets belonging to the same person with a lower ID value).
This would yield a one record result:
Jane|Cat
The rule you'll need to apply to your view will depend on the data in the columns you have, and which of the 'multiple' records should be displayed in the column. However, that will wind up hiding some data, which may not be what you want. For example, the above rule hides the fact that Jane has a Dog. It makes it appear as if Jane only has a Cat, when this is not correct.
You may need to rethink the contents of your view, and what you are trying to accomplish with your view, if you are starting to filter out valid data.
So you added a left outer join that is matching two rows? OFFICE_MIS.TABLE_CODE is not unique in that table I presume? you need to restrict that join to only grab one row. It depends on which row you are looking for, but you can do something like this...
LEFT OUTER JOIN OFFICE_MIS ON
OFFICE_MIS.ID = /* whatever the primary key is? */
(select top 1 om2.ID
from OFFICE_MIS om2
where CLIENT.REFERRAL_OFFICE = om2.TABLE_CODE
order by om2.ID /* change the order to fit your needs */)
If the secondd row has one different value than it is not really duplicate and should be included.
Instead of using DISTINCT, you could use a GROUP BY.
Group by all the fields that you want to be returned as unique values.
Use MIN/MAX/AVG or any other function to give you one result for fields that could return multiple values.
Example:
SELECT Office.Field1, Client.Field1, MIN(Office.Field1), MIN(Client.Field2)
FROM YourQuery
GROUP BY Office.Field1, Client.Field1
You could try using Distinct Top 1 but as Hunter pointed out, if there is if even one column is different then it should either be included or if you don't care about or need the column you should probably remove it. Any other suggestions would probably require more specific info.
EDIT: When using Distinct Top 1 you need to have an appropriate group by statement. You would really be using the Top 1 part. The Distinct is in there because if there is a tie for Top 1 you'll get an error without having some way to avoid a tie. The two most common ways I've seen are adding Distinct to Top 1 or you could add a column to the query that is unique so that sql would have a way to choose which record to pick in what would otherwise be a tie.