SQL counting number of rows - sql

I am looking for a way to search for a certain number of rows as a quality check. For example, we have tables that have a certain set of results that are needed.
Here is a quick table for an example:
ID: Name: Result: Reportable:
ONE A 10 X
TWO B 12 X
THREE C 1
FOUR D 18 X
FOUR(redo) D 11 X
So we are looking to double check results as there are people who accidentally report results multiple times (as in the case with ID FOUR). We have used having counts but we need the numbers to be specific and need a query to verify that number is satisfied.
In the table above we only want IDs ONE, TWO, and FOUR, however we have 4 results (one extra). Currently we have our check showing the count needed (ie 3) and the current result count (4) to show the mismatch but want a query to easily only show the result needed. We would need the redo result most of the time so we have set it so we take the latest date, but it doesn't help filter how many rows or results. I apologize if anything is confusing and I am not able to share the SQL query that we have currently. It's my first time posting so if I need to clarify anything please let me know as this seems to be very complicated. Thank you for your time.
EDIT: The details
We have one table (Table A) letting us know which results are reportable. The ones that are reportable go into another table (Table B). We have had issues in which people have made too many results reportable which overpopulates the Table B. Our old query had a count in Table B, but due to mistakes in people placing multiple reportables, samples which had many redos seem to be finished as they were all placed and met the count in Table B.
So now by using the Table A that helps tell us how many are Reportable, we want this to double check that the samples are indeed ready.

As I understand the question, you want ids that have multiple reportables. Assuming you really mean name, then:
select name
from t
where reportable = 'X'
group by name
having count(*) >= 2;

Related

Create and define functions to apply to different segments/filters that exist (Instead of doing one very long SQL query)

I am trying to come up with some arithmetic calculations for some survey data. I want to do these calculations for a number of segments and want to figure out how to do it without writing numerous SELECT statements.
This is what I have so far:
FACT table. This tables holds survey data at a respondent level - for example, if a survey had 10 questions, this table will have 11 columns: a column to identify the respondent_ID and 10 other columns to identify the responses to those questions.
DIMENSION table. This table segments we want to view the survey data by at a respondent level - for example, if we want to view survey responses by membership_status and age_bracket, this table will have 3 columns: a column to identify the respondent_ID, and two columns to identify membership_status and age_bracket.
OUTPUT.
I want to get aggregate calculations to summarizes the responses to the survey overall and to each question. I also want to be able to get this information for all possible segments that exist in the DIMENSIONS table.
I can do the query below, however I'll need to do this for every segment:
SELECT
COUNT(DISTINCT(CASE WHEN f.QUESTION_1 IN ('8', '9', '10') THEN f.RESPONDENT_ID END))*1.0 / COUNT(DISTINCT(CASE WHEN f.QUESTION_1 IS NOT NULL THEN f.RESPONDENT_ID END))*1.0 AS CSAT_1
FROM FACT f
JOIN DIMENSION d ON f.RESPONDENT_ID = d.RESPONDENT_ID
WHERE d.MEMBERSHIP_STATUS = 'ACTIVE'
The calculation above gives us something called a top 3 box. That is just one calculation, I will need to do many of them. Additionally, ever calculation will need to be done for each segment. In order to get a calculation for nonactive members, I would need to run another query and set d.MEMBERSHIP_STATUS = 'INACTIVE' and I would need to run another query with no filter, to get the overall calculation.
Is there a way I could store all my arithmetic calculations needed in my output as a function (maybe in a temp table or something) - my thought is that it'll be better to set the functions somewhere, and then when I need to calculate the output, I would some how call the function to do all the calculations I need, and give me all the calculations for every segment I have?
I can't fully envision how to get there, or if this is even a good solution, so guidance and detailed SQL code would be extremely helpful.Examples please!

T-SQL query for SQL Server 2008 : how to query X # of rows where X is a value in a query while matching on another column

Summary:
I have a list of work items that I am attempting to assign to a list of workers. Each working is allowed to only have a max of 100 work items assigned to them. Each work item specifies the user that should work it (associated as an owner).
For example:
Jim works a total of 5 accounts each with multiple work items. In total jim has 50 items to work already assigned to him. I am allowed to assign only 50 more.
My plight/goal:
I am using a temp table and a select statement to get the # of items each owner has currently assigned to them and I calculate the available slots for new items and store the values in new column. I need to be able to select from the items table where the owner matches my list of owners and their available items(in the temp table), only retrieving the number of rows for each user equal to the number of available slots per user - query would return only 50 rows for jim even though there may be 200 matching the criteria while sam may get 0 rows because he has no available slots while there are 30 items for him to work in the items table.
I realize I may be approaching this problem wrong. I want to avoid using a cursor.
Edit: Adding some example code
SELECT
nUserID_Owner
, CASE
WHEN COUNT(c.nWorkID) >= 100 THEN 0
ELSE 100 - COUNT(c.nWorkID)
END
,COUNT(c.nWorkID)
FROM tblAccounts cic
LEFT JOIN tblWorkItems c
ON c.sAccountNumber = cic.sAccountNumber
AND c.nUserID_WorkAssignedTo = cic.nUserID_Owner
AND c.nTeamID_WorkAssignedTo = cic.nTeamID_Owner
WHERE cic.nUserID_Collector IS NOT NULL
AND nUserID_CurrentOwner = 5288
AND c.bCompleted = 0
GROUP BY nUserID_Owner
This provides output vaulues of 5288, 50, 50 (in Jim's scenario)
It took longer than I wanted it to but I found a solution.
I did use a sub-query as suggested above to produce the work items with a unique row count by user.
I used PARTITION BY to produce a unique row count for each worker and included in my HAVING clause that the row number must be < the count of available slots. I'd post the code but it's beyond the char limit and I'd also have a lot of things to change to anon the system properly.
Originally I was approaching the problem incorrectly focusing on limiting the results rather than thinking about creating the necessary data to relate the result sets.

MS SQL 2000 - How to efficiently walk through a set of previous records and process them in groups. Large table

I'd like to consult one thing. I have table in DB. It has 2 columns and looks like this:
Name...bilance
Jane...+3
Jane...-5
Jane...0
Jane...-8
Jane...-2
Paul...-1
Paul...2
Paul....9
Paul...1
...
I have to walk through this table and if I find record with different "name" (than was on previous row) I process all rows with the previous "name". (If I step on the first Paul row I process all Jane rows)
The processing goes like this:
Now I work only with Jane records and walk through them one by one. On each record I stop and compare it with all previous Jane rows one by one.
The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs
Summary:
I loop through this table in 3 levels paralelly (nested loops)
1st level = search for changes of "name" column
2nd level = if change was found, get all rows with previous "name" and walk through them
3rd level = on each row stop and walk through all previous rows with current "name"
Can this be solved only using CURSOR and FETCHING, or is there some smoother solution?
My real table has 30 000 rows and 1500 people and If I do the logic in PHP, it takes long minutes and than timeouts. So I would like to rewrite it to MS SQL 2000 (no other DB is allowed). Are cursors fast solution or is it better to use something else?
Thank you for your opinions.
UPDATE:
There are lots of questions about my "summarization". Problem is a little bit more difficult than I explained. I simplified it just to describe my algorithm.
Each row of my table contains much more columns. The most important is month. That's why there are more rows for each person. Each is for different month.
"Bilances" are "working overtimes" and "arrear hours" of workers. And I need to sumarize + and - bilances to neutralize them using values from previous months. I want to have as many zeroes as possible. All the table must stay as it is, just bilances must be changed to zeroes.
Example:
Row (Jane -5) will be summarized with row (Jane +3). Instead of 3 I will get 0 and instead of -5 I will get -2. Because I used this -5 to reduce +3.
Next row (Jane 0) won't be affected
Next row (Jane -8) can not be used, because all previous bilances are negative
etc.
You can sum all the values per name using a single SQL statement:
select
name,
sum(bilance) as bilance_sum
from
my_table
group by
name
order by
name
On the face of it, it sounds like this should do what you want:
select Name, sum(bilance)
from table
group by Name
order by Name
If not, you might need to elaborate on how the Names are sorted and what you mean by "summarize".
I'm not sure what you mean by this line... "The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs".
But, it may be possible to use a group by query to get a lot of what you need.
select name, case when bilance < 0 then 'negative' when bilance >= 0 then 'positive', count(*)
from table
group by name, bilance
That might not be perfect syntax for the case statement, but it should get you really close.

Group by run when there is no run number in data (was Show how changing the length of a production run affects time-to-build)

It would seem that there is a much simpler way to state the problem. Please see Edit 2, following the sample table.
I have a number of different products on a production line. I have the date that each product entered production. Each product has two identifiers: item number and serial number I have the total number of labour hours for each product by item number and by serial number (i.e. I can tell you how many hours went into each object that was manufactured and what the average build time is for each kind of object).
I want to determine how (if) varying the length of production runs affects the average time it takes to build a product (item number). A production run is the sequential production of multiple serial numbers for a single item number. We have historical records going back several years with production runs varying in length from 1 to 30.
I think to achieve this, I need to be able to assign 'run id'. To me, that means building a query that sorts by start date and calculates a new unique value at each change in item number. If I knew how to do that, I could solve the rest of the problem on my own.
So that suggests a series of related questions:
Am I thinking about this the right way?
If I am on the right track, how do I generate those run id values? Calculate and store is an option, although I have a (misguided?) preference for direct queries. I know exactly how I would generate the run numbers in Excel, but I have a (misguided?) preference to do this in the database.
If I'm not on the right track, where might I find that track? :)
Edit:
Table structure (simplified) with sample data:
AutoID Item Serial StartDate Hours RunID (proposed calculation)
1 Legend 1234 2010-06-06 10 1
3 Legend 1235 2010-06-07 9 1
2 Legend 1237 2010-06-08 8 1
4 Apex 1236 2010-06-09 12 2
5 Apex 1240 2010-06-10 11 2
6 Legend 1239 2010-06-11 10 3
7 Legend 1238 2010-06-12 8 3
I have shown that start date, serial, and autoID are mutually unrelated. I have shown the expectation that labour goes down as the run length increases (but this is a 'fact' only via received wisdom, not data analysis). I have shown what I envision as the heart of the solution, that being a RunID that reflects sequential builds of a single item. I know that if I could get that runID, I could group by run to get counts, averages, totals, max, min, etc. In addition, I could do something like hours/ to get percentage change from the start of the run. At that point I could graph the trends associated with different run lengths either globally across all items or on a per item basis. (At least I think I could do all that. I might have to muck about a bit, but I think I could get it done.)
Edit 2: This problem would appear to be: how do I get the 'starting' member (earliest start date) of each run when I don't already have a runID? (The runID shown in the sample table does not exist and I was originally suggesting that being able to calculate runID was a potentially viable solution.)
AutoID Item
1 Legend
4 Apex
6 Legend
I'm assuming that having learned how to find the first member of each run that I would then be able to use what I've learned to find the last member of each run and then use those two results to get all other members of each run.
Edit 3: my version of a query that uses the AutoID of the first item in a run as the RunID for all units in a run. This was built entirely from samples and direction provided by Simon, who has the accepted answer. Using this as the basis for grouping by run, I can produce a variety of run statistics.
SELECT first_product_of_run.AutoID AS runID, run_sibling.AutoID AS itemID, run_sibling.Item, run_sibling.Serial, run_sibling.StartDate, run_sibling.Hours
FROM (SELECT first_of_run.AutoID, first_of_run.Item, first_of_run.Serial, first_of_run.StartDate, first_of_run.Hours
FROM dbo.production AS first_of_run LEFT OUTER JOIN
dbo.production AS earlier_in_run ON first_of_run.AutoID - 1 = earlier_in_run.AutoID AND
first_of_run.Item = earlier_in_run.Item
WHERE (earlier_in_run.AutoID IS NULL)) AS first_product_of_run LEFT OUTER JOIN
dbo.production AS run_sibling ON first_product_of_run.Item = run_sibling.Item AND first_product_of_run.AutoID run_sibling.AutoID AND
first_product_of_run.StartDate product_between.Item AND
first_product_of_run.StartDate
Could you describe your table structure some more? If the "date that each product entered production" is a full time stamp, or if there is a sequential identifier across products, you can write queries to identify the first and last products of a run. From that, you can assign IDs to or calculate the length of the runs.
Edit:
Once you've identified 1,4, and 6 as the start of a run, you can use this query to find the other IDs in the run:
select first_product_of_run.AutoID, run_sibling.AutoID
from first_product_of_run
left join production run_sibling on first_product_of_run.Item = run_sibling.Item
and first_product_of_run.AutoID <> run_sibling.AutoID
and first_product_of_run.StartDate < run_sibling.StartDate
left join production product_between on first_product_of_run.Item <> product_between.Item
and first_product_of_run.StartDate < product_between.StartDate
and product_between.StartDate < run_sibling.StartDate
where product_between.AutoID is null
first_product_of_run can be a temp table, table variable, or sub-query that you used to find the start of a run. The key is the where product_between.AutoID is null. That restricts the results to only pairs where no different items were produced between them.
Edit 2, here's how to get the first of each run:
select first_of_run.AutoID
from
(
select product.AutoID, product.Item, MAX(previous_product.StartDate) as PreviousDate
from production product
left join production previous_product on product.AutoID <> previous_product.AutoID
and product.StartDate > previous_product.StartDate
group by product.AutoID, product.Item
) first_of_run
left join production earlier_in_run
on first_of_run.PreviousDate = earlier_in_run.StartDate
and first_of_run.Item = earlier_in_run.Item
where earlier_in_run.AutoID is null
It's not pretty, and will break if StartDate is not unique. The query could be simplified by adding a sequential and unique identifier with no gaps. In fact, that step will probably be necessary if StartDate is not unique. Here's how it would look:
select first_of_run.AutoID
from production first_of_run
left join production earlier_in_run
on (first_of_run.Sequence - 1) = earlier_in_run.Sequence
and first_of_run.Item = earlier_in_run.Item
where earlier_in_run.AutoID is null
Using outer joins to find where things aren't still twists my brain, but it's a very powerful technique.

What is an unbounded query?

Is an unbounded query a query without a WHERE param = value statement?
Apologies for the simplicity of this one.
An unbounded query is one where the search criteria is not particularly specific, and is thus likely to return a very large result set. A query without a WHERE clause would certainly fall into this category, but let's consider for a moment some other possibilities. Let's say we have tables as follows:
CREATE TABLE SALES_DATA
(ID_SALES_DATA NUMBER PRIMARY KEY,
TRANSACTION_DATE DATE NOT NULL
LOCATION NUMBER NOT NULL,
TOTAL_SALE_AMOUNT NUMBER NOT NULL,
...etc...);
CREATE TABLE LOCATION
(LOCATION NUMBER PRIMARY KEY,
DISTRICT NUMBER NOT NULL,
...etc...);
Suppose that we want to pull in a specific transaction, and we know the ID of the sale:
SELECT * FROM SALES_DATA WHERE ID_SALES_DATA = <whatever>
In this case the query is bounded, and we can guarantee it's going to pull in either one or zero rows.
Another example of a bounded query, but with a large result set would be the one produced when the director of district 23 says "I want to see the total sales for each store in my district for every day last year", which would be something like
SELECT LOCATION, TRUNC(TRANSACTION_DATE), SUM(TOTAL_SALE_AMOUNT)
FROM SALES_DATA S,
LOCATION L
WHERE S.TRANSACTION_DATE BETWEEN '01-JAN-2009' AND '31-DEC-2009' AND
L.LOCATION = S.LOCATION AND
L.DISTRICT = 23
GROUP BY LOCATION,
TRUNC(TRANSACTION_DATE)
ORDER BY LOCATION,
TRUNC(TRANSACTION_DATE)
In this case the query should return 365 (or fewer, if stores are not open every day) rows for each store in district 23. If there's 25 stores in the district it'll return 9125 rows or fewer.
On the other hand, let's say our VP of Sales wants some data. He/she/it isn't quite certain what's wanted, but he/she/it is pretty sure that whatever it is happened in the first six months of the year...not quite sure about which year...and not sure about the location, either - probably in district 23 (he/she/it has had a running feud with the individual who runs district 23 for the past 6 years, ever since that golf tournament where...well, never mind...but if a problem can be hung on the door of district 23's director so be it!)...and of course he/she/it wants all the details, and have it on his/her/its desk toot sweet! And thus we get a query that looks something like
SELECT L.DISTRICT, S.LOCATION, S.TRANSACTION_DATE,
S.something, S.something_else, S.some_more_stuff
FROM SALES_DATA S,
LOCATIONS L
WHERE EXTRACT(MONTH FROM S.TRANSACTION_DATE) <= 6 AND
L.LOCATION = S.LOCATION
ORDER BY L.DISTRICT,
S.LOCATION
This is an example of an unbounded query. How many rows will it return? Good question - that depends on how business conditions were, how many location were open, how many days there were in February, etc.
Put more simply, if you can look at a query and have a pretty good idea of how many rows it's going to return (even though that number might be relatively large) the query is bounded. If you can't, it's unbounded.
Share and enjoy.
http://hibernatingrhinos.com/Products/EFProf/learn#UnboundedResultSet
An unbounded result set is where a query is performed and does not explicitly limit the number of returned results from a query. Usually, this means that the application assumes that a query will always return only a few records. That works well in development and in testing, but it is a time bomb waiting to explode in production.
The query may suddenly start returning thousands upon thousands of rows, and in some cases, it may return millions of rows. This leads to more load on the database server, the application server, and the network. In many cases, it can grind the entire system to a halt, usually ending with the application servers crashing with out of memory errors.
Here is one example of a query that will trigger the unbounded result set warning:
var query = from post in blogDataContext.Posts
where post.Category == "Performance"
select post;
If the performance category has many posts, we are going to load all of them, which is probably not what was intended. This can be fixed fairly easily by using pagination by utilizing the Take() method:
var query = (from post in blogDataContext.Posts
where post.Category == "Performance"
select post)
.Take(15);
Now we are assured that we only need to handle a predictable, small result set, and if we need to work with all of them, we can page through the records as needed. Paging is implemented using the Skip() method, which instructs Entity Framework to skip (at the database level) N number of records before taking the next page.
But there is another common occurrence of the unbounded result set problem from directly traversing the object graph, as in the following example:
var post = postRepository.Get(id);
foreach (var comment in post.Comments)
{
// do something interesting with the comment
}
Here, again, we are loading the entire set without regard for how big the result set may be. Entity Framework does not provide a good way of paging through a collection when traversing the object graph. It is recommended that you would issue a separate and explicit query for the contents of the collection, which will allow you to page through that collection without loading too much data into memory.