Oracle SQL: Grouping and allocating data based on 3 criteria - sql

Quite a long-winded question:
As a hypothetical situation i am trying to split a table of data between two companies: OPM, MON.
|NAME |ACCOUNT |BALANCE |COMPANY
_______________________________________________
|SMITH |11111 |100 |
|SMITH |11111 |150 |
|HUNTER |11121 |200 |
|HUNTER |11131 |250 |
|LITTLE |11141 |300 |
|RIDLEY |11151 |300 |
|RIDLEY |11151 |100 |
|ARMSTRONG |11161 |150 |
|ARMSTRONG |11171 |150 |
|HENRY |11181 |100 |
There are several scenario's with the customer data. 1. Customer has two accounts, both have the same account number, but with different balances. 2. Customer has two accounts, different account numbers and different balances. 3. Customer has one account, one account number, one balance.
I need to write out logic in SQL / PL-SQL that enables the data to fulfill an allocation to either of the two different companies and that also follows rules.
A customer, regardless of how many accounts, must be allocated to the same company.
The value of accounts must be roughly equal.
The volume of accounts must be roughly equal.
I accept the limitation in the data i have provided, but the logic must be extrapolated to larger datasets. What is the best method to achieve this?

What you are trying to do is a bin-packing problem, and this is generally hard. However, you simply state that the two groups need to be approximately equal. So, I would suggest adding up the balances for each customer and taking a stratified sample:
select name, balance,
(case when mod(seqnum, 2) = 0 then 'Company1' else 'Company2' end)
from (select name, sum(balance) as balance,
row_number() over (order by balance) as seqnum
from table t
group by name
) n
Note: this is an approximate approach. It puts half the accounts in each group, and they should have similar total balances. There are many cases where this will not produce an optimal solution (such as when one "name" has very large balances compared to everyone else), but it might be good enough.

Related

Ignore NULLs in GROUP BY? Or a better way to combine row that fill gaps in data?

Due to partial duplicates in some of my database, after some LEFT JOINs I wind up with several (but not all) rows where I have partial data, along with NULLs. For a unique user, one row may have a ZIP code, and another row may have the STATE of that same user.
Let me show you an example:
|email |state |zip |
|-----------------|------|------|
|unique#email.com |NULL |40502 |
|unique#email.com |KY |NULL |
|other#email.com |FL |34744 |
|other#email.com |FL |34744 |
|third#email.com |OH |NULL |
Rows with full duplicates (such as other#email.com in my example) are easy enough to cleanup with a GROUP BY clause, and some people like third#email.com in my example have NULLs and that's ok, but for unique#email.com I have the state in one row and zip in another, what is the best way to combine those two into one row?
A desired result would be:
|email |state |zip |
|-----------------|------|------|
|unique#email.com |KY |40502 |
|other#email.com |FL |34744 |
|third#email.com |OH |NULL |
For the data you have provided, you can use aggregation:
select email, max(state) as state, max(zip) as zip
from t
group by email;
That said, you can probably fix this in the query used to generate the data. Also, if you want multiple rows for a given email in the result set, then you should ask a new question with a clearer example of data.

How to create two JOIN-tables so that I can compare attributes within?

I take a Database course in which we have listings of AirBnBs and need to be able to do some SQL queries in the Relationship-Model we made from the data, but I struggle with one in particular :
I have two tables that we are interested in, Billing and Amenities. The first one have the id and price of listings, the second have id and wifi (let's say, to simplify, that it equals 1 if there is Wifi, 0 otherwise). Both have other attributes that we don't really care about here.
So the query is, "What is the difference in the average price of listings with and without Wifi ?"
My idea was to build to JOIN-tables, one with listings that have wifi, the other without, and compare them easily :
SELECT avg(B.price - A.price) as averagePrice
FROM (
SELECT Billing.price, Billing.id
FROM Billing
INNER JOIN Amenities
ON Billing.id = Amenities.id
WHERE Amenities.wifi = 0
) A, (
SELECT Billing.price, Billing.id
FROM Billing
INNER JOIN Amenities
ON Billing.id = Amenities.id
WHERE Amenities.wifi = 1) B
WHERE A.id = B.id;
Obviously this doesn't work... I am pretty sure that there is a far easier solution to it tho, what do I miss ?
(And by the way, is there a way to compute the absolute between the difference of price ?)
I hope that I was clear enough, thank you for your time !
Edit : As mentionned in the comments, forgot to say that, but both tables have idas their primary key, so that there is one row per listing.
Just use conditional aggregation:
SELECT AVG(CASE WHEN a.wifi = 0 THEN b.price END) as avg_no_wifi,
AVG(CASE WHEN a.wifi = 1 THEN b.price END) as avg_wifi
FROM Billing b JOIN
Amenities a
ON b.id = a.id
WHERE a.wifi IN (0, 1);
You can use a - if you want the difference instead of the specific values.
Let's assume we're working with data like the following (problems with your data model are noted below):
Billing
+------------+---------+
| listing_id | price |
+------------+---------+
| 1 | 1500.00 |
| 2 | 1700.00 |
| 3 | 1800.00 |
| 4 | 1900.00 |
+------------+---------+
Amenities
+------------+------+
| listing_id | wifi |
+------------+------+
| 1 | 1 |
| 2 | 1 |
| 3 | 0 |
+------------+------+
Notice that I changed "id" to "listing_id" to make it clear what it was (using "id" as an attribute name is problematic anyways). Also, note that one listing doesn't have an entry in the Amenities table. Depending on your data, that may or may not be a concern (again, refer to the bottom for a discussion of your data model).
Based on this data, your averages should be as follows:
Listings with wifi average $1600 (Listings 1 and 2)
Listings without wifi (just 3) average 1800).
So the difference would be $200.
To achieve this result in SQL, it may be helpful to first get the average cost per amenity (whether wifi is offered). This would be obtained with the following query:
SELECT
Amenities.wifi AS has_wifi,
AVG(Billing.price) AS avg_cost
FROM Billing
INNER JOIN Amenities ON
Amenities.listing_id = Billing.listing_id
GROUP BY Amenities.wifi
which gives you the following results:
+----------+-----------------------+
| has_wifi | avg_cost |
+----------+-----------------------+
| 0 | 1800.0000000000000000 |
| 1 | 1600.0000000000000000 |
+----------+-----------------------+
So far so good. So now we need to calculate the difference between these 2 rows. There are a number of different ways to do this, but one is to use a CASE expression to make one of the values negative, and then simply take the SUM of the result (note that I'm using a CTE, but you can also use a sub-query):
WITH
avg_by_wifi(has_wifi, avg_cost) AS
(
SELECT Amenities.wifi, AVG(Billing.price)
FROM Billing
INNER JOIN Amenities ON
Amenities.listing_id = Billing.listing_id
GROUP BY Amenities.wifi
)
SELECT
ABS(SUM
(
CASE
WHEN has_wifi = 1 THEN avg_cost
ELSE -1 * avg_cost
END
))
FROM avg_by_wifi
which gives us the expected value of 200.
Now regarding your data model:
If both your Billing and Amenities table only have 1 row for each listing, it makes sense to combine them into 1 table. For example: Listings(listing_id, price, wifi)
However, this is still problematic, because you probably have a bunch of other amenities you want to model (pool, sauna, etc.) So you might want to model a many-to-many relationship between listings and amenities using an intermediate table:
Listings(listing_id, price)
Amenities(amenity_id, amenity_name)
ListingsAmenities(listing_id, amenity_id)
This way, you could list multiple amenities for a given listing without having to add additional columns. It also becomes easy to store additional information about an amenity: What's the wifi password? How deep is the pool? etc.
Of course, using this model makes your original query (difference in average cost of listings by wifi) a bit tricker, but definitely still doable.

Summing n numerical variables by grouping level specific to each

I am working through a group by problem and could use some direction at this point. I want to summarize a number of variables by a grouping level which is different (but the same domain of values) for each of the variables to be summed. In pseudo-pseudo code, this is my issue: For each empYEAR variable (there are 20 or so employment-by-year variables in wide format), I want to sum it by the county in which the business was located in that particular year.
The data is a bunch of tables representing business establishments over a 20-year period from Dun & Bradstreet/NETS.
More details on the database, which is a number of flat files, all with the same primary key.
The primary key is DUNSNUMBER, which is present in several tables. There are tables detailing, for each year:
employment
county
sales
credit rating (and others)
all organized as follows (this table shows employment, but the other variables are similarly structured, with a year postfix).
dunsnumber|emp1990 |emp1991|emp1992|... |emp2011|
a | 12 |32 |31 |... | 35 |
b | |2 |3 |... | 5 |
c | 1 |1 | |... | |
d | 40 |86 |104 |... | 350 |
...
I would ultimately like to have a table that is structured like this:
county |emp1990|emp1991|emp1992|...|emp2011|sales1990|sales1991|sales1992|sales2011|...
A
B
C
...
My main challenge right now is this: How can I sum employment (or sales) by county by year as in the example table above, given that county as a grouping variable changes sometimes by the year and specified in another table?
It seems like something that would be fairly straightforward to do in, say, R with a long data format, but there are millions of records, so I prefer to keep the initial processing in postgres.
As I understand your question this sounds relatively straight forward. While I normally prefer normalized data to work with, I don't see that normalizing things beforehand will buy you anything specific here.
It seems to me you want something relatively simple like:
SELECT sum(emp1990), sum(emp1991), ....
FROM county c
JOIN emp e ON c.dunsnumber = e.dunsnumber
JOIN sales s ON c.dunsnumber = s.dunsnumber
JOIN ....
GROUP BY c.name, c.state;
I don't see a simpler way of doing this. Very likely you could query the system catalogs or information schema to generate a list of columns to sum up. the rest is a straight group by and join process as far as I can tell.
if the variable changes by name, the best thing to do in my experience is to put together a location view based on that union and join against it. This lets you hide the complexity from your main queries and as long as you don't also join the underlying tables should perform quite well.

Items getting double-counted in SQL Server, dependent counting logic not working right

I am counting the number of RFIs (requests for info) from various agencies. Some of these agencies are also part of a task force (committee). Currently this SQL combines the agencies and task forces into one list and counts the RFIs for each. The problem is, if the RFI belongs to a task force (which is also assigned to an agency), I only want it to count for the task force and not for the agency. However, if the agency does not have a task force assigned to the RFI, I want it to still count for the agency. The RFIs are linked to various agencies through a _LinkEnd table, but that logic works just fine. Here is the logic thus far:
SELECT t.Submitting_Agency, COUNT(DISTINCT t.Count) AS RFICount
FROM (
SELECT RFI_.Submitting_Agency, RFI_.Unique_ID, _LinkEnd.EntityType_ID1, _LinkEnd.Link_ID as Count
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630'
UNION ALL
SELECT RFI_.Task_Force__Initiative AS Submitting_Agency, RFI_.Unique_ID, _LinkEnd.EntityType_ID1, _LinkEnd.Link_ID as Count
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630' AND RFI_.Task_Force__Initiative IS NOT NULL) t
GROUP BY t.Submitting_Agency
How can I get it to only count an RFI one time, even though the two fields are combined? For instance, here are sample records from the RFI_ table:
---------------------------------------------------------------------------
| Unique_ID | Submitting_Agency | Task_Force__Initiative | Date_Submitted |
---------------------------------------------------------------------------
| 1 | Social Service | Flood Relief TF | 2011-05-08 |
---------------------------------------------------------------------------
| 2 | Faith-Based Init. | Homeless Shelter Min. | 2011-06-08 |
---------------------------------------------------------------------------
| 3 | Psychology Group | | 2011-05-04 |
---------------------------------------------------------------------------
| 4 | Attorneys at Law | | 2011-05-05 |
---------------------------------------------------------------------------
| 5 | Social Service | | 2011-05-10 |
---------------------------------------------------------------------------
So assuming only one link existed to one RFI for each of these, the count should be as follows:
Social Service 1
Faith-Based Unit. 0
Psychology Group 1
Attorneys at Law 1
Flood Relief TF 1
Homeless Shelter Min. 1
Note that if both an agency and a task force are in one record, then the task force gets the count, not the agency. But it is possible for the agency to have a record without a task force, in which case the agency gets the count. How could I get this to work in this fashion so that RFIs are not double-counted? As it stands both the agency and the task force get counted, which I do not want to happen. The task force always gets the count, unless that field is blank, then the agency gets it.
I guess a simple COLESCE() would do the trick?
SELECT COLAESCE(Task_Force__Initiative, Submitting_Agency), COUNT(DISTINCT _LinkEnd.Link_ID) AS RFICount
FROM RFI_
JOIN _LinkEnd ON RFI_.Unique_ID=_LinkEnd.Entity_ID1
WHERE _LinkEnd.Link_ID LIKE 'CAS%' AND RFI_.Date_Submitted BETWEEN '20110430' AND '20110630'
GROUP BY COLAESCE(Task_Force__Initiative, Submitting_Agency);
Rather than:
SELECT t.Submitting_Agency ...
Try
SELECT
CASE t.[Task_Force__Initiative]
WHEN NULL THEN -- Or whatever value constitutes "empty"
t.[Submitting_Agency]
ELSE
t.[Task_Force__Initiative]
END ...
and then GROUP BY the same.
http://msdn.microsoft.com/en-us/library/ms181765.aspx
The result will be that your count will aggregate from the proper specified grouping point, rather than from the single agency column.
EDIT: From your example it looks like you don't use NULL for the empty field but maybe a blank string? In that case you'll want to replace the NULL in the CASE above with the proper "blank" value. If it is NULL then you can COALESCE as suggested in the other answer.
EDIT: Based on what I think your schema is... and your WHERE criteria
SELECT
COALESCE(RFI_.[Task_Force__Initiative], RFI_.[Submitting_Agency]),
COUNT(*)
FROM
RFI_
JOIN _LinkEnd
ON RFI_.[Unique_ID]=_LinkEnd.[Entity_ID1]
WHERE
_LinkEnd.[Link_ID] LIKE 'CAS%'
AND RFI_.[Date_Submitted] BETWEEN '20110430' AND '20110630'
GROUP BY
COALESCE(RFI_.[Task_Force__Initiative], RFI_.[Submitting_Agency])

Database table for grades

I'm trying to define a table to store student grades for a online report card. I can't decide how to do it, though.
The grades are given by subject, in a trimestral period. Every trimester has a average grade, the total missed classes and a "recovering grade" (I don't know the right term in English, but it's an extra test you take to try to raise your grade if you're below the average), I also gotta store the year average and final "recovering grade". Basically, it's like this:
|1st Trimester |2nd Trimester |3rd Trimester
Subj. |Avg. |Mis. |Rec |Avg. |Mis. |Rec |Avg. |Mis. |Rec |Year Avg. |Final Rec.
Math |5.33 |1 |4 |8.0 |0 |7.0 |2 |6.5 |7.0
Sci. |5.33 |1 |4 |8.0 |0 |7.0 |2 |6.5 |7.0
I could store this information in a single DB row, with each row like this:
1tAverage | 1tMissedClasses | 1tRecoveringGrade | 2tAverage | 2tMissedClasses | 2tRecoveringGrade
And so on, but I figured this would be a pain to mantain, if the scholl ever decides to grade by bimester or some other period (like it used to be up until 3 years ago).
I could also generalize the table fields, and use a tinyint for flagging for which trimester those grades are, or if they're the year finals.
But this one would ask for a lot of subqueries to write the report card, also a pain to mantain.
Which of the two is better, or is there some other way?
Thanks
You could try structuring it like this with your tables. I didn't have all the information so I made some guesses at what you might need or do with it all.
TimePeriods:
ID(INT)
PeriodTimeStart(DateTime)
PeriodTimeEnd(DateTime)
Name(VARCHAR(50)
Students:
ID(INT)
FirstName(VARCHAR(60))
LastName(VARCHAR(60))
Birthday(DateTime)
[any other relevant student field
information added...like contact
info, etc]
Grading:
ID(INT)
StudentID(INT)
GradeValue(float)
TimePeriodID(INT)
IsRecoveringGrade(boolean)
MissedClasses:
ID(INT)
StudentID(INT)
ClassID(INT)
TimePeriodID(INT)
DateMissed(DateTime)
Classes:
ID(INT)
ClassName (VARCHAR(50))
ClassDescription (TEXT)
I think the best solution is to store one row per period. So you'd have a table like:
grades
------
studentID
periodNumber
averageGrade
missedClasses
recoveringGrade
So if it's 2 semesters, you'd have periods 1 and 2. I'd suggest using period 0 to mean "overall for the year".
It's better to have a second table representing trimester, and have a foreign key reference to the trimester from the grades table (and store individual grades in the grades table). Then do the averages, missed classes, etc using SQL functions SUM and AVG.
This comes to mind.
(But seriously, err on the side of too many tables, not too few. Handruin has the best solution I see so far).