Optimal solution for interview question - sql

Recently in a job interview, I was given the following problem.
Say I have the following table
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
a | 15.00 | 1
b | 30.00 | 1
c | 20.00 | 1
d | 25.00 | 1
where widget_name is holds the name of the widget, widget_costs is the price of a widget, and in stock is a constant of 1.
Now for my business insurance I have a certain deductible. I am looking to find a sql statement that will tell me every widget and it's price exceeds the deductible. So if my dedudctible is $50.00 the above would just return
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
a | 15.00 | 1
d | 25.00 | 1
Since widgets b and c where used to meet the deductible
The closest I could get is the following
SELECT
*
FROM (
SELECT
widget_name,
widget_price
FROM interview.tbl_widgets
minus
SELECT widget_name,widget_price
FROM (
SELECT
widget_name,
widget_price,
50 - sum(widget_price) over (ORDER BY widget_price ROWS between unbounded preceding and current row) as running_total
FROM interview.tbl_widgets
)
where running_total >= 0
)
;
Which gives me
widget_Name | widget_Costs | In_Stock
---------------------------------------------------------
c | 20.00 | 1
d | 25.00 | 1
because it uses a and b to meet the majority of the deductible
I was hoping someone might be able to show me the correct answer
EDIT: I understood the interview question to be asking this. Given a table of widgets and their prices and given a dollar amount, substract as many of the widgets you can up to the dollar amount and return those widgets and their prices that remain

I'll put an answer up, just in case it's easier than it looks, but if the idea is just to return any widget that costs more than the deductible then you'd do something like this:
Select
Widget_Name, Widget_Cost, In_Stock
From
Widgets
Where
Widget_Cost > 50 -- SubSelect for variable deductibles?
For your sample data my query returns no rows.

I believe I understand your question, but I'm not 100%. Here is what I'm assuming you mean:
Your deductible is say, $50. To meet the deductible you have you "use" two items. (Is this always two? How high can it go? Can it be just one? What if they don't total exactly $50, there is a lot of missing information). You then want to return the widgets that aren't being used towards deductible. I have the following.
CREATE TABLE #test
(
widget_name char(1),
widget_cost money
)
INSERT INTO #test (widget_name, widget_cost)
SELECT 'a', 15.00 UNION ALL
SELECT 'b', 30.00 UNION ALL
SELECT 'c', 20.00 UNION ALL
SELECT 'd', 25.00
SELECT * FROM #test t1
WHERE t1.widget_name NOT IN (
SELECT t1.widget_name FROM #test t1
CROSS JOIN #test t2
WHERE t1.widget_cost + t2.widget_cost = 50 AND t1.widget_name != t2.widget_name)
Which returns
widget_name widget_cost
----------- ---------------------
a 15.00
d 25.00

This looks like a Bin Packing problem these are really hard to solve especially with SQL.
If you search on SO for Bin Packing + SQL, you'll find how to find Sum(field) in condition ie “select * from table where sum(field) < 150” Which is basically the same problem except you want to add a NOT IN to it.
I couldn't get the accepted answer by brianegge to work but what he wrote about it in general was interesting
..the problem you
describe of wanting the selection of
users which would most closely fit
into a given size, is a bin packing
problem. This is an NP-Hard problem,
and won't be easily solved with ANSI
SQL. However, the above seems to
return the right result, but in fact
it simply starts with the smallest
item, and continues to add items until
the bin is full.
A general, more effective bin packing
algorithm would is to start with the
largest item and continue to add
smaller ones as they fit. This
algorithm would select users 5 and 4.
So with this advice you could write a cursor to loop over the table to do just this (it just wouldn't be pretty).
Aaron Alton gives a nice link to a series of articles that attempts to solve the Bin Packing problem with sql but basically concludes that its probably best to use a cursor to do it.

Related

How to create two JOIN-tables so that I can compare attributes within?

I take a Database course in which we have listings of AirBnBs and need to be able to do some SQL queries in the Relationship-Model we made from the data, but I struggle with one in particular :
I have two tables that we are interested in, Billing and Amenities. The first one have the id and price of listings, the second have id and wifi (let's say, to simplify, that it equals 1 if there is Wifi, 0 otherwise). Both have other attributes that we don't really care about here.
So the query is, "What is the difference in the average price of listings with and without Wifi ?"
My idea was to build to JOIN-tables, one with listings that have wifi, the other without, and compare them easily :
SELECT avg(B.price - A.price) as averagePrice
FROM (
SELECT Billing.price, Billing.id
FROM Billing
INNER JOIN Amenities
ON Billing.id = Amenities.id
WHERE Amenities.wifi = 0
) A, (
SELECT Billing.price, Billing.id
FROM Billing
INNER JOIN Amenities
ON Billing.id = Amenities.id
WHERE Amenities.wifi = 1) B
WHERE A.id = B.id;
Obviously this doesn't work... I am pretty sure that there is a far easier solution to it tho, what do I miss ?
(And by the way, is there a way to compute the absolute between the difference of price ?)
I hope that I was clear enough, thank you for your time !
Edit : As mentionned in the comments, forgot to say that, but both tables have idas their primary key, so that there is one row per listing.
Just use conditional aggregation:
SELECT AVG(CASE WHEN a.wifi = 0 THEN b.price END) as avg_no_wifi,
AVG(CASE WHEN a.wifi = 1 THEN b.price END) as avg_wifi
FROM Billing b JOIN
Amenities a
ON b.id = a.id
WHERE a.wifi IN (0, 1);
You can use a - if you want the difference instead of the specific values.
Let's assume we're working with data like the following (problems with your data model are noted below):
Billing
+------------+---------+
| listing_id | price |
+------------+---------+
| 1 | 1500.00 |
| 2 | 1700.00 |
| 3 | 1800.00 |
| 4 | 1900.00 |
+------------+---------+
Amenities
+------------+------+
| listing_id | wifi |
+------------+------+
| 1 | 1 |
| 2 | 1 |
| 3 | 0 |
+------------+------+
Notice that I changed "id" to "listing_id" to make it clear what it was (using "id" as an attribute name is problematic anyways). Also, note that one listing doesn't have an entry in the Amenities table. Depending on your data, that may or may not be a concern (again, refer to the bottom for a discussion of your data model).
Based on this data, your averages should be as follows:
Listings with wifi average $1600 (Listings 1 and 2)
Listings without wifi (just 3) average 1800).
So the difference would be $200.
To achieve this result in SQL, it may be helpful to first get the average cost per amenity (whether wifi is offered). This would be obtained with the following query:
SELECT
Amenities.wifi AS has_wifi,
AVG(Billing.price) AS avg_cost
FROM Billing
INNER JOIN Amenities ON
Amenities.listing_id = Billing.listing_id
GROUP BY Amenities.wifi
which gives you the following results:
+----------+-----------------------+
| has_wifi | avg_cost |
+----------+-----------------------+
| 0 | 1800.0000000000000000 |
| 1 | 1600.0000000000000000 |
+----------+-----------------------+
So far so good. So now we need to calculate the difference between these 2 rows. There are a number of different ways to do this, but one is to use a CASE expression to make one of the values negative, and then simply take the SUM of the result (note that I'm using a CTE, but you can also use a sub-query):
WITH
avg_by_wifi(has_wifi, avg_cost) AS
(
SELECT Amenities.wifi, AVG(Billing.price)
FROM Billing
INNER JOIN Amenities ON
Amenities.listing_id = Billing.listing_id
GROUP BY Amenities.wifi
)
SELECT
ABS(SUM
(
CASE
WHEN has_wifi = 1 THEN avg_cost
ELSE -1 * avg_cost
END
))
FROM avg_by_wifi
which gives us the expected value of 200.
Now regarding your data model:
If both your Billing and Amenities table only have 1 row for each listing, it makes sense to combine them into 1 table. For example: Listings(listing_id, price, wifi)
However, this is still problematic, because you probably have a bunch of other amenities you want to model (pool, sauna, etc.) So you might want to model a many-to-many relationship between listings and amenities using an intermediate table:
Listings(listing_id, price)
Amenities(amenity_id, amenity_name)
ListingsAmenities(listing_id, amenity_id)
This way, you could list multiple amenities for a given listing without having to add additional columns. It also becomes easy to store additional information about an amenity: What's the wifi password? How deep is the pool? etc.
Of course, using this model makes your original query (difference in average cost of listings by wifi) a bit tricker, but definitely still doable.

The best way to keep count data in postgres

I need to create a statistic for some aggragete date splitted by days.
For example:
select
(select count(*) from bananas) as bananas_count,
(select count(*) from apples) as apples_count,
(select count(*) from bananas where color = 'yellow') as yellow_bananas_count;
obviously I will get:
bananas_count | apples_count | yellow_bananas_count
--------------+------------------+ ---------------------
123| 321 | 15
but I need to get that data grouped by day, we need to know how many banaras we had yesterday.
The first thought which I got is create aview, but in that case i will not be able split by dates ( or I don't know how to do it).
I need a performance-wise database sided implementation of this task.

SQL: SUM of MAX values WHERE date1 <= date2 returns "wrong" results

Hi stackoverflow users
I'm having a bit of a problem trying to combine SUM, MAX and WHERE in one query and after an intense Google search (my search engine skills usually don't fail me) you are my last hope to understand and fix the following issue.
My goal is to count people in a certain period of time and because a person can visit more than once in said period, I'm using MAX. Due to the fact that I'm defining people as male (m) or female (f) using a string (for statistic purposes), CHAR_LENGTH returns the numbers I'm in need of.
SELECT SUM(max_pers) AS "People"
FROM (
SELECT "guests"."id", MAX(CHAR_LENGTH("guests"."gender")) AS "max_pers"
FROM "guests"
GROUP BY "guests"."id")
So far, so good. But now, as stated before, I'd like to only count the guests which visited in a certain time interval (for statistic purposes as well).
SELECT "statistic"."id", SUM(max_pers) AS "People"
FROM (
SELECT "guests"."id", MAX(CHAR_LENGTH("guests"."gender")) AS "max_pers"
FROM "guests"
GROUP BY "guests"."id"),
"statistic", "guests"
WHERE ( "guests"."arrival" <= "statistic"."from" AND "guests"."departure" >= "statistic"."to")
GROUP BY "statistic"."id"
This query returns the following, x = desired result:
x * (x+1)
So if the result should be 3, it's 12. If it should be 5, it's 30 etc.
I probably could solve this algebraic but I'd rather understand what I'm doing wrong and learn from it.
Thanks in advance and I'm certainly going to answer all further questions.
PS: I'm using LibreOffice Base.
EDIT: An example
guests table:
ID | arrival | departure | gender |
10 | 1.1.14 | 10.1.14 | mf |
10 | 15.1.14 | 17.1.14 | m |
11 | 5.1.14 | 6.1.14 | m |
12 | 10.2.14 | 24.2.14 | f |
13 | 27.2.14 | 28.2.14 | mmmmmf |
statistic table:
ID | from | to | name |
1 | 1.1.14 | 31.1.14 |January | expected result: 3
2 | 1.2.14 | 28.2.14 |February| expected result: 7
MAX(...) is the wrong function: You want COUNT(DISTINCT ...).
Add proper join syntax, simplify (and remove unnecessary quotes) and this should work:
SELECT s.id, COUNT(DISTINCT g.id) AS People
FROM statistic s
LEFT JOIN guests g ON g.arrival <= s."from" AND g.departure >= s."too"
GROUP BY s.id
Note: Using LEFT join means you'll get a result of zero for statistics ids that have no guests. If you would rather no row at all, remove the LEFT keyword.
You have a very strange data structure. In any case, I think you want:
SELECT s.id, sum(numpersons) AS People
FROM (select g.id, max(char_length(g.gender)) as numpersons
from guests g join
statistic s
on g.arrival <= s."from" AND g.departure >= s."too"
group by g.id
) g join
GROUP BY s.id;
Thanks for all your inputs. I wasn't familiar with JOIN but it was necessary to solve my problem.
Since my databank is designed in german, I made quite the big mistake while translating it and I'm sorry if this caused confusion.
Selecting guests.id and later on grouping by guests.id wouldn't make any sense since the id is unique. What I actually wanted to do is select and group the guests.adr_id which links a visiting guest to an adress databank.
The correct solution to my problem is the following code:
SELECT statname, SUM (numpers) FROM (
SELECT statistic.name AS statname, guests.adr_id, MAX( CHAR_LENGTH( guests.gender ) ) AS numpers
FROM guests
JOIN statistics ON (guests.arrival <= statistics.too AND guests.departure >= statistics.from )
GROUP BY guests.adr_id, statistic.name )
GROUP BY statname
I also noted that my database structure is a mess but I created it learning by doing and haven't found any time to rewrite it yet. Next time posting, I'll try better.

Find Count Distinct in CakePHP

This is a simple query problem. But it seems that I can't get it right :(
I just started using cakephp last week. I'm still playing around exploring. Anyway, here's my problem.
This is the relationship in Model: Product has many Stock. Stock belongs to Product.
This is the sample STOCKS table:
ID | Product Name | Transaction
------------------------------------
1 | Astringent | Purchase
2 | Glutathione | Sales
3 | Glutathione | Sales
I would like to get the number of transaction per product from the STOCKS table.
This is the output I would like with distinct product name:
Transaction | Astringent | Glutathione
--------------------------------------------
purchase | 1 | 0
sales | 0 | 2
Here is a sql query- for exactly what you requested.
SELECT transaction,
SUM(
CASE
WHEN product_name = 'Astringent' THEN 1
ELSE 0
END) AS 'Astringent',
SUM(
CASE
WHEN product_name = 'Glutathione' THEN 1
ELSE 0
END) AS 'Glutathione'
FROM stock
GROUP BY transaction;
However, are you looking for a pivot table i.e., an output with a variable number of columns based on a random number of products? If so, your solution is lot more complicated as a MySQL query, and potentially less complicated as a MySQL and PHP solution. In either case, I think you should share more of your database schema / CakePHP model along with some controller information. In order to render a view with this table is not overly difficult as long as you want Cakephp to output this format.

Need a Complex SQL Query

I need to make a rather complex query, and I need help bad. Below is an example I made.
Basically, I need a query that will return one row for each case_id where the type is support, status start, and date meaning the very first one created (so that in the example below, only the 2/1/2009 John's case gets returned, not the 3/1/2009). The search needs to be dynamic to the point of being able to return all similar rows with different case_id's etc from a table with thousands of rows.
There's more after that but I don't know all the details yet, and I think I can figure it out if you guys (an gals) can help me out here. :)
ID | Case_ID | Name | Date | Status | Type
48 | 450 | John | 6/1/2009 | Fixed | Support
47 | 450 | John | 4/1/2009 | Moved | Support
46 | 451 | Sarah | 3/1/2009 | |
45 | 432 | John | 3/1/2009 | Fixed | Critical
44 | 450 | John | 3/1/2009 | Start | Support
42 | 450 | John | 2/1/2009 | Start | Support
41 | 440 | Ben | 2/1/2009 | |
40 | 432 | John | 1/1/2009 | Start | Critical
...
Thanks a bunch!
Edit:
To answer some people's questions, I'm using SQL Server 2005. And the date is just plain date, not string.
Ok so now I got further in the problem. I ended up with Bliek's solution which worked like a charm. But now I ran into the problem that sometimes the status never starts, as it's solved immediately. I need to include this in as well. But only for a certain time period.
I imagine I'm going to have to check for the case table referenced by FK Case_ID here. So I'd need a way to check for each Case_ID created in the CaseTable within the past month, and then run a search for these in the same table and same manner as posted above, returning only the first result as before. How can I use the other table like that?
As usual I'll try to find the answer myself while waiting, thanks again!
Edit 2:
Seems this is the answer. I don't have access to the full DB yet so I can't fully test it, but it seems to be working with the dummy tables I created, to continue from Bliek's code's WHERE clause:
WHERE RowNumber = 1 AND Case_ID IN (SELECT Case_ID FROM CaseTable
WHERE (Date BETWEEN '2007/11/1' AND '2007/11/30'))
The date's screwed again but you get the idea I'm sure. Thanks for the help everyone! I'll get back if there're more problems, but I think with this info I can improvise my way through most of the SQL problems I currently have to deal with. :)
Maybe something like:
select Case_ID, Name, MIN(date), Status, Type
from table
where type = 'Support'
and status = 'Start'
group by Case_ID, Name, Status, Type
EDIT: You haven't provided a lot of details about what you really want, so I'd suggest that you read all the answers and choose one that suits your problem best. So far I'd say that Tomalak's answer is closest to what you're looking for...
SELECT
c.ID,
c.Case_ID,
c.Name,
c.Date,
c.Status,
c.Type
FROM
CaseTable c
WHERE
c.Type = 'Support'
AND c.Status = 'Start'
AND c.Date = (
SELECT MIN(Date)
FROM CaseTable
WHERE Case_ID = c.Case_ID AND Type = c.Type AND Status = c.Status)
/* GROUP BY only needed when for a given Case_ID several rows
exist that fulfill the WHERE clause */
GROUP BY
c.ID,
c.Case_ID,
c.Name,
c.Date,
c.Status,
c.Type
This query benefits greatly from indexes on the Case_ID, Date, Status and Type columns.
Added value though the fact that the filter on Support and Status only needs to be set in one place.
As an alternative to the GROUP BY clause, you can do SELECT DISTINCT, which would increase readability (this may or may not affect overall performance, I suggest you measure both variants against each other). If you are sure that for no Case_ID in your table two rows exist that have the same Date, you won't need GROUP BY or SELECT DISTINCT at all.
In SQL Server 2005 and beyond I would use Common Table Expressions (CTE). This offers lots of possibilities like so:
With ResultTable (RowNumber
,ID
,Case_ID
,Name
,Date
,Status
,Type)
AS
(
SELECT Row_Number() OVER (PARTITION BY Case_ID
ORDER BY Date ASC)
,ID
,Case_ID
,Name
,Date
,Status
,Type
FROM CaseTable
WHERE Type = 'Support'
AND Status = 'Start'
)
SELECT ID
,Case_ID
,Name
,Date
,Status
,Type
FROM ResultTable
WHERE RowNumber = 1
Don't apologize for your date formatting, it makes more sense that way.
SELECT ID, Case_ID, Name, MIN(Date), Status, Type
FROM caseTable
WHERE Type = 'Support'
AND status = 'Start'
GROUP BY ID, Case_ID, Name, Status, Type