Creating new table from data of other tables - sql

I'm very new to SQL and I hope someone can help me with some SQL syntax. I have a database with these tables and fields,
DATA: data_id, person_id, attribute_id, date, value
PERSONS: person_id, parent_id, name
ATTRIBUTES: attribute_id, attribute_type
attribute_type can be "Height" or "Weight"
Question 1
Give a person's "Name", I would like to return a table of "Weight" measurements for each children. Ie: if John has 3 children names Alice, Bob and Carol, then I want a table like this
| date | Alice | Bob | Carol |
I know how to get a long list of children's weights like this:
select d.date,
d.value
from data d,
persons child,
persons parent,
attributes a
where parent.name='John'
and child.parent_id = parent.person_id
and d.attribute_id = a.attribute_id
and a.attribute_type = "Weight';
but I don't know how to create a new table that looks like:
| date | Child 1 name | Child 2 name | ... | Child N name |
Question 2
Also, I would like to select the attributes to be between a certain range.
Question 3
What happens if the dates are not consistent across the children? For example, suppose Alice is 3 years older than Bob, then there's no data for Bob during the first 3 years of Alice's life. How does the database handle this if we request all the data?

1) It might not be so easy. MS SQL Server can PIVOT a table on an axis, but dumping the resultset to an array and sorting there (assuming this is tied to some sort of program) might be the simpler way right now if you're new to SQL.
If you can manage to do it in SQL it still won't be enough info to create a new table, just return the data you'd use to fill it in, so some sort of external manipulation will probably be required. But you can probably just use INSERT INTO [new table] SELECT [...] to fill that new table from your select query, at least.
2) You can join on attributes for each unique attribute:
SELECT [...] FROM data AS d
JOIN persons AS p ON d.person_id = p.person_id
JOIN attributes AS weight ON p.attribute_id = weight.attribute_id
HAVING weight.attribute_type = 'Weight'
JOIN attributes AS height ON p.attribute_id = height.attribute_id
HAVING height.attribute_type = 'Height'
[...]
(The way you're joining in the original query is just shorthand for [INNER] JOIN .. ON, same thing except you'll need the HAVING clause in there)
3) It depends on the type of JOIN you use to match parent/child relationships, and any dates you're filtering on in the WHERE, if I'm reading that right (entirely possible I'm not). I'm not sure quite what you're looking for, or what kind of database you're using, so no good answer. If you're new enough to SQL that you don't know the different kinds of JOINs and what they can do, it's very worthwhile to learn them - they put the R in RDBMS.

when you do a select, you need to specify the exact columns you want. In other words you can't return the Nth child's name. Ie this isn't possible:
1/2/2010 | Child_1_name | Child_2_name | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name | Child_2_name
Each record needs to have the same amount of columns. So you might be able to make a select that does this:
1/2/2010 | Child_1_name
1/2/2010 | Child_2_name
1/2/2010 | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name
1/4/2010 | Child_2_name
And then in a report remap it to how you want it displayed

Related

Strict Match Many to One on Lookup Table

This has been driving me and my team up the wall. I cannot compose a query that will strict match a single record that has a specific permutation of look ups.
We have a single lookup table
room_member_lookup:
room | member
---------------
A | Michael
A | Josh
A | Kyle
B | Kyle
B | Monica
C | Michael
I need to match a room with an exact list of members but everything else I've tried on stack overflow will still match room A even if I ask for a room with ONLY Josh and Kyle
I've tried queries like
SELECT room FROM room_member_lookup
WHERE member IN (Josh, Michael)
GROUP BY room
HAVING COUNT(1) = 2
However this will still return room A even though that has 3 members I need a exact member permutation and that matches the room even not partials.
SELECT room
FROM room_member_lookup a
WHERE member IN ('Monica', 'Kyle')
-- Make sure that the room 'a' has exactly two members
and (select count(*)
from room_member_lookup b
where a.room=b.room)=2
GROUP BY room
-- and both members are in that room
HAVING COUNT(1) = 2
Depending on the SQL dialect, one can build a dynamic table (CTE or select .. union all) to hold the member set (Monica and Kyle, for example), and then look for set equivalence using MINUS/EXCEPT sql operators.

How to create two JOIN-tables so that I can compare attributes within?

I take a Database course in which we have listings of AirBnBs and need to be able to do some SQL queries in the Relationship-Model we made from the data, but I struggle with one in particular :
I have two tables that we are interested in, Billing and Amenities. The first one have the id and price of listings, the second have id and wifi (let's say, to simplify, that it equals 1 if there is Wifi, 0 otherwise). Both have other attributes that we don't really care about here.
So the query is, "What is the difference in the average price of listings with and without Wifi ?"
My idea was to build to JOIN-tables, one with listings that have wifi, the other without, and compare them easily :
SELECT avg(B.price - A.price) as averagePrice
FROM (
SELECT Billing.price, Billing.id
FROM Billing
INNER JOIN Amenities
ON Billing.id = Amenities.id
WHERE Amenities.wifi = 0
) A, (
SELECT Billing.price, Billing.id
FROM Billing
INNER JOIN Amenities
ON Billing.id = Amenities.id
WHERE Amenities.wifi = 1) B
WHERE A.id = B.id;
Obviously this doesn't work... I am pretty sure that there is a far easier solution to it tho, what do I miss ?
(And by the way, is there a way to compute the absolute between the difference of price ?)
I hope that I was clear enough, thank you for your time !
Edit : As mentionned in the comments, forgot to say that, but both tables have idas their primary key, so that there is one row per listing.
Just use conditional aggregation:
SELECT AVG(CASE WHEN a.wifi = 0 THEN b.price END) as avg_no_wifi,
AVG(CASE WHEN a.wifi = 1 THEN b.price END) as avg_wifi
FROM Billing b JOIN
Amenities a
ON b.id = a.id
WHERE a.wifi IN (0, 1);
You can use a - if you want the difference instead of the specific values.
Let's assume we're working with data like the following (problems with your data model are noted below):
Billing
+------------+---------+
| listing_id | price |
+------------+---------+
| 1 | 1500.00 |
| 2 | 1700.00 |
| 3 | 1800.00 |
| 4 | 1900.00 |
+------------+---------+
Amenities
+------------+------+
| listing_id | wifi |
+------------+------+
| 1 | 1 |
| 2 | 1 |
| 3 | 0 |
+------------+------+
Notice that I changed "id" to "listing_id" to make it clear what it was (using "id" as an attribute name is problematic anyways). Also, note that one listing doesn't have an entry in the Amenities table. Depending on your data, that may or may not be a concern (again, refer to the bottom for a discussion of your data model).
Based on this data, your averages should be as follows:
Listings with wifi average $1600 (Listings 1 and 2)
Listings without wifi (just 3) average 1800).
So the difference would be $200.
To achieve this result in SQL, it may be helpful to first get the average cost per amenity (whether wifi is offered). This would be obtained with the following query:
SELECT
Amenities.wifi AS has_wifi,
AVG(Billing.price) AS avg_cost
FROM Billing
INNER JOIN Amenities ON
Amenities.listing_id = Billing.listing_id
GROUP BY Amenities.wifi
which gives you the following results:
+----------+-----------------------+
| has_wifi | avg_cost |
+----------+-----------------------+
| 0 | 1800.0000000000000000 |
| 1 | 1600.0000000000000000 |
+----------+-----------------------+
So far so good. So now we need to calculate the difference between these 2 rows. There are a number of different ways to do this, but one is to use a CASE expression to make one of the values negative, and then simply take the SUM of the result (note that I'm using a CTE, but you can also use a sub-query):
WITH
avg_by_wifi(has_wifi, avg_cost) AS
(
SELECT Amenities.wifi, AVG(Billing.price)
FROM Billing
INNER JOIN Amenities ON
Amenities.listing_id = Billing.listing_id
GROUP BY Amenities.wifi
)
SELECT
ABS(SUM
(
CASE
WHEN has_wifi = 1 THEN avg_cost
ELSE -1 * avg_cost
END
))
FROM avg_by_wifi
which gives us the expected value of 200.
Now regarding your data model:
If both your Billing and Amenities table only have 1 row for each listing, it makes sense to combine them into 1 table. For example: Listings(listing_id, price, wifi)
However, this is still problematic, because you probably have a bunch of other amenities you want to model (pool, sauna, etc.) So you might want to model a many-to-many relationship between listings and amenities using an intermediate table:
Listings(listing_id, price)
Amenities(amenity_id, amenity_name)
ListingsAmenities(listing_id, amenity_id)
This way, you could list multiple amenities for a given listing without having to add additional columns. It also becomes easy to store additional information about an amenity: What's the wifi password? How deep is the pool? etc.
Of course, using this model makes your original query (difference in average cost of listings by wifi) a bit tricker, but definitely still doable.

Querying array of text in postgres

I have an array type I want to store in Postgres. One of the major use cases I have is to see if any of the records has an array which has a string in it.
eg.
| A | ["NY", "Paris", "Milan"] |
| B | ["Paris", "NY"] |
| C | [] |
| D | ["Milan"] |
Does there exist a row with Paris in the array? Which rows have Milan in the array? and so on.
I have 2 options on how to store the column. I can either make it a type text[] or convert it into a json as {"cities": ["NY", "Paris", "Milan"]} and then store as a JSONB field
However, I am not sure what would allow the fastest querying for the use case I have. Is there any one obviously better way of doing this? Am I tying myself down in any way by choosing one over the other? If I choose one over the other then how can I query the DB?
As you seem to be storing simple lists of values, I would recommend to use datataype Array over JSON, which better fits more complex cases (nested datastructures, associative arrays, ...).
To check for the value of an element at any position in the array, you can use array function ANY().
Here is a query that will return all records where the array stored in column cities contains 'Paris' :
SELECT t.* FROM mytable t WHERE 'Paris' = ANY(t.cities);
Yields :
id cities
---------------------------
A ["NY","Paris","Milan"]
B ["Paris","NY"]
Demo on DB Fiddle
For more information :
Postgres Arrays Documentation
Postgres Arrays Tutorial
I've noticed it is better to query JSONB, if it is a simple key-value store.
As in for instance you want to store arbitrary info on a row that your not sure what the columns(keys) would be.
info = {"a":"apple", "b":"ball"}
For use cases like yours, it would be better if you could design the db with simple tables so you could use JOINS and Indexes to your advantage.
You could restructure the tables like :
Location
id | name
----------
1 | Paris
2 | NY
3 | Milan
Other Table (with foreign key on location table)
user | location_id
--------------------
A | 1
A | 3
B | 2
Using these set of tables it would be easy to query all users with location paris using JOINS.

Check if a value exists in the child-parent tree

I'm creating a simple directory listing page where you can specify what kind of thing you want to list in the directory e.g. a person or a company.
Each user has an UserTypeID and there is a dbo.UserType lookup table. The dbo.UserType lookup table is like this:
UserTypeID | UserTypeParentID | Name
1 NULL Person
2 NULL Company
3 2 IT
4 3 Accounting Software
In the dbo.Users table we have records like this:
UserID | UserTypeID | Name
1 1 Jenny Smith
2 1 Malcolm Brown
3 2 Wall Mart
4 3 Microsoft
5 4 Sage
My SQL (so far) is very simple: (excuse the pseudo-code style)
DECLARE #UserTypeID int
SELECT
*
FROM
dbo.Users u
INNER JOIN
dbo.UserType ut
WHERE
ut.UserTypeID = #UserTypeID
The problem is here is that when people want to search for companies they will enter in '2' as the UserTypeID. But both Microsoft and Sage won't show up because their UserTypeIDs are 3 and 4 respectively. But its the final UserTypeParentID which tells me that they're both Companies.
How could I rewrite the SQL to ask it to return to return records where the UserTypeID = #UserTypeID or where its final UserTypeParentID is also equal to #UserTypeID. Or am I going about this the wrong way?
Schema Change
I would suggest you to break it down this schema a little bit more, to make your queries and life simpler, with this current schema you will end up writing a recursive query every time you want to get simplest data from your Users table, and trust me you dont want to do this to yourself.
I would break down this schema of these tables as follow:
dbo.Users
UserID | UserName
1 | Jenny
2 | Microsoft
3 | Sage
dbo.UserTypes_Type
TypeID | TypeName
1 | Person
2 | IT
3 | Compnay
4 | Accounting Software
dbo.UserTypes
UserID | TypeID
1 | 1
2 | 2
2 | 3
3 | 2
3 | 3
3 | 4
You say that you are "creating" this - excellent because you have the opportunity to reconsider your whole approach.
Dealing with hierarchical data in a relational database is problematic because it is not designed for it - the model you choose to represent it will have a huge impact on the performance and ease of construction of your queries.
You have opted for an Adjacently List model which is great for inserts (and deletes) but a bugger for selects because the query has to effectively reconstruct the hierarchy path. By the way an Adjacency List is the model almost everyone goes for on their first attempt.
Everything is a trade off so you should decide what queries will be most common - selects (and updates) or inserts (and deletes). See this question for starters. Also, since SQL Server 2008, there is a native HeirachyID datatype (see this) which may be of assistance.
Of course, you could store your data in an XML file (in SQL Server or not) which is designed for hierarchical data.

Summing n numerical variables by grouping level specific to each

I am working through a group by problem and could use some direction at this point. I want to summarize a number of variables by a grouping level which is different (but the same domain of values) for each of the variables to be summed. In pseudo-pseudo code, this is my issue: For each empYEAR variable (there are 20 or so employment-by-year variables in wide format), I want to sum it by the county in which the business was located in that particular year.
The data is a bunch of tables representing business establishments over a 20-year period from Dun & Bradstreet/NETS.
More details on the database, which is a number of flat files, all with the same primary key.
The primary key is DUNSNUMBER, which is present in several tables. There are tables detailing, for each year:
employment
county
sales
credit rating (and others)
all organized as follows (this table shows employment, but the other variables are similarly structured, with a year postfix).
dunsnumber|emp1990 |emp1991|emp1992|... |emp2011|
a | 12 |32 |31 |... | 35 |
b | |2 |3 |... | 5 |
c | 1 |1 | |... | |
d | 40 |86 |104 |... | 350 |
...
I would ultimately like to have a table that is structured like this:
county |emp1990|emp1991|emp1992|...|emp2011|sales1990|sales1991|sales1992|sales2011|...
A
B
C
...
My main challenge right now is this: How can I sum employment (or sales) by county by year as in the example table above, given that county as a grouping variable changes sometimes by the year and specified in another table?
It seems like something that would be fairly straightforward to do in, say, R with a long data format, but there are millions of records, so I prefer to keep the initial processing in postgres.
As I understand your question this sounds relatively straight forward. While I normally prefer normalized data to work with, I don't see that normalizing things beforehand will buy you anything specific here.
It seems to me you want something relatively simple like:
SELECT sum(emp1990), sum(emp1991), ....
FROM county c
JOIN emp e ON c.dunsnumber = e.dunsnumber
JOIN sales s ON c.dunsnumber = s.dunsnumber
JOIN ....
GROUP BY c.name, c.state;
I don't see a simpler way of doing this. Very likely you could query the system catalogs or information schema to generate a list of columns to sum up. the rest is a straight group by and join process as far as I can tell.
if the variable changes by name, the best thing to do in my experience is to put together a location view based on that union and join against it. This lets you hide the complexity from your main queries and as long as you don't also join the underlying tables should perform quite well.