SQL - Combining all children in one row - sql

I am trying to move data from a database into a pandas data frame. I have data in multiple tables that I want to combine.
I'm using SQLAlchemy and relationship between parent/children.
I'm trying to understand how I'd do this in SQL before attempting in SQLAlchemy
I am using Sqlite as a DB.
parent_table
ID | Name | Class
1 | Joe | Paladin
2 | Ron | Mage
3 | Sara | Knight
child1
ID | distance | finished | parent_id
1 | 2 miles | yes | 1
2 | 3 miles | yes | 1
3 | 1 miles | yes | 1
4 | 10 miles | no | 2
child2
ID | Weight | height | parent_id
1 | 5 lbs | 5'3 | 1
2 | 10 lbs | 5'5 | 2
I want to write a query where the result would be everything for Joe (id: 1) on a row.
1 | Joe | Paladin | 2 miles | yes | 3 miles | yes | 1 miles | yes | 5lbs | 5'3
2 | Ron | Mage | 10 miles | no | None | None | None | None | 10lbs | 5'5
3 | Sara | Knight | None | None | None | None | None | None | None | None
I'm guessing I need to do a join, but confused about the fact that Ron has less child1 entries.
How do I construct a table that has as many columns as needed and fills out the empty ones as None when some of the rows in parent_table don't have as many children?

simply search everyone by themself and use a union to join:
SELECT Name,Class FROM parent_table WHERE ID = 1
UNION
SELECT distance,finished FROM child1 WHERE parent_id = 1
UNION
SELECT weight,height FROM child2 WHERE parent_id =1
This way you avoid the problem for Ron or anyone that does not have a register in a table,

You can't have "As many columns as needed" because the number of child rows is variable and you can't have a variable number of columns. If you can figure out a fixed number of children, (say 2) you can do:
CREATE TABLE
"some_table"
AS
SELECT
"parent_table"."ID",
"parent_table"."Name",
"parent_table"."Class",
"child1_1"."finished" AS "2_miles",
"child1_2"."finished" AS "3_miles"
FROM
"parent_table",
"child1" AS "child1_1",
"child1" AS "child1_2"
WHERE
"child1_1"."parent_id"="parent_table"."id" AND
"child1_2"."parent_id"="parent_table"."id" AND
"child1_1"."distance"='2 miles' AND
"child1_2"."distance"='3 miles'
You can add columns from child2 in the same manner. And child subkeys (data in child1.distance i.e.) will need to go to column names. But for variable one-to-many relations, you need multiple tables. It's basically what the relational concept is all about.
For data analysis (which you are trying to do as it seems) you will also need two datasets (like tables) because the 2 measurements (sample sets) are not correlated (i.e. distances and weights), which you can obtain in 2 tables. Think of what a "sample" is (the result of a measurement). It can't be "entity 1 completed 2 miles and 4lbs" because "2 miles and 4 lbs" it's not a measurable event. So you have 2 distinct samples: "entity 1 completed 2 miles" and "entity 1 completed 4 lbs". (Or are the data in child2 1-to-1 properties of the entity in parent_table ? You should detail better the meaning of the data and what you-re trying to achieve).

Related

Best database design to find relationships between two persons

I want to find relationships between two persons using a database. For example, I have a database like this:
Person:
Id| Name
1 | Edvard
2 | Ivan
3 | Molly
4 | Julian
5 | Emily
6 | Katarina
Relationship:
Id| Type
1 | Parent
2 | Husband\Wife
3 | ex-Husband\ex-Wife
Relationships:
Id| Person_1_Id | Person_2_Id | Relation_Id
1 | 1 | 3 | 2
2 | 3 | 4 | 3
3 | 3 | 2 | 1
4 | 4 | 2 | 1
5 | 1 | 6 | 3
6 | 1 | 5 | 1
7 | 6 | 5 | 1
What the best way to find what relationship between Person-2 and Person-5? This example is not large enough, but what if there were 5 families or 10000. I think, if there are too many families, then it is necessary to introduce the concept of depth. Maybe it will be better to change the database design? Is this possible to make it like trees or graphs? Some ideas on how to solve this problem differently?
As soon as you get above a handful of nodes and a few relationships between them, this becomes a very complex problem: there are whole branches of maths based around this type of challenge and how long it takes to compute a result.
For any non-trivial set of nodes/relationships you are going to need to look at deploying a graph database e.g. Neo4j

Element with the most votes for each combination of attributes

In my schema, a user can vote for different monsters that have different powers (eg lighting, fire) and different bodies.
Body is a polymorphic association, as it can be from different types of animals.
Here's the relevant pieces of the schema:
votes:
monster_id
power_id
body_id #polymorphic association
body_type #polymorphic association
For every combination of power and body with representation on the votes table, I want to find out the monsters that got the most votes.
Eg of a specific example:
--------------------------------------------------
| votes |
--------------------------------------------------
| monster_id| power_id | body_id | body_type |
--------------------------------------------------
| 1 | 1 | 1 | Body::Mammal |
| 2 | 1 | 1 | Body::Mammal |
| 2 | 1 | 1 | Body::Mammal |
| 11 | 2 | 11 | Body::Reptile |
| 11 | 2 | 11 | Body::Reptile |
| 22 | 2 | 11 | Body::Reptile |
--------------------------------------------------
Results I would like:
- ["For the combination (power_id: 1, body_id: 1, body_type: Body::Mammal), the monster with most votes is monster_id: 2",
"For the combination (power_id: 2, body_id: 11, body_type: Body::Reptile), the monster with most votes is monster_id: 11",
...]
I am using Rails 6 and postgres so I have the option to use ActiveRecord, for which I have a slight preference, but I realize this likely needs raw sql.
I understand the answer is very likely an extension of the one given in this question, where they do a similar thing with less attributes, but I can't seem to add the extra complexity needed to accommodate increased number of columns in play.
sql: select most voted items from each user
If I follow you correctly, you can use distinct on and aggregation:
select distinct on (body_id, power_id, body_type)
body_id, power_id, body_type, monster_id, count(*) cnt_votes
from votes
group by body_id, power_id, body_type, monster_id
order by body_id, power_id, body_type, count(*) desc

Correct Database Design / Relationship

Below I have shown a basic example of my proposed database tables.
I have two questions:
Categories "Engineering", "Client" and "Vendor" will have exactly the same "Disciplines", "DocType1" and "DocType2", does this mean I have to enter these 3 times over in the "Classification" table, or is there a better way? Bear in mind there is the "Vendor" category that is also covered in the classification table.
In the "Documents" table I have shown "category_id" and "classification_id", I'm not sure if the will depend on the answer to the first question, but is "category_id" necessary, or should I just be using a JOIN to allow me to filter the category based on the classification_id?
Thank you in advance.
Table: Category
id | name
---|-------------
1 | Engineering
2 | Client
3 | Vendor
4 | Commercial
Table: Discipline
id | name
---|-------------
1 | Electrical
2 | Instrumentation
3 | Proposals
Table: DocType1
id | name
---|-------------
1 | Specifications
2 | Drawings
3 | Lists
4 | Tendering
Table: Classification
id | category_id | discipline_id | doctype1_id | doctype2
---|-------------|---------------|-------------|----------
1 | 1 | 1 | 2 | 00
2 | 1 | 1 | 2 | 01
3 | 2 | 1 | 2 | 00
4 | 4 | 3 | 4 | 00
Table: Documents
id | title | doc_number | category_id | classification_id
---|-----------------|------------|-------------|-------------------
1 | Electrical Spec | 0001 | 1 | 1
2 | Electrical Spec | 0002 | 2 | 3
3 | Quotation | 0003 | 3 | 4
From what you've provided, it looks like we have three simple lookup tables: category, discipline, and doctype1. The part that's not intuitively obvious to me and may also be causing confusion on your end, is that the last two tables are both serving as cross-references of the lookup tables. The classification table in particular seems like it might be out of place. If there are only certain combinations of category, discipline, and doctype that would ever be valid, then the classification table makes sense and the right thing to do would be to look up that valid combination by way of the classification ID from the document table. If this is not the case, then you would probably just want to reference the category, discipline, and document type directly from the document table.
In your example, the need to make this distinction is illuminated by the fact that the document table has a referenc to the classification table and a references to the category table. However the row that is looked up in the classification table also references a category ID. This is not only redundant but also opens the door to the possibility of having conflicting category IDs.
I hope this helps.

What is the most efficient way to store a variable number of columns in SQL Server?

What is the most efficient way to store a variable amount of columns in MS-SQL?
I have a requirement to store a large number (several million) records into a Microsoft SQL server (via c#). Most columns are standard, but certain groups of users will need to add their own custom columns, and record data in them.
The data in each custom column field will not be large, but the number of records with a certain set of custom columns will be in the millions.
I do not know ahead of time what these columns might be (in terms of name or datatype), but I'll need to pull reports based on these columns as effeciently as possible..
What is the most efficient way of storing the new varying columns and data?
Entity-Attribute-Value model?
Con's: Efficiency if there's a large number of custom columns (= large number of rows)?
A extra table "CustomColumns"?
Storing columnName, Data, Datatype each time an entry has a custom column, for each column.
Con's: A table with a large number of records, perhaps not the most efficient storage.
Serialise the extra columns for each record into a single field
Con's: Lookup efficiency and stored procedure complicated when running reports based on a custom field.
Any other?
Edit: Think I may be confusing option (1) and (2): I actually meant, is the following the best approach :
Entity (User Groups)
id | name | description
-- | ---- | ------------
1 | user group 1 | user group 1
2 | user group 2 | user group 2
Attribute
id | name | type | entityids (best way to do this for 2 user
-- | ---- | ---- | groups using same attribute?
1 | att1 | string | 1,2
2 | att2 | int | 2
3 | att3 | string | 1
4 | att4 | numeric | 2
5 | att5 | string | 1
Value
id | entityId| attributeId | value
-- | --------| ----------- | -----
1 | 1 | 1 | a
2 | 1 | 2 | 1
3 | 1 | 3 | b
4 | 1 | 3 | c
5 | 1 | 3 | d
6 | 1 | 3 | 75
7 | 1 | 5 | Inches

Relative incremental ID by reference field

I have a table to store reservations for certain events; relevant part of it is:
class Reservation(models.Model):
# django creates an auto-increment field "id" by default
event = models.ForeignKey(Event)
# Some other reservation-specific fields..
first_name = models.CharField(max_length=255)
Now, I wish to retrieve the sequential ID of a given reservation relative to reservations for the same event.
Disclaimer: Of course, we assume reservations are never deleted, or their relative position might change.
Example:
+----+-------+------------+--------+
| ID | Event | First name | Rel.ID |
+----+-------+------------+--------+
| 1 | 1 | AAA | 1 |
| 2 | 1 | BBB | 2 |
| 3 | 2 | CCC | 1 |
| 4 | 2 | DDD | 2 |
| 5 | 1 | EEE | 3 |
| 6 | 3 | FFF | 1 |
| 7 | 1 | GGG | 4 |
| 8 | 1 | HHH | 5 |
+----+-------+------------+--------+
The last column is the "Relative ID", that is, a sequential number, with no gaps, for all reservations of the same event.
Now, what's the best way to accomplish this, without having to manually calculate relative id for each import (I don't like that)? I'm using postgresql as underlying database, but I'd prefer to stick with django abstraction layer in order to keep this portable (i.e. no database-specific solutions, such as triggers etc.).
Filtering using Reservation.objects.filter(event_id = some_event_id) should suffice. This will give you a QuerySet that should have the same ordering each time. Or am I missing something in your question?
I hate always being the one that responds its own questions, but I solved using this:
class Reservation(models.Model):
# ...
def relative_id(self):
return self.id - Reservation.objects.filter(id__lt=self.id).filter(~Q(event=self.event)).all().count()
Assuming records from reservations are never deleted, we can safely assume the "relative id" is the incremental id - (count of reservations before this one not belonging to same event).
I'm thinking of any drawbacks, but I didn't find any.