Let's say I want to find out movie sales per year by genre in an OLAP cube. The data structure would look like this:
{
"Year": 2020,
"Title": "Spider-Man: Into the Spider-Verse",
"Revenue": 102000,
"Genres": ["Action", "Sci-Fi"]
}
What would be the proper way to model this? Would I un-nest the genres so that the genre itself is multiplied by the sales? For example, the fact table would look like:
+------------+
| Movie Fact |
+------------+
| Year |
| Title | (1 record for the above data)
| Revenue |
+------------+
Or would it look like:
+------------+
| Movie Fact |
+------------+
| Year |
| Title | (TWO records for the above data)
| Revenue |
| Genre |
+------------+
Why would it be one way over the other?
You need many-to-many bridge table for this to handle multiple genres for the same movie. Your fact table will obviously have >1 movie. M2M table will show only dimensional relations between each movie and each genre.
Dimensions:
Date (all years of all movies or dates in YYYYMMDD format as a key
and additional columns like Year, Calendar Date MM/DD/YYYY etc. with
key Date_SK)
Genre (just flat key-value list like 1 - Action, 2 - Sci-Fi, 3 -
Drama etc.)
Bridge table (+ separate measure group for this!):
Movies_Genres (Movie_SK, Genre_SK). It will contain two rows for this movie. Row-1: 1,1; Row-2: 1,2
Fact table:
Movies (Movie_SK, Title, Date_SK, Revenue)
Joins between all tables are:
Date <-- Movies
Movies_Genres <-- Movies
Genre <-- Movies_Genres
On dimension usage tab you will join Movies fact table with Genre dimension by "Many-to-many relation" over Movies_Genres measure group that is already connected to Genre dimension.
This is the only way to avoid duplication. Server will go by each genre and in total you will see the same revenue as for each genre if one movie is selected.
Performance may be slow for bridge tables and dimensions with >1M rows each.
Related
I have 3 tables of purchases in the following format:
date | company_id | apple_txn_amt
date | company_id | orange_txn_amt
date | company_id | pear_txn_amt
There are multiple purchases/sales daily for many companies. I'm trying to join and group so there is only 1 date per company along with total fruit balance:
date | company_id | total_apple_balance | total_apple_orange_balance | total_pear_balance
I have built a query for a similar case earlier, and used 2 joins. But this was for only one company's data so I was only joining on date=date for each table. Process for each table was: gather buys, sells, union those two, union to a new table with generate_series() to insert 0s for days missing, calculate daily delta, and group by day to have a running total. Then something like:
SELECT
apple.day
apple.total
orange.total
pear.total
(apple + orange + pear) AS total_fruit
FROM apple
JOIN orange ON orange.date = apple.date
JOIN pear ON pear.date = apple.date
ORDER BY day
It's like I need to JOIN ON date and company id but from what I can tell this isn't possible.
Should I approach this in a different way?
Sure you can add the company_id like
SELECT
apple.day
apple.total
orange.total
pear.total
(apple.total + orange.total + pear.total) AS total_fruit
FROM apple
JOIN orange ON orange.date = apple.date AND orange.company_id = apple.company_id
JOIN pear ON pear.date = apple.date AND pear.company_id = apple.company_id
ORDER BY day
But the design of your database isn't right, if circumstances don't require it.
you would not have 3 tables, you would have only one with Fruit type as another column, to differentiate them
I have two tables
TicketsForSale
ticket_id (PK)
type
category
Transactions
transaction_id (PK)
ticket_id (FK)
I want to get the transactions per type of tickets. This is what I've tried:
SELECT ticketsforsale.type
, COUNT(transactions.ticket_id)
FROM ticketsforsale
INNER JOIN transactions ON ticketsforsale.ticket_id = transactions.ticket_id
GROUP BY ticketsforsale.type
What I hope for as a result is something like this
{
Sports 5
Theater 7
Cruise 8
Cinema 10
}
But instead I get the following :
{ Theater 2
Cruise 1
Sports 1
Sports 2
Cruise 3
Cinema 5
}
The numbers aren't accurate, just used for demonstration.
(The category column is listing the specific show you attend by "purchasing" the ticket. E.G If the type is "Sports", the category could be Basketball or Football or Volleyball etc. etc. ) I just thought that this column could somehow be the issue here, but maybe I'm wrong.
Try this:
select distinct type
, encode(type::bytea,'hex') hex_type
from TicketsForSale order by 1;
You'll probably find that you have multiple type values that appear identical but have different hexadecimal representations. Fix those discrepancies, and the you should be good to go.
In Rails application I have two associated (as many to many) tables — volumes and tracks.
volumes is a collection of radio show episodes. Each episode has multiple tracks. But each track can appear on many shows.
volumes:
id | title
-----+-----------------------
23 | Oldstyle
24 | How to stop worrying
tracks:
id | artist | title
------+----------------------+------------------------
4764 | John Lennon | Mind Games
4765 | George Harrison | All Those Years Ago
4766 | Paul McCartney | Here Today
What I need is to aggregate tracks by artist and show all show volumes that have that artist tracks. Something like Artist.select(name: 'Beatles').volumes.
The problem is I don't have Artist model. And don't want to create it, because I believe it will make manual data cleanup much harder.
For example, I can query data I need like:
SELECT DISTINCT t.artist, v.number
FROM tracks t
JOIN volumes_tracks vt ON vt.track_id = t.id
JOIN volumes v ON v.id = vt.volumes_id
GROUP BY artist, number;
But is it possible to wrap it model-like data structure for easy access?
You can map each result returned by the sql query to a Struct, which may give you the ease of access that you're looking for. Something like this:
Artist = Struct.new(:artist, :volume_id)
result=ActiveRecord::Base.connection.execute(sql)
artists = result.map{|r| Artist.new(r['t.artist'], r['v.id']}
first_artist = artists[0].artist
(the sql variable represents your sql query string)
I am working through a group by problem and could use some direction at this point. I want to summarize a number of variables by a grouping level which is different (but the same domain of values) for each of the variables to be summed. In pseudo-pseudo code, this is my issue: For each empYEAR variable (there are 20 or so employment-by-year variables in wide format), I want to sum it by the county in which the business was located in that particular year.
The data is a bunch of tables representing business establishments over a 20-year period from Dun & Bradstreet/NETS.
More details on the database, which is a number of flat files, all with the same primary key.
The primary key is DUNSNUMBER, which is present in several tables. There are tables detailing, for each year:
employment
county
sales
credit rating (and others)
all organized as follows (this table shows employment, but the other variables are similarly structured, with a year postfix).
dunsnumber|emp1990 |emp1991|emp1992|... |emp2011|
a | 12 |32 |31 |... | 35 |
b | |2 |3 |... | 5 |
c | 1 |1 | |... | |
d | 40 |86 |104 |... | 350 |
...
I would ultimately like to have a table that is structured like this:
county |emp1990|emp1991|emp1992|...|emp2011|sales1990|sales1991|sales1992|sales2011|...
A
B
C
...
My main challenge right now is this: How can I sum employment (or sales) by county by year as in the example table above, given that county as a grouping variable changes sometimes by the year and specified in another table?
It seems like something that would be fairly straightforward to do in, say, R with a long data format, but there are millions of records, so I prefer to keep the initial processing in postgres.
As I understand your question this sounds relatively straight forward. While I normally prefer normalized data to work with, I don't see that normalizing things beforehand will buy you anything specific here.
It seems to me you want something relatively simple like:
SELECT sum(emp1990), sum(emp1991), ....
FROM county c
JOIN emp e ON c.dunsnumber = e.dunsnumber
JOIN sales s ON c.dunsnumber = s.dunsnumber
JOIN ....
GROUP BY c.name, c.state;
I don't see a simpler way of doing this. Very likely you could query the system catalogs or information schema to generate a list of columns to sum up. the rest is a straight group by and join process as far as I can tell.
if the variable changes by name, the best thing to do in my experience is to put together a location view based on that union and join against it. This lets you hide the complexity from your main queries and as long as you don't also join the underlying tables should perform quite well.
I'm very new to SQL and I hope someone can help me with some SQL syntax. I have a database with these tables and fields,
DATA: data_id, person_id, attribute_id, date, value
PERSONS: person_id, parent_id, name
ATTRIBUTES: attribute_id, attribute_type
attribute_type can be "Height" or "Weight"
Question 1
Give a person's "Name", I would like to return a table of "Weight" measurements for each children. Ie: if John has 3 children names Alice, Bob and Carol, then I want a table like this
| date | Alice | Bob | Carol |
I know how to get a long list of children's weights like this:
select d.date,
d.value
from data d,
persons child,
persons parent,
attributes a
where parent.name='John'
and child.parent_id = parent.person_id
and d.attribute_id = a.attribute_id
and a.attribute_type = "Weight';
but I don't know how to create a new table that looks like:
| date | Child 1 name | Child 2 name | ... | Child N name |
Question 2
Also, I would like to select the attributes to be between a certain range.
Question 3
What happens if the dates are not consistent across the children? For example, suppose Alice is 3 years older than Bob, then there's no data for Bob during the first 3 years of Alice's life. How does the database handle this if we request all the data?
1) It might not be so easy. MS SQL Server can PIVOT a table on an axis, but dumping the resultset to an array and sorting there (assuming this is tied to some sort of program) might be the simpler way right now if you're new to SQL.
If you can manage to do it in SQL it still won't be enough info to create a new table, just return the data you'd use to fill it in, so some sort of external manipulation will probably be required. But you can probably just use INSERT INTO [new table] SELECT [...] to fill that new table from your select query, at least.
2) You can join on attributes for each unique attribute:
SELECT [...] FROM data AS d
JOIN persons AS p ON d.person_id = p.person_id
JOIN attributes AS weight ON p.attribute_id = weight.attribute_id
HAVING weight.attribute_type = 'Weight'
JOIN attributes AS height ON p.attribute_id = height.attribute_id
HAVING height.attribute_type = 'Height'
[...]
(The way you're joining in the original query is just shorthand for [INNER] JOIN .. ON, same thing except you'll need the HAVING clause in there)
3) It depends on the type of JOIN you use to match parent/child relationships, and any dates you're filtering on in the WHERE, if I'm reading that right (entirely possible I'm not). I'm not sure quite what you're looking for, or what kind of database you're using, so no good answer. If you're new enough to SQL that you don't know the different kinds of JOINs and what they can do, it's very worthwhile to learn them - they put the R in RDBMS.
when you do a select, you need to specify the exact columns you want. In other words you can't return the Nth child's name. Ie this isn't possible:
1/2/2010 | Child_1_name | Child_2_name | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name | Child_2_name
Each record needs to have the same amount of columns. So you might be able to make a select that does this:
1/2/2010 | Child_1_name
1/2/2010 | Child_2_name
1/2/2010 | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name
1/4/2010 | Child_2_name
And then in a report remap it to how you want it displayed