Recently stumbled on this situation. Doing both queries might be "light" in my situation, I just want to know when it comes to big dataset on what is better. Better in overall (performance, speed, etc etc).
Currently I do single queries of 2 1:N (has-many) relationship and reduce/transform the data in the application.
It looks like this transformed/reduced:
[
'field' => 'value',
'hasMany-1' => [],
'hasMany-2' => []
]
I'm actually somehow tempted to just do separate queries as it eliminates the pain of reducing it if I had more than 2 hasMany queries and is more quite readable but code currently works so I'll maybe just do it next time.
Is the compromise worth it? Again, in my situation it might be very "light" as I only have few rows (< 100) and structure is not complex as it is on early stage yet.
But asked in case I stumble upon this next time and when dataset grows larger.
** EDIT **
So the has-many relationship I'm talking about are: A customer has-many phones and pets.
My current query returns me this result (simplified):
customer_id | pet_name | phone
1 | john | 1234
1 | john | 5678
2 | jane | 1357
2 | jane | 2468
2 | joe | 1357
2 | joe | 2468
I think my query is fine. It seems logical for some rows to repeat because the other field has different value.
In general, you should issue a single query and let the optimizer do the work for you. At the very least, this saves multiple round-trips to the database and query compilation.
There are cases where multiple queries can have better performance, but I think it is better to start with a single query.
You have a particular issue regarding joins along multiple many-to-many dimensions. There is no need to do the joins "generally" and then "reduce" the results. There are more efficient methods.
I would suggest that you ask another question. Provide sample data, desired results, and an explanation of the logic you are attempting. You may be able to learn a more efficient way to write a single query.
You did not describe your table structures so i assume few things.
If you want to have pets and phones as one row do:
select c.customer_id, c.name,
array_to_string((array_agg(p.pet_name)),',') pet_names
array_to_string((array_agg(ph.phone)),',') phones
from customer c, pet p, phones ph
where p.customer_id=1
and p.customer_id=c.customer_id
and ph.customer_id=c.customer_id
group by c.customer_id, c.name
If you want to have row per pet_name with all possible phone numbers:
select c.customer_id, c.name,
p.pet_name
array_to_string((array_agg(ph.phone)),',') phones
from customer c, pet p, phones ph
where p.customer_id=1
and p.customer_id=c.customer_id
and ph.customer_id=c.customer_id
group by c.customer_id, c.name, p.pet_name
If we talk about performance it will be faster to do 2 queries to pets and phones separetly by customer_id. But until you have milions of rows it is not so important.
Of course you should have indexes on customer_id.
Related
I know this question has been asked and answered. I understand the problem and I understand the underlying cause and I understand the solution. What I DON'T understand is how to implement the solution.
I'll try to be detailed....
Background: Each material is being grouped on WellID (I work in oil and gas) and SandType which is my primary key in each table, these come from 2 lookup tables one for each. (I work in oil and gas)
I have 3 tables that store material (sand)) weights at 3 different stages in the job process. Basically the weight from the engineer's DESIGN, what was DELIVERED and what is in INVENTORY.
I know that the join is messed up and adding the total for each row in each table. Sometimes double triple etc.
I am grouping on WellID and SandID.
Now I don't want someone to do the work for me. I just don't know how or where in access to restrict it to what I want, or if modifying t he sql the proper way to write the code. Current work around is 3 separate sum queries one for each table, but that is going to get inefficient and added steps.
My whole database purpose and subsequent reports hinge off math on these 3 numbers so, my show stopper here is putting the fat lady on stage, and is about to become a deal breaker at the end of the line! 0
I need some advice, direction, criticism, wisdom, witty euphemisms or a new job!
The 3 tables look as follows
Design:
T_DESIGN
DesignID WellID Sand_ID Weight_DES Time_DES
89 201 1 100 4/21/2014 6:46:02 AM
98 201 2 100 4/21/2014 7:01:22 AM
86 201 4 100 4/21/2014 6:28:01 AM
93 228 5 100 4/21/2014 6:53:34 AM
91 228 1 100 4/21/2014 6:51:23 AM
92 228 1 100 4/21/2014 6:53:30 AM
Delivered:
T_BOL
BOLID WellID_BOL SandID_BOL Weight_BOL
279 201 1 100
280 201 1 100
281 228 2 5
282 228 1 10
283 228 9 100
Inventory:
T_BIN
StrapID WellID_BIN SandID_BIN Weight_BIN
11 201 1 100
13 228 1 10
14 228 1 0
17 228 1 103
19 201 1 50
The Query Results:
Test Query99
WellID
WellID SandID Sum Of Weight_DES Sum Of Weight_BOL Sum Of Weight_BIN
201 1 400 400 300
228 1 600 60 226
SQL:
SELECT DISTINCTROW L_WELL.WellID, L_SAND.SandID,
Sum(T_DESIGN.Weight_DES) AS [Sum Of Weight_DES],
Sum(T_BOL.Weight_BOL) AS [Sum Of Weight_BOL],
Sum(T_BIN.Weight_BIN) AS [Sum Of Weight_BIN]
FROM ((L_SAND INNER JOIN
(L_WELL INNER JOIN T_DESIGN ON L_WELL.[WellID] = T_DESIGN.[WellID_DES])
ON L_SAND.SandID = T_DESIGN.[SandID_DES])
INNER JOIN T_BIN
ON (L_WELL.WellID = T_BIN.WellID_BIN)
AND (L_SAND.SandID = T_BIN.SandID_BIN))
INNER JOIN T_BOL
ON (L_WELL.WellID = T_BOL.WellID_BOL) AND (L_SAND.SandID = T_BOL.SandID_BOL)
GROUP BY L_WELL.WellID, L_SAND.SandID;
Two LooUp tables are for Well Names and Sand Types. (Well has been abbreviate do to size)
L_Well:
WellID WellName_WELL
3 AAGVIK 1-35H
4 AARON 1-22
5 ACHILLES 5301 41-12B
6 ACKLINS 6092 12-18H
7 ADDY 5992 43-21 #1H
8 AERABELLE 5502 43-7T
9 AGNES 1-13H
10 AL 5493 44-23B
11 ALDER 6092 43-8H
12 AMELIA FEDERAL 5201 41-11B
13 AMERADA STATE 1-16X
14 ANDERSMADSON 5201 41-13H
15 ANDERSON 1-13H
16 ANDERSON 7-18H
17 ANDRE 5501 13-4H
18 ANDRE 5501 14-5 3B
19 ANDRE SHEPHERD 5501 14-7 1T
Sand Lookup:
LSand
SandID SandType_Sand
1 100 Mesh
2 20/40 EP
3 20/40 RC
4 20/40 W
5 30/50 Ceramic
6 30/50 EP
7 30/50 RC
8 40/70 EP
9 40/70 W
10 NA See Notes
Querying and Joining Aggregation Data through an MS Access Database
I noticed your concern for pointers on how to implement some of the theory behind your aggregation queries. While SQL queries are good power-tools to get to the core of a difficult analysis problem, it might also be useful to show some of the steps on how to bring things together using the built-in design tools of MS Access.
This solution was developed on MS Access 2010.
Comments on Previous Solutions
#xQbert had a solid start with the following SQL statement. The sub query approach could be visualized as individual query objects created in Access:
FROM
(SELECT WellID, Sand_ID, Sum(weight_DES) as sumWeightDES
FROM T_DESGN) A
INNER JOIN
(SELECT WellID_BOL, Sum(Weight_BOL) as SUMWEIGHTBOL
FROM T_BOL B) B
ON A.Well_ID = B.WellID_BOL
INNER JOIN
(SELECT WellID_BIN, sum(Weight_Bin) as SumWeightBin
FROM T_BIN) C
ON C.Well_ID_BIN = B.Well_ID_BOL
Depending on the actual rules of the business data, the following assumptions made in this query may not necessarily be true:
Will the tables of T_DESIGN, T_BOL and T_BIN be populated at the same time? The sample data has mixed values, i.e., there are WellID and SandID combinations which do not have values for all three of these categories.
INNER type joins assume all three tables have records for each dimension value (Well-Sand combination)
#Frazz improved on the query design by suggesting that whatever is selected as the "base" joining table (T_DESIGN in this case), this table must be populated with all the relevant dimensional values (WellID and SandID combinations).
SELECT
WellID_DES AS WellID,
SandID_DES AS SandID,
SUM(Weight_DES) AS Weight_DES,
(SELECT SUM(Weight_BOL) FROM T_BOL WHERE T_BOL.WellID_BOL=d.WellID_DES
AND T_BOL.SandID_BOL=d.SandID_DES) AS Weight_BOL,
(SELECT SUM(Weight_BIN) FROM T_BIN WHERE T_BIN.WellID_BIN=d.WellID_DES
AND T_BIN.SandID_BIN=d.SandID_DES) AS Weight_BIN
FROM T_DESIGN;
(... note: a group-by statement should be here...)
This was animprovement because now all joins originate from a single point. If a key-value does not exist in either T_BOL or T_BIN, results will still come back and the entire record of the query would not be lost.
Again, it may be possible that there are no T_DESIGN records matching to values stored in the other tables.
Building Aggregation Sub Query Objects
The presented data does not suggest that there is any direct interaction between the data in each of the three tables aside from lining up their results in the end for presentation based on a common key-value pair (WellID and SandID). Since we are using Access, there is a chance to do these calculations separately.
This query was designed using the "summarizing" feature of the Access query design tool. It's output, after pointing to the T_DESIGN table looked like this:
Making Dimension Table Through a Cartesian Product
There are mixed opinions out there about cartesian products, but they do actually have a purpose.
Most of the concern is that a runaway cartesian product query will make millions and millions of nonsensical data values. In this query, it's specifically designed to simulate a real business condition.
The Case for a Cartesian Product
Picking from the sample data provided:
Some of the Sand Types: "20/40 EP", "30/50 Ceramic", "40/70 EP", and "30/50 RC" that are moved between their respective wells, are these sand types found at these wells consistently throughout the year?
Without an anchoring dimension for the key-values, Wells would not be found anywhere in the database via querying. It's not that they do not exist... it's just that there is no recorded data (i.e., Sand Type Weights delivered) for them.
A Reference Dimension Query Product
A dimension query is simple to produce. By referencing the two sources of keys: L_WELL and L_SAND (both look up tables or dimensional tables) without identifying a join condition, all the different combinations of the two key-values (WellID and SandID) are made:
The shortcut in SQL looks like this:
SELECT L_WELL.WellID, L_SAND.SandID, L_WELL.WellName, L_SAND.SandType
FROM L_SAND, L_WELL;
The resulting data looks like this:
Instead of using any of the operational data tables: T_DESIGN, T_BOL, or T_BIN as sources of data for a static dimension such as a list of Oil Wells, or a catalog of Sand Types, that data has been predetermined and can even be transferred to a real table since it probably will not change much once it is created.
Correlating Sub Query Results from Different Sources
After repeating the process and creating the summary tables for the other two sources (T_BOL and T_BIN), You can finally arrange the results through a simple query and join process.
The actual JOIN operations are between the dimension table/query: QSUB_WELL_SAND and all three of the summary queries: QSUB_DES, QSUB_BOL, and QSUB_BIN.
I have chosen to chosen to implement LEFT OUTER joins. If you are not sure of the difference between the different "outer" joins, this is the choice I made through the Access Query Design dialogue:
QSUB_WELL_SAND is defined as our anchor dimension. It will always have more records than any of the other tables. An OUTER JOIN should be defined to KEEP all reference dimension records... and all Summary Table query results, regardless if there is a match between the two Query results.
QSUB_WEIGHTS/ The Query to Combine All Sub Query Results
This is what the design of the final output query looks like:
This is what the data output looks like when this query design is executed:
Conclusions and Clean Up: Some Closing Thoughts
With respect to the join to the dimension query, there is a lot of empty space where there are no records or data to report on. This is where a cleverly placed filter or query criteria can shrink the output to exactly what you care to look at the most. Here's how mine looked after I added additional ending query criteria:
My data was based on what was supplied by the OP, except where the ID's assigned to the Well Type attribute did not match the sample data. The values I assigned instead are posted below as well.
Access supports a different style of database operations. Step-wise queries can be developed to hold pre-processed, special sets of data that can be reintroduced to the other data tables and query results to develop complex query criteria.
All this being said, Programming in SQL can also be just as rewarding. Be sure to explore some of the differences between the results and the capabilities you can tap into by using one approach (sql coding), the other approach (access design wizards) or both of the approaches. There's definitely a lot of room to grow and discover new capabilities from just the example provided here.
Hopefully I haven't stolen all the fun from developing a solution for your situation. I read into your comment about "building more on top" as the harbinger of more fun to come, so I don't feel so bad...! Happy Developing!
Data Modifications from the Sample Set
Without understanding L_SAND and L_WELL this is the best I could come up with..
use sub selects to get the sums first so you don't compound the data issues on the joins.
Select WellID, Sand_ID, sumWeightDES, WellID_BOL, SUMWEIGHTBOL,
WellID_BIN, SumWeightBin
FROM
(SELECT WellID, Sand_ID, Sum(weight_DES) as sumWeightDES
FROM T_DESGN) A
INNER JOIN
(SELECT WellID_BOL, Sum(Weight_BOL) as SUMWEIGHTBOL
FROM T_BOL B) B
ON A.Well_ID = B.WellID_BOL
INNER JOIN
(SELECT WellID_BIN, sum(Weight_Bin) as SumWeightBin
FROM T_BIN) C
ON C.Well_ID_BIN = B.Well_ID_BOL
I would simplify it excluding L_WELL and L_SAND. If you are just interestend in IDs, then they really shouldn't be necessary joins. If all the other 3 tables have the WellID and SandID columns, then pick the one that is sure to have all combos.
Supposing it's the Design table, then:
SELECT
WellID_DES AS WellID,
SandID_DES AS SandID,
SUM(Weight_DES) AS Weight_DES,
(SELECT SUM(Weight_BOL) FROM T_BOL WHERE T_BOL.WellID_BOL=d.WellID_DES AND T_BOL.SandID_BOL=d.SandID_DES) AS Weight_BOL,
(SELECT SUM(Weight_BIN) FROM T_BIN WHERE T_BIN.WellID_BIN=d.WellID_DES AND T_BIN.SandID_BIN=d.SandID_DES) AS Weight_BIN
FROM T_DESIGN
GROUP BY WellID, SandID;
... and make sure all your tables have an index on WellID and SandID.
Just to be clear. I dont' think it's a good idea to start the join from the lookup tables, or from their cartesian product. You can always left join them to fetch descriptions and other data. But the main query should be the one with all the combos of WellID and SandID... or if not all, at least the most. Things get difficult if none of the 3 tables (DESIGN, BOL and BIN) have all combos. In that case (and I'd say only in that case) then you might as well start with the cartesian product of the two lookup tables. You could also do a UNION, but I doubt that would be more efficient.
I am working through a group by problem and could use some direction at this point. I want to summarize a number of variables by a grouping level which is different (but the same domain of values) for each of the variables to be summed. In pseudo-pseudo code, this is my issue: For each empYEAR variable (there are 20 or so employment-by-year variables in wide format), I want to sum it by the county in which the business was located in that particular year.
The data is a bunch of tables representing business establishments over a 20-year period from Dun & Bradstreet/NETS.
More details on the database, which is a number of flat files, all with the same primary key.
The primary key is DUNSNUMBER, which is present in several tables. There are tables detailing, for each year:
employment
county
sales
credit rating (and others)
all organized as follows (this table shows employment, but the other variables are similarly structured, with a year postfix).
dunsnumber|emp1990 |emp1991|emp1992|... |emp2011|
a | 12 |32 |31 |... | 35 |
b | |2 |3 |... | 5 |
c | 1 |1 | |... | |
d | 40 |86 |104 |... | 350 |
...
I would ultimately like to have a table that is structured like this:
county |emp1990|emp1991|emp1992|...|emp2011|sales1990|sales1991|sales1992|sales2011|...
A
B
C
...
My main challenge right now is this: How can I sum employment (or sales) by county by year as in the example table above, given that county as a grouping variable changes sometimes by the year and specified in another table?
It seems like something that would be fairly straightforward to do in, say, R with a long data format, but there are millions of records, so I prefer to keep the initial processing in postgres.
As I understand your question this sounds relatively straight forward. While I normally prefer normalized data to work with, I don't see that normalizing things beforehand will buy you anything specific here.
It seems to me you want something relatively simple like:
SELECT sum(emp1990), sum(emp1991), ....
FROM county c
JOIN emp e ON c.dunsnumber = e.dunsnumber
JOIN sales s ON c.dunsnumber = s.dunsnumber
JOIN ....
GROUP BY c.name, c.state;
I don't see a simpler way of doing this. Very likely you could query the system catalogs or information schema to generate a list of columns to sum up. the rest is a straight group by and join process as far as I can tell.
if the variable changes by name, the best thing to do in my experience is to put together a location view based on that union and join against it. This lets you hide the complexity from your main queries and as long as you don't also join the underlying tables should perform quite well.
I have a table in MS Access of data containing results from a survey, and I have a look up table of Risk Ids and descriptions of the sort of risk based on the survey results.
What I've tried so far is selecting distinct entries from my survey table, and inputing a new field into my query for the Risk Code whose number will depend on criteria that I determine, which I will then use to look up the risk.
My table for the survey looks like so:
Name | Location | Days spent eating IceCream | Icecream eating location
John Smith | London | 30 | Hull
My Risk ID table looks like so:
RiskID | RiskBool | Description
1 | Yes | At risk - This person eats too much icecream
2 | Yes | Risk - This person does not eat enough icecream
3 | No | Sensible amount of icecream eaten
4 | Yes | It is illegal to eat icecream in Hull
And my query looks (something) like this in access design view
Name | Location | Risk Code | RiskID | Description
I want to write SQL to change the Risk Code to 1, 2, 3, 4 (up to 15 in my real case) and then I will tell it to only display the person and the description for when the Risk ID and Code match. I haven't written this yet.
What is the best way to achieve this?
I see two possibilities:
Set up 15 queries one for each risk ID, add the descriptions to
those and then join those 15 sets of results together. This is what
I know how to do, but could end up quite messily.
Set up some 'check' using if statements, and then some how setting
the Risk Code field for that entry.
My current SQL looks like this, but it doesn't make any checks yet, I'm worried the if statment will be very, very long.
SELECT DISTINCT
[At Risk Employee List].Employee AS Name,
[At Risk Employee List].[DaysIceCream] AS [Days spent eating Icecream],
[At Risk Employee List].[Base Location],
[RiskCode] AS [Risk Code], <----is this where the check would need to go?
RiskDescLookup.RiskBoolean,
RiskDescLookup.RiskExplanation
FROM RiskDescLookup,
[Survey Raw Data]
INNER JOIN
[At Risk Employee List]
ON
[Survey Raw Data].ResID = [At Risk Employee List].[Staff ID]
GROUP BY
[At Risk Employee List].Employee,
[At Risk Employee List].[DaysIceCream],
[At Risk Employee List].[Base Location],
RiskDescLookup.RiskID,
[RiskCode] AS [Risk Code], <----is this where the check would need to go?
RiskDescLookup.RiskBoolean,
RiskDescLookup.RiskExplanation
I imagine the check done by if statements to be Very long and look something like (in pseudocode):
if ( [At Risk Employee List].[Base Location] = Hull, then [RiskCode]=4...., else if (DaysIceCream>42) then....
Is that the best way to do this? Do I even need to have a Risk Code?
I'm a bit lost as to how to produce this 'check' in the best possible way.
I am not entirely certain of your intent, but from what you've posted and the follow up comments it would appear that the process of joining the Risk Code to Risk ID is relatively simple once you have the Risk Code identified for each survey result.
The real issue it seems is how to encapsulate the logic to identify the Risk Code for each survey result. I would suggest "calculating" the risk code value for each survey result externally to your query and then join to those results before finally joining to the Risk ID.
For example, I might add a third table to the design SurveyRisk that contains Name and Risk Code.
Use whatever criteria and logic you need to use to identify the risk for each survey response. Enter these values into the SurveyRisk table. Then, you can simply join Survey to SurveyRisk to Risk to summarize your results.
Feel free to clarify where I'm misunderstanding what you are trying to accomplish and I'll edit my post accordingly.
The best way to do this is to use a look up table that emulates the structure of your data.
Add a row for every 'case', and in MS Access link the corresponding fields together.
Here is a few of the links:
Then alter the SQL to pair up any options that need to go together. For instance each of the checks I make are duplicated for two seperate locations.
Here is an example:
FROM RiskDescLookupReg
INNER JOIN ([Survey Raw Data]
INNER JOIN [At Risk Employee List]
ON [Survey Raw Data].ResID=[At Risk Employee List].[Staff ID])
ON (RiskDescLookupReg.RegTravelChoice=[Survey Raw Data].RegTravelChoice)
And (RiskDescLookupReg.MonthChoice2=[Survey Raw Data].MonthChoice2
And RiskDescLookupReg.PercentageTimeChoice2=[Survey Raw Data].PercentageTimeChoice2
And RiskDescLookupReg.LimitedDurationChoice2=[Survey Raw Data].LimitedDurationChoice2
And RiskDescLookupReg.TemporaryPurposeChoice2=[Survey Raw Data].TemporaryPurposeChoice2)
Or (
RiskDescLookupReg.MonthChoice1=[Survey Raw Data].MonthChoice1
And RiskDescLookupReg.PercentageTimeChoice1=[Survey Raw Data].PercentageTimeChoice1
And RiskDescLookupReg.LimitedDurationChoice1=[Survey Raw Data].LimitedDurationChoice1
And RiskDescLookupReg.TemporaryPurposeChoice1=[Survey Raw Data].TemporaryPurposeChoice1)
Not how there are two blocks for each location. If I only had one location of interest, I could drop the last block.
If you get duplicates because of the way your lookup table is arranged, you need to specify that the parts from the lookup table are enclosed in a LAST, and the parts from the survey in FIRST. Here is an example:
SELECT
[At Risk Employee List].Number,
FIRST([At Risk Employee List].Employee) AS Name,
FIRST([At Risk Employee List].[Base Location]) AS BaseLocation,
LAST(RiskDescLookupReg.RiskBool) AS RiskBool,
LAST(RiskDescLookupReg.RiskDesc) AS RiskDesc,
The use of LAST ensures that if someone would come up as at risk and not at risk, only the LAST at risk case is displayed (those entries come later in the field). This is counter to the fact when duplicates are displayed the at risk ones come first.
I have 2 tables, a purchases table and a users table. Records in the purchases table looks like this:
purchase_id | product_ids | customer_id
---------------------------------------
1 | (99)(34)(2) | 3
2 | (45)(3)(74) | 75
Users table looks like this:
user_id | email | password
----------------------------------------
3 | joeShmoe#gmail.com | password
75 | nolaHue#aol.com | password
To get the purchase history of a user I use a query like this:
mysql_query(" SELECT * FROM purchases WHERE customer_id = '$users_id' ");
The problem is, what will happen when tens of thousands of records are inserted into the purchases table. I feel like this will take a performance toll.
So I was thinking about storing the purchases in an additional field directly in the user's row:
user_id | email | password | purchases
------------------------------------------------------
1 | joeShmoe#gmail.com | password | (99)(34)(2)
2 | nolaHue#aol.com | password | (45)(3)(74)
And when I query the user's table for things like username, etc. I can just as easily grab their purchase history using that one query.
Is this a good idea, will it help better performance or will the benefit be insignificant and not worth making the database look messier?
I really want to know what the pros do in these situations, for example how does amazon query it's database for user's purchase history since they have millions of customers. How come there queries don't take hours?
EDIT
Ok, so I guess keeping them separate is the way to go. Now the question is a design one:
Should I keep using the "purchases" table I illustrated earlier. In that design I am separating the product ids of each purchase using parenthesis and using this as the delimiter to tell the ids apart when extracting them via PHP.
Instead should I be storing each product id separately in the "purchases" table so it looks like this?:
purchase_id | product_ids | customer_id
---------------------------------------
1 | 99 | 3
1 | 34 | 3
1 | 2 | 3
2 | 45 | 75
2 | 3 | 75
2 | 74 | 75
Nope, this is a very, very, very bad idea.
You're breaking first normal form because you don't know how to page through a large data set.
Amazon and Yahoo! and Google bring back (potentially) millions of records - but they only display them to you in chunks of 10 or 25 or 50 at a time.
They're also smart about guessing or calculating which ones are most likely to be of interest to you - they show you those first.
Which purchases in my history am I most likely to be interested in? The most recent ones, of course.
You should consider building these into your design before you violate relational database fundamentals.
Your database already looks messy, since you are storing multiple product_ids in a single field, instead of creating an "association" table like this.
_____product_purchases____
purchase_id | product_id |
--------------------------
1 | 99 |
1 | 34 |
1 | 2 |
You can still fetch it in one query:
SELECT * FROM purchases p LEFT JOIN product_purchases pp USING (purchase_id)
WHERE purchases.customer_id = $user_id
But this also gives you more possibilities, like finding out how many product #99 were bought, getting a list of all customers that purchased product #34 etc.
And of course don't forget about indexes, that will make all of this much faster.
By doing this with your schema, you will break the entity-relationship of your database.
You might want to look into Memcached, NoSQL, and Redis.
These are all tools that will help you improve your query performances, mostly by storing data in the RAM.
For example - run the query once, store it in the Memcache, if the user refresh the page, you get the data from Memcache, not from MySQL, which avoids querying your database a second time.
Hope this helps.
First off, tens of thousands of records is nothing. Unless you're running on a teensy weensy machine with limited ram and harddrive space, a database won't even blink at 100,000 records.
As for storing purchase details in the users table... what happens if a user makes more than one purchase?
MySQL is hugely extensible, and don't let the fact that it's free convince you of otherwise. Keeping the two tables separate is probably best, not only because it keeps the db more normal, but having more indices will speed queries. A 10,000 record database is relatively small in deference to multi-hundred-million record health record databases.
As far as Amazon and Google, they hire hundreds of developers to write specialized query languages for their specific application needs... not something developers like us have the resources to fund.
There are similar questions to this, but I don't think anyone has asked this particular question.
Scenario:
Customer - Order (where Order has a CustomerID) - OrderPart - Part
I want a query that returns a customer with all its orders and each order with its parts.
Now I have two main choices:
Use a nested loop (which produces separate queries)
Use data loading options (which produces a single query join)
The question:
Most advice and examples on ORMs suggest using option 2 and I can see why. However, option 2 will potentially be sending back a huge amount of duplicated data, eg:
Option 1 results (3 queries):
ID Name Country
1 Customer1 UK
ID Name
1 Order1
2 Order2
ID Name
1 Part1
2 Part2
3 Part3
Option 2 results (1 query):
ID Name Country ID Name ID Name
1 Customer1 UK 1 Order1 1 Part1
1 Customer1 UK 1 Order1 2 Part2
1 Customer1 UK 1 Order1 3 Part3
1 Customer1 UK 2 Order2 1 Part1
1 Customer1 UK 2 Order2 2 Part2
Option 1 sends back 13 fields with 3 queries. Option 2 sends back 42 fields in 1 query. Now imagine Customer table has 30 fields and Orders have more complex sub joins, the data duplication can quickly become huge.
What impact on overall performance do the following things have:
Overhead of making a database connection
Time taken to send data (potentially across network if on different server)
Bandwidth
Is option 2 always the best choice, option 1 the best choice or does it depend on the situation? If it depends, what criteria should you use to determine? Are any ORMs clever enough to work it out for themselves?
Overhead of making a database connection
Very little if they are on the same subnet, which they usually are. If they're not then this is still not a huge overhead and can be overcome with caching, which most ORMs have (NHibernate has 1st and 2nd level caching).
Time taken to send data (potentially across network if on different server)
For SELECT N+1 this will obviously be longer as it will have to send the select statement each time, which might be up to 1k long. It will also have to grab a new connection from the pool. Chatty versus chunky use to be an argument around 2002-2003 but now it really doesn't make a huge difference unless this is a really big application, in which case you will probably want a more experienced (or better paid) pundit giving his views - i.e. a consultant.
I would favour joins however, as databases will be optimised for this usage over their 10 or more years of development. If performance is really slow a View can sort this out, or Stored Procedure.
By the way, SELECT N+1 is probably the commonest performance problem people experience with NHibernate when they first start using it (including me), and is something that actually takes tweaking to sort out. This is because NHibernate is to ORMs what C++ is to languages.
Bandwidth
An extra SELECT statement for every Customer will eventually build up to however many Customer objects * Orders. So for a large system this might be noticeable - but as I mentioned, ORMs usually have caching mechanisms in place to negate this problem. The amount of SELECT statements also isn't going to be that huge considering:
You're on the same network as the SQL server most of the time
The increased amount of bytes account for about an extra 0.5-50k of extra bandwidth? Think how fast that is on most servers.
a great deal of this is going to depend on the amount of data you are going through.
The join, while returning more fields, is going to run markedly faster (as a rule) than the Option 1 set of queries.
From my personal experience, slow-downs are almost always at that level, the actual running of the query, not the sheer amount of data being passed along whatever pipe you have.