This is a classic problem that I could never seem to come up with a solution for that I was happy with. What would be considered an OO-elegant and DB-scalable approach to this problem?
Employee
- Name, Phone, Meta, etc
Company
- Meta, etc
- Employee[]
CompanyRepository (RDMBS)
- Company GetById(int)
- Company[] GetAll()
Approach #1:
'GetById' Selects all from 'tCompany' and left joins 'tEmployee'. Sql Select yields 12 rows. Returns a single company populated with 12 employees.
'GetAll' Same Select as above but returns 12,000,000 rows. Through creative loops and logic returns 1,000,000 companies each with 12 employees.
Approach #2:
'GetById' ... same as above
'GetAll'. Selects all from 'tCompany' but nothing from 'tEmployee'. Sql select yields 1,000,000 rows. Returns 1,000,000 companies but each with a null 'Employees' property.
Approach #3
... split the domain into 'SimpleCompany' containing only meta and 'ComplexCompany' that inherits from 'SimpleCompany' but has a 'Employees' property. GetById returns 'ComplexCompany' and GetAll returns 'SimpleCompany' array.
... each smells bad for different reasons.
What is business reason to get all companies (12,000,000 rows)? I would not advise you to have all 12,000,000 rows in memory at a time.
May be you should use pagination. Select limited set of companies at a time, then iterate from one page to another until no rows are returned.
public Company[] GetAllByPageNumber(int pageNumber, int pageSize)
The down side here is that Company should not be inserted or deleted while you are iterating.
Related
I'm working in a large access database (Access 2010) and am trying to return records where two locations are different.
In my case, I have a large number of birds that have been observed on multiple dates and potentially on different islands. Each bird has a unique BirdID and also an actual physical identifier (unfortunately that may have changed over time). [I'm going to try addressing the changing physical identifier issue later]. I currently want to query individual birds where one or more of their observations is different than the "IslandAlpha" (the first island where they were observed). Something along the lines of a criteria for BirdID: WHERE IslandID [doesn't equal] IslandAlpha.
I then need a separate query to find where all observations DO equal where they were first observed. So where IslandID = IslandAlpha
I'm new to Access, so let me know if you need more information on how my tables/relationships are set up! Thanks in advance.
Assuming the following tables:
Birds table in which all individual birds have records with a unique BirdID and IslandAlpha.
Sightings table in which individual sightings are recorded, including IslandID.
Your first query would look something like this:
SELECT *
FROM Birds
INNER JOIN Sightings ON Birds.BirdID=Sightings.BirdID
WHERE Sightings.IslandID <> Birds.IslandAlpha
You second query would be the same but with = instead of <> in the WHERE clause.
Please provide us information about the tables and columns you are using.
I will presume you are asking this question because a simple join of tables and filtering where IslandAlpha <> ObsLoc is not possible because IslandAlpha is derived from first observation record for each bird. Pulling first observation record for each bird requires a nested query. Need a unique record identifier in Observations - autonumber should serve. Assuming there is an observation date/time field, consider:
SELECT * FROM Observations WHERE ObsID IN
(SELECT TOP 1 ObsID FROM Observations AS Dupe
WHERE Dupe.ObsBirdID = Observations.ObsBirdID ORDER BY Dupe.ObsDateTime);
Now use that query for subsequent queries.
SELECT * FROM Observations
INNER JOIN Query1 ON Observations.ObsBirdID = Query1.ObsBirdID
WHERE Observations.ObsLocID <> Query1.ObsLocID;
im very new to SQL and currently working with joins the first time in my life. What I am trying to figure out right now is trying to get the difference between to queries.
Query 1:
SELECT name
FROM actor
JOIN casting ON id = actorid
where (SELECT COUNT(ord) FROM casting join actor on actorid = actor.id AND ord=1) >= 30
GROUP BY name
Query 2:
SELECT name
FROM actor
JOIN casting ON id = actorid
AND (SELECT COUNT(ord) FROM casting WHERE actorid = actor.id AND ord=1)>=30)
GROUP BY name
So I would think that doing
FROM casting join actor on actorid = actor.id
in the subquery is the same as
FROM casting WHERE actorid = actor.id.
But apparently it is not. Could anyone help me out and explain why?
Edit: If anyone is wondering: The queries are based on question 13 from http://sqlzoo.net/wiki/More_JOIN_operations
Actually, the part that really looks like a "where" statement is only what's after the keyword ON. We sometimes fall on queries performing some data filtering directly at this stage, but its actual purpose is to specify the criteria used
A "join" is a very common operation that consists of associating the rows of two distinct tables according to a common criteria. For example, if you have, on one side, a table containing a client list in which each of them has a unique client number, and on a other side a order list table in which each order contains the client's number, then you may want to "resolve" the number of the latter table into its name, address, and so on.
Before SQL92 (26 years ago), the only way to achieve this was to write something like this :
SELECT client.name,
client.adress,
orders.product,
orders.totalprice
FROM client,orders
WHERE orders.clientNumber = client.clientNumber
AND orders.totalprice > 100.00
Selecting something from two (or more) tables induces a "cartesian product" which actually consists of associating every row from the first set which every row of the second one. This means that if your first table contains 3 rows and the second one 8 rows, the resulting set would be 24-row wide. And out of these, you use the WHERE clause to exclude basically everything and retain only rows in which the client number is the same on both side.
We understand that the size of the resulting set before filtering can grow exponentially if the contents of the different tables exceed a few rows (which is always the case) and it can get even worse if you imply more than two tables. Also, on the programmer's side, it rapidly becomes rather unreadable.
Therefore, if this is what you actually want to do, you now can explicitly tell the server about it, and specify the criteria at first, which will avoid unnecessary growing temporary subsets, while still letting you filter the results with WHERE if needed.
SELECT client.name,
client.adress,
orders.product,
orders.totalprice
FROM client
JOIN orders
ON orders.clientNumber = client.clientNumber
WHERE orders.totalprice > 100.00
It becomes critical when performing multiple JOIN in a single query, especially when performing both INNER and OUTER joins.
In the 2nd query your nested query takes the actor.id from its root query and only counts the results from that. In the 1st query your nested query counts results from all actors instead of only the specified one.
We have a situation where we need results from 4 different tables combined into one list and paginate it through OFFSET/FETCH.
What want to select records from tables a, b, c & d, order them by CreatedDatetime and then OFFSET X, FETCH Y. Tables are quite big (in terms of numbers of rows) and it sounds horrible to do just UNION ALL and then pagination because it would mean probably compiling whole list of records and then taking paginated part.
Problem is that none of the tables can be taken as reference to extract Start/End Datetime window because every collection might but also might not contain records from any of the table. For example, ending result might contain records from any combination of tables a; a/b; a/b/c; a/b/c/d; b; b/c;.... and we need fixed size number to be returned (paging size, for example, being 20).
Any ideas on how to most effectively approach this?
UPDATE
Based on question from #HABO
There are unfortunately no special clues like that about queries. We are showing user activities in the system. There are different kinds of it (tables we select over). Now, query pops up data for administrator who views the activities. How administrator will look at data may vary drastically: some users will have thousands of activities in last few hours and admin will want to see them all. In other cases, users will have 3 actions in a day and admin will see just first page of data.
PS. It's not a pure log tables as activities act as state machines over time, each having their states, which we also look for in these queries.
if you know the page size (eg 100) then you can simply write 4 Top 100 queries (order by Create Date) - Then do a Union ALL on the result.
That way even if all the first 100 records come from 1 table you are covered.
For Subsequent Paging queries - You'll need to record the last displayed row from each table and use this as your High-Water mark for the next fetch - (Select top 100 FROM TableA Where RowID > #HighWater)
Should be fairly efficient...
This is where a cache comes in useful. You can either cache the result of the query in your application layer and do the paging there if it is not too large, or cache the results of the query in a table (or temp table) if it is large.
There would be filters i suppose. From what you say, those may vary a lot. So at the worst scenario, all columns can be filters.
My suggestion is to use 5 views, one for each table and a final one union them. Just make sure all filter columns go up the physical tables as straightforward as possible.
Finally, select the master view and fetch but be careful of the order by clause. Make sure order by has unique data combination else you might have cases where a row change pages on a simple plain refresh. If there is user order by defined, force add some key columns at the end.
How to safely ensure order by to have distinct values for 100% safe fetch/offset:
At the 4 views create a new column with a simple constant number as value, e.g. 1, 2, 3, 4 AS [TableSource]
Make sure you select the PK of each table. If you don't have, you have to create one in the views, probably using ROW_NUMBER or NEWID, as [Pk] for example.
Finally, when selecting from the master view, you ORDER BY CreateDate, Pk, TableSource. This way you are 100% safe that within the same set of data any row will be placed exactly at the same position, resulting correct paging.
Example of safely isolating a page of 30 rows order by CreateDate:
SELECT * FROM (
SELECT src, id, ROW_NUMBER() OVER(ORDER BY dt DESC,src,id)rn FROM (
SELECT 1 src, id, dt FROM table1 /*WHERE x=y*/ UNION ALL
SELECT 2 src, id, dt FROM table2 /*WHERE x=y*/ UNION ALL
SELECT 3 src, id, dt FROM table3 /*WHERE x=y*/ UNION ALL
SELECT 4 src, id, dt FROM table4 /*WHERE x=y*/)alltables
)data WHERE data.rn BETWEEN 3001 AND 3030
The problem: We are wanting to remove misspelled addresses from our database. But we have too many to do by hand. So instead, I have a function, FN, that returns true if two addresses appear very similar (indicating a possible misspelling). A simple check would be to do something like...
select *
from
address adr1
join address adr2
on FN(adr1, adr2)
But, this basically does a cross join and compares rows. This is impossible to do due to how large our table is (> 1 million rows). But, I can limit it to looking at only addresses near each other. For example, addresses within the same city. So, I tried doing a count of addresses like that by doing...
select count(1)
from
address adr1
join address adr2
on adr1.zip = adr2.zip
and adr1.city = adr2.city
--Don't want to compare to self
and adr1.ID <> adr2.ID
The problem is that this takes too long to run (I've waited and it still hasn't finished). I suspect that oracle has a much better way to handle doing these type of things for large numbers of rows, but I just don't know it.
So how should a person go about joining an extreme large table to itself if there are ways to limit what is being joined (such as only looking within the same zipcode)?
P.S. Do trillions of records count as big data or should I remove the tag?
Edit1: Zip and City are already indexed.
Edit2: Zip and City both have large numbers of null values 200,000+. This may affect how the index is used in the join.
Explain plan:
SELECT STATEMENT ALL_ROWSCost: 35,301 Bytes: 42 Cardinality: 1
4 SORT AGGREGATE Bytes: 42 Cardinality: 1
3 HASH JOIN Cost: 35,301 Bytes: 2,195,769,492 Cardinality: 52,280,226
1 TABLE ACCESS FULL TABLE SCHEMA.ADDRESS Cost: 15,677 Bytes: 21,388,962 Cardinality: 1,018,522
2 TABLE ACCESS FULL TABLE SCHEMA.ADDRESS Cost: 15,677 Bytes: 21,388,962 Cardinality: 1,018,522
Edit3: I've tried counting the number of rows I'll be looking at a different way.
select
sum(cnt * (cnt - 1))
from
(
select
count(1) as CNT
from schema.address adr1
group by adr1.zip, adr1.city
)
This returned ~45 billion different pairings in less than 10 seconds. I'm not sure my function can handle more than 100k rows a second, which is what would be needed to have this run in under 12 hours.
1) Build an index on fields ZIP and CITY
2) To get duplicates (this is what you do in second case) use GROUP BY:
SELECT ZIP,CITY, count(*) FROM ADDRESS HAVING COUNT(*)>1 GROUP BY ZIP,CITY
I've got some good news, and some bad news.
The good news is that your existing query is likely to have closer to 5 billion rows, than 45 billion rows.
The bad news is that this is because it won't try to match up any of the 200,000 records that have null zip or null city values - Oracle (and all other RDBMSs I know) won't join NULL values to other NULL values; see here for an example. You can get round this using a coalesce as part of the join criteria, but I suggest handling null city/zip records separately instead.
Assuming that your function handles addresses symmetrically (so that FN(addr1,addr2) returns the same result as FN(addr2,addr1)), you can further halve the number of combinations by changing adr1.ID <> adr2.ID to adr1.ID < adr2.ID in your existing query. If you don't already have a suitable index, I suggest adding one on zip, city and id (in that order).
A different approach would be to encode each address with postal authority idcode, if that exists for the addresses/country in question. This means that rather than comparing each address to itself, you put all the effort into parsing and decoding the address in the first place. We use this approach, and store the id in each row, which means we can join later very precisely and quickly.
If you cannot use postal id (and by this I mean unique id's for each delivery address assigned by the post office), then consider geo coding each address and then joining by geo near addressses. Geo coding might also apply if the addresses arent purely postal addresses.
I'm also quite interested in what FN() does for the addresses, have you seen http://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/ not related to your question, but good reading if you are new to address handling.
I'm curious about how I could go about getting the data I need out of a "circle" of tables.
I have 5 tables(and a few supporting ones): 3 entities joined by junction tables. So the general model is like this:
Cards have many Budgets, and Accounts have many Budgets, and Accounts have many Cards.
So my relationships make a circle, through the junction tables, form Card to Budget to Account back to Card, This structure works all fine and dandy until today when I tried to construct a query using all 5 tables, and noticed that I know of no way to avoid abiguous joins which this structure in place. I'm thinking it might have been a better idea to create AccountBudget and CardBudget tables, but since they will both define exactly the same type of data, one table seemed more efficient.
The information I'm trying to get is basically the total budget limit for all cards of a certain type, and the total budget limit for all accounts of that same type. Am I just looking at this problem wrong?
// Card Budget_Card Budget Budget_Account Account
// ------- --------- -------- -------------- ---------
// cardId------\ budgetId<---------budgetId------>budgetId -----accountId--(to Card)->
// accountId --->cardId limit accountId<------/ typeId
// (etc) typeId (etc)
// (typeId in Budget is either 1 for an account budget or 2 for a card budget.)
As you can see, it's a circle. What I'm trying to accomplish is return one row with two columns: the sum of Budget.limit for the record in Account where typeId = 1, and the sum of Budget.limit for all rows in Card belonging to Accounts of the same type.
As per suggestion, I can in fact get the data I need from a union, but it's no use to me if the data is not in two separate columns:
SELECT DISTINCTROW Sum(Budget.limit) AS SumOfLimit
FROM (Account RIGHT JOIN Card ON Account.accountId = Card.accountId)
RIGHT JOIN (Budget LEFT JOIN Budget_Card ON Budget.budgetID = Budget_Card.budgetId) ON Card.cardId = Budget_Card.cardId
GROUP BY Budget.typeId, Budget.quarterId, Account.typeId
HAVING (((Budget.typeId)=2) AND ((Budget.quarterId)=[#quarterId]) AND ((Account.typeId)=[#accountType]))
UNION SELECT DISTINCTROW Sum(Budget.limit) AS SumOfLimit
FROM Budget LEFT JOIN (Account RIGHT JOIN Budget_Account ON Account.accountId = Budget_Account.accountId) ON Budget.budgetID = Budget_Account.budgetId
GROUP BY Budget.typeId, Budget.quarterId, Account.typeId
HAVING (((Budget.typeId)=1) AND ((Budget.quarterId)=[#quarterId]) AND ((Account.typeId)=[#accountType]));
So, if I understand you correctly, you've made separate column headers with the same name, and so your data becomes skewed because the information needs to be separated? If this is the case I would suggest changing the column headers as you've proposed, or in linking two queries together. To connect the data by querying the same tagged name will combine results. If you want to designate something, it's always a good idea to create separate names for column headers.
Here is an explanation of using SQL to query multiple tables: http://www.techrepublic.com/article/sql-basics-query-multiple-tables/1050307
First make the query for the Cards, then union with the query for the Accounts
Although it would be easier to relate cards to accounts and then only have budgets for accounts, however i don't know if that would work with your schema