The problem: We are wanting to remove misspelled addresses from our database. But we have too many to do by hand. So instead, I have a function, FN, that returns true if two addresses appear very similar (indicating a possible misspelling). A simple check would be to do something like...
select *
from
address adr1
join address adr2
on FN(adr1, adr2)
But, this basically does a cross join and compares rows. This is impossible to do due to how large our table is (> 1 million rows). But, I can limit it to looking at only addresses near each other. For example, addresses within the same city. So, I tried doing a count of addresses like that by doing...
select count(1)
from
address adr1
join address adr2
on adr1.zip = adr2.zip
and adr1.city = adr2.city
--Don't want to compare to self
and adr1.ID <> adr2.ID
The problem is that this takes too long to run (I've waited and it still hasn't finished). I suspect that oracle has a much better way to handle doing these type of things for large numbers of rows, but I just don't know it.
So how should a person go about joining an extreme large table to itself if there are ways to limit what is being joined (such as only looking within the same zipcode)?
P.S. Do trillions of records count as big data or should I remove the tag?
Edit1: Zip and City are already indexed.
Edit2: Zip and City both have large numbers of null values 200,000+. This may affect how the index is used in the join.
Explain plan:
SELECT STATEMENT ALL_ROWSCost: 35,301 Bytes: 42 Cardinality: 1
4 SORT AGGREGATE Bytes: 42 Cardinality: 1
3 HASH JOIN Cost: 35,301 Bytes: 2,195,769,492 Cardinality: 52,280,226
1 TABLE ACCESS FULL TABLE SCHEMA.ADDRESS Cost: 15,677 Bytes: 21,388,962 Cardinality: 1,018,522
2 TABLE ACCESS FULL TABLE SCHEMA.ADDRESS Cost: 15,677 Bytes: 21,388,962 Cardinality: 1,018,522
Edit3: I've tried counting the number of rows I'll be looking at a different way.
select
sum(cnt * (cnt - 1))
from
(
select
count(1) as CNT
from schema.address adr1
group by adr1.zip, adr1.city
)
This returned ~45 billion different pairings in less than 10 seconds. I'm not sure my function can handle more than 100k rows a second, which is what would be needed to have this run in under 12 hours.
1) Build an index on fields ZIP and CITY
2) To get duplicates (this is what you do in second case) use GROUP BY:
SELECT ZIP,CITY, count(*) FROM ADDRESS HAVING COUNT(*)>1 GROUP BY ZIP,CITY
I've got some good news, and some bad news.
The good news is that your existing query is likely to have closer to 5 billion rows, than 45 billion rows.
The bad news is that this is because it won't try to match up any of the 200,000 records that have null zip or null city values - Oracle (and all other RDBMSs I know) won't join NULL values to other NULL values; see here for an example. You can get round this using a coalesce as part of the join criteria, but I suggest handling null city/zip records separately instead.
Assuming that your function handles addresses symmetrically (so that FN(addr1,addr2) returns the same result as FN(addr2,addr1)), you can further halve the number of combinations by changing adr1.ID <> adr2.ID to adr1.ID < adr2.ID in your existing query. If you don't already have a suitable index, I suggest adding one on zip, city and id (in that order).
A different approach would be to encode each address with postal authority idcode, if that exists for the addresses/country in question. This means that rather than comparing each address to itself, you put all the effort into parsing and decoding the address in the first place. We use this approach, and store the id in each row, which means we can join later very precisely and quickly.
If you cannot use postal id (and by this I mean unique id's for each delivery address assigned by the post office), then consider geo coding each address and then joining by geo near addressses. Geo coding might also apply if the addresses arent purely postal addresses.
I'm also quite interested in what FN() does for the addresses, have you seen http://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/ not related to your question, but good reading if you are new to address handling.
Related
Given a fairly stereotypical scenario with an item table referenced by an images table holding multiple images for the same item, I'm trying to figure out how to retrieve a specific number of items, while collecting all of the image rows.
The setup is trivial, and looks like:
CREATE TABLE items (
id INTEGER PRIMARY KEY,
...
...
)
CREATE TABLE images (
item_id INTEGER PRIMARY KEY,
url STRING,
...
...
)
LIMIT 20 clamps the maximum number of rows in the entire result set, and incidentally truncates mid-way through an item record.
I'm currently doing two queries, but besides being architecturally not at all ideal, it's practically proving quite awkward to coordinate (which makes a lot of sense since I'm definitely doing it wrong). I'm not finding any info on how to coordinate LIMIT with LEFT JOINs, so I thought I'd ask. Thanks!
NB. Similar questions to this do indeed exist, but are asking how to do things like retrieving eg the first (say) 5 images for each item. I'm looking to retrieve (say) 10 items and get all the images for each.
A query like:
SELECT * FROM items ORDER BY ? LIMIT 10
will return 10 rows (at most) from items.
You need to provide the column(s) for the ORDER BY clause if you have some specific sorting condition in mind.
If not then remove the ORDER BY clause, but in this case nothing guarantees the resultset that you will get.
So all you have to do is LEFT join the above query to images:
SELECT it.*, im.*
FROM (SELECT * FROM items ORDER BY ? LIMIT 10) it
LEFT JOIN images im
ON im.item_id = it.id
I'm working in a large access database (Access 2010) and am trying to return records where two locations are different.
In my case, I have a large number of birds that have been observed on multiple dates and potentially on different islands. Each bird has a unique BirdID and also an actual physical identifier (unfortunately that may have changed over time). [I'm going to try addressing the changing physical identifier issue later]. I currently want to query individual birds where one or more of their observations is different than the "IslandAlpha" (the first island where they were observed). Something along the lines of a criteria for BirdID: WHERE IslandID [doesn't equal] IslandAlpha.
I then need a separate query to find where all observations DO equal where they were first observed. So where IslandID = IslandAlpha
I'm new to Access, so let me know if you need more information on how my tables/relationships are set up! Thanks in advance.
Assuming the following tables:
Birds table in which all individual birds have records with a unique BirdID and IslandAlpha.
Sightings table in which individual sightings are recorded, including IslandID.
Your first query would look something like this:
SELECT *
FROM Birds
INNER JOIN Sightings ON Birds.BirdID=Sightings.BirdID
WHERE Sightings.IslandID <> Birds.IslandAlpha
You second query would be the same but with = instead of <> in the WHERE clause.
Please provide us information about the tables and columns you are using.
I will presume you are asking this question because a simple join of tables and filtering where IslandAlpha <> ObsLoc is not possible because IslandAlpha is derived from first observation record for each bird. Pulling first observation record for each bird requires a nested query. Need a unique record identifier in Observations - autonumber should serve. Assuming there is an observation date/time field, consider:
SELECT * FROM Observations WHERE ObsID IN
(SELECT TOP 1 ObsID FROM Observations AS Dupe
WHERE Dupe.ObsBirdID = Observations.ObsBirdID ORDER BY Dupe.ObsDateTime);
Now use that query for subsequent queries.
SELECT * FROM Observations
INNER JOIN Query1 ON Observations.ObsBirdID = Query1.ObsBirdID
WHERE Observations.ObsLocID <> Query1.ObsLocID;
im very new to SQL and currently working with joins the first time in my life. What I am trying to figure out right now is trying to get the difference between to queries.
Query 1:
SELECT name
FROM actor
JOIN casting ON id = actorid
where (SELECT COUNT(ord) FROM casting join actor on actorid = actor.id AND ord=1) >= 30
GROUP BY name
Query 2:
SELECT name
FROM actor
JOIN casting ON id = actorid
AND (SELECT COUNT(ord) FROM casting WHERE actorid = actor.id AND ord=1)>=30)
GROUP BY name
So I would think that doing
FROM casting join actor on actorid = actor.id
in the subquery is the same as
FROM casting WHERE actorid = actor.id.
But apparently it is not. Could anyone help me out and explain why?
Edit: If anyone is wondering: The queries are based on question 13 from http://sqlzoo.net/wiki/More_JOIN_operations
Actually, the part that really looks like a "where" statement is only what's after the keyword ON. We sometimes fall on queries performing some data filtering directly at this stage, but its actual purpose is to specify the criteria used
A "join" is a very common operation that consists of associating the rows of two distinct tables according to a common criteria. For example, if you have, on one side, a table containing a client list in which each of them has a unique client number, and on a other side a order list table in which each order contains the client's number, then you may want to "resolve" the number of the latter table into its name, address, and so on.
Before SQL92 (26 years ago), the only way to achieve this was to write something like this :
SELECT client.name,
client.adress,
orders.product,
orders.totalprice
FROM client,orders
WHERE orders.clientNumber = client.clientNumber
AND orders.totalprice > 100.00
Selecting something from two (or more) tables induces a "cartesian product" which actually consists of associating every row from the first set which every row of the second one. This means that if your first table contains 3 rows and the second one 8 rows, the resulting set would be 24-row wide. And out of these, you use the WHERE clause to exclude basically everything and retain only rows in which the client number is the same on both side.
We understand that the size of the resulting set before filtering can grow exponentially if the contents of the different tables exceed a few rows (which is always the case) and it can get even worse if you imply more than two tables. Also, on the programmer's side, it rapidly becomes rather unreadable.
Therefore, if this is what you actually want to do, you now can explicitly tell the server about it, and specify the criteria at first, which will avoid unnecessary growing temporary subsets, while still letting you filter the results with WHERE if needed.
SELECT client.name,
client.adress,
orders.product,
orders.totalprice
FROM client
JOIN orders
ON orders.clientNumber = client.clientNumber
WHERE orders.totalprice > 100.00
It becomes critical when performing multiple JOIN in a single query, especially when performing both INNER and OUTER joins.
In the 2nd query your nested query takes the actor.id from its root query and only counts the results from that. In the 1st query your nested query counts results from all actors instead of only the specified one.
I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:
StudentFieldValues (
FieldId INT NOT NULL,
StudentId INT NOT NULL,
Day DATE NOT NULL,
Value FLOAT NULL
)
FieldId is a foreign key from table Field, Day is a foreign key from table Days. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:
StudentDays (
StudentId INT NOT NULL,
Day DATE NOT NULL,
NumberOfClasses FLOAT NULL,
MinutesLateToSchool FLOAT NULL,
... -- the rest of the 188 fields
)
The PK is the first 2 fields.
Currently the query that populates the output table does a self join with StudentFieldValues 188 times, one for each field. Each join equates StudentId and Day and takes a different FieldId. Specifically:
SELECT Students.StudentId, Days.Day,
StudentFieldValues1.Value NumberOfClasses,
StudentFieldValues2.Value MinutesLateToSchool,
...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1
ON Students.StudentId=StudentFieldValues1.StudentId AND
Days.Day=StudentFieldValues1.Day AND
AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2
ON Students.StudentId=StudentFieldValues2.StudentId AND
Days.Day=StudentFieldValues2.Day AND
StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId
I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.
More details:
MS SQL Server 2014, 2x XEON E5 2690v2 (20 cores, 40 threads total), 128GB RAM. Windows 2008R2.
352 million rows in the input table, 18 million rows in the output table - both expected to increase over time.
Query takes 20 minutes and I'm very happy with that, but performance degrades as I add more fields.
Think about doing this using conditional aggregation:
SELECT s.StudentId, d.Day,
max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses,
max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d LEFT OUTER JOIN
StudentFieldValues sfv
ON s.StudentId = sfv.StudentId AND
d.Day = sfv.Day
GROUP BY s.StudentId, d.Day;
This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.
EDIT:
A faster version of this query would use subqueries instead of aggregation:
SELECT s.StudentId, d.Day,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d;
For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value).
Yes, this is going to blow up. You have your definitions of "normalized" and "denormalized" backwards. The Field/Value table design is not a relational design. It's a variation of the entity-attribute-value design, which has all sorts of problems.
I recommend you do not try to pivot the data in an SQL query. It doesn't scale well that way. Instea, you need to query it as a set of rows, as it is stored in the database, and fetch back the result set into your application. There you write code to read the data row by row, and apply the "fields" to fields of an object or a hashmap or something.
I think there may be some trial and error here to see what works but here are some things you can try:
Disable indexes and re-enable after data load is complete
Disable any triggers that don't need to be ran upon data load scenarios.
The above was taken from an msdn post where someone was doing something similar to what you are.
Think about trying to only update the de-normalized table based on changed records if this is possible. Limiting the result set would be much more efficient if this is a possibility.
You could try a more threaded iterative approach in code (C#, vb, etc) to build this table by student where you aren't doing the X number of joins all at one time.
I have a table that has 14,091 rows (2 columns, let's say first name, last name). I then have a calendar table that has 553 rows of just dates (first of each month). I do a cross join in order to get every combination of first name, last name, & first of month because this is my requirement. This takes just over a minute.
Is there anything I can do about this to make it faster or can a cross join never get any faster like I suspect?
People Table
first_name varchar2(100)
last_name varchar2(1000)
Dates Table
dt DateTime
select a.first_name, a.last_name, b.dt
from people a, dates b
It will be slow as it making all possible combinations. 14091 * 553. It will not going to be fast until you have either index or inner join.
Yeah. Takes over a minute. Let's get this clear. You talk of 14091 * 553 rows - that is 7792323. Rounded that is 7.8 million rows. And loading them into a data table (which is not known for performance).
Want to see slow? Put them into a grid. THEN you see slow.
The requirements make no sense in a table. None. Absolutely none.
And no, there is no way to speed up the loading of 7.8 million rows into a data structure that is not meant to hold these amounts of data.