I know how to do a left join but when I add two more to the query. It gets a bit weird. So, here is the task. The original table, list20192, has 93 columns and 85,353 rows. However, the end user is not ok with it having no descriptive fields such as titles or descriptions. One field, naics, is a six-digit code and they want an industry title to go along with it. For the corresponding title that goes with naics code, 551114, one has to go to the indcodes table which has the following structure:
state char(2)
codetype char(2)
code char(6)
codetitle varchar (115)
GEOG Table
state char (2)
areatype char(2)
area (6)
areaname varchar (254)
areadesc varchar (254)
I am using the following query in an effort to attach the descriptive counterparts of the fields in list20192 to the end. The end result should be 96 columns and 85,353 rows. This query works but produces 6,800,277 rows which is far too many.
Select list20192.*, geog.area, sizeclas.sizedesc
from dbo.list20192
left join dbo.indcodes on list20192.naics = indcodes.code
left join dbo.sizeclas on list20192.sizeclass = sizeclas.sizeclass
left join dbo.geog on list20192.area = geog.area
where list20192.year = '2019' and list20192.qtr = '2'
The end result should look something like this. There are two rows with the sixth item being on the second row.
naics codetitle sizeclass sizedesc area
541114 Management 22 400 to 499 employees 000025
areaname
Pershing
Any ideas how I would adjust this query to not receive so many results? Now that the results show it appears that it is giving me a result for every area value in geog with no regard to state. I am in state 32. My original table list20192 only deals with state 32. For each state, there are many area values that could be identical. For instance, area 000003 in Nevada is Clark while 000003 in South Dakota is Beadle County
The proliferation of rows is undoubtedly being caused by having multiple rows in the various joined tables which match the requested join-key value. A query will produce an output row for every matching combination: for example, a join of a table with 2 matching keys on one side and 5 on the other will produce 2*5 = 10 result rows.
One easy way to find out might be to add JOIN clauses one at a time.
Without having any idea, of course, what your data looks like, I'd probably finger that geog table. (The other two just look to me like lookup-tables.)
The DISTINCT clause can filter out duplicates, but note that this can be expensive. Maybe you need to be more specific with one or more of these tables . . .
Related
I'm working in a large access database (Access 2010) and am trying to return records where two locations are different.
In my case, I have a large number of birds that have been observed on multiple dates and potentially on different islands. Each bird has a unique BirdID and also an actual physical identifier (unfortunately that may have changed over time). [I'm going to try addressing the changing physical identifier issue later]. I currently want to query individual birds where one or more of their observations is different than the "IslandAlpha" (the first island where they were observed). Something along the lines of a criteria for BirdID: WHERE IslandID [doesn't equal] IslandAlpha.
I then need a separate query to find where all observations DO equal where they were first observed. So where IslandID = IslandAlpha
I'm new to Access, so let me know if you need more information on how my tables/relationships are set up! Thanks in advance.
Assuming the following tables:
Birds table in which all individual birds have records with a unique BirdID and IslandAlpha.
Sightings table in which individual sightings are recorded, including IslandID.
Your first query would look something like this:
SELECT *
FROM Birds
INNER JOIN Sightings ON Birds.BirdID=Sightings.BirdID
WHERE Sightings.IslandID <> Birds.IslandAlpha
You second query would be the same but with = instead of <> in the WHERE clause.
Please provide us information about the tables and columns you are using.
I will presume you are asking this question because a simple join of tables and filtering where IslandAlpha <> ObsLoc is not possible because IslandAlpha is derived from first observation record for each bird. Pulling first observation record for each bird requires a nested query. Need a unique record identifier in Observations - autonumber should serve. Assuming there is an observation date/time field, consider:
SELECT * FROM Observations WHERE ObsID IN
(SELECT TOP 1 ObsID FROM Observations AS Dupe
WHERE Dupe.ObsBirdID = Observations.ObsBirdID ORDER BY Dupe.ObsDateTime);
Now use that query for subsequent queries.
SELECT * FROM Observations
INNER JOIN Query1 ON Observations.ObsBirdID = Query1.ObsBirdID
WHERE Observations.ObsLocID <> Query1.ObsLocID;
I have two tables. The first one contains laboratory result header records, one for each order. It has about 10 million rows in it that contain one of about 6,000 unique ProcedureIDs...
OrderID
ResultID
ProcedureID
ProcedureName
OrderDate
ResultDate
PatientID
ProviderID
The second table contains the detailed result record(s) for each order in the first table. It has about 80 million rows and contains about 28,000 child components that are associated with the 6,000 procedure IDs from the first table.
ResultComponentID
ResultID (foreign key to first table)
ComponentID
ComponentName
ResultValueType
ResultValue
ResultUnits
ResultingLab
I have a subset (n=135) procedure IDs for which I need a list of associated child component IDs. Here is a simple example...
Table 1
1000|1|CBC|Complete Blood Count|8/1/2019 08:00:00|8/2/2019 09:27:00|9999|8888
1001|2|CA|Calcium|8/1/2019 08:01:00|8/2/2019 09:28:00|9999|8888
Table 2
2543|1|RBC|Red Blood Cell Count|NM|60|Million/uL|OurLab
2544|1|PLT|Platelet Count|NM|60|Thou/cmm|OurLab
2545|2|RBC|Red Blood Cell Count|NM|60|Million/uL|OurLab
2546|1|CA|Calcium|NM|40|g/dl|OurLab
In this example, if CBC was in my subset and CA wasn't, I would expect two rows back...
CBC|Complete Blood Count|RBC|Red Blood Cell Count
CBC|Complete Blood Count|PLT|Platelet Count
Even if I had two million CBCs in the DB, I only need have one set of CBC parent/child rows.
If I were using a scripting tool, I would use a for each loop to iterate through the subset and grab the top 1 of each ProcedureID and use it to get the associated component children.
If I really wanted to go crazy with this, I would not assume that CBC only had two components, as some labs might send us two and some might send us seven.
Any advice on how to get the list of parent/child associations?
For the simple query, sometimes there is no way around just writing out all 135 ids if you can't find a neat way to get that subset out of a query or store it in a temp table.
For the uniqueness requirement, just add a 'group by'
Select t1.ProcedureId, t2.ComponentId
from Table1 t1
join Table2 t2 on t2.ResultId = t1.ResultId
where t1.ProcedureId in (
'CBC',
'etc', -- 135 times...
)
group by t1.ProcedureId, t2.ComponentId
I have two database tables:
Cities with columns:
Country_Code | City_Code | City_Name
Countries with columns
Country_Code | Country_Name
Based on a few chars entered by User, it checks the City_Name column to return results to populate a City autocomplete box. The result needs to have the city code, city name, country code, and country name, hence the need for a join.
The query I am using is
SELECT TOP 10
ci.Country_Code, ci.City_Code, ci.City_Name, co.Country_Name
FROM
Cities ci
LEFT OUTER JOIN
Countries co ON ci.Country_Code = co.Country_Code
WHERE
ci.City_Name LIKE '#CityName'
ORDER BY
ci.City_Name
The results I get are correct, but the query takes a long time to complete. From what I understand, first, the results contain join of both the tables, then the where clause kicks in to get the specific rows only, which are ordered by City Name and top 10 results returned.
My question is, is there a way to speed up the query. Have the where clause checked, and then only perform the join, better still perform it only on the top 10 results? I tried putting my WHERE clause in the ON clause, but that gave wrong results.
EDIT : #CityName contains 2-3 chars entered by the user and then a '%'.
I'd suggest start with adding clustered index on Countries.Country_Code (also making it the primary key of the Countries table if it is not already so). An index would sort the table such that the search speed in join is increased.
This appears to be your query:
SELECT TOP 10 ci.Country_Code, ci.City_Code, ci.City_Name, co.Country_Name
FROM Cities ci LEFT OUTER JOIN
Countries co
ON ci.Country_Code = co.Country_Code
WHERE ci.City_Name LIKE #CityName
ORDER BY ci.City_Name ;
Quotes should not be needed around #CityName.
I don't understand the LEFT JOIN. It suggests that there are cities without a valid Country_Code -- and that seems unlikely.
Assuming #CityName does not start with a wildcard (as suggested by your question), then this can make use of an index. I would suggest the following indexes:
cities(city_name, country_code)
countries(country_code, country_name)
The second is not needed if country_code is a primary key.
I have an ETL process which takes values from an input table which is a key value table with each row having a field ID and turning it into a more denormalized table where each row has all the values. Specifically, this is the input table:
StudentFieldValues (
FieldId INT NOT NULL,
StudentId INT NOT NULL,
Day DATE NOT NULL,
Value FLOAT NULL
)
FieldId is a foreign key from table Field, Day is a foreign key from table Days. The PK is the first 3 fields. There are currently 188 distinct fields. The output table is along the lines of:
StudentDays (
StudentId INT NOT NULL,
Day DATE NOT NULL,
NumberOfClasses FLOAT NULL,
MinutesLateToSchool FLOAT NULL,
... -- the rest of the 188 fields
)
The PK is the first 2 fields.
Currently the query that populates the output table does a self join with StudentFieldValues 188 times, one for each field. Each join equates StudentId and Day and takes a different FieldId. Specifically:
SELECT Students.StudentId, Days.Day,
StudentFieldValues1.Value NumberOfClasses,
StudentFieldValues2.Value MinutesLateToSchool,
...
INTO StudentDays
FROM Students
CROSS JOIN Days
LEFT OUTER JOIN StudentFieldValues StudentFieldValues1
ON Students.StudentId=StudentFieldValues1.StudentId AND
Days.Day=StudentFieldValues1.Day AND
AND StudentFieldValues1.FieldId=1
LEFT OUTER JOIN StudentFieldValues StudentFieldValues2
ON Students.StudentId=StudentFieldValues2.StudentId AND
Days.Day=StudentFieldValues2.Day AND
StudentFieldValues2.FieldId=2
... -- 188 joins with StudentFieldValues table, one for each FieldId
I'm worried that this system isn't going to scale as more days, students and fields (especially fields) are added to the system. Already there are 188 joins and I keep reading that if you have a query with that number of joins you're doing something wrong. So I'm basically asking: Is this something that's gonna blow up in my face soon? Is there a better way to achieve what I'm trying to do? It's important to note that this query is minimally logged and that's something that wouldn't have been possible if I was adding the fields one after the other.
More details:
MS SQL Server 2014, 2x XEON E5 2690v2 (20 cores, 40 threads total), 128GB RAM. Windows 2008R2.
352 million rows in the input table, 18 million rows in the output table - both expected to increase over time.
Query takes 20 minutes and I'm very happy with that, but performance degrades as I add more fields.
Think about doing this using conditional aggregation:
SELECT s.StudentId, d.Day,
max(case when sfv.FieldId = 1 then sfv.Value end) as NumberOfClasses,
max(case when sfv.FieldId = 2 then sfv.Value end) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d LEFT OUTER JOIN
StudentFieldValues sfv
ON s.StudentId = sfv.StudentId AND
d.Day = sfv.Day
GROUP BY s.StudentId, d.Day;
This has the advantage of easy scalability. You can add hundreds of fields and the processing time should be comparable (longer, but comparable) to fewer fields. It is also easer to add new fields.
EDIT:
A faster version of this query would use subqueries instead of aggregation:
SELECT s.StudentId, d.Day,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 1 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as NumberOfClasses,
(SELECT TOP 1 sfv.Value FROM StudentFieldValues WHERE sfv.FieldId = 2 and sfv.StudentId = s.StudentId and sfv.Day = sfv.Day) as MinutesLateToSchool,
...
INTO StudentDays
FROM Students s CROSS JOIN
Days d;
For performance, you want a composite index on StudentFieldValues(StudentId, day, FieldId, Value).
Yes, this is going to blow up. You have your definitions of "normalized" and "denormalized" backwards. The Field/Value table design is not a relational design. It's a variation of the entity-attribute-value design, which has all sorts of problems.
I recommend you do not try to pivot the data in an SQL query. It doesn't scale well that way. Instea, you need to query it as a set of rows, as it is stored in the database, and fetch back the result set into your application. There you write code to read the data row by row, and apply the "fields" to fields of an object or a hashmap or something.
I think there may be some trial and error here to see what works but here are some things you can try:
Disable indexes and re-enable after data load is complete
Disable any triggers that don't need to be ran upon data load scenarios.
The above was taken from an msdn post where someone was doing something similar to what you are.
Think about trying to only update the de-normalized table based on changed records if this is possible. Limiting the result set would be much more efficient if this is a possibility.
You could try a more threaded iterative approach in code (C#, vb, etc) to build this table by student where you aren't doing the X number of joins all at one time.
Query below:
select
cu.course_id as 'bb_course_id',
cu.user_id as 'bb_user_id',
cu.role as 'bb_role',
cu.available_ind as 'bb_available_ind',
CASE cu.row_status WHEN 0 THEN 'ENABLED' ELSE 'DISABLED' END AS 'bb_row_status',
eff.course_id as 'registrar_course_id',
eff.user_id as 'registrar_user_id',
eff.role as 'registrar_role',
eff.available_ind as 'registrar_available_ind',
CASE eff.row_status WHEN 'DISABLE' THEN 'DISABLED' END as 'registrar_row_status'
into enrollments_comparison_temp
from narrowed_users_enrollments cu
full outer join enrollments_feed_file eff on cu.course_id = eff.course_id
Quick background: I'm taking the data from a replicated table and selecting it into narrowed_users_enrollments based on some criteria. In a script I'm taking a text feed file, with enrollment data, and inserting it into enrollments_feed_file. The purpose is to compare the most recent enrollment data with enrollments already in the database.
However the issue is that joining these tables results in about 160,000 rows when I'm really only expecting about 22,000. The point of doing this comparison is so that I can look for nulled values on either side of the join. For example, if the table on the right contains a null, then disable the enrollment record. If the table on the left contains a null, then add this student's enrollment.
I know it's a little off because I'm not using PKs or FKs. This is what is selected into the table:
Here's a screenshot showing a select * from the enrollments table on the left and a feed file on the right.
http://i.imgur.com/0ZPZ9HS.png
Here's a screenshot showing the newly created table from the full outer join.
http://i.imgur.com/89ssAkS.png
As you can see even though there there's only one matching enrollment(the matching jmartinez12 columns), there's 4 extra rows created for the same record on the left for the enrollments on the right. What I'm trying to get is for it to be 5 rows, with the first being how it is in the screenshot(matching pre-existing enrollment and enrollment in the feed file), BUT, the next 4 rows with the bb_* columns should be NULL up to the registrar_course_id.
Am I overlooking something simple here? I've tried a select distinct and I've added a where clause specifying when the course_ids are equal however that ensures that I won't get null rows which I need. I have also joined the tables on the user_id however the results are still the same.
One quick suggestion is to add the DISTNCT clause. If the records you are setting are complete duplicates that may cut it down to what you are expecting.
The fix was to also join on:
ON cu.course_id = eff.course_id AND cu.user_id = eff.user_id