Trimming amount of many-to-many row combinations in bridge table - sql

In SQL Server I have a bridge table to handle the Many-to-Many relationship that a Person can have to the Areas that they belong. Initially the Person table is loaded with the reference to his/her belonging Area based on their specific address. However as many Addresses basically point to the same single area or same set of multiple areas, I want to Trim/Consolidate the possible combinations into fewer rows in the bridge table (10million+ rows into 10k).
So first I want to generate the new bridge_area table and then populate a new person table with the new Key_Bridge value as per below.

Before doing this, a few things to consider:
Have you changed your processes so this doesn't happen again? When adding new records, you can detect that you already have a mapping to the relevant areas and not create a new duplicate one.
Do you have a plan for how you are going to drop the existing tables and use the new ones? If there are existing foreign keys and indexes, you'll need to account for all of that.
I'm sure there are many different ways of doing this, but one method I think will work:
SELECT key_bridge, cast(key_area.value as int) key_area
INTO bridge_area2
FROM
(
SELECT row_number() OVER (ORDER BY areas) key_bridge, areas
FROM
(
SELECT
DISTINCT STRING_AGG(key_area, ',') WITHIN GROUP (ORDER BY key_area) areas
FROM bridge_area
GROUP BY key_address
) a
) b CROSS APPLY STRING_SPLIT(areas, ',') key_area
The innermost query uses STRING_AGG to group the key_areas for each key_address together. For example, key_address=13 has areas="100,105". Since they are sorted in order, any key_bridge that has the exact same set of key_areas will have an exact match of areas so the distinct will limit this to the minimum number we need in the new bridge_area table. (It's important to use a delimiter here that can't exist in your data, but since your key_areas are numeric, a simple comma will do.)
row_number() is used to generate a new key_bridge column for each set of areas we care about, and the cross_apply is used to separate the areas (e.g. "100,105") back into separate rows. Note the cast to convert back to a numeric format.
Then you can create your new person table from that:
SELECT key_person, (SELECT key_bridge FROM bridge_area2 b2 GROUP BY key_bridge HAVING STRING_AGG(b2.key_area, ',') WITHIN GROUP (ORDER BY b2.key_area) = STRING_AGG(b1.key_area, ',') WITHIN GROUP (ORDER BY b1.key_area)) key_bridge
INTO person2
FROM person INNER JOIN bridge_area b1
ON person.key_address = b1.key_address
GROUP BY key_person
This is similar to the previous query. For each key_person, it figures out the ordered STRING_AGG of all the relevant key_areas in the current bridge_area table and then finds the same ordered grouping of key_areass in the new bridge_area2 table.
You can see this working in this Fiddle.
You now have the new tables you want and can rename them after dropping the old ones (with the caveats listed above about indexes, foreign keys, etc.)

Related

(Hive) SQL retrieving data from a column that has 1 to N relationship in another column

How can I retrieve rows where BID comes up multiple times in AID
You can see the sample below, AID and BID columns are under the PrimaryID, and BIDs are under AID. I want to come up with an output that only takes records where BIDs had 1 to many relationship with records on AIDs column. Example output below.
I provided a small sample of data, I am trying to retrieve 20+ columns and joining 4 tables. I have unqiue PrimaryIDs and under those I have multiple unique AIDs, however under these AIDs I can have multiple non-unqiue BIDs that can repeatedly come up under different AIDs.
Hive supports window functions. A window function can associate every row in a group with an attribute of the group. Count() being one of the supported functions. In your case you can use that a and select rows for which that count > 1
The partition by clause you specify which columns define the group, tge same way that you would in the more familiar group by clause.
Something like this:
select * from
(
Select *,
count(*) over (partition by primaryID,AID) counts
from mytable
) x
Where counts>1

SQL - Append counter to recurring value in query output

I am in the process of creating an organizational charts for my company, and to create the chart, the data must have a unique role identifier, and a unique 'reports to role' identifier for each line. Unfortunately my data is not playing ball and it out of my scope to change the source.
I have two source tables, simplified in the image below. It is important to note a couple of things in the data.
An employees manager in the query needs to come from the [EmpData] table. The 'ReportsTo' field is only in the [Role] table to be used when a role is vacant
Any number of employees can hold the same role, but for simplicity lets assume that there will only ever be one person in the 'Reports to' role
Using this sample data, my query is as follows:
/**Join Role table with employee data table.
/**Right join so roles with more than one employee will generate a row each
SELECT [Role].RoleId As PositionId
,[EmpData].ReportsToRole As ReportsToPosition
,[Role].RoleTitle
,[Empdata].EmployeeName
FROM [Role]
RIGHT JOIN [EmpData] ON [Role].RoleId=[EmpData].[Role]
UNION
/** Output all roles that do not have a holder, 'VACANT' in employee name.
SELECT [Role].RoleId
,[Role].ReportsToRole
,[Role].RoleTitle
,'VACANT'
FROM [Role]
WHERE [Role].RoleID NOT IN (SELECT RoleID from [empdata])
This almost creates the intended output, but each operator roles has 'OPER', in the PositionId column.
For the charting software to work, each position must have a unique identifier.
Any thoughts on how to achieve this outcome? I'm specifically chasing the appended -01, -02, -03 etc. highlighted yellow in the Desired Query Output.
If you are using T-SQL, you should look into using the ROW_NUMBER operator with the PARTITON BY command and combining the column with your existing column.
Specifically, you would add a column to your select of ROW_NUMBER () OVER (PARTITION BY PositionID ORDER BY ReportsToPosition,EmployeeName) AS SeqNum
I would add that to your first query, and then, in your second, I would do something like SELECT PositionID + CASE SeqNum WHEN 1 THEN "" ELSE "-"+CAST(SeqNum AS VarChar(100)),...
There are multiple ways to do this, but this will leave out the individual ones that don't need a "-1" and only add it to the rest. The major difference between this and your scheme is it doesn't contain the "0" pad on the left, which is easy to do, nor would the first "OPER" be "OPER-1", they would simply be "OPER", but this can also be worked around.
Hopefully this gets you what you need!

Get multiple rows as comma separated string column AND map values to temp table from junction table

I've seen several questions about how to pull together multiple rows into a single comma-separated column with t-sql. I'm trying to map those examples to my own case in which I also need to bring columns from two tables together referencing a junction table. I can't make it work. Here's the query I have:
WITH usersCSV (userEmails, siteID)
AS (SELECT usersSites.siteID, STUFF(
(SELECT ', ' + users.email
FROM users
WHERE usersSites.userID = users.id
FOR XML PATH ('')
GROUP BY usersSites.userID
), 1, 2, '')
FROM usersSites
GROUP BY usersSites.siteID
)
SELECT * FROM usersCSV
The hard stuff here is based on this answer. I've added the WITH which, as I understand it, creates a sort of temporary table (I'm sure it's actually more complicated than that, but humor me.) to hold the values. Obviously, I don't need this just to select the values, but I'm going to be joining this with another table later. (The whole of what I need to do is still more complicated than what I'm trying to do here.)
So, I'm creating a temporary table named usersCSV with two columns which I'm filling by selecting the siteID column from my usersSites table (which is the junction table between users and sites) and selecting ', ' + users.email from my users table which should give me the email address preceded by a comma and space. Then, I chop the first two characters off that using STUFF and group the whole thing by usersSites.siteID.
This query gives me an error identifying line 5 as the problem area:
Column 'usersSites.userID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Why should this matter since the column in question is actually in the WHERE rather than the SELECT as is stated in the error? How can I fix it? I need only the users with an ID that matches an ID in the junction table. I've got tons of users that aren't mapped in that table and have no need to select them.
tl;dr- I need a temp table with distinct sites in one column and a comma-separated list of the email addresses of related users in the other. These two pieces of data come from other tables and will be put together using a junction table on the primary keys of those two tables. Hope that makes sense.
select distinct us.siteId,
STUFF((select ', ' + u.email
from users u
join usersSites us2 on us2.userId = u.userId
where us2.siteId = us.siteId
for xml path('')), 1, 2, '')
from usersSites us
SQL Fiddle

Where are Cartesian Joins used in real life?

Where are Cartesian Joins used in real life?
Can some one please give examples of such a Join in any SQL database.
just random example. you have a table of cities: Id, Lat, Lon, Name. You want to show user table of distances from one city to another. You will write something like
SELECT c1.Name, c2.Name, SQRT( (c1.Lat - c2.Lat) * (c1.Lat - c2.Lat) + (c1.Lon - c2.Lon)*(c1.Lon - c2.Lon))
FROM City c1, c2
Here are two examples:
To create multiple copies of an invoice or other document you can populate a temporary table with names of the copies, then cartesian join that table to the actual invoice records. The result set will contain one record for each copy of the invoice, including the "name" of the copy to print in a bar at the top or bottom of the page or as a watermark. Using this technique the program can provide the user with checkboxes letting them choose what copies to print, or even allow them to print "special copies" in which the user inputs the copy name.
CREATE TEMP TABLE tDocCopies (CopyName TEXT(20))
INSERT INTO tDocCopies (CopyName) VALUES ('Customer Copy')
INSERT INTO tDocCopies (CopyName) VALUES ('Office Copy')
...
INSERT INTO tDocCopies (CopyName) VALUES ('File Copy')
SELECT * FROM InvoiceInfo, tDocCopies WHERE InvoiceDate = TODAY()
To create a calendar matrix, with one record per person per day, cartesian join the people table to another table containing all days in a week, month, or year.
SELECT People.PeopleID, People.Name, CalDates.CalDate
FROM People, CalDates
I've noticed this being done to try to deliberately slow down the system either to perform a stress test or an excuse for missing development deliverables.
Usually, to generate a superset for the reports.
In PosgreSQL:
SELECT COALESCE(SUM(sales), 0)
FROM generate_series(1, 12) month
CROSS JOIN
department d
LEFT JOIN
sales s
ON s.department = d.id
AND s.month = month
GROUP BY
d.id, month
This is the only time in my life that I've found a legitimate use for a Cartesian product.
At the last company I worked at, there was a report that was requested on a quarterly basis to determine what FAQs were used at each geographic region for a national website we worked on.
Our database described geographic regions (markets) by a tuple (4, x), where 4 represented a level number in a hierarchy, and x represented a unique marketId.
Each FAQ is identified by an FaqId, and each association to an FAQ is defined by the composite key marketId tuple and FaqId. The associations are set through an admin application, but given that there are 1000 FAQs in the system and 120 markets, it was a hassle to set initial associations whenever a new FAQ was created. So, we created a default market selection, and overrode a marketId tuple of (-1,-1) to represent this.
Back to the report - the report needed to show every FAQ question/answer and the markets that displayed this FAQ in a 2D matrix (we used an Excel spreadsheet). I found that the easiest way to associate each FAQ to each market in the default market selection case was with this query, unioning the exploded result with all other direct FAQ-market associations.
The Faq2LevelDefault table holds all of the markets that are defined as being in the default selection (I believe it was just a list of marketIds).
SELECT FaqId, fld.LevelId, 1 [Exists]
FROM Faq2Levels fl
CROSS JOIN Faq2LevelDefault fld
WHERE fl.LevelId=-1 and fl.LevelNumber=-1 and fld.LevelNumber=4
UNION
SELECT Faqid, LevelId, 1 [Exists] from Faq2Levels WHERE LevelNumber=4
You might want to create a report using all of the possible combinations from two lookup tables, in order to create a report with a value for every possible result.
Consider bug tracking: you've got one table for severity and another for priority and you want to show the counts for each combination. You might end up with something like this:
select severity_name, priority_name, count(*)
from (select severity_id, severity_name,
priority_id, priority_name
from severity, priority) sp
left outer join
errors e
on e.severity_id = sp.severity_id
and e.priority_id = sp.priority_id
group by severity_name, priority_name
In this case, the cartesian join between severity and priority provides a master list that you can create the later outer join against.
When running a query for each date in a given range. For example, for a website, you might want to know for each day, how many users were active in the last N days. You could run a query for each day in a loop, but it's simplest to keep all the logic in the same query, and in some cases the DB can optimize the Cartesian join away.
To create a list of related words in text mining, using similarity functions, e.g. Edit Distance

How can I compare two tables and delete on matching fields (not matching records)

Scenario: A sampling survey needs to be performed on membership of 20,000 individuals. Survey sample size is 3500 of the total 20000 members. All membership individuals are in table tblMember. Same survey was performed the previous year and members whom were surveyed are in tblSurvey08. Membership data can change over the year (e.g. new email address, etc.) but the MemberID data stays the same.
How do I remove the MemberID/records contained tblSurvey08 from tblMember to create a new table of potential members to be surveyed (lets call it tblPotentialSurvey09). Again the record for a individual member may not match from the different tables but the MemberID field will remain constant.
I am fairly new at this stuff but I seem to be having a problem Googling a solution - I could use the EXCEPT function but the records for the individuals members are not necessarily the same from one table to next - just the MemberID may be the same.
Thanks
SELECT
* (replace with column list)
FROM
member m
LEFT JOIN
tblSurvey08 s08
ON m.member_id = s08.member_id
WHERE
s08.member_id IS NULL
will give you only members not in the 08 survey. This join is more efficient than a NOT IN construct.
A new table is not such a great idea, since you are duplicating data. A view with the above query would be a better choice.
I apologize in advance if I didn't understand your question but I think this is what you're asking for. You can use the insert into statement.
insert into tblPotentialSurvey09
select your_criteria from tblMember where tblMember.MemberId not in (
select MemberId from tblSurvey08
)
First of all, I wouldn't create a new table just for selecting potential members. Instead, I would create a new true/false (1/0) field telling if they are eligible.
However, if you'd still want to copy data to the new table, here's how you can do it:
INSERT INTO tblSurvey00 (MemberID)
SELECT MemberID
FROM tblMember m
WHERE NOT EXISTS (SELECT 1 FROM tblSurvey09 s WHERE s.MemberID = m.MemberID)
If you just want to create a new field as I suggested, a similar query would do the job.
An outer join should do:
select m_09.MemberID
from tblMembers m_09 left outer join
tblSurvey08 m_08 on m_09.MemberID = m_08.MemberID
where
m_08.MemberID is null