How to Group_Concat with a 3-table JOIN for genealogy - sql

I am failing to grasp how I can get the following outcome. I thought perhaps via GROUP_CONCAT, but I am also joining on 3 tables, and unclear on the correct syntax or if this is even the best approach.
Generic table layout:
Table Users: user_id | first | last
Table Orgs org_id | org_name
Table Relationship user_id | org_id | start_year | end_year
The relationship table has MANY entries, that may be associated with that specific user_id.
I need to get the User columns: id, first, last. I'd like to try and group the org data into 1 concatenated, delimited field. Maybe a double group_concatenation is needed? Which would consist of the org_id, org_name, start_year & end_year for all records in the relationship table that match the user_id. I'm hoping for an output like this:
Each '|' represents a new column/piece of data.
If there was only 1 org_id associated with the user_id, the output would be (similar) to:
user_id | first | last | org_id-org_name-start_year-end_year
If there were more than 1 org found/associated with that user_id, the output would have more concatenated/delimited data in the same column:
user_id | first | last | org_id-org_name-start_year-end_year^org_id-org_name-start_year-end_year^org_id-org_name-start_year-end_year
(Notice the '-' delimiter between values and the '^' delimiter between new 'org-grouped' data.)
When I grab that data, I can then just break it up (on the backend/PHP side of things) into an array or whatever.
I'm not sure how I can GROUP_CONCAT (if that is even the best approach here?) while I have to JOIN on 3 separate tables.
This is not my REAL query. (I'm not sure if I should post it, as I do not want to cause any confusion as it does NOT match my dummy table/column names.)
I just wanted to show my attempt that gets me 3 individual rows, (using my JOINS) but no GROUP_CONCAT stuff:
SELECT genealogy_users.imis_id, genealogy_users.full_name,
genealogy_users.member_email, genealogy_orgs.org_id,
genealogy_orgs.org_name, genealogy_relations.user_id,
genealogy_relations.relation_type, genealogy_relations.start_year,
genealogy_relations.end_year
FROM genealogy_users
INNER JOIN genealogy_relations ON genealogy_users.imis_id = genealogy_relations.user_id
INNER JOIN genealogy_orgs ON genealogy_relations.org_id = genealogy_orgs.org_id
WHERE genealogy_users.imis_id = '00003';
UPDATE:
Well I seemed to have fudged my way through it. But I'm not sure how legit this is.
Its -ALMOST- there. I believe I still need a JOIN or something? Since the genealogy_orgs.org_id = '84864' is hardcoded, and it should NOT be. Maybe it needs to come from a JOIN or something?
SELECT genealogy_users.*,
(SELECT GROUP_CONCAT(org_id,'-',
(SELECT org_name FROM genealogy_orgs WHERE genealogy_orgs.org_id = '84864'),
'-',start_year,'-',end_year,'^')
FROM genealogy_relations WHERE genealogy_relations.user_id = genealogy_users.imis_id
) AS alumni_list
FROM genealogy_users
WHERE genealogy_users.imis_id = '00003';
UPDATE 2:
My final attempt, which I think is getting me what I need. (But it's late, and I'll check back tomorrow and look at things more closely.)
SELECT genealogy_users.imis_id, genealogy_users.full_name,
genealogy_users.member_email, genealogy_orgs.org_id,
genealogy_orgs.org_name, genealogy_relations.user_id,
genealogy_relations.relation_type, genealogy_relations.start_year,
genealogy_relations.end_year,
(SELECT GROUP_CONCAT(org_id,'-',org_name,'-',start_year,'-',end_year,'^')
FROM genealogy_relations
WHERE genealogy_relations.user_id = genealogy_users.imis_id
) AS alumni_list
FROM genealogy_users
INNER JOIN genealogy_relations ON genealogy_users.imis_id = genealogy_relations.user_id
INNER JOIN genealogy_orgs ON genealogy_relations.org_id = genealogy_orgs.org_id
WHERE genealogy_users.imis_id = '00003';
Is there anything to make note of in the above attempt? Or is there a better approach? Hopefully something easily readable so it makes sense?

Related

SQL One to many relationships, duplicate/results being returned as unique rows

I have a database and some of the tables have one to many relationships. How do I eliminate the results being returned as its own unique row?
For instance I have a initiative table and a initiative can have many funding requirements. When I perform an inner join I'm getting the results but it looks like the rows are duplicating to output a unique value from the funding table.
From the results, it should be like this
Row 3,4,5 should be in one row listing the results with the the funding required
Description | Acad_priority_1 | Acad_priority_2 | beginning_fiscal_year |
Develop... | false | true | 2018/2019 |
| 2018/2019 |
| 2019/2020
Can you please steer me in the right direction or show me how the SQL should be structured to achieve this?
SQL:
SELECT plan_master.plan_id,
plan_master.date_submitted,
plan_master.filename,
initiative_master.plan_id,
initiative_master.NAME,
initiative_master.acad_priority_1,
funding.initiative_id,
funding.beginning_fiscal_year
FROM plan_master
JOIN initiative_master
ON plan_master.plan_id = initiative_master.plan_id
JOIN funding
ON initiative_master.initiative_id = funding.initiative_id
ORDER BY Filename
|plan_id|date_submitted|filename|plan_id|NAME|acad_priority_1|initiative_id|begginning_fiscal_year|
|16F44FFE-5434-4E52-9D9A-F45C0A49D8E2|2018-12-03|1.txt|16F44FFE-5434-4E52-9D9A-F45C0A49D8E2|Space Utilization framework|false|8CCE0311-0E3C-467D-B675-04817A473056|2018/2019
|16F44FFE-5434-4E52-9D9A-F45C0A49D8E2|2018-12-03|1.txt|16F44FFE-5434-4E52-9D9A-F45C0A49D8E2|Space Utilization framework|false|8CCE0311-0E3C-467D-B675-04817A473056|2019/2020
|16F44FFE-5434-4E52-9D9A-F45C0A49D8E2|2018-12-03|1.txt|16F44FFE-5434-4E52-9D9A-F45C0A49D8E2|Space Utilization framework|false|8CCE0311-0E3C-467D-B675-04817A473056|2020/2021
The 2 beginning fiscal year values cannot be combined into a single row unless you want to concatenate them with commas (or another separator), or write a function to show the value as a range, for example 2018-2020. You can however get rid of the 4th record using distinct or using the below mentioned over/partition by clauses.
If you don't mind, can you run the following query and provide the results that will help me identify the duplication issue:
SELECT plan_master.plan_id,
plan_master.date_submitted,
plan_master.filename,
plan_master.department,
plan_master.last_name,
plan_master.first_name,
plan_master.email,
plan_master.mission_statement,
plan_master.vision_statement,
plan_master.goals_objectives,
initiative_master.plan_id,
initiative_master.NAME,
initiative_master.description,
initiative_master.acad_priority_1,
initiative_master.acad_priority_2,
initiative_master.acad_priority_3,
initiative_master.acad_priority_4,
initiative_master.acad_priority_5,
initiative_master.acad_priority_6,
initiative_master.operational_sustainability,
initiative_master.people_plan,
funding.initiative_id,
funding.beginning_fiscal_year
FROM plan_master
JOIN initiative_master
ON plan_master.plan_id = initiative_master.plan_id
JOIN funding
ON initiative_master.initiative_id = funding.initiative_id
ORDER BY Filename
Once you get to the cause, you can either use a better join clause (multiple conditions), add a where clause, or use the OVER clause in conjunction with the PARTITION BY clause to filter the data based on a ROW_NUMBER().
A simple join returns a cartesion product. If one table has 2 rows and another has 3, then there will be 6 rows of data. Need to do distinct on the data. You can do this:
SELECT plan.date_submitted,
plan.filename,
plan.department,
plan.last_name,
plan.first_name,
plan.email,
plan.mission_statement,
plan.vision_statement,
plan.goals_objectives,
initiative.Name,
initiative.description,
initiative.acad_priority_1,
initiative.acad_priority_2,
initiative.acad_priority_3,
initiative.acad_priority_4,
initiative.acad_priority_5,
initiative.acad_priority_6
FROM plan_master as plan
inner join (select distinct init.plan_id, init.NAME,
init.description,
init.acad_priority_1,
init.acad_priority_2,
init.acad_priority_3,
init.acad_priority_4,
init.acad_priority_5,
init.acad_priority_6,
init.operational_sustainability,
init.people_plan,
funding.beginning_fiscal_year from initiative_master as init
join funding on funding.initiative_id = init.initiative_id ) as initiative
ON plan.plan_id = initiative.plan_id
ORDER BY Filename

Count of how many times id occurs in table SQL regexp

Hi I have a redshift table of articles that has a field on it that can contain many accounts. So there is a one to many relationship between articles to accounts.
However I want to create a new view where it lists the partner id's in one column and in another column a count of how many times the partner id appears in the articles table.
I've attempted to do this using regex and created a new redshift view, but am getting weird results where it doesn't always build properly. So one day it will say a partner appears 15 times, then the next 17, then the next 15, when the partner id count hasn't actually changed.
Any help would be greatly appreciated.
SELECT partner_id,
COUNT(DISTINCT id)
FROM (SELECT id,
partner_ids,
SPLIT_PART(partner_ids,',',i) partner_id
FROM positron_articles a
LEFT JOIN util.seq_0_to_500 s
ON s.i < regexp_count (partner_ids,',') + 2
OR s.i = 1
WHERE i > 0
AND regexp_count (partner_ids,',') = 0
ORDER BY id)
GROUP BY 1;
Let's start with some of the more obvious things and see if we can start to glean other information.
Next GROUP BY 1 on your outer query needs to be GROUP BY partner_id.
Next you don't need an order by in your INNER query and the database engine will probably do a better job optimizing performance without it so remove ORDER BY id.
If you want your final results to be ordered then add an ORDER BY partner_id or similar clause after your group by of your OUTER query.
It looks like there are also problems with how you are splitting a partnerid from partnerids but I am not positive about that because I need to understand your view and the data it provides to know how that affects your record count for partnerid.
Next your LEFT JOIN statement on the util.seq_0_to_500 I am pretty sure you can drop off the s.i = 1 as the first condition will satisfy that as well because 2 is greater than 1. However your left join really acts more like an inner join because you then exclude any non matches from positron_articles that don't have a s.i > 0.
Oddly then your entire join and inner query gets kind of discarded because you only want articles that have no commas in their partnerids: regexp_count (partner_ids,',') = 0
I would suggest posting the code for your util.seq_0_to_500 and if you have a partner table let use know about that as well because you can probably get your answer a lot easier with that additional table depending on how regexp_count works. I suspect regex_count(partnerids,partnerid) exampleregex_count('12345,678',1234) will return greater than 0 at which point you have no choice but to split the delimited strings into another table before counting or building a new matching function.
If regex_count only matches exact between commas and you have a partner table your query could be as easy as this:
SELECT
p.partner_id
,COUNT(a.id) AS ArticlesAppearedIn
FROM
positron_articles a
LEFT JOIN PARTNERTABLE p
ON regexp_count(a.partnerids,p.partnerid) > 0
GROUP BY
p.partner_id
I will actually correct myself as I just thought of a way to join a partner table without regexp_count. So if you have a partner table this might work for you. If not you will need to split strings. It basically tests to see if the partnerid is the entire partnerids, at the beginning, in the middle, or at the end of partnerids. If one of those is met then the records is returned.
SELECT
p.partner_id
,COUNT(a.id) AS ArticlesAppearedIn
FROM
PARTNERTABLE p
INNER JOIN positron_articles a
ON
(
CASE
WHEN a.partnerids = CAST(p.partnerid AS VARCHAR(100)) THEN 1
WHEN a.partnerids LIKE p.partnerid + ',%' THEN 1
WHEN a.partnerids LIKE '%,' + p.partnerid + ',%' THEN 1
WHEN a.partnerids LIKE '%,' + p.partnerid THEN 1
ELSE 0
END
) = 1
GROUP BY
p.partner_id

Ways to Clean-up messy records in sql

I have the following sql data:
ID Company Name Customer Address 1 City State Zip Date
0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025
1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190
AAA Test, AAA.Test and AAA Test Company are considered as one company.
Since their data is messy I'm thinking either to do this:
Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?
In this case, the AAA Test and AAA.Test will be AAA Test Company.
OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?
If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.
Thank you in advance.
You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.
Something like a self join? || is ANSI SQL concat, some products have a concat function instead.
select *
from tablename t1
join tablename t2 on t1.companyname like '%' || t2.companyname || '%'
Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.
And, as Miguel suggests, use REPLACE to remove commas and dots etc.
Use case-insensitive collation. SOUNDEX can be used etc etc.
I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity.
for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx
You can do the clean-up in several stages.
Create new columns
Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows
/** Populate dupstr column for fuzzy match **/
update vendor v
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;
Identify duplicate records
Add an index on the dupstr column, then do an update like this to identify "good" records:
/** Mark the good duplicates **/
update vendor v
set v.status = 'keep' --indicate keeper record
where
--dupes to clean up
exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr
and v.id != v1.id )
and
( --keeper has longest name
length(v.companyname) =
( select max(length(v2.companyname)) from vendor v2
where v.dupstr = v2.dupstr
)
or
--keeper has latest record (assuming ID is sequential)
v.id =
( select max(v3.id) from vendor v3
where v.dupstr = v3.dupstr
)
)
group by v.dupstr
;
The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.
Clean Up Stragglers
Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)
You can use SQL query with SOUDEX of DIFFRENCE
For example:
SELECT DIFFERENCE ('AAA Test','AAA Test Company')
DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)
See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

how to get ID of field based on the Max of another field in same table

I have a table named ae_types. It contains three fields that are relevant to my question:
aetId This is Auto Increment and is the Primary Key
aetProposalType Text field 5 characters long
aetDaysToWait Byte data type
aetProposalType and aetDaysToWait are in a unique key so that I am guaranteed that there will never be two aetProposalTypes with the same aetDaysToWait.
The result that I am looking for is to get the aetId of the field with the largest aetDaysToWait for each aetProposalType.
Below is the query that I have come up with to accomplish this, but it seems to me like it is possibly unnecessarily complicated and not very beautiful.
SELECT ae_types.aetId AS lastEmailId, ae_types.aetProposalType
FROM ae_types INNER JOIN
(SELECT ae_types.aetProposalType, Max(ae_types.aetDaysToWait) AS MaxOfaetDaysToWait
FROM ae_types GROUP BY ae_types.aetProposalType) AS ae_maxDaysToWaitByProposalType
ON (ae_types.aetDaysToWait = ae_maxDaysToWaitByProposalType.MaxOfaetDaysToWait)
AND (ae_types.aetProposalType = ae_maxDaysToWaitByProposalType.aetProposalType);
What are some alternative solutions and why would they be better?
PS If you have any questions please ask and I will be happy to attempt to provide the answer.
That's the way I'd do it too.
select a.aetId, a.aetProposalType, a.aetDaysToWait
from ae_types a
inner join (select aetProposalType, max(aetDaysToWait) as MaxDays
from ae_types
group by aetProposalType) sq
on a.aetProposalType = sq.aetProposalType
and a.aetDaysToWait = sq.MaxDays

complex MySQL Order by not working

Here is the select statement I'm using. The problem happens with the sorting. When it is like below, it only sorts by t2.userdb_user_first_name, doesn't matter if I put that first or second. When I remove that, it sorts just fine by the displayorder field value pair. So I know that part is working, but somehow the combination of the two causes the first_name to override it. What I want is for the records to be sorted by displayorder first, and then first_name within that.
SELECT t1.userdb_id
FROM default_en_userdbelements as t1
INNER JOIN default_en_userdb AS t2 ON t1.userdb_id = t2.userdb_id
WHERE t1.userdbelements_field_name = 'newproject'
AND t1.userdbelements_field_value = 'no'
AND t2.userdb_user_first_name!='Default'
ORDER BY
(t1.userdbelements_field_name = 'displayorder' AND t1.userdbelements_field_value),
t2.userdb_user_first_name;
Edit: here is what I want to accomplish. I want to list the users (that are not new projects) from the userdb table, along with the details about the users that is stored in userdbelements. And I want that to be sorted first by userdbelements.displayorder, then by userdb.first_name. I hope that makes sense? Thanks for the really quick help!
Edit: Sorry for disappearing, here is some sample data
userdbelements
userdbelements_id userdbelements_field_name userdbelements_field_value userdb_id
647 heat 1
648 displayorder 1 - Sponsored 1
645 condofees 1
userdb
userdb_id userdb_user_name userdb_emailaddress userdb_user_first_name userdb_user_last_name
10 harbourlights info#harbourlightscondosminium.ca Harbourlights 1237 Northshore Blvd, Burlington
11 harbourview info#harbourviewcondominium.ca Harbourview 415 Locust Street, Burlington
12 thebalmoral info#thebalmoralcondominium.ca The Balmoral 2075 & 2085 Amherst Heights Drive, Burlington
You are trying to use an invalid ORDER BY.
ORDER BY (t1.userdbelements_field_name = 'displayorder' AND t1.userdbelements_field_value)
It must reference a table column or returned aliased column.
I really cannot follow how this query would even be possible as you already have limited
t1.userdbelements_field_name = newproject and then you wish to order by the case of it being equal to displayorder.
Could you please modify your question to state exactly what it is that you are trying to accomplish in your order by clause?
From what I understand, you'd have to join to default_en_userdbelements solely for the displayorder value. However, I suspect there's something wrong with your query and that it probably returns duplicate values for userdb_id.
Perhaps you should say what you're trying to actually do, not explain the way you're trying to do it.
SELECT t1.userdb_id
FROM default_en_userdbelements AS t1
JOIN default_en_userdb AS t2 ON t1.userdb_id = t2.userdb_id
JOIN default_en_userdbelements AS o ON (o.userdb_id, o.userdbelements_field_name)
= (t1.userdb_id, 'displayorder')
WHERE t1.userdbelements_field_name = 'newproject'
AND t1.userdbelements_field_value = 'no'
AND t2.userdb_user_first_name != 'Default'
ORDER BY o.userdbelements_field_value,
t2.userdb_user_first_name
You could do something like this:
ORDER BY
(CASE WHEN t1.userdbelements_field_name = 'displayorder'
THEN t1.userdbelements_field_value
ELSE $some_large_number
END),
t2.userdb_user_first_name;
It sorts using the value of t1.userdbelements_field_value when t1.userdbelements_field_name = 'displayorder', but you have to supply some other value of the same type to apply for the ELSE.