Check for similarity in string using Postgres

Check for similarity in string using Postgres - sql

I cannot disclose the actual data so I am just using an example. I have two tables one is a dictionary table which has the IDs for the titles. The second table is new data that's coming into the database and doesn't have IDs I need to update the ID for the new data by checking in the dictionary table if I have something similar already in there or else update the dictionary with the new value and get a new ID for it and update the same for the new data. The Expected ID column in the second table is what I expect them to be updated as.
ID Title
-- ------------------------
1 Aliens
2 The Hunger Games
3 John Wick
4 Alien vs Predator
ID Title Expected ID
----------------------------------------------------------------
null The Hunger Games: Mockingjay Part I 2
null The Hunger Games: Mockingjay Part II 2
null John Wick (2014) 3
null John Wick Chapter 2 3
null Alien 1
null Aliens 1
null Alien 3 1
null Lord of the Rings 5 (New ID generated)

This seems like a task for pg_trgm. It provides you with the % operator, which returns true if the two strings are "close enough". You can tune what close enough means via changing pg_trgm.similarity_threshold. You can create an index to accelerate this operation. The higher similarity_threshold is set, the more acceleration you are likely to achieve.
If I take you first shown table as foo1 and all the distinct titles from your desired output as foo2, then this query gives reasonable results:
select *, foo1.title <-> foo2.title as distance from foo2 left join foo1 on foo1.title % foo2.title;
title | id | title | distance
--------------------------------------+--------+-------------------+-----------
The Hunger Games: Mockingjay Part I | 2 | The Hunger Games | 0.5142857
The Hunger Games: Mockingjay Part II | 2 | The Hunger Games | 0.5277778
John Wick (2014) | 3 | John Wick | 0.3333333
John Wick Chapter 2 | 3 | John Wick | 0.5
Alien | 1 | Aliens | 0.375
Alien | 4 | Alien vs Predator | 0.6666666
Aliens | 1 | Aliens | 0
Alien 3 | 1 | Aliens | 0.5
Alien 3 | 4 | Alien vs Predator | 0.7
Lord of the Rings | (null) | (null) | (null)
If you want one a single output row for each foo2, showing the "best" match only, then you would use a LEFT JOIN LATERAL:
select *, a.title <-> foo2.title as distance from foo2 left join lateral
(select * from foo1 where foo1.title % foo2.title order by foo1.title <-> foo2.title limit 1) a
on true;
title | id | title | distance
--------------------------------------+--------+------------------+-----------
The Hunger Games: Mockingjay Part I | 2 | The Hunger Games | 0.5142857
The Hunger Games: Mockingjay Part II | 2 | The Hunger Games | 0.5277778
John Wick (2014) | 3 | John Wick | 0.3333333
John Wick Chapter 2 | 3 | John Wick | 0.5
Alien | 1 | Aliens | 0.375
Aliens | 1 | Aliens | 0
Alien 3 | 1 | Aliens | 0.5
Lord of the Rings | (null) | (null) | (null)
How to replace the NULL in the "id" column (where there are no close-enough matches) with a newly generated id is a separate question, and you should ask separate questions separately.
For any realistically sized datasets, it is unlikely you would be able to just blindly accept whatever a query like the above produces, at least not if you want high quality results. Rather, you could have the computer generate something like the above as recommendations, and then offer them (in a convenient interface) for a human to ratify, reject, or investigate further.

Related

One-to-one vs Many-to-many with duplicate entries

Say I have this very simple table with duplicate entries. Is the relationship between the A and B columns one-to-one or many-to-many?
A
B
C
1
2
x
1
2
y
Undoubtedly a simple question, but I can't find confirmation for this corner case... Thanks in advance!
EDIT: Changed the content of the table to avoid stick to the math definition.

As I said in the comment this is a one-to-many relation. for clarifying, let's take a look at this example(I normalized your table into these bellow tables):
Suppose You have a table entitled with continent like below:
id title
---|--------
1 | Asia
2 | Europe
3 | America
Now we have another table with this name country like below:
id title continent_Id
---|----------|--------------
1 | Norway | 2
2 | Germany | 2
3 | Canada | 3
4 | Japan | 1
also, we have a state table with this structure:
id stateTitle country_Id
---|----------|--------------
1 | Munich | 2
2 | Berlin | 2
3 | Torento | 3
4 | Tokio | 4
5 | Osaka | 4

How can I sort a table by two columns in sisense?

I use Sisense Version: 20.21.6.10054 on Windows.
I need to sort a table widget in sisense by two columns, first by name, and second by number of behavior that person demonstrates.
The result should look like this:
id first_name last_name behavior_NO behavior_link
1 Ben Smith 1 behavior_1
1 Ben Smith 2 behavior_2
1 Ben Smith 3 behavior_3
2 Sam Johns 1 behavior_1
2 Sam Johns 2 behavior_2
3 Martha Star 1 behavior_1
3 Martha Star 2 behavior_2
3 Martha Star 3 behavior_3
3 Martha Star 4 behavior_4
Now, when I sort by Last_name the behavior_No is not sorted in correct order, but it looks like this:
id first_name last_name behavior_NO behavior_link
1 Ben Smith 1 behavior_1
1 Ben Smith 3 behavior_3
1 Ben Smith 2 behavior_2
2 Sam Johns 2 behavior_2
2 Sam Johns 1 behavior_1
3 Martha Star 4 behavior_4
3 Martha Star 2 behavior_2
3 Martha Star 1 behavior_1
3 Martha Star 3 behavior_3
Sisense does not allow to sort by two columns in a table.
I tried to pivot the table but the problem is that there is a column with hyperlinks in it, and when making a pivot hyperlinks display like a text (<a href="https://https://stackoverflow.com/ ) but not like a link.
Can anyone advise on how to solve this, either to sort the table by two columns or to insert a hyperlink in a pivot?
Thanks in advance.

Maybe you already find a better way that the following but yesterday I had a requested to do a rank but also, ordering three columns. First I needed order by the Target then by Rank and then by Sales so in the pivot table can look like this:
Sales_Person | Target | Sales | Rank
Joe | 100% | 12 | 1
Chris | 100% | 12 | 1
Maria | 98% | 11 | 2
Peter | 97% | 10 | 3
So because Sisense in the front end does not allow to sort two or more columns there is a built-in function called "ORDERING".
In the following link you will find the function under "Other Functions"
Function References Sisense
The only disadvantage is that at the time you implement this function it will create an additional column for ordering so at the end I obtained the following results:
Sales_Person | Target | Sales | Rank | Ordering
Joe | 100% | 12 | 1 | 0
Chris | 100% | 12 | 1 | 1
Maria | 98% | 11 | 2 | 2
Peter | 97% | 10 | 3 | 3
Also, keep in mind that all the different columns should be dimensions.
By the way, the version I have is Sisense L2022.4.0.222

SQL - Combining all children in one row

I am trying to move data from a database into a pandas data frame. I have data in multiple tables that I want to combine.
I'm using SQLAlchemy and relationship between parent/children.
I'm trying to understand how I'd do this in SQL before attempting in SQLAlchemy
I am using Sqlite as a DB.
parent_table
ID | Name | Class
1 | Joe | Paladin
2 | Ron | Mage
3 | Sara | Knight
child1
ID | distance | finished | parent_id
1 | 2 miles | yes | 1
2 | 3 miles | yes | 1
3 | 1 miles | yes | 1
4 | 10 miles | no | 2
child2
ID | Weight | height | parent_id
1 | 5 lbs | 5'3 | 1
2 | 10 lbs | 5'5 | 2
I want to write a query where the result would be everything for Joe (id: 1) on a row.
1 | Joe | Paladin | 2 miles | yes | 3 miles | yes | 1 miles | yes | 5lbs | 5'3
2 | Ron | Mage | 10 miles | no | None | None | None | None | 10lbs | 5'5
3 | Sara | Knight | None | None | None | None | None | None | None | None
I'm guessing I need to do a join, but confused about the fact that Ron has less child1 entries.
How do I construct a table that has as many columns as needed and fills out the empty ones as None when some of the rows in parent_table don't have as many children?

simply search everyone by themself and use a union to join:
SELECT Name,Class FROM parent_table WHERE ID = 1
UNION
SELECT distance,finished FROM child1 WHERE parent_id = 1
UNION
SELECT weight,height FROM child2 WHERE parent_id =1
This way you avoid the problem for Ron or anyone that does not have a register in a table,

You can't have "As many columns as needed" because the number of child rows is variable and you can't have a variable number of columns. If you can figure out a fixed number of children, (say 2) you can do:
CREATE TABLE
"some_table"
AS
SELECT
"parent_table"."ID",
"parent_table"."Name",
"parent_table"."Class",
"child1_1"."finished" AS "2_miles",
"child1_2"."finished" AS "3_miles"
FROM
"parent_table",
"child1" AS "child1_1",
"child1" AS "child1_2"
WHERE
"child1_1"."parent_id"="parent_table"."id" AND
"child1_2"."parent_id"="parent_table"."id" AND
"child1_1"."distance"='2 miles' AND
"child1_2"."distance"='3 miles'
You can add columns from child2 in the same manner. And child subkeys (data in child1.distance i.e.) will need to go to column names. But for variable one-to-many relations, you need multiple tables. It's basically what the relational concept is all about.
For data analysis (which you are trying to do as it seems) you will also need two datasets (like tables) because the 2 measurements (sample sets) are not correlated (i.e. distances and weights), which you can obtain in 2 tables. Think of what a "sample" is (the result of a measurement). It can't be "entity 1 completed 2 miles and 4lbs" because "2 miles and 4 lbs" it's not a measurable event. So you have 2 distinct samples: "entity 1 completed 2 miles" and "entity 1 completed 4 lbs". (Or are the data in child2 1-to-1 properties of the entity in parent_table ? You should detail better the meaning of the data and what you-re trying to achieve).

Recursive SQL that gets the first instance of a value up a hierarchy

I have to do this in SQL.
I have a table called 'locations'. It contains a list of locations ranging from houses, to streets, to cities all the way up to continents.
locationId | name | desiredValue
1 | Wimbledon |
2 | Peckham |
3 | London |
4 | UK |
5 | France | 123
6 | Europe | 456
7 | Australia |
8 | Paris |
I have a second table called 'links' which contains the link of locations, and their relation
Location1 | Location2 | Linktype
3 | 1 | 5
3 | 2 | 5
4 | 3 | 5
6 | 4 | 5
5 | 8 | 5
linktype 5 indicates that location2 is situated 'in' location1. In the example above, locationId 1 (wimbledon) is located 'in' locationId 3 (london). LocationId 3 (london) is located 'in' locationId 4 (Europe) and so on.
The linktype just describes this 'in' relationship - the link table contains other relations as well which are not pertinent to this question, I just mention it in case it needs to be in a where clause.
For a given location, I want to get the first instance in its location hierarchy that has a 'desiredValue'
For example:
if I was interested in Peckham, I'd like to see that Peckham has no value, that London has no value, that UK has no value but that Europe does (456).
If I was interested in London, I'd see that it has no value, nor does the UK, but that Europe does (456)
If I was interested in Europe, I'd see that it has a value (456)
If I was interested in Paris, I'd see that it has no value, but France does (123)
I know I should probably be using recursive CTEs for this, but I'm stumped. Any help would be greatfuly received!

Generate multiple rows for a binary number field?

Example data rows:
| ID | First Name | Last Name | Federal Race Code |
| 101 | Bob | Miller | 01010 |
| 102 | Daniel | Smith | 00011 |
The "Federal Race Code" field contains binary data, and each "1" is used to determine if a particular check box is set on a particular web form. E.g., the first bit is American Indian, second bit is Asian, third bit is African American, fourth is Pacific Islander, and the fifth is White.
I need to generate a separate row for each bit that is set to "1". So, given the example above, I need to generate output that looks like this:
| ID | First Name | Last Name | Mapped Race Name |
| 101 | Bob | Miller | Asian |
| 101 | Bob | Miller | African American |
| 102 | Daniel | Smith | Pacific Islander |
| 102 | Daniel | Smith | White |
Any tips or ideas on how to go about this?

You can do it with either 6 queries with UNION or one UNPIVOT clause.
In any case you should start by splitting that binary logic into 6 columns:
SELECT *,
CASE WHEN federal_race_code & 16 = 16 THEN 1 ELSE 0 END as NativeAmerican,
..
CASE WHEN federal_race_code & 1 = 1 THEN 1 ELSE 0 END as White
FROM myTable
Then UNION:
SELECT *, 'Native American' AS Race
FROM (<subquery>)
WHERE NativeAmerican = 1
UNION
...
UNION
SELECT *, 'White' AS Race
FROM (<subquery>)
WHERE White = 1
If you are on Oracle or SQL server use CTE.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Check for similarity in string using Postgres - sql

Related

One-to-one vs Many-to-many with duplicate entries

How can I sort a table by two columns in sisense?

SQL - Combining all children in one row

Recursive SQL that gets the first instance of a value up a hierarchy

Generate multiple rows for a binary number field?

Categories

Resources