SQL query for parsing a body of text to extract a string from a list - sql

I need help on an SQL query to perform the following:
I have a table with a list of possible string values for products.
I have a second table with free form text in which this product may be mentioned. Is there any way an SQL query can extract the string if it is present in the 1st table?
I read on another SO post about CHARINDEX and SUBSTRING. Will this be efficient in this scenario? How can i apply this in my use case?
AN example for my scenario is this:
My table, PRODUCTS, has the following format,
Product
XXX
YYY
ZZZ
DDD
The other table has a column in which there is large amount of text in which this product will be mentioned. Like:
Record Number User Review
1 I like XXX for its versatility but YYY is better.
2 XXX is a horrible product. DO not buy.
3 YYY and DDD are best in class. Many do not know how to use it.
Now I want to extract the product names using a query in this manner.
Record Number Product in Review
1 XXX
1 YYY
2 XXX
3 YYY
3 DDD
Thank you in advance for your time and help.

This should work but it will be slow on big tables:
Select p.id, f.id, p.name from product
Inner Join freeform f on f.text like '%'+p.name+'%'

Related

Using OpenRefine to create a mapping table from two other tables

I have the following use case that OpenRefine seems to be a good candidate to solve. I have an existing, "dirty" product table in my database that looks like this:
id name
51 Product A
52 product-a
53 product B
54 productb
55 produtc
56 productc
I have a new, "clean" product table that looks like this:
id name
1 Product A
2 Product B
3 Product C
I'd like to use OpenRefine's clustering to generate a mapping file, to help me map products from the old table to the new table:
id name old_id
1 Product A 51
1 Product A 52
2 Product B 53
2 Product B 54
3 Product C 55
3 Product C 56
But I can't quite get OpenRefine to do what I want. Any advice for how to achieve this?
As it was already pointed out, there is no direct way to achieve this, but with the help support tables and the cross function, you can get the desired result:
you take the column "name" from the dirty table and clean table, and combine them. Don't worry about the ids at this point.
import them into OpenRefine (e.g. as project "product names")
duplicate the column "name" (the only column so far) and name the new one "name_new".
Cluster the column "name_new" and replace all of the old names with the correct new ones. Some manual adjustments might be required at this point.
Your result should now look like this:
name name_new
Product A Product A
product-a Product A
product B Product B
productb Product B
produtc Product C
productc Product C
Product A Product A
Product B Product B
Product C Product C
import the dirty table as "products" and the clean table as "products clean".
in the project "products" transform the column "name" using
value.cross("product names","name").cells["name_new"].value[0]
rename the column "id" to "old_id"
add a new column based on "name" using
value.cross("products clean","name").cells["id"].value[0]
and save it as "id". The table "products" has now the desired structure.
I hope this helps.
Clustering function is limited to a single column to find similar strings within that column.
OpenRefine doesn't yet have string similarity functions across 2 or more tables or projects (Fuzzy Joins) in the way that your use case presents. You would have to use other tools for this. A common tool that I've seen folks use and express satisfaction with Fuzzy Joining is MS PowerBI (Desktop is Free but has limits on Relationships and Exporting, but Pro version is only $10 a month and canceling anytime) but if you wanted something completely free then a few R packages do this, one of which is https://www.rdocumentation.org/packages/fuzzyjoin/versions/0.1.4
In OpenRefine, we totally want to allow Fuzzy Joins across Projects/datasets in the future and it's on our issue list, but we just haven't had the funding to implement this along with tons of other features we know users would like to see.

Transposing a field into fields

I have a query that produces a 2 field result: Email and Interest.
The result is millions of records. But there are about 100 distinct Interests.
I would like to run the query to produce a result that is 101 fields wide like this:
Email | Books | Cats | Dogs | ETC
Where the metric is the count of each.
With my knowledge of SQL thus far I'd have to use CASE WHEN. But I'd have to write 100 lines of code.
Is there a better way?
You could use the PIVOT statement but sounds like terradata does not support that. Pivot would require typing in all column names as well. Don't think you can avoid that

Exclude entire row based on based on values from another query

I am using MS Access and I have a rather complex situation.
I have Respondents who are linked to varying numbers of different Companies via 2 connecting tables. I want to be able to create a list of distinct customers which excludes any customer associated with Company X.
Here is a pic of the relationships that are involved with the query.
And here is an example of what I'm trying to achieve.
RespondentRef | Respondent Name
8 Joe Bloggs
.
RespondentRef | GroupRef
8 2
.
GroupRef | CompanyRef
2 10
.
CompanyRef | CompanyName
10 Ball of String
I want a query where I enter in 'Ball of String' for the company name, and then it produces a list of all the Respondents (taken from Tbl_Respondent) which completely excludes Respondent 8 (as he is linked to CompanyName: Ball of String).
Tbl_Respondent
RespondentRef | Respondent Name
... ...
7 Bob Carlyle
9 Anton Boyle
I have tried many combinations of subqueries with <> and NOT EXISTS and NOT IN and nothing seems to work. I suspect the way these tables are linked may have something to do with it.
Any help you could offer would be very much appreciated. If you have any questions let me know. (I have made best efforts, but please accept my apologies for any formatting conventions or etiquette faux-pas I may have committed.)
Thank you very much.
EDIT:
My formatted version of Frazz's code is still turning resulting in a syntax error. Any help would be appreciated.
SELECT *
FROM Tbl_Respondent
WHERE RespondentRef NOT IN (
SELECT tbl_Group_Details_Respondents.RespondentRef
FROM tbl_Group_Details_Respondents
JOIN tbl_Group_Details ON tbl_Group_Details.GroupReference = tbl_Group_Details_Respondents.GroupReference
JOIN tbl_Company_Details ON tbl_Company_Details.CompanyReference = tbl_Group_Details.CompanyReference
WHERE tbl_Company_Details.CompanyName = "Ball of String"
)
This should do what you need:
SELECT *
FROM Tbl_Respondent
WHERE RespondentRef NOT IN (
SELECT gdr.RespondentRef
FROM Tbl_Group_Details_Respondent gdr
JOIN Tbl_Group_Details gd ON gd.GroupRef=gdr.GroupRef
JOIN Tbl_Company_Details cd ON cd.CompanyRef=gd.CompanyRef
WHERE cd.CompanyName='Ball of String'
)

Merge Multiple Data from Rows/Records into One Row w/ Comma Separated Fields

If I were to query our ORDERS table, I might enter the following:
SELECT * FROM ORDERS
WHERE ITEM_NAME = 'Fancy Pants'
In the results for this query, I might get the following:
----------------------------------------------------------------------
ORDER_ID WAIST First_Name Email
----------------------------------------------------------------------
001 32 Jason j-diddy[at]some-thing.com
005 28 Pip pirrip[at]british-mail.com
007 28 HAL9000 olhal[at]hot-mail.com
Now, I'm also wanting to pull information from a different table:
SELECT * FROM PRODUCTS
WHERE ITEM_NAME = 'Fancy Pants'
------------------------------------------
PRODUCT_ID Product Prod_Desc
------------------------------------------
008 Fancy Pants Really fancy.
In the end, however, I'm actually wanting to condense these records into one row via SQL query:
-----------------------------------------------------------------------------
PRODUCT ORDER_Merged First_Name_Merged Email_Merged
-----------------------------------------------------------------------------
Fancy Pants 001,005,007 Jason,Pip,Hal9000 j-di[...].com, pirrip[...].com
Anyway, that's how it would look. What I can't figure out is what that "merge" query would look like.
My searches here unfortunately keep leading me to results for PHP. I have found a couple of results re: merging into CSV rows via SQL but I don't think they'll work in my scenario.
Any insight would, as always, be greatly appreciated.
UPDATE:
Ah, turns out the STUFF and FOR XML functions were exactly what I needed. Thanks all!!
Select
A.name,
stuff((
select ',' + B.address
from Addresses B
WHERE A.id=B.name_id
for xml path('')),1,1,'')
From Names A
This is an excellent article on various approaches to group concatenation with pro's and con's of each.
http://www.simple-talk.com/sql/t-sql-programming/concatenating-row-values-in-transact-sql/
Personally however, I like the Coalesce approach as I demonstrate here:
https://dba.stackexchange.com/a/2615/1607

SQL Query with multiple values in one column

I've been beating my head on the desk trying to figure this one out. I have a table that stores job information, and reasons for a job not being completed. The reasons are numeric,01,02,03,etc. You can have two reasons for a pending job. If you select two reasons, they are stored in the same column, separated by a comma. This is an example from the JOBID table:
Job_Number User_Assigned PendingInfo
1 user1 01,02
There is another table named Pending, that stores what those values actually represent. 01=Not enough info, 02=Not enough time, 03=Waiting Review. Example:
Pending_Num PendingWord
01 Not Enough Info
02 Not Enough Time
What I'm trying to do is query the database to give me all the job numbers, users, pendinginfo, and pending reason. I can break out the first value, but can't figure out how to do the second. What my limited skills have so far:
select Job_number,user_assigned,SUBSTRING(pendinginfo,0,3),pendingword
from jobid,pending
where
SUBSTRING(pendinginfo,0,3)=pending.pending_num and
pendinginfo!='00,00' and
pendinginfo!='NULL'
What I would like to see for this example would be:
Job_Number User_Assigned PendingInfo PendingWord PendingInfo PendingWord
1 User1 01 Not Enough Info 02 Not Enough Time
Thanks in advance
You really shouldn't store multiple items in one column if your SQL is ever going to want to process them individually. The "SQL gymnastics" you have to perform in those cases are both ugly hacks and performance degraders.
The ideal solution is to split the individual items into separate columns and, for 3NF, move those columns to a separate table as rows if you really want to do it properly (but baby steps are probably okay if you're sure there will never be more than two reasons in the short-medium term).
Then your queries will be both simpler and faster.
However, if that's not an option, you can use the afore-mentioned SQL gymnastics to do something like:
where find ( ',' |fld| ',', ',02,' ) > 0
assuming your SQL dialect has a string search function (find in this case, but I think charindex for SQLServer).
This will ensure all sub-columns begin and start with a comma (comma plus field plus comma) and look for a specific desired value (with the commas on either side to ensure it's a full sub-column match).
If you can't control what the application puts in that column, I would opt for the DBA solution - DBA solutions are defined as those a DBA has to do to work around the inadequacies of their users :-).
Create two new columns in that table and make an insert/update trigger which will populate them with the two reasons that a user puts into the original column.
Then query those two new columns for specific values rather than trying to split apart the old column.
This means that the cost of splitting is only on row insert/update, not on _every single select`, amortising that cost efficiently.
Still, my answer is to re-do the schema. That will be the best way in the long term in terms of speed, readable queries and maintainability.
I hope you are just maintaining the code and it's not a brand new implementation.
Please consider to use a different approach using a support table like this:
JOBS TABLE
jobID | userID
--------------
1 | user13
2 | user32
3 | user44
--------------
PENDING TABLE
pendingID | pendingText
---------------------------
01 | Not Enough Info
02 | Not Enough Time
---------------------------
JOB_PENDING TABLE
jobID | pendingID
-----------------
1 | 01
1 | 02
2 | 01
3 | 03
3 | 01
-----------------
You can easily query this tables using JOIN or subqueries.
If you need retro-compatibility on your software you can add a view to reach this goal.
I have a tables like:
Events
---------
eventId int
eventTypeIds nvarchar(50)
...
EventTypes
--------------
eventTypeId
Description
...
Each Event can have multiple eventtypes specified.
All I do is write 2 procedures in my site code, not SQL code
One procedure converts the table field (eventTypeIds) value like "3,4,15,6" into a ViewState array, so I can use it any where in code.
This procedure does the opposite it collects any options your checked and converts it in
If changing the schema is an option (which it probably should be) shouldn't you implement a many-to-many relationship here so that you have a bridging table between the two items? That way, you would store the number and its wording in one table, jobs in another, and "failure reasons for jobs" in the bridging table...
Have a look at a similar question I answered here
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM JobId
),
Split AS
(
SELECT JOB_NUMBER, USER_ASSIGNED, SUBSTRING(PENDING_INFO, Numbers.N, CHARINDEX(',', PENDING_INFO + ',', Numbers.N) - Numbers.N) AS PENDING_NUM
FROM JobId
JOIN Numbers ON Numbers.N <= DATALENGTH(PENDING_INFO) + 1
AND SUBSTRING(',' + PENDING_INFO, Numbers.N, 1) = ','
)
SELECT *
FROM Split JOIN Pending ON Split.PENDING_NUM = Pending.PENDING_NUM
The basic idea is that you have to multiply each row as many times as there are PENDING_NUMs. Then, extract the appropriate part of the string
While I agree with DBA perspective not to store multiple values in a single field it is doable, as bellow, practical for application logic and some performance issues. Let say you have 10000 user groups, each having average 1000 members. You may want to have a table user_groups with columns such as groupID and membersID. Your membersID column could be populated like this:
(',10,2001,20003,333,4520,') each number being a memberID, all separated with a comma. Add also a comma at the start and end of the data. Then your select would use like '%,someID,%'.
If you can not change your data ('01,02,03') or similar, let say you want rows containing 01 you still can use " select ... LIKE '01,%' OR '%,01' OR '%,01,%' " which will insure it match if at start, end or inside, while avoiding similar number (ie:101).