Splitting pairs from Column Data in SQL Server

Splitting pairs from Column Data in SQL Server - sql

I have a solution in Java for a simply defined problem, but I want to improve the time needed to execute the data handling. The problem is to take a series of words held in a column on a relational database and split the words into pairs which are then insert into a dictionary of pairs. The pairs themselves relate to a product identified by partid.
Hence the Part table has
PartID (int), PartDesc (nvarchar)
and the dictionary has
DictID (int), WordPair (nvarchar).
The logic is therefore:
insert into DictPair (wordpair, partid)
select wordpairs, partid from Part
A wordpair is defined as two adjacent words and hence words will be repeated, eg
red car with 4 wheel drive
will pair to
{red, car},{car, with}, {with,4}, {4, wheel}, {wheel, drive}
Hence the final dictionary for say partid 45 will have (partid, dictionarypair):
45, red car
45, car with
45, with 4
45, 4 wheel
45, wheel drive
This is used in product classification and hence word order matters (but pair order does not matter).
Has anyone any ideas on how to solve this? I was thinking in terms of stored procedures, and using some kind of parsing. I want the entire solution to be implemented in SQL for efficiency reasons.

Basically, find a split() function on the web that returns the position of a word in a string.
Then do:
select s.word, lead(s.word) over (partition by p.partId order by s.pos) as nextword
from parts p outer apply
dbo.split(p.partDesc, ' ') as s(word, pos);
This will put NULL for the last pair, which you don't seem to want. So:
insert into DictPair (wordpair, partid)
select word + ' ' nextword, partid,
from (select p.*, s.word, lead(s.word) over (partition by p.partId order by s.pos) as nextword
from parts p outer apply
dbo.split(p.partDesc, ' ') as s(word, pos)
)
where nextword is not null;
Here are some split functions, provided by Googling "SQL Server split". And another. And from StackOverflow. And there are many more

Related

Identify the last occurance of a subsring in a string, where the substring is from a table. Teradata

I have the following problem:
I need to identify the last occurrence of any sub-string given in table A, and return that given value in return in the select statement of another statement. This is a bit convoluted, but here is the code:
SELECT TRIM(COUNTRY_CODE)
FROM (
SELECT TOP 1 POSITION( PHRASE IN MY_STRING) AS PHRASE_LOCATION, CODE
FROM REFERENCE_TABLE -- Where the country list is located
WHERE PHRASE_LOCATION > 0 -- To return NULL if there is no matches
ORDER BY 1 DESC -- To get the last one
) t1
This works when run by it self, but i have large problems getting it to work as part of another queries' select. I need "MY_STRING" to come from a higher level in the nested select three. The reasons for this is how the system is designed on a higher level.
In other words i need the following:
PHRASE is coming from a table that have a phrases and a code associated
MY_STRING is used in the higher level select and i need to associate a code with it, based on the last occurring phrase
Number of different phrases > 400 so no hard coding :(
Number of different "MY_STRING" > 1 000 000 / day
So far i tried what you can see above, but due to the constraints of the system, i cannot be to creative.
Example Phrases: "New York", "London", "Oslo"
Example Codes: "US", "UK, "NO"
Example Strings: "London House, Something street, New York"; "Some street x, 0120, OSL0".
Desired Outcomes: "US"; "NO"

This will result in a product join, i.e. use a lot of CPU:
SELECT MY_STRING
-- using INSTR searching the last occurance instead of POSITION if the same PHRASE might occur multiple times
-- INSTR is case sensitive -> must use LOWER
,Instr(Lower(MY_STRING), Lower(PHRASE), -1, 1) AS PHRASE_LOCATION
,CODE
,PHRASE
FROM table_with_MY_STRING
LEFT JOIN REFERENCE_TABLE -- to return NULL if no match
ON PHRASE_LOCATION > 0
QUALIFY
Row_Number() -- return last match
Over (PARTITION BY MY_STRING
ORDER BY PHRASE_LOCATION DESC) = 1
If this is not efficient enough another possible solution might utilize STRTOK_SPLIT_TO_TABLE/REGEXP_SPLIT_TO_TABLE: split the address into parts and then join those parts to PHRASE.

Search a column for values LIKE in another column

I searched but couldn't find what I was looking for, maybe I'm not looking for the right terms though.
I have a colum for SKUs and a Keyword column, the SKUs are formatted AA 12345, and the Keywords are just long lists of words, what I need to do is find any records where the numbers in the SKU match any part of the Keywords, I'm just not sure how to do this. For example I'd like to remove the AA so that I'm looking for %12345% anywhere inside of the value of keywords, but I need to do it for every record.
I've tried a few variations of:
SELECT *, Code AS C
FROM Prod
WHERE Keywords LIKE '%C%';
but I get errors on all of them. Can someone help?
Thank you.
EDIT: Okay, sorry about that, the question wasn't the clearest. I'll try to clarify;
The SKU column has values that have a 2 letter prefix in front of a varying amount of numbers such as, AA 12345 or UN 98767865
The Keywords columns are full of information, but also include the SKU values, the problem here is that some of the keyword columns contain the SKU values of products that have entirely different records
I'm trying to find what columns contain the value of different records.
I hope that's more understandable.
EDIT EDIT: Here is some actual sample data
Code: AD 56409429
Keywords: 56409429, 409249, AD 56409429, AD-56409429, Advance 56409429, Nilfisk 56409429, Nilfisk Advance 56409429, spx56409429, 56409429M, 56409429G, 56409429H, ADV56409429, KNT56409429, Kent 56409429, AA 12345
Code: AA 12345
Keywords: AA 12345, 12345, Brush
I need to find all the records where an Errant Code value has found it's way into the Keywords, such as the first case above, so I need a query that would only return the first example
I'm really sorry my explanation is confusing, it's perhaps an extension of how confused I am trying to figure out how to do it. Imagine me sitting there with the site owner who added thousands of these extra sku numbers to their keywords and having them ask me to then remove them :/

Assuming all of your SKU values are in exactly the same format you can remove the 'AA' part using SUBSTRING and then use the result in the LIKE statement:
SELECT * FROM Prod WHERE Keywords LIKE '%' + SUBSTRING(Code, 3,5) + '%'
Seeing as your SKU codes can be variable length the SUBSTRING statement above will have to changed to:
SELECT * FROM Prod WHERE Keywords LIKE '%' + SUBSTRING(Code, 3, LEN(Code)) + '%'
This will remove the first 3 characters from your SKU code regardless of the number of digits it contains afterwards.
It is not entirely clear from your question whether or not the Keywords are in the format AA 12345 or just 12345 but assuming they are and are comma separated. Then you can find all records where the code is in the keywords but there are OTHER keywords also by using this statement:
SELECT *
FROM Prod
WHERE Keywords LIKE '%' + SUBSTRING(Code, 3, LEN(Code)) + '%'
AND Keywords <> SUBSTRING(Code, 3, LEN(Code))
This statement basically says find me all records where SKU code is somewhere in the Keywords BUT also must not exactly match the Keywords contents, i.e. there must be other keywords in the data.
Ok based on your last revisions I think this will work - or at least get you along the road (I am assuming your Product table has a primary key of Id). Also this is most likely horribly inefficient but seeing as it sounds as if this is a one off tidy up it may not matter too much as long as it works (at least that is what I am hoping).
SELECT DISTINCT P.Id
FROM PROD P
INNER JOIN
(
-- Get all unique SKU codes from Prod table
SELECT DISTINCT SUBSTRING(CODE, 3, LEN(CODE)) as Code FROM Prod
) C ON P.Keywords LIKE '%' + C.Code + '%'
AND SUBSTRING(P.Code, 3, LEN(P.Code)) <> C.Code
The above statement joins a unique list of SKU codes (with the letter prefix removed) with every matching record via the join on the Keyword column. Note: This will result in duplicate product records being returned. Additionally the result-set is filtered so as to only return matching records where the SKU Code of the original Product record does not match a SKU code contained in the keywords column.
The distinct then returns only a unique list of Product Id's that have a erroneous SKU code in the Keyword column (they have may have multiples).

Stuff() seems better suited here.... I would do this:
SELECT *
FROM Prod WHERE
Keywords LIKE '%' + STUFF(SKU,1,3,'') + '%'
This will work for both AA 12345 and UN 98767865 -- it replace the first 3 characters with blank.

looping through a numeric range for secondary record ID

So, I figure I could probably come up with some wacky solution, but i figure i might as well ask up front.
each user can have many orders.
each desk can have many orders.
each order has maximum 3 items in it.
trying to set things up so a user can create an order and the order auto generates a reference number and each item has a reference letter. reference number is 0-99 and loops back around to 0 once it hits 99, so orders throughout the day are easy to reference for the desks.
So, user places an order for desk #2 of 3 items:
78A: red stapler
78B: pencils
78C: a kangaroo foot
not sure if this would be done in the program logic or done at the SQL level somehow.
was thinking something like neworder = order.last + 1 and somehow tying that into a range on order create. pretty fuzzy on specifics.

Without knowing the answer to my comment above, I will assume you want to have the full audit stored, rather than wiping historic records; as such the 78A 78B 78C type orders are just a display format.
If you have a single Order table (containing your OrderId, UserId, DeskId, times and any other top-level stuff) and an OrderItem table (containing your OrderItemId, OrderId, LineItemId -- showing 1,2 or 3 for your first and optional second and third line items in the order, and ProductId) and a Product table (ProductId, Name, Description)
then this is quite simple (thankfully) using the modulo operator, which gives the remainder of a division, allowing you in this case to count in groups of 3 and 100 (or any other number you wish).
Just do something like the following:
(you will want to join the items into a single column, I have just kept them distinct so that you can see how they work)
Obviously join/query/filter on user, desk and product tables as appropriate
select
o.OrderId,
o.UserId,
o.DeskId
o.OrderId%100 + 1 as OrderNumber,
case when LineItem%3 = 1 then 'A'
when LineItem%3 = 2 then 'B'
when LineItem%3 = 0 then 'C'
end as ItemLetter,
oi.ProductId
from tb_Order o inner join tb_OrderItem oi on o.OrderId=oi.OrderId
Alternatively, you can add the itemLetter (A,B,C) and/or the OrderNumber (1-100) as computed (and persisted) columns on the tables themselves, so that they are calculated once when inserted, rather than recalculating/formatting when they are selected.
This sort-of breaks some best practice that you store the raw data in the DB and you format on retrieval; but if you are not going to update the data and you are going to select the data for more than you are going to write the data; then I would break this rule and calculate your formatting at insert time

Weighted Keyword Search

Hello: I want to do a "weighted search" on product that are tagged with keywords.
(So: not fulltext search, but n-to-m-relation). So here it is:
Table 'product':
sku - the primary key
name
Table 'keywords':
kid - keyword idea
keyword_de - German language String (e.g. 'Hund','Katze','Maus')
keyword_en - English language String (e.g. 'Dog','Cat','Mouse')
Table 'product_keyword' (the cross-table)
sku \__ combined primary key
kid /
What I want is to get a score for all products that at least "contain" one relevant keyword. If I search for ('Dog','Elephant','Maus') I want that
Dog credits a score of 1.003,
Elephant of 1.002
Maus of 1.001
So least important search term starts at 1.001, everything else 0.001++. That way, a lower score limit of 3.0 would equal "AND" query (all three keywords must be found), a lower score limit of 1.0 would equal an "OR". Anything in between something more or less matching. In particular by sorting according to this score, most relevant search results would be first (regardless of lower limit)...
I guess I will have to do something with
IF( keyword1 == 'dog', 1.001, 0) + IF...
maybe inside a SUM() and probably with a GROUP BY at the end of a JOIN over the cross table, eh? But I am fairly clueless how to tackle this.
What would be feasible, is to get the keyword id's from the keywords beforehand. That's a cheap query. So the keywords table can be left ignored and it's all about the other of the cross and product table...
I have PHP at hand to automatically prepare a fairly lengthy PHP statement, but I would like to avoid further multiple SQL statements. In particular since I will limit the query outcome (most often to "LIMIT 0, 20") for paging mode results, so looping a very large number of in between results through a script would be no good...
DANKESCHÖN, if you can help me on this :-)

I think a lot of this is in the Lucene engine (http://lucene.apache.org/java/docs/index.html), which is available for PHP in the Zend Framework: http://framework.zend.com/manual/en/zend.search.lucene.html.
EDIT:
If you want to do the weighted thing you are talking about, I guess you could use something like this:
select p.sku, sum(case k.keyword_en when 'Dog' then 1001 when 'Cat' then 1002 when 'Mouse' then 1003 else 0 end) as totalscore
from products p
left join product_keyword pk on p.sku = pk.sku
inner join keywords k on k.kid = pk.kid
where k.keyword_en in ('Dog', 'Cat', 'Mouse')
group by p.sku
(Edit 2: forgot the group by clause.)

Optimizing a strange MySQL Query

Hoping someone can help with this. I have a query that pulls data from a PHP application and turns it into a view for use in a Ruby on Rails application. The PHP app's table is an E-A-V style table, with the following business rules:
Given fields: First Name, Last Name, Email Address, Phone Number and Mobile Phone Carrier:
Each property has two custom fields defined: one being required, one being not required. Clients can use either one, and different clients use different ones based on their own rules (e.g. Client A may not care about First and Last Name, but client B might)
The RoR app must treat each "pair" of properties as only a single property.
Now, here is the query. The problem is it runs beautifully with around 11,000 records. However, the real database has over 40,000 and the query is extremely slow, taking roughly 125 seconds to run which is totally unacceptable from a business perspective. It's absolutely required that we pull this data, and we need to interface with the existing system.
The UserID part is to fake out a Rails-esque foreign key which relates to a Rails table. I'm a SQL Server guy, not a MySQL guy, so maybe someone can point out how to improve this query? They (the business) demand that it be sped up but I'm not sure how to since the various group_concat and ifnull calls are required due to the fact that I need every field for every client and then have to combine the data.
select `ls`.`subscriberid` AS `id`,left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) AS `user_id`,
ifnull(min((case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end)),_utf8'') AS `first_name`,
ifnull(min((case when (`s`.`fieldid` in (3,36)) then `s`.`data` else NULL end)),_utf8'') AS `last_name`,
ifnull(`ls`.`emailaddress`,_utf8'') AS `email_address`,
ifnull(group_concat((case when (`s`.`fieldid` = 81) then `s`.`data` when (`s`.`fieldid` = 154) then `s`.`data` else NULL end) separator ''),_utf8'') AS `mobile_phone`,
ifnull(group_concat((case when (`s`.`fieldid` = 100) then `s`.`data` else NULL end) separator ','),_utf8'') AS `sms_only`,
ifnull(group_concat((case when (`s`.`fieldid` = 34) then `s`.`data` else NULL end) separator ','),_utf8'') AS `mobile_carrier`
from ((`list_subscribers` `ls`
join `lists` `l` on((`ls`.`listid` = `l`.`listid`)))
left join `subscribers_data` `s` on((`ls`.`subscriberid` = `s`.`subscriberid`)))
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
group by `ls`.`subscriberid`,`l`.`name`,`ls`.`emailaddress`
EDIT
I removed the regexp and that sped the query up to about 20 seconds, instead of nearly 120 seconds. If I could remove the group by then it would be faster, but I cannot as removing this causes it to duplicate rows with blank data for each field, instead of aggregating them. For instance:
With group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John Doe jdoe#example.com 5551234567 0 Sprint
Without group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John jdoe#xample.com
1 1 Doe jdoe#example.com
1 1 jdoe#example.com
1 1 jdoe#example.com 5551234567
And so on. What we need is the first result.
EDIT #2
The query still seems to take a long time, but earlier today it was running in only about 20 seconds on the production database. Without changing a thing, the same query is now once again taking over 60 seconds. This is still unacceptable.. any other ideas on how to improve this?

That is, without a doubt, the second most hideous SQL query I have ever laid my eyes on :-)
My advice is to trade storage requirements for speed. This is a common trick used when you find your queries have a lot of per-row functions (ifnull, case and so forth). These per-row functions never scale very well as the table becomes larger.
Create new fields in the table which will hold the values you want to extract and then calculate those values on insert/update (with a trigger) rather than select. This doesn't technically break 3NF since the triggers guarantee data consistency between columns.
The vast majority of database tables are read far more often than they're written so this will amortise the cost of the calculation across many selects. In addition, just about every reported problem with databases is one of speed, not storage.
An example of what I mean. You can replace:
case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end
with:
`s`.`data_2_35`
in your query if your insert/update trigger simply sets the data_2_35 column to data or NULL depending on the value of fieldid. Then you index data_2_35 and, voila, instant speed improvement at the cost of a little storage.
This trick can be done to the five case clauses, the left/regexp bit and the "naked" ifnull function as well (the ifnull functions containing min and group_concat may be harder to do).

The problem is most likely the WHERE condition:
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
This looks like complex string comparison, so no index can be used, which results in a full table scan, possibly for every row in the result set. I am not a MySQL expert, but if you can simplify this into more simple column comparisons it will probably run much faster.

The first thing that jumps out at me as the source of all the trouble:
The PHP app's table is an E-A-V style table...
Trying to convert data in EAV format into conventional relational format on the fly using SQL is bound to be awkward and inefficient. So don't try to smash it into a conventional column-per-attribute format. The following query returns multiple rows per subscriber, one row per EAV attribute:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE SUBSTRING_INDEX(l.name, _utf8'_', 1) REGEXP _utf8'[[:digit:]]+'
This eliminates the GROUP BY which is not optimized well in MySQL -- it usually incurs a temporary table which kills performance.
id user_id email_address fieldid data
1 1 jdoe#example.com 2 John
1 1 jdoe#example.com 3 Doe
1 1 jdoe#example.com 81 5551234567
But you'll have to sort out the EAV attributes in application code. That is, you can't seamlessly use ActiveRecord in this case. Sorry about that, but that's one of the disadvantages of using a non-relational design like EAV.
The next thing that I notice is the killer string manipulation (even after I've simplified it with SUBSTRING_INDEX()). When you're picking substrings out of a column, this says you me that you've overloaded one column with two distinct pieces of information. One is the name and the other is some kind of list-type attribute that you would use to filter the query. Store one piece of information in each column.
You should add a column for this attribute, and index it. Then the WHERE clause can utilize the index:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE l.list_name_contains_digits = 1;
Also, you should always analyze an SQL query with EXPLAIN if it's important for them to have good performance. There's an analogous feature in MS SQL Server, so you should be accustomed to the concept, but the MySQL terminology may be different.
You'll have to read the documentation to learn how to interpret the EXPLAIN report in MySQL, there's too much info to describe here.
Re your additional info: Yes, I understand you can't do away with the EAV table structure. Can you create an additional table? Then you can load the EAV data into it:
CREATE TABLE subscriber_mirror (
subscriberid INT PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
first_name2 VARCHAR(100),
last_name2 VARCHAR(100),
mobile_phone VARCHAR(100),
sms_only VARCHAR(100),
mobile_carrier VARCHAR(100)
);
INSERT INTO subscriber_mirror (subscriberid)
SELECT DISTINCT subscriberid FROM list_subscribers;
UPDATE subscriber_data s JOIN subscriber_mirror m USING (subscriberid)
SET m.first_name = IF(s.fieldid = 2, s.data, m.first_name),
m.last_name = IF(s.fieldid = 3, s.data, m.last_name),
m.first_name2 = IF(s.fieldid = 35, s.data, m.first_name2),
m.last_name2 = IF(s.fieldid = 36, s.data, m.last_name2),
m.mobile_phone = IF(s.fieldid = 81, s.data, m.mobile_phone),
m.sms_only = IF(s.fieldid = 100, s.data, m.sms_only),
m.mobile_carrer = IF(s.fieldid = 34, s.data, m.mobile_carrier);
This will take a while, but you only need to do it when you get a new data update from the vendor. Subsequently you can query subscriber_mirror in a much more conventional SQL query:
SELECT ls.subscriberid AS id, l.name+0 AS user_id,
COALESCE(s.first_name, s.first_name2) AS first_name,
COALESCE(s.last_name, s.last_name2) AS last_name,
COALESCE(ls.email_address, '') AS email_address),
COALESCE(s.mobile_phone, '') AS mobile_phone,
COALESCE(s.sms_only, '') AS sms_only,
COALESCE(s.mobile_carrier, '') AS mobile_carrier
FROM lists l JOIN list_subscribers USING (listid)
JOIN subscriber_mirror s USING (subscriberid)
WHERE l.name+0 > 0
As for the userid that's embedded in the l.name column, if the digits are the leading characters in the column value, MySQL allows you to convert to an integer value much more easily:
An expression like '123_bill'+0 yields an integer value of 123. An expression like 'bill_123'+0 has no digits at the beginning, so it yields an integer value of 0.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas