Select normalized strings

Select normalized strings - sql

I have a table which contains company names which appear to have been a free text box entry. As such there ends up being lots of companies with 3-5 entries such as A Good Company, A Good Company LLC, AA Good Company etc.
I know if I was looking for one company I could use like (%) to get all the variations, but I would like to insert them into a new company table with just one row for all options so that I can use that as a reference table going forward. Is there a way to do this within SQL, or in an outside application for that matter?

Related

Linking two seperate sets of data codes without a common identifier

I have two large sets of data. Both sets are a form of structured coding system,and is used to categorize groups of people based on their occupation. The two sets of data have no common identifier. Besides a column that contains a unique identifier each table has a description for said identifier, but although they may be describing similar things the descriptions are not identical.
How do I create a table, that connects the two sets of data, without having to go back and manually try to figure out how to make the connection between the two identifiers. I am not sure if this can be done on Access or SQL. If there is a way to do this, I would like to know what software is maybe out there.
Here's some example data:
Table 1:
Z Identifier DescriptionA
162000 Pharmacist
3123566 Electronic Repairman
143246 Banker
8444455 Doctor
Table 2:
Q Identifier DescriptionB
XX134556 COPY/PRINT/SCAN EQUIP
666Q1224 DRUGS
722WWYZ Financial Svc
8456435T Medical Services
15666PP Health Services
Desired Output:
Table 3:
Z Identifier DescriptionA Q Identifier DescriptionB
162000 Pharmacist 666Q1224 DRUGS
3123566 Electr Repairman XX134556 COPY/PRINT/SCAN EQUIP
143246 Banker 722WWYZ Financial Svc
8444455 Doctor 8456435T Medical Services

Table 1:
Z Identifier DescriptionA
162000 Pharmacist
3123566 Electronic Repairman
143246 Banker
8444455 Doctor
Table 2:
Q Identifier DescriptionB
XX134556 COPY/PRINT/SCAN EQUIP
666Q1224 DRUGS
722WWYZ Financial Svc
8456435T Medical Services
15666PP Health Services
Output:
Z Identifier DescriptionA Q Identifier DescriptionB
162000 Pharmacist 666Q1224 DRUGS
3123566 Electr Repairman XX134556 COPY/PRINT/SCAN EQUIP
143246 Banker 722WWYZ Financial Svc
8444455 Doctor 8456435T Medical Services

Conventional tools that you are used to (like Access, Excel, and SQL) can only go so far with comparing the meaning and usage of words.
In other words (forgive the pun), in order to do this, you need some sort of natural language processing toolkit (NLPT). Along with that, you also need some knowledge of how to program, because I don't think there exists front-end interfaces that can give you the output you want given only the input you listed by just filling out some forms.
So with that in mind, in order to solve your problem (I'll assume you know how to program and can pick up a NLPT in a language of your choice), you need to do the following:
Put your two datasets in some tables.
Manipulate DescriptionA and DescriptionB to be something meaningful to the NLPT you are using. They won't like a string such as "COPY/PRINT/SCAN/ EQUIP". They'll want the slashes removed and the words separated.
Compare DescriptionA with DescriptionB in a permutation-style manner by using a path_similarity type of function in the library. For example path_similarity('animal.definition1', 'dog.definition1') should return a high value, say .60, while path_similarity('animal.definition1', 'book.definition1') should return a low value, like .10.
If the path_similarity is above a certain value (up for you to decide), join the two items together and append them as a single row to a results table, while removing them from their respective tables. Continue doing this until the list is exhausted of DescriptionA greater than a certain similarity to a DescriptionB. Then do something else with the rows that are left in Table 1 and Table 2.
This should all be fairly easy to do programmatically. You may find you are not getting proper matches in some places with this method because you are randomly choosing two words to compare. Because of that, you may want to find another algorithm other than just permutations, perhaps one that looks at the statistics of the path_similarity of every piece of your data to every other piece and acts more appropriately.
Additionally, you may want to allow more than two words to be paired up. For example; "lumberjack", "tree cutter", and "tree chopper" make more sense to be grouped in one row with an additional two columns created than to throw one of them out who will likely be left without a pair. All of the problems I just listed in this paragraph, I'm sure are not new problems and you can search around the internet in order to solve them. Best of luck!

having an order with multiple requests

i want to do a autoshop software... where they keep up the cars they have and what they need(engine and other parts for example) but i dont know how to do the database to accept multiple items at once
example:
a car needs on one visit to the auto shop:
left frontal door
tires
oil change
filters
how to i add this in one go to the database(with prices included) so that i can see it all after and print a bill wheer it shows all... but my main priority is being able to insert all in one go and in one table

Hard to tell without any idea about your db strucutre. Lets assume db isn't constructed yet, you don't want to decrease parts stock or keep any track of wich exact part (i mean with serial etc.) was used. You want it quite simple, just a table with a car bought some parts.
In this case i woulde use a table looking like this : id|date|car_id|parts_used
where parts_used is a string containing parts and prices with separators. For example : "left frontal door=500+tires=100+oil=10" and then split the string when reading db.
I'm not sure it's what you want but your question isn't quite precise :)

Using LIKE in SQL Server to identify strings

I am writing a program that performs operations on a database of Football matches and data. One of the issues that I have is that my source data does not have consistent naming of each Team. So Leyton Orient could appear as L Orient. Most of the time this team is listed as L Orient. So I need to find the closest match to a team name when it does not appear in the database team name list exactly as it appears in the data that I am importing. Currently in my database I have a table 'Team' with a data sample as follows:
TeamID TeamName TeamLocation
1 Arsenal England
2 Aston Villa England
3 L Orient England
If the name 'Leyton Orient' appears in the data being imported I need to match this to L Orient and get the TeamID 3. My question is, can I use the LIKE function to achieve this in a case where the team name is longer than the name in the database?
I have figured out that if I had 'Leyton Orient' in the table and was importing 'L Orient' I could locate the correct entry with:
SELECT TeamName FROM Team WHERE TeamName LIKE '%l%orient%';
But can I do it the other way around? Also, I could have an example like Manchester United and I want to import Man Utd. I could find this by putting a % sign between every character like this:
SELECT TeamName FROM Team WHERE TeamName LIKE '%M%a%n%U%t%d%';
But is there a better way?
Finally, and this might be better put in another question, I would like not to have to search for the correct team when the way a team is named is repeated, i.e. I would like to store alternative spellings/aliases for teams in order to find the correct team entry quickly. Can anybody advise on how I might approach this? Thanks

The solution you are looking for is the FULL TEXT SEARCH, it'll require your DBA to create a full text index, however, once there you can perform much more powerful searches than just character pattern matching.
As the others have suggested, you could also just have an Alias table, which contains all possible forms of a team name and reference that. depending on how your search is working, that may well be the path of least resistance.

Finally, and this might be better put in another question, I would like not to have to search for the correct team when the way a team is named is repeated, i.e. I would like to store alternative spellings/aliases for teams in order to find the correct team entry quickly. Can anybody advise on how I might approach this? Thank
I would personally have a team table and a teamalias table. Use relationships to marry them up.

I believe the best way to prevent this, is to have a list of teams names displayed in a dropdown list. This will also let you drop validation for the team name. The users can now only choose one set team name and will also make it much easier for you working in your database. then you can look for the exact team name as it appears in your database. i.e.:
SELECT TeamName FROM Team WHERE TeamName = [dropdownlist_name];

SQL IN statement "inclusiveness"

I'm not a programmer, but trying to learn. I'm a nurse, and need to pull data for medical referral tracking from a database. I have a piece of GUI software which builds JOIN queries for me to pull things from the database. One of the operators I can use in the drop-down is "IN." The referral documentation is stored in the table as codes made up of one to three letters. For example, the code for a completed dental referral is CDF, and the code for a dental referral is D.
I want to build a report to allow other nurses to pull all their outstanding referrals, so I'll want to pull "D" but not "CDF"
If I use IN as the operator, and set my parameters to 'S','D','BP' {etc} will that also pull the records which have the other, longer codes which contain those same letters? (like CDF, CSR, CBP)
I don't want to test it because I only have access to the production database, and I don't want to hose up actual patient records. Thanks in advance for any help!

Assuming that the column that holds the referral code holds one and only one code per record (which is what it sounds like) the query should function as you want and will not attempt to match substrings.
In any event, there's no danger that a query in the form IN ('S', 'D', 'BP') will match substrings. To perform substring matches in SQL you have to use the LIKE operator.
The situation in which this will not work is if the referral code column holds multiple codes separated by commas. This is an all-too-common mistake in designing databases but if the product you're using is commercial rather than home-grown, I think it's very unlikely to be the case. If it is, searching it is much more difficult.

How to design a database table structure for storing and retrieving search statistics?

I'm developing a website with a custom search function and I want to collect statistics on what the users search for.
It is not a full text search of the website content, but rather a search for companies with search modes like:
by company name
by area code
by provided services
...
How to design the database for storing statistics about the searches?
What information is most relevant and how should I query for them?

Well, it's dependent on how the different search modes work, but generally I would say that a table with 3 columns would work:
SearchType SearchValue Count
Whenever someone does a search, say they search for "Company Name: Initech", first query to see if there are any rows in the table with SearchType = "Company Name" (or whatever enum/id value you've given this search type) and SearchValue = "Initech". If there is already a row for this, UPDATE the row by incrementing the Count column. If there is not already a row for this search, insert a new one with a Count of 1.
By doing this, you'll have a fair amount of flexibility for querying it later. You can figure out what the most popular searches for each type are:
... ORDER BY Count DESC WHERE SearchType = 'Some Search Type'
You can figure out the most popular search types:
... GROUP BY SearchType ORDER BY SUM(Count) DESC
Etc.

This is a pretty general question but here's what I would do:
Option 1
If you want to strictly separate all three search types, then create a table for each. For company name, you could simply store the CompanyID (assuming your website is maintaining a list of companies) and a search count. For area code, store the area code and a search count. If the area code doesn't exist, insert it. Provided services is most dependent on your setup. The most general way would be to store key words and a search count, again inserting if not already there.
Optionally, you could store search date information as well. As an example, you'd have a table with Provided Services Keyword and a unique ID. You'd have another table with an FK to that ID and a SearchDate. That way you could make sense of the data over time while minimizing storage.
Option 2
Treat all searches the same. One table with a Keyword column and a count column, incorporating SearchDate if needed.

You may want to check this:
http://www.microsoft.com/sqlserver/2005/en/us/express-starter-schemas.aspx

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Select normalized strings - sql

Related

Linking two seperate sets of data codes without a common identifier

having an order with multiple requests

Using LIKE in SQL Server to identify strings

SQL IN statement "inclusiveness"

How to design a database table structure for storing and retrieving search statistics?

Categories

Resources