Linking two seperate sets of data codes without a common identifier - sql

I have two large sets of data. Both sets are a form of structured coding system,and is used to categorize groups of people based on their occupation. The two sets of data have no common identifier. Besides a column that contains a unique identifier each table has a description for said identifier, but although they may be describing similar things the descriptions are not identical.
How do I create a table, that connects the two sets of data, without having to go back and manually try to figure out how to make the connection between the two identifiers. I am not sure if this can be done on Access or SQL. If there is a way to do this, I would like to know what software is maybe out there.
Here's some example data:
Table 1:
Z Identifier DescriptionA
162000 Pharmacist
3123566 Electronic Repairman
143246 Banker
8444455 Doctor
Table 2:
Q Identifier DescriptionB
XX134556 COPY/PRINT/SCAN EQUIP
666Q1224 DRUGS
722WWYZ Financial Svc
8456435T Medical Services
15666PP Health Services
Desired Output:
Table 3:
Z Identifier DescriptionA Q Identifier DescriptionB
162000 Pharmacist 666Q1224 DRUGS
3123566 Electr Repairman XX134556 COPY/PRINT/SCAN EQUIP
143246 Banker 722WWYZ Financial Svc
8444455 Doctor 8456435T Medical Services

Table 1:
Z Identifier DescriptionA
162000 Pharmacist
3123566 Electronic Repairman
143246 Banker
8444455 Doctor
Table 2:
Q Identifier DescriptionB
XX134556 COPY/PRINT/SCAN EQUIP
666Q1224 DRUGS
722WWYZ Financial Svc
8456435T Medical Services
15666PP Health Services
Output:
Z Identifier DescriptionA Q Identifier DescriptionB
162000 Pharmacist 666Q1224 DRUGS
3123566 Electr Repairman XX134556 COPY/PRINT/SCAN EQUIP
143246 Banker 722WWYZ Financial Svc
8444455 Doctor 8456435T Medical Services

Conventional tools that you are used to (like Access, Excel, and SQL) can only go so far with comparing the meaning and usage of words.
In other words (forgive the pun), in order to do this, you need some sort of natural language processing toolkit (NLPT). Along with that, you also need some knowledge of how to program, because I don't think there exists front-end interfaces that can give you the output you want given only the input you listed by just filling out some forms.
So with that in mind, in order to solve your problem (I'll assume you know how to program and can pick up a NLPT in a language of your choice), you need to do the following:
Put your two datasets in some tables.
Manipulate DescriptionA and DescriptionB to be something meaningful to the NLPT you are using. They won't like a string such as "COPY/PRINT/SCAN/ EQUIP". They'll want the slashes removed and the words separated.
Compare DescriptionA with DescriptionB in a permutation-style manner by using a path_similarity type of function in the library. For example path_similarity('animal.definition1', 'dog.definition1') should return a high value, say .60, while path_similarity('animal.definition1', 'book.definition1') should return a low value, like .10.
If the path_similarity is above a certain value (up for you to decide), join the two items together and append them as a single row to a results table, while removing them from their respective tables. Continue doing this until the list is exhausted of DescriptionA greater than a certain similarity to a DescriptionB. Then do something else with the rows that are left in Table 1 and Table 2.
This should all be fairly easy to do programmatically. You may find you are not getting proper matches in some places with this method because you are randomly choosing two words to compare. Because of that, you may want to find another algorithm other than just permutations, perhaps one that looks at the statistics of the path_similarity of every piece of your data to every other piece and acts more appropriately.
Additionally, you may want to allow more than two words to be paired up. For example; "lumberjack", "tree cutter", and "tree chopper" make more sense to be grouped in one row with an additional two columns created than to throw one of them out who will likely be left without a pair. All of the problems I just listed in this paragraph, I'm sure are not new problems and you can search around the internet in order to solve them. Best of luck!

Related

What is the industry standard way to store country / state / city in a database of web APP?

For country and state, there are ISO numbers. With City, there is not.
Method 1:
Store in one column:
[Country ISO]-[State ISO]-[City Name]
Method 2:
Store in 3 separate columns.
Also, how to handle city names if there is no unique identifier?
First and foremost, three separate columns to keep your data. If you want to create a unique identifier, the easiest way would be giving a random 3-10 digit code depending on the size of your data set. However, I would suggest concatenating [country-code]-[state-code]-[code] if you have a small data set and if you want human readability to a certain point. code can be several things. Here are some ideas:
of course a random id or even a database row id
licence plate number/code if there is for a city
phone area code of the city or the code of the center
same logic may apply to zip codes
combination of latitude and longitude of the city center up to certain degree
Here are also more references that can be used:
ISO 3166 is a country codes. In there you can find codes for states or cities depending on the country.
As mentioned IATA has both Airport and City codes list but they are hard to obtain.
UN Location list is a good mention but it can be difficult to gather the levels of differentiation, like the airport code or city code or a borough code can be on the same list, but eventually the UN/LOCODE must be unique. (Airport codes are used for ICAO, similar to IATA but not the same)
there are several data sets out there like OpenTravelData or GeoNames that can be used for start but may require digging and converting. They provide unique codes for locations. And many others can be found.
Bonus:
I would suggest checking Schema.org's City Schema and other Place Schemas for a conscious setup.

Merge two CSV and collate data

I have two CSV files, the first like so:
Book1:
ID,TITLE,SUBJECT
0001,BLAH,OIL
0002,BLAH,HAMSTER
0003,BLAH,HAMSTER
0004,BLAH,PLANETS
0005,BLAH,JELLO
0006,BLAH,OIL
0007,BLAH,HAMSTER
0008,BLAH,JELLO
0009,BLAH,JELLO
0010,BLAH,HAMSTER
0011,BLAH,OIL
0012,BLAH,OIL
0013,BLAH,OIL
0014,BLAH,JELLO
0015,BLAH,JELLO
0016,BLAH,HAMSTER
0017,BLAH,PLANETS
0018,BLAH,PLANETS
0019,BLAH,HAMSTER
0020,BLAH,HAMSTER
And then a second CSV with items associated with the first list, with ID being the common attribute between the two.
Book2:
ID,ITEM
0001,PURSE
0001,STEAM
0001,SEASHELL
0002,TRUMPET
0002,TRAMPOLINE
0003,PURSE
0003,DOLPHIN
0003,ENVELOPE
0004,SEASHELL
0004,SERPENT
0004,TRUMPET
0005,CAR
0005,NOODLE
0006,CANNONBALL
0006,NOODLE
0006,ORANGE
0006,SEASHELL
0007,CREAM
0007,CANNONBALL
0007,GUM
0008,SERPENT
0008,NOODLE
0008,CAR
0009,CANNONBALL
0009,SERPENT
0009,GRAPE
0010,SERPENT
0010,CAR
0010,TAPE
0011,CANNONBALL
0011,GRAPE
0012,ORANGE
0012,GUM
0012,SEASHELL
0013,NOODLE
0013,CAR
0014,STICK
0014,ORANGE
0015,GUN
0015,GRAPE
0015,STICK
0016,BASEBALL
0016,SEASHELL
0017,CANNONBALL
0017,ORANGE
0017,TRUMPET
0018,GUM
0018,STICK
0018,GRAPE
0018,CAR
0019,CANNONBALL
0019,TRUMPET
0019,ORANGE
0020,TRUMPET
0020,CHERRY
0020,ORANGE
0020,GUM
The real datasets are millions of records, so I'm sorry in advance for my simple example.
The problem I need to solve is getting the data merged and collated in a way where I can see which item groupings most commonly appear together on the same ID. (e.g. GRAPE,GUM,SEASHELL appear together 340 times, ORANGE and STICK 89 times, etc...)
Then I need to see if there is any change/deviation to the general results in common appearance when grouped by SUBJECT.
Tools I'm familiar with are Excel and SQL, but I also have PowerBI and Alteryx at my disposal.
Full disclosure: Not homework, or work, but a volunteer project, thus my unfamiliarity with this kind of data manipulation.
Thanks in advance.
An Alteryx solution:
Drag the two .csv files onto your canvas (seen as book1.csv and book2.csv in my picture; Alteryx will create "Input" tools for you.
Drag a "Join" tool on and connect the two .csv files to its inputs; select "ID" as the join field; unselect the "Right_ID" as output since it's merely a duplicate of "ID"
Drag a "Summary" tool on and connect the Join tool's output to the Summary tool's input; select all three of the outputs and add as a "group by"... then add the ID column with a "count"
Drag a browse tool on and connect the summary's output to the browse tool's input.
run the workflow
After all that, click on the browse tool and you should see what is seen in my screenshot: (which is showing just the first ten rows of output):
+1 for taking on a volunteer project - I think anyone who knows data can have a big impact in support of their favourite group or cause.
I would just pull the 2 files into Power BI as 2 separate tables (Get Data / From File). Create a relationship between the 2 tables based on ID (it might get auto-generated). It should be one to many.
Then I would add a Calculated Column to the Book1 table to Concatenate the related ITEM values, eg.
Items =
CALCULATE (
CONCATENATEX (
DISTINCT ( 'Book2'[ITEM] ),
'Book2'[ITEM],
", ",
'Book2'[ITEM], ASC
)
)
Now you can use that Items field in visuals (e.g. a Table), along with Count of ID to get the frequency.
Adding Subject to a copy of the table (e.g. to the Columns well of a Matrix) will produce your grouped scenario, or you could add a Subject Slicer.
As you will be comparing subsets of varying size, I would change Count of ID to Show value as - % of grand total.
Little different solution using Alteryx.
With this dataset, there are very few repeating 3 or 4 item groups. You can do the two item affinity analysis and get a probability of 3 or 4 item groups, or you can count the 3 and 4 item groups individually. I believe what you want is the latter as your probability of getting grapes with oranges may be altered by whether you have bananas in the cart or not.
Anyway, I did not join in the subject until after finding all of my combinations. I found all the combinations by taking the Cartesian join of two, then three, then four of the original set. I then removed all duplicates by ensuring items were always in alphabetical order in each row. I then counted occurrences of each combination. More joins can be added in the same pattern to count groups of 5,6,7...
Once you have the counts of occurrences, then I would join back with the subjects and perform this analysis on each group and compare to the overall results.
I'm supposed to disclose that I work for Alteryx.
first of all if you are using windows
just navigate to the directory which contains the CSV and write the following command:
copy pattern newfileName.csv
#example
copy *.csv merged.csv
now you created one csv file, the file is too large now you can't process it once, depending on your programming language you can use appropriate way, for python you can use generators to process line by line, or pandas you can read chunk by chunk it will be easy.
I hope this help you.

Select normalized strings

I have a table which contains company names which appear to have been a free text box entry. As such there ends up being lots of companies with 3-5 entries such as A Good Company, A Good Company LLC, AA Good Company etc.
I know if I was looking for one company I could use like (%) to get all the variations, but I would like to insert them into a new company table with just one row for all options so that I can use that as a reference table going forward. Is there a way to do this within SQL, or in an outside application for that matter?

having an order with multiple requests

i want to do a autoshop software... where they keep up the cars they have and what they need(engine and other parts for example) but i dont know how to do the database to accept multiple items at once
example:
a car needs on one visit to the auto shop:
left frontal door
tires
oil change
filters
how to i add this in one go to the database(with prices included) so that i can see it all after and print a bill wheer it shows all... but my main priority is being able to insert all in one go and in one table
Hard to tell without any idea about your db strucutre. Lets assume db isn't constructed yet, you don't want to decrease parts stock or keep any track of wich exact part (i mean with serial etc.) was used. You want it quite simple, just a table with a car bought some parts.
In this case i woulde use a table looking like this : id|date|car_id|parts_used
where parts_used is a string containing parts and prices with separators. For example : "left frontal door=500+tires=100+oil=10" and then split the string when reading db.
I'm not sure it's what you want but your question isn't quite precise :)

SQL IN statement "inclusiveness"

I'm not a programmer, but trying to learn. I'm a nurse, and need to pull data for medical referral tracking from a database. I have a piece of GUI software which builds JOIN queries for me to pull things from the database. One of the operators I can use in the drop-down is "IN." The referral documentation is stored in the table as codes made up of one to three letters. For example, the code for a completed dental referral is CDF, and the code for a dental referral is D.
I want to build a report to allow other nurses to pull all their outstanding referrals, so I'll want to pull "D" but not "CDF"
If I use IN as the operator, and set my parameters to 'S','D','BP' {etc} will that also pull the records which have the other, longer codes which contain those same letters? (like CDF, CSR, CBP)
I don't want to test it because I only have access to the production database, and I don't want to hose up actual patient records. Thanks in advance for any help!
Assuming that the column that holds the referral code holds one and only one code per record (which is what it sounds like) the query should function as you want and will not attempt to match substrings.
In any event, there's no danger that a query in the form IN ('S', 'D', 'BP') will match substrings. To perform substring matches in SQL you have to use the LIKE operator.
The situation in which this will not work is if the referral code column holds multiple codes separated by commas. This is an all-too-common mistake in designing databases but if the product you're using is commercial rather than home-grown, I think it's very unlikely to be the case. If it is, searching it is much more difficult.