I would like to ask some suggestions cause I've been doing this for a week.
It's basically a data cleanup program.
I have this excel file which contains thousands of company name and I have this database which contains the correct names of companies.
What I want is to read the excel file which I already did and compare each of the company in the excel file with the values I have on database. For example
Data in Excel
Hewlett-Packard, Costa Rica
Hewlett-Packard (HP)
Hewlett-Packard Singapore (Private) Limited
Data in Database
Hewlett-Packard
It will auto detect that those 3 value in excel is Hewlett-Packard because the excel is in free type form. I want to correct everything that is inputted in it and find the similar value in my database. Like if the Hewlett-Packard is spelled wrong it will automatically tell that its Hewlett-Packard. Any idea?
It's like an autocomplete but with thinking. Autocomplete but decides the correct value
I'm doing it in VB.Net btw. I'm researching about fuzzy search algorithim and levenstein and stuff. But I still don't get it how can i use that
See my blog, Solving the right problem, which is somewhat similar. You're probably better off doing a simple match and outputting any failures to a text file, which you manually edit. It's drudgery, but it'll get the job done. When you start talking about Levenstein distance and fuzzy search, you're turning a simple, if dull, task into a research project.
If your database contains only "thousands" (rather than millions) of names, then one thing you can do is load all the names into a list, and sort them. Then sort the names in the Excel file. Then go through the two lists (a standard merge-type algorithm). For example, you might have in your database:
Hasbro
Hewlett Packard
Home Depot
and in your Excel file:
Grainger
Halliburton
Hewlet Packard, Costa Rica
Hewlett Packard (HP)
Humana
Using the merge algorithm, you'd be comparing "Hewlet Packard, Costa Rica" against "Hewlett Packard", and you might even output that as the suggested replacement. That would probably constitute the majority of your errors.
In any case, I strongly recommend using the computer to identify the mismatches, and then manually resolve them. That's usually the fastest way to solve this type of problem.
Related
I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.
At my job, we have several rental properties that we manage. Each one of these properties may go by different names. For example a property may be called Amber Gateway, Platinum Gateway, The Gateway, etc. We have maybe 500-600 Excel workbooks floating around with different types of information in them & I might be asked to pull information from various ones.
The lack of a consistent naming methodology prevents me from using a standard Index/Match function to look up data. I'm not sure if this is the best solution, but this has been my stab at solving the problem.
I've created a worksheet that has a list of all the property names in Column A. Any associated names are listed to the right on the same row in Column B, Column C, and so on. Just for simplicity, say there are only 5 properties and all my data is in A1:E5. Then say the property I'm interested in is in F1 and the property list I want to "match" it up against is in G1:G5. So my data would look something like this:
River Stream Creek Brook Rivulet
Apple Fruit
Rock Boulder Stone Slab
Candy Dessert Sweets
Forest Trees
Given the word 'boulder' and the following list:
Candy
Fruit
Creek
Slab
Forest
my goal is to return the list position of the synonym 'slab' - in this case, 4.
I think I can use the below array formula in place of the Match function. to accomplish this:
{=SUMPRODUCT(--(INDEX(A1:E5,SUMPRODUCT(--(A1:E5=F1)*ROW(A1:E5)),)
=G1:G5)*IF(G1:G5<>"",MATCH(G1:G5,G1:G5,0)))}
Now this formula is a bit unwieldy and I was hoping to translate it into a UDF to make it easier to work with. I'm unfamiliar with VBA though, and after doing a bit of searching, I realized that VBA logic works quite differently than Excel Formula logic. Specifically, I don't think I can use = to force my lookup grid into TRUE/FALSE values in VBA like I do in the SUMPRODUCT functions. Do I have to learn VBA in order to implement this as a UDF or is there another solution? In practice, my lookup grid (A1:E5) will be in an external workbook.
If my attempt is completely off the mark, I'm open to other solutions. I know the Match formula function supports wildcards, but it wouldn't work in the case of dramatically different names, so I was hoping for something more comprehensive.
This is my first time asking a question on here, so please let me know if this belongs in a different area or there's any matter of etiquette I'm overlooking.
I am working on huge excel sheets from different sources about the same thing. The way the sources report it and write down information is different. So, for example, one would write the location as "Khurais" whereas the other would write it as "Khorais".
Since both of these files are contain important information, I would like to combine them in one excel sheet so that I can deal with them more easily. So if you have any suggestion or tool that you think would be beneficial, please share it here.
P.s. The words in the excel sheet are translations of Arabic words.
You could use Levenshtein distance to determine if two words are "close" to each other. Based on that you could match.
You could use FuzzyLookup, a macro that allows you to do appropriate matching. It worked really well for me in the past and is actually really well documented.
You can find it here: https://www.mrexcel.com/forum/excel-questions/195635-fuzzy-matching-new-version-plus-explanation.html including examples on how to use it.
Hope that helps!
PS obviously you can also use it stricly within VBA (not using worksheet functions)
The Double Metaphone algorithm springs to mind. It attempts to convert strings into phonetic representations. For example, "Folly" and "Pholee" should have the same phonetic code.
If you could generate these codes, you could then match your records based on them, instead of the strings.
Here's an article that explains, along with sample VBA code:
https://bytes.com/topic/access/insights/965241-fuzzy-string-matching-double-metaphone-algorithm
Hope that inspires you :)
I was bored and looking at old code that runs like molasses on a cold day. I found that a group of tables in our accounting system - each with 500,000 records of ~20 datapoints - that use a single column of concatenated, fixed-width values instead of separate columns. (Fixing the tables isn't an option.) An old .net ETL project is grabbing all records, doing a bunch of substrings on each record to set an object's corresponding attributes, then sending the object to merge with production data via a stored proc.
The way it is working is fine. It works. And, to be perfectly honest, I doubt I'll be given the go-ahead to fix it even if I come up with a better solution, but I was curious to see if anyone knew of a better way of doing this, because it's not entirely unlikely that I'll face a situation like this in the future.
I was thinking that if there was a way to use the TextFieldParser to parse a static string instead of a file/stream that might be a valid idea. Or, instead, I could write the entire table to a text file and then use the TextFieldParser to send data to the SProc. http://www.dotnetperls.com/textfieldparser does show that TextFieldParser is quite a bit faster than split, which I would assume is tantamount to the string manipulation our project is currently doing with substring. So there may be something to that idea.
Or perhaps the whole, old project should be dumped for a shiny new SSIS project. Would it also have to write the records to a flat file before importing into SQL? Or can it import directly from the table?
Thank you in advance!
I have a table called Animals. I pull data from this table to populate another system.
I get Excel data with lists of animals that need to go in the Animals table.
The Excel data will also have other identifiers, like Breed, Color, Age, Favorite Toy, Veterinarian, etc.
These identifiers will change with each new excel file. Some may repeat, others are brand new.
Because the fields change, and I never know what new fields will come with each new excel file, my Animals table only has Animal Id and Animal Name.
I've created a Values table to hold all the other identifier fields. That table is structured like this:
AnimalId
Value
FieldId
DataFileId
And then I have a Fields table that holds the key to each FieldId in the Values table.
I do this because the alternative is to keep a big table with fields that may not even be used each time I need to add data. A big table with a lot of null columns.
I'm not sure my way is a good way either. It can seem overly complex.
But, assuming it is a good way, what is the best way to get this excel data into my Values table? The list of animals is easy to add to my Animals table. But for each identifier (Breed, Color, etc.) I have to copy or import the values and then update the table to assign a matching FieldId (or create a new FieldId in the Fields table if it doesn't exist yet).
It's a huge pain to load new data if there are a lot of identifiers. I'm really struggling and could use a better system.
Any advice, help, or just pointing me in a better direction would be really appreciated.
Thanks.
Depending on your client (eg, I use SequelPro on a Mac), you might be able to import CSVs. This is generally pretty shaky, but you can also export your Excel document as a CSV... how convenient.
However, this doesn't really help with your database structure. Granted, using foreign keys is a good idea, but importing that data unobtrusively (and easily) is something that will need to likely be done a row at a time.
However, you could try modifying something like this to suit your needs, by first exporting your Excel document as a CSV, removing the header row (the first one), and then using regular expressions on it to change it into a big chunk of SQL. For example:
Your CSV:
myval1.1,myval1.2,myval1.3,myval1.4
myval2.1,myval2.2,myval2.3,myval2.4
...
At which point, you could do something like:
myCsvText.replace(/^(.+),(.+),(.+)$/mg, 'INSERT INTO table_name(col1, col2, col3) VALUES($1, $2, $3)')
where you know the number of columns, their names, and how their values are organized (via the regular expression & replacement).
Might be a good place to start.
Your table looks OK. Since you have a variable number of fields, it seems logical to expand vertically. Although you might want to make it easier on yourself by changing DataFileID and FieldID into FieldName and DataFileName, unless you will use them in a lot of other tables too.
Getting data from Excel into SQL Server is unfortunately not so easy as you would expect from two Microsoft products interacting with eachother. There are several routes that I know of that you can take:
Work with CSV files instead of Excel files. Excel can edit CSV files just as easily as Excel files, but CSV is an infinitely more reliable datasource when it comes to importing. You don't get problems with different file formats for different Excel versions, Excel having to be installed on the computer that will run the script or quirks with automatic datatype recognition. A CSV can be read with the BCP commandline tool, the BULK INSERT command or with SSIS. Then use stored procedures to convert the data from a horizontal bulk of columns into a pure vertical format.
Use SSIS to read the data directly from the Excel file(s). It is possible to make a package that loops over several Excel files. A drawback is that the column format and the sheet name of the Excel file has to be known beforehand, and so a different template (with a separate loop) has to be made each time a new Excel format arrives. There exist third-party SSIS components that claim to be more flexible, but I haven't tested them out yet.
Write a Visual C# program or PowerShell script that grabs the Excel file, extracts the data and outputs into your SQL table. Visual C# is a pretty easy language with powerful interfaces into Office and SQL Server. I don't know how big the learning curve is to get started, but once you do, it will be a pretty easy program to write. I have also heard good things about Powershell.
Create an Excel Macro that uses VB code to open other Excel files, loop through their data and write the results either in a predefined sheet or as CSV to disk. Once everything is in a standard format it will be easy to import the data using one of the above methods.
Since I have had headaches with 1) and 2) before, I would advise on either 3) or 4). Because of my greater experience with VBA than Visual C# or Powershell, I´d go for 4) if I was in a hurry. But I think 3) is the better investment for the long term.
(You could also go adventurous and use another script language, such as Python as I once did because Python is cool, unfortunately Python offers pretty slow and limited interfaces to SQL server and Excel)
Good luck!