Openrefine: cross.cell for similar but not identical values - openrefine

I have two dataset:
one dataset has names of countries, but dirty ones like
Gaule Cisalpine (province romaine)
Gaule belgique
Gaule , Histoire
Gaule
ecc.
the second dataset has two columns with the names of countries (clean) and a code like
Gaule | 1DDF
Is there a way to use cell.cross with value.contains() ? I tried to use reconcile-csv but it didn't work properly (it matches just the exact ones).

I've not been able to think of a great way of doing this, but given the substring you want to match between the two files is always the first thing in the 'messy' string, and if you want to do this in OpenRefine, I can see a way that might work by creating a 'match' column in each project for the cross matching.
In the 'clean' project use 'Add column based on this column' on the 'Country name' column, and in the GREL transform use:
value.fingerprint()
The 'fingerprint' transformation is the same as the one used when doing clustering with key collision/fingerprint and basically I'm just using it here to get rid of any minor differences between country names (like upper/lower case or special characters)
Then in the 'messy' project create a new column based on the dirty 'name of country' column again using the 'Add column based on this column' but in this case use the GREL transform something like:
value.split(/[\s,-\.\(\)]/)[0].fingerprint()
The first part of this "value.split(/[\s,-.()]/)" splits the string into individual words (using space, comma, fullstop, open or closed bracket as a separator). Then the '[0]' takes the first string (so the first word in the cell), then again uses the fingerprint algorithm.
Now you have columns in each of the projects which should match on the exact cell content. You can use this to do the look up between the two projects.
This isn't going to be completely ideal - for example if you have some country names which consist of multiple words it isn't going to work. However you could add some additional key columns to the 'messy' project which use the first 2,3,4 strings etc. rather than just the first one as given here.
e.g.
filter(value.split(/[\s,-\.\(\)]/),v,isNonBlank(v)).get(0,2).join(" ").fingerprint()
filter(value.split(/[\s,-\.\(\)]/),v,isNonBlank(v)).get(0,3).join(" ").fingerprint()
etc. (I've done a bit more work here to make sure blank entries are ignored - it's the get() command that's the key bit for getting the different numbers of words).
I'm guessing that most country names are going to be only a few words long, so it would only be a few columns needed.
I've not been able to come up with a better approach so far. I'll post some more here if I come up with anything else. You could also try asking on the OpenRefine forum https://groups.google.com/forum/#!forum/openrefine

Related

combine multiple excel files with similar names

I have a somewhat general question about combining multiple excel files together. Normally, I would use pd.read_excel to read the files then concat to join. However, I have some cases where the field names are not exactly the same but similar. For example,
one sheet would have fields like: Apple, Orange, Size, Id
another sheet would be: Apples, orange, Sizes, #
I have use the rename columns function but with this I have to check and compare every names in each files. I wonder if there's any way to combine them without going through all the field names. Any thought? THANKS!
Define what it means for two strings to be the same, then you can do the renaming automatically (you'll also need to determine what the "canonical" form of the string is - the name that you'll actually use in the final dataframe). The problem is pretty general, so you'll have to decide based on the sort of column names that you're willing to consider the same, but one simple thing might be to use a function like this:
def compare_columns(col1: str, col2: str) -> bool:
return col1.lower() == col2.lower()
Here you'd be saying that any two columns with the same name up to differences in case are considered equal. You'd probably want to define the canonical form for a column to be all lowercase letters.
Actually, now that I think about it, since you'll need a canonical form for a column name anyway, the easiest approach would probably be, instead of comparing names, to just convert all names to canonical form and then merge like usual. In the example here, you'd rename all columns of all your dataframes to be their lowercase versions, then they'll merge correctly.
The hard part will be deciding what transforms to apply to each name to get it into canonical form. Any transformation you do has a risk of combining data that wasn't mean to be (even just changing the case), so you'll need to decide for yourself what's reasonable to change based on what you expect from your column names.
As #ako said, you could also do this with something like Levenstein distance, but I think that will be trickier than just determining a set of transforms to use on each column name. With Levenstein or similar, you'll need to decide which name to rename to, but you'll also have to track all names that map to that name and compute the Levenstein distance between the closest member of that group when deciding if a new name maps to that canonical name (e.g. say that you have "Apple" and "Aple" and "Ale" and are merging names with edit distance of 1 or less. "Apple" and "Aple" should be merged, as should "Aple" and "Ale". "Apple" and "Ale" normally shouldn't be (as their distance is 2), but because they both merge with "Aple", they also merge with each other now).
You could also look into autocorrect to try to convert things like "Aple" to "Apple" without needing "Ale" to also merge in; I'm sure there's some library for doing autocorrect in Python. Additionally, there are NLP tools that will help you if you want to do stemming to try to merge things like "Apples" and "Apple".
But it'll all be tricky. Lowercasing things probably works, though =)

SQL - How do I find the exact string in a list of strings that can be part of another string in the list?

The question sounds confusing but I have a column of data that is divided up by characters |~*~|. I am trying to find data based on the strings in between these characters. The line of characters doesn't start with the delimiter but it does end with it.
e.g. Product Developer|~*~|Technician|~*~|
The issue I've run into is the following:
Product Developer|~*~|Technician|~*~|Lead Product Developer|~*~|
If I search for WHERE T.COLUMN LIKE '%Lead Product Developer%' it is fine I get that column, but if I search for WHERE T.COLUMN LIKE '%Product Developer%' then I'll get where Lead Product Developer and Product Developer since it is part of the string. How can I avoid this and only get exactly the string I am looking for.
Here is a snippet of what I have:
SELECT
T.COLUMN,
FROM TABLE1 T
WHERE T.COLUMN LIKE '%Product Developer%'
First, your data structure is wrong, wrong, wrong. The correct SQL way of representing this relationship is with another table, with one row per whatever and item in the list. In addition, Oracle offers JSON structures and nested tables. So, there is no shortage of ways to do this right.
Sometimes, we are stuck with other people's really bad design decisions. In this case, you can use like in a clever way:
WHERE '|~*~|' || T.COLUMN LIKE '%|~*~|Product Developer|~*~|%'
Because column ends with a delimiter, you don't need to add it to the end again, only to the beginning.

SQL: LIKE and Contains — Different results

I am using MS SQL Express SQL function Contains to select data. However when I selected data with LIKE operator, I realised that Contains function is missing a few rows.
Rebuilt indexes but it didn't help.
Sql: brs.SearchText like '%aprilis%' and CONTAINS(brs.SearchText, '*aprilis*')
The contains function missed rows like:
22-28.aprīlis
[1.aprīlis]
Sīraprīlis
PS. If I search directly CONTAINS(brs.SearchText, '*22-28.aprīlis*'), then it finds them
contains is functionality based on the full text index. It supports words, phrases, and prefixed matches on words, but not suffixed matches. So you can match words that start with 'aprilis' but not words that end with it or that contain it arbitrarily in the middle. You might be able to take advantage of a thesaurus for these terms.
This is explained in more detail in the documentation.

SQL Server Text Searching

I have a business requirement where we need to do somce crazy name matching against records stored in the database and I was wondering if there is any easy way to do it using SQL Server.
Name Stored in the DB : Austin K
Name to be Matched from UI : Austin Kierland
That's just a sample. In reality, there could be whole lot of different permutations and combinations.
If it's other way round, I could've used wild character but in this case, the name in the database is smaller than the search criteria.
Any suggestions?
Realistically - no. Databases were meant for comparing absolute values, not for messy comparisons. The way they store their data internally just isn't fit for really messy matching. Actually even a superpowerful dedicated search engine like Google, that has a LOT of messy matching features, wouldn't be able to pull off your example without prior knowledge.
I don't know how the requirement is precisely worded, but I'd either shoot the feature request with "technically impossible", or implement a rule set for which messy matches are tried - for your example, you could easily 'hard code' that multiple searches are executed when capitalized words are entered, shortening them so a single letter. No idea if that's a solution to your problem though.
You can do a normal search using the LIKE operator which determines whether a specific character string matches a specified pattern. The problem you will run into is the probability of the returning of multiple records or incorrect people. I've had similar requirement myself for a business app and the best solution to the issue is to require other qualifying values rather then just name. If you do a partial name search without other qualifying data you are certainly going to come across the false positive matches and/or multiple records. In my case I built a web service that checks eligibility allowing text search for first & last name but also added date of birth, primary person SSN, and gender which ensured the matching person was in deed the person intended to search for. If my situation was like yours in which name was the only search criteria my recommendation to the business would be we cannot perform the search until qualifying data is entered into the database otherwise there is no accurate way to query the results they are looking for.

Wildcards in database

Any one have any pointers how I can store wildcards in a database and the see which row(s) a string matches? Can it be done?
e.g.
DB contains a table like:
So john3136 should get 10 times his regular pay. fred3136 would get half his regular pay.
harry3136 probably crashes the app since there is no matching data ;-)
The code needs to do something like:
foreach(Employee e in all_employees) {
SELECT Multiplier FROM PayScales WHERE
//??? e.Name matches the PayScales.Name wildcard
}
Thanks!
Edit
This is a real world issue: I've got a parameter file that contains wildcards. The code currently iterates through employees, iterates through the param file looking for a match - you can see why I'd like to "databaserize" it ;-)
Wildcards are optional. The row could have said "john3136" to only match one employee. (The real app isn't actually employees, so it does make sense even if it looks like overkill in this simple example)
One option open: I do know all the employee names before I start, so I could iterate through them and effectively expand the wildcards in a temporary table. (so if I have john3136* in the starting table, it might expand to john3136, john31366 etc based on the list of employees). I was hoping to find a better way than this since it requires more maintenance (e.g. if we add functionality to add an employee we need to maintain the expanded wildcards table).
SELECT * FROM payscales
WHERE e.Name
LIKE regexp_replace(name, E'^\\*|\\*$', '%', 'g');
I don't know which database you're using. The above query works on postgresql and just replace your trailing and leading wildcard with %, that's the LIKE wildcard.
If no wildcard is present, it must match the full string.