Using OpenRefine to create a mapping table from two other tables - openrefine

I have the following use case that OpenRefine seems to be a good candidate to solve. I have an existing, "dirty" product table in my database that looks like this:
id name
51 Product A
52 product-a
53 product B
54 productb
55 produtc
56 productc
I have a new, "clean" product table that looks like this:
id name
1 Product A
2 Product B
3 Product C
I'd like to use OpenRefine's clustering to generate a mapping file, to help me map products from the old table to the new table:
id name old_id
1 Product A 51
1 Product A 52
2 Product B 53
2 Product B 54
3 Product C 55
3 Product C 56
But I can't quite get OpenRefine to do what I want. Any advice for how to achieve this?

As it was already pointed out, there is no direct way to achieve this, but with the help support tables and the cross function, you can get the desired result:
you take the column "name" from the dirty table and clean table, and combine them. Don't worry about the ids at this point.
import them into OpenRefine (e.g. as project "product names")
duplicate the column "name" (the only column so far) and name the new one "name_new".
Cluster the column "name_new" and replace all of the old names with the correct new ones. Some manual adjustments might be required at this point.
Your result should now look like this:
name name_new
Product A Product A
product-a Product A
product B Product B
productb Product B
produtc Product C
productc Product C
Product A Product A
Product B Product B
Product C Product C
import the dirty table as "products" and the clean table as "products clean".
in the project "products" transform the column "name" using
value.cross("product names","name").cells["name_new"].value[0]
rename the column "id" to "old_id"
add a new column based on "name" using
value.cross("products clean","name").cells["id"].value[0]
and save it as "id". The table "products" has now the desired structure.
I hope this helps.

Clustering function is limited to a single column to find similar strings within that column.
OpenRefine doesn't yet have string similarity functions across 2 or more tables or projects (Fuzzy Joins) in the way that your use case presents. You would have to use other tools for this. A common tool that I've seen folks use and express satisfaction with Fuzzy Joining is MS PowerBI (Desktop is Free but has limits on Relationships and Exporting, but Pro version is only $10 a month and canceling anytime) but if you wanted something completely free then a few R packages do this, one of which is https://www.rdocumentation.org/packages/fuzzyjoin/versions/0.1.4
In OpenRefine, we totally want to allow Fuzzy Joins across Projects/datasets in the future and it's on our issue list, but we just haven't had the funding to implement this along with tons of other features we know users would like to see.

Related

How do you 'join' multiple SQL data sets side by side (that don't link to each other)?

How would I go about joining results from multiple SQL queries so that they are side by side (but unrelated)?
The reason I am thinking of this is so that I can run 1 query in Google Big Query and it will return 1 single table which I can import into Excel and do some charts.
e.g. Query 1 looks at dataset TableA and returns:
**Metric:** Sales
**Value:** 3,402
And then Query 2 looks at dataset TableB and returns:
**Name:** John
**DOB:** 13 March
They would both use different tables and different filters, etc.
What would I do to make it look like:
---Sales----------John----
---3,402-------13 March----
Or alternatively:
-----Sales--------3,402-----
-----John-------13 March----
Or is there a totally different way to do this?
I can see the use case for the above, I've used something similar to create a single table from multiple tables with different metrics to query in Data Studio so that filters apply to all data in the dataset for example. However in that case, the data did share some dimensions that made it worthwhile doing.
If you are going to put those together with no relationship between the tables, I'd have 4 columns with TYPE describing the data in that row to make for easier filtering.
Type | Sales | Name | DOB
Use UNION ALL to put the rows together so you have something like
"Sales" | 3402 | null | null
"Customer Details" | null | John | 13 March
However, like the others said, make sure you have a good reason to do that otherwise you're just creating a bigger table to query for no reason.

SQL query for parsing a body of text to extract a string from a list

I need help on an SQL query to perform the following:
I have a table with a list of possible string values for products.
I have a second table with free form text in which this product may be mentioned. Is there any way an SQL query can extract the string if it is present in the 1st table?
I read on another SO post about CHARINDEX and SUBSTRING. Will this be efficient in this scenario? How can i apply this in my use case?
AN example for my scenario is this:
My table, PRODUCTS, has the following format,
Product
XXX
YYY
ZZZ
DDD
The other table has a column in which there is large amount of text in which this product will be mentioned. Like:
Record Number User Review
1 I like XXX for its versatility but YYY is better.
2 XXX is a horrible product. DO not buy.
3 YYY and DDD are best in class. Many do not know how to use it.
Now I want to extract the product names using a query in this manner.
Record Number Product in Review
1 XXX
1 YYY
2 XXX
3 YYY
3 DDD
Thank you in advance for your time and help.
This should work but it will be slow on big tables:
Select p.id, f.id, p.name from product
Inner Join freeform f on f.text like '%'+p.name+'%'

in theory: would this be possible: SQL query for efficient use of build parts

I have a couple of tables, which I have simplified to the extreme for the purpose of this example.
Table 1, 'Units'
ID UnitName PartNumber
1 UnitX 1
2 UnitX 1
3 UnitX 2
4 UnitX 3
5 UnitX 3
6 UnitY 1
7 UnitY 2
8 UnitY 3
Table 2, 'Parts'
ID PartName Quantity
1 Screw 2
2 Wire 1
3 Ducttape 1
I would like to query on these tables which of these units would be Possible to build, AND if so, which one could be built first ideally to make efficient use of these parts.
Now the question is: can this be done using SQL, or is a background application required/more efficient?
So in this example, it is easy, because only one unit (unit Y) can be built.. But I guess you get the idea. (I'm not looking for a shake and bake answer, just your thoughts on this.)
Thanks
As you present it, it is efficient to use sql. As you described PartNumber column of table Units is a foreign key on ID column of Parts table, so a simple outer join or selecting units that the PartNumber are "NOT IN" the Parts table would give you the units that can not be build.
However if your db schema consists of many non normalised tables, or is very complex without indexes, other "bad" things etc
it could be examined whether specific application code is faster. But i really doubt it for the particular case, the query seems trivial.

Minimum price selection

Situation looks like this:
We have product 'A123' and we have to remember lowest price for it.
Prices for one product comes from random number of shops and there is no way to tell when shop x will send us their price for 'A123'.
So I had SQL table with columns:
product_number
price
shop (from which shop this price comes)
An SQL function for updating product price looks like this (this is SQL pseudo-code, syntax doesn't matter):
function update_product(in_shop, in_product_number, in_price)
select price, shop into productRow from products where product_number = in_product_number;
if found then
if (productRow.price > in_price) or (productRow.price < in_price and productRow.shop = in_shop) then
update row with new price and new shop
end if;
else
insert new product that we didn't have before
end if;
the (productRow.price < in_price and productRow.shop = in_shop) condition is to prevent situation like this:
In products table we have
A123 22.5 amazon
then comes info from amazon again:
A123 25 amazon
Thanks to above condition we update price to higher which is correct behavior.
But algorithm fails in this situation: again we have a row in the products table:
A123 22.5 amazon
then comes info from merlin
A123 23 merlin (we don't update because price is higher)
then comes info from amazon
A123 35 amazon
and we update table and now we have:
A123 35 amazon
but this is wrong because merlin earlier has lower price for that product.
Any idea how to avoid this situation?
The only way you are going to solve your problem is keep track of the price per shop and then only return the lowest current price. So for example you would need a table like the one you already have, but when you select out of the table something like:
select min(price)
from products
where product_number = :my_product
Personally if it were me, I would keep a time stamp of when you receive the product price updates so you would be able to ascertain when you got the update.
To make this work you should maintain a table that contains the following:
Product
Supplier
LatestPrice
Then identify the current best supplier by querying that table - you can either do this when requested or when the table is updated either way you simplify the problem at the price of slightly more complex schema and queries
Additional (following comment):
Ok, this is going to mean that you need to store more data - but you don't have a huge amount of choice - the data is required to solve the problem so you either: a) have to update prices from all suppliers concurrently and then choose the best price from that snapshot or b) store the prices as you get them and pick the best price from the data you've got. The former implies a fairly hefty overhead in terms of fetching and processing data whereas the latter is basically a fairly modest storage problem and something any decent databases will cope with easily.
Basically, the problem is that you only store the lowest price from 1 vendor. You have to keep records of prices of all vendors, and use a selection query to select the minimum.
For example, If you have:
A123 22.5 Amazon
and you got:
A123 23 Merlin
You have to insert it, even if it is with higher price, because it's a different vendor. So you'll have:
A123 22.5 Amazon
A123 23 Merlin
When you get the new price from Amazon, for example: 25, you just update it. So you'll get:
A123 25 Amazon
A123 23 Merlin
then select the lowest price, Merlin, in this case.

Representing ecommerce products and variations cleanly in the database

I have an ecommerce store that I am building. I am using Rails/ActiveRecord, but that really isn't necessary to answer this question (however, if you are familiar with those things, please feel free to answer in terms of Rails/AR).
One of the store's requirements is that it needs to represent two types of products:
Simple products - these are products that just have one option, such as a band's CD. It has a basic price, and quantity.
Products with variation - these are products that have multiple options, such as a t-shirt that has 3 sizes and 3 colors. Each combination of size and color would have its own price and quantity.
I have done this kind of thing in the past, and done the following:
Have a products table, which has the main information for the product (title, etc).
Have a variants table, which holds the price and quantity information for each type of variant. Products have_many Variants.
For simple products, they would just have one associated Variant.
Are there better ways I could be doing this?
I worked on an e-commerce product a few years ago, and we did it the way you described. But we added one more layer to handle multiple attributes on the same product (size and color, like you said). We tracked each attribute separately, and we had a "SKUs" table that listed each attribute combination that was allowed for each product. Something like this:
attr_id attr_name
1 Size
2 Color
sku_id prod_id attr_id attr_val
1 1 1 Small
1 1 2 Blue
2 1 1 Small
2 1 2 Red
3 1 1 Large
3 1 2 Red
Later, we added inventory tracking and other features, and we tied them to the sku IDs so that we could track each one separately.
Your way seems pretty flexible. It would be similar to my first cut.