Keep newest duplicate row depending on multiple Columns - openrefine

I seem to have a workflow problem with Open Refine (Google Refine 2.5 [r2407]) to do sophisticated duplicate row cleaning. All I have found so far is how to delete duplicate rows based on a single column.
My aim is to delete duplicate rows based on multiple columns, at best, in a specific hierarchy.
Example
Given the following dummy data in Refine
+----+---------+---------+--------+------------+------+-----------------------------------+
| id | timeAgo | title | author | date | val1 | [After Refine, keep Record] |
+----+---------+---------+--------+------------+------+-----------------------------------+
| 1 | 10 | Faust | Mr. A | 2014-01-15 | 10 | ->B, older entry |
| 2 | 11 | Faust | Mr. A | 2014-01-21 | 10 | A (because of Date) |
| 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 | B |
| 4 | 8 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, older entry |
| 5 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, same time Ago, but lower ID |
| 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 | C (because of author, date, val1) |
| 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | D |
+----+---------+---------+--------+------------+------+-----------------------------------+
I want to kill the duplicate rows based on following logic. If
title && auther && date && val1 are the same, than
keep the newest (least timeAgo) row, if there are multiple, than
keep the one with the highest id
The Result would be:
+---------+----+---------+---------+--------+------------+------+
| Refined | id | timeAgo | title | author | date | val1 |
+---------+----+---------+---------+--------+------------+------+
| A | 2 | 10 | Faust | Mr. A | 2014-01-21 | 10 |
| B | 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 |
| C | 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 |
| D | 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 |
+---------+----+---------+---------+--------+------------+------+
Easy Approach?
If there is no other solution, I thankfully take a scripting/GREL one.
But could it be done by Refines famous workflow "recording" to achieve above logic, so it could be extracted and applied to other same format datasets?
My motivation behind this is to enable employees to work more thoughtfully with data (beyond excel) but without confronting them right away with a full blown scripting language.

That sounds like a straightforward sorting problem.
Sort the records by title, author, time ago, and ID
Re-order rows permanently (IMPORTANT - it won't work if you forget this step)
Blank down on Title & Author
Move those two columns to the two left most positions
Join multivalued cells on remaining columns
Transform all columns from step 5 using value.split(',')[0] to extract the first value (which should be the value for the record you want if you sorted them in the right order

Related

Transform table from sequential identifier to real with attributes

I changed a but the context, but it's basically the same issue.
Imagine we are in a never-ending tunnel, shaped like a circle. We split every section of the circle, from 1 to 10 and we'll call each section slot (sl). There are 2 groups (gr) of living things walking in the tunnel. Each group has 2 bands, where each has a name and global hitpoints (hp). Every group is walking forward (although the bands might change order). If a group is at slot #10 and moves forward, he will be at slot #1. We snapshot their information every day. All the data gathered is stored in a table with this structure:
+----------+----------------+------------------+----------------+----------------+------------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+--+
| day_id | | gr_1_sl_1_id | | gr_1_sl_1_name | | gr_1_sl_1_hp | | gr_1_sl_2_id | | gr_1_sl_2_name | | gr_1_sl_2_hp | | gr_2_sl_1_id | | gr_2_sl_1_name | | gr_2_sl_1_hp | | gr_2_sl_2_id | | gr_2_sl_2_name | | gr_2_sl_2_hp | |
+----------+----------------+------------------+----------------+----------------+------------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+--+
| 1 | 3 | orc | 100 | 4 | goblin | 10 | 10 | human | 50 | 1 | dwarf | 25 | |
| 2 | 6 | goblin | 7 | 7 | orc | 76 | 2 | human | 60 | 3 | dwarf | 28 | |
+----------+----------------+------------------+----------------+----------------+------------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+--+
As you can see, the columns are structured in a sequential way, while the data shows what is the actual value. What I want is to have the information shaped this way instead:
+---------+-------+-------+-----------+---------+
| id_game | gr_id | sl_id | band_name | band_hp |
+---------+-------+-------+-----------+---------+
| 1 | 1 | 3 | orc | 100 |
| 1 | 1 | 4 | goblin | 10 |
| 1 | 2 | 10 | human | 50 |
| 1 | 2 | 1 | dwarf | 25 |
| 2 | 1 | 6 | goblin | 7 |
| 2 | 1 | 7 | orc | 76 |
| 2 | 2 | 2 | human | 60 |
| 2 | 2 | 3 | dwarf | 28 |
+---------+-------+-------+-----------+---------+
I have this information in power bi, although I can create views in sql server if need be. I have tried many things, closest thing I got was unpivoting and parsing the original columns to get day_id, gr_id, sl_id, attributes and values. In attributes and values, it's basically name and hp with their corresponding value (I changed hp into string), but then I'm stocked, I'm not sure what to do next.
Anyone has any ideas ? Keep in mind that I oversimplified the problem; there are more groups, more slots, more bands and more statistics (i.e. attack and defense rating, etc.)
You seem to want to unpivot the table. In SQL Server, I recommend using apply:
select t.day_id, v.*
form t cross apply
(values (1, 1, gr_1_sl_1_id, gr_1_sl_1_name, gr_1_sl_1_hp),
(1, 2, gr_1_sl_2_id, gr_1_sl_2_name, gr_1_sl_2_hp),
(2, 1, gr_2_sl_1_id, gr_1_sl_1_name, gr_2_sl_1_hp),
(2, 2, gr_2_sl_2_id, gr_1_sl_2_name, gr_2_sl_2_hp)
) v(id_game, gr_id, sl_id, band_name, band_hp);
In other databases, you can do something similar with union all.

How to add conditional count based on mutiple columns

I'm trying to summarise a T-SQL output that looks a little like this:
+---------+---------+-----+-------+
| perf_no | section | row | seat |
+---------+---------+-----+-------+
| 7128 | 6 | A | 4 |
| 7128 | 6 | A | 5 |
| 7128 | 6 | A | 7 |
| 7128 | 6 | A | 9 |
| 7128 | 6 | A | 28 |
| 7129 | 6 | A | 29 |
| 7129 | 6 | A | 8 |
| 7129 | 6 | A | 9 |
| 7129 | 8 | A | 6 |
| 7129 | 8 | B | 3 |
| 7129 | 8 | B | 4 |
+---------+---------+-----+-------+
Comparing one row to the row(s) below, if the perf_no, section, and row values are the same, and the difference between the seat values is 1, then I want to consider them a group, and count the number of rows in that group.
To give you a real world example, these are seats in a theatre! I'm trying to summarise what seats are available.
Using the table above to illustrate:
rows 1 & 2 show that seats 4 & 5 in section 6, row 8 for performance 7128 are available. So that's 2 seats together
row 3 shows that 7 in sectino 6, row 8 for performance 7128 is available on its own. So that's a single seat (1)
rows 5 & 6 have the same section and row, and the seats are consecutive, but you can see the performance is different. So that's a single seat too.
So the output for the table above would look a little like...
(I've left in the spaces just so visually you can see the groupings more easily - obviously the final version will have none)
+---------+---------+----------+-------+
| perf_no | section | seat_row | total |
+---------+---------+----------+-------+
| 7128 | 6 | A | 2 |
| | | | |
| 7128 | 6 | A | 1 |
| 7128 | 6 | A | 1 |
| 7128 | 6 | A | 1 |
| 7129 | 6 | A | 1 |
| 7129 | 6 | A | 2 |
| | | | |
| 7129 | 6 | A | 1 |
| 7129 | 8 | B | 2 |
+---------+---------+----------+-------+
I've been trying to use some conditional case statements to not much avail. Any assistance very gratefully received!
This is a type of gaps-and-islands problem. You can generate a grouping by subtracting a sequence (generated by row_number()) from the seat:
select perf_no, section, row, count(*) as num_seats,
min(seat) as first_seat, max(seat) as last_seat
from (select t.*,
row_number() over (partition by perf_no, section, row order by seat) as seqnum
from t
) t
group by perf_no, section, row, (seat - seqnum);

Looking up parent item based on a bill of materials

I'm trying to figure out how to put together a SQL statement that will let me find an end-item in our database based on its bill of materials. I guess you could say this is like a reverse BOM lookup question.
My table structure is pretty simple.
-End-item table
-Component table
-Linking table to tie together multiple components to an end item record.
The data I have is just the component list, and I want to find the end item. Since every bill of material is unique it has to match the bill of materials perfectly ie exact number of components and exact matches to the component SKU numbers. In some cases 2 end-items might use all the same components, but one of them just uses an extra part or two that makes the end-item SKU number different, so it has to account for that. That is, again, it has to match the BOM perfectly.
If not an outright answer, could someone at least steer me on the correct path to finding one?
------ UPDATE ----------
Table structure would be something like this.
ManufacturedPart
,--------------------,
| ID | PART_NUM |
|--------------------|
| 1 | V3175-01 |
| 2 | V3367-01 |
| 3 | V3988-01 |
| 4 | V3175-CV |
`--------------------`
Component
,--------------------,
| ID | COMP_NUM |
|--------------------|
| 1 | V3175 |
| 2 | V3367 |
| 3 | V3369 |
| 4 | V3114 |
| 5 | V3370 |
| 6 | V4060 |
| 7 | V3550 |
| 8 | V3988 |
`--------------------`
ManufacturedComponent
,-------------------------------------------------,
| ID | MANUFACTURED_PART_ID | COMPONENT_ID |
|-------------------------------------------------|
| 1 | 1 | 1 |
| 2 | 1 | 4 |
| 3 | 1 | 6 |
| 4 | 2 | 2 |
| 5 | 2 | 3 |
| 6 | 2 | 5 |
| 7 | 2 | 7 |
| 8 | 3 | 1 |
| 9 | 3 | 8 |
| 10 | 4 | 1 |
| 11 | 4 | 4 |
`-------------------------------------------------`
Assuming I have only the COMP_NUMs (component numbers) to search with I want to match back to the ManufacturedPart that contains that exact list of components.
So some examples: If I have components V3175, V3114, and V4060, it should match back to V3175-01 manufactured part. But, if I only have components V3175 and V3114 it should match back to V3175-CV manufactured part. If I have components V3367, V3369, V3370, and V3550 it should match back to manufactured part V3367-01.
I have no SQL written at all yet as I'm unsure of how to break the problem down..

Developing SCV using SQL

I am trying to identify all related records using IDs from two different systems.
I have seen solutions that matches SourceA to SourceB and back to SourceA but obviously this will not pick up everything.
The below table shows that 1-A is seemingly unrelated to 4-C, however when we pair them up we can see that all of the below records are related and the latest ID combination is 4-C.
| SystemA_ID | SystemB_ID | Date | PrimaryA | PrimaryB |
| 1 | A | 1/1/2016 | 4 | C |
| 2 | A | 2/1/2016 | 4 | C |
| 2 | B | 3/1/2016 | 4 | C |
| 3 | B | 4/1/2016 | 4 | C |
| 3 | C | 5/1/2016 | 4 | C |
| 4 | C | 6/1/2016 | 4 | C |
What I need is to populate the PrimaryA and PrimaryB columns with 4 and 'C' respectively.
I was thinking of doing a double loop similar to the solution described here
However, I could not get it working and also there might be a better solution.

How to generate this report?

I'm trying to set up a report based on several tables.
I have a table Actual that looks like this:
+--------+------+
| status | date |
+--------+------+
| 5 | 7/10 |
| 8 | 7/9 |
| 8 | 7/11 |
| 5 | 7/18 |
+--------+------+
Table Targets looks like this:
+--------+-------------+--------+------------+
| status | weekEndDate | target | cumulative |
+--------+-------------+--------+------------+
| 5 | 7/12 | 4 | 45 |
| 5 | 7/19 | 5 | 50 |
| 8 | 7/12 | 4 | 45 |
| 8 | 7/19 | 5 | 50 |
+--------+-------------+--------+------------+
Grouping the Actual records by which Targets.weekEndDate they fall under, I have the following aggregate query GroupActual:
+-------------+------------+--------------+--------+------------+
| weekEndDate | status | weeklyTarget | actual | cumulative |
+-------------+------------+--------------+--------+------------+
| 7/12 | 5 | 4 | 1 | 45 |
| 7/12 | 8 | 4 | 2 | 41 |
| 7/19 | 5 | 5 | 1 | 50 |
| 7/19 | 8 | 4 | | 45 |
+-------------+------------+--------------+--------+------------+
I'm trying to create this report:
+--------+------------+------+------+
| status | category | 7/12 | 7/19 | ...etc for every weekEndDate entry in Targets
+--------+------------+------+------+
| 5 | actual | 1 | 1 |
| 5 | target | 4 | 5 |
| 5 | cumulative | 45 | 50 |
+--------+------------+------+------+
| 8 | actual | 2 | |
| 8 | target | 4 | 5 |
| 8 | cumulative | 45 | 50 |
+--------------+------+------+------+
I can use a crosstab query to make the date columns, but I'm not sure how to have rows for "actual", "target", and "cumulative". They aren't values in the same table, which means (I think) that a crosstab query won't be useful for this breakdown. Should I try to change GroupActual so that it puts the data in the shape I'm looking for? Kind of confused as to where to go next with this...
EDIT: I've made some headway on the crosstabs as per PowerUser's solution, but I'm having trouble with the one for Target. I modified the wizard's generated SQL in an attempt to get what I want but it's not working out. I used a version of GroupActual that only has the weekEndDate,status, and weeklyTarget columns; here's the SQL:
TRANSFORM weeklyTarget
SELECT status
FROM TargetStatus_forCrosstab_Target
GROUP BY status,weeklyTarget
PIVOT Format([weekEndDate],"Short Date");
You're almost there. The problem is that you can't do this all in a single crosstab. You need to make 3 crosstabs (one for 'actual', one for 'target', and one for 'cumulative'), then make a Union query to combine them all.
Additional Tip: In your individual crosstabs, add a Sort column. Your 'actual' crosstab will have a Sort value of 1, 'Target' will have a Sort value of 2, and 'Cumulative' will have 3. That way, when you union them together, you can get them all in the right order.