Transpose variable number of rows into columns in OpenRefine - openrefine

I have an xml file containing records from a library catalogue. I have imported it into OpenRefine but all the values are in one column. I want to transpose it so each field in the record has its own column. However, this is complicated by the fact that a) each field is optional so does not exist in all records and b) many fields are repeatable so can appear multiple times in each record. Here's a simplified example of what the data looks like:
| RecordID | Tag | Data |
| 1 | 040a | CaABCD |
| 1 | 245a | Go fish |
| 1 | 245a | A guide to fish |
| 1 | 246i | Fish series |
| 1 | 260a | Fishing friends |
| 2 | 040a | CaABDC |
| 2 | 245a | Happy trails |
| 2 | 246i | Hiking series |
| 2 | 260i | The happy hiker |
| 2 | 500a | Notes |
I have read the Q&A here Openrefine - Transpose rows into columns based on text but the problem with this solution is that if I concatenate all the values together I have no way to be sure what field they belong in anymore, as my data is much more complicated than the data in that question (my actual data has 25+ fields and many thousands of records).
I was able to get closer using Google Sheets and making a pivot table with a calculated field (as in PivotTable to show values, not sum of values - see the answer at the very bottom). However, I still don't know how to handle the repeating fields. In the pivot table the multiple values are there but only the first displays (double-clicking on an individual cell brings up a details table which lists all the values), so when I copy-paste the table I lose the additional values. I would like to concatenate them but I cannot see a way to do so within the pivot table.
Can you think of any other way I could do this, in OpenRefine or another tool? Thanks!

The classic way to fix this in OpenRefine is to use "Transpose -> Columnize by key value". But this feature is poorly documented and can cause headaches even for OpenRefine developers. In your case, repeated fields will be problematic, so here is a possible solution.
1° Go to the "tag" column, click on "Transpose -> Columnize by key value" and use the following configuration (don't forget the "Note column (optional)")
The result will look like this (my dataset is not exactly the same as yours, I modified a value to do some test)
2° In the new column "Record ID: 040 a", click on "edit column -> Move Column To Beginning".
3° If you want to merge the repeated fields, go to each column that contains them and click on "Edit Cells -> Join Multi Value cells" by choosing a separator, for example "|".
The end result will look like this.
To get rid of unnecessary columns: Click on Export -> Custom tabular export and deselect the columns whose name starts with RecordId.

OpenRefine also has a native MARC importer which might be something worth trying if you need to work with MARC data in the future. MARCEdit also has some specific OpenRefine support built in.

Related

Transpose/Pivot Excel file in Pentaho (using multiple files)

I've been having some trouble with the following situation: There's an Excel file I need to use which has the information in the following format:
ColumnA | ColumnB
Name | John
Business | Pentaho
Address | Evergreen 123
Job type | Food processing
NameBoss | Boss lv1
Phone | 555-NoPhone
Mail | thisATmail
What I need to do is get all column A as different columns, ending with 7 different columns, each one with one value, which is the data in column B. Additionally, the integration is reading the filename as an extra output field:
SELECT
'${FILES_ROOT}/proyectos/BUSINESS_NAME/B_NAME_OPER/archivos_fuente/NÓMINA BAC - ' ||nombre_empresa||'.xlsx' as nombre_archivo
--, nombre_empresa
FROM "public".maestro_empresa
The transformation for the Excel file I have it as this:
As can bee seen, in the fields tab of the transformation, added manually each column, since the data in the Excel file does not has headers.
With this done, I am not sure how to proceed from here in order to get the transposed data I need. What can I do?
End result I am looking forward is something like this:
Name | Business | Address | Job type | NameBoss | Phone | Mail | excel_name
John | Pentaho | Evergreen 123 | Food processing | Boss lv1 | 555-NoPhone | thisAtMail | ExcelName.xlsx
With step 'Row demoralizer', you can do this easily. AT first you need to take input from excel file -> you need to use 'Row demoralizer' step. You can see sample from HERE.
Note: Remove ''Id'' column from my sample if you always suppose to get one line.
If you ColumnA values are dynamic /not specific . You can use THIS Metadata Injection sample ( where you need to take same excel input twice. But not require to specify column name). Please run transformation "MetaDataInjectionPV.ktr"

Get row number when using "MAXIFS" function

I am using MAXIFS (or similar) to identify the wanted line in a table. but i do not need the max value, i need data from an adjecent column. Example:
=MAXIFS(TableComments1[CommentDate];TableComments1[T.Number];TableView1[#Number])
Basically, in this example i am searching for lines, matching "Number", with the latest date. But in a next step i require to get the row number of the date to enable the use of INDEX and return the appropriate column (TableComments1[Comment]).
I tried different approaches - no success.
PS: performance is also important here.
UPDATE, example lookup table:"TableComments1"
T.Number | Comment | CommentDate
==============+==============+===========
SCTASK0073347 | correction | 22/07/2018
SCTASK0073347 | update 11 | 25/07/2018
SCTASK0073347 | update 2 | 21/07/2018
PS: sorting "CommentDate" is not an option here.
After days of dabbling and finally posting the above question i found a solution myself. Not sure it is the best but performance seems okay.
Be aware: a more simple solution is possible, by sorting the table "CommentDate". This could not be guaranteed and was not desired in this use-case based on the question input.
recap: We want in table TableView1 to add the most recent comment for column "Number" with lookup from TableComments1 containing the comment history:
I got the idea from another post to use a helper column for combination of 2 criteria. New table layout:
T.Number | Comment | CommentDate | Helper1
==============+==============+=============+===================
SCTASK0073347 | correction | 22/07/2018 | 43303SCTASK0073347
SCTASK0073347 | find this! | 25/07/2018 | 43306SCTASK0073347
SCTASK0073347 | update 2 | 21/07/2018 | 43302SCTASK0073347
TASK9999 | comment | 25/07/2018 | 43306TASK9999
Formula breakdown
The formula for the Helper column just does CONCATENATE 2 columns:
=[#CommentDate]&[#[T.Number]]
Lets say we want: SCTASK0073347
Note: in the helper column we have value "43306SCTASK0073347";
where "43306" is the numerical representation of date "25/07/2018".
This will search for a match of "Number" and return the most recent "CommentDate":
=MAXIFS(TableComments1[CommentDate];TableComments1[T.Number];TableView1[#Number])
Returning "25/07/2018". Lets abbreviate the above to <<MostRecentDate>> for readability in next step(s).
This step, will search for a combination of above formula <<MostRecentDate>> & "Number" in the Helper column:
=MATCH(<<MostRecentDate>>&TableView1[#Number];TableComments1[Helper1];0)
..returning row number (2) matching helper table value "43306SCTASK0073347".
From this point forward we use MATCH (now returning the wanted row) and INDEX in a style VLOOKUP would do:
=INDEX(TableComments1[Comment];MATCH(<<MostRecentDate>>&TableView1[#Number];TableComments1[Helper1];0))
...returning the wanted column with desired comment "find this!".
Full/final formula, includes IFNA function to clear blank lookups with no comments:
=IFNA(INDEX(TableComments1[Comment];MATCH(MAXIFS(TableComments1[CommentDate];TableComments1[T.Number];TableView1[#Number])&TableView1[#Number];TableComments1[Helper1];0));"")

sqlite variable and unknown number of entries in column

I am sure this question has been asked before, but I'm so new to SQL, I can't even combine the correct search terms to find an answer! So, apologies if this is a repetition.
The db I'm creating has to be created at run-time, then the data is entered after creation. Some fields will have a varying number of entries, but the number is unknown at creation time.
I'm struggling to come up with a db design to handle this variation.
As an (anonymised) example, please see below:
| salad_name | salad_type | salad_ingredients | salad_cost |
| apple | fruity | apple | cheap |
| unlikely | meaty | sausages, chorizo | expensive |
| normal | standard | leaves, cucumber, tomatoes | mid |
As you can see, the contents of "salad_ingredients" varies.
My thoughts were:
just enter a single, comma-separated string and separate at run-time. Seems hacky, and couldn't search by salad_ingredients!
have another table, for each salad, such as "apple_ingredients", which could have a varying number of rows for each ingredient. However, I can't do this, because I don't know the salad_name at creation time! :(
Have a separate salad_ingredients table, where each row is a salad_name, and there is an arbitrary number of ingredients fields, say 10, so you could have up to 10 ingredients. Again, seems slightly hacky, as I don't like to unused fields, and what happens if a super-complicated salad comes along?
Is there a solution that I've missed?
Thanks,
Dan
based on my experience the best solution is based on a normalized set of tables
table salads
id
salad_name
salad_type
salad_cost
.
table ingredients
id
name
and
table salad_ingredients
id
id_salad
id_ingredients
where id_salad is the corresponding if from salads
and id_ingredients is the corresponding if from ingredients
using proper join you can get (select) and filter (where) all the values you need

How to flatten a one-to-many relationship

While trying to build a data warehousing application using Talend, we are faced with the following scenario.
We have two tables tables that look like
Table master
ID | CUST_NAME | CUST_EMAIL
------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM
Events Table
ID | CUST_ID | EVENT_NAME | EVENT_DATE
---------------------------------------
1 | 1 | ACC_APPLIED | 2014-01-01
2 | 1 | ACC_OPENED | 2014-01-02
3 | 1 | ACC_CLOSED | 2014-01-02
There is a one-to-many relationship between master and the events table.Since, given a limited number of event names I proposing that we denormalize this structure into something that looks like
ID | CUST_NAME | CUST_EMAIL | ACC_APP_DATE_ID | ACC_OPEN_DATE_ID |ACC_CLOSE_DATE_ID
-----------------------------------------------------------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM | 20140101 | 20140102 | 20140103
THE DATE_ID columns refer to entries inside the time dimension table.
First question : Is this a good idea ? What are the other alternatives to this scheme ?
Second question : How do I implement this using Talend Open Studio ? I figured out a way in which I moved the data for each event name into it's own temporary table along with cust_id using the tMap component and later linked them together using another tMap. Is there another way to do this in talend ?
To do this in Talend you'll need to first sort your data so that it is reliably in the order of applied, opened and closed for each account and then denormalize it to a single row with a single delimited field for the dates using the tDenormalizeRows component.
After this you'll want to use tExtractDelimitedFields to split the single dates field.
Yeah, this is a good idea, this is called a cumulative snapshot fact. http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
Not sure how to do this in Talend (dont know the tool) but it would be quite easy to implement in SQL using a Case or Pivot statement
Regarding only your first question, it's certainly a good idea -- unless there is any possibility of the same persons applying-opening-closing their account more than once AND you want to keep all this information in their history (so UPDATE wouldn't help).
Snowflaking is definitely not a good option if you are going to design a data warehouse. So, denormalizing will certainly be a good choice in this case. Following article almost fits perfectly to clear the air over such scenarios,
http://www.kimballgroup.com/2008/09/design-tip-105-snowflakes-outriggers-and-bridges/

Storing a COUNT of values in a table

I have a table with data along the (massively simplified) lines of:
User | Value
-----|------
UsrA | 100
UsrA | 102
UsrB | 100
UsrA | 100
UsrB | 101
and, for reasons far to obscure to go into, I need to store the COUNT of each value in a table for future retrieval - ending up with something like
User | Value100Count | Value101Count | Value102Count
-----|---------------|---------------|--------------
UsrA | 2 | 0 | 1
UsrB | 1 | 1 | 0
However, there could be up to 255 different Values - meaning potentially 255 different ValueXCount columns. I know this is a horrible way to do things, but is there an easy way to get the data into a format that can be easily INSERTed into the destination table? Is there a better way to store the COUNT of values per user (unfortunately I do need to store this information; grabbing it from the source table each time isn't an option)?
The whole thing isn't very pretty, but you know that, rather than your table with 255 columns I'd consider setting up another table with:
User | Value | CountOfValue
And set a primary key over User and Value.
You could then insert the count's for given user/value combos into the CountOfValue field
As I said, the design is horrible and it feels like you would be better off starting from scratch, normalizing and doing counts live.
Check out indexed views. You can maintain the table automatically, with integrity and as a bonus it can get used in queries that already do count(*) on that data.