How to split all multi-valued columns into several columns at once in OpenRefine using GREL

How to split all multi-valued columns into several columns at once in OpenRefine using GREL - openrefine

So, I have a complex XML imported into OpenRefine and merged all rows of all records to one using a GREL formula in "All => Transform". Now I have around 50 columns with each containing multiple values in each cell seperated by "|" and I want to have them in different columns.
See here for two example columns
I could apply "Edit column => Split into several columns" for each column but this would mean to do it over and over again. I am pretty sure this can be done via "All => Transform" using GREL but I haven't found a solution yet.
Please help me!

My question already has been answered insofar as that OpenRefine (at the moment) is not able to split several columns at once. A workaround is just to "Extract..." the respective JSON code for a one-column-split from "Undo/Redo" and paste it to "Apply..." as many times as there are columns replacing the "columnName" in every pasted copy of the JSON with the one needed.
See https://groups.google.com/g/openrefine/c/BMoK35CCXYo

Related

Is there a quick way to subset columns in PANDAS?

I am trying to setup a PANDAS project that I can use to compare and return the differences in excel and csv files over time. Currently I load the excel/csv files into pandas and assign them a version column. I assign them a "Version" column because in my last step, I want the program to create me a file containing only what has changed in the "new" version file so that I do not have to update the entire database, only the data points that have changed.
old = pd.read_excel(landdata20201122.xlsx')
new = pd.read_excel(landdata20210105.xlsx')
old['version'] = "old"
new['version'] = "new"
I merge the sheets into one, I then drop duplicate rows based on all the columns in the original files. I have to subset the data because if the program looks at my added version column, it will not be seen as a duplicate row. Statement is listed below
df2 = df1.drop_duplicates(subset=["UWI", "Current DOI Partners", "Encumbrances", "Lease Expiry Date", "Mineral Leases", "Operator", "Attached Land Rights", "Surface Leases"])
df2.shape
I am wondering if there is a quicker way to subset the data, basically the way I currently have it setup, I have to list each column title. Some of my sheets have 100+ columns, so it is a lot of work when I only want it to negate 1 column. Is there a way that I can populate all the column titles and remove the ones I do not want looked at? Or is there a way to enter the columns I DO NOT want compared in the drop duplicates command instead of entering all the columns except one?
If I can just list the columns I do not want to compare, I will be able to use the same script for much more of the data that I am working with as I will not have to edit the drop_duplicates statement each time I compare sheets.
Any help is appreciated, thank you in advance!

If I've understood well:
Store the headers in a list.
Remove the names you don't want by hand.
Inside the subset of drop_duplicates(), place the list.
In case that the columns you want to remove are more than those you want to keep, add by hand all the wanted columns in the list.
With a list, you won't need to write them every time.
How to iterate a list:
list=['first', 'second', 'third']
for i in list:
print(i)
# Output: 'first', 'second', 'third'

Reading xlsx into R and creating new header

I'm a newbie to R.
I read in an Excel file for a survey, but I started reading observations from the 3rd row of the excel file, as the survey download creates a first two rows of the question string (first row for all questions) followed by a second row of multiple choice questions (each option gets its own column except the first option, which is listed in the same column in the second row as the question in the first row).
So now, my dataframe starts with Row 3.
But now I need to create custom variable names - ie. new variable names for each column before I manipulate further. I'm looking for tips on how to best accomplish this.
What I am thinking:
Create an Excel file with the variable names, and then use this is as the header. I'm not quite sure which code I would use to do this.
Code the names as an empty dataframe, and then somehow merge this so the empty dataframe column names are the column names for the file I imported.
I would appreciate some suggestions on how best to do this!

Is there a way in Excel to filter out duplicates even if the items occur in a different order within cells? (VBA solutions are also welcome)

What I am trying to do is some sort of smart "remove duplicates" in Excel.
I have a list of 200+ cells and each cell in the list potentially contains multiple items separated by a semi-colon (;).So, imagine I have a cell containing items (a,f,g) and another cell containing items (g,a,f).
Those cells are duplicates since they contain exactly the same items, but in a different order. However the order has no importance to me.
Is there a way that excel could recognize such cells as duplicates?
Many thanks in advance for you suggestions :)

If there is only one column, the solution is simple.
Use split to columns using ";" as delimiter
Sort the row, please note you have to sort row wise not column
Concatenate
use the highlight duplicate option.
In my opinion, this should be less time taking than going VBA way

How to Stack a range of values (from multiple tables in another sheet) into a single column

I'm working on a quarterly report that Auto-generates all fields.
I could really use some help building a formula that pulls values from the first column ([T6-TOC]) of three separate tables (ROVH_Jan, ROVH_Feb, ROVH_MAR) existing in another worksheet (RVH 1825). I need the three ranges of values to stack in a single column, but I do not want to eliminate duplicates values.
I've tried using =INDEX formula, and VBA but I can't get the syntax right.
Any suggestions?
These are sources I've viewed but didn't solve my problem.
https://superuser.com/questions/445410/pull-row-of-data-from-one-place-in-spreadsheet-to-another
http://forum.chandoo.org/threads/merge-stack-multiple-named-ranges-across-multiple-worksheets-in-a-master-sheet.11074/
Excel - Combine multiple columns into one column
http://www.mrexcel.com/forum/excel-questions/610527-how-do-i-stack-data-multiple-columns-into-one-column.html

Something like this should work for you:
=IF(ROW(A1)<=ROWS(ROVH_Jan),INDEX(ROVH_Jan[T6-TOC],ROW(A1)),IF(ROW(A1)<=ROWS(ROVH_Jan)+ROWS(ROVH_Feb),INDEX(ROVH_Feb[T6-TOC],ROW(A1)-ROWS(ROVH_Jan)),IF(ROW(A1)<=ROWS(ROVH_Jan)+ROWS(ROVH_Feb)+ROWS(ROVH_MAR),INDEX(ROVH_MAR[T6-TOC],ROW(A1)-ROWS(ROVH_Jan)-ROWS(ROVH_Feb)),"")))

Build a Column by applying a formula to an existing column (like in excel)

I am new to the community. Hopefully this was not answered already.
I am trying to add a column to a DataFrame that contains a formula based on the previous columns. Example, build a series of stocks returns based on stocks close.
I know how to build a column by doing exactly the same thing to all elements of another, but not to use a columns element and formula to create another.
Thanks for your help.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas