How to transpose columns when they encode multiple "records"? - openrefine

I have a spreadsheet I have imported into OpenRefine. The creator encoded groups of information (records) in columns. I need to bring each of those groups of columns into its own row, along with all the relevant columns.
Using a simplified example, how would I go from this:
id foo1 foo2 foo3 bar1 bar2 bar3
1 4 6 a 7 9 b
2 5 5 a 8 8 b
3 6 4 a 9 7 b
To this:
id foobar1 foobar2 foobar3
1 4 6 a
1 7 9 b
2 5 5 a
2 8 8 b
3 6 4 a
3 9 7 b
I've been trying to think of a way forward with intermediate columns, but there are are 6 groups of 5 columns and I'm currently stuck.
I found a solution. The steps are:
Concat each group of columns into a single column (FOO_CONCAT, BAR_CONCAT)
Delete the now unneeded columns (foo1..3, bar1..3)
Transpose your CONCAT columns into a single column, no prefix, ignoring blanks, filling down other columns
Now FOO_CONCATs and BAR_CONCATs are all in the same column
Split that column into several columns...(using the separator you used in step 1)
Rename columns
Strip out prefixes (I had foo1:4, bar2:8, etc for clarity)
Transform to numbers (Edit cells -> Common Transforms -> toNumber)
Now you're ready to transpose,facet, etc

I think this is essentially the same has the solution you describe, but possibly with some shortcuts to avoid all the steps.
Given the example data you post I would:
On "Id" column select Edit column->Add column based on this column
from menu
Make new column name "foobar"
Use the GREL forEach(row.columnNames,cn,if(cn.startsWith("foo"),cells[cn].value,null)).join("|")+"~"+forEach(row.columnNames,cn,if(cn.startsWith("bar"),cells[cn].value,null)).join("|")
Once new "foobar" column exists, on this column use menu option Edit cells->Split multi-valued cells using the "~" character (as used in the GREL above)
The also on the "foobar" column use menu option Edit columns->Split into several columns, using the "|" character as in the GREL above
Finally on ID column use menu Edit cells->Fill down
This should result in the output you describe - if you don't need the original columns at this point you can either remove them, or (sometimes quicker) export the first X columns that have the reconfigured data using the custom tabular exporter, and then import that data into a new project.
You can modify the GREL to deal with the exact column groupings you have. In my example I've used the column naming to group the values, but if that isn't the reality of the data you are dealing with you can use GREL like:
forEach(row.columnNames.slice(1,4),cn,cells[cn].value).join("|")+"~"+forEach(row.columnNames.slice(4,8),cn,cells[cn].value).join("|")
Which uses the 'slice' function to select certain columns rather than using some aspect of the column name to select them.

Related

Storing a large array into a table with 10,000 columns in SQLite

I want to be able to store some 100x100 matrices onto a table within my database (covariance matrices). A first good step for me would be to flatten the matrix and store the matrix structure (among other things) into a parent table.
However, creating such a table would require to make a table with about 10,000 or so columns. Writing so many field names would make my SQL code extraordinarily large, and I wouldn't know where to start if I want to query for that matrix.
Is there a neat way to specify such a table in SQL? Is there a neat way for me to set or get a particular (set of) matrix (matrices) from my database using such a table? Is there a better way?
I am using Sqlite for my databases.
All tables with big size of same typed columns can be rotated.
For example if you have a table A like this:
row col1 col2 col3 ...
1 1 2 3
2 11 12 13
You can simply rotate to a table with 3 colums
row col value
1 1 1
1 2 2
1 3 3
2 1 11
2 2 12
2 3 13
so instead of writing big sql like
select col1, col2, col3 ...... from A where row = 2
you write sql like
select value from A where row = 2 order by col
the result set was originally horizontal and now become vertical -- it is rotated and easy to handle.

How to (1) condense into one row after certain number of rows; (2) How to assign field names

Using Pentaho PDI 8.3.
After REST calls with quite complex data structures, I was able to extract data with a row for each data element in a REST result/ E.g:
DataCenterClusterAbstract
1
UK1
Datacenter (auto generated)
Company
29
0
39
15
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
2
UK1_Murex
Datacenter (auto generated)
Company
0
0
0
0
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
3
UK1_UNIX
Notice that there are 8 data elements that are spread out into separate rows. I would like to condense these 8 data elements into one row each iteration in Pentaho. Is this possible? And assign field names?
Row flattener
Condense 8 data element in columns into one row. Each of these 8 data elements are repeating.
(1) Add row flattener
(2) Assign field names for the rows coming in - so you have 10 data attributes in rows specify a field name for each row.
(3) In table output use space as seperator

Split column in hive

I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe '|' character. Then I want to group up the 2 adjacent values and separate them into separate rows.
Example, I have a table
id mapper
1 a|0.1|b|0.2
2 c|0.2|d|0.3|e|0.6
3 f|0.6
I am able to split the column by using split(mapper, "\\|") which gives me the array
id mapper
1 [a,0.1,b,0.2]
2 [c,0.2,d,0.3,e,0.6]
3 [f,0.6]
Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.
Expected:
id mapper
1 [a,0.1]
1 [b,0.2]
2 [c,0.2]
2 [d,0.3]
2 [e,0.6]
3 [f,0.6]
Actual
id mapper
1 a
1 0.1
1 b
1 0.2
etc .......
How can I achieve this?
I would suggest you to split your pairs split(mapper, '(?<=\\d)\\|(?=\\w)'), e.g.
split('c|0.2|d|0.3|e|0.6', '(?<=\\d)\\|(?=\\w)')
results in
["c|0.2","d|0.3","e|0.6"]
then explode the resulting array and split by |.
Update:
If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\\.\\d)\\|(?=\\w|\\d)').
Update 2:
OK, the best way is to split on the second | as follows
split(mapper, '(?<!\\G[^\\|]+)\\|')
e.g.
split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\\G[^\\|]+)\\|')
results in
["6193439|0.0444035224643987","6186654|0.0444035224643987"]

SPSS: How can I copy values from a variable (column) and paste it below the other one using syntax?

[SPSS] How can I copy values from a variable (column) and paste it below the other one by syntax?
I need to merge 10 columns and I cant do this only by copy paste.
I have this: [1]: https://i.imgur.com/I5DFV.jpg "tooltip"
var1 var2
1 3 6
2 4 7
3 5 8
4
5
.
.
.
and I want this:
newvar
1 3
2 4
3 5
4 6
5 7
6 8
If you want to create new lines (so you get two lines with one variable instead of one line with two variables), You can use varstocases like this:
varstocases /make NewVar from Var1 Var2/index=originVar(NewVar).
this will get both the old variables into the new one, and create an additional variable called originVar which will contain the name of the original variable that each number in NewVar came from.
ADDITION:
if your file was originally sorted by a specific variable(s) you can now just sort again by your original variable and by originVar. If you don't have a variable that conserves the original order, just create one before rustructure:
compute OrigOrder=$casenum.
restructure....
sort cases by OrigOrder originVar./* or by originVar OrigOrder.
Your example may imply that you already have empty lined to which you want to copy values from previous lines. This is a different situation, you can do it this way:
compute NewVar=Var1.
if missing(NewVar) NewVar=lag(Var2).

Excel: one column has duplicates of each value, I need to take averages of the corresponding two values from the other columns

Example:
column A column B
A 1
A 2
B 2
B 2
C 1
C 1
I would somehow like to get the following result:
column A column B
A 1.5
B 2
C 1
(which are averages of 1 and 2, 2 and 2 and 1 and 1)
How do I achieve that?
Thanks
If you're using Excel 2007 or above, you can also use the shorter AVERAGEIF function:
=AVERAGEIF($A$1:$A:$6,D1,$B$1:$B$6)
Less typing, easier to read..
In D1:D3, type A, B, C. Then in E1, put this formula
=SUMIF($A$1:$A$6,D1,$B$1:$B$6)/COUNTIF($A$1:$A$6,D1)
and fill down to E3. If you want to replace the existing data, copy E1:E3 and paste-special-values over itself. Then delete A:C.
Alternatively, you can add headers to your data, say "Letter" and "Number". Then create a Pivot Table from your data. Put Letter in the rows section and Number in the Data section. Change your Data section from SUM to AVERAGE and you'll get the same result.