How to facet multiple columns in Google Refine - openrefine

I have a data set with 30 columns and multiple rows (some cells have no data). I would like to be able to facet the columns in groups.
1 2 3 4...
Row1 A B C D
Row2 E A D F
Row3 Q A B H
Given the above data I would like the facet to retun the number of instances in a group of columns. For the first three columns I need the facet to return:
A - 3
B - 2
C - 1
D - 1
E - 1
Q - 1
I have tried to combine columns when I loaded the data but the individual data was grouped as well. This is not the desired outcome. For example:
ABC - 1
EAD - 1
QAB - 1
Thanks in advance.

I can't think of a more efficient way to do this off the top of my head, but you can do a custom facet with something like:
[ cells.["1"].value, cells.["2"].value, cells.["3"].value ]
where "1", "2", and "3" are the names of your columns. If your column names are single words, like "V1", "V2", "V3", and so on, you can also change the custom facet to something like:
[ cells.V1.value, cells.V2.value, cells.V3.value ]
With a lot of columns, this solution might be somewhat tedious though...

Did you tried to transpose all your column in one and facet on this 'master column'?
When transposing add the column name so you know from where the data comes from. The you can split your master column into 'source column' and 'data'.
You can find here the JSON code to transpose a large amount of column: http://googlerefine.blogspot.ca/2011/09/json-code-to-transpose-important-number.html
it should work for your project with a limited amount of edits.
Hope it help!

Related

Pandas - data per row instead of all in one cell

I have problems getting the data in separate rows. At the moment all my data per column is in one cell. I really would appreciate your support!
the column header is "Dealer" and it is showing one value below like this:
|Dealer|
|:---- |
|['Automobiles', 'Garage Benz', 'Cencini SA']|
I would like to get three rows out of this:
Row
Dealer
1
'Automobiles'
2
'Garage Benz'
3
'Cencini SA'
4
....
5
....
...
...
what would be the easiest way to achieve this?
Thanks for your support, as I am totally new to pandas!
The easiest way is to convert your data into a dict like data:
x = {'Dealer':['Automobiles', 'Garage Benz', 'Cencini SA']}
Then
x = pd.DataFrame(x)

Calculating the difference between values based on their date

I have a dataframe that looks like this, where the "Date" is set as the index
A B C D E
Date
1999-01-01 1 2 3 4 5
1999-01-02 1 2 3 4 5
1999-01-03 1 2 3 4 5
1999-01-04 1 2 3 4 5
I'm trying to compare the percent difference between two pairs of dates. I think I can do the first bit:
start_1 = "1999-01-02"
end_1 = "1999-01-03"
start_2 = "1999-01-03"
end_2 = "1999-01-04"
Obs_1 = df.loc[end_1] / df.loc[start_1] -1
Obs_2 = df.loc[end_2] / df.loc[start_2] -1
The output I get from - eg Obs_1 looks like this:
A 0.011197
B 0.007933
C 0.012850
D 0.016678
E 0.007330
dtype: float64
I'm looking to build some correlations between Obs_1 and Obs_2. I think I need to create a new dataframe with the labels A-E as one column (or as the index), and then the data series from Obs_1 and Obs_2 as adjacent columns.
But I'm struggling! I can't 'see' what Obs_1 and Obs_2 'are' - have I created a list? A series? How can I tell? What would be the best way of combining the two into a single dataframe...say df_1.
I'm sure the answer is staring me in the face but I'm going mental trying to figure it out...and because I'm not quite sure what Obs_1 and Obs_2 'are', it's hard to search the SO archive to help me.
Thanks in advance

How to transpose columns when they encode multiple "records"?

I have a spreadsheet I have imported into OpenRefine. The creator encoded groups of information (records) in columns. I need to bring each of those groups of columns into its own row, along with all the relevant columns.
Using a simplified example, how would I go from this:
id foo1 foo2 foo3 bar1 bar2 bar3
1 4 6 a 7 9 b
2 5 5 a 8 8 b
3 6 4 a 9 7 b
To this:
id foobar1 foobar2 foobar3
1 4 6 a
1 7 9 b
2 5 5 a
2 8 8 b
3 6 4 a
3 9 7 b
I've been trying to think of a way forward with intermediate columns, but there are are 6 groups of 5 columns and I'm currently stuck.
I found a solution. The steps are:
Concat each group of columns into a single column (FOO_CONCAT, BAR_CONCAT)
Delete the now unneeded columns (foo1..3, bar1..3)
Transpose your CONCAT columns into a single column, no prefix, ignoring blanks, filling down other columns
Now FOO_CONCATs and BAR_CONCATs are all in the same column
Split that column into several columns...(using the separator you used in step 1)
Rename columns
Strip out prefixes (I had foo1:4, bar2:8, etc for clarity)
Transform to numbers (Edit cells -> Common Transforms -> toNumber)
Now you're ready to transpose,facet, etc
I think this is essentially the same has the solution you describe, but possibly with some shortcuts to avoid all the steps.
Given the example data you post I would:
On "Id" column select Edit column->Add column based on this column
from menu
Make new column name "foobar"
Use the GREL forEach(row.columnNames,cn,if(cn.startsWith("foo"),cells[cn].value,null)).join("|")+"~"+forEach(row.columnNames,cn,if(cn.startsWith("bar"),cells[cn].value,null)).join("|")
Once new "foobar" column exists, on this column use menu option Edit cells->Split multi-valued cells using the "~" character (as used in the GREL above)
The also on the "foobar" column use menu option Edit columns->Split into several columns, using the "|" character as in the GREL above
Finally on ID column use menu Edit cells->Fill down
This should result in the output you describe - if you don't need the original columns at this point you can either remove them, or (sometimes quicker) export the first X columns that have the reconfigured data using the custom tabular exporter, and then import that data into a new project.
You can modify the GREL to deal with the exact column groupings you have. In my example I've used the column naming to group the values, but if that isn't the reality of the data you are dealing with you can use GREL like:
forEach(row.columnNames.slice(1,4),cn,cells[cn].value).join("|")+"~"+forEach(row.columnNames.slice(4,8),cn,cells[cn].value).join("|")
Which uses the 'slice' function to select certain columns rather than using some aspect of the column name to select them.

How can I compare two sets of data having two columns in excel? Picture below will elaborate

Below are two sets of data. Each has two columns. I want that that the similar data comes in front of each other.
This is a manual solution with formulas and sorting.
Imagine the following data in columns A to E:
Enter the following formulas into columns G to K
Column G: =IFERROR(IF(VLOOKUP(D:D,A:B,2,FALSE)=E:E,1,2),3)
Column H: =IF(G:G<3,D:D,"")
Column I: =IFERROR(VLOOKUP(H:H,A:B,2,FALSE),"")
Column J: =D:D
Column K: =IFERROR(VLOOKUP(J:J,D:E,2,FALSE),"")
The column G sort by now shows:
1 if part and quantity matched
2 if only part matched
3 if nothing matched
So if you now select data from A3:K10 and sort by column G (sort by) then it will result in this:

Excel: one column has duplicates of each value, I need to take averages of the corresponding two values from the other columns

Example:
column A column B
A 1
A 2
B 2
B 2
C 1
C 1
I would somehow like to get the following result:
column A column B
A 1.5
B 2
C 1
(which are averages of 1 and 2, 2 and 2 and 1 and 1)
How do I achieve that?
Thanks
If you're using Excel 2007 or above, you can also use the shorter AVERAGEIF function:
=AVERAGEIF($A$1:$A:$6,D1,$B$1:$B$6)
Less typing, easier to read..
In D1:D3, type A, B, C. Then in E1, put this formula
=SUMIF($A$1:$A$6,D1,$B$1:$B$6)/COUNTIF($A$1:$A$6,D1)
and fill down to E3. If you want to replace the existing data, copy E1:E3 and paste-special-values over itself. Then delete A:C.
Alternatively, you can add headers to your data, say "Letter" and "Number". Then create a Pivot Table from your data. Put Letter in the rows section and Number in the Data section. Change your Data section from SUM to AVERAGE and you'll get the same result.