Open Refine--create new column by looking up values from a pair of columns - openrefine

I have a table in OpenRefine with columns A, B, and C like this:
A | B | C | D
---|---|---|---
a | 1 | b | 2
b | 2 | |
c | 3 | a | 1
d | 4 | c | 3
I want to create a column D by fetching the values from B corresponding to those in C, using A as an index. Hope that makes sense? I'm not having much luck figuring out how to do this in GREL.

You can use the 'cross' function to look up values across the project. Cross is usually used to look up values in a different OpenRefine project/file, but actually it works the same if you point it back at the same project you are already in.
So - from Col C, you can use "Add new column based on this column" with the GREL:
cell.cross("Your project name","Col A")
You'll get back an array of 'rows' - and if the same value appears in Column A multiple times you could get multiple rows.
To extract a value from the array you can use something like:
forEach(cell.cross("Your project name","Col A"),r,r.cells["Col B"].value).join("|")
The final 'join' is necessary to convert the array into a string which is required to be able to store the result (arrays can't be stored directly)

Related

Filtering Columns in PLSQL

I have a table with tons and tons of columns and I'm trying to select only certain columns based on the data the columns contain. The table is part of an application I'm building in Oracle APEX and looks something like this:
|Row Header|Criteria 1|Criteria 2| Criteria 3 | Criteria 4 |Criteria 5 |
|Category | Type A | Type B | Type B | Type A | Type A |
| ID | 2.3 | 2.4 | 2.5 | 3.1 | 3.2 |
| Part A | Yes | Yes | Yes | No | Yes |
| Part B | Yes | No | Yes | Yes | Yes |
| Part C | No | Yes | Yes | Yes | No |
It goes on like this for around 1000ish criteria and 100ish parts I need to find a way to select all the columns that are of a specific type to its own table using SQL.
Id Like the return to look like this:
|Row Header|Criteria 1|Criteria 5 |
|Category | Type A | Type A |
| ID | 3.1 | 3.2 |
| Part A | No | Yes |
| Part B | Yes | Yes |
| Part C | Yes | No |
This way I only have the columns showing that are part of the "Type A" Category and have an ID greater than 3.
I've looked into GROUP BY and FILTER functions that SQL has to offer as well as PIVOT and I don't believe these will help me, but I'd be happy to be proven wrong.
In a relational database, columns are meant to be discrete, non-repeating attributes of a thing. Rows are meant to be multiple instances of that thing. Your table is reversed, using columns for what should be rows, and rows for what should be columns. Another factor is that Oracle limits you to 1000 columns, and you start undergoing severe performance degradation when you exceed 254 columns. Tables simply weren't meant to have hundreds, let alone thousands, of columns. So first step is to pivot your table like this:
Criteria_No, Cat, ID, PtA, PtB, PtC
---------------------------------------------
Row 1: Criteria 1, Type A, 2.3, Yes, Yes, No
Row 2: Criteria 2, Type B, 2.4, Yes, No, Yes
Row 3: Criteria 3, Type B, 2.5, Yes, Yes, Yes
. . . thousands more
But even then, you mentioned that you have 100s of "parts", so Parts A, B, C aren't the only three - the series continues. If so, it would be a violation of normal form to have such a repeating list in a single row. So you have one more step to fix your design: Break this into three tables.
CRITERIA
Criteria_No, Cat, ID
---------------------------------------------
Row 1: Criteria 1, Type A, 2.3
Row 2: Criteria 2, Type B, 2.4
Row 3: Criteria 3, Type B, 2.5
PARTS
Part, anything-else-about-part
-----------------
Part A, blah
Part B, blah,
Part C, blah
. . .
And now the bridge table between them:
CRITERIA_PARTS
Criteria_No, Part
-----------------
1, Part A
1, Part B
1, Part C
2, Part A,
2, Part B,
. . . and so on
You should also place a foreign key on each of the bridge table columns to point to their respective parent tables to ensure data integrity.
Now you query by joining the tables together in your SQL.
Updated: you asked how to move data into this new criteria table from your existing one. Use dynamic SQL like this:
BEGIN
FOR i IN 1..1000
LOOP
EXECUTE IMMEDIATE 'INSERT INTO criteria (criteria_no,cat,id) SELECT criteria_'||i||',category,id FROM oldtable';
END LOOP;
COMMIT;
END;
But of course set the 1000 to the real # of category_n columns.

How to create a funnel visual/bar chart in Tableau by creating a calculated field using an existing column in the data source?

In my data source, there's a column called 'Pool'
Within that column, there are about 3 values:
| Pool |
| C |
| B |
| C |
| A |
So as you can see, there are 3 distinct values, A, B, C. I want to create a funnel, or essentially a bar chart that will calculate each and count them in the whole column for each of those three values. However, I know I can't just place the column itself in the sheet since I also want to have a fourth bar that counts all the values as a "All" category.
So eventually having a visual that states (but this is in tabular form to help illustrate what I mean)
All | 20
A | 10
B | 5
C | 5
Please find an indicative answer in fiddle
You could use UNION between two results one to bring the COUNT for each of your values and one COUNT for all your samples.
(SELECT Pool, COUNT(Pool) AS your_count
FROM your_table
GROUP BY Pool)
UNION
(SELECT 'ALL', COUNT(*) AS your_count
FROM your_table)
ORDER BY your_count DESC

SQL: reverse groupby : EDIT

Is there a build in function in sql, to reverse the order in which the groupby works? I try to groupby a certain key but i would like to have the last inserted record returned and not the first inserted record.
Changing the order with orderby does not affect this behaviour.
Thanx in advance!
EDIT:
this is the sample data:
id|value
-----
1 | A
2 | B
3 | B
4 | C
as return i want
1 | A
3 | B
4 | C
not
1 | A
2 | B
4 | C
when using group by id don't get the result i want.
Question here is how are you identifying last inserted row. Based on your example, it looks like based on id. If id is auto generated, or a sequence then you can definitely do this.
select max(id),value
from your_table
group by value
Ideally in a table design, people uses a date column which holds the time a particular record was inserted, so it is easy to order by that.
Use Max() as your aggregate function for your id:
SELECT max(id), value FROM <table> GROUP BY value;
This will return:
1 | A
3 | B
4 | C
As for eloquent, I've not used it but I think it would look like:
$myData = DB::table('yourtable')
->select('value', DB::raw('max(id) as maxid'))
->groupBy('value')
->get();

Transforming a 2 column SQL table into 3 columns, column 3 lagged on 2

Here's my problem: I want to write a query (that goes into a larger query) that takes a table like this;
ID | DATE
A | 1
A | 2
A | 3
B | 1
B | 2
and so on, and transforms it into;
ID | DATE1 | DATE2
A | 1 | 2
A | 2 | 3
A | 3 | NOW
B | 1 | 2
B | 2 | NOW
Where the numbers are dates, and NOW() is always appended to the most recent date. Given free rein I would do this in Python, but unfortunately this goes into a larger query. We're using SyBase's SQL Anywhere 12, I think? I interact with the database using SQuirreL SQL.
I'm very stumped. I thought (SQL query to transform a list of numbers into 2 columns) would help, but I'm afraid I don't know enough to make it work. I was thinking of JOINing the table to itself, but I don't know how to SELECT for only the A-1-2 rows instead of the A-1-3 rows as well, for instance, or how to insert the NOW() value into it. Does anyone have any ideas?
I made a an sqlfiddle.com to outline a solution for your example. You were mentioning dates, but using integers so I chose to do an integer example, but it can be modified. I wrote it in postgresql so the coalesce() function can be substituted with nvl() or similar. Also, the parameter '0' can be substituted with any value, including now(), but you must change the data type of the "i" column in the table to be a date as well. Please let me know if you need further help on this.
select a.id, a.i, coalesce(min(b.i),'0') from
test a
left join test b on b.id=a.id and a.i<b.i
group by a.id,a.i
order by a.id, a.i
http://sqlfiddle.com/#!15/f1fba/6

ADO - how to select column from xls file where two or more columns have the same name?

I have an excel file like this:
| | A | B | C | D |
| 1 | Name 1 | Name 2 | Name 3 | Name 2 |
| 2 | Data | Data | Data | Data |
| 3 | Data | Data | Data | Data |
As you can see, headers of two columns have the same name - Name 2.
My question is, is it possible to tell the ADO engine from which column to select data?
Currently my select looks like this:
SELECT [Name 1], [Name 2] FROM [REPORT7_RAW$] WHERE [Name 1] IS NOT NULL
and ADO picks up the data from column which is listed under column B in excel. In other words it takes the first column which have the given name. Unfortunately I have two columns with the same name and I would like to pull out the data from column D. Is it possible?
I could not find any way to select column by its index rather the name.
You will need to change your connection string so that data header names are not used. The normal connection string would look something like this:
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=c:\myFolder\myExcel2007file.xlsx;
Extended Properties="Excel 12.0 Xml;HDR=YES";
You need to change the last bit, HDR=YES, to HDR=NO.
With that type of connection, the columns(fields) then become F1, F2, etc., where F1 = column A, F2 = column B, etc.
This is not ideal, since you are now essentially running the query based on the number of the column rather than the name, but with duplicate column names, this is the only way around that.
Per the comment from #barrowc: This format of the connection string will treat your column names as data. So depending on your query, you may need to include code to filter out the row that contains your column names.