How to match entries between datasets? - openrefine

For Example, I have a dataset like this:
| People | ID |
|-------------|-----|
| John Smith |A1234|
| John Doe |A1235|
| Jane Doe |A1236|
| John Smith |A1237|
And I also have another dataset like this:
| People | Company | City | Rank |
|-------------|---------|--------|-------|
| John Smith | XXX |New York| 1 |
| John Doe | YYY |London | 2 |
| Jane Doe | ZZZ |Seoul | 3 |
| John Smith | WWW |Tokyo | 4 |
I want to find the company of each people in the first table, using the information in another table. Note there're people with the same name (though few) in the second (and also the first) tables, so we need other columns for assistance.
Is it necessary to import two tables in one project? The reality is I have multiple tables providing possible name / company matchings, but they have little similarity (i.e. different dataset provides entirely different information) other then each dataset have name and company rows.

You need to create two separate OpenRefine projects and join them using the cell.cross function. You can also see this tutorial for joining two projects in OpenRefine
cell.cross performs the equivalent of a database join. You will need a unique identifier common to your two projects for the function to match the records, otherwise, OpenRefine will return the first match.

Related

“Is there an Access SQL code/query for concatenating first letter plus a unique ID number and insert into a new column? [duplicate]

This question already has answers here:
Customized Auto-Number IDs for tables?
(2 answers)
Closed 3 years ago.
First of all, I am quite new to SQL and Microsoft Access.
I am setting a database in Access. My database collects information from four different departments. I store my data through forms. My main table (Business) stores information (department) using a Combo Box saving a number instead of text.
I want to have a column (similar to CODE ID already available in the table above) which shows the initial letter from a field (name Department) + a number.
Ie. In table "Business", I want to display a Code ID which contains the initials of column Department plus a number code (department order number ascending). I want to have this every time i add information.
+===============+=================+=========+==+
| DEPARTMENT | PARTNER | CODE ID | |
+===============+=================+=========+==+
| Data_Analysis | John Doe | D001 | |
+---------------+-----------------+---------+--+
| Marketing | Jane Doe | M001 | |
+---------------+-----------------+---------+--+
| Finance | Alex Mustermann | F001 | |
+---------------+-----------------+---------+--+
| Operations | Juan Perez | O001 | |
+---------------+-----------------+---------+--+
| Finance | Barack Trump | F002 | |
+---------------+-----------------+---------+--+
| Finance | Mark Merkel | F003 | |
+---------------+-----------------+---------+--+
| Marketing | Peggy Hilton | M002 | |
+---------------+-----------------+---------+--+
| Operations | Max Mustermann | O002 | |
+---------------+-----------------+---------+--+
| Operations | | OXXX | |
+---------------+-----------------+---------+--+
The values in column CODE ID are those I would like to have display every time I add a new row (new department order). I need this type of code for tracking my number of orders in each department and use it as a unique code for any inquires with partners. I dont want to have it as the primary key id.
Thanks in advance!
If you rethink the schema slightly it becomes trivial; instead of having the column with the ID and code combined, just keep a running count when inserting:
INSERT INTO business(department, name, code) SELECT Forms!Department, Forms!Name, COUNT(*)+1 FROM business WHERE name=Forms!Name
Then when you pull the information out:
SELECT department, name, LEFT(1, department) & code

Hide Hierachy duplication in Powerpivot (Row Labels)

I am reporting on performance of legal cases, from a SQL database of activities. I have a main table of cases, which has a parent/child hierarchy. I am looking for a way to appropriately report on case performance, reporting only once for a parent/child group (`Family').
An example of relevant tables is:
Cases
ID | Client | MatterName | ClaimAmount | ParentID | NumberOfChildren |
1 | Mr. Smith | ABC Ltd | $40,000 | 0 | 2 |
2 | Mr. Smith | Jakob R | $40,000 | 1 | 0 |
3 | Mr. Smith | Jenny R | $40,000 | 1 | 0 |
4 | Mrs Bow | JQ Public | $7,000 | 0 | 0 |
Payments
ID | MatterID | DateReceived | Amount |
1 | 1 | 14/7/15 | $50 |
2 | 3 | 21/7/15 | $100 |
I'd like to be able to report back on a consolidated view that only shows the parent matter, with total received (and a lot of other similar related fact tables) - e.g.
Client | MatterName | ClaimAmount | TotalReceived |
Mr Smith | ABC Ltd | $40,000 | $150 |
Mrs Bow | JQ Public | $7,000 | $0 |
A key problem I'm having is hiding row labels for irrelevant rows (child matters). I believe I need to
Determine whether the current row is a parent group
Consolidate all measures for that parent group
Filter on that being True? Place all measures inside IF checks?
Any help appreciated
How many levels does your hierarchy have? If it's just 2 levels (parents have children, children cannot be parents), then denormalize your model. You can add a single column for ParentMatterName and use that as the rowfilter in pivots. If there is a reasonable maximum number of levels in your data (we typically look at <=6 as reasonable) then denormalization is probably preferable, and certainly simpler/more performant, than trying to dynamically roll up the child measure values.
Edits to address comment below:
Denormalizing your data structure in this case just means going to the following table structure:
Cases
ID | Client | ParentMatterName | MatterName | ClaimAmount
1 | Mr. Smith | ABC Ltd | ABC Ltd | $40,000
2 | Mr. Smith | Jakob R | ABC Ltd | $0
3 | Mr. Smith | Jenny R | ABC Ltd | $0
4 | Mrs Bow | JQ Public | JQ Public | $7,000
Regarding nomenclature - Excel is stupid, and so is DAX. Here is the way to think about these things to help minimize confusion - these are important concepts as you move forward in more complex DAX measures and queries.
Here are some absolutely truthful and accurate statements to show how stupid the nomenclature can get:
FILTER() is a table
Pivot table rows are filter context
FILTER() applies additional filter context when used as an argument to CALCULATE()
FILTER() creates row context internally which to evaluate expressions
FILTER()'s arguments are affected by filter context from pivot table rows
FILTER()'s second argument evaluates an expression evaluated in the pivot table's rowfilter context in the row context of each row in the table in the first argument
And so on. Don't think of a pivot table as anything but filters. You have filters, slicers, rowfilters, columnfilters. Everything in a pivot table is filter context.
Links:
Denormalization in Power Pivot
Denormalizing Dimensions

Database functional dependency for Nullable Columns

I have 4 columns in my non-decomposed, non-normalized Job Application table which are all Nullable, for example my table is:
Name | SSN | Education | City | Job Applied | Post | Job Obtained | Post Obtained
John. | 123 | High School | LA | USPS | MailMan | USPS | MailMan
John. | 123 | High School | LA | Dept. of Agri | Assistant | *null* | *null*
Sam. | 123 | BS | NY | Intel | QA Analyst | Intel | QA Analyst
The first 4 Columns are non-nullable so I can easily determine Functional Dependencies between them.
The last 4 columns, can or cannot have values depending on if a person has got a job and also depending on if he/she has applied for a job.
My question is: Can I have Functional Dependencies on Nullable Columns either them being on the LHS or the RHS.
The answer should be yes, please see:
http://en.wikipedia.org/wiki/Functional_dependency

How do you merge rows from 2 SQL tables without duplicating rows?

I guess this query is a little basic and I should know more about SQL but haven't done much with joins yet which I guess is the solution here.
What I have is a table of people and a table of job roles they hold. A person can have multiple jobs and I wish to have one set of results with a row per person containing their details and their job roles.
Two example tables (people and job_roles) are below so you can understand the question easier.
People
id | name | email_address | phone_number
1 | paul | paul#example.com | 123456
2 | bob | bob#example.com | 567891
3 | bart | bart#example.com | 987561
job_roles
id | person_id | job_title | department
1 | 1 | secretary | hr
2 | 1 | assistant | media
3 | 2 | manager | IT
4 | 3 | finance clerk | finance
4 | 3 | manager | IT
so that I can output each person and their roles like such
Name: paul
Email Address: paul#example.com
Phone: 123456
Job Roles:
Secretary for HR department
Assistant for media department
_______
Name: bob
Email address: bob#example.com
Phone: 567891
Job roles:
Manager for IT department
So how would I get each persons information (from the people table) along with their job details (from the job_roles table) to output like the example above. I guess it would be some kind of way of merging their jobs and their relevant departments into a jobs column that can be split up for output, but maybe there is a better way and what would the sql look like?
Thanks
Paul
PS it would be a mySQL database if that makes any difference
It looks like a straight-forward join:
SELECT p.*, j.*
FROM People AS p INNER JOIN Roles AS r ON p.id = r.person_id
ORDER BY p.name;
The remainder of the work is formatting; that's best done by a report package.
Thanks for the quick response, that seems a good start but you get multiple rows per person like (you have to imagine this is a table as you don't seem to be able to format in comments):
id | Name | email_address | phone_number | job_role | department
1 | paul | paul#example.com | 123456 | secretary | HR
1 | paul | paul#example.com | 123456 | assistant | media
2 | bob | bob#example.com | 567891 | manager | IT
I would like one row per person ideally with all their job roles in it if that's possible?
It depends on your DBMS, but most available ones do not support RVAs - relation-valued attributes. What you'd like is to have the job role and department part of the result like a table associated with the user:
+----+------+------------------+--------------+------------------------+
| id | Name | email_address | phone_number | dept_role |
+----+------+------------------+--------------+------------------------+
| | | | | +--------------------+ |
| | | | | | job_role | dept | |
| 1 | paul | paul#example.com | 123456 | | secretary | HR | |
| | | | | | assistant | media | |
| | | | | +--------------------+ |
+----+------+------------------+--------------+------------------------+
| | | | | +--------------------+ |
| | | | | | job_role | dept | |
| 2 | bob | bob#example.com | 567891 | | manager | IT | |
| | | | | +--------------------+ |
+----+------+------------------+--------------+------------------------+
This accurately represents the information you want, but is not usually an option.
So, what happens next depends on your report generation tool. Using the one I'm most familiar with, (Informix ACE, part of Informix SQL, available from IBM for use with the Informix DBMSs), you would simply ensure that the data is sorted and then print the name, email address and phone number in the 'BEFORE GROUP OF id' section of the report, and in the 'ON EVERY ROW' section you would process (print) just the role and department information.
It is often a good idea to separate the report formatting from the data retrieval operations; this is an example of where it is necessary unless your DBMS has unusual features to help with the formatting of selected data.
Oh dear that sounds very complicated and not something I could run easily on a mySQL database in a PHP page?
The RVA stuff - you're right, that is not for MySQL and PHP.
On the other hand, there are millions of reports (meaning results from queries that are formatted for presentation to a user) that do roughly this. The technical term for them is 'Control-Break Report', but the basic idea is not hard.
You keep a record of the 'id' number you last processed - you can initialize that to -1 or 0.
When the current record has a different id number from the previous number, then you have a new user and you need to start a new set of output lines for the new user and print the name, email address and phone number (and change the last processed id number). When the current record has the same id number, then all you do is process the job role and department information (not the name, email address and phone number). The 'break' occurs when the id number changes. With a single level of control-break, it is not hard; if you have 4 or 5 levels, you have to do more work, and that's why there are reporting packages to handle it.
So, it is not hard - it just requires a little care.
RE:
I was hoping SQL could do something
clever and join the rows together
nicely so I had essentially a jobs
column with that persons jobs in it.
You can get fairly close with
SELECT p.id, p.name, p.email_address, p.phone_number,
group_concat(concat(job_title, ' for ', department, ' department') SEPARATOR '\n') AS JobRoles
FROM People AS p
INNER JOIN job_roles AS r ON p.id = r.person_id
GROUP BY p.id, p.name, p.email_address, p.phone_number
ORDER BY p.name;
Doing it the way you're wanting would mean the result set arrays could have infinite columns, which would be very messy. for example, you could left join the jobs table 10 times and get job1, job2, .. job10.
I would do a single join, then use PHP to check if the name ID is the same from 1 row to the next.
One way might be to left outer join the tables and then load them up into an array using
$people_array =array();
while($row1=mysql_fetch_assoc($extract1)){
$people_array[] = $row1;
}
and then loop through using
for ($x=0;$x<=sizeof($people_array;)
{
echo $people_array[$x][id];
echo $people_array[$x][name];
for($y=0;$y<=$number_of_roles;$y++)
{
echo $people_array[$x][email_address];
echo $people_array[$x][phone_number];
$x++;
}
}
You might have to play with the query a bit and the loops but it should do generally what you want.For it to work as above every person would have to have the same number of roles, but you may be able to fill in the blanks in your table

UNIQUE - way to have unique rows in table?

I have problem with unique rows in db table, now it is posible to do that:
id | Name | LastName | City
-------------------------------------
1 | John | Moore | London
2 | John | Moore | London
when i use UNIQUE attribute in all columns i have errors inserting second Moore even it is different Name :/
how use UNIQUE (or maybe INDEX?) to do something like that in my table in db:
id | Name | LastName | City
-------------------------------------
1 | John | Moore | London
2 | Jake | Moore | London
3 | John | Keen | London
4 | John | Moore | London //but good error when inserting the same row
Sorry if question is easy, but i am beginner at sql, and have problems with find some good example with using a UNIQUE like a want:/
or maybe I must just before inserting a new row selecting a table from db and check if it exist?
Remove the unique index on the individual column and make it on both columns together, like this:
CREATE UNIQUE INDEX ixFullName ON yourTable (LastName, Name);