Openrefine - reconcile by second or third candidate

Openrefine - reconcile by second or third candidate - openrefine

With the reconcile service I often come across this problem: the best candidate isn't really correct, the best is the second or the third candidate (ad it has also a better score), like this:
How can I select the correct one in mass? I've got thousand of records, and I'm stumbling across lots of cases like this. I think it should be some way that is not doing it one by one.
For instance something that says "take the best candidate score, no matter what's its position".
Edit: as pintoch says, it could be a bug. In the meantime it's possible to create two numeric facet. One with cell.recon.candidates[1].score and the other with cell.recon.candidates[2].score. Playing with them it's possible to select the score of the third and the second candidates to make sure you get the candidate with the best score. Then it has to be reconciled one by one, but it's just a question of clicking.

I would say that this behaviour is a bug in the first place: the candidates should be sorted by decreasing score. The reconciliation service API does not specify that services should return their candidates with any particular order, but that is probably unintended.
The quickest solution would be to contact the person running the reconciliation service that you are using and ask them to sort the candidates by decreasing score on their side.
This also suggests improvements in OpenRefine itself: OpenRefine could always sort the results of a reconciliation service by decreasing score. I have opened a ticket about this.
More broadly, I agree that the current ways to match candidates based on specific criteria could be improved (but this might require redesigning important parts of the reconciliation system, which will take time).

Related

Efficient way to get CIO-1?

I was assigned a task where I receive a bunch of employees and I need to get their CIO-1 (means their boss's boss's boss... all the way to the boss whose manager is the CIO). But I don't know if I'm doing it in an efficient way. This is my algorithm:
For every given employee, perform an api request to microsoft graph to get their manager. Then perform another api request to get that manager's manager... and do this until I get the one whose manager is our CIO. This means if I was given 500 employees, I would be performing an HTTP request inside a for loop, and perform another for loop inside to go up the "manager chain."
Is this ok? Would Microsoft Graph cut me off because I would be performing many many queries in a short amount of time?

This is more of an algorithm question than a graph question really. But yes you do need to query the data at the end of the day.
However there are smarter ways than others.
Rather than building a loop that queries all your initial employee's boss and within that have another loop to get the higher level boss and so on... You could do it a different way.
First make sure you leverage batch queries to minimize the number of round-trips.
Second, I think it's safe to assume that at some point, some sets of employees are going to have the same boss. And rather than querying multiple times who's that boss' boss, you need to make sure you only do it once.
There are a couple of different ways of doing that, maintaining a tree, an index, always applying distinct on data sets...
Lastly, keep in mind that a loop in a loop algorithmic complexity is o(n)^2. Which is always worse than having two subsequent loops (one after another). So trying to flatten your algorithm will help, and it will also help you build batches.
On the throttling part, make sure you go though the guidance. Yes you might experience that, however there are a couple of ways to optimize.

data population and manipulation in frontend vs. backend

I have very general question. For example, I have employee table which includes name, address, age, gender, dept etc.. When a user wants to see overall employee information, I do not need extract whole columns from DB. I want to show overall employee information first by grouping employees' names per dept. And then, a user can select one particular employee if he/she has more interest.
To implement this, which approach will be better.
1) Apply two different APIs which will produce two different data results by applying different queries. Therefore, if I take this approach, even though I have to call two different APIs, it seems effective.
2) Apply one API which will produce one data result including whole employee relevant columns even though the overall employee information does not need entire detail employee information. Once I get this data, I can re-format employee information by manipulating already extracted data from the front-end side according to different needs.
Normally, which approach developers take between 1) and 2)? I think 1) makes sense, but API will be too much specialized, not generalized. Is it a desirable way to manipulate data gotten from back-end side(RESTful) in a front-end side(Angular 2)?
Which approach is preferred between creating relatively much more numbers of specialized APIs or manipulating the data in the front-end once after getting whole data? If there is some criteria to take which approach, on which ground I have to consider?
Is this is correct thinking? If someone has some idea about it, could you give some guide?

That's a very interesting discussion. If you need a very optimized application then you would need to go with the first option. Also if you have some column on the employee that can be accessed only by a privileged user, you can imagine that a client side manipulation won't stop some malicious hacker from seeing it. In this case you must implement the first option as well.
The drawback is that at some point you could have too many APIs doing very similar things making It confusing and unproductive.
About the second option, as you pointed out having generalized APIs is a good thing, so in order to speed up the development process and have a cleaner set of APIs you can waste few internet traffic in return.
You need to balance your needs.
Is it optimization of resources? Like speed in terms of database queries and internet requests.
Or is it development time? Do you have enough time to implement all these things and optimizations?

Distributed postgresql ID collision handling

Let's imagine we have a distributed table with an ID, CONTENT and TIMESTAMP. The ID is hash(CONTENT) and the CONTENT is deterministic enough to be entered in multiple places in the system, shortly after each other.
Let's say a certain real life event happened. Like someone won the Olympics. Then that goes into this database in a record that always looks the same, except for the timestamp. As each machine observes the event at slightly different delays.
So. As the machines sync this distributed table they will wonder "We have this exact ID already! It's also not an identical row! What should we do!?". I want to give them the answer in the form of:bool compare(row a, row b) or, preferably, row merge(row a, row b).
Does anyone know how to do this? I can only find 'merge' things related to merging two different tables while in fact this is the same table, only distributed.
For me this is pretty essential for making my system 'eventually consistent'. I want to leverage postgresql's distributed database mechanics because they are so reliable, I wouldn't want to rewrite them.

PostgreSQL has no "distributed database" features. You can't rewrite them or avoid rewriting them because they don't exist, and I'm quite curious about where you got your reliability information from.
The closest tihng I can think of is a 3rd party addon called Bucardo, which does multi-master replication with conflict resolution.
It's also possible you were thinking of Postgres-XC, but that project is intended to produce a synchronous, consistent, transparent multi-master cluster, so there'd be no conflict resolution in the first place.
There's also Rubyrep; I don't know enough about it to know if it'd fit your needs.
In the future PostgreSQL will support something akin to what you are describing, with logical replication / bi-directional replication, but it's pre-alpha quality for now, and is likely to land in PostgreSQL 9.5 at the soonest.

OOD: order.fill(warehouse) -or- warehouse.fill(order)

which form is a correct OO design?
"Matter of taste" is a mediocre's easy way out.
Any good reads on the subject?
I want a conclusive prove one way or the other.
EDIT: I know which answer is correct (wink!). What I really want is to see any arguments in support of the former form (order.fill(warehouse)).

There is no conclusive proof and to a certain extent it is a matter of taste. OO is not science - it is art. It also depends on the domain, overall software structure, etc. and so your small example cannot be extrapolated to any OO problem.
However, here is my take based on your information:
Warehouses store things. They don't fill orders. Orders request things. They don't know which warehouse (or warehouses) the things come from. So a dependency in either direction between the two does not feel right.
In the real world, and the software, something would be a mediator between the two. #themel indicated the same in the comment to your question, though I prefer something less programming pattern sounding. Perhaps something like:
ShippingPlan plan = shippingPlanner.fill(order).from(warehouses).ship();
However, it is a matter of taste :-)

In its simplest form warehouse is an inventory storage place.
But it also would be correct to view a warehouse as a facility comprised of storage space, personal, shipping docks etc. If you assume that view of a warehouse then it would be appropriate to say that a warehouse (as a facility) can be charged with filling out orders, or in expanded form:
a warehouse facility is capable of assembling a shipment according to a given specification (an order)
above is a justification (if not proof) for: warehouse.fill(order); form. Notice that this form substantially equivalent to SingleShot's and themel's suggestions. The trick is to consolidate shippingPlanner (an order fulfillment authority) and a warehouse (a inventory storage space). Simply put in my example warehouse is a composition of an order fulfillment authority and an inventory storage space and in SingleShot's those two are presented separately. It means that if such consolidation is (or becomes) unacceptable (for example due to complexity of the parts), then the warehouse can be decomposed into these two sub components.
I can not come up with a justification for assigning fill operation to an order object.
hello? warehouse? yes, please take this order and fill it. thank you. -- that I can understand.
hey, order! the warehouse is over there. do your thing and get fulfill yourself. -- makes no sense to me.

Advice on splitting up a process involving multiple actors into Use Cases

Let's say I am modelling a process that involves a conversation or exchnage between two actors. For this example, I'll use something easily understandable:-
Supplier creates a price list,
Buyer chooses some items to buy and sends a Purchase Order,
Supplier receives the purchase order and sends the goods.
Supplier sends an invoice
Buyer receives the invoice and makes a payment
Of course each of those steps in itself could be quick complicated. How would you split this up into use cases in your requirements document?
If this process was treated as a single use-case it could fill a book.
Alternatively, making a use case out of each of the above steps would hide some of the essential interaction and flow that should be captured. Would it make sense to have a use case that starts at "Received a purchase order" and finishes at "Send an Invoice" and then another that starts at "Receive an Invoice" and ends at "Makes a Payment"?
Any advice?

The way I usually approach such tasks is by just starting to create UML Use Case and high-level Activity diagrams for the process. Don't bother about specifics, just give it your best shot.
When you will have a draft you would almost immediately see from it how it could be improved. You could then go on refactoring it - getting the use case smaller, structuring large Activities and so on. Alternatively you could lump a couple of Use Cases together if they are too small.
Without knowing the details of your project I would just go ahead and make each step a separate Use Case - they all seem to be self-contained and could be described without any cross-references. If while doing so you will find any dependencies you could always rethink the approach.
Also consider use 'extend' and 'include' blocks for common elements like logging, security etc.

Yes, there are many possibilities here. In your example above it could be even more complicated by the Buyer making multiple partial payments to pay the bill.
You probably need to create complete workflow use cases. Splitting each of the above steps into their own use cases may not prove useful as some of the steps will have pre & post conditions.
I work on the QuickBooks source code and the number of ways that a transaction can flow through the system is daunting. It is almost impossible for our QA guys to test every combination.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Openrefine - reconcile by second or third candidate - openrefine

Related

Efficient way to get CIO-1?

data population and manipulation in frontend vs. backend

Distributed postgresql ID collision handling

OOD: order.fill(warehouse) -or- warehouse.fill(order)

Advice on splitting up a process involving multiple actors into Use Cases

Categories

Resources