Allow multiple locales as a qualification test in Mechanical Turk - mechanicalturk

I would like to have a HIT that requires users to be in the U.S. or Canada. I am using the command line tools, but it seems that any qualifications are treated as being all required -- I want a Boolean or. Is this possible?
For example, in the .properties file, I have:
# Worker_NumberHITsApproved > 100
qualification.1:00000000000000000040
qualification.comparator.1:GreaterThan
qualification.value.1:100
# Worker_PercentAssignmentsApproved > 95%
qualification.2:000000000000000000L0
qualification.comparator.2:GreaterThan
qualification.value.2:95
# Worker_Locale
qualification.3:00000000000000000071
qualification.comparator.3:EqualTo
qualification.locale.3:US
All three of these qualifications are required. Suppose I want to require the first two and then a locale of either US or CA. Is there syntax for this?

Mechanical Turk now supports new In and NotIn comparators which allow for multiple locales. Here's an example SOAP request from that link which requires workers from the US, Canada, or the UK:
<QualificationRequirement>
<QualificationTypeId>00000000000000000071</QualificationTypeId>
<Comparator>In</Comparator>
<LocaleValue>US</LocaleValue>
<LocaleValue>CA</LocaleValue>
<LocaleValue>UK</LocaleValue>
</QualificationRequirement>
See also AWS MTurk documentation.

Update: this lack may have been addressed, see dmcc's answer for more information.
I think you're out of luck. From the API docs (emphasis added):
The Locale Qualification
...snip...
Note
A Worker must meet all of a HIT's Qualification requirements to qualify for the HIT. This means you cannot specify more than one locale Qualification requirement, because a given Worker will only be able to match one of the requirements. There is no way to allow Workers of varying locales to qualify for a single HIT.

TurkPrime.com offers this and many more features using a web interface. You can exclude and include workers, email them, restart survey and more.
I and my colleagues have used it for a while and just love it.

This question and its answers are fairly old at this point, but I was not able to find a solution specifically for the command line tools. I was fortunately able to figure it out after playing around for some time. In your .properties file:
# user must be located in the US or Canada
qualification.1:00000000000000000071
qualification.comparator.1:In
qualification.locale.1.1:US
qualification.locale.1.2:CA
A fairly modest solution but one that is not well-documented in my opinion :)

Related

Is there a way to escape a keyword in a Gherkin feature or scenario description?

In Gherkin, you can have free-form text that describes a scenario, a feature, etc. These descriptions are not used by, say, a test runner, but are for you to describe important additional information to another human.
The documentation for Gherkin says that these cannot start a line with one of the other keywords, such as Given, When or Then. Yet, sometimes the best description I could give would be to start with one of these keywords.
I'm sort of making this up as I go here, but here is an example of what I wish I could do:
Scenario: Many notifications at the same time get combined
When we have a lot of notifications being posted at once, it causes problems
for humans. They can't make sense of that much new information all at once.
So if we are ever in a situation where we are posting lots of
notifications in a short time period, we will take the one with the highest
severity and show it with the other notifications as "child" notifications,
accessible via a link that says, "And N other issues."
Given a notification posted today at 11:03:25
And a notification posted today at 11:03:26
And a notification posted today at 11:03:26
And a notification posted today at 11:03:27
When a notification is posted at 11:03.28
Then the notification list will contain 1 notification
And that notification should contain 4 child notifications
The problem I have is that because my description starts with a When, it the tools assume that I've started my specific steps, and blows up on the next line, which doesn't start with a keyword.
I've considered:
Commenting out the first line or the entire description (that seems more consistent to me) but to me, there is a semantic difference between a comment with # and a description.
Rewording the thing to not start with a "When". For example, if it started with, "In times where we have a lot of notifications..." but that's less readable, which is the point with Gherkin-style specifications.
If it wasn't the first word in the whole description, I might be able to get away with simply wrapping my lines differently so that the "When" starts in the middle of a line instead of the beginning, but in this case, I don't have that option.
Those options just seem like workarounds that feel sub-optimal.
Is there a way to "escape" these keywords to tell the system that some usage of "When" is really still just part of the description and not a keyword? If not, is there some sort of accepted best practice or guideline for how people should handle situations like this?
You could use # in the beginning of the line (it's used for writing comments).
Ex:
# When a notification is posted...
You are misinterpretting the the language spec. You can describe a feature and use keywords at the beginning of a line. The example you posted gets interpreted as steps in a scenario since the description appears after the scenario keyword.
Just as Mr Cas said in his answer, you need comments.
Feature: Given a feature title
When I use keywords up here
Then it is allowed
Scenario: When I use keywords after the title to describe a scenario
# Then I need to use comments

Custom model in Apache Open NLP

I am working currently with custom models which I am training for my own use case. My use case is to classify emails based on whether it is an address change request. If the address change request could be understood from a single sentence, it is working fine without issues. But if the address change request needs to be understood from multiple sentences, it is not working.
Giving few examples below :-
Example 1 :- THIS IS WORKING
1.
a)training file :-
Guys I wish to <START:contactupdate> change my address <END> .
My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.
Please confirm once you are done.
Thanks.
b)Testing model with the below sentence :-
String input = "Guys I wish to change my address.My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.Please confirm once you are done. Thanks."; //Working
EXAMPLE 2 :- This is not working.
Lets say the address change request can only be deduced from multiple lines.
"My old address is no longer valid. Need to update it."
How do I train my model in this scenario?How do I specify the custom tags for above?
Can you please help. I am stuck.
Many Thanks
What do you mean with not working? That the thing you want to retrieve is not retrieved? Or that the training crashes somewhere when the tags are spread out over multiple lines?
In general, the (by default MaxEnt) model that you are training in this procedure tries to detect common features for the thing you are training for. Typically, these are named entities like persons, organisations, locations. And in many languages, these contain typical features (like the prefix Mr./Mrs., the suffix corp., the morpheme "street", respectively). This can be picked up by the model, and applied in new data, leading to the recognition of whichever it is you want to recognise. The thing you are trying to do however, is pretty advanced NLP already. Since the longer the phrase, the larger the possible variation, it becomes more difficult to pick up commonalities. I'd say for your use case, people are typically using parsing (either constituency or dependency parsing) or other more sophisticated tools than just this relatively flat pattern recognition. So you may want to look into these instead. I don't know how much data you have at your disposal, from which you can infer different ways of expressing the desire to change an address in a customer database. If reasonable (i.e. not just a couple of sentences), you may want to manually annotate them, parse the corpus, use machine learning on the parse trees/graphs for the sentences of interest and go about it in this way. As mentioned, quite advanced NLP in my opinion, and not something that has an out of the box solution.
If I understand your question correctly, I think you are trying to categorize emails to find out if its for address change. But the model example looks like for named entity. In my opinion, it might be better to use "Document Categorizer" feature of Apache OpenNLP.
You can provide different samples for possible sentences which can be categorized as address change. "Address_change", "general_inquiry" etc. can be a categories. This way you can add as many different sampels as you want with many variations of sentences. Here is easy & basic tutorial for document categorization training & usage.

Deciding on REST API Path Convention

I am working for classified section of my organization where I need to fetch cities available for this section. Classified cities are the subset of the cities. So, for fetching the cities I created the following API path.
api/classified/cities
Then I realize since the classified cities are the subset of cities the URL can be
api/cities/classifiedcities
which one of the above path should I use according to REST principals?
If you are aiming for REST principles, REST does not really have any principles how a URI might look like (as long as it identifies a "real" resource!). It does however say that it should be linked instead of hardcoded into the clients. This is why some people say URIs do not matter. They do not matter because clients should "discover" URIs instead of knowing them beforehand.
So why don't we all pick URIs like "/45ttfdfg/34tkfjdldf23wedkdfjsd"? That, again is up to personal preference really. It is nice if the URI is "readable" by humans. There are some (badly written) tools that assume some structure. There are some "REST" libraries (server and client ones) that also assume a bunch of stuff, for example the concept of "subresources" (which also does not come from REST).
To sum up: If you follow REST (you don't have to of course!), then clients should discover URIs. If that is the case, then it comes down to personal preference and maybe some technical restrictions with libraries/clients. So pick one which you prefer, don't worry about it! They can be modified later if needed anyway, since the server has control over its URIs.
Theoretically the subset should follow the parent set (so the first solution doesn't look too good).
One third approach would be through query parameters:
api/cities?type=classified

Fuzzy search in SQL

I am trying to map information of Linux packages (name + version) to their corresponding CPE strings (see http://nvd.nist.gov/cpe.cfm) in order to be able to automatically find possible vulnerabilities of a system.
There is an XML document provided by NIST which contains all relevant CPE. I thought about parsing this information into an SQL database so I can quickly search by name and version number. That would be some 70.000 rows.
The problem now is, of course, that there are variations of the spellings of the CPEs and the package names. For example, the CPE for Tomcat 6.0.36 would be cpe:/a:apache:tomcat:6.0.36 so you have the name tomcat and the version 6.0.36. Now, the package manager could give you something like tomcat6 for the name and 6.0.36-3 for the version. Its likely that both programs are the same or have at least the same vulnerabilities. So I need to be able to automatically identify the above mentioned CPE as the correct one for my tomcat package.
The first thing to do would be some kind of normalization, maybe converting everything to lowercase. But as you can see from the example, that's not enough. I need some kind of fuzzy search. From what I already found out, there are some solutions for identifying matches in the case of misspelling. That is not exactly what I need, though. The package names are not misspelled but may contain additional characters (or miss some).
The fuzzy search must also be relatively fast, since I need to execute it for multiple hosts which each could have some hundred packages installed and as I said, the database would have around 70.000 rows. I can introduce a primary lookup which tries to find an exact match first, but since I suspect many package will not have any corresponding CPE string, that will not decrease the amount too dramatically.
Another constraint is that the solution should be working on a non-proprietary database, since I don't have the financial means for anything else.
So, is there anything that matches these requirements? Or can you think of any solution to my problem except some kind of fuzzy searching?
Thanks in advance!
A general comment, first. The CPE nomenclature seems to have evolved organically, often depending on the vendors' (inconsistent) nomenclature. For example, Sun Java has major.minor.point_version. Adobe uses major.minor.point.subpoint. Microsoft operating systems use Service Packs_Language Packs. Some other vendors would use point releases with mostly numbers but occasional letters sprinkled in (e.g., .8, .9, .9R2, .10).
When I worked on the stated problem, I started from their XML files and manipulated them in Excel, splitting on the periods. Then I would sort either numerically (if they were all numeric) or as a text string. (Note that the letters sprinkled in to mostly numbers causes havoc, and that .10 comes lexically before .8)
This inconsistency is why third-party software vendors have sprouted like mushrooms after a spring rain. Companies would rather pay the software vendors than untangle this Gordian knot.
If you want a truly fuzzy search, please take a look at this question about using Soundex. Expect to get a lot of false positives.
If your goal is accurately mapping the CPE strings, you should probably think about implementing a lookup table that translates from CPE to a library name.

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org
It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.
OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.
This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.
Do you mean something like http://www.freebase.com?
You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.
Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.
mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...
I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.
OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.
Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.
I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/
I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.