How to get ids from Freebase given part of a name (from Freebase Offline Dumps)? - api

I have asked this question before here!. At that time, I was concerned on getting the output using the Google-Api which works just fine.
Actually, the problem with that is running into timeouts and more importantly, querying a web-based API. I would like to do it offline using the Freebase data-dumps. Is there any easy way to go about it?
Thanks

zegrep $'\tns:type\.object\.name\t.*Bush.*' freebase-rdf-<date>.gz | cut -f 1
will give you a list of all MIDs for topics which contain the string "Bush" (from your previous example) in their name.
Extend the regex as needed to include things like aliases, fancier name matching, etc.

Related

Ldap search for objects where attribute X contains multiple values

I would like to know if it is possible to do a search like this:
"give me all objects where description has more than 1 value"
The short answer is no. At least not from a single LDAP Query without somehow parsing the results.
I know of a tool that will provide those results however it has not been updated in a while but last time I used it, it worked.

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

In SQL, what is the memory-efficient way of "mapping" 1 ID to multiple IDs?

I'll describe my scenario so you guys understand what type of design pattern I'm looking for.
I'm making an application where I provide someone with a link that is associated with one or more files. For example, someone needs somePowerpoint.ppx, main.cpp and somevid.mp4, and I have a tool that makes kj13h1djdsja213j1hhadad9933932 associated with those 3 files so that I can give someone
mysite.com/getfiles?fid=kj13h1djdsja213j1hhadad9933932
and they'll get a list of those files that they can download individually or all at once.
Since I'm new to SQL, the only way I know of doing that is having my tool use a table like
fid | filename
------------------------------------------------------------------
kj13h1djdsja213j1hhadad9933932 somePowerpoint.ppx
kj13h1djdsja213j1hhadad9933932 main.cpp
kj13h1djdsja213j1hhadad9933932 somevid.mp4
jj133823u22h248884h4h24h01h232 someotherfile.someextension
to go along with the above example. It would be nice if I could do some equivalent of
fid | filename(s)
---------------------------------------------------------------------------
kj13h1djdsja213j1hhadad9933932 somePowerpoint.ppx, main.cpp, somevid.mp4
jj133823u22h248884h4h24h01h232 someotherfile.someextension
but I'm not sure if that's possible or if I should be using some other design pattern altogether.
Any advice?
I believe Concatenate many rows into a single text string? can help give you a query that would generate your condensed format (you'd still want to store it in SQL with the full list, but you could make a view showing the condensed version using the query in the link)

Google Places API - RadarSearch results are confusing

I'm running a query vs the Google Places RadarSearch API and don't entirely understand the results. I'm trying to find nearby Tesco Supermarkets. My query is structured like this:
https://maps.googleapis.com/maps/api/place/radarsearch/xml?location=51.503186,-0.126446&types=store&keyword=tesco&name=tesco&radius=5000&key=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I've tried a bunch of variations of the fields types, keyword and name. None of the results are Tesco stores. Am i missing something?
The Google docs show the fields as:
keyword — A term to be matched against all content that Google has indexed for this place, including but not limited to name, type, and address, as well as customer reviews and other third-party content.
name — One or more terms to be matched against the names of places, separated by a space character. Results will be restricted to those containing the passed name values. Note that a place may have additional names associated with it, beyond its listed name. The API will try to match the passed name value against all of these names. As a result, places may be returned in the results whose listed names do not match the search term, but whose associated names do.
I always get the maximum of 200 results which maybe includes 1 or 2 Tescos. When I check on Google maps there are 10 Tescos in the radius I am searching. It's as if the api is ignoring the name field. It doesn't matter what I populate in the name field, I still get the same results
UPDATE: Seems this is a known bug https://code.google.com/p/gmaps-api-issues/issues/detail?id=7082
maybe I am wrong, but I believe it is a commercial issue, google will show all business filtering them with a particular criteria they are no publishing the rules, for example in your search, the type you used was "store" , so they are returning to you all stores, and using the name or keyword in their own way who knows which criteria they are internally using, and there is something else, on the API description, the sample that they provide for radar search shows the name of the place in the result, but in the tests i am doing, they are not even sending the name, so you couldn't iterate those results, and filter by your own, for you to get the name, you have to do another call using:
https://maps.googleapis.com/maps/api/place/details/json?placeid=ChIJq4lX1doEdkgR5JXPstgQjc0&key=YOUR_KEY
Maybe there is another way but I don't see it.
I find the radar search is returning strange results today. It worked differently a couple of days ago.
The keyword-parameter has no effect at the moment and I have breaking integration-tests that were working before. I hope this is a temporary issue.
I filed a bug report for it: https://code.google.com/p/gmaps-api-issues/issues/detail?id=7086

Freebase API - listing a city's tourist attractions by relevance

I'm trying to use Freebase to list tourist attractions for cities by relevance.
Using the Topic API, it's simple to retrieve results for a certain city using its MID (e.g. "/m/04jpl" for London)
https:// www.googleapis.com/freebase/v1/topic/m/04jpl/?&filter=/travel/travel_destination/tourist_attractions
However, this gives a limited 10 results. The response ends with "count": 87.0". How do I get all 87? It's possible to click a "87 values total" link on London's Freebase page. Effectively, I want to do the same here.
I realise I could use MQL, but I want the results to be ranked by relevance, not by timestamp. Using the Search API, it's possible to rank by freebase, entity or schema, so I'd rather use that.
First, I looked at the Search Output schema for the Search API. However, even outputting "all" didn't produce Tourist Attraction results. Using metaschema with the Search API DID work. I used "part_of" to select London. However, it only works for some locations:
https:// www.googleapis.com/freebase/v1/search?limit=50&filter=(all%20type:/travel/tourist_attraction%20part_of:/m/04jpl)&indent=true
What I REALLY want to be able to do is make it work for a relatively unknown location like "Loughborough" (MID /m/01z21p). As you can see, substituting /m/04jpl for /m/01z21p produces no results:
https:// www.googleapis.com/freebase/v1/search?limit=50&filter=(all%20type:/travel/tourist_attraction%20part_of:/m/01z21p)&indent=true
Looking at "Loughborough", we see that its tourist attraction like "Loughborough Town Hall" has a "/travel/tourist_attraction/near_travel_destination" of "Loughborough". How would I compose this filter?
I want something like the following (that actually works):
https:// www.googleapis.com/freebase/v1/search?limit=50&filter=(all%20type:/travel/tourist_attraction)&filter=(/travel/tourist_attraction/near_travel_destination:/m/01z21p)&indent=true
Thanks!
NOTE: To enter the links into your browser you need to remove the space between the https:// and www. I would have done so, but I don't have the required permissions here yet to post more than 2 links.
I solved this problem using 2 Freebase API calls.
1) An MQL query that gets a list of all the tourist attractions for a particular MID. These results are not ranked in any useful way. I am also returning the result number to make processing a little easier later
https://www.googleapis.com/freebase/v1/mqlread?query={"mid":"/m/04jpl","/travel/travel_destination/tourist_attractions":[{"mid":null}],"resultnumber:/travel/travel_destination/tourist_attractions":[{"return":"count"}]}
The list of returned MIDs are then used to create a new query (using a for loop). You must enter all MIDs returned from the above query, so that they can all be ranked together.
2) https://www.googleapis.com/freebase/v1/search?limit=10&filter=(any%20mid:/m/0gsxw%20mid:/m/01d_0p%20mid:/m/07gyc)&scoring=entity
It's best to choose a return format that just returns MIDs, to ensure that loading times aren't extensive.
You then have a ranked list of MIDs! You'll need one final query to return whatever details you desire.
I hope this has proved helpful.