Google Custom Search with no sites and "Search entire web" returns way less results than normal google search - google-custom-search

I'm trying to create a simple python program that returns the ratio of search results between two keywords (two searches in total, thus). However, the google-api-python-client that uses my Custom Search Engine always returns about 10x less results than a normal google search.
I don't have any restrictions in the CSE except I've set it to use 'google.fi' and set the user geolocation as Finland, because that mirrors the way I normally search on Google on the web.
Any ideas?

According to:
https://developers.google.com/custom-search/json-api/v1/reference/cse/list
You can adjust the number of results returned with the "num" parameter. However:
Valid values are integers between 1 and 10, inclusive.

Related

Getting search suggestions to work on 2 (or more) non-consecutive words (to improve search on a medical conditions list - ICD10 codes)

Context:
We are using Azure Cognitive Services in a mobile app to search patient diagnostic codes (ICD10 codes).
The ICD10 code list is approximately 94,000 items. For anyone interested here is a list.
We currently have set-up a standard Lucene analyser on the diagnostic description field
Requirement:
We want to provide a really good search as you type experience, which provides the most relevant suggestions
Using the Suggest method with the fuzzy parameter set to true works reasonably well for a single search term:
As you can see it does well in finding partial matches and is resilient to typos.
The issue comes in when I add a second search term. E.g. I want to search for asthma that is moderate:
In both these examples, there is no match.
So when searching for more than one term, requiring the user to express this in the sequence that this is in the data is not a good user experience.
Using the Search method instead, we can overcome the problem of finding matches where 2 search terms are supplied that do not appear consecutively in the data:
And this is resilient to typos
However, this is not good at finding partial matches (like the Suggest does).
E.g. in this search, we would still want the term moderate to be picked up:
Seemingly if we could combine a wild card search with a fuzzy search we could solve this problem. e.g. supplying the following search phrase: ashtma~* AND moder~*.
But from what we have seen this syntax is not supported.
Any suggestions on how to overcome this limitation so we can get the best of both worlds, i.e:
For 2 or more search terms, it will work on partial matches
And the search terms are treated independently and do not need to appear consecutively in the data
Many thanks in advance,
Andreas.
I recommend using (or at least experimenting with) Lucene ngrams.
An example custom analyzer can use the NGramTokenFilter.
This filter splits each source token into one or more indexed tokens by chopping up the source into substrings of different lengths.
An example from the above link:
"abc" will give "a", "ab", "abc", "b", "bc", "c"
You can, as an example, set each token to be from 3 to 5 characters long (but this is one of the areas where you can experiment with different settings).
When you use this analyzer for indexing, it's going to create many more tokens (larger index) but that gives you more searching flexibility.
Use the same analyzer for searching.
If the user enters the following two words as their search values:
ashtma moder
You would convert that into the following Lucene search phrase:
ashtma~ AND moder~
This will find the following hits:
doc id = 12877
field = Moderate persistent asthma with status asthmaticus
doc id = 12874
field = Moderate persistent asthma
doc id = 12875
field = Moderate persistent asthma, uncomplicated
doc id = 12876
field = Moderate persistent asthma with (acute) exacerbation
doc id = 94210
field = Family history of asthma and oth chronic lower resp diseases
doc id = 6970
field = Xanthelasma of right lower eyelid
doc id = 6973
field = Xanthelasma of left lower eyelid
doc id = 6979
field = Chloasma of right lower eyelid and periocular area
doc id = 6982
field = Chloasma of left lower eyelid and periocular area
As you can see it does find some false positives, but the first four hits (the highest scored) are the ones you want.
You can see how this approach performs in terms of index size and search speed.
One reason for suggesting ngrams is your point about wanting to handle mis-spellings: ngrams may help to isolate spelling mistakes into smaller tokens,since the ~ fuzzy search operator is fairly limited in what it can handle. But, definitely experiment with different ngram lengths - and maybe also without using ngrams at all.

How to find search volume of large set of Keywords (say 1 million) by Keyword planner or Adword API

I am using Keyword planner to find keywords' search volume. But when I upload a .CSV file, there is a limit of 3000 keywords.
How can I get the search volume of a large set of Keywords which I want to.
Is there any way by Adword API or somw other tool ?
Yes. You can use the AdWords API TargetingIdeaService to get Keyword Planner data for large sets of keywords.
This will allow you to get search volume (and other metrics) for a specified list of keywords. As well as your input list of keywords you would have to specify a requestType of STATS in you TargetingIdeaSelector.

Lucene Paging with search

Hello I am currently using Lucene 4.6.1
In my design I need to be able to search and page possibly many results, so i have some general questions for optimization.
First in the "search(query q, int n)" What is the goal of the variable "n" , Is "n" different from ".totalHits()" ? How should this number be chosen and with what specifications?
Second, it seems that there are two general algorithms for paging. I can either use "searchAfter" or process the "ScoreDoc[]" given a page size.
Currently what way do most people recommend, and what are the design ideas that are required?
searchAfter can be used for efficient "deep paging".
A tutorial on using it with Solr
http://heliosearch.org/solr/paging-and-deep-paging/
The int passed to search is the maximum number of hits the search will retrieve. totalHits, from the TopDocs is the total number of hits for the query. It may be more or less than the value passed in.
Not clear to me what you mean by processing the ScoreDoc array. searchAfter is specifically intended to be used for pagination. Use it.

CategoryId in venues search not working correctly

In foursquare Api documentation for "Search venues" https://developer.foursquare.com/docs/venues/search it states
"categoryId - A comma separated list of categories to limit results to. This is an experimental feature and subject to change or may be unavailable. If you specify categoryId you may also specify a radius. If specifying a top-level category, all sub-categories will also match the query."
Realise its supposed to be experimental, but when I provide Food category i.e. 4d4b7105d754a06374d81259, it only returns a few local results, the rest are miles away. However if I execute same search on website sing Food category, it returns correctly lots of results, assuming its the last bit "If specifying a top-level category, all sub-categories will also match the query" is not working , i.e. its not searching sub-categories ?
Any fix work around for this ?
Thanks,
Neil Pepper
You're making a /venues/search request with its default intent of intent=checkin. This returns a filter on nearby results, heavily biased by distance since it's trying to guess where the user might be checking in.
Foursquare Explore uses the /venues/explore endpoint and attempts to return recommended results for a query. If you want to get the sorts of results you get in that tool, call /venues/explore?section=food

Flickr Geo queries not returning any data

I cannot get the Flickr API to return any data for lat/lon queries.
view-source:http://api.flickr.com/services/rest/?method=flickr.photos.search&media=photo&api_key=KEY_HERE&has_geo=1&extras=geo&bbox=0,0,180,90
This should return something, anything. Doesn't work if I use lat/lng either. I can get some photos returned if I lookup a place_id first and then use that in the query, except then all the photos returned are from anywhere and not the place id
Eg,
http://api.flickr.com/services/rest/?method=flickr.photos.search&media=photo&api_key=KEY_HERE&placeId=8iTLPoGcB5yNDA19yw
I deleted out my key obviously, replace with yours to test.
Any help appreciated, I am going mad over this.
I believe that the Flickr API won't return any results if you don't put additional search terms in your query. If I recall from the documentation, this is treated as an unbounded search. Here is a quote from the documentation:
Geo queries require some sort of limiting agent in order to prevent the database from crying. This is basically like the check against "parameterless searches" for queries without a geo component.
A tag, for instance, is considered a limiting agent as are user defined min_date_taken and min_date_upload parameters — If no limiting factor is passed we return only photos added in the last 12 hours (though we may extend the limit in the future).
My app uses the same kind of geo searching so what I do is put in an additional search term of the minimum date taken, like so:
http://api.flickr.com/services/rest/?method=flickr.photos.search&media=photo&api_key=KEY_HERE&has_geo=1&extras=geo&bbox=0,0,180,90&min_taken_date=2005-01-01 00:00:00
Oh, and don't forget to sign your request and fill in the api_sig field. My experience is that the geo based searches don't behave consistently unless you attach your api_key and sign your search. For example, I would sometimes get search results and then later with the same search get no images when I didn't sign my query.