NYT article search API not returning results for certain queries - api

I have a set of queries and I am trying to get web_urls using the NYT article search API. But I am seeing that it works for q2 below but not for q1.
q1: Seattle+Jacob Vigdor+the University of Washington
q2: Seattle+Jacob Vigdor+University of Washington
If you paste the url below with your API key in the web browser, you get an empty result.
Search request for q1
api.nytimes.com/svc/search/v2/articlesearch.json?q=Seattle+Jacob%20Vigdor+the%20University%20of%20Washington&begin_date=20170626&api-key=XXXX
Empty results for q1
{"response":{"meta":{"hits":0,"time":27,"offset":0},"docs":[]},"status":"OK","copyright":"Copyright (c) 2013 The New York Times Company. All Rights Reserved."}
Instead if you paste the following in your web browser (without the article 'the' in the query) you get non-empty results
Search request for q2
api.nytimes.com/svc/search/v2/articlesearch.json?q=Seattle+Jacob%20Vigdor+University%20of%20Washington&begin_date=20170626&api-key=XXXX
Non-empty results for q2
{"response":{"meta":{"hits":1,"time":22,"offset":0},"docs":[{"web_url":"https://www.nytimes.com/aponline/2017/06/26/us/ap-us-seattle-minimum-wage.html","snippet":"Seattle's $15-an-hour minimum wage law has cost the city jobs, according to a study released Monday that contradicted another new study published last week....","lead_paragraph":"Seattle's $15-an-hour minimum wage law has cost the city jobs, according to a study released Monday that contradicted another new study published last week.","abstract":null,"print_page":null,"blog":[],"source":"AP","multimedia":[],"headline":{"main":"New Study of Seattle's $15 Minimum Wage Says It Costs Jobs","print_headline":"New Study of Seattle's $15 Minimum Wage Says It Costs Jobs"},"keywords":[],"pub_date":"2017-06-26T15:16:28+0000","document_type":"article","news_desk":"None","section_name":"U.S.","subsection_name":null,"byline":{"person":[],"original":"By THE ASSOCIATED PRESS","organization":"THE ASSOCIATED PRESS"},"type_of_material":"News","_id":"5951255195d0e02550996fb3","word_count":643,"slideshow_credits":null}]},"status":"OK","copyright":"Copyright (c) 2013 The New York Times Company. All Rights Reserved."}
Interestingly, both queries work fine on the api test page
http://developer.nytimes.com/article_search_v2.json#/Console/
Also, if you look at the article below returned by q2, you see that the query term in q1, 'the University of Washington' does occur in it and it should have returned this article.
https://www.nytimes.com//aponline//2017//06//26//us//ap-us-seattle-minimum-wage.html
I am confused about this behaviour of the API. Any ideas what's going on? Am I missing something?

Thank you for all the answers. Below I am pasting the answer I received from NYT developers.
NYT's Article Search API uses Elasticsearch. There are lots of docs online about the query syntax of Elasticsearch (it is based on Lucene).
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax
If you want articles that contain "Seattle", "Jacob Vigdor" and "University of Washington", do
"Seattle" AND "Jacob Vigdor" AND "University of Washington"
or
+"Seattle" +"Jacob Vigdor" +"University of Washington"

I think you need to change encoding of spaces (%20) to + (%2B):
In your example,
q=Seattle+Jacob%20Vigdor+the%20University%20of%20Washington
When I submit from the page on the site, it uses %2B:
q=Seattle%2BJacob+Vigdor%2Bthe+University+of+Washington
How are you URL encoding? One way to fix it would be to replace your spaces with + before URL encoding.
Also, you may need to replace %20 with +. There are various schemes for URL encoding, so the best way would depend on how you are doing it.

Related

How to display only searched string in column in postgresql

I want to only display searched string from a table, as example this is my table:
Table name: guidelines
id content
1 An individual is accused “of” a crime, not “with” or “for” a crime. Accused, often as “the accused”, refers to the individual or individuals standing trial. EXAMPLES: The prosecutor accused the politician of bribery. The accused politician stood trial for bribery. See alleged, charged, suspected.
2 There were a lot of people getting accused on this particular town.
If I use search query to search for "accused", it will show the full result:
SELECT content FROM "guidelines" WHERE "content" 'ILIKE' '%accused%';
Result:
content
An individual is accused “of” a crime, not “with” or “for” a crime. Accused, often as “the accused”, refers to the individual or individuals standing trial. EXAMPLES: The prosecutor accused the politician of bribery. The accused politician stood trial for bribery. See alleged, charged, suspected.
There were a lot of people getting accused on this particular town.
How can I only get the first matching string and followed by the data on the column, as example this is my goal:
content
Accused, often as “the accused”, refers to th...
accused on this particular to...
update: I updated the table and column name to make it better to differentiate table and column
In Postgresql, you can do that by using position function and substring function. see the following query as an example:
SELECT
id,
substring(content, position ('accused' in content)) as matched
FROM
guidelines
WHERE
content LIKE '%accused%'
Try this :
SELECT substring(content from '%#"accused%#"%' for '#') from guidelines;
each # is the place holder defined in the last part for '#' and need and aditional "
So you have % and function will return what is found inside both placeholder. In this case is % or the rest of the string after accused

Odd results from Google Reverse Geocode API

Recently I've been occasionally receiving odd results from the Google Reverse Geocode API.
For example when looking up the address for coordinates 43.2379396 -72.44746565 the result I get is:
6HQ3+52 Springfield, VT, USA
In another case looking up 43.703563 -72.209753 results with:
PQ3R+C3 Hanover, NH, USA
Does anyone know what the initial 7 bytes of the returned address symbolize? When I receive this type of result it's always 4 bytes of alphanumeric data followed by a plus sign then 2 more alphanumeric bytes.
After some additional research I found that these are Plus Code addresses, a relatively new feature in Google Maps. These are used for places that don't have a street address. These seem to have some similarities to "what 3 words" addresses.

Understanding Themes in Google BigQuery GDELT GKG 2.0

I'm using Google bigquery to analyze the GDELT GKG 2.0 dataset and would like to better understand how to query based on themes (or V2Themes). The docs mention a 'Category List' spreadsheet but so far I've been unsuccessful in finding that list.
the following asesome blog mentions that you can use World Bank Taxonomy among others to narrow down your search. My objective is to find all items that mention "droughts / too little water" ,all items that mention "floods / too much water" and all items that mention " poor quality / too dirty water" that have a geographical match on a sub-country level.
So far I've been able to get a list of distinct themes but this is non-extensive and I don't get the hierarchy / structure of it.
SELECT
DISTINCT theme
FROM (
SELECT
GKGRECORDID,
locations,
REGEXP_EXTRACT(themes,r'(^.[^,]+)') AS theme,
CAST(REGEXP_EXTRACT(locations,r'^(?:[^#]*#){0}([^#]*)') AS NUMERIC) AS location_type,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){1}([^#]*)') AS location_fullname,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){2}([^#]*)') AS location_countrycode,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){3}([^#]*)') AS location_adm1code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){4}([^#]*)') AS location_adm2code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){5}([^#]*)') AS location_latitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){6}([^#]*)') AS location_longitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){7}([^#]*)') AS location_featureid,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){8}([^#]*)') AS location_characteroffset,
DocumentIdentifier
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`,
UNNEST(SPLIT(V2Locations,';')) AS locations,
UNNEST(SPLIT(V2Themes,';')) AS themes
WHERE
_PARTITIONTIME >= "2018-08-20 00:00:00"
AND _PARTITIONTIME < "2018-08-21 00:00:00" )
WHERE
(location_type = 5
OR location_type = 4
OR location_type = 2) --WorldState, WorldCity or US State
ORDER BY
theme
And a list of water related themes I've been able to find so far (sample, not exhaustive):
CRISISLEX_C06_WATER_SANITATION
ENV_WATERWAYS
HUMAN_RIGHTS_ABUSES_WATERBOARD
HUMAN_RIGHTS_ABUSES_WATERBOARDED
HUMAN_RIGHTS_ABUSES_WATERBOARDING
NATURAL_DISASTER_FLOODWATER
NATURAL_DISASTER_FLOODWATERS
NATURAL_DISASTER_FLOOD_WATER
NATURAL_DISASTER_FLOOD_WATERS
NATURAL_DISASTER_HIGH_WATER
NATURAL_DISASTER_HIGH_WATERS
NATURAL_DISASTER_WATER_LEVEL
TAX_AIDGROUPS_WATERAID
TAX_DISEASE_WATERBORNE_DISEASE
TAX_DISEASE_WATERBORNE_DISEASES
TAX_FNCACT_WATERBOY
TAX_FNCACT_WATERMAN
TAX_FNCACT_WATERMEN
TAX_FNCACT_WATER_BOY
TAX_WEAPONS_WATER_CANNON
TAX_WEAPONS_WATER_CANNONS
TAX_WORLDBIRDS_WATERFOWL
TAX_WORLDMAMMALS_WATER_BUFFALO
UNGP_CLEAN_WATER_SANITATION
WATER_SECURITY
WB_1000_WATER_MANAGEMENT_STRUCTURES
WB_1021_WATER_LAW
WB_1063_WATER_ALLOCATION_AND_WATER_SUPPLY
WB_1064_WATER_DEMAND_MANAGEMENT
WB_1199_WATER_SUPPLY_AND_SANITATION
WB_1215_WATER_QUALITY_STANDARDS
WB_137_WATER
WB_138_WATER_SUPPLY
WB_139_SANITATION_AND_WASTEWATER
WB_140_AGRICULTURAL_WATER_MANAGEMENT
WB_141_WATER_RESOURCES_MANAGEMENT
WB_143_RURAL_WATER
WB_144_URBAN_WATER
WB_1462_WATER_SANITATION_AND_HYGIENE
WB_149_WASTEWATER_TREATMENT_AND_DISPOSAL
WB_150_WASTEWATER_REUSE
WB_155_WATERSHED_MANAGEMENT
WB_156_GROUNDWATER_MANAGEMENT
WB_159_TRANSBOUNDARY_WATER
WB_1729_URBAN_WATER_FINANCIAL_SUSTAINABILITY
WB_1731_NON_REVENUE_WATER
WB_1778_FRESHWATER_ECOSYSTEMS
WB_1790_INTERNATIONAL_WATERWAYS
WB_1798_WATER_POLLUTION
WB_1805_WATERWAYS
WB_1998_WATER_ECONOMICS
WB_2008_WATER_TREATMENT
WB_2009_WATER_QUALITY_MONITORING
WB_2971_WATER_PRICING
WB_2981_DRINKING_WATER_QUALITY_STANDARDS
WB_2992_FRESHWATER_FISHERIES
WB_427_WATER_ALLOCATION_AND_WATER_ECONOMICS
While this link is provided as a theme listing:
http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_CategoryList.xlsx
...it is far from complete (perhaps just the original theme list?). I just pulled a single day's worth of GKG, and there are tons of themes not on the list of 283 themes in that spreadsheet.
GKG documentation located at https://blog.gdeltproject.org/world-bank-group-topical-taxonomy-now-in-gkg/ points to a World Bank Taxonomy located at http://pubdocs.worldbank.org/en/275841490966525495/Theme-Taxonomy-and-definitions.pdf. The GKG post implies this World Bank taxonomy has been rolled into the GKG theme list.
This is presented as a complete listing of World Bank Taxonomy themes. Unfortunately, I've found numerous World Bank themes in GKG that aren't in this publication. The union of these two lists represents a portion of GKG themes, but it definitely isn't all of them.
Here is the list of GKG Themes:
http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_CategoryList.xlsx
If anyone needs this, I have added a list of all themes in the GKG v1 in the timeperiod from 1/1/2017-31/12/2020 which are at least present in 10 or more articles for that particular day: Themes.parquet
It consists of 17639 unique themes with the count per day. Looks like this:
The complete numbers for that 4 year dataset is 36 713 385 unique actors, 50 845 unique themes as well as 26 389 528 unique organizations. These numbers are not filtered for different spellings for the same entity, and hence Donald Trump and Donald J. Trump will count as two separate actors.
The best GDELT GKG Themes list I could find is here, as described in this blog post.
I put it into a CSV file, which I find slightly easier to work with, and put that file here.

DAX / Xetra on alphavantage

I am very very happy with Alphavantage.
BUT I can't find the german stocks (Xetra)
I have tried:
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=xtr:lin&apikey=MYKEY
(But this works https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=NYSE:DIN&apikey=MYKEY)
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=Lin.be&apikey=MYKEY
(But this works: https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=Novo-b.CO&apikey=MYKEY)
So my question is - has anyone had any luck getting german stocks on Alphavanta (or another free service. Realtime is not crucial, but obviously a plus).
I use the "Search Endpoint" function to find german stocks on alphavantage.
Let's say you look for "BASF" you could query:
https://www.alphavantage.co/query?function=SYMBOL_SEARCH&keywords=BASF&apikey=[your API key]&datatype=csv
You get a list with possible matches:
symbol,name,type,region,marketOpen,marketClose,timezone,currency,matchScore
BASFY,BASF SE,Equity,United States,09:30,16:00,UTC-05,USD,0.8889
BFFAF,BASF SE,Equity,United States,09:30,16:00,UTC-05,USD,0.8889
BASFX,BMO Short Tax-Free Fund Class A,Mutual Fund,United States,09:30,16:00,UTC- 05,USD,0.8889
BAS.DEX,BASF SE,Equity,XETRA,08:00,20:00,UTC+02,EUR,0.7273
BAS.FRK,BASF SE,Equity,Frankfurt,08:00,20:00,UTC+02,EUR,0.7273
BASA.DEX,BASF SE,Equity,XETRA,08:00,20:00,UTC+02,EUR,0.7273
BAS.BER,BASF SE NA O.N.,Equity,Berlin,08:00,20:00,UTC+02,EUR,0.7273
BASF.NSE,BASF India Limited,Equity,India/NSE,09:15,15:30,UTC+5.5,INR,0.6000
See documentation: https://www.alphavantage.co/documentation/
It seems to work with the yahoo symbols on alphavantage, at least for a few stocks (I did not check all). BASF for example works with:
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=BASF.TI&apikey=MYKEY
The alphavantage symbols for German securities consist of the Xetra symbol + .DE. For example EUNL.DE (for iShares MSCI World Core ETF). You can find a list of all Xetra stocks here.

Geocoding returns two possible locations but presents wrong address as most accurate and right address is least accurate

using https://www.google.ca/maps and the geocoding api gives the same results:
using https://www.google.ca/maps and searching for:
143 GARRISON CIR , RED DEER, AB , Canada
returns two results:
143 Garrison PL
143 Garrison Cir
using the API reveals that it considers the first one '... Pl' more accurate than '... Cir' when clearly the second one is more true to the original addressed used to search since it contains 'Cir'...
using:
https://maps.googleapis.com/maps/api/geocode/xml?address=143%20GARRISON%20CIR%20%2C%20RED%20DEER%2C%20AB%20%2C%20Canada
reveals the first result's accuracy is:
ROOFTOP
and the second result's accuracy is:
RANGE_INTERPOLATED {not as accurate}
WHY???
Interestingly... if I use the postal code in the full address {which I verified with Canada Post as being correct}:
'143 GARRISON CIR , RED DEER, AB T4P0P5, Canada'
I get no results from either method!
again... WHY???
The RANGE_INTERPOLATED result means that there is no exact street address feature in the Google database and the service tries to guess where the address is located. Maybe due to this reason the exact ROOFTOP result is scored higher than an interpolation. Especially taking into account that the coordinates of both results are very close to each other:
https://google-developers.appspot.com/maps/documentation/utils/geocoder/#q%3D143%2520GARRISON%2520CIR%2520%252C%2520RED%2520DEER%252C%2520AB%2520%252C%2520Canada
In order to solve this you should report a missing address to Google using Send feedback mechanism:
https://support.google.com/maps/answer/3094045
Also note that an interpolated result for the address has a different postal code T4N 3M4. Even more, if I try to search the postal code T4P 0P5, I'll get back only postal code prefix T4P:
https://google-developers.appspot.com/maps/documentation/utils/geocoder/#q%3D%26options%3Dtrue%26in_country%3DCA%26in_postal_code%3DT4P%25200P5
That means the postal code T4P 0P5 is also missing from Google database and you should report it as well.
As the postal code is missing you are getting ZERO_RESULTS for complete string 143 GARRISON CIR , RED DEER, AB T4P0P5, Canada
https://google-developers.appspot.com/maps/documentation/utils/geocoder/#q%3D143%2520GARRISON%2520CIR%2520%252C%2520RED%2520DEER%252C%2520AB%2520T4P0P5%252C%2520Canada%26options%3Dtrue
As you mentioned, we can see that the same behavior is reproducible on maps.google.com. There are two options for the address and Garrison Pl is the first item while Garrison Cir is the second. That confirms that this is a data issue rather than API issue:
I hope this explains your doubt.