Understanding Themes in Google BigQuery GDELT GKG 2.0 - google-bigquery

I'm using Google bigquery to analyze the GDELT GKG 2.0 dataset and would like to better understand how to query based on themes (or V2Themes). The docs mention a 'Category List' spreadsheet but so far I've been unsuccessful in finding that list.
the following asesome blog mentions that you can use World Bank Taxonomy among others to narrow down your search. My objective is to find all items that mention "droughts / too little water" ,all items that mention "floods / too much water" and all items that mention " poor quality / too dirty water" that have a geographical match on a sub-country level.
So far I've been able to get a list of distinct themes but this is non-extensive and I don't get the hierarchy / structure of it.
SELECT
DISTINCT theme
FROM (
SELECT
GKGRECORDID,
locations,
REGEXP_EXTRACT(themes,r'(^.[^,]+)') AS theme,
CAST(REGEXP_EXTRACT(locations,r'^(?:[^#]*#){0}([^#]*)') AS NUMERIC) AS location_type,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){1}([^#]*)') AS location_fullname,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){2}([^#]*)') AS location_countrycode,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){3}([^#]*)') AS location_adm1code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){4}([^#]*)') AS location_adm2code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){5}([^#]*)') AS location_latitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){6}([^#]*)') AS location_longitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){7}([^#]*)') AS location_featureid,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){8}([^#]*)') AS location_characteroffset,
DocumentIdentifier
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`,
UNNEST(SPLIT(V2Locations,';')) AS locations,
UNNEST(SPLIT(V2Themes,';')) AS themes
WHERE
_PARTITIONTIME >= "2018-08-20 00:00:00"
AND _PARTITIONTIME < "2018-08-21 00:00:00" )
WHERE
(location_type = 5
OR location_type = 4
OR location_type = 2) --WorldState, WorldCity or US State
ORDER BY
theme
And a list of water related themes I've been able to find so far (sample, not exhaustive):
CRISISLEX_C06_WATER_SANITATION
ENV_WATERWAYS
HUMAN_RIGHTS_ABUSES_WATERBOARD
HUMAN_RIGHTS_ABUSES_WATERBOARDED
HUMAN_RIGHTS_ABUSES_WATERBOARDING
NATURAL_DISASTER_FLOODWATER
NATURAL_DISASTER_FLOODWATERS
NATURAL_DISASTER_FLOOD_WATER
NATURAL_DISASTER_FLOOD_WATERS
NATURAL_DISASTER_HIGH_WATER
NATURAL_DISASTER_HIGH_WATERS
NATURAL_DISASTER_WATER_LEVEL
TAX_AIDGROUPS_WATERAID
TAX_DISEASE_WATERBORNE_DISEASE
TAX_DISEASE_WATERBORNE_DISEASES
TAX_FNCACT_WATERBOY
TAX_FNCACT_WATERMAN
TAX_FNCACT_WATERMEN
TAX_FNCACT_WATER_BOY
TAX_WEAPONS_WATER_CANNON
TAX_WEAPONS_WATER_CANNONS
TAX_WORLDBIRDS_WATERFOWL
TAX_WORLDMAMMALS_WATER_BUFFALO
UNGP_CLEAN_WATER_SANITATION
WATER_SECURITY
WB_1000_WATER_MANAGEMENT_STRUCTURES
WB_1021_WATER_LAW
WB_1063_WATER_ALLOCATION_AND_WATER_SUPPLY
WB_1064_WATER_DEMAND_MANAGEMENT
WB_1199_WATER_SUPPLY_AND_SANITATION
WB_1215_WATER_QUALITY_STANDARDS
WB_137_WATER
WB_138_WATER_SUPPLY
WB_139_SANITATION_AND_WASTEWATER
WB_140_AGRICULTURAL_WATER_MANAGEMENT
WB_141_WATER_RESOURCES_MANAGEMENT
WB_143_RURAL_WATER
WB_144_URBAN_WATER
WB_1462_WATER_SANITATION_AND_HYGIENE
WB_149_WASTEWATER_TREATMENT_AND_DISPOSAL
WB_150_WASTEWATER_REUSE
WB_155_WATERSHED_MANAGEMENT
WB_156_GROUNDWATER_MANAGEMENT
WB_159_TRANSBOUNDARY_WATER
WB_1729_URBAN_WATER_FINANCIAL_SUSTAINABILITY
WB_1731_NON_REVENUE_WATER
WB_1778_FRESHWATER_ECOSYSTEMS
WB_1790_INTERNATIONAL_WATERWAYS
WB_1798_WATER_POLLUTION
WB_1805_WATERWAYS
WB_1998_WATER_ECONOMICS
WB_2008_WATER_TREATMENT
WB_2009_WATER_QUALITY_MONITORING
WB_2971_WATER_PRICING
WB_2981_DRINKING_WATER_QUALITY_STANDARDS
WB_2992_FRESHWATER_FISHERIES
WB_427_WATER_ALLOCATION_AND_WATER_ECONOMICS

While this link is provided as a theme listing:
http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_CategoryList.xlsx
...it is far from complete (perhaps just the original theme list?). I just pulled a single day's worth of GKG, and there are tons of themes not on the list of 283 themes in that spreadsheet.
GKG documentation located at https://blog.gdeltproject.org/world-bank-group-topical-taxonomy-now-in-gkg/ points to a World Bank Taxonomy located at http://pubdocs.worldbank.org/en/275841490966525495/Theme-Taxonomy-and-definitions.pdf. The GKG post implies this World Bank taxonomy has been rolled into the GKG theme list.
This is presented as a complete listing of World Bank Taxonomy themes. Unfortunately, I've found numerous World Bank themes in GKG that aren't in this publication. The union of these two lists represents a portion of GKG themes, but it definitely isn't all of them.

Here is the list of GKG Themes:
http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_CategoryList.xlsx

If anyone needs this, I have added a list of all themes in the GKG v1 in the timeperiod from 1/1/2017-31/12/2020 which are at least present in 10 or more articles for that particular day: Themes.parquet
It consists of 17639 unique themes with the count per day. Looks like this:
The complete numbers for that 4 year dataset is 36 713 385 unique actors, 50 845 unique themes as well as 26 389 528 unique organizations. These numbers are not filtered for different spellings for the same entity, and hence Donald Trump and Donald J. Trump will count as two separate actors.

The best GDELT GKG Themes list I could find is here, as described in this blog post.
I put it into a CSV file, which I find slightly easier to work with, and put that file here.

Related

Matching an element in a column, to others in the same column

I have columns taken from excel as a dataframe, the columns are as follows:
HolidayTourProvider|Packages|Meals|Accommodation|LocalTravelVehicle|Cancellationfee
Holiday Tour Provider has a couple of company names
Packages, the features provided in each package are mostly the same like
Meals,Accommodation etc... even though one company may call it "Saver", others may call it "Budget". (each of column mostly follow Yes/No, except Local travel vehicle are again car names like Ford Taurus,jeep cherokee etc..
Cancellation amount is integers)
I need to write a function like
match(HolidayTP,Package)
where the user can give input like
match(AdventureLife, Luxury)
then I need to return all the packages that have similar features with Luxury by other Holiday Tour Providers, no matter what name they give the package like 'Semi Lux', 'Comfort' etc...
I want to give a counter for every match and display all the packages that exceed the counter by 3 or 4.
This is my first python code. I am stuck here.
fb is the total df I exported to
def mapHol(HTP, PACKAGE):
mfb = (fb['HTP']== HTP)&(fb['package']== package)
B = fb[mfb]
for i in fb[i]:
for j in B[j]:
if fb[i]==B[j]:
count+=1
I dont know how to proceed, please help me this is my first major project, I started on my own.

DAX / Xetra on alphavantage

I am very very happy with Alphavantage.
BUT I can't find the german stocks (Xetra)
I have tried:
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=xtr:lin&apikey=MYKEY
(But this works https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=NYSE:DIN&apikey=MYKEY)
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=Lin.be&apikey=MYKEY
(But this works: https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=Novo-b.CO&apikey=MYKEY)
So my question is - has anyone had any luck getting german stocks on Alphavanta (or another free service. Realtime is not crucial, but obviously a plus).
I use the "Search Endpoint" function to find german stocks on alphavantage.
Let's say you look for "BASF" you could query:
https://www.alphavantage.co/query?function=SYMBOL_SEARCH&keywords=BASF&apikey=[your API key]&datatype=csv
You get a list with possible matches:
symbol,name,type,region,marketOpen,marketClose,timezone,currency,matchScore
BASFY,BASF SE,Equity,United States,09:30,16:00,UTC-05,USD,0.8889
BFFAF,BASF SE,Equity,United States,09:30,16:00,UTC-05,USD,0.8889
BASFX,BMO Short Tax-Free Fund Class A,Mutual Fund,United States,09:30,16:00,UTC- 05,USD,0.8889
BAS.DEX,BASF SE,Equity,XETRA,08:00,20:00,UTC+02,EUR,0.7273
BAS.FRK,BASF SE,Equity,Frankfurt,08:00,20:00,UTC+02,EUR,0.7273
BASA.DEX,BASF SE,Equity,XETRA,08:00,20:00,UTC+02,EUR,0.7273
BAS.BER,BASF SE NA O.N.,Equity,Berlin,08:00,20:00,UTC+02,EUR,0.7273
BASF.NSE,BASF India Limited,Equity,India/NSE,09:15,15:30,UTC+5.5,INR,0.6000
See documentation: https://www.alphavantage.co/documentation/
It seems to work with the yahoo symbols on alphavantage, at least for a few stocks (I did not check all). BASF for example works with:
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=BASF.TI&apikey=MYKEY
The alphavantage symbols for German securities consist of the Xetra symbol + .DE. For example EUNL.DE (for iShares MSCI World Core ETF). You can find a list of all Xetra stocks here.

NYT article search API not returning results for certain queries

I have a set of queries and I am trying to get web_urls using the NYT article search API. But I am seeing that it works for q2 below but not for q1.
q1: Seattle+Jacob Vigdor+the University of Washington
q2: Seattle+Jacob Vigdor+University of Washington
If you paste the url below with your API key in the web browser, you get an empty result.
Search request for q1
api.nytimes.com/svc/search/v2/articlesearch.json?q=Seattle+Jacob%20Vigdor+the%20University%20of%20Washington&begin_date=20170626&api-key=XXXX
Empty results for q1
{"response":{"meta":{"hits":0,"time":27,"offset":0},"docs":[]},"status":"OK","copyright":"Copyright (c) 2013 The New York Times Company. All Rights Reserved."}
Instead if you paste the following in your web browser (without the article 'the' in the query) you get non-empty results
Search request for q2
api.nytimes.com/svc/search/v2/articlesearch.json?q=Seattle+Jacob%20Vigdor+University%20of%20Washington&begin_date=20170626&api-key=XXXX
Non-empty results for q2
{"response":{"meta":{"hits":1,"time":22,"offset":0},"docs":[{"web_url":"https://www.nytimes.com/aponline/2017/06/26/us/ap-us-seattle-minimum-wage.html","snippet":"Seattle's $15-an-hour minimum wage law has cost the city jobs, according to a study released Monday that contradicted another new study published last week....","lead_paragraph":"Seattle's $15-an-hour minimum wage law has cost the city jobs, according to a study released Monday that contradicted another new study published last week.","abstract":null,"print_page":null,"blog":[],"source":"AP","multimedia":[],"headline":{"main":"New Study of Seattle's $15 Minimum Wage Says It Costs Jobs","print_headline":"New Study of Seattle's $15 Minimum Wage Says It Costs Jobs"},"keywords":[],"pub_date":"2017-06-26T15:16:28+0000","document_type":"article","news_desk":"None","section_name":"U.S.","subsection_name":null,"byline":{"person":[],"original":"By THE ASSOCIATED PRESS","organization":"THE ASSOCIATED PRESS"},"type_of_material":"News","_id":"5951255195d0e02550996fb3","word_count":643,"slideshow_credits":null}]},"status":"OK","copyright":"Copyright (c) 2013 The New York Times Company. All Rights Reserved."}
Interestingly, both queries work fine on the api test page
http://developer.nytimes.com/article_search_v2.json#/Console/
Also, if you look at the article below returned by q2, you see that the query term in q1, 'the University of Washington' does occur in it and it should have returned this article.
https://www.nytimes.com//aponline//2017//06//26//us//ap-us-seattle-minimum-wage.html
I am confused about this behaviour of the API. Any ideas what's going on? Am I missing something?
Thank you for all the answers. Below I am pasting the answer I received from NYT developers.
NYT's Article Search API uses Elasticsearch. There are lots of docs online about the query syntax of Elasticsearch (it is based on Lucene).
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax
If you want articles that contain "Seattle", "Jacob Vigdor" and "University of Washington", do
"Seattle" AND "Jacob Vigdor" AND "University of Washington"
or
+"Seattle" +"Jacob Vigdor" +"University of Washington"
I think you need to change encoding of spaces (%20) to + (%2B):
In your example,
q=Seattle+Jacob%20Vigdor+the%20University%20of%20Washington
When I submit from the page on the site, it uses %2B:
q=Seattle%2BJacob+Vigdor%2Bthe+University+of+Washington
How are you URL encoding? One way to fix it would be to replace your spaces with + before URL encoding.
Also, you may need to replace %20 with +. There are various schemes for URL encoding, so the best way would depend on how you are doing it.

Facebook Application that Searches for nearest Place

Is it possible to create a facebook application that would display a list of the nearest place of interest location?
Much like the current existing Urbanspoon Application (http://www.urbanspoon.com/c/338/Perth-restaurants.html) but on facebook.
Can somebody please point me to the right direction :)
Get place ID for user's current location:
SELECT current_location FROM user WHERE uid=me()
Get coordinates for that place ID:
https://graph.facebook.com/106412259396611
Get nearby places:
https://graph.facebook.com/search?type=place&center=65.5833,22.15&distance=1000
You could of course get the latitude and longitude by other means.
I know this is a resolved post, but i've also been doing some work on this and thought I'd add my findings / advice...
Using the FQL:
SELECT page_id, name, description, display_subtext FROM place WHERE distance(latitude, longitude, "52.3", "-1.53333") < 10000 AND checkin_count > 50 AND CONTAINS("coffee") order by checkin_count DESC LIMIT 100
This will get pages containing the word "coffee" in their title, description or meta-data that are within 10k of the specified lat/lon.
"page_id": 163313220350407,
"name": "Starbucks At Warwick Services, M40",
"display_subtext": "Coffee Shop・Service Station Supply・3,012 were here"
I like to set a min check-in count of around 50 to help filter out the ad-hoc places that single users create such as "the coffee machine at jim's house". This helps ensure better quality and relevancy of returned results.
Hope this helps someone.

Creating, Visualizing and Querying simple Data Structures

Simple and common tree like data structures
Data Structure example
Animated Cartoons have 4 extremities (arm, leg,limb..)
Human have 4 ext.
Insects have 6 ext.
Arachnids have 6 ext.
Animated Cartoons have 4 by extremity
Human have 5 by ext.
Insects have 1 by ext.
Arachnids have 1 by ext.
Some Kind of Implementation
Level/Table0
Quantity, Item
Level/Table1
ItemName, Kingdom
Level/Table2
Kingdom, NumberOfExtremities
Level/Table3
ExtremityName, NumberOfFingers
Example Dataset
1 Homer Simpson, 1 Ralph Wiggum, 2 jon
skeet, 3 Atomic ant, 2 Shelob (spider)
Querying.. "Number of fingers"
Number = 1*4*4 + 1*4*4 + 1*4*5 + 3*6*1 + 2*6*1 = 82 fingers (Let Jon be a Human)
I wonder if there is any tool for define it parseable for automatic create the inherited data, and drawing this kind of trees, (with the plus of making this kind of data access, if where posible..)
It could be drawn manually with for example FreeMind, but AFAIK it dont let you define datatype or structures to automatically create inherited branch of items, so it's really annoying to have to repeat and repeat a structure by copying (and with the risk of mistake). Repeated Work over Repeated Data, (an human running repeated code), it's a buggy feature.
So I would like to write the data in the correct language that let me reuse it
for queries and visualization, if all data is in XML, or Java Classes, or in a Database File, etc.. there is some tool for viewing the tree and making the query?
PD : Creating nested folders in a filesystem and using Norton Commander in tree view, is not an option, I hope (just because It have to be builded manually)
Your answer is mostly going to depend on what programming skills you already have and what skills you are willing to acquire. I can tell you what I would do with what I know.
I think for drawing trees you want a LaTeX package like qtree. If you don't like this one, there are a bunch of others out there. You'd have to write a script in whatever your favorite scripting language is to parse your input into the LaTeX code to generate the trees, but this could easily be done with less than 100 lines in most languages, if I properly understand your intentions. I would definitely recommend storing your data in an XML format using a library like Ruby's REXML, or whatever your favorite scripting language has.
If you are looking to generate more interactive trees, check out the Adobe Flex Framework. Again, if you don't like this specific framework, there are bunches of others out there (I recommend the blog FlowingData).
Hope this helps and I didn't miserably misunderstand your question.
Data structure that You are describing looks like it can fit in xml format. Take a look at Exist XML database, and if I can say so it is the most complete xml database. It comes with many tools to get you started fast ! like XQuery Sandbox option in admin http interface.
Example Dataset
1 Homer Simpson, 1 Ralph Wiggum, 2 jon skeet, 3 Atomic ant, 2 Shelob (spider)
I am assuming that there are 2 instances of jon skeet, 3 instances of Atomic ant and 2 instances of Shelob
Here is a XQuery example:
let $doc :=
<root>
<definition>
<AnimatedCartoons>
<extremities>4</extremities>
<fingers_per_ext>4</fingers_per_ext>
</AnimatedCartoons>
<Human>
<extremities>4</extremities>
<fingers_per_ext>5</fingers_per_ext>
</Human>
<Insects>
<extremities>6</extremities>
<fingers_per_ext>1</fingers_per_ext>
</Insects>
<Arachnids>
<extremities>6</extremities>
<fingers_per_ext>1</fingers_per_ext>
</Arachnids>
</definition>
<subject><name>Homer Simpson</name><kind>AnimatedCartoons</kind></subject>
<subject><name>Ralph Wiggum</name><kind>AnimatedCartoons</kind></subject>
<subject><name>jon skeet</name><kind>Human</kind></subject>
<subject><name>jon skeet</name><kind>Human</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Shelob</name><kind>Arachnids</kind></subject>
<subject><name>Shelob</name><kind>Arachnids</kind></subject>
</root>
let $definitions := $doc/definition/*
let $subjects := $doc/subject
(: here goes some query logic :)
let $fingers := fn:sum(
for $subject in $subjects
return (
for $x in $definitions
where fn:name($x) = $subject/kind
return $x/extremities * $x/fingers_per_ext
)
)
return $fingers
XML Schema Editor with visualization is perhaps what I am searching for
http://en.wikipedia.org/wiki/XML_Schema_Editor
checking it..