How to improve the results from DBpedia Spotlight? - sparql

I am using DBpedia Spotlight to extract DBpedia resources as follows.
import json
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = "Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients. In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions."
CONFIDENCE = '0.5'
SUPPORT = '10'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['#URI'])
print(all_urls)
My text is shown below:
Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients. In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions.
The results I got is as follows.
['http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Helix',
'http://dbpedia.org/resource/Bronchitis',
'http://dbpedia.org/resource/Cough_medicine',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/After_7',
'http://dbpedia.org/resource/Gastrointestinal_tract',
'http://dbpedia.org/resource/Antibiotics',
'http://dbpedia.org/resource/Relative_risk',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Bronchitis']
As you can see, the results are not very good.
For example, consider Hedera helix extract in the text mentioned above. Even though DBpedia has a resource for Hedera helix (http://dbpedia.org/resource/Hedera_helix), the Spotlight outputs it as two URIs as http://dbpedia.org/resource/Hedera and http://dbpedia.org/resource/Helix.
According to my dataset, I would like to get the longest term in DBpedia as the results. In that case, what are the improvements I can do to get my desired output?
I am happy to provide more details if needed.

Although I am answering quiet late for this question but you can use Babelnet API in python to obtain dbpedia URI's containing longer texts. I reproduced the problem using the code below:
`from babelpy.babelfy import BabelfyClient
text ="Tolerance, safety and efficacy of Hedera helix extract in inflammatory
bronchial diseases under clinical practice conditions: a prospective, open,
multicentre postmarketing study in 9657 patients. In this postmarketing
study 9657 patients (5181 children) with bronchitis (acute or chronic
bronchial inflammatory disease) were treated with a syrup containing dried ivy
leaf extract. After 7 days of therapy, 95% of the patients showed improvement
or healing of their symptoms. The safety of the therapy was very good with an
overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders
with 1.5%). In those patients who got concomitant medication as well, it could
be shown that the additional application of antibiotics had no benefit
respective to efficacy but did increase the relative risk for the occurrence
of side effects by 26%. In conclusion, it is to say that the dried ivy leaf
extract is effective and well tolerated in patients with bronchitis. In view
of the large population considered, future analyses should approach specific
issues concerning therapy by age group, concomitant therapy and baseline
conditions."
# Instantiate BabelFy client.
params = dict()
params['lang'] = 'english'
babel_client = BabelfyClient("**Your Registration Code For API**", params)
# Babelfy sentence.
babel_client.babelfy(text)
# Get all merged entities.
babel_client.all_merged_entities'
The output will be in the sample format as shown below for all the merged entities in the text. You can further store and process the dictionary structure to extract the dbpedia URIs.
{'start': 34,
'end': 45,
'text': 'Hedera helix',
'isEntity': True,
'tokenFragment': {'start': 6, 'end': 7},
'charFragment': {'start': 34, 'end': 45},
'babelSynsetID': 'bn:00021109n',
'DBpediaURL': 'http://dbpedia.org/resource/Hedera_helix',
'BabelNetURL': 'http://babelnet.org/rdf/s00021109n',
'score': 1.0,
'coherenceScore': 0.0847457627118644,
'globalScore': 0.0013494092960806407,
'source': 'BABELFY'},

Related

Scraping contents of news articles

I was able to scrape the title, date, links, and content of news on these links: https://www.news24.com/news24/southafrica/crime-and-courts and https://www.news24.com/news24/southafrica/education. The output is saved in an excel file. However, I noticed that not all the contents inside the articles were scrapped. I have tried different methods on my "Getting content section of my code" Any help with this will be appreciate. Below is my code:
import sys, time
from bs4 import BeautifulSoup
import requests
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from datetime import timedelta
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['southafrica/crime-and-courts', 'southafrica/education']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
url = 'https://www.news24.com/news24/'+str(pagesToGet[i])
print(url)
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
try:
driver.get("https://www.news24.com/news24/" +str(pagesToGet[i]))
except Exception as e:
error_type, error_obj, error_info = sys.exc_info()
print('ERROR FOR LINK:', url)
print(error_type, 'Line:', error_info.tb_lineno)
continue
time.sleep(3)
scroll_pause_time = 1
screen_height = driver.execute_script("return window.screen.height;")
i = 1
while True:
driver.execute_script("window.scrollTo(0, {screen_height}{i});".format(screen_height=screen_height, i=i))
i += 1
time.sleep(scroll_pause_time)
scroll_height = driver.execute_script("return document.body.scrollHeight;")
if (screen_height) * i > scroll_height:
break
soup = BeautifulSoup(driver.page_source, 'html.parser')
news = soup.find_all('article', attrs={'class': 'article-item'})
print(len(news))
# Getting titles, dates, and links
for j in news:
titles = j.select_one('.article-item__title span')
title = titles.text.strip()
dates = j.find('p', attrs={'class': 'article-item__date'})
date = dates.text.strip()
address = j.find('a').get('href')
news_link = 'https://www.news24.com' + address
art_title.append(title)
art_date.append(date)
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title, 'Date': art_date, 'Source': art_link})
# Getting Content Section
news_articles = [] # to store the content of each news artcle
news_count = 0
for link in df['Source']:
print('\n')
start_time = time.monotonic()
print('Article No. ', news_count)
print('Link: ', link)
# Countermeasure for broken links
try:
if requests.get(link):
news_response = requests.get(link)
else:
print("")
except requests.exceptions.ConnectionError:
news_response = 'N/A'
# Auto sleep trigger after saving every 300 articles
sleep_time = ['100', '200', '300', '400', '500']
if news_count in sleep_time:
time.sleep(12)
else:
""
try:
if news_response.text:
news_data = news_response.text
else:
print('')
except AttributeError:
news_data = 'N/A'
news_soup = BeautifulSoup(news_data, 'html.parser')
try:
if news_soup.find('div', {'class': 'article__body'}):
art_cont = news_soup.find('div','article__body')
art = []
article_text = [i.text.strip().replace("\xa0", " ") for i in art_cont.findAll('p')]
art.append(article_text)
else:
print('')
except AttributeError:
article = 'N/A'
print('\n')
news_count += 1
news_articles.append(art)
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
print('\n')
# Create a column to add all the scraped text
df['News'] = news_articles
df.drop_duplicates(subset ="Source", keep = False, inplace = True)
# Dont store links
df.drop(columns=['Source'], axis=1, inplace=True)
df.to_excel('SA_news24_3.xlsx')
driver.quit()
I tried the following code in the Getting Content Section as well. However, it produced the same output.
article_text = [i.get_text(strip=True).replace("\xa0", " ") for i in art_cont.findAll('p')]
The site has various types of URLs so your code was omitting them since they found it malformed or some had to be subscribed to read.For the ones that has to be subscribed to read i have added "Login to read" followers by the link in articles . I ran this code till article number 670 and it didn't give any error. I had to change it from .xlsx to .csv since it was giving an error of openpyxl in python 3.11.0.
Full Code
import time
import sys
from datetime import timedelta
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['southafrica/crime-and-courts',
'southafrica/education', 'places/gauteng']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
if "places" in pagesToGet[I]:
url = f"https://news24.com/api/article/loadmore/tag?tagType=places&tag={pagesToGet[i].split('/')[1]}&pagenumber=1&pagesize=100&ishomepage=false&ismobile=false"
else:
url = f"https://news24.com/api/article/loadmore/news24/{pagesToGet[i]}?pagenumber=1&pagesize=1200&ishomepage=false&ismobile=false"
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.json()["htmlContent"], 'html.parser')
news = soup.find_all('article', attrs={'class': 'article-item'})
print(len(news))
# Getting titles, dates, and links
for j in news:
titles = j.select_one('.article-item__title span')
title = titles.text.strip()
dates = j.find('p', attrs={'class': 'article-item__date'})
date = dates.text.strip()
address = j.find('a').get('href')
# Countermeasure for links with full url
if "https://" in address:
news_link = address
else:
news_link = 'https://www.news24.com' + address
art_title.append(title)
art_date.append(date)
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title,
'Date': art_date, 'Source': art_link})
# Getting Content Section
news_articles = [] # to store the content of each news artcle
news_count = 0
for link in df['Source']:
start_time = time.monotonic()
print('Article No. ', news_count)
print('Link: ', link)
news_response = requests.get(link)
news_data = news_response.content
news_soup = BeautifulSoup(news_data, 'html.parser')
art_cont = news_soup.find('div', 'article__body')
# Countermeasure for links with subscribe form
try:
try:
article = art_cont.text.split("Newsletter")[
0]+art_cont.text.split("Sign up")[1]
except:
article = art_cont.text
article = " ".join((article).strip().split())
except:
article = f"Login to read {link}"
news_count += 1
news_articles.append(article)
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
print('\n')
# Create a column to add all the scraped text
df['News'] = news_articles
df.drop_duplicates(subset="Source", keep=False, inplace=True)
# Dont store links
df.drop(columns=['Source'], axis=1, inplace=True)
df.to_csv('SA_news24_3.csv')
Output
,Article_Title,Date,News
0,Pastor gets double life sentence plus two 15-year terms for rape and murder of women,2h ago,"A pastor has been sentenced to two life sentences and two 15-year jail terms for rape and murder His modus operandi was to take the women to secluded areas before raping them and tying them to trees.One woman managed to escape after she was raped but the bodies of two others were found tied to the trees.The North West High Court has sentenced a 50-year-old pastor to two life terms behind bars and two 15-year jail terms for rape and murder.Lucas Chauke, 50, was sentenced on Monday for the crimes, which were committed in 2017 and 2018 in Temba in the North West.According to North West National Prosecuting Authority spokesperson Henry Mamothame, Chauke's first victim was a 53-year-old woman.He said Chauke pretended that he would assist the woman with her spirituality and took her to a secluded place near to a dam.""Upon arrival, he repeatedly raped her and subsequently tied her to a tree before fleeing the scene,"" Mamothame said. The woman managed to untie herself and ran to seek help. She reported the incident to the police, who then started searching for Chauke.READ | Kidnappings doubled nationally: over 4 000 cases reported to police from July to SeptemberOn 10 May the following year, Chauke pounced on his second victim - a 55-year-old woman.He took her to the same secluded area next to the dam, raped her and tied her to a tree before fleeing. This time his victim was unable to free herself.Her decomposed body was later found, still tied to the tree, Mamothame said. His third victim was targeted months later, on 3 August, in the same area. According to Mamothame, Chauke attempted to rape her but failed.""He then tied her to a tree and left her to die,"" he said. Chauke was charged in connection with her murder.He was linked to the crimes via DNA.READ | 'These are not pets': Man gives away his two pit bulls after news of child mauled to deathIn aggravation of his sentence, State advocate Benny Kalakgosi urged the court not to deviate from the prescribed minimum sentences, saying that the offences Chauke had committed were serious.""He further argued that Chauke took advantage of unsuspecting women, who trusted him as a pastor but instead, [he] took advantage of their vulnerability,"" Mamothame said. Judge Frances Snyman agreed with the State and described Chauke's actions as horrific.The judge also alluded to the position of trust that he abused, Mamothame said.Chauke was sentenced to life for the rape of the first victim, 15 years for the rape of the second victim and life for her murder, as well as 15 years for her murder. He was also declared unfit to possess a firearm."
1,'I am innocent': Alleged July unrest instigator Ngizwe Mchunu pleads not guilty,4h ago,"Former Ukhozi FM DJ Ngizwe Mchunu has denied inciting the July 2022 unrest.Mchunu pleaded not guilty to charges that stem from the incitement allegations.He also claimed he had permission to travel to Gauteng for work during the Covid-19 lockdown.""They are lying. I know nothing about those charges,"" alleged July unrest instigator, former Ukhozi FM DJ Ngizwe Brian Mchunu, told the Randburg Magistrate's Court when his trial started on Tuesday.Mchunu pleaded not guilty to the charges against him, which stems from allegations that he incited public violence, leading to the destruction of property, and convened a gathering in contravening of Covid-19 lockdown regulations after the detention of former president Jacob Zuma in July last year.In his plea statement, Mchunu said all charges were explained to him.""I am a radio and television personality. I'm also a poet and cultural activist. In 2020, I established my online radio.READ | July unrest instigators could face terrorism-related charges""On 11 July 2021, I sent invitations to journalists to discuss the then-current affairs. At the time, it was during the arrest of Zuma.""I held a media briefing at a hotel in Bryanston to show concerns over Zuma's arrest. Zuma is my neighbour [in Nkandla]. In my African culture, I regard him as my father.""Mchunu continued that he was not unhappy about Zuma's arrest but added: ""I didn't condone any violence. I pleaded with fellow Africans to stop destroying infrastructure. I didn't incite any violence.I said to them, 'My brothers and sisters, I'm begging you as we are destroying our country.'""He added:They are lying. I know nothing about those charges. I am innocent. He also claimed that he had permission to travel to Gauteng for work during the lockdown.The hearing continues."
2,Jukskei River baptism drownings: Pastor of informal 'church' goes to ground,5h ago,"A pastor of the church where congregants drowned during a baptism ceremony has gone to ground.Johannesburg Emergency Medical Services said his identity was not known.So far, 14 bodies have been retrieved from the river.According to Johannesburg Emergency Medical Services (EMS), 13 of the 14 bodies retrieved from the Jukskei River have been positively identified.The bodies are of congregants who were swept away during a baptism ceremony on Saturday evening in Alexandra.The search for the other missing bodies continues.Reports are that the pastor of the church survived the flash flood after congregants rescued him.READ | Jukskei River baptism: Families gather at mortuary to identify loved onesEMS spokesperson Robert Mulaudzi said they had been in contact with the pastor since the day of the tragedy, but that they had since lost contact with him.It is alleged that the pastor was not running a formal church, but rather used the Jukskei River as a place to perform rituals for people who came to him for consultations. At this stage, his identity is not known, and because his was not a formal church, Mulaudzi could not confirm the number of people who could have been attending the ceremony.Speaking to the media outside the Sandton fire station on Tuesday morning, a member of the rescue team, Xolile Khumalo, said: Thirteen out of the 14 bodies retrieved have been identified, and the one has not been identified yet.She said their team would continue with the search. ""Three families have since come forward to confirm missing persons, and while we cannot be certain that the exact number of bodies missing is three, we will continue with our search."""
3,Six-month-old infant ‘abducted’ in Somerset West CBD,9h ago,"Authorities are on high alert after a baby was allegedly abducted in Somerset West on Monday.The alleged incident occurred around lunchtime, but was only reported to Somerset West police around 22:00. According to Sergeant Suzan Jantjies, spokesperson for Somerset West police, the six-month-old baby boy was taken around 13:00. It is believed the infant’s mother, a 22-year-old from Nomzamo, entrusted a fellow community member and mother with the care of her child before leaving for work on Monday morning. However, when she returned home from work, she was informed that the child was taken. Police were apparently informed that the carer, the infant and her nine-year-old child had travelled to Somerset West CBD to attend to Sassa matters. She allegedly stopped by a liquor store in Victoria Street and asked an unknown woman to keep the baby and watch over her child. After purchasing what was needed and exiting the store, she realised the woman and the children were missing. A case of abduction was opened and is being investigated by the police’s Family Violence, Child Protection and Sexual Offences (FCS) unit. Police obtained security footage which shows the alleged abductor getting into a taxi and making off with the children. The older child was apparently dropped off close to her home and safely returned. However, the baby has still not been found. According to a spokesperson, FCS police members prioritised investigations immediately after the case was reported late last night and descended on the local township, where they made contact with the visibly “traumatised” parent and obtained statements until the early hours of Tuesday morning – all in hopes of locating the child and the alleged suspect.Authorities are searching for what is believed to be a foreign national woman with braids, speaking isiZulu.Anyone with information which could aid the investigation and search, is urged to call Captain Trevor Nash of FCS on 082 301 8910."
4,Have you herd: Dubai businessman didn't know Ramaphosa owned Phala Phala buffalo he bought - report,8h ago,"A Dubai businessman who bought buffaloes at Phala Phala farm reportedly claims he did not know the deal was with President Cyril Ramaphosa.Hazim Mustafa also claimed he was expecting to be refunded for the livestock after the animals were not delivered.He reportedly brought the cash into the country via OR Tambo International Airport, and claims he declared it.A Dubai businessman who reportedly bought 20 buffaloes from President Cyril Ramaphosa's Phala Phala farm claims that he didn't know the deal with was with the president, according to a report.Sky News reported that Hazim Mustafa, who reportedly paid $580 000 (R10 million) in cash for the 20 buffaloes from Ramaphosa's farm in December 2019, said initially he didn't know who the animals belonged to.A panel headed by former chief justice Sandile Ngcobo released a report last week after conducting a probe into allegations of a cover-up of a theft at the farm in February 2020.READ | Ramaphosa wins crucial NEC debate as parliamentary vote on Phala Phala report delayed by a weekThe panel found that there was a case for Ramaphosa to answer and that he may have violated the law and involved himself in a conflict between his official duties and his private business.In a statement to the panel, Mustafa was identified as the source of the more than $500 000 (R9 million) that was stolen from the farm. Among the evidence was a receipt for $580 000 that a Phala Phala employee had written to ""Mr Hazim"".According to Sky News, Mustafa said he celebrated Christmas and his wife's birthday in Limpopo in 2019, and that he dealt with a broker when he bought the animals.He reportedly said the animals were to be prepared for export, but they were never delivered due to the Covid-19 lockdown. He understood he would be refunded after the delays.He also reportedly brought the cash into the country through OR Tambo International Airport and said he declared it. Mustafa also told Sky News that the amount was ""nothing for a businessman like [him]"".READ | Here's the Sudanese millionaire - and his Gucci wife - who bought Ramaphosa's buffaloThe businessman is the owner Sudanese football club Al Merrikh SC. He is married to Bianca O'Donoghue, who hails from KwaZulu-Natal. O'Donoghue regularly takes to social media to post snaps of a life of wealth – including several pictures in designer labels and next to a purple Rolls Royce Cullinan, a luxury SUV worth approximately R5.5 million.Sudanese businessman Hazim Mustafa with his South African-born wife, Bianca O'Donoghue.Facebook PHOTO: Bianca O'Donoghue/Facebook News24 previously reported that he also had ties to former Sudanese president, Omar al-Bashir.There have been calls for Ramaphosa to step down following the saga. A motion of no confidence is expected to be submitted in Parliament.He denied any wrongdoing and said the ANC's national executive committee (NEC) would decide his fate.Do you have a tipoff or any information that could help shape this story? Email tips#24.com"
5,Hefty prison sentence for man who killed stranded KZN cop while pretending to offer help,9h ago,"Two men have been sentenced – one for the murder of a KwaZulu-Natal police officer, and the other for an attempt to rob the officer.Sergeant Mzamiseni Mbele was murdered in Weenen in April last year.He was attacked and robbed when his car broke down on the highway while he was on his way home.A man who murdered a KwaZulu-Natal police officer, after pretending that he wanted to help him with his broken-down car, has been jailed.A second man, who was only convicted of an attempt to rob the officer, has also been sentenced to imprisonment.On Friday, the KwaZulu-Natal High Court in Madadeni sentenced Sboniso Linda, 36, to an effective 25 years' imprisonment, and Nkanyiso Mungwe, 25, to five years' imprisonment.READ | Alleged house robber shot after attack at off-duty cop's homeAccording to Hawks spokesperson, Captain Simphiwe Mhlongo, 39-year-old Sergeant Mzamiseni Mbele, who was stationed at the Msinga police station, was on his way home in April last year when his car broke down on the R74 highway in Weenen.Mbele let his wife know that the car had broken down. While stationary on the road, Linda and Mungwe approached him and offered to help.Mhlongo said: All of a sudden, [they] severely assaulted Mbele. They robbed him of his belongings and fled the scene. A farm worker found Mbele's body the next dayA case of murder was reported at the Weenen police station and the Hawks took over the investigation.The men were arrested.""Their bail [application] was successfully opposed and they appeared in court several times until they were found guilty,"" Mhlongo added.How safe is your neighbourhood? Find out by using News24's CrimeCheckLinda was sentenced to 20 years' imprisonment for murder and 10 years' imprisonment for robbery with aggravating circumstances. Half of the robbery sentence has to be served concurrently, leaving Linda with an effective sentence of 25 years.Mungwe was sentenced to five years' imprisonment for attempted robbery with aggravating circumstances."

WebDriverWait Selenium get all anchor href and surfing

I'd like to get all the links in this website https://www.sciencedirect.com/browse/journals-and-books?accessType=openAccess&accessType=containsOpenAccess
Then I'd like go on every link to extract the text in `" after clicking on the button "View Full Aims & Scope".
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
wait = WebDriverWait(driver, 20)
url = "https://www.sciencedirect.com/browse/journals-and-books?accessType=openAccess&accessType=containsOpenAccess"
driver.get(url)
page_description = wait.until(EC.presence_of_element_located((By.XPATH, "//span[#class='pagination-pages-label u-margin-s-left-from-sm u-margin-s-right-from-sm']")))
index_of = wait.until(EC.presence_of_element_located((By.XPATH, "//span[#class='pagination-pages-label u-margin-s-left-from-sm u-margin-s-right-from-sm']"))).text.index('of')
index_number = index_of + 3
time.sleep(2) #otherwise sometimes it doesn't work
length = len(page_description.text)
pages = int(page_description.text[index_number:length])
allLi = []
for i in range(pages):
index = i + 1
url = "https://www.sciencedirect.com/browse/journals-and-books?page="+str(index)+"&accessType=containsOpenAccess&accessType=openAccess"
driver.get(url)
currentAli = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//a[#class='anchor js-publication-title anchor-default']"))
for li in currentAli:
link = li.get_attribute('href');
allLi.append(link)
for li in allLi:
driver.get(li)
button = wait.until(EC.element_to_be_clickable(By.XPATH, "//button[#class='button-link button-link-secondary']"))
button.click()
descrip = wait.until(EC.presence_of_all_elements_located(By.XPATH, "//span[#class='spaced']"))
print(descrip)
First it doesn't work. Then I also have a problem with I try to compute length or pages. Sometimes it works, sometimes it doesn't. Is there anything with the asynchronism? I have to add time.sleep(2). I know it is not a good practice.
Thanks!
Here is a way of getting that information you're after, without the overheads of Selenium (though you might have to add a 2 or 3 seconds between network calls, just to be nice to ScienceDirect server's resources):
from bs4 import BeautifulSoup as bs
from tqdm import tqdm ## if using Jupyter, do from tqdm.notebook import tqdm
import pandas as pd
import cloudscraper
scraper = cloudscraper.create_scraper(disableCloudflareV1=True)
initial_list = []
journal_stuffs = []
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
for x in tqdm(range(1, 39)):
r = scraper.get(f'https://www.sciencedirect.com/browse/journals-and-books?page={x}&accessType=containsOpenAccess&accessType=openAccess')
soup = bs(r.text, 'html.parser')
links = ['https://www.sciencedirect.com' + y.get('href') for y in soup.select('a[class="anchor js-publication-title anchor-default"]')]
initial_list.extend(links)
print('there are ', len(set(initial_list)), 'journal links')
for url in tqdm(initial_list[:20]):
r = scraper.get(url)
soup = bs(r.text, 'html.parser')
title = soup.select_one('a[class="anchor js-title-link anchor-default anchor-has-background-color anchor-has-inherit-color"]').get_text(strip=True)
try:
more_info = soup.select_one('div[class="slide-out"]').get_text(strip=True)
except Exception as e:
more_info = 'Without full aim and scope. A bit pointless, really.'
journal_stuffs.append((title, more_info))
df = pd.DataFrame(journal_stuffs, columns = ['Title', 'Info'])
print(df)
Result in terminal:
100%
38/38 [00:11<00:00, 3.49it/s]
there are 3718 journal links
100%
20/20 [00:12<00:00, 1.46it/s]
Title Info
0 AACE Clinical Case Reports Aims & ScopeAACE Clinical Case Reports is an online journal that publishes case reports with accompanying commentaries six times a year. The primary mission of the journal is to present the most up-to-date information for practicing endocrinologists, fellows in endocrinology and health care professionals dealing with endocrine disorders including diabetes, obesity, osteoporosis, thyroid and other general endocrine disorders.
1 AASRI Procedia Without full aim and scope. A bit pointless, really.
2 Academic Pathology Aims & ScopeAcademic Pathologyis the official open-access journal of theAssociation of Pathology Chairs, established to give voice to innovations in education, practice, and management from academic departments of pathology and laboratory medicine, with the potential for broad impact on medicine, medical research, and the delivery of care.Academic Pathologyaddresses methods for improving patient care (clinical informatics, genomic testing and data management, lab automation, electronic health record integration, and annotate biorepositories); best practices in inter-professional clinical partnerships; innovative pedagogical approaches to medical education and educational program evaluation in pathology; models for training academic pathologists and advancing academic career development; administrative and organizational models supporting the discipline; and leadership development in academic medical centers, health systems, and other relevant venues. Intended authorship and audiences forAcademic Pathologyare international and reach beyond academic pathology itself, including but not limited to healthcare providers, educators, researchers, and policy-makers.Academic Pathologypublishes original research, reviews, brief reports, and educational cases. All articles are rigorously peer-reviewed for relevance and quality.
3 Academic Pediatrics Aims & ScopeAcademic Pediatrics, the official journal of theAcademic Pediatric Association, is a peer-reviewed publication whose purpose is to strengthen the research and educational base of academic generalpediatrics. The journal provides leadership in pediatric education, research, patient care and advocacy. Content areas includepediatric education,emergency medicine,injury,abuse,behavioral pediatrics,holistic medicine,child health servicesandhealth policy,and theenvironment. The journal provides an active forum for the presentation of pediatric educational research in diverse settings, involving medical students, residents, fellows, and practicing professionals. The journal also emphasizes important research relating to the quality of child health care, health care policy, and the organization of child health services. It also includes systematic reviews of primary care interventions and important methodologic papers to aid research in child health and education.Benefits to authorsWe also provide many author benefits, such as free PDFs, a liberal copyright policy, special discounts on Elsevier publications and much more. Please click here for more information on ourauthor services.Please see ourGuide for Authorsfor information on article submission. If you require any further information or help, please visit ourSupport Center
4 Academic Radiology Aims & ScopeAcademic Radiologypublishes original reports of clinical and laboratory investigations indiagnostic imaging, the diagnostic use ofradioactive isotopes,computed tomography,positron emission tomography,magnetic resonance imaging,ultrasound,digital subtraction angiography,image-guided interventionsand related techniques.\nIt also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.
5 ACC Current Journal Review Without full aim and scope. A bit pointless, really.
6 Accident Analysis & Prevention Aims & ScopeAccident Analysis & Preventionprovides wide coverage of the general areas relating toaccidental injuryand damage, including the pre-injury and immediate post-injury phases. Published papers deal with medical, legal, economic, educational, behavioral, theoretical or empirical aspects of transportation accidents, as well as withaccidentsat other sites. Selected topics within the scope of the Journal may include: studies of human, environmental and vehicular factors influencing the occurrence, type and severity of accidents and injury; the design, implementation and evaluation of countermeasures; biomechanics of impact and human tolerance limits to injury; modelling and statistical analysis of accident data; policy, planning and decision-making in safety.Benefits to authorsWe also provide many author benefits, such as free PDFs, a liberal copyright policy, special discounts on Elsevier publications and much more. Please click here for more information on ourauthor services.Please see ourGuide for Authorsfor information on article submission. If you require any further information or help, please visit ourSupport Center
7 Accounting Forum Without full aim and scope. A bit pointless, really.
8 Accounting, Organizations and Society Aims & ScopeAccounting, Organizations & Society is a leading international interdisciplinary journal concerned with the relationships among accounting and human behaviour, organizational and institutional structures and processes, and the wider socio-political environment of the enterprise. It aims to challenge and extend our understanding of the roles of accounting and related emergent and calculative practices in the construction of economic and societal actors, and their modes of economic organizing, including ways in which such practices influence and are influenced by the development of market and other infrastructures.We aim to publish high quality work which draws upon diverse methodologies and theoretical developments from across the social sciences, and which illuminates the development, processes and effects of accounting within its organizational, political, historical and social contexts. AOS particularly wishes to attract innovative work which analyses accounting phenomena to advance theory development in, for example, the psychological, social psychological, organizational, sociological and human sciences.The journal's unique focus covers, but is not limited to, such topics as:•\tThe roles of accounting in organizations and society;•\tThe contribution of accounting practices to the emergence, maintenance and transformation of organizational and societal institutions;•\tThe roles of accounting in the development of new organizational and institutional forms, both public and private;•\tThe relationships between accounting, auditing, accountability, ethics and social justice;•\tBehavioural studies of accounting practices and the providers, verifiers, and users of accounting information, including cognitive aspects of accounting, judgment and decision-making processes, and the behavioural aspects of planning, control and valuation processes;•\tOrganizational process studies of the design, implementation and use of accounting, information and management control systems;•\tAccounting for human actors, and the impact of accounting technologies upon human subjectivities and evaluations;•\tThe roles of accounting in shaping the design, operation and delivery of public service providers, not-for-profit entities, government bodies, as well as local, national and transnational governmental organizations;•\tSocial, organizational, political, and psychological studies of the standard-setting process, and the effects of accounting regulations and rules;•\tThe roles and practices of audit, auditors and accounting firms in the construction and understanding of organizational and societal valuations;•\tAccounting for sustainability and the environment, including studies of environmental and social reporting;•\tHistorical studies of the emergence, transformation and impact of accounting calculations, practices, and representations, including the development and the changing roles of accounting theories, techniques, individual and teams of practitioners and their firms, professional associations, and regulators.Benefits to authorsWe also provide many author benefits, such as free PDFs, a liberal copyright policy, special discounts on Elsevier publications and much more. Please click here for more information on ourauthor services.Please see ourGuide for Authorsfor information on article submission. If you require any further information or help, please visit ourSupport Center
9 Achievements in the Life Sciences Without full aim and scope. A bit pointless, really.
10 ACOG Clinical Review Without full aim and scope. A bit pointless, really.
11 Acta Anaesthesiologica Taiwanica Without full aim and scope. A bit pointless, really.
12 Acta Astronautica Aims & ScopeActa Astronauticais sponsored by theInternational Academy of Astronautics. Content is based on original contributions in all fields of basic,engineering,lifeandsocial space sciencesand ofspace technologyrelated to:The peaceful scientific exploration of space,Its exploitation for human welfare and progress,Conception, design, development and operation of space-borne and Earth-based systems,In addition to regular issues, the journal publishes selected proceedings of the annual International Astronautical Congress (IAC), transactions of the IAA and special issues on topics of current interest, such asmicrogravity,space station technology,geostationary orbits, andspace economics. Other subject areas includesatellite technology,space transportationandcommunications,space energy,power and propulsion,astrodynamics,extraterrestrial intelligenceandEarth observations.For more information on the International Academy of Astronautics (IAA), visit their home page:http://www.iaaweb.org. Members of the IAA are eligible for a discount on a personal subscription toActa Astronautica. Please clickhereto download an order form.
13 Acta Biomaterialia Aims & ScopeActa Biomaterialiais an international journal that publishes peer-reviewed original research reports, review papers and communications in the broadly defined field ofbiomaterials science. The emphasis of the journal is on the relationship betweenbiomaterial structureandfunctionat all length scales.The scope ofActa Biomaterialiaincludes:Hypothesis-driven design of biomaterialsBiomaterial surface science linking structure to biocompatibility, including protein adsorption and cellular interactionsBiomaterial mechanical characterization and modeling at all scalesMolecular, statistical and other types of modeling applied to capture biomaterial behaviorInteractions of biological species with defined surfacesCombinatorial approaches to biomaterial developmentStructural biology as it relates structure to function for biologically derived materials that have application as a medical material, or as it aids in understanding the biological response to biomaterialsMethods for biomaterial characterizationProcessing of biomaterials to achieve specific functionalityMaterials development for arrayed genomic and proteomic screeningBenefits to authorsFree and automatic manuscript deposit service to meet NIH public access requirements at one year;Multiple options for data-sharing (seehttp://www.materialstoday.com/materials-genome-initiative/);Free author pdf and Sharelink share your article with your peers (seehttps://www.elsevier.com/journal-authors/share-link);And more information on our author services can be foundherePlease see ourGuide for Authorsfor information on article submission. If you require any further information or help, please visit ourSupport Center
14 Acta Colombiana de Cuidado Intensivo Aims & ScopeActa Colombiana de Cuidado Intensivois the official publication of the Asociación Colombiana de Medicina Crítica y Cuidado Intensivo (Colombian Association of Critical Medicine and Intensive Care). It is published every three months in March, June, September and December and is intended to be a means of dissemination in all areas associated with the management of the critically ill patient.All the manuscripts received by theActa Colombiana de Cuidado Intensivoare reviewed using a double blind system by experts in the specialty (peer review).The Journal publishes articles on research (Originals), Reviews, Case Reports , and Case Series, as well as Articles on Reflections, and Clinical Comments. Also, it offers the possibility of publishing supplements on specific topics that allows the reader to get into a particular area of knowledge in depth.The development of Intensive Care has encouraged certain areas of specialisation within the specialists dedicated to the care of the critically ill patient. Responding to this need,Acta Colombiana de Cuidado Intensivopays particular attention to certain areas of interests which are made up by experts. The subject matter organisation of the Journal enables it to approach not just technical subjects, but also those related to the logistic organisation of the practice of intensive care.The areas of interest of the Journal are the following:• epidemiology• infection and sepsis• coagulation and inflammation• cardiovascular critical care• mechanical ventilation• bioethics• nutrition and metabolism• quality and costs• neurological intensive care• toxicology• trauma• obstetrics intensive care• sedation and analgesia• paediatrics intensive careLa revistaActa Colombiana de Cuidado Intensivoes el órgano oficial de la Asociación Colombiana de Medicina Crítica y Cuidado Intensivo. Se publica trimestralmente en los meses de marzo, junio, septiembre y diciembre y pretende ser un órgano de divulgación en todas las áreas relacionadas con el manejo del paciente críticamente enfermo.Todos los manuscritos recibidos porActa Colombiana de Cuidado Intensivoson revisados mediante el sistema de doble ciego por expertos de la especialidad.La revista publica artículos de investigación (Originales), de Revisión, Reportes de Casos y Series de Casos, así como Artículos de Reflexión y Comentarios Clínicos. Además, ofrece la posibilidad de publicar suplementos sobre temas específicos que permitan al lector profundizar a fondo en un área particular del conocimiento.El desarrollo del cuidado intensivo ha promovido ciertas áreas de especialización dentro de los especialistas dedicados al cuidado del paciente críticamente enfermo. Respondiendo a esta necesidad,Acta Colombiana de Cuidado Intensivopresta especial atención a determinadas áreas de interés en las que se agrupa a los expertos. La organización temática de la revista permite abordar no solo temas técnicos sino también aquellos relacionados con la organización logística de la práctica del cuidado intensivo.Las áreas de interés de la revista son las siguientes:• epidemiologia• infecciones y sepsis• coagulación e inflamación• cuidado crítico cardiovascular• ventilación mecánica• bioética• nutrición y metabolismo• calidad y costos• cuidado intensivo neurológico• toxicología• trauma• cuidado intensivo obstétrico• sedación y analgesia• cuidado intensivo pediátrico
15 Acta Ecologica Sinica Aims & ScopeActa Ecologica Sinica (International Journal)is a bimonthly academic journal sponsored by the Ecological Society of China and the Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences.Acta Ecologica Sinicapublishes novel research inecology, promotes the exchange and cooperation of ecologists and ecological research between developing and developed countries. The Journal aims to show the scientific mechanism of the interaction between life and environment and facilitates the academic dissemination and scientific development of ecological research in the world, especially in developing countries.Position of the journalActa Ecologica Sinicais a comprehensive journal devoted to the development of Ecology and its sub-disciplines. It unites ecological scientists in the world with the aim of publishing high-quality papers on innovative research. Published papers should unveil mechanisms of the interactions between life and environment, and contribute to the innovation and sustainable development of ecological science in the world.International perspectiveMembers of the editorial board of Acta Ecologica Sinica are all internationally renowned ecologists, and its presentEditor-in-Chiefis the academician of theChinese Academy of Sciences (CAS)and theChinese Academy of Engineering (CAE). In recent years, Acta Ecologica Sinica is receiving an increasing international attention, and its editorial members come from 8 different countries and regions in various areas of ecological research, which strengthens the journal`s impact worldwide. It is anticipated that Acta Ecologica Sinica will further gain more international recognition and have a great prospect of development.Journal coverageThis journal publishes papers on animal ecology, plant ecology, microbial ecology, agro-ecology, forestry ecology, grassland ecology, soil ecology, ocean and aquatic ecosystems, landscape ecology, chemical ecology, contaminant ecology, urban and human ecology. We particularly welcome reviews on recent developments in ecology, novel experimental studies, and short communications, new theories, methodologies, new techniques, book reviews, and research news and laboratory introductions.
16 Acta Histochemica Aims & ScopeActa Histochemicais a classic scientific journal established in 1954 currently focused on basic research and methodological innovations in cell and tissue biology. The aim of the journal is to promote the peer-reviewed publication of original articles and short communications reporting novel results and experimental approaches in the field, as well as comprehensive reviews, letters to the editor and meeting reports, serving as an open forum for the cell and histochemical research community. Manuscripts analysing the mechanisms of functional regulation of living systems at a cell/tissue level, in physiological or pathological conditions, or reporting new techniques and methodological approaches to quantify/visualize cellular activities are particularly welcomed.
17 Acta de Investigación Psicológica Without full aim and scope. A bit pointless, really.
18 Acta Materialia Aims & ScopeActa Materialiaprovides a forum for publishing full-length, original papers and commissioned overviews that advance the in-depth understanding of the relationship between the processing, the structure and the properties of inorganic materials. Papers that have a high impact potential and/or substantially advance the field are sought. The structure encompasses atomic and molecular arrangements, chemical and electronic structures, and microstructure. Emphasis is on either the mechanical or functional behavior of inorganic solids at all length scales down to nanostructures.The following aspects of the science and engineering of inorganic materials are of particular interest:(i) Cutting-edge experiments and theory as they relate to the understanding of the properties,(ii) Elucidation of the mechanisms involved in the synthesis and processing of materials specifically as they relate to the understanding of the properties,and(iii) Characterization of the structure and chemistry of materials specifically as it relates to the understanding of the properties.Acta Materialiawelcomes papers that employ theory and/or simulation (or numerical methods) that substantially advance our understanding of the structure and properties of inorganic materials. Such papers should demonstrate relevance to the materials community by, for example, making a comparison with experimental results (in the literature or in the present study), making testable microstructural or property predictions or elucidating an important phenomenon. Papers that focus primarily on model parameter studies, development of methodology or those employing existing software packages to obtain standard or incremental results are discouraged.Short communications and comments to papers published inActa Materialiamay besubmitted toScripta Materialia.
19 Acta Metallurgica Without full aim and scope. A bit pointless, really.
​

Spacy.io Wikipedia Entity Linker - Results NLP Model Have no KB Entities

I have been learning how to use the Sapcy.io Entity Linker using the Wikipedia example here.
I started with a small training size of 2000 articles (it ran for 20 hours) but the results model does not recognize or return any kb entities even from text that used in the training.
nlp_kb.from_disk("/path/to/nel-wikipedia/output_lt_kb80k_model_vsm/nlp")
text = "Anarchism is a political philosophy and movement that rejects all involuntary, coercive forms of hierarchy. It calls for the abolition of the state which it holds to be undesirable, unnecessary and harmful. It is usually described alongside libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement and as having a historical association with anti-capitalism and socialism. The history of anarchism goes back to prehistory, when some humans lived in anarchistic societies long before the establishment of formal states, realms or empires. With the rise of organised hierarchical bodies, skepticism toward authority also rose, but it was not until the 19th century that a self-conscious political movement emerged. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement flourished in most parts of the world and had a significant role in workers' struggles for emancipation. Various anarchist schools of thought formed during this period. Anarchists have taken part in several revolutions, most notably in the Spanish Civil War, whose end marked the end of the classical era of anarchism. In the last decades of the 20th century and into the 21st century, the anarchist movement has been resurgent once more. Anarchism employs various tactics in order to meet its ideal ends; these can be broadly separated into revolutionary and evolutionary tactics."
doc = nlp_kb(text)
for ent in doc.ents:
print(ent.text, ent.label_, ent.kb_id_)
Results
the 19th century DATE
the latter half of the 19th and the first decades of the 20th century DATE
Anarchists NORP
the Spanish Civil War EVENT
the last decades of the 20th century DATE
the 21st century DATE
The NLP model doesn't have an entity linker pipeline.
nlp_kb.meta["pipeline"]
['tagger', 'parser', 'ner']
But the meta.json has it.
{
"lang":"en",
"name":"core_web_lg",
"license":"MIT",
"author":"Explosion",
"url":"https://explosion.ai",
"email":"contact#explosion.ai",
"description":"English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, POS tags, dependency parses and named entities.",
"sources":[
{
"name":"OntoNotes 5",
"url":"https://catalog.ldc.upenn.edu/LDC2013T19",
"license":"commercial (licensed by Explosion)"
},
{
"name":"GloVe Common Crawl",
"author":"Jeffrey Pennington, Richard Socher, and Christopher D. Manning",
"url":"https://nlp.stanford.edu/projects/glove/",
"license":"Public Domain Dedication and License v1.0"
}
],
"pipeline":[
"tagger",
"parser",
"ner",
"entity_linker"
],
Here is the constant of the NLP directory
(spacy) ➜ nlp git:(master) ✗ ls
entity_linker meta.json ner parser tagger tokenizer vocab
(spacy) ➜ nlp git:(master) ✗ ls -l entity_linker
total 55040
-rw-r--r-- 1 staff 323 Sep 8 04:40 cfg
-rw-r--r-- 1 staff 25294844 Sep 8 04:40 kb
-rw-r--r-- 1 staff 2875799 Sep 8 04:40 model
I am assuming I am loading the model wrong, but I am not sure how to fix it.
You've used this line:
nlp_kb.from_disk("/path/to/nel-wikipedia/output_lt_kb80k_model_vsm/nlp")
which basically loads trained weights for the existing nlp_kb from disk. However, it doesn't actually change any internals of this nlp_kb object - it also won't automagically add new components.
Instead, what you want to do is
nlp_el = spacy.load("/path/to/nel-wikipedia/output_lt_kb80k_model_vsm/nlp")
and then you should have a new NLP object with the entity_linker component.

Generating similar named entities/compound nouns

I have been trying to create distractors (false answers) for multiple choice questions. Using word vectors, I was able to get decent results for single-word nouns.
When dealing with compound nouns (such as "car park" or "Donald Trump"), my best attempt was to compute similar words for each part of the compound and combine them. The results are very entertaining:
Car park -> vehicle campground | automobile zoo
Fire engine -> flame horsepower | fired motor
Donald Trump -> Richard Jeopardy | Jeffrey Gamble
Barrack Obama -> Obamas McCain | Auschwitz Clinton
Unfortunately, these are not very convincing. Especially in case of named entities, I want to produce other named entities, which appear in similar contexts; e.g:
Fire engine -> Fire truck | Fireman
Donald Trump -> Barrack Obama | Hillary Clinton
Niagara Falls -> American Falls | Horseshoe Falls
Does anyone have any suggestions of how this could be achieved? Is there are a way to generate similar named entities/noun chunks?
I managed to get some good distractors by searching for the named entities on Wikipedia, then extracting entities which are similar from the summary. Though I'd prefer to find a solution using just spacy.
If you haven't seen it yet, you might want to check out sense2vec, which allows learning context-sensitive vectors by including the part-of-speech tags or entity labels. Quick usage example of the spaCy extension:
s2v = Sense2VecComponent('/path/to/reddit_vectors-1.1.0')
nlp.add_pipe(s2v)
doc = nlp(u"A sentence about natural language processing.")
most_similar = doc[3]._.s2v_most_similar(3)
# [(('natural language processing', 'NOUN'), 1.0),
# (('machine learning', 'NOUN'), 0.8986966609954834),
# (('computer vision', 'NOUN'), 0.8636297583580017)]
See here for the interactive demo using a sense2vec model trained on Reddit comments. Using this model, "car park" returns things like "parking lot" and "parking garage", and "Donald Trump" gives you "Sarah Palin", "Mitt Romney" and "Barack Obama". For ambiguous entities, you can also include the entity label – for example, "Niagara Falls|GPE" will show similar terms to the geopolitical entitiy (GPE), e.g. the city as opposed to the actual waterfalls. The results obviously depend on what was present in the data, so for even more specific similarities, you could also experiment with training your own sense2vec vectors.

Understanding Themes in Google BigQuery GDELT GKG 2.0

I'm using Google bigquery to analyze the GDELT GKG 2.0 dataset and would like to better understand how to query based on themes (or V2Themes). The docs mention a 'Category List' spreadsheet but so far I've been unsuccessful in finding that list.
the following asesome blog mentions that you can use World Bank Taxonomy among others to narrow down your search. My objective is to find all items that mention "droughts / too little water" ,all items that mention "floods / too much water" and all items that mention " poor quality / too dirty water" that have a geographical match on a sub-country level.
So far I've been able to get a list of distinct themes but this is non-extensive and I don't get the hierarchy / structure of it.
SELECT
DISTINCT theme
FROM (
SELECT
GKGRECORDID,
locations,
REGEXP_EXTRACT(themes,r'(^.[^,]+)') AS theme,
CAST(REGEXP_EXTRACT(locations,r'^(?:[^#]*#){0}([^#]*)') AS NUMERIC) AS location_type,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){1}([^#]*)') AS location_fullname,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){2}([^#]*)') AS location_countrycode,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){3}([^#]*)') AS location_adm1code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){4}([^#]*)') AS location_adm2code,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){5}([^#]*)') AS location_latitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){6}([^#]*)') AS location_longitude,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){7}([^#]*)') AS location_featureid,
REGEXP_EXTRACT(locations,r'^(?:[^#]*#){8}([^#]*)') AS location_characteroffset,
DocumentIdentifier
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`,
UNNEST(SPLIT(V2Locations,';')) AS locations,
UNNEST(SPLIT(V2Themes,';')) AS themes
WHERE
_PARTITIONTIME >= "2018-08-20 00:00:00"
AND _PARTITIONTIME < "2018-08-21 00:00:00" )
WHERE
(location_type = 5
OR location_type = 4
OR location_type = 2) --WorldState, WorldCity or US State
ORDER BY
theme
And a list of water related themes I've been able to find so far (sample, not exhaustive):
CRISISLEX_C06_WATER_SANITATION
ENV_WATERWAYS
HUMAN_RIGHTS_ABUSES_WATERBOARD
HUMAN_RIGHTS_ABUSES_WATERBOARDED
HUMAN_RIGHTS_ABUSES_WATERBOARDING
NATURAL_DISASTER_FLOODWATER
NATURAL_DISASTER_FLOODWATERS
NATURAL_DISASTER_FLOOD_WATER
NATURAL_DISASTER_FLOOD_WATERS
NATURAL_DISASTER_HIGH_WATER
NATURAL_DISASTER_HIGH_WATERS
NATURAL_DISASTER_WATER_LEVEL
TAX_AIDGROUPS_WATERAID
TAX_DISEASE_WATERBORNE_DISEASE
TAX_DISEASE_WATERBORNE_DISEASES
TAX_FNCACT_WATERBOY
TAX_FNCACT_WATERMAN
TAX_FNCACT_WATERMEN
TAX_FNCACT_WATER_BOY
TAX_WEAPONS_WATER_CANNON
TAX_WEAPONS_WATER_CANNONS
TAX_WORLDBIRDS_WATERFOWL
TAX_WORLDMAMMALS_WATER_BUFFALO
UNGP_CLEAN_WATER_SANITATION
WATER_SECURITY
WB_1000_WATER_MANAGEMENT_STRUCTURES
WB_1021_WATER_LAW
WB_1063_WATER_ALLOCATION_AND_WATER_SUPPLY
WB_1064_WATER_DEMAND_MANAGEMENT
WB_1199_WATER_SUPPLY_AND_SANITATION
WB_1215_WATER_QUALITY_STANDARDS
WB_137_WATER
WB_138_WATER_SUPPLY
WB_139_SANITATION_AND_WASTEWATER
WB_140_AGRICULTURAL_WATER_MANAGEMENT
WB_141_WATER_RESOURCES_MANAGEMENT
WB_143_RURAL_WATER
WB_144_URBAN_WATER
WB_1462_WATER_SANITATION_AND_HYGIENE
WB_149_WASTEWATER_TREATMENT_AND_DISPOSAL
WB_150_WASTEWATER_REUSE
WB_155_WATERSHED_MANAGEMENT
WB_156_GROUNDWATER_MANAGEMENT
WB_159_TRANSBOUNDARY_WATER
WB_1729_URBAN_WATER_FINANCIAL_SUSTAINABILITY
WB_1731_NON_REVENUE_WATER
WB_1778_FRESHWATER_ECOSYSTEMS
WB_1790_INTERNATIONAL_WATERWAYS
WB_1798_WATER_POLLUTION
WB_1805_WATERWAYS
WB_1998_WATER_ECONOMICS
WB_2008_WATER_TREATMENT
WB_2009_WATER_QUALITY_MONITORING
WB_2971_WATER_PRICING
WB_2981_DRINKING_WATER_QUALITY_STANDARDS
WB_2992_FRESHWATER_FISHERIES
WB_427_WATER_ALLOCATION_AND_WATER_ECONOMICS
While this link is provided as a theme listing:
http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_CategoryList.xlsx
...it is far from complete (perhaps just the original theme list?). I just pulled a single day's worth of GKG, and there are tons of themes not on the list of 283 themes in that spreadsheet.
GKG documentation located at https://blog.gdeltproject.org/world-bank-group-topical-taxonomy-now-in-gkg/ points to a World Bank Taxonomy located at http://pubdocs.worldbank.org/en/275841490966525495/Theme-Taxonomy-and-definitions.pdf. The GKG post implies this World Bank taxonomy has been rolled into the GKG theme list.
This is presented as a complete listing of World Bank Taxonomy themes. Unfortunately, I've found numerous World Bank themes in GKG that aren't in this publication. The union of these two lists represents a portion of GKG themes, but it definitely isn't all of them.
Here is the list of GKG Themes:
http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_CategoryList.xlsx
If anyone needs this, I have added a list of all themes in the GKG v1 in the timeperiod from 1/1/2017-31/12/2020 which are at least present in 10 or more articles for that particular day: Themes.parquet
It consists of 17639 unique themes with the count per day. Looks like this:
The complete numbers for that 4 year dataset is 36 713 385 unique actors, 50 845 unique themes as well as 26 389 528 unique organizations. These numbers are not filtered for different spellings for the same entity, and hence Donald Trump and Donald J. Trump will count as two separate actors.
The best GDELT GKG Themes list I could find is here, as described in this blog post.
I put it into a CSV file, which I find slightly easier to work with, and put that file here.