How to extract xml tags with BeautifulSoup? - beautifulsoup

I am trying to extract the tags from this data:
[{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"},
But I cannot seem to get the tags; I am trying:
# Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("file.xml", "r") as file:
# Read each line in the file
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "lxml")
result = bs_content.find_all("title")
print(result)
But I only get an empty []
Appreciate any help!

It is not XML its a JSON like structure, so simply iterate the list of dicts:
l = [{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"},]
for d in l:
print(d['title'])
Or while you have a string just convert it before via json.loads():
import json
l = '[{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"}]'
for d in json.loads(l):
print(d['title'])
Output:
Joshua Cohen
Louise Erdrich

Related

Scraping contents of news articles

I was able to scrape the title, date, links, and content of news on these links: https://www.news24.com/news24/southafrica/crime-and-courts and https://www.news24.com/news24/southafrica/education. The output is saved in an excel file. However, I noticed that not all the contents inside the articles were scrapped. I have tried different methods on my "Getting content section of my code" Any help with this will be appreciate. Below is my code:
import sys, time
from bs4 import BeautifulSoup
import requests
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from datetime import timedelta
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['southafrica/crime-and-courts', 'southafrica/education']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
url = 'https://www.news24.com/news24/'+str(pagesToGet[i])
print(url)
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
try:
driver.get("https://www.news24.com/news24/" +str(pagesToGet[i]))
except Exception as e:
error_type, error_obj, error_info = sys.exc_info()
print('ERROR FOR LINK:', url)
print(error_type, 'Line:', error_info.tb_lineno)
continue
time.sleep(3)
scroll_pause_time = 1
screen_height = driver.execute_script("return window.screen.height;")
i = 1
while True:
driver.execute_script("window.scrollTo(0, {screen_height}{i});".format(screen_height=screen_height, i=i))
i += 1
time.sleep(scroll_pause_time)
scroll_height = driver.execute_script("return document.body.scrollHeight;")
if (screen_height) * i > scroll_height:
break
soup = BeautifulSoup(driver.page_source, 'html.parser')
news = soup.find_all('article', attrs={'class': 'article-item'})
print(len(news))
# Getting titles, dates, and links
for j in news:
titles = j.select_one('.article-item__title span')
title = titles.text.strip()
dates = j.find('p', attrs={'class': 'article-item__date'})
date = dates.text.strip()
address = j.find('a').get('href')
news_link = 'https://www.news24.com' + address
art_title.append(title)
art_date.append(date)
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title, 'Date': art_date, 'Source': art_link})
# Getting Content Section
news_articles = [] # to store the content of each news artcle
news_count = 0
for link in df['Source']:
print('\n')
start_time = time.monotonic()
print('Article No. ', news_count)
print('Link: ', link)
# Countermeasure for broken links
try:
if requests.get(link):
news_response = requests.get(link)
else:
print("")
except requests.exceptions.ConnectionError:
news_response = 'N/A'
# Auto sleep trigger after saving every 300 articles
sleep_time = ['100', '200', '300', '400', '500']
if news_count in sleep_time:
time.sleep(12)
else:
""
try:
if news_response.text:
news_data = news_response.text
else:
print('')
except AttributeError:
news_data = 'N/A'
news_soup = BeautifulSoup(news_data, 'html.parser')
try:
if news_soup.find('div', {'class': 'article__body'}):
art_cont = news_soup.find('div','article__body')
art = []
article_text = [i.text.strip().replace("\xa0", " ") for i in art_cont.findAll('p')]
art.append(article_text)
else:
print('')
except AttributeError:
article = 'N/A'
print('\n')
news_count += 1
news_articles.append(art)
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
print('\n')
# Create a column to add all the scraped text
df['News'] = news_articles
df.drop_duplicates(subset ="Source", keep = False, inplace = True)
# Dont store links
df.drop(columns=['Source'], axis=1, inplace=True)
df.to_excel('SA_news24_3.xlsx')
driver.quit()
I tried the following code in the Getting Content Section as well. However, it produced the same output.
article_text = [i.get_text(strip=True).replace("\xa0", " ") for i in art_cont.findAll('p')]
The site has various types of URLs so your code was omitting them since they found it malformed or some had to be subscribed to read.For the ones that has to be subscribed to read i have added "Login to read" followers by the link in articles . I ran this code till article number 670 and it didn't give any error. I had to change it from .xlsx to .csv since it was giving an error of openpyxl in python 3.11.0.
Full Code
import time
import sys
from datetime import timedelta
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['southafrica/crime-and-courts',
'southafrica/education', 'places/gauteng']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
if "places" in pagesToGet[I]:
url = f"https://news24.com/api/article/loadmore/tag?tagType=places&tag={pagesToGet[i].split('/')[1]}&pagenumber=1&pagesize=100&ishomepage=false&ismobile=false"
else:
url = f"https://news24.com/api/article/loadmore/news24/{pagesToGet[i]}?pagenumber=1&pagesize=1200&ishomepage=false&ismobile=false"
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.json()["htmlContent"], 'html.parser')
news = soup.find_all('article', attrs={'class': 'article-item'})
print(len(news))
# Getting titles, dates, and links
for j in news:
titles = j.select_one('.article-item__title span')
title = titles.text.strip()
dates = j.find('p', attrs={'class': 'article-item__date'})
date = dates.text.strip()
address = j.find('a').get('href')
# Countermeasure for links with full url
if "https://" in address:
news_link = address
else:
news_link = 'https://www.news24.com' + address
art_title.append(title)
art_date.append(date)
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title,
'Date': art_date, 'Source': art_link})
# Getting Content Section
news_articles = [] # to store the content of each news artcle
news_count = 0
for link in df['Source']:
start_time = time.monotonic()
print('Article No. ', news_count)
print('Link: ', link)
news_response = requests.get(link)
news_data = news_response.content
news_soup = BeautifulSoup(news_data, 'html.parser')
art_cont = news_soup.find('div', 'article__body')
# Countermeasure for links with subscribe form
try:
try:
article = art_cont.text.split("Newsletter")[
0]+art_cont.text.split("Sign up")[1]
except:
article = art_cont.text
article = " ".join((article).strip().split())
except:
article = f"Login to read {link}"
news_count += 1
news_articles.append(article)
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
print('\n')
# Create a column to add all the scraped text
df['News'] = news_articles
df.drop_duplicates(subset="Source", keep=False, inplace=True)
# Dont store links
df.drop(columns=['Source'], axis=1, inplace=True)
df.to_csv('SA_news24_3.csv')
Output
,Article_Title,Date,News
0,Pastor gets double life sentence plus two 15-year terms for rape and murder of women,2h ago,"A pastor has been sentenced to two life sentences and two 15-year jail terms for rape and murder His modus operandi was to take the women to secluded areas before raping them and tying them to trees.One woman managed to escape after she was raped but the bodies of two others were found tied to the trees.The North West High Court has sentenced a 50-year-old pastor to two life terms behind bars and two 15-year jail terms for rape and murder.Lucas Chauke, 50, was sentenced on Monday for the crimes, which were committed in 2017 and 2018 in Temba in the North West.According to North West National Prosecuting Authority spokesperson Henry Mamothame, Chauke's first victim was a 53-year-old woman.He said Chauke pretended that he would assist the woman with her spirituality and took her to a secluded place near to a dam.""Upon arrival, he repeatedly raped her and subsequently tied her to a tree before fleeing the scene,"" Mamothame said. The woman managed to untie herself and ran to seek help. She reported the incident to the police, who then started searching for Chauke.READ | Kidnappings doubled nationally: over 4 000 cases reported to police from July to SeptemberOn 10 May the following year, Chauke pounced on his second victim - a 55-year-old woman.He took her to the same secluded area next to the dam, raped her and tied her to a tree before fleeing. This time his victim was unable to free herself.Her decomposed body was later found, still tied to the tree, Mamothame said. His third victim was targeted months later, on 3 August, in the same area. According to Mamothame, Chauke attempted to rape her but failed.""He then tied her to a tree and left her to die,"" he said. Chauke was charged in connection with her murder.He was linked to the crimes via DNA.READ | 'These are not pets': Man gives away his two pit bulls after news of child mauled to deathIn aggravation of his sentence, State advocate Benny Kalakgosi urged the court not to deviate from the prescribed minimum sentences, saying that the offences Chauke had committed were serious.""He further argued that Chauke took advantage of unsuspecting women, who trusted him as a pastor but instead, [he] took advantage of their vulnerability,"" Mamothame said. Judge Frances Snyman agreed with the State and described Chauke's actions as horrific.The judge also alluded to the position of trust that he abused, Mamothame said.Chauke was sentenced to life for the rape of the first victim, 15 years for the rape of the second victim and life for her murder, as well as 15 years for her murder. He was also declared unfit to possess a firearm."
1,'I am innocent': Alleged July unrest instigator Ngizwe Mchunu pleads not guilty,4h ago,"Former Ukhozi FM DJ Ngizwe Mchunu has denied inciting the July 2022 unrest.Mchunu pleaded not guilty to charges that stem from the incitement allegations.He also claimed he had permission to travel to Gauteng for work during the Covid-19 lockdown.""They are lying. I know nothing about those charges,"" alleged July unrest instigator, former Ukhozi FM DJ Ngizwe Brian Mchunu, told the Randburg Magistrate's Court when his trial started on Tuesday.Mchunu pleaded not guilty to the charges against him, which stems from allegations that he incited public violence, leading to the destruction of property, and convened a gathering in contravening of Covid-19 lockdown regulations after the detention of former president Jacob Zuma in July last year.In his plea statement, Mchunu said all charges were explained to him.""I am a radio and television personality. I'm also a poet and cultural activist. In 2020, I established my online radio.READ | July unrest instigators could face terrorism-related charges""On 11 July 2021, I sent invitations to journalists to discuss the then-current affairs. At the time, it was during the arrest of Zuma.""I held a media briefing at a hotel in Bryanston to show concerns over Zuma's arrest. Zuma is my neighbour [in Nkandla]. In my African culture, I regard him as my father.""Mchunu continued that he was not unhappy about Zuma's arrest but added: ""I didn't condone any violence. I pleaded with fellow Africans to stop destroying infrastructure. I didn't incite any violence.I said to them, 'My brothers and sisters, I'm begging you as we are destroying our country.'""He added:They are lying. I know nothing about those charges. I am innocent. He also claimed that he had permission to travel to Gauteng for work during the lockdown.The hearing continues."
2,Jukskei River baptism drownings: Pastor of informal 'church' goes to ground,5h ago,"A pastor of the church where congregants drowned during a baptism ceremony has gone to ground.Johannesburg Emergency Medical Services said his identity was not known.So far, 14 bodies have been retrieved from the river.According to Johannesburg Emergency Medical Services (EMS), 13 of the 14 bodies retrieved from the Jukskei River have been positively identified.The bodies are of congregants who were swept away during a baptism ceremony on Saturday evening in Alexandra.The search for the other missing bodies continues.Reports are that the pastor of the church survived the flash flood after congregants rescued him.READ | Jukskei River baptism: Families gather at mortuary to identify loved onesEMS spokesperson Robert Mulaudzi said they had been in contact with the pastor since the day of the tragedy, but that they had since lost contact with him.It is alleged that the pastor was not running a formal church, but rather used the Jukskei River as a place to perform rituals for people who came to him for consultations. At this stage, his identity is not known, and because his was not a formal church, Mulaudzi could not confirm the number of people who could have been attending the ceremony.Speaking to the media outside the Sandton fire station on Tuesday morning, a member of the rescue team, Xolile Khumalo, said: Thirteen out of the 14 bodies retrieved have been identified, and the one has not been identified yet.She said their team would continue with the search. ""Three families have since come forward to confirm missing persons, and while we cannot be certain that the exact number of bodies missing is three, we will continue with our search."""
3,Six-month-old infant ‘abducted’ in Somerset West CBD,9h ago,"Authorities are on high alert after a baby was allegedly abducted in Somerset West on Monday.The alleged incident occurred around lunchtime, but was only reported to Somerset West police around 22:00. According to Sergeant Suzan Jantjies, spokesperson for Somerset West police, the six-month-old baby boy was taken around 13:00. It is believed the infant’s mother, a 22-year-old from Nomzamo, entrusted a fellow community member and mother with the care of her child before leaving for work on Monday morning. However, when she returned home from work, she was informed that the child was taken. Police were apparently informed that the carer, the infant and her nine-year-old child had travelled to Somerset West CBD to attend to Sassa matters. She allegedly stopped by a liquor store in Victoria Street and asked an unknown woman to keep the baby and watch over her child. After purchasing what was needed and exiting the store, she realised the woman and the children were missing. A case of abduction was opened and is being investigated by the police’s Family Violence, Child Protection and Sexual Offences (FCS) unit. Police obtained security footage which shows the alleged abductor getting into a taxi and making off with the children. The older child was apparently dropped off close to her home and safely returned. However, the baby has still not been found. According to a spokesperson, FCS police members prioritised investigations immediately after the case was reported late last night and descended on the local township, where they made contact with the visibly “traumatised” parent and obtained statements until the early hours of Tuesday morning – all in hopes of locating the child and the alleged suspect.Authorities are searching for what is believed to be a foreign national woman with braids, speaking isiZulu.Anyone with information which could aid the investigation and search, is urged to call Captain Trevor Nash of FCS on 082 301 8910."
4,Have you herd: Dubai businessman didn't know Ramaphosa owned Phala Phala buffalo he bought - report,8h ago,"A Dubai businessman who bought buffaloes at Phala Phala farm reportedly claims he did not know the deal was with President Cyril Ramaphosa.Hazim Mustafa also claimed he was expecting to be refunded for the livestock after the animals were not delivered.He reportedly brought the cash into the country via OR Tambo International Airport, and claims he declared it.A Dubai businessman who reportedly bought 20 buffaloes from President Cyril Ramaphosa's Phala Phala farm claims that he didn't know the deal with was with the president, according to a report.Sky News reported that Hazim Mustafa, who reportedly paid $580 000 (R10 million) in cash for the 20 buffaloes from Ramaphosa's farm in December 2019, said initially he didn't know who the animals belonged to.A panel headed by former chief justice Sandile Ngcobo released a report last week after conducting a probe into allegations of a cover-up of a theft at the farm in February 2020.READ | Ramaphosa wins crucial NEC debate as parliamentary vote on Phala Phala report delayed by a weekThe panel found that there was a case for Ramaphosa to answer and that he may have violated the law and involved himself in a conflict between his official duties and his private business.In a statement to the panel, Mustafa was identified as the source of the more than $500 000 (R9 million) that was stolen from the farm. Among the evidence was a receipt for $580 000 that a Phala Phala employee had written to ""Mr Hazim"".According to Sky News, Mustafa said he celebrated Christmas and his wife's birthday in Limpopo in 2019, and that he dealt with a broker when he bought the animals.He reportedly said the animals were to be prepared for export, but they were never delivered due to the Covid-19 lockdown. He understood he would be refunded after the delays.He also reportedly brought the cash into the country through OR Tambo International Airport and said he declared it. Mustafa also told Sky News that the amount was ""nothing for a businessman like [him]"".READ | Here's the Sudanese millionaire - and his Gucci wife - who bought Ramaphosa's buffaloThe businessman is the owner Sudanese football club Al Merrikh SC. He is married to Bianca O'Donoghue, who hails from KwaZulu-Natal. O'Donoghue regularly takes to social media to post snaps of a life of wealth – including several pictures in designer labels and next to a purple Rolls Royce Cullinan, a luxury SUV worth approximately R5.5 million.Sudanese businessman Hazim Mustafa with his South African-born wife, Bianca O'Donoghue.Facebook PHOTO: Bianca O'Donoghue/Facebook News24 previously reported that he also had ties to former Sudanese president, Omar al-Bashir.There have been calls for Ramaphosa to step down following the saga. A motion of no confidence is expected to be submitted in Parliament.He denied any wrongdoing and said the ANC's national executive committee (NEC) would decide his fate.Do you have a tipoff or any information that could help shape this story? Email tips#24.com"
5,Hefty prison sentence for man who killed stranded KZN cop while pretending to offer help,9h ago,"Two men have been sentenced – one for the murder of a KwaZulu-Natal police officer, and the other for an attempt to rob the officer.Sergeant Mzamiseni Mbele was murdered in Weenen in April last year.He was attacked and robbed when his car broke down on the highway while he was on his way home.A man who murdered a KwaZulu-Natal police officer, after pretending that he wanted to help him with his broken-down car, has been jailed.A second man, who was only convicted of an attempt to rob the officer, has also been sentenced to imprisonment.On Friday, the KwaZulu-Natal High Court in Madadeni sentenced Sboniso Linda, 36, to an effective 25 years' imprisonment, and Nkanyiso Mungwe, 25, to five years' imprisonment.READ | Alleged house robber shot after attack at off-duty cop's homeAccording to Hawks spokesperson, Captain Simphiwe Mhlongo, 39-year-old Sergeant Mzamiseni Mbele, who was stationed at the Msinga police station, was on his way home in April last year when his car broke down on the R74 highway in Weenen.Mbele let his wife know that the car had broken down. While stationary on the road, Linda and Mungwe approached him and offered to help.Mhlongo said: All of a sudden, [they] severely assaulted Mbele. They robbed him of his belongings and fled the scene. A farm worker found Mbele's body the next dayA case of murder was reported at the Weenen police station and the Hawks took over the investigation.The men were arrested.""Their bail [application] was successfully opposed and they appeared in court several times until they were found guilty,"" Mhlongo added.How safe is your neighbourhood? Find out by using News24's CrimeCheckLinda was sentenced to 20 years' imprisonment for murder and 10 years' imprisonment for robbery with aggravating circumstances. Half of the robbery sentence has to be served concurrently, leaving Linda with an effective sentence of 25 years.Mungwe was sentenced to five years' imprisonment for attempted robbery with aggravating circumstances."

How to get all the substates from a country in OSMNX?

What would be the code to easily get all the states (second subdivisions) of a country?
The pattern from OSMNX is, more or less:
division
admin_level
country
2
region
3
state
4
city
8
neighborhood
10
For an example, to get all the neighborhoods from a city:
import pandas as pd
import geopandas as gpd
import osmnx as ox
place = 'Rio de Janeiro'
tags = {'admin_level': '10'}
gdf = ox.geometries_from_place(place, tags)
The same wouldn't apply if one wants the states from a country?
place = 'Brasil'
tags = {'admin_level': '4'}
gdf = ox.geometries_from_place(place, tags)
I'm not even sure this snippet doesn't work, because I let it run for 4 hours and it didn't stop running. Maybe the package isn't made for downloading big chunks of data, or there's a solution more efficient than ox.geometries_from_place() for that task, or there's more information I could add to the tags. Help is appreciated.
OSMnx can potentially get all the states or provinces from some country, but this isn't a use case it's optimized for, and your specific use creates a few obstacles. You can see your query reproduced on Overpass Turbo.
You're using the default query area size, so it's making thousands of requests
Brazil's bounding box intersects portions of overseas French territory, which in turn pulls in all of France (spanning the entire globe)
OSMnx uses an r-tree to filter the final results, but globe-spanning results make this index perform very slowly
OSMnx can acquire geometries either via the geometries module (as you're doing) or via the geocode_to_gdf function in the geocoder module. You may want to try the latter if it fits your use case, as it's extremely more efficient.
With that in mind, if you must use the geometries module, you can try a few things to improve performance. First off, adjust the query area so you're downloading everything with one single API request. You're downloading relatively few entities, so the huge query area should still be ok within the timeout interval. The "intersecting overseas France" and "globe-spanning r-tree" problems are harder to solve. But as a demonstration, here's a simple example with Uruguay instead. It takes 20 something seconds to run everything on my machine:
import osmnx as ox
ox.settings.log_console = True
ox.settings.max_query_area_size = 25e12
place = 'Uruguay'
tags = {'admin_level': '4'}
gdf = ox.geometries_from_place(place, tags)
gdf = gdf[gdf["is_in:country"] == place]
gdf.plot()

Using BS4 and requests to scrape links of a specific class?

I was trying to use requests and BeautifulSoup4 to scrape the top page of r/askreddit, but when I tried to pull links using the class of that link, I would sometimes receive an empty list. Using this code:
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.reddit.com/r/AskReddit/top/?t=day'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
links = []
for link in soup.find_all('a', 'SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE'):
print(link.get('href'))
links.append(link.get('href'))
print(links)
Sometimes the code would return a printed version of each link as well as a list of the links as intended:
/r/AskReddit/comments/yaqpzk/what_is_a_cult_that_pretends_its_not_cult/
/r/AskReddit/comments/yaugmy/whats_a_name_you_would_never_give_to_your_child/
/r/AskReddit/comments/yavldx/what_is_the_single_greatest_animated_series_of/
/r/AskReddit/comments/yb64tg/what_have_you_survived_that_wouldve_killed_you/
/r/AskReddit/comments/yat0xj/what_is_your_favorite_movie_that_most_people_have/
/r/AskReddit/comments/yasntt/what_is_the_craziest_cult_of_all_time/
/r/AskReddit/comments/yas0s7/54_of_americans_between_the_ages_of_16_and_74/
['/r/AskReddit/comments/yaqpzk/what_is_a_cult_that_pretends_its_not_cult/', '/r/AskReddit/comments/yaugmy/whats_a_name_you_would_never_give_to_your_child/', '/r/AskReddit/comments/yavldx/what_is_the_single_greatest_animated_series_of/', '/r/AskReddit/comments/yb64tg/what_have_you_survived_that_wouldve_killed_you/', '/r/AskReddit/comments/yat0xj/what_is_your_favorite_movie_that_most_people_have/', '/r/AskReddit/comments/yasntt/what_is_the_craziest_cult_of_all_time/', '/r/AskReddit/comments/yas0s7/54_of_americans_between_the_ages_of_16_and_74/']
>>>
but most of the time I would simply receive:
[]
>>>
I am confused as to why the same code would be providing two different outputs, and I don't understand why I only sometimes receive the data I actually want to scrape. I have looked at some of the other posts about these libraries on this site, but I haven't found anything that looks like the problem I am having. I have looked over the BS4 documentation, albeit a bit ineffectively because I am a beginner, but I am still unsure of where the program is going wrong.
I recommend to parse old.reddit.com (note the old. at the beginning of URL - simpler HTML syntax) or use their JSON API (add .json at the end of URL). For example:
import requests
base_url = "https://www.reddit.com/r/AskReddit/top/.json?t=day"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0"
}
data = requests.get(base_url, headers=headers).json()["data"]["children"]
for i, d in enumerate(data, 1):
print("{:>3} {:<90} {}".format(i, d["data"]["title"], d["data"]["url"]))
Prints:
1 What is a cult that pretends it’s not cult? https://www.reddit.com/r/AskReddit/comments/yaqpzk/what_is_a_cult_that_pretends_its_not_cult/
2 Men of Reddit, what was something you didn't know about women till you got with one? https://www.reddit.com/r/AskReddit/comments/yb54lt/men_of_reddit_what_was_something_you_didnt_know/
3 What's a name you would NEVER give to your child? https://www.reddit.com/r/AskReddit/comments/yaugmy/whats_a_name_you_would_never_give_to_your_child/
4 What is the single greatest animated series of all time? https://www.reddit.com/r/AskReddit/comments/yavldx/what_is_the_single_greatest_animated_series_of/
5 What have you survived that would’ve killed you 150+ years ago? https://www.reddit.com/r/AskReddit/comments/yb64tg/what_have_you_survived_that_wouldve_killed_you/
6 What is your “favorite movie” that most people have never seen? https://www.reddit.com/r/AskReddit/comments/yat0xj/what_is_your_favorite_movie_that_most_people_have/
7 What is the craziest cult of all time? https://www.reddit.com/r/AskReddit/comments/yasntt/what_is_the_craziest_cult_of_all_time/
8 [Serious]: What are some early warning signs of an abusive relationship? https://www.reddit.com/r/AskReddit/comments/yar1os/serious_what_are_some_early_warning_signs_of_an/
9 54% of Americans between the ages of 16 and 74 read below a 6th grade reading level. Why do you think that is? https://www.reddit.com/r/AskReddit/comments/yas0s7/54_of_americans_between_the_ages_of_16_and_74/
10 What is something positive going on in your life? https://www.reddit.com/r/AskReddit/comments/yb28sm/what_is_something_positive_going_on_in_your_life/
11 The alien overlords demand that one American major metropolitan city be sacrificed and turned into a no human zone. Which one goes? https://www.reddit.com/r/AskReddit/comments/yb0l7y/the_alien_overlords_demand_that_one_american/
12 What movie has a great soundtrack in your opinion? https://www.reddit.com/r/AskReddit/comments/yapb09/what_movie_has_a_great_soundtrack_in_your_opinion/
13 What is an obscure reference to something that only true fans will understand? https://www.reddit.com/r/AskReddit/comments/yb7og2/what_is_an_obscure_reference_to_something_that/
14 What's socially acceptable within your own gender, but not with the opposite? https://www.reddit.com/r/AskReddit/comments/yb7pei/whats_socially_acceptable_within_your_own_gender/
15 What stages of drunk do you have? https://www.reddit.com/r/AskReddit/comments/yb39tn/what_stages_of_drunk_do_you_have/
16 What is the worst chocolate? https://www.reddit.com/r/AskReddit/comments/yawn49/what_is_the_worst_chocolate/
17 What would the US state mottos be if they were brutally honest? https://www.reddit.com/r/AskReddit/comments/yb5eaw/what_would_the_us_state_mottos_be_if_they_were/
18 What’s your opinion on circumcision? https://www.reddit.com/r/AskReddit/comments/yat9pm/whats_your_opinion_on_circumcision/
19 What examples of 'Internet etiquette' do you feel deserve more awareness? https://www.reddit.com/r/AskReddit/comments/yb25n3/what_examples_of_internet_etiquette_do_you_feel/
20 What 90s song will always be a banger? https://www.reddit.com/r/AskReddit/comments/ybeys0/what_90s_song_will_always_be_a_banger/
21 What show never had a 'meh' season? https://www.reddit.com/r/AskReddit/comments/yb4ww4/what_show_never_had_a_meh_season/
22 What was the scariest thing you have witnessed? https://www.reddit.com/r/AskReddit/comments/yb0ajj/what_was_the_scariest_thing_you_have_witnessed/
23 You start a new job, what's an instant red flag in the workplace social atmosphere? https://www.reddit.com/r/AskReddit/comments/yb2siq/you_start_a_new_job_whats_an_instant_red_flag_in/
24 which fictional world would you like to live in the most? https://www.reddit.com/r/AskReddit/comments/yawtqi/which_fictional_world_would_you_like_to_live_in/
25 How did you come up with your username? https://www.reddit.com/r/AskReddit/comments/yb7zm8/how_did_you_come_up_with_your_username/

How to improve the results from DBpedia Spotlight?

I am using DBpedia Spotlight to extract DBpedia resources as follows.
import json
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = "Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients. In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions."
CONFIDENCE = '0.5'
SUPPORT = '10'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['#URI'])
print(all_urls)
My text is shown below:
Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients. In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions.
The results I got is as follows.
['http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Helix',
'http://dbpedia.org/resource/Bronchitis',
'http://dbpedia.org/resource/Cough_medicine',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/After_7',
'http://dbpedia.org/resource/Gastrointestinal_tract',
'http://dbpedia.org/resource/Antibiotics',
'http://dbpedia.org/resource/Relative_risk',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Bronchitis']
As you can see, the results are not very good.
For example, consider Hedera helix extract in the text mentioned above. Even though DBpedia has a resource for Hedera helix (http://dbpedia.org/resource/Hedera_helix), the Spotlight outputs it as two URIs as http://dbpedia.org/resource/Hedera and http://dbpedia.org/resource/Helix.
According to my dataset, I would like to get the longest term in DBpedia as the results. In that case, what are the improvements I can do to get my desired output?
I am happy to provide more details if needed.
Although I am answering quiet late for this question but you can use Babelnet API in python to obtain dbpedia URI's containing longer texts. I reproduced the problem using the code below:
`from babelpy.babelfy import BabelfyClient
text ="Tolerance, safety and efficacy of Hedera helix extract in inflammatory
bronchial diseases under clinical practice conditions: a prospective, open,
multicentre postmarketing study in 9657 patients. In this postmarketing
study 9657 patients (5181 children) with bronchitis (acute or chronic
bronchial inflammatory disease) were treated with a syrup containing dried ivy
leaf extract. After 7 days of therapy, 95% of the patients showed improvement
or healing of their symptoms. The safety of the therapy was very good with an
overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders
with 1.5%). In those patients who got concomitant medication as well, it could
be shown that the additional application of antibiotics had no benefit
respective to efficacy but did increase the relative risk for the occurrence
of side effects by 26%. In conclusion, it is to say that the dried ivy leaf
extract is effective and well tolerated in patients with bronchitis. In view
of the large population considered, future analyses should approach specific
issues concerning therapy by age group, concomitant therapy and baseline
conditions."
# Instantiate BabelFy client.
params = dict()
params['lang'] = 'english'
babel_client = BabelfyClient("**Your Registration Code For API**", params)
# Babelfy sentence.
babel_client.babelfy(text)
# Get all merged entities.
babel_client.all_merged_entities'
The output will be in the sample format as shown below for all the merged entities in the text. You can further store and process the dictionary structure to extract the dbpedia URIs.
{'start': 34,
'end': 45,
'text': 'Hedera helix',
'isEntity': True,
'tokenFragment': {'start': 6, 'end': 7},
'charFragment': {'start': 34, 'end': 45},
'babelSynsetID': 'bn:00021109n',
'DBpediaURL': 'http://dbpedia.org/resource/Hedera_helix',
'BabelNetURL': 'http://babelnet.org/rdf/s00021109n',
'score': 1.0,
'coherenceScore': 0.0847457627118644,
'globalScore': 0.0013494092960806407,
'source': 'BABELFY'},

BS4 - grabbing information from something youve already parsed

hey this was kind of explained to me before but having trouble appying the same thing now to almost the same page...
page = 'http://www.imdb.com/genre/action/?ref_=gnr_mn_ac_mp'
table = soup.find_all("table", {"class": "results"})
for item in list(table):
for info in item.contents[1::2]:
info.a.extract()
link = info.a['href']
print(link)
name = info.text.strip()
print(name)
code above tries to capture the link to each page of each film contained in the a tag in the variable info... and the text in it has the name of each film but instead i get all the text. is there any way of just getting the name?
thanks guys in advance!!!
Just just need to pull the text from the anchor tag inside the td with the class title:
In [15]: from bs4 import BeautifulSoup
In [16]: import requests
In [17]: url = "http://www.imdb.com/genre/action/?ref_=gnr_mn_ac_mp"
In [18]: soup = BeautifulSoup(requests.get(url,"lxml").content)
In [19]: for td in soup.select("table.results td.title"):
....: print(td.a.text)
....:
X-Men: Apocalypse
Warcraft
Captain America: Civil War
The Do-Over
Teenage Mutant Ninja Turtles: Out of the Shadows
The Angry Birds Movie
The Nice Guys
Batman v Superman: Dawn of Justice
Suicide Squad
Deadpool
Gods of Egypt
Zootopia
13 Hours: The Secret Soldiers of Benghazi
Now You See Me 2
The Brothers Grimsby
Hardcore Henry
Monster Trucks
Independence Day: Resurgence
Star Trek Beyond
The Legend of Tarzan
Deepwater Horizon
X-Men: Days of Future Past
Star Wars: The Force Awakens
X-Men: First Class
The 5th Wave
Pretty much all the data you would want is inside the td with the title class:
So if you wanted the outline also all you need is the text from the span.outline:
In [24]: for td in soup.select("table.results td.title"):
....: print(td.a.text)
....: print(td.select_one("span.outline").text)
....:
X-Men: Apocalypse
With the emergence of the world's first mutant, Apocalypse, the X-Men must unite to defeat his extinction level plan.
Warcraft
The peaceful realm of Azeroth stands on the brink of war as its civilization faces a fearsome race of...
Captain America: Civil War
Political interference in the Avengers' activities causes a rift between former allies Captain America and Iron Man.
The Do-Over
Two down-on-their-luck guys decide to fake their own deaths and start over with new identities, only to find the people they're pretending to be are in even deeper trouble.
Teenage Mutant Ninja Turtles: Out of the Shadows
As Shredder joins forces with mad scientist Baxter Stockman and henchmen Bebop and Rocksteady to take over the world, the Turtles must confront an even greater nemesis: the notorious Krang.
The Angry Birds Movie
Find out why the birds are so angry. When an island populated by happy, flightless birds is visited by mysterious green piggies, it's up to three unlikely outcasts - Red, Chuck and Bomb - to figure out what the pigs are up to.
The Nice Guys
A mismatched pair of private eyes investigate the apparent suicide of a fading porn star in 1970s Los Angeles.
Batman v Superman: Dawn of Justice
Fearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world wrestles with what kind of a hero it really needs.
Suicide Squad
A secret government agency recruits imprisoned supervillains to execute dangerous black ops missions in exchange for clemency.
Deadpool
A former Special Forces operative turned mercenary is subjected to a rogue experiment that leaves him with accelerated healing powers, adopting the alter ego Deadpool.
Gods of Egypt
Mortal hero Bek teams with the god Horus in an alliance against Set, the merciless god of darkness, who has usurped Egypt's throne, plunging the once peaceful and prosperous empire into chaos and conflict.
Zootopia
In a city of anthropomorphic animals, a rookie bunny cop and a cynical con artist fox must work together to uncover a conspiracy.
13 Hours: The Secret Soldiers of Benghazi
During an attack on a U.S. compound in Libya, a security team struggles to make sense out of the chaos.
Now You See Me 2
The Four Horsemen resurface and are forcibly recruited by a tech genius to pull off their most impossible heist yet.
The Brothers Grimsby
A new assignment forces a top spy to team up with his football hooligan brother.
Hardcore Henry
Henry is resurrected from death with no memory, and he must save his wife from a telekinetic warlord with a plan to bio-engineer soldiers.
Monster Trucks
Looking for any way to get away from the life and town he was born into, Tripp (Lucas Till), a high school senior...
Independence Day: Resurgence
Two decades after the first Independence Day invasion, Earth is faced with a new extra-Solar threat. But will mankind's new space defenses be enough?
Star Trek Beyond
The USS Enterprise crew explores the furthest reaches of uncharted space, where they encounter a mysterious new enemy who puts them and everything the Federation stands for to the test.
The Legend of Tarzan
Tarzan, having acclimated to life in London, is called back to his former home in the jungle to investigate the activities at a mining encampment.
Deepwater Horizon
A story set on the offshore drilling rig Deepwater Horizon, which exploded during April 2010 and created the worst oil spill in U.S. history.
X-Men: Days of Future Past
The X-Men send Wolverine to the past in a desperate effort to change history and prevent an event that results in doom for both humans and mutants.
Star Wars: The Force Awakens
Three decades after the defeat of the Galactic Empire, a new threat arises. The First Order attempts to rule the galaxy and only a ragtag group of heroes can stop them, along with the help of the Resistance.
X-Men: First Class
In 1962, the United States government enlists the help of Mutants with superhuman abilities to stop a malicious dictator who is determined to start World War III.
The 5th Wave
Four waves of increasingly deadly alien attacks have left most of Earth decimated. Cassie is on the run, desperately trying to save her younger brother.
For runtime td.select_one("span.runtime").text etc..
Just like how you got the link by doing
info.a['href']
You can also get the title of the movie by doing
info.a['title']
Hopefully this is what you're looking for!