React native handling html p tags - react-native

I have a screen that displays article information thats been pulled from a Wordpress API call and returns json (inclusive of all its lovely HTML tags).
<Text style={styles.summary}>{htmlRegex(item.content.rendered)}{"\n"}{Moment(item.date, "YYYYMMDD").fromNow()}</Text>
I have a function that strips out all of the HTML tags, tidies up any unicode, etc...
function htmlRegex(string) {
string = string.replace(/<\/?[^>]+(>|$)/g, "")
string = string.replace(/…/g,"...")
let changeencode = entities.decode(string);
return changeencode;
}
The challenge is that the tags returned in the content appear to be causing odd line spacing issues, as shown in the screen grab;
The content.rendered contains;
rendered: "
<figure class="wp-block-image size-large"><img data-attachment-id="655" data-permalink="https://derbyfutsal.com/derby-futsal-club-women-name-change-june20/" data-orig-file="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png" data-orig-size="1024,512" data-comments-opened="1" data-image-meta="{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}" data-image-title="derby-futsal-club-women-name-change-june20" data-image-description="" data-medium-file="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=300" data-large-file="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=730" src="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=1024" alt="" class="wp-image-655" srcset="https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png 1024w, https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=150 150w, https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=300 300w, https://derbyfutsal.files.wordpress.com/2020/06/derby-futsal-club-women-name-change-june20.png?w=768 768w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
<p>Derby Futsal Club Ladies’ team are renamed Derby Futsal Club Women.</p>
<p>The change in name reflects Derby Futsal’s work in developing all aspects of futsal on and off the court.</p>
<p>It reflects the way the league (FA National Futsal Women’s Super Series), the players, the fans and the management refer to the game.</p>
<p>Hannah Roberts, Derby Futsal Club Women captain, believes “the change from Ladies to Women’s is a subtle but important one. Many professional sports teams have moved towards ‘Women’s’ in the last five years in order to stay modern and in touch, and as a forward-thinking club it’s important for Derby Futsal to do the same. We’re making so many strides in our community work and marketing, and this name change is another step forward to the future for the club”.</p>
<p>Derby Futsal Club Women first team coach, Matt Hardy feels this name change signifies evolution for the team; “the future of the women’s game both at Derby and nationally is looking bright. So it’s only right that we have a name that is modern, and inline with the national game”. </p>
<p>This news follows similar moves in professional football. Chelsea, Manchester City and Arsenal have all renamed their women’s team recently. It is something Professor Kath Woodward from the Open University, an expert on sociology and sport agrees with, “the use of ladies suggests a physical frailty and need for protection”.</p>
<p>Alex Scott, former Arsenal Women captain, adds: “the term ‘Women’s’ delineates between men and women without as many stereotypes or preconceived notions and it is in keeping with modern-day thinking on equality”.</p>
<p></p>
",
My question is, how do you handle the tags so that the return line white space is managable?

Put this in your css:
p {
margin: 0;
padding: 0;
}
And just replace 0 with whatever suits (0.5rem, 20px, whatever floats your boat really).

Related

Scraping contents of news articles

I was able to scrape the title, date, links, and content of news on these links: https://www.news24.com/news24/southafrica/crime-and-courts and https://www.news24.com/news24/southafrica/education. The output is saved in an excel file. However, I noticed that not all the contents inside the articles were scrapped. I have tried different methods on my "Getting content section of my code" Any help with this will be appreciate. Below is my code:
import sys, time
from bs4 import BeautifulSoup
import requests
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from datetime import timedelta
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['southafrica/crime-and-courts', 'southafrica/education']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
url = 'https://www.news24.com/news24/'+str(pagesToGet[i])
print(url)
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
try:
driver.get("https://www.news24.com/news24/" +str(pagesToGet[i]))
except Exception as e:
error_type, error_obj, error_info = sys.exc_info()
print('ERROR FOR LINK:', url)
print(error_type, 'Line:', error_info.tb_lineno)
continue
time.sleep(3)
scroll_pause_time = 1
screen_height = driver.execute_script("return window.screen.height;")
i = 1
while True:
driver.execute_script("window.scrollTo(0, {screen_height}{i});".format(screen_height=screen_height, i=i))
i += 1
time.sleep(scroll_pause_time)
scroll_height = driver.execute_script("return document.body.scrollHeight;")
if (screen_height) * i > scroll_height:
break
soup = BeautifulSoup(driver.page_source, 'html.parser')
news = soup.find_all('article', attrs={'class': 'article-item'})
print(len(news))
# Getting titles, dates, and links
for j in news:
titles = j.select_one('.article-item__title span')
title = titles.text.strip()
dates = j.find('p', attrs={'class': 'article-item__date'})
date = dates.text.strip()
address = j.find('a').get('href')
news_link = 'https://www.news24.com' + address
art_title.append(title)
art_date.append(date)
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title, 'Date': art_date, 'Source': art_link})
# Getting Content Section
news_articles = [] # to store the content of each news artcle
news_count = 0
for link in df['Source']:
print('\n')
start_time = time.monotonic()
print('Article No. ', news_count)
print('Link: ', link)
# Countermeasure for broken links
try:
if requests.get(link):
news_response = requests.get(link)
else:
print("")
except requests.exceptions.ConnectionError:
news_response = 'N/A'
# Auto sleep trigger after saving every 300 articles
sleep_time = ['100', '200', '300', '400', '500']
if news_count in sleep_time:
time.sleep(12)
else:
""
try:
if news_response.text:
news_data = news_response.text
else:
print('')
except AttributeError:
news_data = 'N/A'
news_soup = BeautifulSoup(news_data, 'html.parser')
try:
if news_soup.find('div', {'class': 'article__body'}):
art_cont = news_soup.find('div','article__body')
art = []
article_text = [i.text.strip().replace("\xa0", " ") for i in art_cont.findAll('p')]
art.append(article_text)
else:
print('')
except AttributeError:
article = 'N/A'
print('\n')
news_count += 1
news_articles.append(art)
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
print('\n')
# Create a column to add all the scraped text
df['News'] = news_articles
df.drop_duplicates(subset ="Source", keep = False, inplace = True)
# Dont store links
df.drop(columns=['Source'], axis=1, inplace=True)
df.to_excel('SA_news24_3.xlsx')
driver.quit()
I tried the following code in the Getting Content Section as well. However, it produced the same output.
article_text = [i.get_text(strip=True).replace("\xa0", " ") for i in art_cont.findAll('p')]
The site has various types of URLs so your code was omitting them since they found it malformed or some had to be subscribed to read.For the ones that has to be subscribed to read i have added "Login to read" followers by the link in articles . I ran this code till article number 670 and it didn't give any error. I had to change it from .xlsx to .csv since it was giving an error of openpyxl in python 3.11.0.
Full Code
import time
import sys
from datetime import timedelta
import pandas as pd
from bs4 import BeautifulSoup
import requests
import json
art_title = [] # to store the titles of all news article
art_date = [] # to store the dates of all news article
art_link = [] # to store the links of all news article
pagesToGet = ['southafrica/crime-and-courts',
'southafrica/education', 'places/gauteng']
for i in range(0, len(pagesToGet)):
print('processing page : \n')
if "places" in pagesToGet[I]:
url = f"https://news24.com/api/article/loadmore/tag?tagType=places&tag={pagesToGet[i].split('/')[1]}&pagenumber=1&pagesize=100&ishomepage=false&ismobile=false"
else:
url = f"https://news24.com/api/article/loadmore/news24/{pagesToGet[i]}?pagenumber=1&pagesize=1200&ishomepage=false&ismobile=false"
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.json()["htmlContent"], 'html.parser')
news = soup.find_all('article', attrs={'class': 'article-item'})
print(len(news))
# Getting titles, dates, and links
for j in news:
titles = j.select_one('.article-item__title span')
title = titles.text.strip()
dates = j.find('p', attrs={'class': 'article-item__date'})
date = dates.text.strip()
address = j.find('a').get('href')
# Countermeasure for links with full url
if "https://" in address:
news_link = address
else:
news_link = 'https://www.news24.com' + address
art_title.append(title)
art_date.append(date)
art_link.append(news_link)
df = pd.DataFrame({'Article_Title': art_title,
'Date': art_date, 'Source': art_link})
# Getting Content Section
news_articles = [] # to store the content of each news artcle
news_count = 0
for link in df['Source']:
start_time = time.monotonic()
print('Article No. ', news_count)
print('Link: ', link)
news_response = requests.get(link)
news_data = news_response.content
news_soup = BeautifulSoup(news_data, 'html.parser')
art_cont = news_soup.find('div', 'article__body')
# Countermeasure for links with subscribe form
try:
try:
article = art_cont.text.split("Newsletter")[
0]+art_cont.text.split("Sign up")[1]
except:
article = art_cont.text
article = " ".join((article).strip().split())
except:
article = f"Login to read {link}"
news_count += 1
news_articles.append(article)
end_time = time.monotonic()
print(timedelta(seconds=end_time - start_time))
print('\n')
# Create a column to add all the scraped text
df['News'] = news_articles
df.drop_duplicates(subset="Source", keep=False, inplace=True)
# Dont store links
df.drop(columns=['Source'], axis=1, inplace=True)
df.to_csv('SA_news24_3.csv')
Output
,Article_Title,Date,News
0,Pastor gets double life sentence plus two 15-year terms for rape and murder of women,2h ago,"A pastor has been sentenced to two life sentences and two 15-year jail terms for rape and murder His modus operandi was to take the women to secluded areas before raping them and tying them to trees.One woman managed to escape after she was raped but the bodies of two others were found tied to the trees.The North West High Court has sentenced a 50-year-old pastor to two life terms behind bars and two 15-year jail terms for rape and murder.Lucas Chauke, 50, was sentenced on Monday for the crimes, which were committed in 2017 and 2018 in Temba in the North West.According to North West National Prosecuting Authority spokesperson Henry Mamothame, Chauke's first victim was a 53-year-old woman.He said Chauke pretended that he would assist the woman with her spirituality and took her to a secluded place near to a dam.""Upon arrival, he repeatedly raped her and subsequently tied her to a tree before fleeing the scene,"" Mamothame said. The woman managed to untie herself and ran to seek help. She reported the incident to the police, who then started searching for Chauke.READ | Kidnappings doubled nationally: over 4 000 cases reported to police from July to SeptemberOn 10 May the following year, Chauke pounced on his second victim - a 55-year-old woman.He took her to the same secluded area next to the dam, raped her and tied her to a tree before fleeing. This time his victim was unable to free herself.Her decomposed body was later found, still tied to the tree, Mamothame said. His third victim was targeted months later, on 3 August, in the same area. According to Mamothame, Chauke attempted to rape her but failed.""He then tied her to a tree and left her to die,"" he said. Chauke was charged in connection with her murder.He was linked to the crimes via DNA.READ | 'These are not pets': Man gives away his two pit bulls after news of child mauled to deathIn aggravation of his sentence, State advocate Benny Kalakgosi urged the court not to deviate from the prescribed minimum sentences, saying that the offences Chauke had committed were serious.""He further argued that Chauke took advantage of unsuspecting women, who trusted him as a pastor but instead, [he] took advantage of their vulnerability,"" Mamothame said. Judge Frances Snyman agreed with the State and described Chauke's actions as horrific.The judge also alluded to the position of trust that he abused, Mamothame said.Chauke was sentenced to life for the rape of the first victim, 15 years for the rape of the second victim and life for her murder, as well as 15 years for her murder. He was also declared unfit to possess a firearm."
1,'I am innocent': Alleged July unrest instigator Ngizwe Mchunu pleads not guilty,4h ago,"Former Ukhozi FM DJ Ngizwe Mchunu has denied inciting the July 2022 unrest.Mchunu pleaded not guilty to charges that stem from the incitement allegations.He also claimed he had permission to travel to Gauteng for work during the Covid-19 lockdown.""They are lying. I know nothing about those charges,"" alleged July unrest instigator, former Ukhozi FM DJ Ngizwe Brian Mchunu, told the Randburg Magistrate's Court when his trial started on Tuesday.Mchunu pleaded not guilty to the charges against him, which stems from allegations that he incited public violence, leading to the destruction of property, and convened a gathering in contravening of Covid-19 lockdown regulations after the detention of former president Jacob Zuma in July last year.In his plea statement, Mchunu said all charges were explained to him.""I am a radio and television personality. I'm also a poet and cultural activist. In 2020, I established my online radio.READ | July unrest instigators could face terrorism-related charges""On 11 July 2021, I sent invitations to journalists to discuss the then-current affairs. At the time, it was during the arrest of Zuma.""I held a media briefing at a hotel in Bryanston to show concerns over Zuma's arrest. Zuma is my neighbour [in Nkandla]. In my African culture, I regard him as my father.""Mchunu continued that he was not unhappy about Zuma's arrest but added: ""I didn't condone any violence. I pleaded with fellow Africans to stop destroying infrastructure. I didn't incite any violence.I said to them, 'My brothers and sisters, I'm begging you as we are destroying our country.'""He added:They are lying. I know nothing about those charges. I am innocent. He also claimed that he had permission to travel to Gauteng for work during the lockdown.The hearing continues."
2,Jukskei River baptism drownings: Pastor of informal 'church' goes to ground,5h ago,"A pastor of the church where congregants drowned during a baptism ceremony has gone to ground.Johannesburg Emergency Medical Services said his identity was not known.So far, 14 bodies have been retrieved from the river.According to Johannesburg Emergency Medical Services (EMS), 13 of the 14 bodies retrieved from the Jukskei River have been positively identified.The bodies are of congregants who were swept away during a baptism ceremony on Saturday evening in Alexandra.The search for the other missing bodies continues.Reports are that the pastor of the church survived the flash flood after congregants rescued him.READ | Jukskei River baptism: Families gather at mortuary to identify loved onesEMS spokesperson Robert Mulaudzi said they had been in contact with the pastor since the day of the tragedy, but that they had since lost contact with him.It is alleged that the pastor was not running a formal church, but rather used the Jukskei River as a place to perform rituals for people who came to him for consultations. At this stage, his identity is not known, and because his was not a formal church, Mulaudzi could not confirm the number of people who could have been attending the ceremony.Speaking to the media outside the Sandton fire station on Tuesday morning, a member of the rescue team, Xolile Khumalo, said: Thirteen out of the 14 bodies retrieved have been identified, and the one has not been identified yet.She said their team would continue with the search. ""Three families have since come forward to confirm missing persons, and while we cannot be certain that the exact number of bodies missing is three, we will continue with our search."""
3,Six-month-old infant ‘abducted’ in Somerset West CBD,9h ago,"Authorities are on high alert after a baby was allegedly abducted in Somerset West on Monday.The alleged incident occurred around lunchtime, but was only reported to Somerset West police around 22:00. According to Sergeant Suzan Jantjies, spokesperson for Somerset West police, the six-month-old baby boy was taken around 13:00. It is believed the infant’s mother, a 22-year-old from Nomzamo, entrusted a fellow community member and mother with the care of her child before leaving for work on Monday morning. However, when she returned home from work, she was informed that the child was taken. Police were apparently informed that the carer, the infant and her nine-year-old child had travelled to Somerset West CBD to attend to Sassa matters. She allegedly stopped by a liquor store in Victoria Street and asked an unknown woman to keep the baby and watch over her child. After purchasing what was needed and exiting the store, she realised the woman and the children were missing. A case of abduction was opened and is being investigated by the police’s Family Violence, Child Protection and Sexual Offences (FCS) unit. Police obtained security footage which shows the alleged abductor getting into a taxi and making off with the children. The older child was apparently dropped off close to her home and safely returned. However, the baby has still not been found. According to a spokesperson, FCS police members prioritised investigations immediately after the case was reported late last night and descended on the local township, where they made contact with the visibly “traumatised” parent and obtained statements until the early hours of Tuesday morning – all in hopes of locating the child and the alleged suspect.Authorities are searching for what is believed to be a foreign national woman with braids, speaking isiZulu.Anyone with information which could aid the investigation and search, is urged to call Captain Trevor Nash of FCS on 082 301 8910."
4,Have you herd: Dubai businessman didn't know Ramaphosa owned Phala Phala buffalo he bought - report,8h ago,"A Dubai businessman who bought buffaloes at Phala Phala farm reportedly claims he did not know the deal was with President Cyril Ramaphosa.Hazim Mustafa also claimed he was expecting to be refunded for the livestock after the animals were not delivered.He reportedly brought the cash into the country via OR Tambo International Airport, and claims he declared it.A Dubai businessman who reportedly bought 20 buffaloes from President Cyril Ramaphosa's Phala Phala farm claims that he didn't know the deal with was with the president, according to a report.Sky News reported that Hazim Mustafa, who reportedly paid $580 000 (R10 million) in cash for the 20 buffaloes from Ramaphosa's farm in December 2019, said initially he didn't know who the animals belonged to.A panel headed by former chief justice Sandile Ngcobo released a report last week after conducting a probe into allegations of a cover-up of a theft at the farm in February 2020.READ | Ramaphosa wins crucial NEC debate as parliamentary vote on Phala Phala report delayed by a weekThe panel found that there was a case for Ramaphosa to answer and that he may have violated the law and involved himself in a conflict between his official duties and his private business.In a statement to the panel, Mustafa was identified as the source of the more than $500 000 (R9 million) that was stolen from the farm. Among the evidence was a receipt for $580 000 that a Phala Phala employee had written to ""Mr Hazim"".According to Sky News, Mustafa said he celebrated Christmas and his wife's birthday in Limpopo in 2019, and that he dealt with a broker when he bought the animals.He reportedly said the animals were to be prepared for export, but they were never delivered due to the Covid-19 lockdown. He understood he would be refunded after the delays.He also reportedly brought the cash into the country through OR Tambo International Airport and said he declared it. Mustafa also told Sky News that the amount was ""nothing for a businessman like [him]"".READ | Here's the Sudanese millionaire - and his Gucci wife - who bought Ramaphosa's buffaloThe businessman is the owner Sudanese football club Al Merrikh SC. He is married to Bianca O'Donoghue, who hails from KwaZulu-Natal. O'Donoghue regularly takes to social media to post snaps of a life of wealth – including several pictures in designer labels and next to a purple Rolls Royce Cullinan, a luxury SUV worth approximately R5.5 million.Sudanese businessman Hazim Mustafa with his South African-born wife, Bianca O'Donoghue.Facebook PHOTO: Bianca O'Donoghue/Facebook News24 previously reported that he also had ties to former Sudanese president, Omar al-Bashir.There have been calls for Ramaphosa to step down following the saga. A motion of no confidence is expected to be submitted in Parliament.He denied any wrongdoing and said the ANC's national executive committee (NEC) would decide his fate.Do you have a tipoff or any information that could help shape this story? Email tips#24.com"
5,Hefty prison sentence for man who killed stranded KZN cop while pretending to offer help,9h ago,"Two men have been sentenced – one for the murder of a KwaZulu-Natal police officer, and the other for an attempt to rob the officer.Sergeant Mzamiseni Mbele was murdered in Weenen in April last year.He was attacked and robbed when his car broke down on the highway while he was on his way home.A man who murdered a KwaZulu-Natal police officer, after pretending that he wanted to help him with his broken-down car, has been jailed.A second man, who was only convicted of an attempt to rob the officer, has also been sentenced to imprisonment.On Friday, the KwaZulu-Natal High Court in Madadeni sentenced Sboniso Linda, 36, to an effective 25 years' imprisonment, and Nkanyiso Mungwe, 25, to five years' imprisonment.READ | Alleged house robber shot after attack at off-duty cop's homeAccording to Hawks spokesperson, Captain Simphiwe Mhlongo, 39-year-old Sergeant Mzamiseni Mbele, who was stationed at the Msinga police station, was on his way home in April last year when his car broke down on the R74 highway in Weenen.Mbele let his wife know that the car had broken down. While stationary on the road, Linda and Mungwe approached him and offered to help.Mhlongo said: All of a sudden, [they] severely assaulted Mbele. They robbed him of his belongings and fled the scene. A farm worker found Mbele's body the next dayA case of murder was reported at the Weenen police station and the Hawks took over the investigation.The men were arrested.""Their bail [application] was successfully opposed and they appeared in court several times until they were found guilty,"" Mhlongo added.How safe is your neighbourhood? Find out by using News24's CrimeCheckLinda was sentenced to 20 years' imprisonment for murder and 10 years' imprisonment for robbery with aggravating circumstances. Half of the robbery sentence has to be served concurrently, leaving Linda with an effective sentence of 25 years.Mungwe was sentenced to five years' imprisonment for attempted robbery with aggravating circumstances."

Is Spacy's Tok2Vec components required for POS tagging?

I am using Spacy to do POS tagging and lemmatization. I believe the best practice is to disable unneeded components to maximize performance. Having disabled several components however it now seems that every token POS is noun!
It seems the tok2vec component is required for POS tagging. Is that correct, and if so, is this explained anywhere?
Additionally, is there a better way to optimize Spacy pipelines besides removing components?
import spacy
txt = '''ex-4.1 2 d879007dex41.htm ex-4.1 ex-4.1 exhibit 4.1 amendment no. 6 to note amendment no. 6 to note (this " amendment "), dated and effective as of january 30, 2020, is made by and between the u.s. small business administration (" sba "), an agency of the united states, and its successors and assigns, and freshstart venture capital corporation (the " licensee "), a small business investment borrower, licensed under the small business investment act of 1958, as amended, whose principal office is located at 437 madison avenue, new york, ny 10022. recitals whereas , the licensee issued that certain note, effective as of march 1, 2017 in the principal amount of $34,024,755.58 (thirty-four million twenty-four thousand seven hundred fifty-five and 58/100 dollars) in favor of sba (the " existing note "). whereas , sba and the licensee have agreed, subject to the terms and conditions of this amendment, that the existing note be amended to reflect certain agreed upon revisions to the terms of the existing note. now therefore, sba and the licensee hereby agree, in consideration of the mutual premises and mutual obligations set forth herein, that the existing note is hereby amended as follows: section 1. defined terms . except as otherwise indicated herein, all words and terms defined in the existing note shall have the same meanings when used herein. section 2. amendments . a. in the last sentence of the second paragraph of the existing note the phrase, "february 1, 2020" is hereby deleted in its entirety and replaced with the following: "april 1, 2020" b. in the third paragraph of the existing note the phrase, "february 1, 2020" is hereby deleted in its entirety and replaced with the following: "april 1, 2020" section 3. representations and warranties . each party hereby represents and warrants to the other party that it is in compliance with all the terms and provisions set forth in the existing note on its part to be observed or performed and hereby confirms and reaffirms each of its representations and warranties contained in the existing note. section 4. limited effect . except as expressly amended and modified by this amendment, the existing note shall continue to be, and shall remain, in full force and effect in accordance with its terms (and as duly amended). 1 section 5. counterparts . this amendment may be executed by each of the parties hereto on any number of separate counterparts, each of which shall be an original and all of which taken together shall constitute one and the same instrument. delivery of an executed signature page of this amendment in portable document format (pdf) or by facsimile transmission shall be effective as delivery of an executed original counterpart of this amendment. section 6. governing law . pursuant to section 101.106(b) of part 13 of the code of federal regulations, this amendment is to be construed and enforced in accordance with the act, the regulations and other federal law, and in the absence of applicable federal law, then by applicable new york law to the extent it does not conflict with the act, the regulations or other federal law. [signatures appear on next page] 2 in witness whereof, the parties have caused this amendment to be executed by their respective officers thereunto duly authorized, as of the date first above written. freshstart venture capital corporation by: /s/ thomas j. munson name: thomas j. munson title: svp u.s. small business administration by: /s/ thomas g. morris name: thomas g. morris title: director, o/l & acting deputy a/a oii 3'''
nlp = spacy.load('en_core_web_sm')
nlp.disable_pipe("parser")
nlp.disable_pipe("tok2vec") # it seems this is needed in fact?
nlp.disable_pipe("ner")
nlp.enable_pipe("senter")
nlp.max_length = 5000000
doc = nlp(txt)
print(nlp.pipe_names)
for token in doc:
print(token.text, token.pos_, token.lemma_)
NER is not required for POS tagging. Assuming are actually using the above code, the tok2vec is the issue, as that is required for POS tagging.
For advice on making spaCy faster, please see the spaCy speed FAQ. Besides disabling components you aren't using, another thing you can do is use nlp.pipe to batch requests.

Detect predefined topics in a text

I would like to find in a text corpus (with long boring text), allusions about some pre-defined topic (let's say i am interested in the 2 topic: "Remuneration" and "Work condition").
For exemple finding in my corpus where (the specific paragraph) it is pointing problems about "remuneration".
To accomplish that i first thought about a deterministic approach: building a big dictionary, and thanks to regex maybe flagging those words in the text corpus. It is a very basic idea but i do not know how i could build efficiently my dictionary (i need a lot of words in the lexical field of remuneration). Do you know some website in french which could help me to build this dictionary ?
Perhaps can you think about a more clever approach based on some Machine Learning algorithm which could realize this task (i know about topic modelling but the difference here is that i am focusing on pre-determines subject/topic like "Remuneration"). I need a simple approach :)
The dictionary approach is a very basic one, but it could work. You can build the dictionary iteratively:
Suppose you want a dictionary of terms related to "work conditions".
Start with a seed, a small number of terms that may be related, with high probability, to work conditions.
Use this dictionary to run through the corpus and find relevant documents.
Now go through the relevant documents and find terms with high TFIDF value (terms which have high representation in the above documents but low representation in the rest of the corpus). These terms can be assumed to refer to the subject of "work conditions" as well.
Add the new terms you found to the dictionary.
Now you can run again through the corpus and find additional relevant documents.
You can repeat the above process for a pre-configured number of times, or until no more new terms are found.
A full treatment of such "topic analysis" problems is well beyond the scope of a Stack Overflow Q&A - There are multiple books and papers on such.
For a small, starter project: collect a number of articles which focus on discussing your topic(s), and on other topics. Rate each document according to whether or not it covers each of your chosen topics. Calculate the term-frequency-inverse-document-frequency for each of the sample articles. Convert these into a vector of the frequency of appearance of each word for each document. (You'll probably want to eliminate extremely common or ambiguous "stop words" from the analysis and do "stemming" as well. You can also scan for common sequences of two or more words.) This then defines a set of "positive" and "negative" examples for each defined topic.
If there's only a single topic of interest, you can then use a cosine-similarity function to determine which sample article is most like your new / input text sample. For multiple topics, you'll probably want to do something like Principal Component Analysis from the original text samples to identify which words and word combinations are most representative of each topic.
The quality of the classification will depend largely on the number of example texts you have to train the model and how much they differ.
If you're talking about the coding way of solving this, why don't you write a code (in your language) that finds the paragraph containing the word or the allusion word.
For example, I would do it like this in JavaScript
// longText is a long text that includes 4 paragraphs in total
const longText = `
In 1893, the first running, gasoline-powered American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts. The first public run of the Duryea Motor Wagon took place on 21 September 1893, on Taylor Street in Metro Center Springfield.[32][33] The Studebaker Automobile Company, subsidiary of a long-established wagon and coach manufacturer, started to build cars in 1897[34]: p.66  and commenced sales of electric vehicles in 1902 and gasoline vehicles in 1904.[35]
In Britain, there had been several attempts to build steam cars with varying degrees of success, with Thomas Rickett even attempting a production run in 1860.[36] Santler from Malvern is recognized by the Veteran Car Club of Great Britain as having made the first gasoline-powered car in the country in 1894,[37] followed by Frederick William Lanchester in 1895, but these were both one-offs.[37] The first production vehicles in Great Britain came from the Daimler Company, a company founded by Harry J. Lawson in 1896, after purchasing the right to use the name of the engines. Lawson's company made its first car in 1897, and they bore the name Daimler.[37]
In 1892, German engineer Rudolf Diesel was granted a patent for a "New Rational Combustion Engine". In 1897, he built the first diesel engine.[1] Steam-, electric-, and gasoline-powered vehicles competed for decades, with gasoline internal combustion engines achieving dominance in the 1910s. Although various pistonless rotary engine designs have attempted to compete with the conventional piston and crankshaft design, only Mazda's version of the Wankel engine has had more than very limited success.
All in all, it is estimated that over 100,000 patents created the modern automobile and motorcycle.
`
document.querySelector('.searchbox').addEventListener('submit', (e)=> { e.preventDefault(); search() })
function search(){
const allusion = document.querySelector('#searchbox').value.toLowerCase()
const output = document.querySelector('#results-body ol')
output.innerHTML = "" // reset the output
const paragraphs = longText.split('\n').filter(item => item != "")
const included = paragraphs.filter((paragraph) => paragraph.toLowerCase().includes(allusion))
let foundIn = included.map(paragraph => `<div class="result-row"> <li>${paragraph.toLowerCase()}</li>
</div>`)
foundIn = foundIn.map(el => el.replaceAll(allusion, `<span class="highlight">${allusion}</span>`))
output.insertAdjacentHTML('afterbegin', foundIn.join('\n'))
}
.container{
padding : 5px;
border: .2px solid black;
}
.searchbox{
padding-bottom: 5px
}
.searchbox input {
width: 90%
}
.result-row{
padding-bottom: 5px;
}
.highlight{
background: yellow;
}
h3 span {
font-size: 14px;
font-style: italic;
}
<div class="container">
<form class="searchbox">
<h3>Give me an hint: <span>ex: car, gasoline, company</span> </h3>
<input id="searchbox" type="text" placeholder="allusion word, ex: car, gasoline, company">
<button type"submit"> find </button>
</form>
<div id="results-body">
<ol></ol>
</div>
</div>

Generating similar named entities/compound nouns

I have been trying to create distractors (false answers) for multiple choice questions. Using word vectors, I was able to get decent results for single-word nouns.
When dealing with compound nouns (such as "car park" or "Donald Trump"), my best attempt was to compute similar words for each part of the compound and combine them. The results are very entertaining:
Car park -> vehicle campground | automobile zoo
Fire engine -> flame horsepower | fired motor
Donald Trump -> Richard Jeopardy | Jeffrey Gamble
Barrack Obama -> Obamas McCain | Auschwitz Clinton
Unfortunately, these are not very convincing. Especially in case of named entities, I want to produce other named entities, which appear in similar contexts; e.g:
Fire engine -> Fire truck | Fireman
Donald Trump -> Barrack Obama | Hillary Clinton
Niagara Falls -> American Falls | Horseshoe Falls
Does anyone have any suggestions of how this could be achieved? Is there are a way to generate similar named entities/noun chunks?
I managed to get some good distractors by searching for the named entities on Wikipedia, then extracting entities which are similar from the summary. Though I'd prefer to find a solution using just spacy.
If you haven't seen it yet, you might want to check out sense2vec, which allows learning context-sensitive vectors by including the part-of-speech tags or entity labels. Quick usage example of the spaCy extension:
s2v = Sense2VecComponent('/path/to/reddit_vectors-1.1.0')
nlp.add_pipe(s2v)
doc = nlp(u"A sentence about natural language processing.")
most_similar = doc[3]._.s2v_most_similar(3)
# [(('natural language processing', 'NOUN'), 1.0),
# (('machine learning', 'NOUN'), 0.8986966609954834),
# (('computer vision', 'NOUN'), 0.8636297583580017)]
See here for the interactive demo using a sense2vec model trained on Reddit comments. Using this model, "car park" returns things like "parking lot" and "parking garage", and "Donald Trump" gives you "Sarah Palin", "Mitt Romney" and "Barack Obama". For ambiguous entities, you can also include the entity label – for example, "Niagara Falls|GPE" will show similar terms to the geopolitical entitiy (GPE), e.g. the city as opposed to the actual waterfalls. The results obviously depend on what was present in the data, so for even more specific similarities, you could also experiment with training your own sense2vec vectors.

BS4 - grabbing information from something youve already parsed

hey this was kind of explained to me before but having trouble appying the same thing now to almost the same page...
page = 'http://www.imdb.com/genre/action/?ref_=gnr_mn_ac_mp'
table = soup.find_all("table", {"class": "results"})
for item in list(table):
for info in item.contents[1::2]:
info.a.extract()
link = info.a['href']
print(link)
name = info.text.strip()
print(name)
code above tries to capture the link to each page of each film contained in the a tag in the variable info... and the text in it has the name of each film but instead i get all the text. is there any way of just getting the name?
thanks guys in advance!!!
Just just need to pull the text from the anchor tag inside the td with the class title:
In [15]: from bs4 import BeautifulSoup
In [16]: import requests
In [17]: url = "http://www.imdb.com/genre/action/?ref_=gnr_mn_ac_mp"
In [18]: soup = BeautifulSoup(requests.get(url,"lxml").content)
In [19]: for td in soup.select("table.results td.title"):
....: print(td.a.text)
....:
X-Men: Apocalypse
Warcraft
Captain America: Civil War
The Do-Over
Teenage Mutant Ninja Turtles: Out of the Shadows
The Angry Birds Movie
The Nice Guys
Batman v Superman: Dawn of Justice
Suicide Squad
Deadpool
Gods of Egypt
Zootopia
13 Hours: The Secret Soldiers of Benghazi
Now You See Me 2
The Brothers Grimsby
Hardcore Henry
Monster Trucks
Independence Day: Resurgence
Star Trek Beyond
The Legend of Tarzan
Deepwater Horizon
X-Men: Days of Future Past
Star Wars: The Force Awakens
X-Men: First Class
The 5th Wave
Pretty much all the data you would want is inside the td with the title class:
So if you wanted the outline also all you need is the text from the span.outline:
In [24]: for td in soup.select("table.results td.title"):
....: print(td.a.text)
....: print(td.select_one("span.outline").text)
....:
X-Men: Apocalypse
With the emergence of the world's first mutant, Apocalypse, the X-Men must unite to defeat his extinction level plan.
Warcraft
The peaceful realm of Azeroth stands on the brink of war as its civilization faces a fearsome race of...
Captain America: Civil War
Political interference in the Avengers' activities causes a rift between former allies Captain America and Iron Man.
The Do-Over
Two down-on-their-luck guys decide to fake their own deaths and start over with new identities, only to find the people they're pretending to be are in even deeper trouble.
Teenage Mutant Ninja Turtles: Out of the Shadows
As Shredder joins forces with mad scientist Baxter Stockman and henchmen Bebop and Rocksteady to take over the world, the Turtles must confront an even greater nemesis: the notorious Krang.
The Angry Birds Movie
Find out why the birds are so angry. When an island populated by happy, flightless birds is visited by mysterious green piggies, it's up to three unlikely outcasts - Red, Chuck and Bomb - to figure out what the pigs are up to.
The Nice Guys
A mismatched pair of private eyes investigate the apparent suicide of a fading porn star in 1970s Los Angeles.
Batman v Superman: Dawn of Justice
Fearing that the actions of Superman are left unchecked, Batman takes on the Man of Steel, while the world wrestles with what kind of a hero it really needs.
Suicide Squad
A secret government agency recruits imprisoned supervillains to execute dangerous black ops missions in exchange for clemency.
Deadpool
A former Special Forces operative turned mercenary is subjected to a rogue experiment that leaves him with accelerated healing powers, adopting the alter ego Deadpool.
Gods of Egypt
Mortal hero Bek teams with the god Horus in an alliance against Set, the merciless god of darkness, who has usurped Egypt's throne, plunging the once peaceful and prosperous empire into chaos and conflict.
Zootopia
In a city of anthropomorphic animals, a rookie bunny cop and a cynical con artist fox must work together to uncover a conspiracy.
13 Hours: The Secret Soldiers of Benghazi
During an attack on a U.S. compound in Libya, a security team struggles to make sense out of the chaos.
Now You See Me 2
The Four Horsemen resurface and are forcibly recruited by a tech genius to pull off their most impossible heist yet.
The Brothers Grimsby
A new assignment forces a top spy to team up with his football hooligan brother.
Hardcore Henry
Henry is resurrected from death with no memory, and he must save his wife from a telekinetic warlord with a plan to bio-engineer soldiers.
Monster Trucks
Looking for any way to get away from the life and town he was born into, Tripp (Lucas Till), a high school senior...
Independence Day: Resurgence
Two decades after the first Independence Day invasion, Earth is faced with a new extra-Solar threat. But will mankind's new space defenses be enough?
Star Trek Beyond
The USS Enterprise crew explores the furthest reaches of uncharted space, where they encounter a mysterious new enemy who puts them and everything the Federation stands for to the test.
The Legend of Tarzan
Tarzan, having acclimated to life in London, is called back to his former home in the jungle to investigate the activities at a mining encampment.
Deepwater Horizon
A story set on the offshore drilling rig Deepwater Horizon, which exploded during April 2010 and created the worst oil spill in U.S. history.
X-Men: Days of Future Past
The X-Men send Wolverine to the past in a desperate effort to change history and prevent an event that results in doom for both humans and mutants.
Star Wars: The Force Awakens
Three decades after the defeat of the Galactic Empire, a new threat arises. The First Order attempts to rule the galaxy and only a ragtag group of heroes can stop them, along with the help of the Resistance.
X-Men: First Class
In 1962, the United States government enlists the help of Mutants with superhuman abilities to stop a malicious dictator who is determined to start World War III.
The 5th Wave
Four waves of increasingly deadly alien attacks have left most of Earth decimated. Cassie is on the run, desperately trying to save her younger brother.
For runtime td.select_one("span.runtime").text etc..
Just like how you got the link by doing
info.a['href']
You can also get the title of the movie by doing
info.a['title']
Hopefully this is what you're looking for!