extract text from specific sections in html, python

extract text from specific sections in html, python - beautifulsoup

I'm trying to do a program that show uou lyrics of a song, but i get stuck on this error:
AttributeError: 'NoneType' object has no attribute 'text'
here's the code:
def get_lyrics(url):
lyrics_html = requests.get(url)
soup = BeautifulSoup(lyrics_html.content, "html.parser")
lyrics = soup.find('div', {"class": "lyrics"})
return lyrics.text
This is the site where i take the lyrics.
I can't explain whats wrong, for example i'll search the lyrics of this song, so here's the lyrics of the song: click.
You can see from yourself that in the page the "place" where the lyrics is, a div with class "lyrics". This is how all lyrics pages of this site are made. Can someone help me pls? Ty

The page returns two versions of page (probably to confuse scrapers and bots). One version with class that begins on "Lyrics__Container..." and one with class lyrics. If a tag with class Lyrics__Container is not found, the lyrics are inside the tag with class lyrics.
This should always print a lyrics:
import requests
from bs4 import BeautifulSoup
url = 'https://genius.com/Luis-sal-ciao-mi-chiamo-luis-lyrics'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
text = soup.select_one('div[class^="Lyrics__Container"], .lyrics').get_text(strip=True, separator='\n')
print(text)
Prints:
[Intro]
Ah, mhh (ehi)
Ho la bocca piena
Va bene
[Verse]
Ciao, mi chiamo Luis (eh, eh-eh)
Ciao, mi chiamo Luis (eh, eh-eh)
Ciao, Ciao mi chiamo Luis (eh, eh-eh)
Ciao, mi chiamo Luis
Si, si, si Sal
A a a a Si si si si si si
Proprio così mi chiamo io
Ciao mi chiamo Luis Aah
... and so on.
EDIT: Updated version:
import requests
from bs4 import BeautifulSoup
url = 'https://genius.com/Avicii-the-nights-lyrics'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
def get_text(elements):
text = ''
for c in elements:
for t in c.select('a, span'):
t.unwrap()
if c:
c.smooth()
text += c.get_text(strip=True, separator='\n')
return text
cs = soup.select('div[class^="Lyrics__Container"]')
if cs:
text = get_text(cs)
else:
text = get_text(soup.select('.lyrics'))
print(text)
Prints:
[Verse 1]
(Hey)
Once upon a younger year
When all our shadows disappeared
The animals inside came out to play (Hey)
Hey, went face to face with all our fears
Learned our lessons through the tears
Made memories we knew would never fade
[Pre-Chorus]
One day my father he told me
Son, don't let it slip away
...etc.

You should use this link https://genius.com/Luis-sal-ciao-mi-chiamo-luis-lyrics
instead of https://genius.com/ which you have mentioned as song.
def get_lyrics(url):
lyrics_html = requests.get(url)
soup = BeautifulSoup(lyrics_html.text, "lxml")
lyrics_text = []
lyrics = soup.find_all('div', class_="Lyrics__Container-sc-1ynbvzw-2 jgQsqn")
for i in lyrics:
lyrics_text.append(i.text.strip())
# print(i.text.strip())
return lyrics_text
output = get_lyrics("https://genius.com/Luis-sal-ciao-mi-chiamo-luis-lyrics")
Output will be:
['[Intro]Ah, mhh (ehi)Ho la bocca pienaVa bene[Verse]Ciao, mi chiamo Luis (eh, eh-eh)Ciao, mi chiamo Luis (eh, eh-eh)Ciao, Ciao mi chiamo Luis (eh, eh-eh)Ciao, mi chiamo LuisSi, si, si SalA a a a Si si si si si siProprio così mi chiamo ioCiao mi chiamo Luis AahLuis Sal, Luis, Luis, Luis SalCiao mi chiamo Luis, Luis SalEeemEeeCiao, Ciao BolognaMi chiamo LuisCiao Mamma (Eee) EeeCiao, Ciao anche a voi LuistiMi chiamo Luis, Lo youtuber EeeEeeCiao, Sono uno youtuberMi chiamo LuisSono uno youtuberEeeCiao, Sono uno youtuberMi chiamo LuisSono uno youtuberA e (Diglielo Luis) a e ă a e e a ă a a a-aaaaCiao mi chiamo LuisEee (Ma chi ti caga)Eee Ciao (Ma chi vuoi che ti guardi)Mi chiamo LuisHahahahaEeeVoglio diventare uno youtuberEee', '', '[Outro]Uuu BolognaDuemila EeeEee EeeEe']

Related

UnboundLocalError: local variable 'range' referenced before assignment

I'm new to Python.
I jsut found an eroor in my code and I have no idea what's the problem, I already searched on multiple websites with the error message, but I haven't gotten any solution.
I didin't even put range into a variable so it's kinda weird.
import turtle # j'ai pas mis de couleur pour l'instant car jsp comment faire , à mon avis il faut choisir une rangée de couleur et ensuite utiliser la commande rand
import random
couleur=[(255,127,36),
(238,118,33),
(205,102,29),
(255,114,86),
(238,106,80),
(205,91,69),
(255,127,0),
(238,118,0),
(205,102,0),
(139,69,0),
(139,69,19)]
def random_color():
return random.choice(couleur)
def briques(): # cette fonction permet de tracer la ligné de brique
for i in range (12): # j'aurais pu mettre nblong , mais sa aurait était bizarre, c'est le nb de ligne de brique
for i in range (12): # c'est pour la ligne de 1ere brique
random_color()
color(couleur)
fillcolor(couleur)
begin_fill()
By the way this is a project on turtle, we have to build a house (in 2d) with turtles features.

Maybe try defining i with no value
for (I) in Range
define I
ex.
i=" " or i=0

Beautiful Soup - get text from all <li> elements in <ul>

With this code:
match_url = f'https://interativos.globoesporte.globo.com/cartola-fc/mais-escalados/mais-escalados-do-cartola-fc'
browser.visit(match_url)
browser.find_by_tag('li[class="historico-rodadas__rodada historico-rodadas__rodada--ativa"]').click()
soup = BeautifulSoup(browser.html, 'html.parser')
innerContent = soup.findAll('ul',class_="field__players")
print (innerContent)
I've managed to fetch the <ul>:
[<ul class="field__players"><li class="player"...]
Now how can I access text for player__name and player__value for all players in the list?

This should help u:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://interativos.globoesporte.globo.com/cartola-fc/mais-escalados/mais-escalados-do-cartola-fc')
src = driver.page_source
driver.close()
soup = BeautifulSoup(src,'html5lib')
innerContent = soup.find('ul',class_="field__players")
li_items = innerContent.find_all('li')
for li in li_items:
p_tags = li.find_all('p')[:-1] #The [:-1] removes the last p tag from the list, which is player__label
for p in p_tags:
print(p.text)
Output:
Keno
2.868.755
Pedro
2.483.069
Bruno Henrique
1.686.894
Hugo Souza
809.186
Guilherme Arana
1.314.769
Filipe Luís
776.147
Thiago Galhardo
2.696.853
Vinícius
1.405.012
Nenê
1.369.209
Jorge Sampaoli
1.255.731
Réver
1.505.522
Víctor Cuesta
1.220.451

I should just put this here to show you what he wants.
soup = BeautifulSoup(browser.html, 'html.parser')
innerContent = soup.findAll('ul',class_="field__players")
for li in innerContent.findAll('li'):
player_name = li.find('p', class_ = "player__name")
player_value = li.find('p', class_ = "player__value")
print(player_name.text)
print(player_value.text)

Training a spacy model for NER in french resumes dont give any results

Sample of trainning data(input.json), the full json has only 100 resumes.
{"content": "Resume 1 text in french","annotation":[{"label":["diplomes"],"points":[{"start":1233,"end":1423,"text":"1995-1996 : Lycée Dar Essalam Rabat \n Baccalauréat scientifique option sciences Expérimentales "}]},{"label":["diplomes"],"points":[{"start":1012,"end":1226,"text":"1996-1998 : Faculté des Sciences Rabat \n C.E.U.S (Certificat des Etudes universitaires Supérieurs) option physique et chimie "}]},{"label":["diplomes"],"points":[{"start":812,"end":1004,"text":"1999-2000 : Faculté des Sciences Rabat \n Licence es sciences physique option électronique "}]},{"label":["diplomes"],"points":[{"start":589,"end":805,"text":"2002-2004 : Faculté des Sciences Rabat \nDESA ((Diplôme des Etudes Supérieures Approfondies) en informatique \n\ntélécommunication multimédia "}]},{"label":["diplomes"],"points":[{"start":365,"end":582,"text":"2014-2017 : Institut National des Postes et Télécommunications INPT Rabat \n Thèse de doctorat en informatique et télécommunication "}]},{"label":["adresse"],"points":[{"start":122,"end":157,"text":"Rue 34 n 17 Hay Errachad Rabat Maroc"}]}],"extras":null,"metadata":{"first_done_at":1586140561000,"last_updated_at":1586140561000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}
{"content": "Resume 2 text in french","annotation":[{"label":["diplomes"],"points":[{"start":1251,"end":1345,"text":"Lycée Oued El Makhazine - Meknès \n\n- Bachelier mention très bien \n- Option : Sciences physiques"}]},{"label":["diplomes"],"points":[{"start":1122,"end":1231,"text":"Classes préparatoires Moulay Youssef - Rabat \n\n- Admis au Concours National Commun CNC \n- Option : PCSI - PSI "}]},{"label":["diplomes"],"points":[{"start":907,"end":1101,"text":"Institut National des Postes et Télécommunications INPT - Rabat \n\n- Ingénieur d’État en Télécommunications et technologies de l’information \n- Option : MTE Management des Télécoms de l’entreprise"}]},{"label":["adresse"],"points":[{"start":79,"end":133,"text":"94, Hay El Izdihar, Avenue El Massira, Ouislane, MEKNES"}]}],"extras":null,"metadata":{"first_done_at":1586126476000,"last_updated_at":1586325851000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}
{"content": "Resume 3 text in french","annotation":[{"label":["adresse"],"points":[{"start":2757,"end":2804,"text":"N141 Av. El Hansali Agharass \nBouargane \nAgadir "}]},{"label":["diplomes"],"points":[{"start":262,"end":369,"text":"2009-2010 : Baccalauréat Scientifique, option : Sciences Physiques au Lycée Qualifiant \nIBN MAJJA à Agadir."}]},{"label":["diplomes"],"points":[{"start":125,"end":259,"text":"2010-2016 : Diplôme d’Ingénieur d’Etat, option : Génie Informatique, à l’Ecole \nNationale des Sciences Appliquées d’Agadir (ENSAA). "}]}],"extras":null,"metadata":{"first_done_at":1586141779000,"last_updated_at":1586141779000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}
{"content": "Resume 4 text in french","annotation":[{"label":["diplomes"],"points":[{"start":505,"end":611,"text":"2012 Baccalauréat Sciences Expérimentales option Sciences Physiques, Lycée Hassan Bno \nTabit, Ouled Abbou. "}]},{"label":["diplomes"],"points":[{"start":375,"end":499,"text":"2012–2015 Diplôme de licence en Informatique et Gestion Industrielle, IGI, Faculté des sciences \net Techniques, Settat, LST. "}]},{"label":["diplomes"],"points":[{"start":272,"end":367,"text":"2015–2017 Master Spécialité BioInformatique et Systèmes Complexes, BISC, ENSA , Tanger, \n\nBac+5."}]},{"label":["adresse"],"points":[{"start":15,"end":71,"text":"246 Hay Pam Eljadid OULED ABBOU \n26450 BERRECHID, Maroc "}]}],"extras":null,"metadata":{"first_done_at":1586127374000,"last_updated_at":1586327010000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}
{"content": "Resume 5 text in french","annotation":null,"extras":null,"metadata":{"first_done_at":1586139511000,"last_updated_at":1586139511000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}
Code that transformes this json data to spacy format
input_file="input.json"
output_file="output.json"
training_data = []
lines=[]
with open(input_file, 'r', encoding="utf8") as f:
lines = f.readlines()
for line in lines:
data = json.loads(line)
print(data)
text = data['content']
entities = []
for annotation in data['annotation']:
point = annotation['points'][0]
labels = annotation['label']
if not isinstance(labels, list):
labels = [labels]
for label in labels:
entities.append((point['start'], point['end'] + 1 ,label))
training_data.append((text, {"entities" : entities}))
with open(output_file, 'wb') as fp:
pickle.dump(training_data, fp)
Code for training the spacy model
def train_spacy():
TRAIN_DATA = training_data
nlp = spacy.load('fr_core_news_md') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
# if 'ner' not in nlp.pipe_names:
# ner = nlp.create_pipe('ner')
# nlp.add_pipe(ner, last=True)
ner = nlp.get_pipe("ner")
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(20):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
# drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(itn, dt.datetime.now(), losses)
output_dir = "new-model"
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.meta['name'] = "addr_edu" # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
train_spacy()
When I test the model this is what happens
import spacy
nlp = spacy.load("new-model")
doc = nlp("Text of a Resume already trained on")
print(doc.ents)
# It prints out this ()
doc = nlp("Text of a Resume not trained on")
print(doc.ents)
# It prints out this ()
What I expect it to give me is the entities adresse(address) and diplomes(academic degrees) present in the text
Edit 1
The sample data(input.json) in the very top is part of the data that I get after I annotate resumes on a text annotation platform.
But I should transform it to spacy format, so I can give to model for training.
This is what a resume with annotations looks like when I give it to the model
training_data = [(
'Dr.XXXXXX XXXXXXX \n\n \nEmail : XXXXXXXXXXXXXXXXXXXXXXXX \n\nGSM : XXXXXXXXXX \n\nAdresse : XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \n \n\n \n\nETAT CIVIL \n \n\nSituation de famille : célibataire \n\nNationalité : Marocaine \n\nNé le : 10 février 1983 \n\nLieu de naissance : XXXXXXXXXXXXXXXX \n\n \n FORMATION \n\n• 2014-2017 : Institut National des Postes et Télécommunications INPT Rabat \n Thèse de doctorat en informatique et télécommunication \n \n\n• 2002-2004 : Faculté des Sciences Rabat \nDESA ((Diplôme des Etudes Supérieures Approfondies) en informatique \n\ntélécommunication multimédia \n \n\n• 1999-2000 : Faculté des Sciences Rabat \n Licence es sciences physique option électronique \n \n\n• 1996-1998 : Faculté des Sciences Rabat \n C.E.U.S (Certificat des Etudes universitaires Supérieurs) option physique et chimie \n \n\n• 1995-1996 : Lycée Dar Essalam Rabat \n Baccalauréat scientifique option sciences Expérimentales \n\nSTAGE DE FORMATION \n\n• Du 03/03/2004 au 17/09/2004 : Stage de Projet de Fin d’Etudes à l’ INPT pour \nl’obtention du DESA (Diplôme des Etudes Supérieures Approfondies). \n\n Sujet : AGENT RMON DANS LA GESTION DE RESEAUX. \n\n• Du 03/06/2002 au 17/01/2003: Stage de Projet de Fin d’année à INPT \n Sujet : Mécanisme d’Authentification Kerbéros Dans un Réseau Sans fils sous Redhat. \n\nPUBLICATION \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou. "New \n\nstrategy to optimize the performance of epidemic routing protocol." International Journal \n\nof Computer Applications, vol. 92, N.7, 2014. \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou. "New \n\nStrategy to optimize the Performance of Spray and wait Routing Protocol." International \n\nJournal of Wireless and Mobile Networks v.6, N.2, 2014. \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou. "Impact of \n\nmobility models on Supp-Tran optimized DTN Spray and Wait routing." International \n\njournal of Mobile Network Communications & Telematics ( IJMNCT), Vol.4, N.2, April \n\n2014. \n\n✓ M. Ababou, R. Elkouch, M. Bellafkih and N. Ababou, "AntProPHET: A new routing \n\nprotocol for delay tolerant networks," Proceedings of 2014 Mediterranean Microwave \n\nSymposium (MMS2014), Marrakech, 2014, IEEE. \n\nmailto:XXXXXXXXXXXXXXXXXXXXXXXX\n\n\n✓ Ababou, Mohamed, et al. "BeeAntDTN: A nature inspired routing protocol for delay \n\ntolerant networks." Proceedings of 2014 Mediterranean Microwave Symposium \n\n(MMS2014). IEEE, 2014. \n\n✓ Ababou, Mohamed, et al. "ACDTN: A new routing protocol for delay tolerant networks \n\nbased on ant colony." Information Technology: Towards New Smart World (NSITNSW), \n\n2015 5th National Symposium on. IEEE, 2015. \n\n✓ Ababou, Mohamed, et al. "Energy-efficient routing in Delay-Tolerant Networks." RFID \n\nAnd Adaptive Wireless Sensor Networks (RAWSN), 2015 Third International Workshop \n\non. IEEE, 2015. \n\n✓ Ababou, Mohamed, et al. "Energy efficient and effect of mobility on ACDTN routing \n\nprotocol based on ant colony." Electrical and Information Technologies (ICEIT), 2015 \n\nInternational Conference on. IEEE, 2015. \n\n✓ Mohamed, Ababou et al. "Fuzzy ant colony based routing protocol for delay tolerant \n\nnetwork." 10th International Conference on Intelligent Systems: Theories and Applications \n\n(SITA). IEEE, 2015. \n\nARTICLES EN COURS DE PUBLICATION \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou.”Dynamic \n\nUtility-Based Buffer Management Strategy for Delay-tolerant Networks. “International \n\nJournal of Ad Hoc and Ubiquitous Computing, 2017. ‘accepté par la revue’ \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou. "Energy \n\nefficient routing protocol for delay tolerant network based on fuzzy logic and ant colony." \n\nInternational Journal of Intelligent Systems and Applications (IJISA), 2017. ‘accepté par la \n\nrevue’ \n\nCONNAISSANCES EN INFORMATIQUE \n\n \n\nLANGUES \n\nArabe, Français, anglais. \n\nLOISIRS ET INTERETS PERSONNELS \n\n \n\nVoyages, Photographie, Sport (tennis de table, footing), bénévolat. \n\nSystèmes : UNIX, DOS, Windows \n\nLangages : Séquentiels ( C, Assembleur), Requêtes (SQL), WEB (HTML, PHP, MySQL, \n\nJavaScript), Objets (C++, DOTNET,JAVA) , I.A. (Lisp, Prolog) \n\nLogiciels : Open ERP (Enterprise Resource Planning), AutoCAD, MATLAB, Visual \n\nBasic, Dreamweaver MX. \n\nDivers : Bases de données, ONE (Opportunistic Network Environment), NS3, \n\nArchitecture réseaux,Merise,... \n\n',
{'entities': [(1233, 1424, 'diplomes'), (1012, 1227, 'diplomes'), (812, 1005, 'diplomes'), (589, 806, 'diplomes'), (365, 583, 'diplomes'), (122, 158, 'adresse')]}
)]
I agree is better if we try to train the model just on one resume, and test with it to see if he learns.
I've changed the code, the difference is now I try to train a blank model.
def train_spacy():
TRAIN_DATA = training_data
nlp = spacy.blank('fr')
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
ner = nlp.get_pipe("ner")
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
optimizer = nlp.begin_training()
for itn in range(20):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.1, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses
)
print(itn, dt.datetime.now(), losses)
return nlp
Here are the losses I get in the training
Here is the test, here I test the same resume used for training.
The good thing is now I dont have the empty tuple, the model actually recognized something correctly, in this case the "adresse" entity.
But I wont recognize the "diplomes" entity, which I web have 5 of them in this resume, even though its trained on it.

Embed a pdf in a R Markdown file and adapt pagination

I am finishing my PhD, and I need to embed some papers (in pdf format) in somewhere in the middle of my R Markdown text.
When converting the R Markdown into PDF, I would like those PDF papers to be embed in the conversion.
However, I would like those PDF papers to be also numbered according to the rest of the Markdown text.
How can I do it?
UPDATE: New error
By using \includepdf, I get this error:
output file: Tesis_doctoral_-_TEXTO.knit.md
! Undefined control sequence.
l.695 \includepdf
[pages=1-10, angle=90, pagecommand={}]{PDF/Paper1.pdf}
Here is how much of TeX's memory you used:
12157 strings out of 495028
174654 string characters out of 6181498
273892 words of memory out of 5000000
15100 multiletter control sequences out of 15000+600000
40930 words of font info for 89 fonts, out of 8000000 for 9000
14 hyphenation exceptions out of 8191
31i,4n,35p,247b,342s stack positions out of 5000i,500n,10000p,200000b,80000s
Error: Failed to compile Tesis_doctoral_-_TEXTO.tex. See Tesis_doctoral_-_TEXTO.log for more info.
Execution halted
EXAMPLE of the R Markdown code
---
title: Histología dental de los homininos de la Sierra de Atapuerca (Burgos, España)
y patrón de estrategia de vida
author: "Mario Modesto-Mata"
date: "20 September 2018"
output:
pdf_document:
highlight: pygments
number_sections: yes
toc: yes
toc_depth: 4
word_document:
toc: yes
toc_depth: '4'
html_document: default
csl: science.csl
bibliography: references.bib
header-includes:
- \usepackage{pdfpages}
---
```{r opciones_base_scripts, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(captioner)
tabla_nums <- captioner(prefix = "Tabla")
figura_nums <- captioner(prefix = "Figura")
anx_tabla_nums <- captioner(prefix = "Anexo Tabla")
```
# Resumen
Los estudios de desarrollo dental en homínidos han sido sesgados involuntariamente en especies pre-Homo y algunos especímenes Homo tempranos, que representan la condición primitiva con tiempos de formación dental más rápidos, respetan a los Neandertales posteriores y a los humanos modernos, que comparativamente tienen tiempos de formación más lentos.
## PDF Article
\includepdf[pages=1-22, pagecommand={}]{PDF/Paper1.pdf}
## Bayes
El desarrollo dental relativo se evaluó empleando un enfoque estadístico bayesiano (31).
This is the link to download the PDF
NEW IMAGE

I had to remove a few things from your example, but after that it worked without problems:
---
title: Histología dental de los homininos de la Sierra de Atapuerca (Burgos, España)
y patrón de estrategia de vida
author: "Mario Modesto-Mata"
date: "20 September 2018"
output:
pdf_document:
highlight: pygments
number_sections: yes
toc: yes
toc_depth: 4
keep_tex: yes
word_document:
toc: yes
toc_depth: '4'
html_document: default
header-includes:
- \usepackage{pdfpages}
---
# Resumen
Los estudios de desarrollo dental en homínidos han sido sesgados involuntariamente en especies pre-Homo y algunos especímenes Homo tempranos, que representan la condición primitiva con tiempos de formación dental más rápidos, respetan a los Neandertales posteriores y a los humanos modernos, que comparativamente tienen tiempos de formación más lentos.
## PDF Article
\includepdf[pages=1-22, pagecommand={}, scale = 0.9]{Paper1.pdf}
## Bayes
El desarrollo dental relativo se evaluó empleando un enfoque estadístico bayesiano (31).
Result:
BTW, for something like a thesis I would use bookdown, since this gives you cross-referencing etc.
If that does not work for you, I suggest first looking at plain LaTeX, i.e. does the following LaTeX document work for you:
\documentclass{article}
\usepackage{pdfpages}
\begin{document}
foo
\includepdf[pages=1-22, pagecommand={}, scale = 0.9]{Paper1.pdf}
bar
\end{document}

Part of HTML instead of full using Soup/Selenium

I've been struggling with the following problem:
I am trying to retrieve the full HTML of a certain page. I've managed to scrape a few other sites, but this one just won't cooperate.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
##########################################
url = "https://fd.nl/laatste-nieuws"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
page_soup1 = soup(html, "html5lib")
page_soup1
The output is just a subpart of the HTML. When I inspect the page through Chrome there are many more elements.
I've tried just using Soup with multiple parsers (html.parser, html5lib and lxml), as well as using Selenium before Soup, both to no avail.
I'm fairly new to all of this so any tips/guides are welcome!
Cheers!

Seems that the site is using "cookiewall", Just set "Cookie" in headers to "cookieconsent=true" and it should work:
from bs4 import BeautifulSoup
import requests
headers = {"Host":"fd.nl",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding":"gzip,deflate,br",
"Cookie": "cookieconsent=true",
"User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
url = "https://fd.nl/laatste-nieuws"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'lxml')
for h1, p in zip(soup.select('h1'), soup.select('h1 ~ p')):
print(h1.text)
print(p.text)
print('-' * 80)
Prints:
Hogere omzet voor Nederlandse zuivelgroep A-ware
Familiebedrijf A-ware bouwt mozzarellafabriek in Heerenveen
--------------------------------------------------------------------------------
Via negentig procedures van amorfe betonkolos tot hotel met welnesscentrum
Ook fabrieken hebben een levensduur. Niet zelden staan de gebouwen er nog, maar is de oorspronkelijke functie verdwenen. Soms krijgen ze een nieuwe bestemming. In dit eerste deel over industrieel erfgoed: meelfabriek De Sleutels in Leiden.
--------------------------------------------------------------------------------
Egyptische miljardair en oprichter Fortress Investment Group kopen voetbalclub Aston Villa
Nieuwe eigenaren Nassef Sawiris en Wes Edens hopen met hun investering Aston Villa wel snel weer op het hoogste niveau te krijgen.
--------------------------------------------------------------------------------
Greet Prins struint door Marrakesh
Een ideale agenda zonder beperkingen van tijd, afstand of geld. Deze week in de rubriek Droomweekend: Greet Prins, voorzitter van de raad van bestuur van Philadelphia Zorg.
--------------------------------------------------------------------------------
Trump drukt op de beurs, Wall Street licht lager
Koersen op Wall Street dalen nadat Amerikaanse president heeft gezegd 'klaar te zijn om tot 500' mrd aan importheffingen te gaan.
--------------------------------------------------------------------------------
...and so on

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

extract text from specific sections in html, python - beautifulsoup

Related

UnboundLocalError: local variable 'range' referenced before assignment

Beautiful Soup - get text from all <li> elements in <ul>

Training a spacy model for NER in french resumes dont give any results

Embed a pdf in a R Markdown file and adapt pagination

Part of HTML instead of full using Soup/Selenium

Categories

Resources