I've been struggling with the following problem:
I am trying to retrieve the full HTML of a certain page. I've managed to scrape a few other sites, but this one just won't cooperate.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
##########################################
url = "https://fd.nl/laatste-nieuws"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
page_soup1 = soup(html, "html5lib")
page_soup1
The output is just a subpart of the HTML. When I inspect the page through Chrome there are many more elements.
I've tried just using Soup with multiple parsers (html.parser, html5lib and lxml), as well as using Selenium before Soup, both to no avail.
I'm fairly new to all of this so any tips/guides are welcome!
Cheers!
Seems that the site is using "cookiewall", Just set "Cookie" in headers to "cookieconsent=true" and it should work:
from bs4 import BeautifulSoup
import requests
headers = {"Host":"fd.nl",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding":"gzip,deflate,br",
"Cookie": "cookieconsent=true",
"User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
url = "https://fd.nl/laatste-nieuws"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'lxml')
for h1, p in zip(soup.select('h1'), soup.select('h1 ~ p')):
print(h1.text)
print(p.text)
print('-' * 80)
Prints:
Hogere omzet voor Nederlandse zuivelgroep A-ware
Familiebedrijf A-ware bouwt mozzarellafabriek in Heerenveen
--------------------------------------------------------------------------------
Via negentig procedures van amorfe betonkolos tot hotel met welnesscentrum
Ook fabrieken hebben een levensduur. Niet zelden staan de gebouwen er nog, maar is de oorspronkelijke functie verdwenen. Soms krijgen ze een nieuwe bestemming. In dit eerste deel over industrieel erfgoed: meelfabriek De Sleutels in Leiden.
--------------------------------------------------------------------------------
Egyptische miljardair en oprichter Fortress Investment Group kopen voetbalclub Aston Villa
Nieuwe eigenaren Nassef Sawiris en Wes Edens hopen met hun investering Aston Villa wel snel weer op het hoogste niveau te krijgen.
--------------------------------------------------------------------------------
Greet Prins struint door Marrakesh
Een ideale agenda zonder beperkingen van tijd, afstand of geld. Deze week in de rubriek Droomweekend: Greet Prins, voorzitter van de raad van bestuur van Philadelphia Zorg.
--------------------------------------------------------------------------------
Trump drukt op de beurs, Wall Street licht lager
Koersen op Wall Street dalen nadat Amerikaanse president heeft gezegd 'klaar te zijn om tot 500' mrd aan importheffingen te gaan.
--------------------------------------------------------------------------------
...and so on
Related
I am trying to use selenium to scrape dynamic webpages.
Here, I tried to print all the authors in the website
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://quotes.toscrape.com/js")
elements = driver.find_elements_by_class_name("author")
for i in elements:
print(i.text)
driver.quit()
Which worked pretty well and printed me the right result:
Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin
But when I try to use a similar code for another website
I get an error:
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid locator
(Session info: chrome=98.0.4758.102)
This is my second code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://www.myperfume.co.il/155567-%D7%9B%D7%9C-%D7%94%D7%9E%D7%95%D7%AA%D7%92%D7%99%D7%9D-%D7%9C%D7%92%D7%91%D7%A8?order=up_title'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
elements = driver.find_elements_by_class_name("title text-center")
for i in elements:
print(i.text)
driver.quit()
What I am trying to do in this code is to print all the names of the perdumes in the webpage.
After inspecting I saw that all of the names are in a class that called: 'title text-center'.
How can I fix my code?
title text-center are actually 2 class names title and text-center.
In order to locate elements by 2 class names you have to use XPath or CSS Selector.
So, instead of
elements = driver.find_elements_by_class_name("title text-center")
You can use
elements = driver.find_elements_by_xpath("//h3[#class='title text-center']")
Or
elements = driver.find_elements_css_selector("h3.title.text-center")
Also, you should add waits to access the web elements only when they are loaded and ready.
This should be done with Expected Conditions explicit waits, as following:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.myperfume.co.il/155567-%D7%9B%D7%9C-%D7%94%D7%9E%D7%95%D7%AA%D7%92%D7%99%D7%9D-%D7%9C%D7%92%D7%91%D7%A8?order=up_title'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
wait = WebDriverWait(driver, 20)
driver.get(url)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h3.title.text-center")))
elements = driver.find_elements_css_selector("h3.title.text-center")
for i in elements:
print(i.text)
driver.quit()
This error message...
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid locator
...implies that the locator strategy you have used is not a valid locator strategy as By.CLASS_NAME takes a single classname as an argument.
To print all the names of the perfumes in the webpage you can use List Comprehension you can use the following Locator Strategy:
Using css_selector:
driver.get("https://www.myperfume.co.il/155567-%D7%9B%D7%9C-%D7%94%D7%9E%D7%95%D7%AA%D7%92%D7%99%D7%9D-%D7%9C%D7%92%D7%91%D7%A8?order=up_title")
print([my_elem.get_attribute("innerHTML") for my_elem in driver.find_elements_by_css_selector("h3.title")])
Ideally you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following Locator Strategy:
Using CSS_SELECTOR and get_attribute("innerHTML"):
driver.get("https://www.myperfume.co.il/155567-%D7%9B%D7%9C-%D7%94%D7%9E%D7%95%D7%AA%D7%92%D7%99%D7%9D-%D7%9C%D7%92%D7%91%D7%A8?order=up_title")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h3.title")))])
Console Output:
[' 212 וי אי פי לגבר א.ד.ט 212 vip for men e.d.t ', ' 212 ניו יורק לגבר א.ד.ט 212 nyc for men e.d.t ', ' 212 סקסי לגבר א.ד.ט 212 sexy men e.d.t ', ' אברקרומבי פירס 100 מל א.ד.ק Abercrombie & Fitch Fierce 100 ml e.d.c ', ' אברקרומבי פירס 50 מל א.ד.ק Abercrombie & Fitch Fierce 50 ml e.d.c ', ' אברקרומבי פירס גודל ענק 200 מל א.ד.ק Abercrombie & Fitch Fierce 200 ml e.d.c ', ' אברקרומבי פירסט אינסטינקט לגבר א.ד.ט Abercrombie & Fitch First Instinct e.d.t ', ' אגואיסט א.ד.ט Egoiste e.d.t ', ' אגואיסט פלטינום א.ד.ט Egoiste Platinum e.d.t ', ' או דה בלנק א.ד.ט Eau De Blanc e.d.t ', ' או דה פרש א.ד.ט Eau Fraiche e.d.t ', ' אובסיישן לגבר א.ד.ט Obsession for men e.d.t ']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
I am trying to create a pandas dataframe based off of the top 1000 recruits from the 2022 football recruiting class from the 247sports website in a google colab notebook. I currently am using the following code so far:
#Importing all necessary packages
import pandas as pd
import time
import datetime as dt
import os
import re
import requests
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import twofourseven
from bs4 import BeautifulSoup
from splinter import Browser
from kora.selenium import wd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import requests
from geopy.geocoders import Nominatim
from sklearn.model_selection import KFold
from sklearn.metrics import log_loss, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
year = '2022'
url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeRecruitRankings?InstitutionGroup=HighSchool'
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="rankings-page__list-item"): # `[1:]` Since the first result is a table header
# meta = tag.find_all("span", class_="meta")
rank = tag.find_next("div", class_="primary").text
TwoFourSeven_rank = tag.find_next("div", class_="other").text
name = tag.find_next("a", class_="rankings-page__name-link").text
school = tag.find_next("span", class_="meta").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
data.append(
{
"Rank": rank,
"247 Rank": TwoFourSeven_rank,
"Name": name,
"School": school,
"Class of": year,
"Position": position,
"Height & Weight": height_weight,
"Rating": rating,
"National Rank": nat_rank,
"State Rank": state_rank,
"Position Rank": pos_rank,
# "School": ???,
}
)
print(rank)
df = pd.DataFrame(data)
data
Ideally, I would also like to grab the school name the recruit chose from the logo on the table, but I am not sure how to go about that. For example, I would like to print out "Florida State" for the school column from this "row" of data.
Along with that, I do get an output of printing ranks, but afterwards, I get the following error that won't allow me to collect and/or print out additional data:
AttributeError Traceback (most recent call last)
<ipython-input-11-56f4779601f8> in <module>()
16 # meta = tag.find_all("span", class_="meta")
17
---> 18 rank = tag.find_next("div", class_="primary").text
19 # TwoFourSeven_rank = tag.find_next("div", class_="other").text
20 name = tag.find_next("a", class_="rankings-page__name-link").text
AttributeError: 'NoneType' object has no attribute 'text'
Lastly, I do understand that this webpage only displays 50 recruits without having my python code click the "Load more" tab via selenium, but I am not 100% sure how to incorporate that in the most efficient and legible way possible. If anyone knows a good way to do all this, I'd greatly appreciate it. Thanks in advance.
Use try/except as some of the elements will not be present. Also no need to use Selenium. Simple requests will do.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://247sports.com/Season/2022-Football/CompositeRecruitRankings/?InstitutionGroup=HighSchool'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}
rows = []
page = 0
while True:
page +=1
print('Page: %s' %page)
payload = {'Page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
athletes = soup.find_all('li',{'class':'rankings-page__list-item'})
if len(athletes) == 0:
break
continue_loop = True
while continue_loop == True:
for athlete in athletes:
if athlete.text.strip() == 'Load More':
continue_loop = False
continue
primary_rank = athlete.find('div',{'class':'rank-column'}).find('div',{'class':'primary'}).text.strip()
try:
other_rank = athlete.find('div',{'class':'rank-column'}).find('div',{'class':'other'}).text.strip()
except:
other_rank = ''
name = athlete.find('div',{'class':'recruit'}).find('a').text.strip()
link = 'https://247sports.com' + athlete.find('div',{'class':'recruit'}).find('a')['href']
highschool = ' '.join([x.strip() for x in athlete.find('div',{'class':'recruit'}).find('span',{'class':'meta'}).text.strip().split('\n')])
pos = athlete.find('div',{'class':'position'}).text.strip()
ht = athlete.find('div',{'class':'metrics'}).text.split('/')[0].strip()
wt = athlete.find('div',{'class':'metrics'}).text.split('/')[1].strip()
rating = athlete.find('span',{'class':'score'}).text.strip()
nat_rank = athlete.find('a',{'class':'natrank'}).text.strip()
pos_rank = athlete.find('a',{'class':'posrank'}).text.strip()
st_rank = athlete.find('a',{'class':'sttrank'}).text.strip()
try:
team = athlete.find('div',{'class':'status'}).find('img')['title']
except:
team = ''
row = {'Primary Rank':primary_rank,
'Other Rank':other_rank,
'Name':name,
'Link':link,
'Highschool':highschool,
'Position':pos,
'Height':ht,
'weight':wt,
'Rating':rating,
'National Rank':nat_rank,
'Position Rank':pos_rank,
'State Rank':st_rank,
'Team':team}
rows.append(row)
df = pd.DataFrame(rows)
**Output: first 10 rows of 1321 rows - **
print(df.head(10).to_string())
Primary Rank Other Rank Name Link Highschool Position Height weight Rating National Rank Position Rank State Rank Team
0 1 1 Quinn Ewers https://247sports.com/Player/Quinn-Ewers-45572600 Southlake Carroll (Southlake, TX) QB 6-3 206 1.0000 1 1 1 Ohio State
1 2 3 Travis Hunter https://247sports.com/Player/Travis-Hunter-46084728 Collins Hill (Suwanee, GA) CB 6-1 165 0.9993 2 1 1 Florida State
2 3 2 Walter Nolen https://247sports.com/Player/Walter-Nolen-46083769 St. Benedict at Auburndale (Cordova, TN) DL 6-4 300 0.9991 3 1 1
3 4 14 Domani Jackson https://247sports.com/Player/Domani-Jackson-46057101 Mater Dei (Santa Ana, CA) CB 6-1 185 0.9966 4 2 1 USC
4 5 10 Zach Rice https://247sports.com/Player/Zach-Rice-46086346 Liberty Christian Academy (Lynchburg, VA) OT 6-6 282 0.9951 5 1 1
5 6 4 Gabriel Brownlow-Dindy https://247sports.com/Player/Gabriel-Brownlow-Dindy-46084792 Lakeland (Lakeland, FL) DL 6-3 275 0.9946 6 2 1
6 7 5 Shemar Stewart https://247sports.com/Player/Shemar-Stewart-46080267 Monsignor Pace (Opa Locka, FL) DL 6-5 260 0.9946 7 3 2
7 8 20 Denver Harris https://247sports.com/Player/Denver-Harris-46081216 North Shore (Houston, TX) CB 6-1 180 0.9944 8 3 2
8 9 33 Travis Shaw https://247sports.com/Player/Travis-Shaw-46057330 Grimsley (Greensboro, NC) DL 6-5 310 0.9939 9 4 1
9 10 23 Devon Campbell https://247sports.com/Player/Devon-Campbell-46093947 Bowie (Arlington, TX) IOL 6-3 310 0.9937 10 1 3
I am trying to learn scraping with selenium while parsing the page_source with "html.parser" of BS4 soup. I have all the Tags that contain h2 tag and a class name, but extracting the text in between doesn't seem to work.
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as soup
opts = webdriver.ChromeOptions()
opts.binary_location = os.environ.get('GOOGLE_CHROME_BIN', None)
opts.add_argument("--headless")
opts.add_argument("--disable-dev-shm-usage")
opts.add_argument("--no-sandbox")
browser = webdriver.Chrome(executable_path="chromedriver", options=opts)
url1='https://www.animechrono.com/date-a-live-series-watch-order'
browser.get(url1)
req = browser.page_source
sou = soup(req, "html.parser")
h = sou.find_all('h2', class_='heading-5')
p = sou.find_all('div', class_='text-block-5')
for i in range(len(h)):
h[i] == h[i].getText()
for j in range(len(p)):
p[j] = p[j].getText()
print(h)
print(p)
browser.quit()
My Output :
[<h2 class="heading-5">Season 1</h2>, <h2 class="heading-5">Date to Date OVA</h2>, <h2 class="heading-5">Season 2</h2>, <h2 class="heading-5">Kurumi Star Festival OVA</h2>, <h2 class="heading-5">Date A Live Movie: Mayuri Judgement</h2>, <h2 class="heading-5">Season 3</h2>, <h2 class="heading-5">Date A Bullet: Dead or Bullet Movie</h2>, <h2 class="heading-5">Date A Bullet: Nightmare or Queen Movie</h2>]
['Episodes 1-12', 'Date to Date OVA', 'Episodes 1-10', 'Kurumi Star Festival OVA', 'Date A Live Movie: Mayuri Judgement', 'Episodes 1-12', 'Date A Bullet: Dead or Bullet Movie', 'Date A Bullet: Nightmare or Queen Movie']
Add this line before driver.quit():
h = [elem.text for elem in h]
print(h)
Full code:
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as soup
opts = webdriver.ChromeOptions()
opts.binary_location = os.environ.get('GOOGLE_CHROME_BIN', None)
opts.add_argument("--headless")
opts.add_argument("--disable-dev-shm-usage")
opts.add_argument("--no-sandbox")
browser = webdriver.Chrome(executable_path="chromedriver", options=opts)
url1='https://www.animechrono.com/date-a-live-series-watch-order'
browser.get(url1)
req = browser.page_source
sou = soup(req, "html.parser")
h = sou.find_all('h2', class_='heading-5')
p = sou.find_all('div', class_='text-block-5')
for j in range(len(p)):
p[j] = p[j].getText()
h = [elem.text for elem in h]
print(h)
browser.quit()
Output:
['Season 1', 'Date to Date OVA', 'Season 2', 'Kurumi Star Festival OVA', 'Date A Live Movie: Mayuri Judgement', 'Season 3', 'Date A Bullet: Dead or Bullet Movie', 'Date A Bullet: Nightmare or Queen Movie']
I cannot iterate through the list of restaurants.
Here is a quick video demonstrating my issue: https://streamable.com/vorg7
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find("div", {"class": "ui cards"})
restaurant_title = zomato_containers[1].find("a", {"class": "result-title hover_feedback zred bold ln24 fontsize0 "}).text
print("restaurant_title: ", restaurant_title)
I expect Python to state that there are 15 restaurants in 1 page, but I am getting 39.
I just changed the class you use to find your results, and used a find_all method to get all the snippet cards, and I've found 15 restaurants:
CODE:
import requests
from bs4 import BeautifulSoup as soup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
bs = soup(response.text,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
print(len(zomato_containers))
for zc in zomato_containers:
restaurant_title = zc.find("a", {"class": "result-title"})
print("restaurant_title: ", restaurant_title.get_text())
RESULT:
15
restaurant_title: Delfina
restaurant_title: Boudin Bakery
restaurant_title: In-N-Out Burger
restaurant_title: Hollywood Cafe
restaurant_title: The Slanted Door
restaurant_title: Tartine Bakery
restaurant_title: The Original Ghirardelli Ice Cream and Chocolate...
restaurant_title: The Cheesecake Factory
restaurant_title: Scoma's
restaurant_title: Boulevard
restaurant_title: Foreign Cinema
restaurant_title: Zuni Café
restaurant_title: Brenda's French Soul Food
restaurant_title: Gary Danko
restaurant_title: Hog Island Oyster Company
I am finishing my PhD, and I need to embed some papers (in pdf format) in somewhere in the middle of my R Markdown text.
When converting the R Markdown into PDF, I would like those PDF papers to be embed in the conversion.
However, I would like those PDF papers to be also numbered according to the rest of the Markdown text.
How can I do it?
UPDATE: New error
By using \includepdf, I get this error:
output file: Tesis_doctoral_-_TEXTO.knit.md
! Undefined control sequence.
l.695 \includepdf
[pages=1-10, angle=90, pagecommand={}]{PDF/Paper1.pdf}
Here is how much of TeX's memory you used:
12157 strings out of 495028
174654 string characters out of 6181498
273892 words of memory out of 5000000
15100 multiletter control sequences out of 15000+600000
40930 words of font info for 89 fonts, out of 8000000 for 9000
14 hyphenation exceptions out of 8191
31i,4n,35p,247b,342s stack positions out of 5000i,500n,10000p,200000b,80000s
Error: Failed to compile Tesis_doctoral_-_TEXTO.tex. See Tesis_doctoral_-_TEXTO.log for more info.
Execution halted
EXAMPLE of the R Markdown code
---
title: Histología dental de los homininos de la Sierra de Atapuerca (Burgos, España)
y patrón de estrategia de vida
author: "Mario Modesto-Mata"
date: "20 September 2018"
output:
pdf_document:
highlight: pygments
number_sections: yes
toc: yes
toc_depth: 4
word_document:
toc: yes
toc_depth: '4'
html_document: default
csl: science.csl
bibliography: references.bib
header-includes:
- \usepackage{pdfpages}
---
```{r opciones_base_scripts, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(captioner)
tabla_nums <- captioner(prefix = "Tabla")
figura_nums <- captioner(prefix = "Figura")
anx_tabla_nums <- captioner(prefix = "Anexo Tabla")
```
# Resumen
Los estudios de desarrollo dental en homínidos han sido sesgados involuntariamente en especies pre-Homo y algunos especímenes Homo tempranos, que representan la condición primitiva con tiempos de formación dental más rápidos, respetan a los Neandertales posteriores y a los humanos modernos, que comparativamente tienen tiempos de formación más lentos.
## PDF Article
\includepdf[pages=1-22, pagecommand={}]{PDF/Paper1.pdf}
## Bayes
El desarrollo dental relativo se evaluó empleando un enfoque estadístico bayesiano (31).
This is the link to download the PDF
NEW IMAGE
I had to remove a few things from your example, but after that it worked without problems:
---
title: Histología dental de los homininos de la Sierra de Atapuerca (Burgos, España)
y patrón de estrategia de vida
author: "Mario Modesto-Mata"
date: "20 September 2018"
output:
pdf_document:
highlight: pygments
number_sections: yes
toc: yes
toc_depth: 4
keep_tex: yes
word_document:
toc: yes
toc_depth: '4'
html_document: default
header-includes:
- \usepackage{pdfpages}
---
# Resumen
Los estudios de desarrollo dental en homínidos han sido sesgados involuntariamente en especies pre-Homo y algunos especímenes Homo tempranos, que representan la condición primitiva con tiempos de formación dental más rápidos, respetan a los Neandertales posteriores y a los humanos modernos, que comparativamente tienen tiempos de formación más lentos.
## PDF Article
\includepdf[pages=1-22, pagecommand={}, scale = 0.9]{Paper1.pdf}
## Bayes
El desarrollo dental relativo se evaluó empleando un enfoque estadístico bayesiano (31).
Result:
BTW, for something like a thesis I would use bookdown, since this gives you cross-referencing etc.
If that does not work for you, I suggest first looking at plain LaTeX, i.e. does the following LaTeX document work for you:
\documentclass{article}
\usepackage{pdfpages}
\begin{document}
foo
\includepdf[pages=1-22, pagecommand={}, scale = 0.9]{Paper1.pdf}
bar
\end{document}