Web Scraping/ Web Crawling

Web Scraping/ Web Crawling - selenium

Can somebody help me figure out how to scrape / crawl this website? https://www.arkansasonline.com/i/lrcrime/
I've downloaded the page source, with requests and parced with BeautifulSoup, but I can't figure out what's going on.
Here is what I have so far:
#####################################################
import requests
from bs4 import BeautifulSoup
url = 'https://www.arkansasonline.com/i/lrcrime/'
r = requests.get(url, headers = headers).text
soup = BeautifulSoup(r,'html.parser' )
data = [x.get_text() for x in soup.find_all('td')]
#####################################################
Typically this would do the trick and I'd get a list of all table data inputs..
But I'm getting
['SEARCH | DISPATCH LOG | STORIES | HOMICIDES | OLD MAP',
'\n\n\n\nClick here to load this Caspio Cloud Database\nCloud Database by Caspio\n']
Which is far from what I need....
Also, how do I craw the 3000 pages?
I also tried to do it with a macro and just record my keystrokes and save to google drive, but the page moves around as you go through the pages, so it makes that basically impossible. They are trying to hide the crime data In my opinion. I want to scrape it all into 1 database and release it to the public.

What happens?
Most of content is provided dynamically, so you won't get it with requests, only the first table with some navigation is provided static, thats why you get this as result.
There is also a "big" hint you should respect.
This Content is Only Available to Subscribers.
How to fix?
NOTE: Be kind and subscribe to consume the content - would be the best solution.
Assuming this is the case you should go with selenium that will render the page as a browser will do and also provide the table you are looking for.
You have to wait until the table is loaded:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'table[data-cb-name="cbTable"]')))
Grab the page_source and load it with pandas as dataframe.
Example (selenium 4)
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
service = Service(executable_path='ENTER YOUR PATH TO CHROMEDRIVER')
driver = webdriver.Chrome(service=service)
driver.get('https://www.arkansasonline.com/i/lrcrime/')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'table[data-cb-name="cbTable"]')))
pd.read_html(driver.page_source)[1]
Output
INCIDENT NUMBER
OFFENSE DESCRIPTION
Location/address
ZIP
INCIDENT DATE
2021-134824
ALL OTHER LARCENY
8109 W 34TH ST
72204
11/1/2021
2021-134815
BURGLARY/B&E
42 NANDINA LN
72210
11/1/2021
2021-134790
ROBBERY
1800 BROADWAY ST
72206
11/1/2021
2021-134788
AGGRAVATED ASSAULT
11810 PLEASANT RIDGE RD
72223
11/1/2021
2021-134778
THEFT FROM MOTOR VEHICLE
4 HANOVER DR
72209
11/1/2021
...
...
...
...
...

Related

Selenium (Python)- Webscraping verb-conjugation tables (Accessing web elements underneath '#document')

Section 0: Introduction:
This is my first webscraping project and I am not experienced in using selenium . I am trying to scrape arabic verb-conjugation tables from the website:
Online Sarf Generator
Any help with the following probelem will be great.
Thank you.
Section 1: The Problem:
I am trying to webscrape from the following website:
Online Sarf Generator
For doing this, I am trying to use Selenium.
I basically need to select the three root letters and the family from the four toggle menus as shown in the picture below:
After this, I have to click the 'Generate Sarf Table' button.
Section 2: My Attempt:
Here is my code:
#------------------ Just Setting Up the web_driver:
s = Service('/usr/local/bin/chromedriver')
# Set some selenium chrome options:
chromeOptions = Options()
# chromeOptions.headless = False
driver = webdriver.Chrome(service=s, options=chromeOptions)
driver.get('https://sites.google.com/view/sarfgenerator/home')
# I switch the frame once:
iframe = driver.find_elements(by=By.CSS_SELECTOR, value='iframe')[0]
driver.switch_to.frame(iframe)
# I switch the frame again:
iframe = driver.find_elements(by=By.CSS_SELECTOR, value='iframe')[0]
driver.switch_to.frame(iframe)
This takes me to the frame within which the webelements that I need are located.
Now, I print the html to see where I am at:
print(BeautifulSoup(driver.execute_script("return document.body.innerHTML;"),'html.parser'))
Here is the output that I get:
<iframe frameborder="0" id="userHtmlFrame" scrolling="yes">
</iframe>
<script>function loadGapi(){var loaderScript=document.createElement('script');loaderScript.setAttribute('src','https://apis.google.com/js/api.js?checkCookie=1');loaderScript.onload=function(){this.onload=function(){};loadGapiClient();};loaderScript.onreadystatechange=function(){if(this.readyState==='complete'){this.onload();}};(document.head||document.body||document.documentElement).appendChild(loaderScript);}function updateUserHtmlFrame(userHtml,enableInteraction,forceIosScrolling){var frame=document.getElementById('userHtmlFrame');if(enableInteraction){if(forceIosScrolling){var iframeParent=frame.parentElement;iframeParent.classList.add('forceIosScrolling');}else{frame.style.overflow='auto';}}else{frame.setAttribute('scrolling','no');frame.style.pointerEvents='none';}clearCookies();clearStorage();frame.contentWindow.document.open();frame.contentWindow.document.write('<base target="_blank">'+userHtml);frame.contentWindow.document.close();}function onGapiInitialized(){gapi.rpc.call('..','innerFrameGapiInitialized');gapi.rpc.register('updateUserHtmlFrame',updateUserHtmlFrame);}function loadGapiClient(){gapi.load('gapi.rpc',onGapiInitialized);}if(document.readyState=='complete'){loadGapi();}else{self.addEventListener('load',loadGapi);}function clearCookies(){var cookies=document.cookie.split(";");for(var i=0;i<cookies.length;i++){var cookie=cookies[i];var equalPosition=cookie.indexOf("=");var name=equalPosition>-1?cookie.substr(0,equalPosition):cookie;document.cookie=name+"=;expires=Thu, 01 Jan 1970 00:00:00 GMT";document.cookie=name+"=;expires=Thu, 01 Jan 1970 00:00:01 GMT ;domain=.googleusercontent.com";}}function clearStorage(){try{localStorage.clear();sessionStorage.clear();}catch(e){}}</script>
However, the actual html on the website looks like this:
Section 3: The main problem with my approach:
I am unable to access the anything #document contained within the iframe.
Section 4: Conclusion:
Is there a possible solution that can fix my current approach to the problem?
Is there any other way to solve the problem described in Section 1?

You put a lot of effort into structuring your question, so I couldn't not answer it, even if it meant double negation.
Here is how you can drill down into the iframe with content:
EDIT: here is how you can select some options, click the button and access the results:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver_linux64/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'https://sites.google.com/view/sarfgenerator/home'
driver.get(url)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#aria-label="Custom embed"]')))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#id="innerFrame"]')))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#id="userHtmlFrame"]')))
first_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root1"]'))))
second_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root2"]'))))
third_select = Select(wait.until(EC.element_to_be_clickable((By.XPATH, '//select[#id="root3"]'))))
first_select.select_by_visible_text("ج")
second_select.select_by_visible_text("ت")
third_select.select_by_visible_text("ص")
wait.until(EC.element_to_be_clickable((By.XPATH, ('//button[#onclick="sarfGenerator(false)"]')))).click()
print('clicked')
result = wait.until(EC.presence_of_element_located((By.XPATH, '//p[#id="demo"]')))
print(result.text)
Result printed in terminal:
clicked
جَتَّصَ يُجَتِّصُ تَجتِيصًا مُجَتِّصٌ
جُتِّصَ يُجَتَّصُ تَجتِيصًا مُجَتَّصٌ
جَتِّصْ لا تُجَتِّصْ مُجَتَّصٌ Highlight Root Letters
Selenium setup is for Linux, you just have to observe the imports, and the part after defining the driver.
Selenium documentation can be found here.

Unable to paginate with selenium-scrapy, only extracting data for first page

I am scraping a website for most recent customer rating, with several pages.
The problem is that I am able to interact with the "sortby" option and select "most recent" using Selenium, and scrape data for first page using Scrapy. However, I am unable to extract the data for other pages, the Selenium Web driver somehow does not render the next page. My intension is to automate data scraping.
I am a newb to web scraping. A snippet of the code is attached here (Some information is removed due to confidentiality)
import scrapy
import selenium.webdriver as webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait,Select
import time
from selenium.webdriver.support import expected_conditions as EC
from scrapy import Selector
from selenium.webdriver.edge.options import Options
from scrapy.utils.project import get_project_settings
class ABC(scrapy.Spider):
#"........."
def start_requests(self):
#" ...... "
yield scrapy.Request(url)
def parse(self, response):
settings =get_project_settings()
driver_path = settings.get('EDGE_DRIVER_PATH')
options = Options()
options.add_argument("headless")
ser=Service(driver_path)
driver = webdriver.Edge(service=ser,options = options)
driver.get(response.url)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID,"sort-order-dropdown")))
element_dropdown=driver.find_element(By.ID,"sort-order-dropdown")
select=Select(element_dropdown)
select.select_by_value("recent")
time.sleep(5)
for review in response.css('[data-hook="review"]':
res={
"rating": review.css('[class="a-icon-alt"]::text').get(),
}
yield res
next_page =response.xpath('//a[text()="Next page"]/#href').get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
driver.quit()

Looks that you're using Scrapy and Selenium instead of scrapy_selenium (I don't see any SeleniumRequest in your code.
Your current spider works like this:
Get page using Scrapy
Get the same page using Selenium webdriver
Perform some actions using Selenium
Parse Scrapy response (for rating and next_page)
As you see you never use / parse Selenium result.

how to use time sleep to make selenium output consistent

This might be the stupidest question i asked yet but this is driving me nuts...
Basically i want to get all links from profiles but for some reason selenium gives different amounts of links most of the time ( sometimes all sometimes only a tenth)
I experimented with time.sleep and i know its affecting the output somehow but i dont understand where the problem is.
(but thats just my hypothesis maybe thats wrong)
I have no other explanation why i get incosistent output. Since i get all profile links from time to time the program is able to find all relevant profiles.
heres what the output should be (for different gui input)
input:anlagenbau output:3070
Fahrzeugbau output:4065
laserschneiden output:1311
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import TimeoutException
from urllib.request import urlopen
from datetime import date
from datetime import datetime
import easygui
import re
from selenium.common.exceptions import NoSuchElementException
import time
#input window suchbegriff
suchbegriff = easygui.enterbox("Suchbegriff eingeben | Hinweis: suchbegriff sollte kein '/' enthalten")
#get date and time
now = datetime.now()
current_time = now.strftime("%H-%M-%S")
today = date.today()
date = today.strftime("%Y-%m-%d")
def get_profile_url(label_element):
# get the url from a result element
onlick = label_element.get_attribute("onclick")
# some regex magic
return re.search(r"(?<=open\(\')(.*?)(?=\')", onlick).group()
def load_more_results():
# load more results if needed // use only on the search page!
button_wrapper = wd.find_element_by_class_name("loadNextBtn")
button_wrapper.find_element_by_tag_name("span").click()
#### Script starts here ####
# Set some Selenium Options
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
# Webdriver
wd = webdriver.Chrome(options=options)
# Load URL
wd.get("https://www.techpilot.de/zulieferer-suchen?"+str(suchbegriff))
# lets first wait for the timeframe
iframe = WebDriverWait(wd, 5).until(
EC.frame_to_be_available_and_switch_to_it("efficientSearchIframe")
)
# the result parent
result_pane = WebDriverWait(wd, 5).until(
EC.presence_of_element_located((By.ID, "resultPane"))
)
#get all profilelinks as list
time.sleep(5)
href_list = []
wait = WebDriverWait(wd, 15)
while True:
try:
#time.sleep(1)
wd.execute_script("loadFollowing();")
#time.sleep(1)
try:
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".fancyCompLabel")))
except TimeoutException:
break
#time.sleep(1) # beeinflusst in irgeneiner weise die findung der ergebnisse
result_elements = wd.find_elements_by_class_name("fancyCompLabel")
#time.sleep(1)
for element in result_elements:
url = get_profile_url(element)
href_list.append(url)
#time.sleep(2)
while True:
try:
element = wd.find_element_by_class_name('fancyNewProfile')
wd.execute_script("""var element = arguments[0];element.parentNode.removeChild(element);""", element)
except NoSuchElementException:
break
except NoSuchElementException:
break
wd.close #funktioniert noch nicht
print("####links secured: "+str(len(href_list)))

Since you say that the sleep is affecting the number of results, it sounds like they're loading asynchronously and populating as they're loaded, instead of all at once.
The first question is whether you can ask the web site developers to change this, to only show them when they're all loaded at once.
Assuming you don't work for the same company as them, consider:
Is there something else on the page that shows up when they're all loaded? It could be a button or a status message, for instance. Can you wait for that item to appear, and then get the list?
How frequently do new items appear? You could poll for the number of results relatively infrequently, such as only every 2 or 3 seconds, and then consider the results all present when you get the same number of results twice in a row.

The issue is the method presence_of_all_elements_located doesn't wait for all elements matching a passed locator. It waits for presence of at least 1 element matching the passed locator and then returns a list of elements found on the page at that moment matching that locator.
In Java we have
wait.until(ExpectedConditions.numberOfElementsToBeMoreThan(element, expectedElementsAmount));
and
wait.until(ExpectedConditions.numberOfElementsToBe(element, expectedElementsAmount));
With these methods you can wait for predefined amount of elements to appear etc.
Selenium with Python doesn't support these methods.
The only thing you can see with Selenium in Python is to build some custom method to do these actions.
So if you are expecting some amount of elements /links etc. to appear / be presented on the page you can use such method.
This will make your test stable and will avoid usage of hardcoded sleeps.
UPD
I have found this solution.
This looks to be the solution for the mentioned above methods.
This seems to be a Python equivalent for wait.until(ExpectedConditions.numberOfElementsToBeMoreThan(element, expectedElementsAmount));
myLength = 9
WebDriverWait(browser, 20).until(lambda browser: len(browser.find_elements_by_xpath("//img[#data-blabla]")) > int(myLength))
And this
myLength = 10
WebDriverWait(browser, 20).until(lambda browser: len(browser.find_elements_by_xpath("//img[#data-blabla]")) == int(myLength))
Is equivalent for Java wait.until(ExpectedConditions.numberOfElementsToBe(element, expectedElementsAmount));

Selenium fails to load elements, despite EC, waits, and scrolling attempts

With the Selenium (3.141), BeautifulSoup (4.7.9), and Python (3.79), I'm trying to scrape what streaming, rental, and buying options are available for a given movie/show. I've spent hours trying to solve this, so any help would be appreciated. Apologies for the poor formatting, in terms of mixing in comments and prior attempts.
Example Link: https://www.justwatch.com/us/tv-show/24
Desired Outcome is a Beautiful soup element that I can then parse (e.g., which streaming services have it, how many seasons are available, etc.),
which has 3 elements (as of now) - Hulu, IMDB TV, and DirecTV.
I tried numerous variations, but only get one of the 3 streaming services for the example link, and even then it's not a consistent result. Often, I get an empty object.
Some of the things that I've tried included waiting for an expected condition (presence or visibility), explicitly using sleep() from the time library. I'm using a Mac (but running Linux via a USB), so there is no "PAGE DOWN" on the physical keyboard. For the keys module, I've tried control+arrow down, page down, and and space (space bar), but on this particular web page they don't work. However, if I'm browsing it in a normal fashion, control+arrow down and space bar help scrolling the desired section into view. As far as I know, there is no fn + arrow down option that works in Keys, but that's another way that I can move in a normal fashion.
I've run both headless and regular options to try to debug, as well as trying both Firefox and Chrome drivers.
Here's my code:
import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
firefox_options = Options()
firefox_options.add_argument('--enable-javascript') # double-checking to make sure that javascript is enabled
firefox_options.add_argument('--headless')
firefox_driver_path = 'geckodriver'
driver = webdriver.Firefox(executable_path=firefox_driver_path, options=firefox_options)
url_link = 'https://www.justwatch.com/us/tv-show/24'
driver.get(url_link) # initial page
cookies = driver.get_cookies()
Examples of things I've tried around this part of the code
various time.sleep(3) and driver.implicitly_wait(3) commands
webdriver.ActionChains(driver).key_down(Keys.CONTROL).key_down(Keys.ARROW_DOWN).perform()
webdriver.ActionChains(driver).key_down(Keys.SPACE).perform()
This code yields a timeout error when used
stream_results = WebDriverWait(driver, 15)
stream_results.until(EC.presence_of_element_located(
(By.CLASS_NAME, "price-comparison__grid__row price-comparison__grid__row--stream")))
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser') # 'lxml' didn't work either
Here's code for getting the html related to the streaming services. I've also tried to grab the html code at various levels, ids, and classes of the tree, but the code just isn't there
stream_row = soup.find('div', attrs={'class':'price-comparison__grid__row price-comparison__grid__row--stream'})
stream_row_holder = soup.find('div', attrs={'class':'price-comparison__grid__row__holder'})
stream_items = stream_row_holder\
.find_all('div', attrs={'class':'price-comparison__grid__row__element__icon'})
driver.quit()

I'm not sure if you are saying your code works in some cases or not at all, but I use chrome and the four find_all() lines at the end all produce results. If this isn't what you mean, let me know. The one thing you may be missing is a time.sleep() that is long enough. That could be the only difference...
Note you need chromedriver to run this code, but perhaps you have chrome and can download chromedriver.exe.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url = 'https://www.justwatch.com/us/tv-show/24'
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
soup.find_all(class_="price-comparison__grid__row__price")
soup.find_all(class_="price-comparison__grid__row__holder")
soup.find_all(class_="price-comparison__grid__row__element__icon")
soup.find_all(class_="price-comparison__grid__row--stream")
This is the output from the last line:
[<div class="price-comparison__grid__row price-comparison__grid__row--stream"><div class="price-comparison__grid__row__title price-comparison__promoted__title"> Stream </div><div class="price-comparison__grid__row__holder"><!-- --><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="Hulu" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/116305230/s100" title="Hulu"/><div class="price-comparison__grid__row__price"> 9 Seasons <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><!-- --></div></div>,
<div class="price-comparison__grid__row price-comparison__grid__row--stream"><div class="price-comparison__grid__row__title"> Stream </div><div class="price-comparison__grid__row__holder"><!-- --><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="Hulu" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/116305230/s100" title="Hulu"/><div class="price-comparison__grid__row__price"> 9 Seasons <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="IMDb TV" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/134049674/s100" title="IMDb TV"/><div class="price-comparison__grid__row__price"> 8 Seasons <!-- --></div></div></div><div class="price-comparison__grid__row__element"><div class="presentation-type price-comparison__grid__row__element__icon"><img alt="DIRECTV" class="jw-provider-icon price-comparison__grid__row__icon" src="https://images.justwatch.com/icon/158260222/s100" title="DIRECTV"/><div class="price-comparison__grid__row__price"> 1 Season <span class="price-comparison__badge price-comparison__badge--hd price-comparison__badge--hd"> HD </span></div></div></div><!-- --></div></div>]

selenium scraper works, but after some time chrome says "This site can't be reached"

I'm scraping the US Patent website, their robot.txt has no restrictions when it comes to scraping, but after a few hundred requests, I get this isse:
I clear cookies after each search request, and I also have tried using different proxies. Any ideas as to why this is happening? My code works fine, but after 10-20 minutes of scraping I get this error.
Here's my code but I don't think it will be very helpful at all as the code works fine til this point
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
import time
import pandas as pd
from fake_useragent import UserAgent
from webdriver_manager.chrome import ChromeDriverManager
PATH = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(executable_path=PATH)
num_rows = 50000
df = pd.read_csv('company_names.csv').head(500)
df_new = pd.DataFrame(index=range(num_rows),columns=['company_name','link','patent title','abstract','company_id'])
row_number = 0
for company in df['company_name']:
company_id = df.loc[df.company_name == company, 'company_id'].values[0]
print(company_id)
df_new.iloc[row_number,4]=str(company_id)
print(company)
df_new.iloc[row_number,0]=str(company)
driver.get("http://patft.uspto.gov/netahtml/PTO/")
driver.get("http://patft.uspto.gov/netahtml/PTO/search-adv.htm")
search_box = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"/html/body/center/form/table/tbody/tr[1]/td[1]/textarea")))
print('found search box')
search_box.send_keys("AN/"+'"'+str(company)+'"')
search_button = driver.find_element_by_xpath("/html/body/center/form/table/tbody/tr[2]/td[2]/input[1]").click()
#multiple results
check_table = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"/html/body/table/tbody/tr[1]/th[1]")))
if check_table.text == 'PAT. NO.':
#multiple links
rows = driver.find_elements_by_xpath("/html/body/table/tbody/tr")
num_patents = len(rows)-1
min_patents = min(10,num_patents)
for row in range(min_patents):
df_new.iloc[row_number,4]=str(company_id)
df_new.iloc[row_number,0]=str(company)
title_link = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,"/html/body/table/tbody/tr["+str(row+2)+"]/td[4]/a")))
link = title_link.get_attribute('href')
print(str(link))
title_text = title_link.text
print(title_text)
df_new.iloc[row_number,1] = str(link)
df_new.iloc[row_number,2] = str(title_text)
title_link.click()
abstract = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"/html/body/p[1]")))
print(abstract.text)
df_new.iloc[row_number,3] = str(abstract.text)
row_number += 1
driver.back()
#get patent abstract data
elif check_table.text == 'Inventors:':
#one link
df_new.iloc[row_number,4]=str(company_id)
df_new.iloc[row_number,0]=str(company)
abstract = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"/html/body/p[1]")))
link = driver.current_url
df_new.iloc[row_number,1] = str(link)
abstract_text = abstract.text
title = driver.find_element_by_xpath('/html/body/font')
title_text = title.text
print(title_text)
df_new.iloc[row_number,2] = str(title_text)
print(abstract_text)
df_new.iloc[row_number,3] = str(abstract_text)
row_number += 1
driver.delete_all_cookies()
df_new.to_csv('patent_results.csv')

Terms of Use for USPTO websites:
USPTO’s online databases are not designed or intended to be a source for bulk downloads of USPTO data when accessed through the website’s interfaces. Individuals, companies, IP addresses, or blocks of IP addresses who, in effect, deny or decrease service by generating unusually high numbers of database accesses (searches, pages, or hits), whether generated manually or in an automated fashion, may be denied access to USPTO servers without notice.
next paragraph:
Bulk data products may be separately obtained from the USPTO, either for free or at the cost of dissemination. For details, see information on Electronic Bulk Data Products.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Web Scraping/ Web Crawling - selenium

Related

Selenium (Python)- Webscraping verb-conjugation tables (Accessing web elements underneath '#document')

Unable to paginate with selenium-scrapy, only extracting data for first page

how to use time sleep to make selenium output consistent

Selenium fails to load elements, despite EC, waits, and scrolling attempts

selenium scraper works, but after some time chrome says "This site can't be reached"

Categories

Resources