How do I make python try the next URL in my file if the current one returns a 404? - line

I'm having a problem figuring out what code I need to create to make to make python try the next url in my csv file each url is on a line like this:
http://www.indexedamerica.com/states/PR/Adjuntas/Restaurants-Adjuntas-00601.html
http://www.indexedamerica.com/states/PR/Aguada/Restaurants-Aguada-00602.html
http://www.indexedamerica.com/states/PR/Aguadilla/Restaurants-Aguadilla-00603.html
http://www.indexedamerica.com/states/PR/Aguadilla/Restaurants-Aguadilla-00604.html
http://www.indexedamerica.com/states/PR/Aguadilla/Restaurants-Aguadilla-00605.html
http://www.indexedamerica.com/states/PR/Maricao/Restaurants-Maricao-00606.html
http://www.indexedamerica.com/states/MI/Kent/Restaurants-Grand-Rapids-49503.html
#open csv file
#read csv file line by line
#Pass each line to beautiful soup to try
#If URL raises a 404 error continue to next line
#extract tables from url
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import csv
mech = Browser()
indexed = open('C://python27/longlist.csv')
reader = csv.reader(indexed)
html = mech.open(reader)
for line in html:
try:
mechanize.open(html)
table = soup.find("table", border=3)
else:
#!!!! try next url from file. How do I do this?
for row in table.findAll('tr')[2:]:
col = row.findAll('td')
BusinessName = col[0].string
Phone = col[1].string
Address = col[2].string
City = col[3].string
State = col[4].string
Zip = col[5].string
Restaurantinfo = (BusinessName, Phone, Address, City, State)
print "|".join(Restaurantinfo)

for line in html:
try:
mechanize.open(html)
table = soup.find("table", border=3)
except Exception:
continue
Alternatively, you could check the status code of the page, and skip if you receive a 404 (in a for loop):
if urllib.urlopen(url).getcode() == '404':
continue
continue in a loop, stops execution of further code and continues to the next entry in the loop.

Add all the urls you want to search through to a list. Then loop through the list, opening each url in sequence. If a given url returns any kind of error then you can choose to use continue to ignore that url-file and move on to the next one.

Related

wait until the blob storoage folder is created

I would like to download a picture into a blob folder.
Before that I need to create the folder first.
Below codes are what I am doing.
The issue is the folder needs time to be created.
When it comes to with open(abs_file_name, "wb") as f:
it can not find the folder.
I am wondering whether there is an 'await' to get to know the completion of the folder creation, then do the write operation.
for index, row in data.iterrows():
url = row['Creatives']
file_name = url.split('/')[-1]
r = requests.get(url)
abs_file_name = lake_root + file_name
dbutils.fs.mkdirs(abs_file_name)
if r.status_code == 200:
with open(abs_file_name, "wb") as f:
f.write(r.content)
The final sub folder will not be created when using dbutils.fs.mkdirs() on blob storage.
It creates a file with the final sub folder name which would be considered as a directory, but it is not a directory. Look at the following demonstration:
dbutils.fs.mkdirs('/mnt/repro/s1/s2/s3.csv')
When I try to open this file, the error says that this is a directory.
This might be the issue with the code. So, try using the following code instead:
for index, row in data.iterrows():
url = row['Creatives']
file_name = url.split('/')[-1]
r = requests.get(url)
abs_file_name = lake_root + 'fail' #creates the fake directory (to counter the problem we are facing above)
dbutils.fs.mkdirs(abs_file_name)
if r.status_code == 200:
with open(lake_root + file_name, "wb") as f:
f.write(r.content)

Add Proxy to Selenium & export dataframe to CSV

I'm trying to make a scraper for capterra. I'm having issues getting blocked, so I think I need a proxy for my driver.get. Also, I am having trouble exporting a dataframe to a CSV. The first half of my code (not attached) is able to get all the links and store them in a list that I am trying to access with Selenium to get the information I want, but the second part is where I am having trouble.
For an example, these are the types of links I am storing in the plinks list and that the driver is accessing:
https://www.capterra.com/p/212448/Blackbaud-Altru/
https://www.capterra.com/p/80509/Volgistics-Volunteer-Management/
https://www.capterra.com/p/179048/One-Earth/
for link in plinks:
driver.get(link)
#driver.implicitly_wait(20)
companyProfile = bs(driver.page_source, 'html.parser')
try:
name = companyProfile.find("h1", class_="sm:nb-type-2xl nb-type-xl").text
except AttributeError:
name = "couldn't find"
try:
reviews = companyProfile.find("div", class_="nb-ml-3xs").text
except AttributeError:
reviews = "couldn't find"
try:
location = driver.find_element(By. XPATH, "//*[starts-with(., 'Located in')]").text
except NoSuchElementException:
location = "couldn't find"
try:
url = driver.find_element(By. XPATH, "//*[starts-with(., 'http')]").text
except NoSuchElementException:
url = "couldn't find"
try:
features = [x.get_text() for x in companyProfile.select('[id="LoadableProductFeaturesSection"] li span')]
except AttributeError:
features = "couldn't find"
companyInfo.append([name, reviews, location, url, features])
companydf = pd.DataFrame(companyInfo, columns = ["Name", "Reviews", "Location", "URL", "Features"])
companydf.to_csv(wmtest.csv, sep='\t')
driver.close()
I am using Mozilla for the webdriver, and I am happy to change to Chrome if it works better, but is it possible to have the webdriver pick from a random set of proxies for each get request?
Thanks!

Use URLs from List to save zip file

Trying to use urllib.request to read a list of urls from a shapefile, then download the zips from all those URLs. So far I got my list of a certain number of URLs, but I am unable to pass all of them through. The error is expected string or bytes-like object. Meaning theres prob an issue with the URL. As a side note, I also need to download them and name them by their file name/#. Need help!! Code below.
import arcpy
import urllib.request
import os
os.chdir('C:\\ProgInGIS\\FinalExam\\Final')
lidar_shp = 'C:\\ProgInGIS\\FinalExam\\Final\\lidar-2013.shp'
zip_file_download = 'C:\\ProgInGIS\\FinalExam\\Final\\file1.zip'
data = []
with arcpy.da.SearchCursor(lidar_shp,"*") as cursor:
for row in cursor:
data.append(row)
data.sort(key=lambda tup: tup[2])
i = 0
with arcpy.da.UpdateCursor(lidar_shp,"*") as cursor:
for row in cursor:
row = data[i]
i += 1
cursor.updateRow(row)
counter = 0
url_list = []
with arcpy.da.UpdateCursor(lidar_shp, ['geotiff_ur']) as cursor:
for row in cursor:
url_list.append(row)
counter += 1
if counter == 18:
break
for item in url_list:
print(item)
urllib.request.urlretrieve(item)
I understand your question this way: you want to download a zip file for each record in a shapefile from an URL defined in a certain field.
It's easier to use the requests package which is also recommended in the urllib.request documentation:
The Requests package is recommended for a higher-level HTTP client interface.
Here is an example:
import arcpy, arcpy.da
import shutil
import requests
SHAPEFILE = "your_shapefile.shp"
with arcpy.da.SearchCursor(SHAPEFILE, ["name", "url"]) as cursor:
for name, url in cursor:
response = requests.get(url, stream=True)
if response.status_code == 200:
with open(f"{name}.zip", "wb") as file:
response.raw.decode_content = True
shutil.copyfileobj(response.raw, file)
There is another example on GIS StackExchange:
https://gis.stackexchange.com/a/392463/21355

How to get website to consistently return content from a GET request when it's inconsistent?

I posted a similar question earlier but I think this is a more refined question.
I'm trying to scrape: https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=0
My code randomly throws errors when I send a GET request to the URL. After debugging, I saw the following happen. A GET request for the following url will be sent(Example URL, could happen on any page): https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=2400
The webpage will then say "There were no matching transactions found.". However, if I refresh the page, the content will then be loaded. I'm using BeautifulSoup and Selenium and have put sleep statements in my code in hopes that it'll work but to no avail. Is this a problem on the website's end? It doesn't make sense to me how one GET request will return nothing but the exact same request will return something. Also, is there anything I could to fix it or is it out of control?
Here is a sample of my code:
t
def scrapeWebsite(url, start, stop):
driver = webdriver.Chrome(executable_path='/Users/Downloads/chromedriver')
print(start, stop)
madeDict = {"Date": [], "Team": [], "Name": [], "Relinquished": [], "Notes": []}
#for i in range(0, 214025, 25):
for i in range(start, stop, 25):
print("Current Page: " + str(i))
currUrl = url + str(i)
#print(currUrl)
#r = requests.get(currUrl)
#soupPage = BeautifulSoup(r.content)
driver.get(currUrl)
#Sleep program for dynamic refreshing
time.sleep(1)
soupPage = BeautifulSoup(driver.page_source, 'html.parser')
#page = urllib2.urlopen(currUrl)
#time.sleep(2)
#soupPage = BeautifulSoup(page, 'html.parser')
info = soupPage.find("table", attrs={'class': 'datatable center'})
time.sleep(1)
extractedInfo = info.findAll("td")
The error occurs at the last line. "findAll" complains because it can't find findAll when the content is null(meaning the GET request returned nothing)
I did some workaround to scrape all the page using try except.
Probably the requests loop it is so fast and the page can't support it.
See the example below, worked like a charm:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=' \
'&PlayerMovementChkBx=yes&submit=Search&start=%s'
def scrape(start=0, stop=214525):
for page in range(start, stop, 25):
current_url = URL % page
print('scrape: current %s' % page)
while True:
try:
response = requests.request('GET', current_url)
if response.ok:
soup = BeautifulSoup(response.content.decode('utf-8'), features='html.parser')
table = soup.find("table", attrs={'class': 'datatable center'})
trs = table.find_all('tr')
slice_pos = 1 if page > 0 else 0
for tr in trs[slice_pos:]:
yield tr.find_all('td')
break
except Exception as exception:
print(exception)
for columns in scrape():
values = [column.text.strip() for column in columns]
# Continuous your code ...

scrapy handle hebrew (non-english) language

I am using scrapy to scrap a hebrew website. However even after encoding scrapped data into UTF-8, I am not able to get the hewbrew character.
Getting weird string(× ×¨×¡×™ בעמ) in CSV. However If I check print same item, I am able to see the correct string on terminal.
Following is the website I am using.
http://www.moch.gov.il/rasham_hakablanim/Pages/pinkas_hakablanim.aspx
class Spider(BaseSpider):
name = "moch"
allowed_domains = ["www.moch.gov.il"]
start_urls = ["http://www.moch.gov.il/rasham_hakablanim/Pages/pinkas_hakablanim.aspx"]
def parse(self, response):
data = {'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$cboAnaf': unicode(140),
'SearchFreeText:': u'חפש',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$txtShemKablan': u'',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$txtMisparYeshut': u'',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$txtShemYeshuv': u'הקלד יישוב',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$txtMisparKablan': u'',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$btnSearch': u'חפש',
'ctl00$ScriptManager1': u'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$UpdatePanel1|ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$btnSearch'}
yield FormRequest.from_response(response,
formdata=data,
callback = self.fetch_details,
dont_click = True)
def fetch_details(self, response):
# print response.body
hxs = HtmlXPathSelector(response)
item = MochItem()
names = hxs.select("//table[#id='ctl00_ctl13_g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d_ctl00_gridRashamDetails']//tr/td[2]/font/text()").extract()
phones = hxs.select("//table[#id='ctl00_ctl13_g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d_ctl00_gridRashamDetails']//tr/td[6]/font/text()").extract()
index = 0
for name in names:
item['name'] = name.encode('utf-8')
item['phone'] = phones[index].encode('utf-8')
index += 1
print item # This is printed correctly on termial.
yield item # If I create a CSV output file. Then I am not able to see proper Hebrew String
The weird thing is, If i open the same csv in notepad++. I am able to see the correct output. So as a workaroud. What i did is, I opened the csv in notepad++ and change the encoding to UTF-8. And saved it. Now when i again open the csv in excel it shows me the correct hebrew string.
Is there anyway to specify the CSV encoding, from within scrapy ?