python - urllib.request.urlretrieve throws unexpected exception unknown url type: ' ' - urllib

I am trying to download files using urllib.request.retrieve()
I am using Python 3 and the downloads are successful, but I don't know why it throws exception.
For some reason it throws an exception.
This is the main file:
import os
import urllib.request
zip_file_open = open("urls.txt")
if not os.path.exists('zip'):
os.makedirs('zip')
num=1
true = True
b = true
for i in zip_file_open.read().splitlines():
try:
print(str(i))
#response = urllib.request.urlopen(str(i))
#print(response)
#html = response.read()
urllib.request.urlretrieve(i, "zip/code"+str(num)+".zip")
if(b):
num+=1
b=False
else:
b=true
except Exception as e:
print("Exception: "+str(e))
if(b):
num+=1
b=False
else:
b=true
This is urls.txt:
http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c01_code.zip
http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c02_code.zip
........
http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c25_code.zip
http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c26_code.zip
Here is how I create the txt file:
f = open("urls.txt","w")
k = """http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c"""
k1 = """_code.zip"""
import os
for i in range(26):
if(i<9):
f.write(k+str(0)+str(i+1)+k1+os.linesep)
else:
f.write(k+str(i+1)+k1+os.linesep)
f.close()
Here is the output
http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c01_code.zip
Exception2: unknown url type: ''
http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c02_code.zip
Exception3: unknown url type: ''
http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c03_code.zip
Exception3: HTTP Error 404: Not Found
........
Exception26: unknown url type: ''
http://media.wiley.com/product_ancillary/50/11188580/DOWNLOAD/c26_code.zip
Exception27: unknown url type: ''
I didn't include all the lines of output as they were same. The code is functional but I would like to know if we can remove the exception.

It looks like you have some blank lines in your file, so urllib throws a ValueError exception when you try to fetch '', which is clearly not a url.
You can fix this error if you add a condition in the loop to check for empty strings.
for i in zip_file_open.read().splitlines():
if not i.strip():
continue
...
But this won't work for non-empty strings that are not urls, for example 'not a url'.
A better approach would be to check the url scheme with urlparse.
for i in zip_file_open.read().splitlines():
if not urllib.parse.urlparse(i).scheme:
continue
...

Related

I got error from calling json() when trying to running

Hello i want asking somethin that i got when i tried to run my streamlit, so i got error like this on my frontend page, i import it from backend page:
i will show my code too
Predict = st.button('Predict Satisfacion Rate')
if Predict:
r = requests.post(URL, json=data)
res = r.json()
if res['code'] == 200:
res2 = (res['result']['description'])
if res2 == 'Not Satisfied':
st.markdown('**The Passenger is not Satisfied**')
col4,col5,col6 = st.columns([1,1,1])
with col5 :
st.image('Happy.jpg')
else:
st.markdown('**The Passenger is Satisfied**')
col7,col8,col9 = st.columns([1,1,1])
with col8 :
st.image('NotHappy.jpg')
else:
st.write('**Error**')
st.write(f"Details : {res['result']['description']}")
so how can i do to solve this error?
thank you.
In your code, the conversion res = r.json() is unnecessary. Unless you really need this as JSON somewhere else, you can test the status code directly from the r object as r.status_code.
After if r.status_code == 200:, you can then convert to JSON if you really need to, as you should be confident that the server returned a valid response.

How to get website to consistently return content from a GET request when it's inconsistent?

I posted a similar question earlier but I think this is a more refined question.
I'm trying to scrape: https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=0
My code randomly throws errors when I send a GET request to the URL. After debugging, I saw the following happen. A GET request for the following url will be sent(Example URL, could happen on any page): https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=2400
The webpage will then say "There were no matching transactions found.". However, if I refresh the page, the content will then be loaded. I'm using BeautifulSoup and Selenium and have put sleep statements in my code in hopes that it'll work but to no avail. Is this a problem on the website's end? It doesn't make sense to me how one GET request will return nothing but the exact same request will return something. Also, is there anything I could to fix it or is it out of control?
Here is a sample of my code:
t
def scrapeWebsite(url, start, stop):
driver = webdriver.Chrome(executable_path='/Users/Downloads/chromedriver')
print(start, stop)
madeDict = {"Date": [], "Team": [], "Name": [], "Relinquished": [], "Notes": []}
#for i in range(0, 214025, 25):
for i in range(start, stop, 25):
print("Current Page: " + str(i))
currUrl = url + str(i)
#print(currUrl)
#r = requests.get(currUrl)
#soupPage = BeautifulSoup(r.content)
driver.get(currUrl)
#Sleep program for dynamic refreshing
time.sleep(1)
soupPage = BeautifulSoup(driver.page_source, 'html.parser')
#page = urllib2.urlopen(currUrl)
#time.sleep(2)
#soupPage = BeautifulSoup(page, 'html.parser')
info = soupPage.find("table", attrs={'class': 'datatable center'})
time.sleep(1)
extractedInfo = info.findAll("td")
The error occurs at the last line. "findAll" complains because it can't find findAll when the content is null(meaning the GET request returned nothing)
I did some workaround to scrape all the page using try except.
Probably the requests loop it is so fast and the page can't support it.
See the example below, worked like a charm:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=' \
'&PlayerMovementChkBx=yes&submit=Search&start=%s'
def scrape(start=0, stop=214525):
for page in range(start, stop, 25):
current_url = URL % page
print('scrape: current %s' % page)
while True:
try:
response = requests.request('GET', current_url)
if response.ok:
soup = BeautifulSoup(response.content.decode('utf-8'), features='html.parser')
table = soup.find("table", attrs={'class': 'datatable center'})
trs = table.find_all('tr')
slice_pos = 1 if page > 0 else 0
for tr in trs[slice_pos:]:
yield tr.find_all('td')
break
except Exception as exception:
print(exception)
for columns in scrape():
values = [column.text.strip() for column in columns]
# Continuous your code ...

Problems with selenium 2 and python 3.4.1

I have a simple automation to fill login form fields. Actually, it passes good, but there's the problem. I need to see actual output in my console after the script filled fields, like "Logged in successfully" or "Username not found". I tried many stuff, but nothing worked this way, my last try was while loop and it works great, but only when I have positive result. I wrote a second condition, but when I type incorrect data, it drives me crazy to see all these errors in my console. So here's the code and part of output.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
baseurl = "http://www.somesite/login"
email = input("Type an email: ")
password = input("Type a password: ")
xpaths = { 'loginBox' : "//input[#id='session_email']",
'passwordBox' : "//input[#id='session_password']",
'submitButton' : "//input[#class='ufs-but']",
'success' : "//div[#class='flash-message success']",
'error' : "//span[#class='form_error']"
}
mydriver = webdriver.Firefox()
mydriver.get(baseurl)
mydriver.find_element_by_xpath(xpaths['loginBox']).send_keys(email)
mydriver.find_element_by_xpath(xpaths['passwordBox']).send_keys(password)
mydriver.find_element_by_xpath(xpaths['submitButton']).click()
while mydriver.find_element_by_xpath(xpaths['success']):
print("Success")
if mydriver.find_element_by_xpath(xpaths['error']):
print("No")
And there's what I got when I try to interrupt an error:
File "ab.py", line 32, in <module>
while mydriver.find_element_by_xpath(xpaths['success']):
File "/usr/local/lib/python3.4/site-packages/selenium-2.43.0-py3.4.egg/selenium/webdriver/remote/webdriver.py", line 230, in find_element_by_xpath
return self.find_element(by=By.XPATH, value=xpath)
File "/usr/local/lib/python3.4/site-packages/selenium-2.43.0-py3.4.egg/selenium/webdriver/remote/webdriver.py", line 662, in find_element
{'using': by, 'value': value})['value']
File "/usr/local/lib/python3.4/site-packages/selenium-2.43.0-py3.4.egg/selenium/webdriver/remote/webdriver.py", line 173, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.4/site-packages/selenium-2.43.0-py3.4.egg/selenium/webdriver/remote/errorhandler.py", line 166, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: 'Unable to locate element: {"method":"xpath","selector":"//div[#class=\'flash-message success\']"}' ; Stacktrace:
at FirefoxDriver.prototype.findElementInternal_ (file:///tmp/tmpjax8kj1u/extensions/fxdriver#googlecode.com/components/driver-component.js:9618:26)
at FirefoxDriver.prototype.findElement (file:///tmp/tmpjax8kj1u/extensions/fxdriver#googlecode.com/components/driver-component.js:9627:3)
at DelayedCommand.prototype.executeInternal_/h (file:///tmp/tmpjax8kj1u/extensions/fxdriver#googlecode.com/components/command-processor.js:11612:16)
at DelayedCommand.prototype.executeInternal_ (file:///tmp/tmpjax8kj1u/extensions/fxdriver#googlecode.com/components/command-processor.js:11617:7)
at DelayedCommand.prototype.execute/< (file:///tmp/tmpjax8kj1u/extensions/fxdriver#googlecode.com/components/command-processor.js:11559:5)
As I said, successfull result ain't a problem.
UPD. I corrected the last part of my code a little bit and now I have this:
while mydriver.find_element_by_xpath(xpaths['success']):
print("Success")
break
while mydriver.find_element_by_xpath(xpaths['error']):
print("No")
break
And it works, but not like I want, the output when I want a negative result:
Type an email: w
Type a password: wer
Success
No
As you see, I wanna see 'success' when result is positive and 'no' when it's negative, but I don't want to see them at the same time.
UPD. Props to Macro Giancarli for huge help, so that's how I got what I exactly want:
try:
success = True
success_element = mydriver.find_element_by_xpath(xpaths['success'])
except NoSuchElementException:
success = False
print("Can't log in. Check email and/or password")
try:
failure = True
failure_element = mydriver.find_element_by_xpath(xpaths['error'])
except NoSuchElementException :
failure = False
print("Logged in successfully")
The problem looks like it's in the way you structure your while loop at the end. You shouldn't need to loop in order to check for success or failure.
Consider that there are four outcomes, assuming that you input the login data. You could either find the element that determines success, find the element that determines failure, find both (should be impossible), or find neither (probably in the case of an unexpected screen, or a failure to load the page).
Instead of expecting some values to be returned from the webdriver queries, try putting them in a try block to catch a NoSuchElementException and checking for non-None contents. Also, try handling each of the four cases so that your program will crash less often.
Edit:
Try this.
try:
success = True
success_element = mydriver.find_element_by_xpath(xpaths['success'])
except NoSuchElementException:
success = False
try:
failure = True
failure_element = mydriver.find_element_by_xpath(xpaths['error'])
except NoSuchElementException :
failure = False
# now handle the four possibilities

How do I make python try the next URL in my file if the current one returns a 404?

I'm having a problem figuring out what code I need to create to make to make python try the next url in my csv file each url is on a line like this:
http://www.indexedamerica.com/states/PR/Adjuntas/Restaurants-Adjuntas-00601.html
http://www.indexedamerica.com/states/PR/Aguada/Restaurants-Aguada-00602.html
http://www.indexedamerica.com/states/PR/Aguadilla/Restaurants-Aguadilla-00603.html
http://www.indexedamerica.com/states/PR/Aguadilla/Restaurants-Aguadilla-00604.html
http://www.indexedamerica.com/states/PR/Aguadilla/Restaurants-Aguadilla-00605.html
http://www.indexedamerica.com/states/PR/Maricao/Restaurants-Maricao-00606.html
http://www.indexedamerica.com/states/MI/Kent/Restaurants-Grand-Rapids-49503.html
#open csv file
#read csv file line by line
#Pass each line to beautiful soup to try
#If URL raises a 404 error continue to next line
#extract tables from url
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import csv
mech = Browser()
indexed = open('C://python27/longlist.csv')
reader = csv.reader(indexed)
html = mech.open(reader)
for line in html:
try:
mechanize.open(html)
table = soup.find("table", border=3)
else:
#!!!! try next url from file. How do I do this?
for row in table.findAll('tr')[2:]:
col = row.findAll('td')
BusinessName = col[0].string
Phone = col[1].string
Address = col[2].string
City = col[3].string
State = col[4].string
Zip = col[5].string
Restaurantinfo = (BusinessName, Phone, Address, City, State)
print "|".join(Restaurantinfo)
for line in html:
try:
mechanize.open(html)
table = soup.find("table", border=3)
except Exception:
continue
Alternatively, you could check the status code of the page, and skip if you receive a 404 (in a for loop):
if urllib.urlopen(url).getcode() == '404':
continue
continue in a loop, stops execution of further code and continues to the next entry in the loop.
Add all the urls you want to search through to a list. Then loop through the list, opening each url in sequence. If a given url returns any kind of error then you can choose to use continue to ignore that url-file and move on to the next one.

python ldap "Bad search filter" error

This filter works just fine in my LDAP browser by python ldap won't pick it up:
(&(!objectClass=computer)(sn=*%s*))
resulting in:
Request Method: GET Request
URL: http://localhost:8000/ldap_find/%D0%B1%D0%BE%D0%BB%D0%BE%D1%82/
Django Version: 1.4
Exception Type: FILTER_ERROR
Exception Value: {'desc': 'Bad search filter'}
here's the code that does it:
try:
LDAPClient.connect()
base = AUTH_LDAP_SEARCH_BASE
scope = ldap.SCOPE_SUBTREE
filter = '(&(!objectClass=computer)(sn=*%s*))' % search_string
result_set = list()
result = LDAPClient.client.search(base.encode(encoding='utf-8'), scope, filter.encode(encoding='utf-8'),['cn','mail'])
res_type, res_data = LDAPClient.client.result(result)
for data in res_data:
if data[0]:
result_set.append(data)
return json.dumps(result_set)
except Exception, e:
raise e
finally:
LDAPClient.unconnect()
it works fine with simple filters, like
filter = 'sn=*%s*' % search_string
so I'm guessing this is some kind of escaping of & or something inside ldap lib but can't find the root yet.
The search filter syntax is incorrect. Use (&(sn=*%s*)(!(objectClass=computer))). Search filters are well-documented in RFC4511 and RFC4515.