Python 3.10 using "urllib.request.Request" shows unsupported browser message - urllib

This is a part of the code I am using:
req = urllib.request.Request(url, headers = user_agent)
Then I have the following commands:
resp = urllib.request.urlopen(req)
resp_data = resp.read()
print(resp_data)
When I read the command line output from print(resp_data) I see the following message:
Loading\n \n\n \n Unsupported Browser\n Please use IE 10+, Microsoft Edge, Chrome, Firefox, or Safari.\n We apologize for any inconvenience.
Clearly, this reads that the website I am requesting sees the browser Python is connecting with as invalid. I am not sure how to remedy this...
Currently, my user_agent variable is coded as follows:
user_agent = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
I have researched around and have played with what Google says to be valid user agents; however, I have not found one that works.
I am very new to the urllib module and, honestly, Python in general. Any help would be greatly appreciated!

Related

Selenium Google Login Blocked in Automation

As of today, a user cannot login to Google account using selenium in a new profile. I found that, Google is blocking the process(rejecting?) even on trying with stackauth. (Experienced this after updating to v90).
This is the answer that I'd posted previously for Google login using OAuth and that was working till very recently!
In short, you'll be logging in in-directly via stackauth.
The only way that I could do to bypass the restrictions is by disabling the Secure-App-Access or adding the below given argument.(Which I don't prefer as I cannot convince my users(100+) who use my app to disable that!)
options.add_argument('user-data-dir=C:/Users/{username}/path to data of browser/')
The other sole way to login is by using stealth to fake the user agent to DN which is mentioned here and it works pretty good.
The major disadvantage that I found was you cannot open another tab when the automation is running, else, the process is interrupted. But this works perfectly with that disadvantage.
But the disadvantage that I found was, once if you login, you cannot get your job done, as the website that you're visiting restricts you and forces you to update the browser in order to access the website(Google Meet in my case).
On the other hand, theoritically, one can open up the automation with the user data, but in the new window. And I feel its pretty optimal when compared to others except OAuth as it was the best way to do it.
Any other optimal working suggestions to bypass these restrictions by Google?
Finally, I was able to bypass Google security restrictions in Selenium successfully and hope it helps you as well. Sharing the entire code here.
In short:
You need to use old/outdated user-agent and revert back.
In detail:
Use selenium-stealth for faking the user agent.
Set user-agent to DN initially, before login.
Then, after logging in, revert back to normal.(not really, but chrome v>80)
That's it.
No need to keep the user data, enable less secure app access, nothing!
Here's the snippet of my code that currently works adn it's quite long tho!.(comments included for better understanding).
# Import required packages, modules etc.. Selenium is a must!
def login(username, password): # Logs in the user
driver.get("https://stackoverflow.com/users/login")
WebDriverWait(driver, 60).until(expected_conditions.presence_of_element_located(
(By.XPATH, '//*[#id="openid-buttons"]/button[1]'))).click()
try:
WebDriverWait(driver, 60).until(expected_conditions.presence_of_element_located(
(By.ID, "Email"))).send_keys(username) # Enters username
except TimeoutException:
del username
driver.quit()
WebDriverWait(driver, 60).until(expected_conditions.element_to_be_clickable(
(By.XPATH, "/html/body/div/div[2]/div[2]/div[1]/form/div/div/input"))).click() # Clicks NEXT
time.sleep(0.5)
try:
try:
WebDriverWait(driver, 60).until(expected_conditions.presence_of_element_located(
(By.ID, "password"))).send_keys(password) # Enters decoded Password
except TimeoutException:
driver.quit()
WebDriverWait(driver, 5).until(expected_conditions.element_to_be_clickable(
(By.ID, "submit"))).click() # Clicks on Sign-in
except TimeoutException or NoSuchElementException:
print('\nUsername/Password seems to be incorrect, please re-check\nand Re-Run the program.')
del username, password
driver.quit()
try:
WebDriverWait(driver, 60).until(lambda webpage: "https://stackoverflow.com/" in webpage.current_url)
print('\nLogin Successful!\n')
except TimeoutException:
print('\nUsername/Password seems to be incorrect, please re-check\nand Re-Run the program.')
del username, password
driver.quit()
USERNAME = input("User Name : ")
PASSWORD = white_password(prompt="Password : ") # A custom function for secured password input, explained at end.
# Expected and required arguments added here.
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('excludeSwitches', ['enable-logging'])
# Assign drivers here.
stealth(driver,
user_agent='DN',
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
) # Before Login, using stealth
login(USERNAME, PASSWORD) # Call login function/method
stealth(driver,
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36',
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
) # After logging in, revert back user agent to normal.
# Redirecting to Google Meet Web-Page
time.sleep(2)
driver.execute_script("window.open('https://the website that you wanto to go.')")
driver.switch_to.window(driver.window_handles[1]) # Redirecting to required from stackoverflow after logging in
driver.switch_to.window(driver.window_handles[0]) # This switches to stackoverflow website
driver.close() # This closes the stackoverflow website
driver.switch_to.window(driver.window_handles[0]) # Focuses on present website
Click here learn about white_password.
Do this:
Install this python module
pip install selenium-stealth
Add this to your code:
from selenium_stealth import stealth
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
This worked for me.

Connection error(10061) during web scraping occasionally

I'm trying to use beautifulsoup to do web scraping.It ran perfectly fine at first, but an error occurred when I run my same code again.
Then i use pd.read_html instead of beautifulsoup to do web scraping, but the same connection error occurred(occasionally).
Code I tried:
link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
f = urllib.urlopen(link)
soup = BeautifulSoup(f,'html.parser')
pf = pd.read_html(link)[0]
Error message:
[Error no 10061]No connection could be made because the target machine
actively refused it
If you're simultaneously accessing some website that do not fall in the category of the websites you should access from your connection, server will not access the website. you can still do it using VPN.\
Instead of urllib go for requests. Install it with pip install requests
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
f = requests.get(link, headers=headers)
soup = bs(f.text,'html.parser')
th = [i.text.strip() for i in (soup.find_all('th'))]
td = [i.text for i in (soup.find_all('td'))]
print(th, td)
Your pandas code is perfectly fine, just don't use it along urllib. If you face the same error just incorporate some delay in your concurrent requests to scrape this page using python's sleep module.
e.g
import time
import pandas
while True:
link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
pf = pd.read_html(link)[0:10]
print(pf)
time.sleep(1) # delays for 1 second

NSE ACCESS DENIED

I created a basic coding in visualbasic to download from NSE website.
While the coding still downloads the previous years data, it gives an download error for the current new year.
The RAW URL is https://www.nseindia.com/products/content/equities/equities/archieve_eq.htm If you choose a date (say today) and then select BHAVCOPY report, the site will provide you with a link to download the csv.zip file.
However, if you click on the link directly (https://www.nseindia.com/content/historical/EQUITIES/2017/JAN/cm02JAN2017bhav.csv.zip), the URL returns an error: Access Denied
You don't have permission to access "THE LINK" on this server.
Reference #18.11367a5c.1483362327.35d38c1b
What might be the problem with change in year?
i also facing same issue. fixed by adding 2 http header property.
"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"
"Referer" : "https://www1.nseindia.com/products/content/equities/equities/archieve_eq.htm"
After a bit tweaking I noticed it was something to do with browser. Blocked the cookies and everything is working fine.

Neko hxssl not working for HTTPS

I'm working on a bigger project rewrite, with quite a big codebase already written in neko. One of the aspects of the project is a data scraper which would (during peak hours) have 100+ connections open to a WebSockets server. Originally, this was done with lots of nodejs processes running, using a WebSockets npm package. The problem was that this was somewhat unreliable, and would slow down the machine running these processes quite a lot. I hoped to solve this with Threads running in a single neko process.
But, I ran into a problem where I didn't expect it – the very awkward support (or lack thereof) of SSL / TLS in haxe. As I understand, the only native OpenSSL wrapper available is the hxssl haxelib. I installed it, but it didn't work with the WebSockets still, so I traced the problem to a simpler case – just a single HTTPS connection, like so:
import haxe.Http;
class Main {
public static function main(){
var http = new Http("https://www.facebook.com/");
http.certFolder = 'certs';
http.certFile = 'certs/ca-certificates.crt';
http.setHeader("Accept", "text/html,application/xhtml+xml,application/xml");
http.setHeader("Accept-Language", "en-US");
http.setHeader("Cache-Control", "max-age=0");
http.setHeader("Connection", "close");
http.setHeader("DNT", "1");
http.setHeader("Upgrade-Insecure-Requests", "1");
http.setHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36");
http.onData = function(data:String){
Sys.println("Data: " + data.substr(0, 50) + " ...");
}
http.onError = function(msg:String){
Sys.println("Error: " + msg);
}
http.onStatus = function(status:Int){
Sys.println("Status: " + status);
}
http.request(false);
}
}
The problem is that sometimes the output of this is simply:
Status: 200
Error: Custom((1) : An unknown error has occurred.)
And the worst part is the randomness with which this happens. Sometimes it happens a number of times in a row, even if I don't rebuild the project. I'm running this on an OS X machine at the moment.
The certs folder is filled with certificates copied from the certs on an up-to-date Ubuntu server. I've tried without the certFolder and certFile lines, with pretty much the same results, however.
Any ideas about what could cause this? Writing a better wrapper / native implementation of OpenSSL is probably out of question, I'm somewhat pressed for time. I tried a cpp build of the above, which failed spectacularly with Sockets code, I'm not sure I want to go down that road either.
Perhaps you can try the RC for the upcoming 3.3 release, it has built-in Neko/Hxcpp support for SSL/TLS.

Selenium error - Cannot navigate to invalid URL

I get the following error :
unknown error: unhandled inspector error:
{"code":-32603,"message":"Cannot navigate to
invalid URL"} (Session info: chrome=29.0.1547.57) (Driver info:
chromedriver=2.2,platform=Windows NT 6.1 SP1 x86_64)
I think its got to do with chrome browser last updated version (29) about two days ago.
*Note:*my chromedriver is up to date (2.2).
please let me know what should i do to fix it.
I received the same error while using Selenium on python. Prepending the destination url with http:// solved my problem:
self.driver.get("http://"+url.rstrip())
This literally happens because the url you are passing in is using an invalid format.
Try the following debug code, where ourUrl is the String of the URL you are trying to connect to:
System.out.println("!!URL " +ourUrl);
driver.get(ourUrl);
for me it was printing out: !!URL "http://www.salesforce.com"
And the problem was that there were quotes around the url. In your case it may be something similar. Once you properly format the url, it will work
I met this error now minutes ago,but i have solved it by adding "https://" to the front of the url. Hope it works for you too.Good luck!
If the error message is "invalid url", check the url on the webpage that you are trying to access, then compare it to what prints out when you do something like:
System.out.println(url);
When selenium tries to open up a webpage, it needs the exact url. It wont infer the Hyper Text Transfer Protocol (http:// or https://). In other words, if you try driver.get(url), and url is returning www.myurl.com, it will likely fail if http or https was not appended.
//Append the Hyper Text Transfer Protocol to the url
driver.get("http://" + url);
If you are getting your urls from a list, or from a file, and you know which protocol your website page(s) use (http:// or https://), you can do something like:
public static void getURLByDriverFromList(List<String> urls) {
for(List<String> url : urls) {
if(!url.contains("http://") {
url = "http://" + url;
}
driver.get(url);
}
}
You can use absolute path as mentioned in other comments or - if it is an internal link/button on the web - you can mapp it as WebElement and perform click() method.
I had exactly the same error but it was due to a parsing issue in Python Behave BDD.
For example, if I have the following feature syntax
Given the user is on <page> using <url>
and my examples syntax has
Examples: Pages
| page | url |
| Mobile App using Guide | https://www.example.com |
See how I have the word using between my variables in the given statement and also used using in the page title Mobile App using Guide. Because of this, the word Guide will get added on to the url and Selenium will return the invalid url error.
If you're using Behave or possibly any BDD with Gherkin syntax, avoid using the same keyword in between variables from the Given, When, Then statements in the Example table.
Try the following code, it's working for me.
WebDriver driver = new ChromeDriver();
driver.get("https://www.facebook.com");
Add https in your URL.