Connection error(10061) during web scraping occasionally - pandas

I'm trying to use beautifulsoup to do web scraping.It ran perfectly fine at first, but an error occurred when I run my same code again.
Then i use pd.read_html instead of beautifulsoup to do web scraping, but the same connection error occurred(occasionally).
Code I tried:
link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
f = urllib.urlopen(link)
soup = BeautifulSoup(f,'html.parser')
pf = pd.read_html(link)[0]
Error message:
[Error no 10061]No connection could be made because the target machine
actively refused it

If you're simultaneously accessing some website that do not fall in the category of the websites you should access from your connection, server will not access the website. you can still do it using VPN.\
Instead of urllib go for requests. Install it with pip install requests
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
f = requests.get(link, headers=headers)
soup = bs(f.text,'html.parser')
th = [i.text.strip() for i in (soup.find_all('th'))]
td = [i.text for i in (soup.find_all('td'))]
print(th, td)
Your pandas code is perfectly fine, just don't use it along urllib. If you face the same error just incorporate some delay in your concurrent requests to scrape this page using python's sleep module.
e.g
import time
import pandas
while True:
link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
pf = pd.read_html(link)[0:10]
print(pf)
time.sleep(1) # delays for 1 second

Related

Python 3.10 using "urllib.request.Request" shows unsupported browser message

This is a part of the code I am using:
req = urllib.request.Request(url, headers = user_agent)
Then I have the following commands:
resp = urllib.request.urlopen(req)
resp_data = resp.read()
print(resp_data)
When I read the command line output from print(resp_data) I see the following message:
Loading\n \n\n \n Unsupported Browser\n Please use IE 10+, Microsoft Edge, Chrome, Firefox, or Safari.\n We apologize for any inconvenience.
Clearly, this reads that the website I am requesting sees the browser Python is connecting with as invalid. I am not sure how to remedy this...
Currently, my user_agent variable is coded as follows:
user_agent = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
I have researched around and have played with what Google says to be valid user agents; however, I have not found one that works.
I am very new to the urllib module and, honestly, Python in general. Any help would be greatly appreciated!

Unable to set a cookie of Github using Selenium Webdriver

I tried to set a cookie for GitHub using Selenium, but it always failed. After deeper analysis, I found that it was throwing an exception when setting a cookie with the name __Host-user_session_same_site. This seems very strange and I would like to know the reason for this phenomenon.
from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.edge.service import Service
import json
import time
driveroptions = Options()
driveroptions.use_chromium = True
driveroptions.add_argument('–start-maximized')
driveroptions.binary_location = r'C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe'
service = Service(
executable_path=r'C:\Program Files (x86)\Microsoft\Edge\Application\msedgedriver.exe')
driver = webdriver.Edge(options=driveroptions, service=service)
driver.set_page_load_timeout(60)
driver.implicitly_wait(3)
driver.get("https://github.com")
driver.maximize_window()
driver.delete_all_cookies()
with open('cookies.txt', 'r') as f:
cookies_list = json.load(f)
for cookie in cookies_list:
cookie['expiry'] = int(time.time() + 10000)
new_cookie = {k: cookie[k] for k in {'name', 'value', 'domain', 'path', 'expiry'}}
# if cookie['name'] == '__Host-user_session_same_site':
# continue
driver.add_cookie(new_cookie)
Before that, the cookies.txt was exported using f.write(json.dumps(driver.get_cookies())) after I logged into Github. If I turn on the commented code above, everything works fine. Otherwise, the program will throw an exception: selenium.common.exceptions.UnableToSetCookieException: Message: unable to set cookie. I don't quite understand what is so special about cookies with this name (__Host-user_session_same_site).
My runtime environment information is as follows.
MicrosoftEdge=103.0.1264.62
MsEdgeDriver=103.0.1264.62
I would be very grateful if I could get your help.
This cookie is set to ensure that browsers that support SameSite cookies can check to see if a request originates from GitHub.
You will find only this cookie have a different value of Strict in the sameSite attribute, others being Lax. So when you skip this cookie, everything works fine. You can set this cookie separately by adding this code:
driver.add_cookie({'name':'__Host-user_session_same_site','value': 'itsValue','sameSite':'Strict'})

Selenium Google Login Blocked in Automation

As of today, a user cannot login to Google account using selenium in a new profile. I found that, Google is blocking the process(rejecting?) even on trying with stackauth. (Experienced this after updating to v90).
This is the answer that I'd posted previously for Google login using OAuth and that was working till very recently!
In short, you'll be logging in in-directly via stackauth.
The only way that I could do to bypass the restrictions is by disabling the Secure-App-Access or adding the below given argument.(Which I don't prefer as I cannot convince my users(100+) who use my app to disable that!)
options.add_argument('user-data-dir=C:/Users/{username}/path to data of browser/')
The other sole way to login is by using stealth to fake the user agent to DN which is mentioned here and it works pretty good.
The major disadvantage that I found was you cannot open another tab when the automation is running, else, the process is interrupted. But this works perfectly with that disadvantage.
But the disadvantage that I found was, once if you login, you cannot get your job done, as the website that you're visiting restricts you and forces you to update the browser in order to access the website(Google Meet in my case).
On the other hand, theoritically, one can open up the automation with the user data, but in the new window. And I feel its pretty optimal when compared to others except OAuth as it was the best way to do it.
Any other optimal working suggestions to bypass these restrictions by Google?
Finally, I was able to bypass Google security restrictions in Selenium successfully and hope it helps you as well. Sharing the entire code here.
In short:
You need to use old/outdated user-agent and revert back.
In detail:
Use selenium-stealth for faking the user agent.
Set user-agent to DN initially, before login.
Then, after logging in, revert back to normal.(not really, but chrome v>80)
That's it.
No need to keep the user data, enable less secure app access, nothing!
Here's the snippet of my code that currently works adn it's quite long tho!.(comments included for better understanding).
# Import required packages, modules etc.. Selenium is a must!
def login(username, password): # Logs in the user
driver.get("https://stackoverflow.com/users/login")
WebDriverWait(driver, 60).until(expected_conditions.presence_of_element_located(
(By.XPATH, '//*[#id="openid-buttons"]/button[1]'))).click()
try:
WebDriverWait(driver, 60).until(expected_conditions.presence_of_element_located(
(By.ID, "Email"))).send_keys(username) # Enters username
except TimeoutException:
del username
driver.quit()
WebDriverWait(driver, 60).until(expected_conditions.element_to_be_clickable(
(By.XPATH, "/html/body/div/div[2]/div[2]/div[1]/form/div/div/input"))).click() # Clicks NEXT
time.sleep(0.5)
try:
try:
WebDriverWait(driver, 60).until(expected_conditions.presence_of_element_located(
(By.ID, "password"))).send_keys(password) # Enters decoded Password
except TimeoutException:
driver.quit()
WebDriverWait(driver, 5).until(expected_conditions.element_to_be_clickable(
(By.ID, "submit"))).click() # Clicks on Sign-in
except TimeoutException or NoSuchElementException:
print('\nUsername/Password seems to be incorrect, please re-check\nand Re-Run the program.')
del username, password
driver.quit()
try:
WebDriverWait(driver, 60).until(lambda webpage: "https://stackoverflow.com/" in webpage.current_url)
print('\nLogin Successful!\n')
except TimeoutException:
print('\nUsername/Password seems to be incorrect, please re-check\nand Re-Run the program.')
del username, password
driver.quit()
USERNAME = input("User Name : ")
PASSWORD = white_password(prompt="Password : ") # A custom function for secured password input, explained at end.
# Expected and required arguments added here.
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('excludeSwitches', ['enable-logging'])
# Assign drivers here.
stealth(driver,
user_agent='DN',
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
) # Before Login, using stealth
login(USERNAME, PASSWORD) # Call login function/method
stealth(driver,
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36',
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
) # After logging in, revert back user agent to normal.
# Redirecting to Google Meet Web-Page
time.sleep(2)
driver.execute_script("window.open('https://the website that you wanto to go.')")
driver.switch_to.window(driver.window_handles[1]) # Redirecting to required from stackoverflow after logging in
driver.switch_to.window(driver.window_handles[0]) # This switches to stackoverflow website
driver.close() # This closes the stackoverflow website
driver.switch_to.window(driver.window_handles[0]) # Focuses on present website
Click here learn about white_password.
Do this:
Install this python module
pip install selenium-stealth
Add this to your code:
from selenium_stealth import stealth
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
This worked for me.

NSE ACCESS DENIED

I created a basic coding in visualbasic to download from NSE website.
While the coding still downloads the previous years data, it gives an download error for the current new year.
The RAW URL is https://www.nseindia.com/products/content/equities/equities/archieve_eq.htm If you choose a date (say today) and then select BHAVCOPY report, the site will provide you with a link to download the csv.zip file.
However, if you click on the link directly (https://www.nseindia.com/content/historical/EQUITIES/2017/JAN/cm02JAN2017bhav.csv.zip), the URL returns an error: Access Denied
You don't have permission to access "THE LINK" on this server.
Reference #18.11367a5c.1483362327.35d38c1b
What might be the problem with change in year?
i also facing same issue. fixed by adding 2 http header property.
"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"
"Referer" : "https://www1.nseindia.com/products/content/equities/equities/archieve_eq.htm"
After a bit tweaking I noticed it was something to do with browser. Blocked the cookies and everything is working fine.

Neko hxssl not working for HTTPS

I'm working on a bigger project rewrite, with quite a big codebase already written in neko. One of the aspects of the project is a data scraper which would (during peak hours) have 100+ connections open to a WebSockets server. Originally, this was done with lots of nodejs processes running, using a WebSockets npm package. The problem was that this was somewhat unreliable, and would slow down the machine running these processes quite a lot. I hoped to solve this with Threads running in a single neko process.
But, I ran into a problem where I didn't expect it – the very awkward support (or lack thereof) of SSL / TLS in haxe. As I understand, the only native OpenSSL wrapper available is the hxssl haxelib. I installed it, but it didn't work with the WebSockets still, so I traced the problem to a simpler case – just a single HTTPS connection, like so:
import haxe.Http;
class Main {
public static function main(){
var http = new Http("https://www.facebook.com/");
http.certFolder = 'certs';
http.certFile = 'certs/ca-certificates.crt';
http.setHeader("Accept", "text/html,application/xhtml+xml,application/xml");
http.setHeader("Accept-Language", "en-US");
http.setHeader("Cache-Control", "max-age=0");
http.setHeader("Connection", "close");
http.setHeader("DNT", "1");
http.setHeader("Upgrade-Insecure-Requests", "1");
http.setHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36");
http.onData = function(data:String){
Sys.println("Data: " + data.substr(0, 50) + " ...");
}
http.onError = function(msg:String){
Sys.println("Error: " + msg);
}
http.onStatus = function(status:Int){
Sys.println("Status: " + status);
}
http.request(false);
}
}
The problem is that sometimes the output of this is simply:
Status: 200
Error: Custom((1) : An unknown error has occurred.)
And the worst part is the randomness with which this happens. Sometimes it happens a number of times in a row, even if I don't rebuild the project. I'm running this on an OS X machine at the moment.
The certs folder is filled with certificates copied from the certs on an up-to-date Ubuntu server. I've tried without the certFolder and certFile lines, with pretty much the same results, however.
Any ideas about what could cause this? Writing a better wrapper / native implementation of OpenSSL is probably out of question, I'm somewhat pressed for time. I tried a cpp build of the above, which failed spectacularly with Sockets code, I'm not sure I want to go down that road either.
Perhaps you can try the RC for the upcoming 3.3 release, it has built-in Neko/Hxcpp support for SSL/TLS.