Python web crawler using urllib

Python web crawler using urllib - beautifulsoup

I am trying to extract the data from a webpage/website. Here's my code:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage=urlopen('http://www.xxxxxxxxx.com').read()
patFinderTitle=re.compile('<title>(.*)</title>')
patFinderLink=re.compile('<link rel.*href="(.*)"/>')
findPatTitle=re.findall(patFinderTitle,webpage)
findPatLink=re.findall(patFinderLink,webpage)
listIterator=[]
listIterator[:]=range(2,16)
for i in listIterator:
print findPatTitle[i]
print findPatLink[i]
print "\n"
articlepage=urlopen(findPatLink[i]).read()
divbegin=articlepage.find('<div class="">')
article=articlepage[divbegin:(divbegin+1000)]
soup=BeautifulSoup(article)
paralist=soup.findAll('<p>')
for i in paralist:
print i
I want to list the title and all the links in the webpage. When I run the script it throws an error:
Traceback (most recent call last):
File "justdialcrawl.py", line 21, in <module>
print findPatTitle[i]
IndexError: list index out of range
I tried searching Google but I could not find answers.

You forgot one minor thing:
webpage=urlopen('http://www.xxxxxxxxx.com').read()
# this -> ^^^^^^^
Your code just generated an urlopen object and assigned it to webpage. To assign the contents of the page, you need .read().

Related

Question about insert from TK form to csv file

I am trying to get some info from TK form that i built to new CSV file, unfortunately i am getting some Errors.
The code:
def sub_func():
with open('Players.csv','w') as df:
df = pd.DataFrame
i=len(df.index)
data=[]
data.append(entry_box1.get())
data.append(entry_box2.get())
data.append(entry_box3.get())
data.append(entry_box4.get())
if i==4:
df.loc[i,:]=data
The Errors:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\tkinter\_init.py", line 1883, in __call_
return self.func(*args)
File "C:/Users/user/PycharmProjects/test11/Final Project/registration form.py", line 23, in sub_func
i=len(df.index)
TypeError: object of type 'pandas._libs.properties.AxisProperty' has no len()

You have two problems in your code:
The file is opened as df, but df is overwritten on the next line
df = pd.DataFrame does not return anything as it would need parenthesis () to create a dataframe. This means that df does not exist -> error comes when you try to take length of 'nothing'.
Solution:
df = pd.read_csv('./Players') # Make sure that the file is in the same directory
i = len(df.index)
Not sure what you are trying to achieve, but you might need to work on the rest of the code, too.

Cant fint next button on linkedin

I am trying to scrape LinkedIn website using selenium and Beautiful Soup.
The idea is simple, I started in the company's LinkedIn website then go to company search and scroll to the bottom of the page to get all results on the page.
But because Linkedin just provides 10 people per page so I need to find the next button on that page to go to the next 10 people.
I use this code
browser.find_element_by_class_name('next')
Traceback (most recent call last): File "LinkedInWebcrawler2019.py",
line 71, in
nextt = browser.find_element_by_class_name('next') File "C:\Users\Afdal\AppData\Roaming\Python\Python37\site-packages\selenium\webdriver\remote\webdriver.py",
line 564, in find_element_by_class_name
return self.find_element(by=By.CLASS_NAME, value=name) File "C:\Users\Afdal\AppData\Roaming\Python\Python37\site-packages\selenium\webdriver\remote\webdriver.py",
line 978, in find_element
'value': value})['value'] File "C:\Users\Afdal\AppData\Roaming\Python\Python37\site-packages\selenium\webdriver\remote\webdriver.py",
line 321, in execute
self.error_handler.check_response(response) File "C:\Users\Afdal\AppData\Roaming\Python\Python37\site-packages\selenium\webdriver\remote\errorhandler.py",
line 242, in check_response
raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: no such
element: Unable to locate element: {"method":"css
selector","selector":".next"} (Session info: chrome=77.0.3865.75)
Any ideas?

Try this for finding the next button.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Next =WebDriverWait(driver, 10).until(EC.element_to_be_clickable(By.XPATH,'//*[contains(#class,'button')'))
Next.click()

SQL CE connection with Python

Problem Statement: Extract data stored in the .sdf file to python.
System config: Win 10 Pro 64-bit, python 3.5.2-64 bit, adodbapi library, SQL CE 3.5
I am fairly new to programming and I have picked up Python as my first language to learn. Currently, I have hit a wall in process of connecting a SQL CE 3.5 .sdf file.
I have used the adodbapi library. I have searched the web extensively over the past week to find a solution to this problem and to make sure that my connection string is correct. I have tried multiple options/solutions provided on stack overflow and https://www.connectionstrings.com/microsoft-sqlserver-ce-oledb-3-5/.
Code:
import adodbapi
cons_str = "Provider=Microsoft.SQLSERVER.MOBILE.OLEDB.3.5;" \
"Data Source=D:\Work\Programming\Python\SQL_DataTransfer\LF.sdf;"\
"Persist Security Info=False;" \
"SSCE:Max Database Size=4091"
connection = adodbapi.connect(cons_str)
print(connection)
Error Message:
Traceback (most recent call last):
File "D:\Work\Programs\Python35.virtualenvs\sql_output\lib\site-packages\adodbapi\adodbapi.py", line 93, in make_COM_connecter
c = Dispatch('ADODB.Connection') #connect after CoIninialize v2.1.1 adamvan
NameError: name 'Dispatch' is not defined
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Work\Programs\Python35.virtualenvs\sql_output\lib\site-packages\adodbapi\adodbapi.py", line 112, in connect
co.connect(kwargs)
File "D:\Work\Programs\Python35.virtualenvs\sql_output\lib\site-packages\adodbapi\adodbapi.py", line 269, in connect
self.connector = connection_maker()
File "D:\Work\Programs\Python35.virtualenvs\sql_output\lib\site-packages\adodbapi\adodbapi.py", line 95, in make_COM_connecter
raise api.InterfaceError ("Windows COM Error: Dispatch('ADODB.Connection') failed.")
adodbapi.apibase.InterfaceError: Windows COM Error: Dispatch('ADODB.Connection') failed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/Work/Programming/Python/SQL_DataTransfer/SQL_CE_reportDB.py", line 8, in
connection = adodbapi.connect(cons_str)
File "D:\Work\Programs\Python35.virtualenvs\sql_output\lib\site-packages\adodbapi\adodbapi.py", line 116, in connect
raise api.OperationalError(e, message)
adodbapi.apibase.OperationalError: (InterfaceError("Windows COM Error: Dispatch('ADODB.Connection') failed.",), 'Error opening connection to "Provoider=Microsoft.SQLSERVER.MOBILE.OLEDB.3.5;Data Source=D:\Work\Programming\Python\SQL_DataTransfer\LF.sdf;Persist Security Info=False;SSCE:Max Database Size=4091"')
At this point any help is much appreciated.
Thank you,
Sincerely,
JD.

Looks like you have a typo:
Provoider => Provider

adodbapi version = '2.6.0.6' depends on pypiwin32 to be installed in your Python environment.
For adodbapi.py, from line 51:
if api.onIronPython:
from System import Activator, Type, DBNull, DateTime, Array, Byte
from System import Decimal as SystemDecimal
from clr import Reference
def Dispatch(dispatch):
type = Type.GetTypeFromProgID(dispatch)
return Activator.CreateInstance(type)
def getIndexedValue(obj,index):
return obj.Item[index]
else: # try pywin32
try:
import win32com.client
import pythoncom
import pywintypes
onWin32 = True
def Dispatch(dispatch):
return win32com.client.Dispatch(dispatch)
except ImportError:
import warnings
warnings.warn("pywin32 package (or IronPython) required for adodbapi.",ImportWarning)
def getIndexedValue(obj,index):
return obj(index)
In my situation, I traced the fact that the Dispatch function was not defined because an ImportError exception is generated at line 62, (import win32com.client) caught in the except block but for some reason the warning message was not displayed in my console.
Try:
pip install pypiwin32
and the ImportError exception described above should not be raised anymore.

AttributeError: 'Context' object has no attribute 'browser'

I am currently experimenting with Behavioral Driven Development. I am using behave_django with selenium. I get the following output
Creating test database for alias 'default'...
Feature: Open website and print title # features/first_selenium.feature:1
Scenario: Open website # features/first_selenium.feature:2
Given I open seleniumframework website # features/steps/first_selenium.py:2 0.001s
Traceback (most recent call last):
File "/home/vagrant/newproject3/newproject3/venv/local/lib/python2.7/site-packages/behave/model.py", line 1456, in run
match.run(runner.context)
File "/home/vagrant/newproject3/newproject3/venv/local/lib/python2.7/site-packages/behave/model.py", line 1903, in run
self.func(context, *args, **kwargs)
File "features/steps/first_selenium.py", line 4, in step_impl
context.browser.get("http://www.seleniumframework.com")
File "/home/vagrant/newproject3/newproject3/venv/local/lib/python2.7/site-packages/behave/runner.py", line 214, in __getattr__
raise AttributeError(msg)
AttributeError: 'Context' object has no attribute 'browser'
Then I print the title # None
Failing scenarios:
features/first_selenium.feature:2 Open website
0 features passed, 1 failed, 0 skipped
0 scenarios passed, 1 failed, 0 skipped
0 steps passed, 1 failed, 1 skipped, 0 undefined
Took 0m0.001s
Destroying test database for alias 'default'...
Here is the code:
first_selenium.feature
Feature: Open website and print title
Scenario: Open website
Given I open seleniumframework website
Then I print the title
first_selenium.py
from behave import *
#given('I open seleniumframework website')
def step_impl(context):
context.browser.get("http://www.seleniumframework.com")
#then('I print the title')
def step_impl(context):
title = context.browser.title
assert "Selenium" in title
manage.py
#!/home/vagrant/newproject3/newproject3/venv/bin/python
import os
import sys
sys.path.append("/home/vagrant/newproject3/newproject3/site/v2/features")
import dotenv
if __name__ == "__main__":
path = os.path.realpath(os.path.dirname(__file__))
dotenv.load_dotenv(os.path.join(path, '.env'))
from configurations.management import execute_from_command_line
#from django.core.management import execute_from_command_line
execute_from_command_line(sys.argv)
I'm not sure what this error means

I know it is a late answer but maybe somebody is going to profit from it:
you need to declare the context.browser (in a before_all/before_scenario/before_feature hook definition or just test method definition) before you use it, e.g.:
context.browser = webdriver.Chrome()
Please note that the hooks must be defined in a separate environment.py module

In my case the browser wasn't installed. That can be a case too. Also ensure path to geckodriver is exposed if you are working with Firefox.

hiveserver2 gives no response on port 10000

Does apache hive have a contact place where I can ask them questions? I've tried everything I could but it still does not work.
I setup hadoop then hive based on the tutorial
http://doctuts.readthedocs.org/en/latest/hive.html
Then I tried to run hiveserver2 and wrote a python script to interact with it but it would hang when I tried to execute a hive command. However, based on the solution here Requests hang when using Hiveserver2 Thrift Java client
Now when I start hiveserver I still get errors.
Using the hive_service library I get the error
Invalid method name: 'execute'
when I call client.execute
and when I try to use pyhs2 I get this output
Traceback (most recent call last):
File "test1.py", line 8, in <module>
database='default') as conn:
File "/home/sakib/anaconda/lib/python2.7/site-packages/pyhs2/__init__.py", line 7, in connect
return Connection(*args, **kwargs)
File "/home/sakib/anaconda/lib/python2.7/site-packages/pyhs2/connections.py", line 46, in __init__
transport.open()
File "/home/sakib/anaconda/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 74, in open
status, payload = self._recv_sasl_message()
File "/home/sakib/anaconda/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 92, in _recv_sasl_message
header = self._trans.readAll(5)
File "/home/sakib/anaconda/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz-have)
File "/home/sakib/anaconda/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 94, in read
raise TTransportException('TSocket read 0 bytes')
thrift.transport.TTransport.TTransportException: None
Here is my sample python scripts to connect with hive
import sys
from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
try:
print "1111"
transport = TSocket.TSocket('localhost', 10000)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
print "2222"
client = ThriftHive.Client(protocol)
transport.open()
print "3333"
client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
print "4444"
transport.close()
except Thrift.TException, tx:
print '%s' % (tx.message)
and
import pyhs2
with pyhs2.connect(host='localhost',
port=10000,
authMechanism="PLAIN",
user='root',
password='',
database='default') as conn:
with conn.cursor() as cur:
#Show databases
print cur.getDatabases()

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Python web crawler using urllib - beautifulsoup

You forgot one minor thing: webpage=urlopen('http://www.xxxxxxxxx.com').read() # this -> ^^^^^^^ Your code just generated an urlopen object and assigned it to webpage. To assign the contents of the page, you need .read().

Related

Question about insert from TK form to csv file

Cant fint next button on linkedin

SQL CE connection with Python

AttributeError: 'Context' object has no attribute 'browser'

hiveserver2 gives no response on port 10000

Categories

Resources