Web scraping for Linkedin

Web scraping for Linkedin - selenium

I am currently working on a college project for Linkedin Web Scraping using selenium. Following is the code for the same:
from selenium import webdriver
from time import sleep
from selenium.webdriver.common.keys import Keys
from parsel import Selector
driver = webdriver.Chrome('location of web driver')
driver.get('https://www.linkedin.com')
# username
username = driver.find_element_by_id('session_key')
username.send_keys('Linkedin Username')
sleep(0.5)
# password
password = driver.find_element_by_id('session_password')
password.send_keys('Linkedin Password')
sleep(0.5)
#submit value
sign_in_button = driver.find_element_by_xpath('//*[#type="submit"]')
sign_in_button.click()
sleep(0.5)
driver.get('https://www.google.com/') #Navigate to google to search the profile
# locate search form by_name
search_query = driver.find_element_by_name('q')
# send_keys() to simulate the search text key strokes
search_query.send_keys('https://www.linkedin.com/in/khushi-thakkar-906b56188/')
sleep(0.5)
search_query.send_keys(Keys.RETURN)
sleep(3)
# locate the first link
search_person = driver.find_element_by_class_name('yuRUbf')
search_person.click()
#Experience
experience = driver.find_elements_by_css_selector('#experience-section .pv-profile-section')
for item in experience:
print(item.text)
print("")
#Education
education = driver.find_elements_by_css_selector('#education-section .pv-profile-section')
for item in education:
print(item.text)
print("")
#Certification
certification = driver.find_elements_by_css_selector('#certifications-section .pv-profile-section')
for item in certification:
print(item.text)
print("")
When I scrape the experience part, it extracts the information perfectly. But when I do the same with Education and certifications part - It shows an empty list. Please help!

I think the problem ís because of your css selector. I try it my self and it is unable to locate any element on html main body
Fix your css selector and you will be fine
#Education
education = driver.find_elements_by_css_selector('#education-section li')
#Certification
certification = driver.find_elements_by_css_selector('#certifications-section li')

Related

Why not able to get the page and search form with login details (selenium, Beautifulsoup)

I want to scrape this this for some of my natural language processing work. I have a subscription to the website but still, I am not able to get the result. I got the error that unable to locate the element.
The link to login page is login
This is the code that I tried in python with selenium.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-infobars")
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", options=options)
driver.get('https://login.newscorpaustralia.com/login?state=hKFo2SBmOXc1TjRJNDlBX3hObkZPN1NsRWgzcktONTlPVnJMS6FupWxvZ2luo3RpZNkgUi1ZRmV2Z2dwcWJmZUpqdWtZdk5CUUllX0h3YngwanSjY2lk2SAwdjlpN0tvVzZNQkxTZmUwMzZZU1FUNzl6QThaYXo0WQ&client=0v9i7KoW6MBLSfe036YSQT79zA8Zaz4Y&protocol=oauth2&response_type=token%20id_token&scope=openid%20profile&audience=newscorpaustralia&site=couriermail&redirect_uri=https%3A%2F%2Fwww.couriermail.com.au%2Fremote%2Fidentity%2Fauth%2Flatest%2Flogin%2Fcallback.html%3FredirectUri%3Dhttps%253A%252F%252Fwww.couriermail.com.au%252Fsearch-results%253Fq%253Djason%252520huny&prevent_sign_up=true&nonce=7j4grLXRD39EVhGsxcagsO5c-PtAY4Md&auth0Client=eyJuYW1lIjoiYXV0aDAuanMiLCJ2ZXJzaW9uIjoiOS4xOS4wIn0%3D')
time.sleep(10)
elem = driver.find_element(by=By.CLASS_NAME,value='navigation_search')
username = driver.find_element(by=By.ID,value='1-email')
password = driver.find_element(by=By.NAME,value='password')
login = driver.find_element(by=By.NAME,value='submit')
username.send_keys("myid");
password.send_keys("password");
login.click();
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'html.parser')
search = driver.find_element(by=By.CSS_SELECTOR,value='form.navigation_search')
search.click();
search.send_keys("jason hunt");
print(driver.page_source)
Below is the error that I am getting. I want to grab the search icon and send the keys there but I am not getting the search form after login.
Below is the text based HTML of the element.
I tried printing the page source and I was not able to locate the html element there too.

Not a proper answer, but since you can't add formatting to comments and this has the same desired effect:
driver.get("https://www.couriermail.com.au/search-results");
WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, "search_box_input"))
searchBox = driver.find_element(By.CLASS_NAME, "search_box_input")
searchBox.send_keys("test");

Dynamic(with mouseover/coordinates) web scraping python unable to extract information

I'm trying to scrape the data that only appears on mouseover(selenium). It's a concert map and this is my entire code. I keep getting TypeError: 'ActionChains' object is not iterable
The idea would be to hover over the whole map & always scrape the code when the html changes. I'm pretty sure I need two for loops for that, but I don't know yet, how to combine them. Also, I know I'll have to use bs4, could someone share ideas how I could go about this?
driver = webdriver.Chrome()
driver.maximize_window()
driver.get('https://www.ticketcorner.ch/event/simple-minds-hallenstadion-12035377/')
#accept shadow-root cookie banner
time.sleep(5)
driver.execute_script('return document.querySelector("#cmpwrapper").shadowRoot.querySelector("#cmpbntyestxt")').click()
time.sleep(5)
# Click on the saalplan so we get to the concert map
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#tickets > div.seat-switch-wrapper.js-tab-switch-group > a:nth-child(3) > div > div.seat-switch-radio.styled-checkbox.theme-switch-bg.theme-text-color.theme-link-color-hover.theme-switch-border > label'))).click()
time.sleep(5)
#Scroll to the concert map, which will be the element to hover over
element_map = driver.find_element(By.CLASS_NAME, 'js-tickettype-cta')
actions = ActionChains(driver)
actions.move_to_element(element_map).perform()
# Close the drop-down which is partially hiding the concert map
driver.find_element(by=By.XPATH, value='//*[#id="seatmap-tab"]/div[2]/div/div/section/div[2]/div[1]/div[1]/div/div/div[1]/div').click()
# Mouse Hover over the concert map and find the empty seats to extract the data.
actions = ActionChains(driver)
data = actions.move_to_element_with_offset(driver.find_element(by=By.XPATH, value='//*[#id="seatmap-tab"]/div[2]/div/div/section/div[2]/div[1]/div[2]/div/div[2]/div[1]/div[2]/div[2]/canvas'),0,0)
for i in data:
actions.move_by_offset(50, 50).perform()
time.sleep(2)
# print content of each box
hover_data = driver.find_element(By.XPATH, '//*[#id="tooltipster-533522"]/div[1]').get_attribute('tooltipster-content')
print(hover_data)```
# The code I would use to hover over the element
#actions.move_by_offset(100, 50).perform()
# time.sleep(5)
# actions.move_by_offset(150, 50).perform()
# time.sleep(5)
# actions.move_by_offset(200, 50).perform()

Locating elements in section with selenium

I'm trying to enter text into a field (the subject field in the image) in a section using Selenium .
I've tried locating by Xpath , ID and a few others but it looks like maybe I need to switch context to the section. I've tried the following, errors are in comments after lines.
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
opts = Options()
browser = Firefox(options=opts)
browser.get('https://www.linkedin.com/feed/')
sign_in = '/html/body/div[1]/main/p/a'
browser.find_element_by_xpath(sign_in).click()
email = '//*[#id="username"]'
browser.find_element_by_xpath(email).send_keys(my_email)
pword = '//*[#id="password"]'
browser.find_element_by_xpath(pword).send_keys(my_pword)
signin = '/html/body/div/main/div[2]/div[1]/form/div[3]/button'
browser.find_element_by_xpath(signin).click()
search = '/html/body/div[8]/header/div[2]/div/div/div[1]/div[2]/input'
name = 'John McCain'
browser.find_element_by_xpath(search).send_keys(name+"\n")#click()
#click on first result
first_result = '/html/body/div[8]/div[3]/div/div[1]/div/div[1]/main/div/div/div[1]/div/div/div/div[2]/div[1]/div[1]/span/div/span[1]/span/a/span/span[1]'
browser.find_element_by_xpath(first_result).click()
#hit message button
msg_btn = '/html/body/div[8]/div[3]/div/div/div/div/div[2]/div/div/main/div/div[1]/section/div[2]/div[1]/div[2]/div/div/div[2]/a'
browser.find_element_by_xpath(msg_btn).click()
sleep(10)
## find subject box in section
section_class = '/html/body/div[3]/section'
browser.find_element_by_xpath(section_class) # no such element
browser.switch_to().frame('/html/body/div[3]/section') # no such frame
subject = '//*[#id="compose-form-subject-ember156"]'
browser.find_element_by_xpath(subject).click() # no such element
compose_class = 'compose-form__subject-field'
browser.find_element_by_class_name(compose_class) # no such class
id = 'compose-form-subject-ember156'
browser.find_element_by_id(id) # no such element
css_selector= 'compose-form-subject-ember156'
browser.find_element_by_css_selector(css_selector) # no such element
wind = '//*[#id="artdeco-hoverable-outlet__message-overlay"]
browser.find_element_by_xpath(wind) #no such element
A figure showing the developer info for the text box in question is attached.
How do I locate the text box and send keys to it? I'm new to selenium but have gotten thru login and basic navigation to this point.
I've put the page source (as seen by the Selenium browser object at this point) here.
The page source (as seen when I click in the browser window and hit 'copy page source') is here .

Despite the window in focus being the one I wanted it seems like the browser object saw things differently . Using
window_after = browser.window_handles[1]
browser.switch_to_window(window_after)
allowed me to find the element using an Xpath.

Acess data image url when the data url is only obtain upon rendering

I would like to automatically get images saved as browser's data after the page renders, using their corresponding data URLs.
For example:
You can go to the webpage: https://en.wikipedia.org/wiki/Truck
Using the WebInspector from Firefox pick the first thumbnail image on the right.
Now on the Inspector tab, right click over the img tag, go to Copy and press "Image Data-URL"
Open a new tab, paste and enter to see the image from the data URL.
Notice that the data URL is not available on the page source. On the website I want to scrape, the images are rendered after passing through a php script. The server returns a 404 response if the images try to be accessed directly with the src tag attribute.
I believe it should be possible to list the data URLs of the images rendered by the website and download them, however I was unable to find a way to do it.
I normally scrape using selenium webdriver with Firefox coded in python, but any solution would be welcome.

I managed to work out a solution using chrome webdriver with CORS disabled as with Firefox I could not find a cli argument to disable it.
The solution executes some javascript to redraw the image on a new canvas element and then use toDataURL method to get the data url. To save the image I convert the base64 data to binary data and save it as png.
This apparently solved the issue in my use case.
Code to get first truck image
from binascii import a2b_base64
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument("--disable-site-isolation-trials")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://en.wikipedia.org/wiki/Truck")
img = driver.find_element_by_xpath("/html/body/div[3]/div[3]"
"/div[5]/div[1]/div[4]/div"
"/a/img")
img_base64 = driver.execute_script(
"""
const img = arguments[0];
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
canvas.width = img.width;
canvas.height = img.height;
ctx.drawImage(img, 0, 0);
data_url = canvas.toDataURL('image/png');
return data_url
""",
img)
binary_data = a2b_base64(img_base64.split(',')[1])
with open('image.png', 'wb') as save_img:
save_img.write(binary_data)
Also, I found that the data url that you get with the procedure described in my question, was generated by the Firefox web inspector on request, so it should not be possible to get a list of data urls (that are not within the page source) as I first thought.

BeautifulSoup is the best library to use for such problem statements. When u wanna retrieve data from any website, u can blindly use BeautifulSoup as it is faster than selenium. BeautifulSoup just takes around 10 seconds to complete this task, whereas selenium would approximately take 15-20 seconds to complete the same task, so it is better to use BeautifulSoup. Here is how u do it using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import time
st = time.time()
src = requests.get('https://en.wikipedia.org/wiki/Truck').text
soup = BeautifulSoup(src,'html.parser')
divs = soup.find_all('div',class_ = "thumbinner")
count = 1
for x in divs:
url = x.a.img['srcset']
url = url.split('1.5x,')[-1]
url = url.split('2x')[0]
url = "https:" + url
url = url.replace(" ","")
path = f"D:\\Truck_Img_{count}.png"
response = requests.get(url)
file = open(path, "wb")
file.write(response.content)
file.close()
count+=1
print(f"Execution Time = {time.time()-st} seconds")
Output:
Execution Time = 9.65831208229065 seconds
29 Images. Here is the first image:
Hope that this helps!

Selenium, Autoit and iframe

I was trying to automate the control on a page, on where there is a iframe and an element that can be controlled with AutoIT. I need to click the Scan button within the iframe. I used driver.switch_to.frame("frmDemo") to switch frame, but it seemed not working. Any idea please?
Here is the code:
import win32com.client
import time
from selenium import webdriver
autoit = win32com.client.Dispatch("AutoItX3.Control")
# create a new Firefox session
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get("http://example.com")
time.sleep(2)
driver.switch_to.frame("frmDemo")
scanButton = driver.find_element_by_css_selector('body.input[type="button"]')
scanButton.click()

input is not class, its child element of body. Try without body
scanButton = driver.find_element_by_css_selector('input[type="button"]')
You can also try by the value attribute
scanButton = driver.find_element_by_css_selector('value="Scan"')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Web scraping for Linkedin - selenium

Related

Why not able to get the page and search form with login details (selenium, Beautifulsoup)

Dynamic(with mouseover/coordinates) web scraping python unable to extract information

Locating elements in section with selenium

Acess data image url when the data url is only obtain upon rendering

Selenium, Autoit and iframe

Categories

Resources