How can I get the link to YouTube Channel from Video Page? - selenium

I've been trying to get the link to the YouTube channel from the Video Page. However, I couldn't find the element of the link. With Inspector, it is obvious that the link is right here as the following picture.
With the code 'a.yt-simple-endpoint.style-scope.yt-formatted-string', I tried to get the link through the following code.
! pip install selenium
from selenium import webdriver
! pip install beautifulsoup4
from bs4 import BeautifulSoup
driver = webdriver.Chrome('D:\chromedrive\chromedriver.exe')
driver.get('https://www.youtube.com/watch?v=P6Cc2R2jK6s')
soup = BeautifulSoup(driver.page_source, 'lxml')
links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string')
for link in links:
print(link.get_attribute("href"))
However, no matter I used links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string') or links = soup.find('a', class_='yt-simple-endpoint style-scope ytd-video-owner-renderer'), it did not print anything. Someone please help me solve this.

Instead of this:
links = soup.select('a.yt-simple-endpoint.style-scope.yt-formatted-string')
In Selenium if I would do:
links = drvier.find_elements_by_css_selector('a.yt-simple-endpoint.style-scope.yt-formatted-string')

Related

Scraping data from CrowdTangle using API return expired image links

I wanted to download images from CrowdTangle Dashboard. I wrote the code to fetch data using its API. However, historical posts scraped using the API return expired media links. While downloading the image, I got "URL expired" error. How to generate new links?
After talking with people, I figured out that I needed to scroll in the CrowdTangle dashboard to generate new image links. However, scrolling manually through thousands of posts will be a tedious task. Hence I decided to code a bot that scrolls. This solved my problem and I was able to generate new links.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)
link = {insert_link}
browser.get(link)
browser.maximize_window()
fb_button = browser.find_element(by=By.LINK_TEXT, value="click here.")
fb_button.click()
time.sleep(7)
phone = browser.find_element(by=By.ID,value="email")
password = browser.find_element(by=By.ID,value="pass")
submit = browser.find_element(by=By.ID,value="loginbutton")
phone.send_keys({phone number})
password.send_keys({password})
submit.click()
time.sleep(6)
element = browser.find_element(by=By.XPATH, value="/html/body/div[1]/div/div/div[3]/div")
while True:
browser.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", element)
time.sleep(3)
Go to the CrowdTangle dashboard, enter your filters, and query. Copy the link from the browser in the code. I would recommend running the scroll bot for each month. Sometimes more posts won't load. This is an issue with CrowdTangle. Just close the browser and move on to the next month.

Splash not rendering a webpage completely

I am trying to use scrapy + splash to scrape this site https://www.teammitsubishihartford.com/new-inventory/index.htm?compositeType=new. But i am unable to extract any data from the site. When I try rendering the webpage using splash api (browser), I came to know that the site is not fully loaded (splash rendering returns a partially loaded website image). How can I render the site completly??
#Vinu Abraham, If your requirement is not specific to scrapy + splash, you can use selenium. This issue occurs when we try to scrape a dynamic site.
Below is the code snippet for reference.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import re
from csv import writer
# url of the page we want to scrape
url = 'https://www.*******/drugs-all-medicines'
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
all_divs = soup.find('div', {'class': 'style__container___1i8GI'})
Also let me know if you get any solution for the same using scrapy.

BeautifulSoup: Why doesn't it find all iframes?

I'm pretty new to BeautifulSoup, and I'm trying to figure out why it doesn't work as expected.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.globes.co.il/news/article.aspx?did=1001285710")
bsObj = BeautifulSoup(html.read(), features="html.parser")
print(bsObj.find_all('iframe'))
I get a list of only 2 iframes.
However, when I open this page with a browser and type:
document.getElementsByTagName("iframe")
in dev-tools I get a list of 14 elements.
Can you please help me?
This is because that site dynamically adds more iframes once the page is loaded. Additionally, iframe content is dynamically loaded by a browser, and won't be downloaded via urlopen either. You may need to use Selenium to allow JavaScript to load the additional iframes, and then may need to search for the iframe and download the content via the src url.

Selenium with Python: can't find element by link text

Could you help me understand why in this particular case find_element_by_partial_link_text doesn't catch the element.
from selenium import webdriver
import unittest
class RegisterNewUser(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.driver.get("http://web.archive.org/web/20141117213704/http://demo.magentocommerce.com/")
def test_register_new_user(self):
self.driver.find_element_by_link_text("Log In").click()
Pardon for the strange link. I'm reading a book on Selenium and the link was originally from there. But the contents has changed. The book seems Ok for me. So, I just extracted old web page from an archive.
Well, if I view page source, I can find the link there. But I can't reach it via Selenium.
Could you give me a hint? Thank you in advance.
The link is hidden, you will need to click first on the menu (Account)

Can beautiful soup also hit webpage events?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. I will use it to extract webpage data,but i didn't find out any way to click the buttons,anchor label which are used in my case the page navigation. So for this shall I have to use any other or beautiful soup has the capability i didn't aware of.
Please advice me!
To answer your tags/comment, yes, you can use them together (Selenium and BeautifulSoup), and no, you can't directly use BeautifulSoup to execute events (clicking etc.). Although I myself haven't ever used them together in the same situation, a hypothetical situation could involve using Selenium to navigate to a target page via a certain path (i.e. click() these options and then click() the button to the next page), and then using BeautifulSoup to read the driver.page_source (where driver is the Selenium driver you created to 'drive' the browser). Since driver.page_source is the HTML of the page, you can use BeautifulSoup as you are used to, parsing out whatever information you need.
Simple example:
from bs4 import BeautifulSoup
from selenium import webdriver
# Create your driver
driver = webdriver.Firefox()
# Get a page
driver.get('http://news.ycombinator.com')
# Feed the source to BeautifulSoup
soup = BeautifulSoup(driver.page_source)
print soup.title # <title>Hacker News</title>
The main idea is that anytime you need to read the source of a page, you can pass driver.page_source to BeautifulSoup in order to read whatever you want.