Python BeautifulSoup get text from class - beautifulsoup

How can I get the text "Lionel Messi" from this HTML code?
Lionel Messi
This is my code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
page = requests.get('https://www.futbin.com/players')
soup = BeautifulSoup(page.content, 'lxml')
pool = soup.find(id='repTb')
player_names = pool.find_all(class_='player_name_players_table')
print(player_names[0])
When I print player_names I get this result:
/Users/ejps/PycharmProjects/scraper_players/venv/bin/python /Users/ejps/PycharmProjects/scraper_players/scraper.py
<a class="player_name_players_table" href="/20/player/44079/lionel-messi">Lionel Messi</a>
Process finished with exit code 0
But what code would I have to put in to get only the text of it?
I want to scrape all player names form that page in my code. But first I need to find a way to get that text extracted I think.
Cant find a way to make it work unfortunately.
I am new to python and try to do some projects to learn it.
EDIT:
With the help from comments I was able to get the text I need.
I only have one more question here.
Is it possible to find class_ by partial text only?
Like this:
prating = pool.find_all(class_='form rating ut20')
The full class would be
class="form rating ut20 toty gold rare"
but it is changing. The part that is always the same is "form rating ut20" so I thought maybe there is some kind of a placeholder that let me search for all "class" names inlcuding "form rating ut20"
Could you maybe help me with this as well?

To select specific class you can use either regular expression or if you have installed version bs4 4.7.1 or above you can use css selector.
Using regular expression will get list of element.
import re
prating = pool.find_all(class_=re.compile("form rating ut20"))
Or Using css selector will get list of element.1st css selector means contains and other one means starts-with.
prating = pool.select('[class*="form rating ut20"]')
OR
prating = pool.select('[class^="form rating ut20"]')

Get text using the getText() method.
player_names[0].getText()

Related

Trying to resolve a scrapy python for loop

If possible I would like to ask for some assistance in scraping some details from a webpage.
https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&period=today&gf=13
The structure is as follows
Webpage data structure
Webpage data structure expanded
I am able to retrieve all songs using the following command:
response.css("div.trk-cell.title a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell title']/a/#href").get()
I am able to retrieve all artists using the following command:
response.css("div.trk-cell.artists a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell artists']/a/#href").get()
so now I am trying to perform a loop which extracts all the titles and artists on the page and encapsulate each result together in either csv or json. I am struggling to work out the for loop, I have been trying the following with no success.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
for track in response.css("div.trklist.v-.full.v5"):
yield {
'link': track.xpath("//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath("//div[#class='trk-cell artists']/a/#href").get()
}
As far as I can tell the "trklist" div appears to encapsulate the artist and title div's so I'm unsure as to why this code doesn't work.
I have tried the following command in the scrapy shell and it doesn't return any results which I suspect is the issue, but why not?
response.css("div.trklist.v-.full.v5")
A push in the correct direction would be a lot of help, thanks
You only select the table which contains the items, but not the items themselves, so you're not really looping through them.
The CSS selector to the table is a little different on scrapy so we need to match it (no v5).
Inside the loop you're missing a dot inside track.xpath(...).
Notice in the code that I added "hdr", I did it in order to skip the table's header.
I added both CSS and xpath for the for loop (they both work, choose one of them):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
# for track in response.css('div.trklist.v-.full div.trk-row:not(.hdr)'):
for track in response.xpath('//div[#class="trklist v- full init-invis"]/div[not(contains(#class, "hdr"))]'):
yield {
'link': track.xpath(".//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath(".//div[#class='trk-cell artists']/a/#href").get()
}
In scrapy shell if you execute view(response) to view your response in web browser. You will find that there is no data because data is generating dynamically using javascript where scrapy does not work.
You should use selenium or other.

Beautifulsoup Scraping Object Selection

I am trying to scrape by Beautifulsoup and new to this, I need table rows as you see enter image description here.
The tables are coming from reactapp and then shown on the website. I need suggestion how to do this. I am struggling to create the beautifulsoup object and do not know what the actual class to grap to reach table rows and their content.
webpage = urlopen(req).read()
soup = bs(webpage, "html.parser")
table=soup.find('table', {'class': 'equity'})
rows=list()
for row in table.findAll("tr"):
rows.append(row)
Need your help, really appreciated, having hard time to get it done!
You can grab the td elements with this code:
webpage = urlopen(req).read()
soup = bs(webpage, "lxml")
table=soup.find('table', {'class': 'table'}).find('tr')
rows=list()
for row in table.findAll("td"):
rows.append(row)
I prefered using lxml as the parser because it has some advantages, but you can keep using html.parser
You can also use pandas, It will create, It's so much easier to learn from its documentaion (there is a lot).

Using regular expressions with BeautifulSoup's select_one

I have read answers on this website that describe using reg ex for 'find' queries in BeautifulSoup. However answers are less clear regarding the use of reg ex and querying on multiple tags while using 'select_one'.
Specifically I have two tags, shown below.
'#CommitYear14'
'#CommitYear12'
So I just need a query that looks for matches with '#CommitYear'.
My query right now is
college_info = beautiful_soup_parsing.select_one(tag)
where tag is either '#CommitYear14' or '#CommitYear12'. I don't know how to get both 14 and 12.
Function select_one() is for applying CSS selectors, you cannot use re with it. But however, you can use CSS selecor ^= which selects element(s) whose attribute value begins with selected string (for reference on CSS selectors look at this):
data = '''
<div id="CommitYear12">CommitYear12</div>
<div id="CommitYear14">CommitYear14</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('[id^="CommitYear"]'))
Prints:
<div id="CommitYear12">CommitYear12</div>

How to explain the following Beautiful Soup code?

i am new to Beautiful Soup and trying to learn it, while i am learning it , i got stuck at a certain code.Below is the code:
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
I am unable to understand the meaning of ".attrs" how to use it and what does it do?
Secondly, when i execute this code it prints all the links but omits href as a variable from it.? what is going on? can someone please explain it to me?
Below is the complete code :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
From the documentation of beautifulsoup4, "attrs" refers all attributes with value of a html tag. For "a" tag, it may have "href" attribute, "class attribute" etc. attrs return a dictionary, so you'll get the value by accessing the key "href". For example: when it prints the following link:
"/wiki/Wikipedia:Protection_policy#semi"
then, dictionary["href"] = "/wiki/Wikipedia:Protection_policy#semi"
so the value for the key "href" is "/wiki/Wikipedia:Protection_policy#semi"
Just write the following code:
print(link.attrs) then everything about this will clear to you.

Beautiful Soup NoneType error

I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'