I want to fetch all the details of wrestlers from the tables - pandas

I have a link its this- www.cagematch.net/?id=8&nr=1&page=15
In this link you will able to see a table with wrestlers. But If you click on the the name of a wrestler you will be able to see details of a wrestler. So, I want to fetch all the wrestlers with details in an easy & shortcut way. In my mind, I am thinking like this :
urls = [
link1, link2, link3, link4
]
for u in urls:
..... do the scrap
But there are 275 wrestlers I don't want to enter all the links like this. Is there any easy way to do it?

To get all links into a list and then info about each wrestler you can use this example:
import requests
from bs4 import BeautifulSoup
url = "http://www.cagematch.net/?id=8&nr=1&page=15"
headers = {"Accept-Encoding": "deflate"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
links = [
"https://www.cagematch.net/" + a["href"] for a in soup.select(".TCol a")
]
for u in links:
soup = BeautifulSoup(
requests.get(u, headers=headers).content, "html.parser"
)
print(soup.h1.text)
for info in soup.select(".InformationBoxRow"):
print(
info.select_one(".InformationBoxTitle").text.strip(),
info.select_one(".InformationBoxContents").text.strip(),
)
# get other info here
# ...
print("-" * 80)
Prints:
Adam Pearce
Current gimmick: Adam Pearce
Age: 44 years
Promotion: World Wrestling Entertainment
Active Roles: Road Agent, Trainer, On-Air Official, Backstage Helper
Birthplace: Lake Forest, Illinois, USA
Gender: male
Height: 6' 2" (188 cm)
Weight: 238 lbs (108 kg)
WWW: http://twitter.com/ScrapDaddyAP https://www.facebook.com/OfficialAdamPearce https://www.youtube.com/watch?v=us91bK1ScL4
Alter egos: Adam O'BrienAdam Pearce    a.k.a.  US Marshall Adam J. PearceMasked Spymaster #2Tommy Lee Ridgeway
Roles: Singles Wrestler (1996 - 2014)Road Agent (2015 - today)Booker (2008 - 2010)Trainer (2013 - today)On-Air Official (2020 - today)Backstage Helper (2015 - today)
Beginning of in-ring career: 16.05.1996
End of in-ring career: 21.12.2014
In-ring experience: 18 years
Wrestling style: Allrounder
Trainer: Randy Ricci & Sonny Rogers
Nicknames: "Scrap Iron"
Signature moves: PiledriverFlying Body SplashRackbomb II
--------------------------------------------------------------------------------
AJ Styles
Current gimmick: AJ Styles
Age: 45 years
Promotion: World Wrestling Entertainment
Brand: RAW
Active Roles: Singles Wrestler
Birthplace: Jacksonville, North Carolina, USA
Gender: male
Height: 5' 11" (180 cm)
Weight: 218 lbs (99 kg)
Background in sports: Ringen, Football, Basketball, Baseball
WWW: http://AJStyles.org https://www.facebook.com/AJStylesOrg-110336188978264/ https://twitter.com/AJStylesOrg https://www.instagram.com/ajstylesp1/ https://www.twitch.tv/Stylesclash
Alter egos: AJ Styles    a.k.a.  Air StylesMr. Olympia
Roles: Singles Wrestler (1999 - today)Tag Team Wrestler (2001 - 2021)
Beginning of in-ring career: 15.02.1999
In-ring experience: 23 years
Wrestling style: Techniker, High Flyer
Trainer: Rick Michaels
Nicknames: "The Phenomenal""The Prince Of Phenomenal"
Signature moves: Styles ClashPelé KickCalf Killer/Calf CrusherStylin' DDTCliffhangerSpiral TapPhenomenal Forearm450 Splash
--------------------------------------------------------------------------------
...and so on.

Related

Why does this not work?(school project btw)

import sys,time,random
typing_speed = 80 #wpm
def slow_type(t):
for l in t:
sys.stdout.write(l)
sys.stdout.flush()
time.sleep(random.random()*10.0/typing_speed)
slow_type("Hello which person do you want info for ")
inputs = input(
"Type 1 For Malcom X, type 2 for Kareem Abdul-Jabbar ")
if inputs == ('1'):
inputs = input(
"what info do you want. 1. overall life 2. accomplishments and obstacles. 3. His legacy "
)
if inputs == ('1'):
slow_type(
"born in may 19 1925 in Omaha Nebraska his parents both died when he was a young child and there wasn't anyone who really could take care of him so he spent much of his time bouncing around different foater homes, in 1952 he joined the nation of islam and became a preacher, he left the NOI to make a new group because he embraced a different type of Islam, sunni islam, he died in febuary 21 on 1965 by assasins who were part of the NOI."
)
elif inputs == ('2'):
slow_type(
"Some of his major accomplishments include preaching islam and the message that the oppressed ahould fight back. "
)
if inputs == ('2'):
inputs = input(
"what info do you want. 1. Birth and age 2. Early Life. 3. Nba life 4. Later Life 5. Accomplishments and Accolades"
)
if inputs == ('1', '2', '3', '4', '5'):
if inputs == ('1'):
slow_type(
"Kareem was born in New York during 1947 on the day of April 16th with the birth name of Lew Alcindor Jr. the son of Fernando Lewis Alcindor., New York policeman and Cora Alcindor. Later in his life Lew Alcindor changed his name to Kareem Abdul-Jabbar, meaning noble servant of the powerful One. Kareem is still alive today and is 74 years of age"
)
if inputs == ('2'):
slow_type(
"Kareem/ Lew Alcindor was always the tallest person in his class. When Kareem turned 9 he was already 5’8”. When he hit eighth grade he was 6’8”. Lew was playing basketball since he was young. At power memorial academy, Lew had a high-school career that nobody could match. Lew brought his team to 71 straight wins and 3 straight city titles."
)
if inputs == ('3'):
slow_type(
"In 1969 the Milwaukee Bucks selected Lew Alcindor with the first overall pick in the NBA draft. Lew quickly became a star being second in the league in scoring and third in rebounding, Lew was named the NBA Rookie of The Year. In the following season Lew became better and better and the bucks added future Oscar Robertson to the roster, making the Bucks the best team in the league with a 66-16 record. The bucks won the ring that year and Lew won MVP. Later that Summer Lew converted to Islam and Changed his name to Kareem Abdul-jabbar. Kareem and the bucks got to the NBA finals that year but lost to the Celtics. Even with al the success with the bucks Kareem struggled to be happy. Later that off season demanded a trade to either The Lakers or the Nicks. The bucks complied and traded Kareem to the Los Angelos Lakers where he was paired with Magic Johnson, making the lakers by far the best team in the league. During the rest of Kareems career he dominated the NBA winning 5 more titles and wining 5 more MVPs."
)
if inputs == ('4'):
slow_type("o")
To be specific the info doesn’t print for some reason pls help owo uwu I’m a furry cat girl
It doesn't work because your logic.
if inputs == ('1', '2', '3', '4', '5'): will always return False as your inputs variable will never be that tuple. You are also overwriting the inputs variable and I would consider renaming those distinct.
I made a few changes in there. Take a look and compare it to your code. This code is working just fine (relative to what you provided).
import sys,time,random
typing_speed = 80 #wpm
def slow_type(t):
print('\n')
for l in t:
sys.stdout.write(l)
sys.stdout.flush()
time.sleep(random.random()*10.0/typing_speed)
slow_type("Hello which person do you want info for?")
inputs_alpha = input(
"Type 1 For Malcom X, type 2 for Kareem Abdul-Jabbar\n--> ")
if inputs_alpha == '1':
inputs = input(
"what info do you want?\n1. overall life\n2. accomplishments and obstacles.\n3. His legacy\n--> "
)
if inputs == '1':
slow_type(
"born in may 19 1925 in Omaha Nebraska his parents both died when he was a young child and there wasn't anyone who really could take care of him so he spent much of his time bouncing around different foater homes, in 1952 he joined the nation of islam and became a preacher, he left the NOI to make a new group because he embraced a different type of Islam, sunni islam, he died in febuary 21 on 1965 by assasins who were part of the NOI."
)
elif inputs == '2':
slow_type(
"Some of his major accomplishments include preaching islam and the message that the oppressed ahould fight back. "
)
if inputs_alpha == '2':
inputs = input(
"what info do you want?\n1. Birth and age\n2. Early Life.\n3. Nba life\n4. Later Life\n5. Accomplishments and Accolades\n--> "
)
if inputs in ['1', '2', '3', '4', '5']:
if inputs == '1':
slow_type(
"Kareem was born in New York during 1947 on the day of April 16th with the birth name of Lew Alcindor Jr. the son of Fernando Lewis Alcindor., New York policeman and Cora Alcindor. Later in his life Lew Alcindor changed his name to Kareem Abdul-Jabbar, meaning noble servant of the powerful One. Kareem is still alive today and is 74 years of age"
)
if inputs == '2':
slow_type(
"Kareem/ Lew Alcindor was always the tallest person in his class. When Kareem turned 9 he was already 5’8”. When he hit eighth grade he was 6’8”. Lew was playing basketball since he was young. At power memorial academy, Lew had a high-school career that nobody could match. Lew brought his team to 71 straight wins and 3 straight city titles."
)
if inputs == '3':
slow_type(
"In 1969 the Milwaukee Bucks selected Lew Alcindor with the first overall pick in the NBA draft. Lew quickly became a star being second in the league in scoring and third in rebounding, Lew was named the NBA Rookie of The Year. In the following season Lew became better and better and the bucks added future Oscar Robertson to the roster, making the Bucks the best team in the league with a 66-16 record. The bucks won the ring that year and Lew won MVP. Later that Summer Lew converted to Islam and Changed his name to Kareem Abdul-jabbar. Kareem and the bucks got to the NBA finals that year but lost to the Celtics. Even with al the success with the bucks Kareem struggled to be happy. Later that off season demanded a trade to either The Lakers or the Nicks. The bucks complied and traded Kareem to the Los Angelos Lakers where he was paired with Magic Johnson, making the lakers by far the best team in the league. During the rest of Kareems career he dominated the NBA winning 5 more titles and wining 5 more MVPs."
)
if inputs == '4':
slow_type("o")

how can force spacy to recognise "Mr. Smith" and "Mrs. Smith" as separate entities

How can I use spacy NER to find people in text and differentiate between Mr. Smith and Mrs. Smith as different people/named entities.
For example this identifies Smith and Smith as the same person:
text="Mr. Smith walked along the sea front. Mrs. Smith stayed at home."
basenlp = spacy.load("en_core_web_sm")
doc = basenlp(text)
displacy.render(doc, style="ent")
I have tried to merge the tokens:
def compounds(doc):
with doc.retokenize() as rt:
for t in doc:
if t.dep_=="compound":
newt = Span(doc, t.i, t.head.i+1)
rt.merge(newt)
return doc
basenlp.add_pipe(compounds, "compounds", before="parser")
Same result with Smith and Smith
I try:
basenlp.add_pipe(compounds, "compounds", before="ner")
Now it does not find any entities.
OK I found it in the documentation under "expanding named entities":
https://spacy.io/usage/rule-based-matching

Web scraping - get tag through text in "brother" tag - beautiful soup

I'm trying to get the text inside a table in wikipedia, but I will do it for many cases (books in this case). I want to get the book genres.
Html code for the page
I need to extract the td containing the genre, when the text in Genre.
I did this:
page2 = urllib.request.urlopen(url2)
soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
for tr in table.findAll('tr')[5:6]:
for td in tr.findAll('td'):
print(td.getText(separator="\n"))```
This gets me the genre but only in some pages due to the row count which differs.
Example of page where this does not work
https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)
Anyone knows how to search through string with "genre"? Thank you
In this particular case, you don't need to bother with all that. Just try:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])
Output:
0 1
0 First edition cover First edition cover
1 Author J. D. Salinger
2 Cover artist E. Michael Mitchell[1][2]
3 Country United States
4 Language English
5 Genre Realistic fictionComing-of-age fiction
6 Published July 16, 1951
7 Publisher Little, Brown and Company
8 Media type Print
9 Pages 234 (may vary)
10 OCLC 287628
11 Dewey Decimal 813.54
From here you can use standard pandas methods to extract whatever you need.

Remove rows with character(0) from a data.frame before proceeding to dtm

I'm analyzing a data frame of product reviews that contain some empty entries or text written in foreign language. The data also contain some customer attributes which can be used as "features" in later analysis.
To begin with, I will first convert the reviews column into DocumentTermMatrix and then convert it to lda format, I then plan to throw in the documents and vocab objects generated from the lda process along with selected columns from the original data frame into stm's prepDocuments() function such that I can leverage the more versatile estimation functions from that package, using customer attributes as features to predict topic salience.
However, because some of the empty cells, punctuation, and foreign characters might be removed during the pre-processing and thereby creating some character(0) rows in the lda's documents object, making those reviews unable to match their corresponding rows in the original data frame. Eventually, this will prevent me from generating the desired stm object from prepDocuments().
Methods to remove empty documents certainly exist (such as the methods recommended in this previous thread), but I am wondering if there're ways to also remove the rows correspond to the empty documents from the original data frame such that the number of lda documents and the row dimension of the data frame that will be used as meta in the stm functions are aligned? Will indexing help?
Part of my data is listed at below.
df = data.frame(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "##haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))
This is a situation where embracing tidy data principles can really offer a nice solution. To start with, "annotate" the dataframe you presented with a new column that keeps track of doc_id, which document each word belongs to, and then use unnest_tokens() to transform this to a tidy data structure.
library(tidyverse)
library(tidytext)
library(stm)
df <- tibble(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "##haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))
tidy_df <- df %>%
mutate(doc_id = row_number()) %>%
unnest_tokens(word, reviews)
tidy_df
#> # A tibble: 154 x 4
#> rating source doc_id word
#> <dbl> <chr> <int> <chr>
#> 1 4 amazon 1 buenisimoooooo
#> 2 4 bestbuy 2 excelente
#> 3 4 amazon 3 excelent
#> 4 4 newegg 4 awesome
#> 5 4 newegg 4 phone
#> 6 4 newegg 4 awesome
#> 7 4 newegg 4 price
#> 8 4 newegg 4 almost
#> 9 4 newegg 4 month
#> 10 4 newegg 4 issue
#> # … with 144 more rows
Notice that you still have all the information you had before; all the information is still there, but it is arranged in a different structure. You can fine-tune the tokenization process to fit your particular analysis needs, perhaps dealing with non-English however you need, or keeping/not keeping punctuation, etc. This is where empty documents get thrown out, if appropriate for you.
Next, transform this tidy data structure into a sparse matrix, for use in topic modeling. The columns correspond to the words and the rows correspond to the documents.
sparse_reviews <- tidy_df %>%
count(doc_id, word) %>%
cast_sparse(doc_id, word, n)
colnames(sparse_reviews) %>% head()
#> [1] "buenisimoooooo" "excelente" "excelent" "almost"
#> [5] "awesome" "blu"
rownames(sparse_reviews) %>% head()
#> [1] "1" "2" "3" "4" "5" "8"
Next, make a dataframe of covariate (i.e. meta) information to use in topic modeling from the tidy dataset you already have.
covariates <- tidy_df %>%
distinct(doc_id, rating, source)
covariates
#> # A tibble: 18 x 3
#> doc_id rating source
#> <int> <dbl> <chr>
#> 1 1 4 amazon
#> 2 2 4 bestbuy
#> 3 3 4 amazon
#> 4 4 4 newegg
#> 5 5 4 amazon
#> 6 8 4 newegg
#> 7 9 1 amazon
#> 8 10 4 amazon
#> 9 11 3 amazon
#> 10 12 1 amazon
#> 11 13 4 amazon
#> 12 14 3 zappos
#> 13 15 1 amazon
#> 14 16 2 amazon
#> 15 17 4 newegg
#> 16 18 4 amazon
#> 17 19 1 amazon
#> 18 20 1 amazon
Now you can put this together into stm(). For example, if you want to train a topic model with the document-level covariates looking at whether topics change a) with source and b) smoothly with rating, you would do something like this:
topic_model <- stm(sparse_reviews, K = 0, init.type = "Spectral",
prevalence = ~source + s(rating),
data = covariates,
verbose = FALSE)
Created on 2019-08-03 by the reprex package (v0.3.0)

Problems while trying to get specific items and format them with selenium?

I am crawling a website which has some tables. Specifically, I would like to extract from all the tables (if exist), the first column (presentation) and the company name (which is located in this xpath: .//*[#id='accordion']//h3), something like this (a two dimension format):
['Mission Pharmacal (Reverified 01/21/2015)' , '250 mg (NDC 01780-500-01)']
['Hospira, Inc. (Reverified 11/07/2016)', '5 mEq/mL; 20 mL vial (NDC 0409-6043-01)']
['Shire US Inc. (Reverified 07/01/2016)', 'AGRYLIN® (anagrelide hydrochloride) Dosage Form: 0.5 mg capsules for oral administration (NDC 54092-063-01)']
['Teva Pharmaceuticals (Reverified 11/01/2016)', '1mg 100 (NDC 00172-5240-60)']
['Teva Pharmaceuticals (Reverified 11/01/2016)', '0.5 mg 10 (NDC 00172-5241-60)']
['Jazz Pharmaceuticals, Inc. (Revised 11/14/2016)', 'ERWINAZE 10,000 IU lyophilized powder supplied in a clear 3 mL glass vial 5 vial carton (NDC 57902-249-05)']
[' Jazz Pharmaceuticals, Inc. (Revised 11/14/2016)', 'ERWINAZE 10,000 IU lyophilized powder supplied in a clear 3 mL glass vial 1 vial (NDC 57902-249-01)']
So far I tried the below approach. However, I do not get how to tweak the list, and I do not understand why I am not catching some hidden items, from the accordion.
In:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://www.accessdata.fda.gov/scripts/drugshortages/default.cfm')
links = driver.find_elements_by_xpath('''.//*[#id='tabs-1']//tbody//td[1]//a[2]''')
links = [x.get_attribute('href') for x in links]
lis = list()
for x in links:
driver.get(x)
#.//*[#id='accordion']//div//table
xpath_list = ['.//*[#id="accordion"]//div//tr//td[1]', ".//*[#id='accordion']//h3//a"]
full_content = [[x.text for x in driver.find_elements_by_xpath(xpath)] for xpath in xpath_list]
lis.append(full_content)
lis
Out:
[[['250 mg (NDC 01780-500-01)'], []],
[['5 mEq/mL; 20 mL vial (NDC 0409-6043-01)'], []],
[['AGRYLIN® (anagrelide hydrochloride) Dosage Form: 0.5 mg capsules for oral administration (NDC 54092-063-01)',
'',
''],
['Shire US Inc. (Reverified 07/01/2016)',
'Teva Pharmaceuticals (Reverified 11/01/2016)']],
[['ERWINAZE 10,000 IU lyophilized powder supplied in a clear 3 mL glass vial 5 vial carton (NDC 57902-249-05)',
'ERWINAZE 10,000 IU lyophilized powder supplied in a clear 3 mL glass vial 1 vial (NDC 57902-249-01)'],
['Jazz Pharmaceuticals, Inc. (Revised 11/14/2016)']],
[['0.4 mg/mL, 1 mL single-dose vial, package of 25 (NDC 00517-0401-25)',
'1 mg/mL, 1 mL single-dose vial, package of 25 (NDC 00517-1010-25)',
'',
'',
'',
'',
'',
''],......
import requests
from lxml.html import fromstring
r = requests.get('http://www.accessdata.fda.gov/scripts/drugshortages/dsp_ActiveIngredientDetails.cfm?AI=Atropine%20Sulfate%20Injection&st=c&tab=tabs-1')
html = fromstring(r.text)
in:
[i.text_content().strip() for i in html.xpath('//div[#id="accordion"]//h3')]
out:
['American Regent/Luitpold (Reverified 11/10/2016)',
'Amphastar Pharmaceuticals, Inc./IMS (Reverified 08/18/2016)',
'Hospira, Inc. (Revised 11/07/2016)',
'West-Ward Pharmaceuticals (Revised 05/02/2016)']
in:
[i.xpath('.//td[1]//text()') for i in html.xpath('//div[#id="accordion"]//tbody')]
out:
[['0.4 mg/mL, 1 mL single-dose vial, package of 25\r\n(NDC 00517-0401-25)',
'1 mg/mL, 1 mL single-dose vial, package of 25 (NDC 00517-1010-25)'],
['0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe\r\n(NDC 76329-3339-1, Old NDC 0548-3339-00) \r\n'],
['0.1 mg/mL; 10 mL Ansyr syringe\r\n(NDC 0409-1630-10)',
'0.05 mg/mL; 5 mL Ansyr syringe\r\n(NDC 0409-9630-05)',
'0.1 mg/mL; 5 mL Lifeshield syringe\r\n(NDC 0409-4910-34)',
'0.1 mg/mL; 10 mL Lifeshield syringe\r\n(NDC 0409-4911-34)'],
['0.4 mg/mL, 20 mL vial (NDC 0641-6006-10)\r\n']]
i ues lxml's xpath, i hope this will be helpful. By the way, nested list comprehension is really hard to understand., maybe you could create lists separately, than zip them toganther.