Web scraping - get tag through text in "brother" tag - beautiful soup - beautifulsoup

I'm trying to get the text inside a table in wikipedia, but I will do it for many cases (books in this case). I want to get the book genres.
Html code for the page
I need to extract the td containing the genre, when the text in Genre.
I did this:
page2 = urllib.request.urlopen(url2)
soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
for tr in table.findAll('tr')[5:6]:
for td in tr.findAll('td'):
print(td.getText(separator="\n"))```
This gets me the genre but only in some pages due to the row count which differs.
Example of page where this does not work
https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)
Anyone knows how to search through string with "genre"? Thank you

In this particular case, you don't need to bother with all that. Just try:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])
Output:
0 1
0 First edition cover First edition cover
1 Author J. D. Salinger
2 Cover artist E. Michael Mitchell[1][2]
3 Country United States
4 Language English
5 Genre Realistic fictionComing-of-age fiction
6 Published July 16, 1951
7 Publisher Little, Brown and Company
8 Media type Print
9 Pages 234 (may vary)
10 OCLC 287628
11 Dewey Decimal 813.54
From here you can use standard pandas methods to extract whatever you need.

Related

Concatenate rows across multiple columns in Access 2013 using VBA

I've an Access database that follows each Line Item across multiple stages where tasks are assigned to individuals. Each stage has a Comments field and are recorded in a Table which looks like:
Line Item Stage Title Comments
1 1 Introduction Trial comment
1 1 Introduction Another one
1 2 Abstract Following one
1 2 Abstract Andi nexto
1 3 Thesis Nexto
2 1 Introduction Comment for next item
2 1 Introduction Andi another one
...
I want to be able to concatenate these comments for each stage and each Line Item as:
Line Item Stage Title Comments
1 1 Introduction Trial comment, Another one
1 2 Abstract Nexto one, Andi next
1 3 Thesis Nexto
2 1 Introduction Comment for nexto item, Andi another one
I tried using Allen Brown's ConcatRelated() function with multiple WHERE criteria:
ConcatRelated("[Comments]","[CommentsT]","[LineItemNo]=" & "[txtLineItemNo] AND "[StageNo]=" & [txtStageNo])
but with no luck. Using a single WHERE clause does concatenate all the comments in the required field without considering StageNo and Title.
Kindly advise as to what is the best way for me to achieve this.
Thank you.

Read comma separated string data from csv file into list in R

I have a csv file with data on grouping of cereal brands in an experiment. I have one row for each subject (~2000 subjects) and each row has a variable number of brand that people liked (text strings separated by commas)
srno Brands
1 GMI 'TOTAL' WG ORIG,POS H-C GRAPE NUTS ORIG,POST GREAT GRAINS CRUNCHY PCN
2 GMI TINY TST STB,GMI HONEY NUT CHEERIOS REG,GMI TINY TST BB,GMI APPLE CN TOAST CRUNCH
3 QKR SQUARES CN
I want to read the data into a data frame so that I have the brands in each row as one element of a list
My goal is to be able to do a text mining analysis to explore similarities (ie brands that occur together)
I see a similar question asked a few years ago but I was not able to adapt the solution
Text file to list in R
Managed to work this out!
I read in the csv file with StringsAsFactors=FALSE option (this is important)
`x = read.csv ("datafile.csv", stringsAsFactors=FALSE)
#strings of brand names read into variable str_brand
#the following stmt then turns the chars in str_brand into a list
#note..I had the individual brands separated by commas in the csv file
brands_list <- stringr::str_split(x$str_brand,pattern = ",")
`

Unable to create new features in Machine learning

I have a dataset. I am using pandas dataframe and named it df.
The dataset has 50,000 rows - here are the first 5:.
Name_Restaurant cuisines_available Average cost
Food Heart Japnese, chinese 60$
Spice n Hungary Indian, American, mexican 42$
kfc, Lukestreet Thai, Japnese 29$
Brown bread shop American 11$
kfc, Hypert mall Thai, Japnese 40$
I want to create column which contains the no. of cuisines available
I am trying code
df['no._of_cuisines_available']=df['cuisines_available'].str.len()
Then instead of showing the no. of cuisines, it is showing the sum of charecters.
For example - for first row the o/p should be 2 , but its showing 17.
I need a new column that contain number of stores for each restaurant. example -
here kfc has 2 stores kfc, lukestreet and kfc, hypert mall. I have completely
no idea how to code this.
i)
df['cuisines_available'].str.split(',').apply(len)
ii)
df['Name_Restaurant'].str.split(',', expand=True).melt().['value'].str.strip().value_counts()
What ii) does: split columns at ',' and store all strings thus generated in an individual column. Then use melt to make one big column, strip away spaces etc. and count individual entries.

pandas dataframe sub setting on multiple loop conditions

I have a pandas df having users and their answers to a survey and a score eg
Userid incomebracket insurance-knowledge..... score
123 3 3 56
346 4 6 65
Assume income bracket has 6 levels with 1:1000-5000,6:100000+,similarly insurance-knowledge has 6 levels (1:very little to 6:expert)
Now I have another df which has user profile features like
userid,age,gender,education....(10 such features)
Now I iterate through set of users (first df) and for each of them want to get the entire subset of other users who have the same user profile but higher answer on each column of first df, say income. I am doing this using the following say for 3 profile features like age, gender and education
df_sameusergroup=df[(df['PPGENDER']==sameuser_gender.values[0])
& (df['EDUC']==sameuser_educ.values[0])
& (df['age']==sameuser_agecat.values[0])
& (df['incomebracket']>user_feature.values[0])]
Although this works the profile features here are hardcoded and is a problem for longer conditions,what I want is
get the subset of users who have same profile on all 10 but with higher answer, if you don't get any such record (which is possible) the reduce to 9 features,then reduce to 8,7.....2 (of the most important features say age gender). My pseudocode for this should look like this
for i in range(10:2) // iterate the Userprofile_df for all profile features,then 9,then 8...
Similaruserdf[] = df[subset when all i features are same and income is >]
if(Similaruserdf.length==0)//no such users with all features same
continue loop reduce number of features to match on
else
return Similaruserdf[]
I am stuck trying to do this and have been looking throughout to find a solution. Any help would be greatly appreciated. Thanks.

need to extract all the content between two string in pandas dataframe

I have data in pandas dataframe. i need to extract all the content between the string which starts with "Impact Factor:" and ends with "&#". If the content doesn't have "Impact Factor:" i want null in that row of the dataframe
this is sample data from a single row.
Save to EndNote online &# Add to Marked List &# Impact Factor: Journal 2 and Citation Reports 500 &# Other Information &# IDS Number: EW5UR &#
I want the content like the below in a dataframe .
Journal 2 and Citation Reports 500
Journal 6 and Citation Reports 120
Journal 50 and Citation Reports 360
Journal 30 and Citation Reports 120
Hi you can just use a regular expression here:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:(.*?)&#',x))
You may want to strip white spaces too in which case you could use:
result = your_df.your_col.apply(lambda x: re.findall('Impact Factor:\s*(.*?)\s*&#',x))