Limitation of beautifulsoap - beautifulsoup

I am trying to convert xml to JSON using Beautifulsoup for the xml file having structure as below:
<H3 id="LinkTarget_311">ISSS.1.A1Acceptance of Overall(B)</H3>
<Standard>An organisation's Top Management.</Standard>
<Standard>The Top Management MUST define.</Standard>
<H3 id="LinkTarget_3116">ISS.2.A2Acceptance of Overall(C)</H3>
<Standard>An organisation's Top.</Standard>
<Standard>Top Management.</Standard>
<H3 id="LinkTarget_316">ISS.2.2Acceptance of Overall(D)</H3>
<Standard>An organisation's Top resource.</Standard>
<Standard>Top Management resource.</Standard>
......
.......
The code I wrote is as below :
extract2 = re.compile(r"[A-Z][a-z]\w*")
control_ids = {}
header = bs_content.find_all('h3',{'id':True})
sub = bs_content.find_all('standard')
for i,j in zip(header,sub):
req_id = str.strip(re.split(extract2,i.text)[0])
control_ids[req_id] = j.text
The result is too long I an not paste all of it:
Expected result: text of H3 tag paired with text of the following 'standard' tags
[{ISSS.1.A1Acceptance of Overall(B) : 'An organisation's Top Management.Top Management.'} , {ISS.2.A2Acceptance of Overall(C):'An organisation's Top.Top Management.'},....]

Try something simpler:
ids = bs_content.select('H3')
for id in ids:
value = " ".join([stan.text for stan in id.fetchNextSiblings()[:2]])
control_ids[id.text] = value
print(control_ids)
Output, based on your sample html:
{'ISSS.1.A1Acceptance of Overall(B)': "An organisation's Top Management. The Top Management MUST define.",
'ISS.2.A2Acceptance of Overall(C)': "An organisation's Top. Top Management.",
'ISS.2.2Acceptance of Overall(D)': "An organisation's Top resource. Top Management resource."}

Related

compare the 'class' of container tag

Let's say I extract some classes from some HTML:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
print(p_standard)
And the output looks like this:
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
And let's say I only wanted to print the text inside the P3 classes so that the output looks like:
a
c
I thought this code below would work, but it didn't. How can I compare the class name of the container tag to some value?
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class") == "P3":
print(p_standard.get_text())
I'm aware that in my first line, I could have simply done r"P3" instead of r"Standard|P3", but this is only a small fraction of the actual code (not the full story), and I need to leave that first line as it is.
Note: doing something like .find("p", class_ = "P3") only works for descendants, not for the container tag.
OK, so after playing around with the code, it turns out that
p_standard.get("class")[0] == "P3"
works. (I was missing the [0])
So this code works:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class")[0] == "P3":
print(p_standard.get_text())
I think the following is more efficient. Use select and CSS Or syntax to gather list based on either class.
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head></head>
<body>
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
p_standards = soup.select('.Standard,.P3')
for p_standard in p_standards:
if 'P3' in p_standard['class']:
print(item.text)

Liferay Dynamic Data Lists: How to get image URL?

I am creating a custom template in velocity for a Dynamic Data Lists and I want to get the image URL for the selected image. How can I get it?
The code is:
#set ( $DDLRecordService = $serviceLocator.findService("com.liferay.portlet.dynamicdatalists.service.DDLRecordLocalService") )
#set ( $records = $DDLRecordService.getRecords($mathTool.toNumber($reserved_record_set_id)) )
#foreach ($record in $records)
#set( $fields = $record.getFields() )
#set( $URL = $fields.get("URL").getValue() )
#set( $Link = $fields.get("Linktitle").getValue() )
#set( $Preview = $fields.get("Vorschaubild").getValue() ) ##the image is here
$URL
$Link
$Preview
#end
The $preview output is: {"groupId":"0000000","uuid":"ccdaccec-00a0-4284-a000-589be48‌​99281","version":"1.‌​0"}
Any suggestion?
Itt will work if you replace UUID_HERE to the real UUID.
<a href='${themeDisplay.getPortalURL()}/c/document_library/get_file?uuid=UUID_HERE&groupId=${themeDisplay.getScopeGroupId()}'>MyFile OR Image</a>
I also came across similar situation and after searching on the internet for hour(s), didn't find any useful information apart from LPS-34792 ticket.
Well, you can render image on the UI from a Documents and Media field using:
<#assign hasPicture = cur_record.getFieldValue("picture")?has_content>
<#if hasPicture>
<#assign picture = jsonFactoryUtil.createJSONObject(cur_record.getFieldValue("picture"))>
<img src='/documents/${picture.getString("groupId")}/${picture.getString("uuid")}' />
</#if>
Where picture is the name of field and hasPicture will check if the image was selected.

Modeling condition, nested forms with elm-simple-forms

I have a form that starts with a select. Depending on what was selected the form then expands to a common main bit and a details section that depends on the selection.
I started modeling with a separate details section
type ProductDetails
= Book Book.Model
| Brochure Brochure.Model
| Card Card.Model
type alias Model =
{ form : Form CustomError Estimate
, details : ProductDetails -- a Form CustomerError CardModel / BookModel / ....
, msg : String
}
but this is becoming quite convoluted to handle in e.g. view.
The alternative would seem to be conditionally to add the details into the main form model - e.g.
type alias Estimate =
{ customer : String
, project : String
, product : String
, details : ProductDetails
}
Before I get started I’d welcome experience from others on what has worked well
If I understand correctly, you have separate modules for Book, Brochure and Card? I don't quite understand what is the purpose of your Model but I would structure it like this:
import Book
import Brochure
import Card
type Products
= Book
| Brochure
| Card
type Msg
= Details Products
type alias Model =
{ selectedProduct : Product
}
update : Msg -> Model -> Model
update msg model =
case msg of
Details prd ->
Model prd
view : Model -> Html
view model =
model.selectedProduct.view
So as you can see, now you define all available products and then you say that Msg can be Details, which would show details, and it's function would be to set selectedProduct value in Model to selected Product. You can implement selecting with:
button [ onClick (Details Book) ] [ text "Book" ]
for example to select a book. Later on you want to make it dynamic and the first instinct should be to be able to call the view function of selected Product.
In other case you could define view which would require some fields, which every Product's model would contain and then you could use them to write some information on site.
Please note that code above isn't meant to work, it's just to represent the idea.
I'm not familiar with elm-simple-forms, but this seems like a good representation of your form:
type ProductType
= TBook
| TBrochure
| TCard
type Product
= Book Book.Model
| Brochure Brochure.Model
| Card Card.Model
type alias Model =
{ product : Maybe Product
, customer : String
, project : String
}
type Msg
= SelectProductType ProductType
init : Model
init =
{ product = Nothing
, customer = ""
, project = ""
}
update : Msg -> Model -> Model
update msg model =
case msg of
SelectProductType product ->
{model | product =
case product of
TBook -> Book Book.init
TBrochure -> Brochure Brochure.init
TCard -> Card Card.init
}
view : Model -> Html Msg
view model =
case model.product of
Nothing ->
myProductTypeSelect
Just product ->
withCommonFormInputs
<| case product of
Book submodel -> Book.view submodel
Brochure submodel -> Brochure.view submodel
Card submodel -> Card.view submodel
The Maybe gives you a nice way to choose between the first form (just the select) and the second form (customer details + selected product type details).
The Book.view etc. give you Html which you can add to the common case:
withCommonFormInputs : Model -> Html Msg -> Html Msg
withCommonFormInputs model productInputs =
div
[]
[ input [] [] -- customer
, input [] [] -- project
, productInputs -- product (Book, Brochure, Card) subform
]
I ended up using a Dict of the various fields and changed the fields when the product changed. Trying to model each product more explicitly create more boiler plate than I needed.

Get title, nav_title and subtitle

I want to print the fields title, nav_title and subtitle with Typoscript. I saw that there are several possibilities. E.g. data, field, levelfield, leveltitle, ...
Currently I'm using this code (because the only one which works for me so far):
lib.heading = TEXT
lib.heading.dataWrap = <p class="title"> {leveltitle:0} </p>
but I want something with alternatives like this
stdWrap.field = subtitle // nav_title // title
What is the correct way of retrieving these fields?
Edit:
[userFunc = user_isMobile]
page.headerData.10 = TEXT
page.headerData.10.value (
// ...
)
// ...
lib.heading = TEXT
#lib.heading.dataWrap = <p class="title"> {leveltitle:0} </p>
lib.heading {
field = title
wrap = <p class="title"> | </p>
}
lib.subpages = HMENU
lib.subpages {
// ...
}
[global]
The userfunction itself is a function in a php script (user_mobile.php). It makes a user agent detection for mobile devices and returns true or false.
field will get values from the current data which in your context is the data of the current page.
lib.heading = TEXT
lib.heading {
field = subtitle // nav_title // title
wrap = <p class="title">|</p>
}
leveltitle, leveluid, levelmedia, etc. allow you to retrieve some of the data from other pages in the rootline of the current page.
For more information see getText in the documentation.

Extracting href from attribute with BeatifulSoup

I use this method
allcity = dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")})
to return a list like this:
[<a onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" href="http://www.ylyd.com/showurl.asp?id=6182" target="_blank"><font size="3">掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫</font></a>,
掳脵露脠驴矛脮脮]
How do I extract this href?
http://www.ylyd.com/showurl.asp?id=6182
Thanks. :)
you can use
for a in dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")}, href=True):
a['href']
In this example, there's no real need to use regex, it can be simply as calling <a> tag and then ['href'] attribute like so:
get_me_url = soup.a['href'] # http://www.ylyd.com/showurl.asp?id=6182
# cached URL
get_me_cached_url = soup.find('a', class_='m')['href']
You can always use prettify() method to better see the HTML code.
from bs4 import BeautifulSoup
string = '''
[
<a href="http://www.ylyd.com/showurl.asp?id=6182" onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" target="_blank">
<font size="3">
掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫
</font>
</a>
,
<a class="m" href="http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu" target="_blank">
掳脵露脠驴矛脮脮
</a>
]
'''
soup = BeautifulSoup(string, 'html.parser')
href = soup.a['href']
cache_href = soup.find('a', class_='m')['href']
print(f'{href}\n{cache_href}')
# output:
'''
http://www.ylyd.com/showurl.asp?id=6182
http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu
'''
Alternatively, you can do the same thing using Baidu Organic Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Essentially, the main difference in this example is that you don't have to figure out how to grab certain elements since it's already done for the end-user with a JSON output.
Code to grab href/cached href from first page results:
from serpapi import BaiduSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "baidu",
"q": "ylyd"
}
search = BaiduSearch(params)
results = search.get_dict()
for result in results['organic_results']:
# try/expect used since sometimes there's no link/cached link
try:
link = result['link']
except:
link = None
try:
cached_link = result['cached_page_link']
except:
cached_link = None
print(f'{link}\n{cached_link}\n')
# Part of the output:
'''
http://www.baidu.com/link?url=7VlSB5iaA1_llQKA3-0eiE8O9sXe4IoZzn0RogiBMCnJHcgoDDYxz2KimQcSDoxK
http://cache.baiducontent.com/c?m=LU3QMzVa1VhvBXthaoh17aUpq4KUpU8MCL3t1k8LqlKPUU9qqZgQInMNxAPNWQDY6pkr-tWwNiQ2O8xfItH5gtqxpmjXRj0m2vEHkxLmsCu&p=882a9646d5891ffc57efc63e57519d&newp=926a8416d9c10ef208e2977d0e4dcd231610db2151d6d5106b82c825d7331b001c3bbfb423291505d3c77e6305a54d5ceaf13673330923a3dda5c91d9fb4c57479c77a&s=c81e728d9d4c2f63&user=baidu&fm=sc&query=ylyd&qid=e42a54720006d857&p1=1
'''
Disclaimer, I work for SerpApi.