How to add a xml fragment to an element - lxml

I am converting from Amara (http://xml3k.org/Amara/Tutorial) to lxml, in Amara I could do:
for wxp in self.Points:
points.xml_append_fragment('<point><x>%i</x><y>%i</y></point>' % (wxp[0], wxp[1]))
Where 'points' is an element, how can I do this with lxml.objectify?

Found it.
for wxp in self.Points:
#points.xml_append_fragment('<point><x>%i</x><y>%i</y></point>' % (wxp[0], wxp[1]))
point = xmlu.addElement(points, u'point', ns=None)
xmlu.addElementPlusValue(point, u'x', wxp[0])
xmlu.addElementPlusValue(point, u'y', wxp[1])
'addElement' is basically doing a objectify.SubElement and 'addElementPlusValue' is doing a setattr(element, name, value)

Related

How to use mapFieldType with gdal.VectorTranslate

I'm trying to export a postgresql database into a .gpkg file, but some of my fields are lists, and ogr2ogr send me the message :
Warning 1: The output driver does not natively support StringList type for field my_field_name. Misconversion can happen. -mapFieldType can be used to control field type conversion.
But, as in the documentation, -mapFieldType is not a -lco, i don't find how to use it with the python version of gdal.VectorTranslate
here ma config :
gdal_conn = gdal.OpenEx(f"PG:service={my_pgsql_service}", gdal.OF_VECTOR)
gdal.VectorTranslate("path_to_my_file.gpkg"), gdal_conn,
SQLStatement=my_sql_query,
layerName=my_mayer_name,
format="GPKG",
accessMode='append',
)
so i've tried to add it in the -lco :
layerCreationOptions=["-mapFieldType StringList=String"]
but it didn't work
so i diged into the code of gdal, added a field mapFieldType=None into the VectorTranslateOptions function, and added into its code the following lines :
if mapFieldType is not None:
mapField_str = ''
i = 0
for k, v in mapFieldType.items():
i += 1
mapField_str += f"{k}={v}" if i == len(mapFieldType) else f"{k}={v},"
new_options += ['-mapFieldType', mapField_str]
And it worked, but is there an other way ?
And if not, where can i propose this feature ?
Thank you for your help

what to do for non present DIV tag. Using Selenium & Python

I'm trying to extract information with help of selenium and python from this container "PROJECT INFORMATION" //www.rera.mp.gov.in/view_project_details.php?id=aDRYYk82L2hhV0R0WHFDSnJRK3FYZz09
but while do this I was getting this error
Unable to locate element:
{"method":"xpath","selector":"/html/body/div/article/div2/div/div2/div2/div2"}
after studying about it I found that this highlighted div is missing and there are many places in this container where div is missing. How am I supposed to do that? I want information only from the right side of the table
MY CODE:
for c in [c for c in range(1, 13) if (c == True)]:
row = driver.find_element_by_xpath("/html/body/div/article/div[2]/div/div[2]/div["+ str(c) +"]/div[2]").text
print(row, end=" ")
print(" ")
else:
print('NoN')
error:
no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/div/article/div[2]/div/div[2]/div[2]/div[2]"}
(Session info: chrome=83.0.4103.106)
The fields highlighted are two different cases. While for "Registration Number" the required div does not exist, for "Proposed End Date" it exists but contains only white space.
Give this a try instead of the for c... loop. It should handle both cases.
#find parent element
proj_info=driver.find_element_by_xpath("//div[#class='col-md-12 box']")
#find all rows in parent element
proj_info_rows = proj_info.find_elements_by_class_name('row')
for row in proj_info_rows:
try:
if row.find_element_by_class_name('col-md-8').text.strip() == "":
print(f"{row.find_element_by_class_name('col-md-4').text} contains only whitespace {row.find_element_by_class_name('col-md-8').text}")
print('NaN')
else:
print(row.find_element_by_class_name('col-md-8').text)
except SE.NoSuchElementException:
print('NaN')
You need this import:
from selenium.common import exceptions as SE

How to write value to input field

I am getting element with
var nameEl = document.getElementById("<portlet:namespace />kategorijaName");
that is input field.How can i write some text in it ?
Since the question (at this time) is tagged liferay and alloy-ui, I am assuming an answer using/appropriate for those two tags would be beneficial
<aui:input id='textFieldId' name='textFieldName' label='My Text Field'></aui:input>
<script>
AUI().use('node', function(A){
A.one('#<portlet:namespace/>textFieldId').set('value', "A new input value");
});
</script>
if you are using normal javascript then you can use below for setting a value in input text
document.getElementById("<portlet:namespace />kategorijaName").value = 'some value';
in case of Jquery you can use
$("#<portlet:namespace />kategorijaName").val("some value");
If you are using alloy-ui then you can set value like this
<aui:script>
A.one('#<portlet:namespace />kategorijaName').set('value',kategorijaName);
</aui:script>
nameEl.value = "value you need"

Convert </br> to end line

I'm trying to extract some text using BeautifulSoup. I'm using get_text() function for this purpose.
My problem is that the text contains </br> tags and I need to convert them to end lines. how can I do this?
You can do this using the BeautifulSoup object itself, or any element of it:
for br in soup.find_all("br"):
br.replace_with("\n")
As official doc says:
You can specify a string to be used to join the bits of text together: soup.get_text("\n")
A regex should do the trick.
import re
s = re.sub('<br\s*?>', '\n', yourTextHere)
Hope this helps!
Also you can use ‍‍‍get_text(separator = '\n', strip = True) :
from bs4 import BeautifulSoup
bs=BeautifulSoup('<td>some text<br>some more text</td>','html.parser')
text=bs.get_text(separator = '\n', strip = True)
print(text)
>>
some text
some more text
it works for me.
Adding to Ian's and dividebyzero's post/comments you can do this to efficiently filter/replace many tags in one go:
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.replace_with(elem.text + "\n\n")
Instead of replacing the tags with \n, it may be better to just add a \n to the end of all of the tags that matter.
To steal the list from #petezurich:
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.append('\n')
If you call element.text you'll get the text without br tags.
Maybe you need define your own custom method for this purpose:
def clean_text(elem):
text = ''
for e in elem.descendants:
if isinstance(e, str):
text += e.strip()
elif e.name == 'br' or e.name == 'p':
text += '\n'
return text
# get page content
soup = BeautifulSoup(request_response.text, 'html.parser')
# get your target element
description_div = soup.select_one('.description-class')
# clean the data
print(clean_text(description_div))

The ':' character, hexadecimal value 0x3A, cannot be included in a name

I saw this question already, but I didnt see an answer..
So I get this error:
The ':' character, hexadecimal value 0x3A, cannot be included in a name.
On this code:
XDocument XMLFeed = XDocument.Load("http://feeds.foxnews.com/foxnews/most-popular?format=xml");
XNamespace content = "http://purl.org/rss/1.0/modules/content/";
var feeds = from feed in XMLFeed.Descendants("item")
select new
{
Title = feed.Element("title").Value,
Link = feed.Element("link").Value,
pubDate = feed.Element("pubDate").Value,
Description = feed.Element("description").Value,
MediaContent = feed.Element(content + "encoded")
};
foreach (var f in feeds.Reverse())
{
....
}
An item looks like that:
<rss>
<channel>
....items....
<item>
<title>Pentagon confirms plan to create new spy agency</title>
<link>http://feeds.foxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/</link>
<category>politics</category>
<dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/" />
<pubDate>Tue, 24 Apr 2012 12:44:51 PDT</pubDate>
<guid isPermaLink="false">http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</guid>
<content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[|http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg<img src="http://feeds.feedburner.com/~r/foxnews/most-popular/~4/lVUZwCdjVsc" height="1" width="1"/>]]></content:encoded>
<description>The Pentagon confirmed Tuesday that it is carving out a brand new spy agency expected to include several hundred officers focused on intelligence gathering around the world.&amp;#160;</description>
<dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-4T19:44:51Z</dc:date>
<feedburner:origLink>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</feedburner:origLink>
</item>
....items....
</channel>
</rss>
All I want is to get the "http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg", and before that check if content:encoded exists..
Thanks.
EDIT:
I've found a sample that I can show and edit the code that tries to handle it..
EDIT2:
I've done it in the ugly way:
text.Replace("content:encoded", "contentt").Replace("xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"","");
and then get the element in the normal way:
MediaContent = feed.Element("contentt").Value
The following code
static void Main(string[] args)
{
var XMLFeed = XDocument.Parse(
#"<rss>
<channel>
....items....
<item>
<title>Pentagon confirms plan to create new spy agency</title>
<link>http://feeds.foxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/</link>
<category>politics</category>
<dc:creator xmlns:dc='http://purl.org/dc/elements/1.1/' />
<pubDate>Tue, 24 Apr 2012 12:44:51 PDT</pubDate>
<guid isPermaLink='false'>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</guid>
<content:encoded xmlns:content='http://purl.org/rss/1.0/modules/content/'><![CDATA[|http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg<img src='http://feeds.feedburner.com/~r/foxnews/most-popular/~4/lVUZwCdjVsc' height='1' width='1'/>]]></content:encoded>
<description>The Pentagon confirmed Tuesday that it is carving out a brand new spy agency expected to include several hundred officers focused on intelligence gathering around the world.&amp;#160;</description>
<dc:date xmlns:dc='http://purl.org/dc/elements/1.1/'>2012-04-4T19:44:51Z</dc:date>
<!-- <feedburner:origLink>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</feedburner:origLink> -->
</item>
....items....
</channel>
</rss>");
XNamespace contentNs = "http://purl.org/rss/1.0/modules/content/";
var feeds = from feed in XMLFeed.Descendants("item")
select new
{
Title = (string)feed.Element("title"),
Link = (string)feed.Element("link"),
pubDate = (string)feed.Element("pubDate"),
Description = (string)feed.Element("description"),
MediaContent = GetMediaContent((string)feed.Element(contentNs + "encoded"))
};
foreach(var item in feeds)
{
Console.WriteLine(item);
}
}
private static string GetMediaContent(string content)
{
int imgStartPos = content.IndexOf("<img");
if(imgStartPos > 0)
{
int startPos = content[0] == '|' ? 1 : 0;
return content.Substring(startPos, imgStartPos - startPos);
}
return string.Empty;
}
results in:
{ Title = Pentagon confirms plan to create new spy agency, Link = http://feeds.f
oxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/, pubDate = Tue, 24 Apr 2012 1
2:44:51 PDT, Description = The Pentagon confirmed Tuesday that it is carving out
a brand new spy agency expected to include several hundred officers focused on
intelligence gathering around the world. , MediaContent = http://global
.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg }
Press any key to continue . . .
A few points:
You never want to treat Xml as text - in your case you removed the namespace declaration but actually if the namespace was declared inline (i.e. without binding to the prefix) or a different prefix would be defined your code would not work even though semantically both documents would be equivalent
Unless you know what's inside CDATA and how to treat it you always want to treat is as text. If you know it's something else you can treat it differently after parsing - see my elaborate on CDATA below for more details
To avoid NullReferenceExceptions if the element is missing I used explicit conversion operator (string) instead of invoking .Value
the Xml you posted was not a valid xml - there was missing namespace Uri for feedburner prefix
This is no longer related to the problem but may be helpful for some folks so I am leaving it
As far as the contents of the encode element is considered it is inside CDATA section. What's inside CDATA section is not an Xml but plain text. CDATA is usually used to not have to encode '<', '>', '&' characters (without CDATA they would have to be encoded as < > and & to not break the Xml document itself) but the Xml processor treat characters in the CDATA as if they were encoded (or to be more correct in encodes them). The CDATA is convenient if you want to embed html because textually the embedded content looks like the original yet it won't break your xml if the html is not a well-formed Xml. Since the CDATA content is not an Xml but text it is not possible to treat it as Xml. You will probably need to treat is as text and use for instance regular expressions. If you know it is a valid Xml you can load the contents to an XElement again and process it. In your case you have got mixed content so it is not easy to do unless you use a little dirty hack. Everything would be easy if you have just one top level element instead of mixed content. The hack is to add the element to avoid all the hassle. Inside the foreach look you can do something like this:
var mediaContentXml = XElement.Parse("<content>" + (string)item.MediaContent + "</content>");
Console.WriteLine((string)mediaContentXml.Element("img").Attribute("src"));
Again it's not pretty and it is a hack but it will work if the content of the encoded element is valid Xml. The more correct way of doing this is to us XmlReader with ConformanceLevel set to Fragment and recognize all kinds of nodes appropriately to create a corresponding Linq to Xml node.
You should use XNamespace:
XNamespace content = "...";
// later in your code ...
MediaContent = feed.Element(content + "encoded")
See more details here.
(Of course, you the string to be assigned to content is the same as in xmlns:content="...").