I have a few cells in OpenRefine with some XML in it (coming from nominatim) and, for each node, I would like to extract the value of an attribute only if the value of element in the same node is equal to a specific string ('Paris'). I am using jython to loop across the element and, if the element value is equal to Paris, return the desired attribute. Here's the code for it:
from xml.etree import ElementTree as ET
element = ET.fromstring(value).encode('utf8')
root = element.getroot()
resultsList = root.findall(".//place")
for result in resultsList:
typerecord = result.find("city")
if typerecord.text == "Paris":
return result.attrib["lat"]
However it doesn't seems to work, even if the code seems fine to me. I get the following error:
Error: Traceback (most recent call last):
File "<string>", line 3, in __temp_242115945__
File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1313, in XML
File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1653, in feed
File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1653, in feed
File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/parsers/expat.py", line 193, in Parse
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 115: ordinal not in range(128)
which appears to be more about the encoding of the characters. I added to the script .encode('utf8') but nothing change.
Here a sample of the XML:
<?xml version="1.0" encoding="UTF-8" ?>
<searchresults attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" exclude_place_ids="18482590,103398643,118557459,109798886" more_url="https://nominatim.openstreetmap.org/search/?street=11+rue+Girardon&city=Paris&country=France&addressdetails=1&extratags=1&polygon_geojson=1&exclude_place_ids=18482590%2C103398643%2C118557459%2C109798886&format=xml" querystring="11 rue Girardon, Paris, France" timestamp="Tue, 25 Oct 22 09:32:26 +0000">
<place address_rank="30" boundingbox="43.6242386,43.6243386,1.4264894,1.4265894" class="place" display_name="11, Rue François Girardon, Minimes - Barrière de Paris, Toulouse Nord, Toulouse, Haute-Garonne, Occitanie, France métropolitaine, 31200, France" geojson="{"type":"Point","coordinates":[1.4265394,43.6242886]}" importance="0.5201" lat="43.6242886" lon="1.4265394" osm_id="2084506137" osm_type="node" place_id="18482590" place_rank="30" type="house">
<extratags/>
<house_number>11</house_number>
<road>Rue François Girardon</road>
<neighbourhood>Minimes - Barrière de Paris</neighbourhood>
<suburb>Toulouse Nord</suburb>
<city>Toulouse</city>
<municipality>Toulouse</municipality>
<county>Haute-Garonne</county>
<ISO3166-2-lvl6>FR-31</ISO3166-2-lvl6>
<state>Occitanie</state>
<ISO3166-2-lvl4>FR-OCC</ISO3166-2-lvl4>
<region>France métropolitaine</region>
<postcode>31200</postcode>
<country>France</country>
<country_code>fr</country_code>
</place>
<place address_rank="26" boundingbox="48.8872626,48.8876471,2.3372233,2.3374922" class="highway" display_name="Rue Girardon, Quartier des Grandes-Carrières, Paris 18e Arrondissement, Paris, Île-de-France, France métropolitaine, 75018, France" geojson="{"type":"LineString","coordinates":[[2.3372233,48.8872626],[2.3372534,48.8873072],[2.337453,48.8875915],[2.3374922,48.8876471]]}" importance="0.52" lat="48.8875915" lon="2.337453" osm_id="10662867" osm_type="way" place_id="103398643" place_rank="26" type="residential">
<extratags>
<tag key="lit" value="yes"/>
<tag key="surface" value="sett"/>
<tag key="maxspeed" value="30"/>
<tag key="sidewalk" value="both"/>
<tag key="smoothness" value="intermediate"/>
<tag key="cycleway:both" value="no"/>
<tag key="zone:maxspeed" value="FR:30"/>
<tag key="motor_vehicle:conditional" value="no # (Su,PH 11:00-18:00)"/>
</extratags>
<road>Rue Girardon</road>
<city_block>Quartier des Grandes-Carrières</city_block>
<suburb>Paris 18e Arrondissement</suburb>
<city_district>Paris</city_district>
<city>Paris</city>
<ISO3166-2-lvl6>FR-75</ISO3166-2-lvl6>
<state>Île-de-France</state>
<ISO3166-2-lvl4>FR-IDF</ISO3166-2-lvl4>
<region>France métropolitaine</region>
<postcode>75018</postcode>
<country>France</country>
<country_code>fr</country_code>
</place>
<place address_rank="26" boundingbox="48.8885135,48.8886689,2.3380551,2.3381062" class="highway" display_name="Rue Girardon, Quartier des Grandes-Carrières, Paris 18e Arrondissement, Paris, Île-de-France, France métropolitaine, 75018, France" geojson="{"type":"LineString","coordinates":[[2.3381062,48.8885135],[2.3380648,48.8886091],[2.3380551,48.8886689]]}" importance="0.52" lat="48.8886091" lon="2.3380648" osm_id="23371363" osm_type="way" place_id="109798886" place_rank="26" type="pedestrian">
<extratags>
<tag key="lit" value="yes"/>
<tag key="surface" value="paving_stones"/>
<tag key="smoothness" value="good"/>
</extratags>
<road>Rue Girardon</road>
<city_block>Quartier des Grandes-Carrières</city_block>
<suburb>Paris 18e Arrondissement</suburb>
<city_district>Paris</city_district>
<city>Paris</city>
<ISO3166-2-lvl6>FR-75</ISO3166-2-lvl6>
<state>Île-de-France</state>
<ISO3166-2-lvl4>FR-IDF</ISO3166-2-lvl4>
<region>France métropolitaine</region>
<postcode>75018</postcode>
<country>France</country>
<country_code>fr</country_code>
</place>
</searchresults>
Given the sample and the code, what I am expecting as a result is:
48.8875915
48.8886091
anyone who can help or suggest some GREL alternative for it?
Personally, I find non-trivial Python to be a royal pain to debug in OpenRefine's Jython preview, because GREL's fluent style is much easier to build up incrementally, so here's a GREL equivalent for your Python:
forEach(value.parseXml().select('place'),p,if(p.select('city')[0].htmlText()=='Paris',p.htmlAttr('lat'),None)).join('|')
It returns 48.8875915|48.8886091 (you can't store an array in a cell)
Having said that, there are two problems with your Python:
you need to encode the string, not the value returned from fromstring(), ie ET.fromstring(value.encode('utf8')) not ET.fromstring(value).encode('utf8')
ElementTree.fromstring() returns the root element directly, so the getroot() is unnecessary.
The patched up code is below, but note that it only returns the first value. It would need additional modifications to return all matches concatenated together in a string.
from xml.etree import ElementTree as ET
root = ET.fromstring(value.encode('utf8'))
resultsList = root.findall(".//place")
for result in resultsList:
typerecord = result.find("city")
if typerecord.text == "Paris":
return result.attrib["lat"]
Related
I'm trying to convert pdf to text files. The problem is that those pdf contain images, which I don't care about (this is the type of file I want to extract (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf). Note that if I do copy/paste with my mouse, it work quite well (except the line break), so I'd guess that it's possible. Most of the answer I found online work pretty well on dummy pdf with text only, but give especially bad result on the map.
For instance, something like this
from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])
works well for retrieving the text, but I have a lot of trash like this :
ERY
CTR
3
CH
A
which appear because of the map.
Something like this, which work by converting the pdf to images and then reading the images, face the same problem (I found it on a very similar thread on stackoverflow, but there is no answer) :
import pytesseract as pt
from PIL import Image
import sys
def convert(name):
pages = convert_from_path(name, dpi=200)
for idx,page in enumerate(pages):
page.save('page'+str(idx)+'.jpg', 'JPEG')
quote = Image.open('page'+str(idx)+'.jpg')
text = pt.image_to_string(quote, lang="fra")
file_ex = open('page'+str(idx)+'.text',"w")
file_ex.write(text)
file_ex.close()
if __name__ == '__main__':
convert(sys.argv[1])
Finally, I tried to remove the image first, and then using one of the solutions above, but it didn't work better :
from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader
# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")
src = PdfFileReader(inputStream)
output = PdfFileWriter()
[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
output.write(outputStream)
outputStream.close()
# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])
Do you know how to solve this ? It can be in any language.
Thanks
One approach you could try is to use a toolkit capable of parsing the text characters in the PDF then use the object properties to try and remove the unwanted map labels while keeping the text characters required.
For example, the ParsePages method from LEADTOOLS PDF toolkit (which is what I am familiar with since I work for the vendor of this toolkit) can be used to obtain the text from the PDF:
using (PDFDocument document = new PDFDocument(pdfFileName))
{
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
using (StreamWriter writer = File.CreateText(txtFileName))
{
IList<PDFObject> objects = document.Pages[0].Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
writer.WriteLine("---------------------");
}
}
This will obtain all the text in the PDF for the first page, with the unwanted results as you mentioned. Here is an excerpt below:
Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information
0
B09SUP AIP 213/20
7
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
141
17˚
82
N20
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
More code can be used to examine the properties for each parsed character:
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine(" TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
This will give the properties for each character:
Objects: 3918
ObjectType: Text
Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
TextProperties.FontHeight: 7.10454273223877
TextProperties.FontIndex: 48
Code: 5
------
Using these properties, the unwanted text might be filtered using their properties. For example, I noticed that the FontHeight for a good portion of the unwanted text is around 7 PDF units, so the first code might be altered to avoid extracting any text smaller than 7.25 PDF units:
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.FontHeight > 7.25)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
}
The extracted output would give a better result, an excerpt follows:
Objects: 3918
Service
de l’Information
SUP AIP 213/20
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD
*
C
D
E
In the end, you will have to try and come up with a good criteria to filter out the unwanted text without removing the text you need to keep, using this approach.
Example html:
<div>
<p>p1</p>
<p>p2</p>
<p>p3<span id="target">starting from here</span></p>
<p>p4</p>
</div>
<div>
<p>p5</p>
<p>p6</p>
</div>
<p>p7</p>
I want to search for <p>s but only if its position is after span#target.
It should return p4, p5, p6 and p7 in the above example.
I tried to get all <p>s first then filter, but then I don't know how do I judge if an element is after span#target or not, either.
You can do this by using the find_all_next function in beautifulsoup.
from bs4 import BeautifulSoup
doc = # Read the HTML here
# Parse the HTML
soup = BeautifulSoup(doc, 'html.parser')
# Select the first element you want to use as the reference
span = soup.select("span#target")[0]
# Find all elements after the `span` element that have the tag - p
print(span.find_all_next("p"))
The above snippet will result in
[<p>p4</p>, <p>p5</p>, <p>p6</p>, <p>p7</p>]
Edit: As per the request to compare position below by OP-
If you want to compare position of 2 elements, you'll have to rely on sourceline and sourcepos provided by the html.parser and html5lib parsing options.
First off, store the sourceline and/or sourcepos of your reference element in a variable.
span_srcline = span.sourceline
span_srcpos = span.sourcepos
(you don't actually have to store them though, you can just do span.sourcepos directly as long as you have the span stored)
Now iterate through the result of find_all_next and compare the values-
for tag in span.find_all_next("p"):
print(f'line diff: {tag.sourceline - span_srcline}, pos diff: {tag.sourcepos - span_srcpos}, tag: {tag}')
You're most likely interested in line numbers though, as the sourcepos denotes the position on a line.
However, sourceline and sourcepos mean slightly different things for each parser. Check the docs for that info
Try this
html_doc = """
<div>
<p>p1</p>
<p>p2</p>
<p>p3<span id="target">starting from here</span></p>
<p>p4</p>
</div>
<div>
<p>p5</p>
<p>p6</p>
</div>
<p>p7</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find(id="target").findNext('p').contents[0])
Result
p4
try
span = soup.select("span > #target > p")
My XML
<?xml version="1.0" encoding="utf-8"?>
<metadata created="2014-05-15T12:26:07.701Z" xmlns="http://site/cu-2.0#" xmlns:ext="http://site/cu/b-2.0">
<customer-list count="47" offset="0">
<customer id="7123456" type="Cust" ext:mark="1">
<name>Tony Watt</name>
<sort-name>Watt, Tony</sort-name>
<gender>male</gender>
<country>US</country>
<knownAs-list>
<knownAs locale="ko" sort-name="Tony Watt"</knownAs>
<knownAs locale="ja" sort-name="Watt Tony"</knownAs>
</knownAs-list>
<tag-list>
<begin>Country</begin>
<tag count="1">
<name>usa</name>
</tag-list>
</customer>
<customer id="9876543" type="Cust" ext:mark="2">
....
</customer-list>
So i have some code that gets all the data. I went one step further to use Anonymous types and add the values into a class as below
Dim c = From cust As XElement In XDoc.Descendants(ns + "customer")
Select New Customer() With {.Name = cust.Element(ns + "name"),
.Surname = CStr(cust.Element(ns + "surname")),
.Id = cust.Attribute("id"),
.Tag = CStr(cust.Element("tag-list").Element("begin"))}
The above code returns data from the XML, but adding this line of code
.Tag = CStr(cust.Element("tag-list").Element("begin"))
throws an exception, "Object reference not set to an instance of an object". Now theres two possibilities here
I have my code wrong for that particular line (to retrieve 'begin' from the 'tag-list' element)
I know some tag-list elements dont have a nested begin element so that could be adding some confusion. I added Cstr to overcome this but not sure if this is enough?
After reading MSDN it seems using .Descendants (Xdoc.Descendants) would get the all the data from all elements where Elements would return data upto the path i have stated, so as far as i can tell the data 'should' be available with the above code. Could anyone assist me in getting the begin data from tag-list?
The XML namespace declaration is missing. Use
.Tag = CStr(cust.Element(ns + "tag-list").Element(ns + "begin"))
XML File:
<apps>
<app name="google">
<branches>
<branch location="us">
<datacenters>
<datacenter city="Mayes County" />
<datacenter city="Douglas County" />
</datacenters>
</branch>
</branches>
</app>
<app name="facebook">
</app>
</apps>
Java Code:
XPath xpath = XPathFactory.newInstance().newXPath();
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
String expression = "/apps/app[name='google']";
Element element = xpath.evaluvate(expression,document.getDocumentElement();
Element app = (Element)xpath.evaluvate(expression,element,XPathConstants.NODE);
I can able to store the app element and execute the xpath expression within the app element . Like this
xpath.evaluvate("branches/branch[#location='us']",app,XPathConstants.STRING);
How can I do this in VTD XML?
I think you can learn the API by visiting VTD-XML's sourceforge web site. Below is what I think how it will look like:
VTDGen vg = new VTDGen();
String filename ="this.xml";
if (vg.parseFile(filename, true)){
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/apps/app[name='google']");
int i=-1;
while((i=ap.evalXPath())!=-1){
//do something with i, it is the node index
}
}
I saw this question already, but I didnt see an answer..
So I get this error:
The ':' character, hexadecimal value 0x3A, cannot be included in a name.
On this code:
XDocument XMLFeed = XDocument.Load("http://feeds.foxnews.com/foxnews/most-popular?format=xml");
XNamespace content = "http://purl.org/rss/1.0/modules/content/";
var feeds = from feed in XMLFeed.Descendants("item")
select new
{
Title = feed.Element("title").Value,
Link = feed.Element("link").Value,
pubDate = feed.Element("pubDate").Value,
Description = feed.Element("description").Value,
MediaContent = feed.Element(content + "encoded")
};
foreach (var f in feeds.Reverse())
{
....
}
An item looks like that:
<rss>
<channel>
....items....
<item>
<title>Pentagon confirms plan to create new spy agency</title>
<link>http://feeds.foxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/</link>
<category>politics</category>
<dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/" />
<pubDate>Tue, 24 Apr 2012 12:44:51 PDT</pubDate>
<guid isPermaLink="false">http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</guid>
<content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[|http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg<img src="http://feeds.feedburner.com/~r/foxnews/most-popular/~4/lVUZwCdjVsc" height="1" width="1"/>]]></content:encoded>
<description>The Pentagon confirmed Tuesday that it is carving out a brand new spy agency expected to include several hundred officers focused on intelligence gathering around the world.&#160;</description>
<dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-4T19:44:51Z</dc:date>
<feedburner:origLink>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</feedburner:origLink>
</item>
....items....
</channel>
</rss>
All I want is to get the "http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg", and before that check if content:encoded exists..
Thanks.
EDIT:
I've found a sample that I can show and edit the code that tries to handle it..
EDIT2:
I've done it in the ugly way:
text.Replace("content:encoded", "contentt").Replace("xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"","");
and then get the element in the normal way:
MediaContent = feed.Element("contentt").Value
The following code
static void Main(string[] args)
{
var XMLFeed = XDocument.Parse(
#"<rss>
<channel>
....items....
<item>
<title>Pentagon confirms plan to create new spy agency</title>
<link>http://feeds.foxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/</link>
<category>politics</category>
<dc:creator xmlns:dc='http://purl.org/dc/elements/1.1/' />
<pubDate>Tue, 24 Apr 2012 12:44:51 PDT</pubDate>
<guid isPermaLink='false'>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</guid>
<content:encoded xmlns:content='http://purl.org/rss/1.0/modules/content/'><![CDATA[|http://global.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg<img src='http://feeds.feedburner.com/~r/foxnews/most-popular/~4/lVUZwCdjVsc' height='1' width='1'/>]]></content:encoded>
<description>The Pentagon confirmed Tuesday that it is carving out a brand new spy agency expected to include several hundred officers focused on intelligence gathering around the world.&#160;</description>
<dc:date xmlns:dc='http://purl.org/dc/elements/1.1/'>2012-04-4T19:44:51Z</dc:date>
<!-- <feedburner:origLink>http://www.foxnews.com/politics/2012/04/24/pentagon-confirms-plan-to-create-new-spy-agency/</feedburner:origLink> -->
</item>
....items....
</channel>
</rss>");
XNamespace contentNs = "http://purl.org/rss/1.0/modules/content/";
var feeds = from feed in XMLFeed.Descendants("item")
select new
{
Title = (string)feed.Element("title"),
Link = (string)feed.Element("link"),
pubDate = (string)feed.Element("pubDate"),
Description = (string)feed.Element("description"),
MediaContent = GetMediaContent((string)feed.Element(contentNs + "encoded"))
};
foreach(var item in feeds)
{
Console.WriteLine(item);
}
}
private static string GetMediaContent(string content)
{
int imgStartPos = content.IndexOf("<img");
if(imgStartPos > 0)
{
int startPos = content[0] == '|' ? 1 : 0;
return content.Substring(startPos, imgStartPos - startPos);
}
return string.Empty;
}
results in:
{ Title = Pentagon confirms plan to create new spy agency, Link = http://feeds.f
oxnews.com/~r/foxnews/most-popular/~3/lVUZwCdjVsc/, pubDate = Tue, 24 Apr 2012 1
2:44:51 PDT, Description = The Pentagon confirmed Tuesday that it is carving out
a brand new spy agency expected to include several hundred officers focused on
intelligence gathering around the world. , MediaContent = http://global
.fncstatic.com/static/managed/img/Politics/panetta_hearing_030712.jpg }
Press any key to continue . . .
A few points:
You never want to treat Xml as text - in your case you removed the namespace declaration but actually if the namespace was declared inline (i.e. without binding to the prefix) or a different prefix would be defined your code would not work even though semantically both documents would be equivalent
Unless you know what's inside CDATA and how to treat it you always want to treat is as text. If you know it's something else you can treat it differently after parsing - see my elaborate on CDATA below for more details
To avoid NullReferenceExceptions if the element is missing I used explicit conversion operator (string) instead of invoking .Value
the Xml you posted was not a valid xml - there was missing namespace Uri for feedburner prefix
This is no longer related to the problem but may be helpful for some folks so I am leaving it
As far as the contents of the encode element is considered it is inside CDATA section. What's inside CDATA section is not an Xml but plain text. CDATA is usually used to not have to encode '<', '>', '&' characters (without CDATA they would have to be encoded as < > and & to not break the Xml document itself) but the Xml processor treat characters in the CDATA as if they were encoded (or to be more correct in encodes them). The CDATA is convenient if you want to embed html because textually the embedded content looks like the original yet it won't break your xml if the html is not a well-formed Xml. Since the CDATA content is not an Xml but text it is not possible to treat it as Xml. You will probably need to treat is as text and use for instance regular expressions. If you know it is a valid Xml you can load the contents to an XElement again and process it. In your case you have got mixed content so it is not easy to do unless you use a little dirty hack. Everything would be easy if you have just one top level element instead of mixed content. The hack is to add the element to avoid all the hassle. Inside the foreach look you can do something like this:
var mediaContentXml = XElement.Parse("<content>" + (string)item.MediaContent + "</content>");
Console.WriteLine((string)mediaContentXml.Element("img").Attribute("src"));
Again it's not pretty and it is a hack but it will work if the content of the encoded element is valid Xml. The more correct way of doing this is to us XmlReader with ConformanceLevel set to Fragment and recognize all kinds of nodes appropriately to create a corresponding Linq to Xml node.
You should use XNamespace:
XNamespace content = "...";
// later in your code ...
MediaContent = feed.Element(content + "encoded")
See more details here.
(Of course, you the string to be assigned to content is the same as in xmlns:content="...").