Convert lxml _Element to HtmlElement - lxml

For various reasons I'm trying to switch from lxml.html.fromstring() to lxml.html.html5parser.document_fromstring(). The big difference between the two is that the first returns an lxml.html.HtmlElement, and the second returns an lxml.etree._Element.
Mostly this is OK, but when I try to run my code with the _Element object, it crashes, saying:
AttributeError: 'lxml.etree._Element' object has no attribute 'rewrite_links'
Which makes sense. My question is, what's the best way to deal with this problem. I have a lot of code that expects HtmlElements, so I think the best solution will be to convert to those. I'm not sure that's possible though.
Update
One terrible solution looks like this:
from lxml.html import fromstring, tostring
from lxml.html import html5parser
e = html5parser.fromstring(text)
html_element = fromstring(tostring(e))
Obviously, that's pretty brute force, but it does work. I'm able to get an HtmlElement that's been parsed by the html5parser, which is what I'm after.
The other option would be to work out how to do the rewrite_links and xpath queries that I rely on, but _Elements don't seem to have that function (which, again, makes sense!)

One solution less CPU intensive than brut force is to to create an almost empty HtmlElement based on the roottree and to append the _Element children.
from lxml.html import fromstring, tostring
from lxml.html import html5parser
text = "<html lang='en'><body><a href='http://localhost'>hello</body></html>"
e = html5parser.fromstring(text)
html_element = fromstring(tostring(e.getroottree()))
for child in e.getchildren():
html_element.append(child)
print(tostring(html_element))
def rewriter(link):
return "http://newlink.com"
html_element.rewrite_links(rewriter)
print(tostring(html_element.body))
Will output :
b'<html><body><html xmlns:html="http://www.w3.org/1999/xhtml" lang="en"><head></head><body>hello</body></html></body><html:head xmlns:html="http://www.w3.org/1999/xhtml"></html:head><html:body xmlns:html="http://www.w3.org/1999/xhtml"><html:a href="http://localhost">hello</html:a></html:body></html>'
b'<body><html xmlns:html="http://www.w3.org/1999/xhtml" lang="en"><head></head><body>hello</body></html></body>'
So both attributes like 'body' and methods like 'rewrite_links' work in this situation.

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

VS Code - Completion is terrible, is it my setup?

Code completion and intellisense in VS Code is absolutely god-awful for me. In every language. I have extensions installed and updated but its always absolute trash.
import pandas as pd
data_all = pd.read_csv(DATA_FILE, header=None)
data_all. (press tab)
No suggestions.
Do you really not know its a Pandas DataFrame object, its literally the line above?
I have this issue in python, in ruby/rails, pretty much every langauge i try to use the completion is absolute garbage. Do i have an extension that is breaking other extensions? is code jsut this bad? Why is it so inexplicably useless?
Installed Currently:
abusaidm.html-s
nippets#0.2.1
alefragnani.numbered-bookmarks#8.0.2
bmewburn.vscode-intelephense-client#1.6.3
bung87.rails#0.16.11
bung87.vscode-gemfile#0.4.0
castwide.solargraph#0.21.1
CoenraadS.bracket-pair-colorizer#1.0.61
donjayamanne.python-extension-pack#1.6.0
ecmel.vscode-html-css#1.10.2
felixfbecker.php-debug#1.14.9
felixfbecker.php-intellisense#2.3.14
felixfbecker.php-pack#1.0.2
formulahendry.auto-close-tag#0.5.10
golang.go#0.23.2
groksrc.ruby#0.1.0
k--kato.intellij-idea-keybindings#1.4.0
KevinRose.vsc-python-indent#1.12.0
Leopotam.csharpfixformat#0.0.84
magicstack.MagicPython#1.1.0
miguel-savignano.ruby-symbols#0.1.8
ms-dotnettools.csharp#1.23.9
ms-mssql.mssql#1.10.1
ms-python.python#2021.2.636928669
ms-python.vscode-pylance#2021.3.1
ms-toolsai.jupyter#2021.3.619093157
ms-vscode.cpptools#1.2.2
rebornix.ruby#0.28.1
sianglim.slim#0.1.2
VisualStudioExptTeam.vscodeintellicode#1.2.11
wingrunr21.vscode-ruby#0.28.0
Zignd.html-css-class-completion
#1.20.0
If you check the IntelliSense of the read_csv() method (By hovering your mouse over it), you will see that it returns a DataFrame object
(function)
read_csv(reader: IO, sep: str = ...,
#Okay... very long definition but scroll to the end...
float_precision: str | None = ...) -> DataFrame
But if you use IntelliSense check the variable data_all
import pandas as pd
data_all = pd.read_csv(DATA_FILE, header=None)
It is listed as the default data type in python: Any. That's why your compiler isn't generating the autocomplete.
So, you simply need to explicitly tell your compiler that it is, in fact, a DataFrame object as shown.
import pandas as pd
from pandas.core.frame import DataFrame
DATA_FILE = "myfile"
data_all:DataFrame = pd.read_csv(DATA_FILE, header=None)
# Now all autocomplete options on data_all are available!
It might seem strange why the compiler cannot guess the data type in this example until you realize that the read_csv() method is overloaded with many definitions, and some of them return objects as Any type. So the compiler assumes the worst-case scenario and treats it as an Any type object unless specified otherwise.

Unable to use pickAFile in TigerJython

In JES, I am able to use:
file=pickAFile()
In TigerJython, however, I get the following error
NameError: name 'pickAFile' is not defined
What am I doing wrong here?
You are not doing anything wrong at all. The thing is that pickAFile() is not a standard function in Python. It is actually rather a function that JES has added for convenience, but which you probably will not find it in any other environment.
Since TigerJython and JES are both based on Jython, you can easily write a pickAFile() function on your own that uses Java's Swing. Here is a possible simple implementation (the pickAFile() found in JES might be a bit more complex, but this should get you started):
def pickAFile():
from javax.swing import JFileChooser
fc = JFileChooser()
retVal = fc.showOpenDialog(None)
if retVal == JFileChooser.APPROVE_OPTION:
return fc.getSelectedFile()
else:
return None
Given that it is certainly a useful function, we might have to consider including it into our next update of TigerJython.
P.S. I would like to apologise for answering so late, I have just joined SO recently and was not aware of your question (I am one of the original authors of TigerJython).

How to get the current time with GHCJS?

How to get the current time with GHCJS? Should i try to access Date or use Haskell base libraries? Is there an utility function somewhere in the GHCJS base libraries?
The Data.Time.Clock module seems to work well:
import Data.Time.Clock (getCurrentTime)
import Data.Time.Format -- Show instance
main = do
now <- getCurrentTime
print now
The solution i found currently is quite ugly, but it works for me, so maybe it can save some time to somebody:
{-# LANGUAGE JavaScriptFFI #-}
import GHCJS.Types( JSVal )
import GHCJS.Prim( fromJSString )
foreign import javascript unsafe "Date.now()+''" dateNow :: IO (JSVal)
asInteger = read (fromJSString dateNow) :: Integer -- this happens in IO
The ugliness comes from not finding a JSInteger type in GHCJS, which would be needed in order to get the result of Date.now() which is a long integer. So i need to produce a string concatenating a string to the result of Date.now() in Javascript. At this point i could get a JSString as result, but that would not be an instance of Read so using read would not work. So i get a JSValue and convert it to String using fromJSString.
Eventually there might be a JSInteger in GHCJS, or JSString might become an instance of Read, so if you are reading this from the future try out something more elegant!

Split and find specific text?

ok so i've made a HTTPWEBREQUEST and i've made the source of the result show in a richtextbox, Now say i have this in the richtextbox
<p>Short URL: <code>http://URL.me/u/eywnp</code></p>
How would i go about just getting the "http://URL.me/u/eywnp" ive tried split but didnt work, guess i'm doing it wrong?
NOTE the URL will be different everytime
Split isn’t the right tool for the job. It will result in a rather complex piece of code that’s quite brittle (meaning it will break as soon as there’s the slightest change in the input).
For a robust, well-written solution you need to parse the HTML properly. Luckily there exist canned solutions for that: The HtmlAgilityPack library.
Dim doc As New HtmlDocument()
doc.LoadHtml(yourCode)
Dim result = doc.DocumentElement.SelectNodes("//a[#href]")(0)("href")
The only complicated part here is the string "//a[#href]". This is an XPath string. XPath strings are a mini-language that is used to address elements in an HTML or XML document. They are conceptually similar to file paths (like C:\Users\foo\Documents\file.txt) but with a slightly different syntax.
The XPath simply selects all the <a> elements having a href attribute from your document. Then you can grab the first of that collection and retrieve the href attribute’s value.
Thanks for all your help, i did find a solution and i used
Dim iStartIndex, iEndIndex As Integer
With RichTextBox1.Text
iStartIndex = .IndexOf("<p>Short URL: <code><a href=") + 29
iEndIndex = .IndexOf(""">", iStartIndex)
Clipboard.SetText(.Substring(iStartIndex, iEndIndex - iStartIndex))
End With
works perfect so far