lxml tag name with a ":" - lxml

I am trying to create an xml tree from a JSON object using lxml.etree. Some of the tagnames contin a colon in them something like :-
'settings:current' I tried using
'{settings}current' as the tag name but I get this :-
ns0:current xmlns:ns0="settings"

Yes, first read and understand XML namespaces. Then use that to generate XML-tree with namespaces:u
>>> MY_NAMESPACES={'settings': 'http://example.com/url-for-settings-namespace'}
>>> e=etree.Element('{%s}current' % MY_NAMESPACES['settings'], nsmap=MY_NAMESPACES)
>>> etree.tostring(e)
'<settings:current xmlns:settings="http://example.com/url-for-settings-namespace"/>'
And you can combine that with default namespaces
>>> MY_NAMESPACES={'settings': 'http://example.com/url-for-settings-namespace', None: 'http://example.com/url-for-default-namespace'}
>>> r=etree.Element('my-root', nsmap=MY_NAMESPACES)
>>> d=etree.Element('{%s}some-element' % MY_NAMESPACES[None])
>>> e=etree.Element('{%s}current' % MY_NAMESPACES['settings'])
>>> d.append(e)
>>> r.append(d)
>>> etree.tostring(r)
'<my-root xmlns:settings="http://example.com/url-for-settings-namespace" xmlns="http://example.com/url-for-default-namespace"><some-element><settings:current/></some-element></my-root>'
Note, that you have to have an element with nsmap=MY_NAMESPACES in your XML-tree hierarchy. Then all descendand nodes can use that declaration. In your case, you have no that bit, so lxml generates namespaces names like ns0
Also, when you create a new node use namespace URI for tag name, not namespace name: {http://example.com/url-for-settings-namespace}current

Related

How to search for a specific word within a text?

I have a file, of type txt, with the following text:
The dataset is available at: https://archive.ics.uci.edu/ml/datasets.php
The file name is Cancer_Data.xml
This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.
I need to search within this text the word that accompanies the "xml". I tried to do the following implementation:
import pandas as pd
with open(local_arquivo, "r") as file_read:
for line in file_read:
var_split = line.split()
for i in range(0, len(var_split)):
if(var_split[i].str.contains('xml')):
archive_name = var_split.iloc[i]
The idea was to separate the text using the split function and then look for the part that contains the 'xml'. However, when I run it, the following error appears:
AttributeError: 'str' object has no attribute 'str'
I would like the output to be:
archive_name = Cancer_Data.xml
Try
if('xml' in var_split[i]):
source: https://docs.python.org/3/reference/expressions.html#in

How do I find a specific tag's value (which could be anything) with beautifulsoup?

I am trying to get the job IDs from the tags of Indeed listings. So far, I have taken Indeed search results and put each job into its own "bs4.element.Tag" object, but I don't know how to extract the value of the tag (or is it a class?) "data-jk". Here is what I have so far:
import requests
import bs4
import re
# 1: scrape (5?) pages of search results for listing ID's
results = []
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=10"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=20"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=30"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=40"))
# each search page has a query "q", location "l", and a "start" = 10*int
# the search results are contained in a "td" with ID = "resultsCol"
justjobs = []
for eachResult in results:
soup_jobs = bs4.BeautifulSoup(eachResult.text, "lxml") # this is for IDs
justjobs.extend(soup_jobs.find_all(attrs={"data-jk":True})) # re.compile("data-jk")
# each "card" is a div object
# each has the class "jobsearch-SerpJobCard unifiedRow row result clickcard"
# as well as a specific tag "data-jk"
# "data-jk" seems to be the actual IDs used in each listing's URL
# Now, each div element has a data-jk. I will try to get data-jk from each one:
jobIDs = []
print(type(justjobs[0])) # DEBUG
for eachJob in justjobs:
jobIDs.append(eachJob.find("data-jk"))
print("Length: " + str(len(jobIDs))) # DEBUG
print("Example JobID: " + str(jobIDs[1])) # DEBUG
The examples I've seen online generally try to get the information contained between and , but I am not sure how to get the info from inside of the (first) tag itself. I've tried doing it by parsing it as a string instead:
print(justjobs[0])
for eachJob in justjobs:
jobIDs.append(str(eachJob)[115:131])
print(jobIDs)
but the website is also inconsistent with how the tags operate, and I think that using beautifulsoup would be more flexible than multiple cases and substrings.
Any pointers would be greatly appreciated!
Looks like you can regex them out from a script tag
import requests,re
html = requests.get('https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0').text
p = re.compile(r"jk:'(.*?)'")
ids = p.findall(html)

How to get StreamSets Record Fields Type inside Jython Evaluator

I have a StreamSets pipeline, where I read from a remote SQL Server database using JDBC component as an origin and put the data into a Hive and a Kudu Data Lake.
I'm facing some issues with the type Binary Columns, as there is no Binary type support in Impala, which I use to access both Hive and Kudu.
I decided to convert the Binary type columns (Which flows in the pipeline as Byte_Array type) to String and insert it like that.
I tried to use a Field Type Converter element to convert all Byte_Array types to String, but it didn't work. So I used a Jython component to convert all arr.arr types to String. It works fine, until I got a Null value on that field, so the Jython type was None.type and I was unable to detect the Byte_Array type and unable to convert it to String. So I couldn't insert it into Kudu.
Any help how to get StreamSets Record Field Types inside Jython Evaluator? Or any suggested work around for the problem I'm facing?
You need to use sdcFunctions.getFieldNull() to test whether the field is NULL_BYTE_ARRAY. For example:
import array
def convert(item):
return ':-)'
def is_byte_array(record, k, v):
# getFieldNull expect a field path, so we need to prepend the '/'
return (sdcFunctions.getFieldNull(record, '/'+k) == NULL_BYTE_ARRAY
or (type(v) == array.array and v.typecode == 'b'))
for record in records:
try:
record.value = {k: convert(v) if is_byte_array(record, k, v) else v
for k, v in record.value.items()}
output.write(record)
except Exception as e:
error.write(record, str(e))
So here is my final solution:
You can use the logic below to detect any StreamSets type inside the Jython component by using the NULL_CONSTANTS:
NULL_BOOLEAN, NULL_CHAR, NULL_BYTE, NULL_SHORT, NULL_INTEGER, NULL_LONG,
NULL_FLOAT, NULL_DOUBLE, NULL_DATE, NULL_DATETIME, NULL_TIME, NULL_DECIMAL,
NULL_BYTE_ARRAY, NULL_STRING, NULL_LIST, NULL_MAP
The idea is to save the value of the field in a temp variable, set the value of the field to be None and use the function sdcFunctions.getFieldNull to know the StreamSets type by comparing it to one of the NULL_CONSTANTS.
import binascii
def toByteArrayToHexString(value):
if value is None:
return NULL_STRING
value = '0x'+binascii.hexlify(value).upper()
return value
for record in records:
try:
for colName,value in record.value.items():
temp = record.value[colName]
record.value[colName] = None
if sdcFunctions.getFieldNull(record,'/'+colName) is NULL_BYTE_ARRAY:
temp = toByteArrayToHexString(temp)
record.value[colName] = temp
output.write(record)
except Exception as e
error.write(record, str(e))
Limitation:
The code above converts the Date type to Datetime type only when it has a value (When its not NULL)

How to update Saved Document Extra data in GridFS

I have stored files on gridFS that has extra information stored on them - it is not set on the metadata otherwise I would use this
For Pymongo 3.2.2 we stored information on the same level as the actual data in fs.files (not using the metadata)
so for example we have:
fs.files = [ {
_id, description, title, ...
}]
When I call GridFS.put like so and nothing happens
FS = GridFS(mongo_,)
file.description = request_data.get('description', None)
FS.put(file)
How can I update that file extra information such as description?
What is file in your code example? That is, what are you passing to GridFS.put?
To add metadata to a GridFS file, pass additional keyword arguments to put, as it shows in the PyMongo tutorial:
>>> fs.put(b'data', filename='foo', description='my description')
ObjectId('5825ea8ea08bff9df5059099')
Now the metadata is stored along with your data in GridFS:
>>> gridout = fs.get(ObjectId('5825ea8ea08bff9df5059099'))
>>> gridout.description
u'my description'
Under the hood, you can see that PyMongo stored the metadata in the fs.files collection in MongoDB:
>>> for doc in db.fs.files.find()
... print(doc.get('description'))
...
my description
But the better way to access GridFS data is with PyMongo's GridFS API, not by directly querying the collections.
Another way to store metadata is to create a GridIn, set a field, and call close:
>>> gridin = fs.new_file()
>>> gridin.filename = 'foo'
>>> gridin.description = 'my description'
>>> gridin.write(b'data')
>>> gridin.close()
This is a good option if you need to call write multiple times with chunks of data.
Manually update those extra fields without using GridFS:
file = mongo_.fs.files.find_one({'_id': ObjectId(fileId)})
file['description'] = request_data.get('description', None)
mongo_.fs.files.save(file)

Pandas HDF5 Select with Where on non natural-named columns

in my continuing spree of exotic pandas/HDF5 issues, I encountered the following:
I have a series of non-natural named columns (nb: because of a good reason, with negative numbers being "system" ids etc), which normally doesn't give an issue:
fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'])
however, my select statement does fall over it:
>>> fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'], where=[('a-6', '=', [0, 25, 28])])
blablabla
File "/srv/www/li/venv/local/lib/python2.7/site-packages/tables/table.py", line 1251, in _required_expr_vars
raise NameError("name ``%s`` is not defined" % var)
NameError: name ``a`` is not defined
Is there any way to work around it? I could rename my negative value from "a-1" to a "a_1" but that means reloading all of the data in my system. Which is rather much! :)
Suggestions are very welcome!
Here's a test table
In [1]: df = DataFrame({ 'a-6' : [1,2,3,np.nan] })
In [2]: df
Out[2]:
a-6
0 1
1 2
2 3
3 NaN
In [3]: df.to_hdf('test.h5','df',mode='w',table=True)
In [5]: df.to_hdf('test.h5','df',mode='w',table=True,data_columns=True)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
There is a very way, but would to build this into the code itself. You can do a variable substitution on the column names as follows. Here is the existing routine (in master)
def select(self):
"""
generate the selection
"""
if self.condition is not None:
return self.table.table.readWhere(self.condition.format(), start=self.start, stop=self.stop)
elif self.coordinates is not None:
return self.table.table.readCoordinates(self.coordinates)
return self.table.table.read(start=self.start, stop=self.stop)
If instead you do this
(Pdb) self.table.table.readWhere("(x>2.0)",
condvars={ 'x' : getattr(self.table.table.cols,'a-6')})
array([(2, 3.0)],
dtype=[('index', '<i8'), ('a-6', '<f8')])
e.g. by subsituting x with the column reference, you can get the data.
This could be done on detection of invalid column names, but is pretty tricky.
Unfortunately I would suggest renaming your columns.