String interpolation while inserting multiple elements into xml from inside a for loop - lxml

For some reason, I found it difficult to generate an XML file by inserting multiple nodes from inside a for loop. The following does achieve that, but a couple of questions remain.
Given a list of dictionaries
dict_list = [{'x': 'abc', 'y': 'efg'}, \
{'x': 'hij', 'y': 'klm'}]
I can use objectify() to generate the desired xml:
from lxml import etree, objectify
end_game = etree.XML('<final_root/>')
E = objectify.ElementMaker(annotate=False)
tmp_root = E.entry
for d in dict_list:
att_values = [val for val in d.values()]
doc = tmp_root(
x = att_values[0],
y = att_values[1]
)
end_game.append(doc)
print(etree.tostring(end_game, pretty_print=True).decode())
Output, as desired:
<final_root>
<entry x="abc" y="efg"/>
<entry x="hij" y="klm"/>
</final_root>
The problem is that I need to hardwire the attribute names x and y into the loop. Attempting to use string interpolation fails. For instance:
for d in dict_list:
att_items = [item for item in d.items()]
doc = tmp_root(
att_items[0][0] = att_items[0][1],
att_items[1][0] = att_items[1][1]
)
raises
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Same error is raised using f-strings ('f'{att_items[0][0]}' = att_items[0][1]), or format ({}.format(att_items[0][0]) = att_items[0][1]).
So, the obvious question: is there a way to avoid having to manually insert attribute names? Alternatively: Is it possible to duplicate the outcome (and maybe avoid the issue) using lxml.etree instead?

Since etree Elements carry attributes as a dict, you should be able to just pass in the dict when constructing the Element...
from lxml import etree
end_game = etree.XML('<final_root/>')
dict_list = [{'x': 'abc', 'y': 'efg'},
{'x': 'hij', 'y': 'klm'}]
for meta in dict_list:
end_game.append(etree.Element('item', meta))
print(etree.tostring(end_game, pretty_print=True).decode())
printed output...
<final_root>
<item x="abc" y="efg"/>
<item x="hij" y="klm"/>
</final_root>

Related

randomly choose value between two numpy arrays

I have two numpy arrays:
left = np.array([2, 7])
right = np.array([4, 7])
right_p1 = right + 1
What I want to do is
rand = np.zeros(left.shape[0])
for i in range(left.shape[0]):
rand[i] = np.random.randint(left[i], right_p1[i])
Is there a way I could do this without using a for loop?
You could try with:
extremes = zip(left, right_p1)
rand = map(lambda x: np.random.randint(x[0], x[1]), extremes)
This way you will end up with a map object. If you need to save memory, you can keep it that way, otherwise you can get the full np.array passing through a list conversion, like this:
rand = np.array(list(map(lambda x: np.random.randint(x[0], x[1]), extremes)))

TypeError: 'Value' object is not iterable : iterate around a Dataframe for prediction purpose with GCP Natural Language Model

I'm trying to iterate over a dataframe in order to apply a predict function, which calls a Natural Language Model located on GCP. Here is the loop code :
model = 'XXXXXXXXXXXXXXXX'
barometre_df_processed = barometre_df
barometre_df_processed['theme'] = ''
barometre_df_processed['proba'] = ''
print('DEBUT BOUCLE FOR')
for ind in barometre_df.index:
if barometre_df.verbatim[ind] is np.nan :
barometre_df_processed.theme[ind]="RAS"
barometre_df_processed.proba[ind]="1"
else:
print(barometre_df.verbatim[ind])
print(type(barometre_df.verbatim[ind]))
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]},'mime_type': 'text/plain'} },model_name=model)
print(res)
theme = res['displayNames']
proba = res["classification"]["score"]
barometre_df_processed.theme[ind]=theme
barometre_df_processed.proba[ind]=proba
and the get_prediction function that I took from the Natural Language AI Documentation :
def get_prediction(file_path, model_name):
options = ClientOptions(api_endpoint='eu-automl.googleapis.com:443')
prediction_client = automl_v1.PredictionServiceClient(client_options=options)
payload = file_path
# Uncomment the following line (and comment the above line) if want to predict on PDFs.
# payload = pdf_payload(file_path)
parameters_dict = {}
params = json_format.ParseDict(parameters_dict, Value())
request = prediction_client.predict(name=model_name, payload=payload, params=params)
print("fonction prediction")
print(request)
return resultat[0]["displayName"], resultat[0]["classification"]["score"], resultat[1]["displayName"], resultat[1]["classification"]["score"], resultat[2]["displayName"], resultat[2]["classification"]["score"]
I'm doing a loop this way because I want each of my couple [displayNames, score] to create a new line on my final dataframe, to have something like this :
verbatim1, theme1, proba1
verbatim1, theme2, proba2
verbatim1, theme3, proba3
verbatim2, theme1, proba1
verbatim2, theme2, proba2
...
The if barometre_df.verbatim[ind] is np.nan is not causing problems, I just use it to deal with nans, don't take care of it.
The error that I have is this one :
TypeError: 'Value' object is not iterable
I guess the issues is about
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]} },model_name=model)
but I can't figure what's goign wrong here.
I already try to remove
,'mime_type': 'text/plain'}
from my get_prediction parameters, but it doesn't change anything.
Does someone knows how to deal with this issue ?
Thank you already.
I think you are not iterating correctly.
The way to iterate through a dataframe is:
for index, row in df.iterrows():
print(row['col1'])

What exactly to test (unittest) in a larger function containing several dataframe manipulations

Perhaps this is a constraint of my understanding of unittests, but I get quite confused as to what should be tested, patched, etc in a method that has several pandas dataframe manipulations. Many of the unittest examples out there focus on classes and methods that are typically small. For larger methods, I get a bit lost on the typical unittest paradigm. For example:
myscript.py
class Pivot:
def prepare_dfs(self):
df = pd.read_csv(self.file, sep=self.delimiter)
g = df.groupby("Other_Location")
df1 = g.apply(lambda x: x[x["PRN"] == "Free"].count())
locations = ["O12-03-01", "O12-03-02"]
cp = df1["PRN"]
cp = cp[locations].tolist()
data = [locations, cp]
new_df = pd.DataFrame({"Other_Location": data[0], "Free": data[1]})
return new_df, df
test_myscript.py
class TestPivot(unittest.TestCase):
def setUp(self):
args = parse_args(["-f", "test1", "-d", ","])
self.pivot = Pivot(args)
self.pivot.path = "Pivot/path"
#mock.patch("myscript.cp[locations].tolist()", return_value=None)
#mock.patch("myscript.pd.read_csv", return_value=df)
def test_prepare_dfs_1(self, mock_read_csv, mock_cp):
new_df, df = self.pivot.prepare_dfs()
# Here I get a bit lost
For example here I try to circumvent the following error message:
ModuleNotFoundError: No module named 'myscript.cp[locations]'; 'myscript' is not a package
I managed to mock correctly the pd.read_csv in my method, however further down in the code there are groupy, apply, tolist etc. The error message is thrown at the following line:
cp = cp[locations].tolist()
What is the best way to approach unittesting when your method involves several manipulations on a dataframe? Is refactoring the code always advised (into smaller chunks)? In this case, how can I mock correctly the tolist ?

Autocorrect a column in a pandas dataframe using pyenchant

I tried to apply the code from the accepted answer of this question to one of my dataframe columns where each row is a sentence, but it didn't work.
My code looks this:
from enchant.checker import SpellChecker
checker = SpellChecker("id_ID")
h = df['Jawaban'].astype(str).str.lower()
hayo = []
for text in h:
checker.set_text(text)
for s in checker:
sug = s.suggest()[0]
s.replace(sug)
hayo.append(checker.get_text())
I got this following error:
IndexError: list index out of range
Any help is greatly appreciated.
I don't get the error using your code. The only thing I'm doing differently is to import the spell checker.
from enchant.checker import SpellChecker
checker = SpellChecker('en_US','en_UK') # not using id_ID
# sample data
ds = pd.DataFrame({ 'text': ['here is a spllng mstke','the wrld is grwng']})
p = ds['text'].str.lower()
hayo = []
for text in p:
checker.set_text(text)
for s in checker:
sug = s.suggest()[0]
s.replace(sug)
print(checker.get_text())
hayo.append(checker.get_text())
print(hayo)
here is a spelling mistake
the world is growing

Search for repeating word in text

I haven't found any straight answers.
I need to find the words in text / string that is being repeated the most.
E.g.
String that has following values:
000587\local_users
000587\local_users
4444\et-4444
et\pmostowiak
et\pmostowiak
et\pmostowiak
Then the results needs to be et\pmostowiak
How should I accomplish this?
EDIT:
I'm using older version of jython so I can't use the collections library with Counter function
This prints all values that are found more than ones:
d = {}
for x in users:
d[x] = x in d
_result = [x for x in d if d[x]] # [1]
If I could reuse this further?
Once you have some iterable container of words, collections does exactly what you need.
>>> import collections
>>> words = ['000587\local_users', '000587\local_users', '4444\et-4444', 'et\pmostowiak', 'et\pmostowiak', 'et\pmostowiak']
>>> print collections.Counter(words).most_common(1)
[('et\\pmostowiak', 3)]
This begs the question of how to split a string.
This works:
>>> str = """000587\local_users
... 000587\local_users
... 4444\et-4444
... et\pmostowiak
... et\pmostowiak
... et\pmostowiak"""
>>> str.split('\n')
['000587\\local_users', '000587\\local_users', '4444\\et-4444', 'et\\pmostowiak', 'et\\pmostowiak', 'et\\pmostowiak']
>>> words = str.split('\n')