Search for repeating word in text - jython

I haven't found any straight answers.
I need to find the words in text / string that is being repeated the most.
E.g.
String that has following values:
000587\local_users
000587\local_users
4444\et-4444
et\pmostowiak
et\pmostowiak
et\pmostowiak
Then the results needs to be et\pmostowiak
How should I accomplish this?
EDIT:
I'm using older version of jython so I can't use the collections library with Counter function
This prints all values that are found more than ones:
d = {}
for x in users:
d[x] = x in d
_result = [x for x in d if d[x]] # [1]
If I could reuse this further?

Once you have some iterable container of words, collections does exactly what you need.
>>> import collections
>>> words = ['000587\local_users', '000587\local_users', '4444\et-4444', 'et\pmostowiak', 'et\pmostowiak', 'et\pmostowiak']
>>> print collections.Counter(words).most_common(1)
[('et\\pmostowiak', 3)]
This begs the question of how to split a string.
This works:
>>> str = """000587\local_users
... 000587\local_users
... 4444\et-4444
... et\pmostowiak
... et\pmostowiak
... et\pmostowiak"""
>>> str.split('\n')
['000587\\local_users', '000587\\local_users', '4444\\et-4444', 'et\\pmostowiak', 'et\\pmostowiak', 'et\\pmostowiak']
>>> words = str.split('\n')

Related

Tensor to Dataframe for each sentence

For a 6 class sentence classification task, I have a list of sentences where I retrieve the absolute values before the softmax is applied. Example list of sentences:
s = ['I like the weather today', 'The movie was very scary', 'Love is in the air']
I get the values the following way:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
print(output.logits.detach().numpy())
# returns [[-0.8390876 2.9480567 -0.5134539 0.70386493 -0.5019671 -2.619496 ]]
#[[-0.8847909 -0.9642067 -2.2108874 -0.43932158 4.3386173 -0.37383893]]
#[[-0.48750368 3.2949197 2.1660519 -0.6453249 -1.7101991 -2.817954 ]]
How do I create a data frame with columns sentence, class_1, class_2, class_3, class_4, class_5, class_6 where I add values iteratively or maybe in a more optimal way where I append each new sentence and its absolute values? What would be the best way?
Expected output:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...
If I only had one sentence, I could transform it to a data frame like this, but I would still need to append the sentence somehow
sentence = tokenizer("Love is in the air", return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
Maybe creating two separate data frames and then appending them would be one plausible way of doing this?
Save the model outputs in a list and then create the dataframe from an object:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import pandas as pd
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
outputs = []
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
outputs.append(output.logits.detach().numpy()[0])
# convert to one numpy array
outputs = np.array(outputs)
# create dataframe
obj = {"sentence": s}
for class_id in range(outputs.shape[1]):
# get the data column for that class
obj[f"class_{class_id}"] = outputs[:,class_id].tolist()
df = pd.DataFrame(obj)
I managed to come up with a solution and I am posting it as someone might find it useful.
The idea is to initialize a data frame and to append the absolute values for every sentence while iterating
absolute_vals = pd.DataFrame()
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
absolute_vals = absolute_vals.append(px, ignore_index = True)
absolute_vals
Returns:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...

How to convert a string inside a function to a variable name that holds a Pandas datframe outside the function? [duplicate]

I know that some other languages, such as PHP, support a concept of "variable variable names" - that is, the contents of a string can be used as part of a variable name.
I heard that this is a bad idea in general, but I think it would solve some problems I have in my Python code.
Is it possible to do something like this in Python? What can go wrong?
If you are just trying to look up an existing variable by its name, see How can I select a variable by (string) name?. However, first consider whether you can reorganize the code to avoid that need, following the advice in this question.
You can use dictionaries to accomplish this. Dictionaries are stores of keys and values.
>>> dct = {'x': 1, 'y': 2, 'z': 3}
>>> dct
{'y': 2, 'x': 1, 'z': 3}
>>> dct["y"]
2
You can use variable key names to achieve the effect of variable variables without the security risk.
>>> x = "spam"
>>> z = {x: "eggs"}
>>> z["spam"]
'eggs'
For cases where you're thinking of doing something like
var1 = 'foo'
var2 = 'bar'
var3 = 'baz'
...
a list may be more appropriate than a dict. A list represents an ordered sequence of objects, with integer indices:
lst = ['foo', 'bar', 'baz']
print(lst[1]) # prints bar, because indices start at 0
lst.append('potatoes') # lst is now ['foo', 'bar', 'baz', 'potatoes']
For ordered sequences, lists are more convenient than dicts with integer keys, because lists support iteration in index order, slicing, append, and other operations that would require awkward key management with a dict.
Use the built-in getattr function to get an attribute on an object by name. Modify the name as needed.
obj.spam = 'eggs'
name = 'spam'
getattr(obj, name) # returns 'eggs'
It's not a good idea. If you are accessing a global variable you can use globals().
>>> a = 10
>>> globals()['a']
10
If you want to access a variable in the local scope you can use locals(), but you cannot assign values to the returned dict.
A better solution is to use getattr or store your variables in a dictionary and then access them by name.
New coders sometimes write code like this:
my_calculator.button_0 = tkinter.Button(root, text=0)
my_calculator.button_1 = tkinter.Button(root, text=1)
my_calculator.button_2 = tkinter.Button(root, text=2)
...
The coder is then left with a pile of named variables, with a coding effort of O(m * n), where m is the number of named variables and n is the number of times that group of variables needs to be accessed (including creation). The more astute beginner observes that the only difference in each of those lines is a number that changes based on a rule, and decides to use a loop. However, they get stuck on how to dynamically create those variable names, and may try something like this:
for i in range(10):
my_calculator.('button_%d' % i) = tkinter.Button(root, text=i)
They soon find that this does not work.
If the program requires arbitrary variable "names," a dictionary is the best choice, as explained in other answers. However, if you're simply trying to create many variables and you don't mind referring to them with a sequence of integers, you're probably looking for a list. This is particularly true if your data are homogeneous, such as daily temperature readings, weekly quiz scores, or a grid of graphical widgets.
This can be assembled as follows:
my_calculator.buttons = []
for i in range(10):
my_calculator.buttons.append(tkinter.Button(root, text=i))
This list can also be created in one line with a comprehension:
my_calculator.buttons = [tkinter.Button(root, text=i) for i in range(10)]
The result in either case is a populated list, with the first element accessed with my_calculator.buttons[0], the next with my_calculator.buttons[1], and so on. The "base" variable name becomes the name of the list and the varying identifier is used to access it.
Finally, don't forget other data structures, such as the set - this is similar to a dictionary, except that each "name" doesn't have a value attached to it. If you simply need a "bag" of objects, this can be a great choice. Instead of something like this:
keyword_1 = 'apple'
keyword_2 = 'banana'
if query == keyword_1 or query == keyword_2:
print('Match.')
You will have this:
keywords = {'apple', 'banana'}
if query in keywords:
print('Match.')
Use a list for a sequence of similar objects, a set for an arbitrarily-ordered bag of objects, or a dict for a bag of names with associated values.
Whenever you want to use variable variables, it's probably better to use a dictionary. So instead of writing
$foo = "bar"
$$foo = "baz"
you write
mydict = {}
foo = "bar"
mydict[foo] = "baz"
This way you won't accidentally overwrite previously existing variables (which is the security aspect) and you can have different "namespaces".
Use globals() (disclaimer: this is a bad practice, but is the most straightforward answer to your question, please use other data structure as in the accepted answer).
You can actually assign variables to global scope dynamically, for instance, if you want 10 variables that can be accessed on a global scope i_1, i_2 ... i_10:
for i in range(10):
globals()['i_{}'.format(i)] = 'a'
This will assign 'a' to all of these 10 variables, of course you can change the value dynamically as well. All of these variables can be accessed now like other globally declared variable:
>>> i_5
'a'
Instead of a dictionary you can also use namedtuple from the collections module, which makes access easier.
For example:
# using dictionary
variables = {}
variables["first"] = 34
variables["second"] = 45
print(variables["first"], variables["second"])
# using namedtuple
Variables = namedtuple('Variables', ['first', 'second'])
v = Variables(34, 45)
print(v.first, v.second)
The SimpleNamespace class could be used to create new attributes with setattr, or subclass SimpleNamespace and create your own function to add new attribute names (variables).
from types import SimpleNamespace
variables = {"b":"B","c":"C"}
a = SimpleNamespace(**variables)
setattr(a,"g","G")
a.g = "G+"
something = a.a
If you don't want to use any object, you can still use setattr() inside your current module:
import sys
current_module = module = sys.modules[__name__] # i.e the "file" where your code is written
setattr(current_module, 'variable_name', 15) # 15 is the value you assign to the var
print(variable_name) # >>> 15, created from a string
You have to use globals() built in method to achieve that behaviour:
def var_of_var(k, v):
globals()[k] = v
print variable_name # NameError: name 'variable_name' is not defined
some_name = 'variable_name'
globals()[some_name] = 123
print(variable_name) # 123
some_name = 'variable_name2'
var_of_var(some_name, 456)
print(variable_name2) # 456
Variable variables in Python
"""
<?php
$a = 'hello';
$e = 'wow'
?>
<?php
$$a = 'world';
?>
<?php
echo "$a ${$a}\n";
echo "$a ${$a[1]}\n";
?>
<?php
echo "$a $hello";
?>
"""
a = 'hello' #<?php $a = 'hello'; ?>
e = 'wow' #<?php $e = 'wow'; ?>
vars()[a] = 'world' #<?php $$a = 'world'; ?>
print(a, vars()[a]) #<?php echo "$a ${$a}\n"; ?>
print(a, vars()[vars()['a'][1]]) #<?php echo "$a ${$a[1]}\n"; ?>
print(a, hello) #<?php echo "$a $hello"; ?>
Output:
hello world
hello wow
hello world
Using globals(), locals(), or vars() will produce the same results
#<?php $a = 'hello'; ?>
#<?php $e = 'wow'; ?>
#<?php $$a = 'world'; ?>
#<?php echo "$a ${$a}\n"; ?>
#<?php echo "$a ${$a[1]}\n"; ?>
#<?php echo "$a $hello"; ?>
print('locals():\n')
a = 'hello'
e = 'wow'
locals()[a] = 'world'
print(a, locals()[a])
print(a, locals()[locals()['a'][1]])
print(a, hello)
print('\n\nglobals():\n')
a = 'hello'
e = 'wow'
globals()[a] = 'world'
print(a, globals()[a])
print(a, globals()[globals()['a'][1]])
print(a, hello)
Output:
locals():
hello world
hello wow
hello world
globals():
hello world
hello wow
hello world
Bonus (creating variables from strings)
# Python 2.7.16 (default, Jul 13 2019, 16:01:51)
# [GCC 8.3.0] on linux2
Creating variables and unpacking tuple:
g = globals()
listB = []
for i in range(10):
g["num%s" % i] = i ** 10
listB.append("num{0}".format(i))
def printNum():
print "Printing num0 to num9:"
for i in range(10):
print "num%s = " % i,
print g["num%s" % i]
printNum()
listA = []
for i in range(10):
listA.append(i)
listA = tuple(listA)
print listA, '"Tuple to unpack"'
listB = str(str(listB).strip("[]").replace("'", "") + " = listA")
print listB
exec listB
printNum()
Output:
Printing num0 to num9:
num0 = 0
num1 = 1
num2 = 1024
num3 = 59049
num4 = 1048576
num5 = 9765625
num6 = 60466176
num7 = 282475249
num8 = 1073741824
num9 = 3486784401
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) "Tuple to unpack"
num0, num1, num2, num3, num4, num5, num6, num7, num8, num9 = listA
Printing num0 to num9:
num0 = 0
num1 = 1
num2 = 2
num3 = 3
num4 = 4
num5 = 5
num6 = 6
num7 = 7
num8 = 8
num9 = 9
I'm answering the question How to get the value of a variable given its name in a string?
which is closed as a duplicate with a link to this question. (Editor's note: It is now closed as a duplicate of How can I select a variable by (string) name?)
If the variables in question are part of an object (part of a class for example) then some useful functions to achieve exactly that are hasattr, getattr, and setattr.
So for example you can have:
class Variables(object):
def __init__(self):
self.foo = "initial_variable"
def create_new_var(self, name, value):
setattr(self, name, value)
def get_var(self, name):
if hasattr(self, name):
return getattr(self, name)
else:
raise "Class does not have a variable named: " + name
Then you can do:
>>> v = Variables()
>>> v.get_var("foo")
'initial_variable'
>>> v.create_new_var(v.foo, "is actually not initial")
>>> v.initial_variable
'is actually not initial'
I have tried both in python 3.7.3, you can use either globals() or vars()
>>> food #Error
>>> milkshake #Error
>>> food="bread"
>>> drink="milkshake"
>>> globals()[food] = "strawberry flavor"
>>> vars()[drink] = "chocolate flavor"
>>> bread
'strawberry flavor'
>>> milkshake
'chocolate flavor'
>>> globals()[drink]
'chocolate flavor'
>>> vars()[food]
'strawberry flavor'
Reference:
https://www.daniweb.com/programming/software-development/threads/111526/setting-a-string-as-a-variable-name#post548936
The consensus is to use a dictionary for this - see the other answers. This is a good idea for most cases, however, there are many aspects arising from this:
you'll yourself be responsible for this dictionary, including garbage collection (of in-dict variables) etc.
there's either no locality or globality for variable variables, it depends on the globality of the dictionary
if you want to rename a variable name, you'll have to do it manually
however, you are much more flexible, e.g.
you can decide to overwrite existing variables or ...
... choose to implement const variables
to raise an exception on overwriting for different types
etc.
That said, I've implemented a variable variables manager-class which provides some of the above ideas. It works for python 2 and 3.
You'd use the class like this:
from variableVariablesManager import VariableVariablesManager
myVars = VariableVariablesManager()
myVars['test'] = 25
print(myVars['test'])
# define a const variable
myVars.defineConstVariable('myconst', 13)
try:
myVars['myconst'] = 14 # <- this raises an error, since 'myconst' must not be changed
print("not allowed")
except AttributeError as e:
pass
# rename a variable
myVars.renameVariable('myconst', 'myconstOther')
# preserve locality
def testLocalVar():
myVars = VariableVariablesManager()
myVars['test'] = 13
print("inside function myVars['test']:", myVars['test'])
testLocalVar()
print("outside function myVars['test']:", myVars['test'])
# define a global variable
myVars.defineGlobalVariable('globalVar', 12)
def testGlobalVar():
myVars = VariableVariablesManager()
print("inside function myVars['globalVar']:", myVars['globalVar'])
myVars['globalVar'] = 13
print("inside function myVars['globalVar'] (having been changed):", myVars['globalVar'])
testGlobalVar()
print("outside function myVars['globalVar']:", myVars['globalVar'])
If you wish to allow overwriting of variables with the same type only:
myVars = VariableVariablesManager(enforceSameTypeOnOverride = True)
myVars['test'] = 25
myVars['test'] = "Cat" # <- raises Exception (different type on overwriting)
Any set of variables can also be wrapped up in a class.
"Variable" variables may be added to the class instance during runtime by directly accessing the built-in dictionary through __dict__ attribute.
The following code defines Variables class, which adds variables (in this case attributes) to its instance during the construction. Variable names are taken from a specified list (which, for example, could have been generated by program code):
# some list of variable names
L = ['a', 'b', 'c']
class Variables:
def __init__(self, L):
for item in L:
self.__dict__[item] = 100
v = Variables(L)
print(v.a, v.b, v.c)
#will produce 100 100 100
It should be extremely risky...
but you can use exec():
a = 'b=5'
exec(a)
c = b*2
print (c)
Result:
10
The setattr() method sets the value of the specified attribute of the specified object.
Syntax goes like this –
setattr(object, name, value)
Example –
setattr(self,id,123)
which is equivalent to self.id = 123
As you might have observed, setattr() expects an object to be passed along with the value to generate/modify a new attribute.
We can use setattr() with a workaround to be able to use within modules. Here’ how –
import sys
x = "pikachu"
value = 46
thismodule = sys.modules[__name__]
setattr(thismodule, x, value)
print(pikachu)

New column with word at nth position of string from other column pandas

import numpy as np
import pandas as pd
d = {'ABSTRACT_ID': [14145090,1900667, 8157202,6784974],
'TEXT': [
"velvet antlers vas are commonly used in tradit",
"we have taken a basic biologic RPA to elucidat4",
"ceftobiprole bpr is an investigational cephalo",
"lipoperoxidationderived aldehydes for example",],
'LOCATION': [1, 4, 2, 1]}
df = pd.DataFrame(data=d)
df
def word_at_pos(x,y):
pos=x
string= y
count = 0
res = ""
for word in string:
if word == ' ':
count = count + 1
if count == pos:
break
res = ""
else :
res = res + word
print(res)
word_at_pos(df.iloc[0,2],df.iloc[0,1])
For this df I want to create a new column WORD that contains the word from TEXT at the position indicated by LOCATION. e.g. first line would be "velvet".
I can do this for a single line as an isolated function world_at_pos(x,y), but can't work out how to apply this to whole column. I have done new columns with Lambda functions before, but can't work out how to fit this function to lambda.
Looping over TEXT and LOCATION could be the best idea because splitting creates a jagged array, so filtering using numpy advanced indexing won't be possible.
df["WORDS"] = [txt.split()[loc] for txt, loc in zip(df["TEXT"], df["LOCATION"]-1)]
print(df)
ABSTRACT_ID ... WORDS
0 14145090 ... velvet
1 1900667 ... a
2 8157202 ... bpr
3 6784974 ... lipoperoxidationderived
[4 rows x 4 columns]

apply generic function in a vectorized fashion using numpy/pandas

I am trying to vectorize my code and, thanks in large part to some users (https://stackoverflow.com/users/3293881/divakar, https://stackoverflow.com/users/625914/behzad-nouri), I was able to make huge progress. Essentially, I am trying to apply a generic function (in this case max_dd_array_ret) to each of the bins I found (see vectorize complex slicing with pandas dataframe for details on date vectorization and Start, End and Duration of Maximum Drawdown in Python for the rationale behind max_dd_array_ret). the problem is the following: I should be able to obtain the result df_2 and, to some degree, ranged_DD(asd_1.values, starts, ends+1) is what I am looking for, except for the tragic effect that it's as if the first two bins are merged and the last one is missing as it can be gauged by looking at the results.
any explanation and fix is very welcomed
import pandas as pd
import numpy as np
from time import time
from scipy.stats import binned_statistic
def max_dd_array_ret(xs):
xs = (xs+1).cumprod()
i = np.argmax(np.maximum.accumulate(xs) - xs) # end of the period
j = np.argmax(xs[:i])
max_dd = abs(xs[j]/xs[i] -1)
return max_dd if max_dd is not None else 0
def get_ranges_arr(starts,ends):
# Taken from https://stackoverflow.com/a/37626057/3293881
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
def ranged_DD(arr,starts,ends):
# Get all indices and the IDs corresponding to same groups
idx = get_ranges_arr(starts,ends)
id_arr = np.repeat(np.arange(starts.size),ends-starts)
slice_arr = arr[idx]
return binned_statistic(id_arr, slice_arr, statistic=max_dd_array_ret)[0]
asd_1 = pd.Series(0.01 * np.random.randn(500), index=pd.date_range('2011-1-1', periods=500)).pct_change()
index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1','2011-7-2', '2011-8-3', '2011-9-1','2011-10-2', '2011-11-3', '2011-12-1','2012-1-2', '2012-2-3', '2012-3-1',])
index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17','2011-7-17', '2011-8-17', '2011-9-17','2011-10-17', '2011-11-17', '2011-12-17','2012-1-17', '2012-2-17', '2012-3-17',])
starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)
df_2 = pd.DataFrame([max_dd_array_ret(asd_1.loc[i:j]) for i, j in zip(index_1, index_2)], index=index_1)
print(df_2[0].values)
print(ranged_DD(asd_1.values, starts, ends+1))
results:
df_2
[ 1.75893509 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085
1.43863472 1.85322338 1.84767224 1.32605754 1.48688414 5.44786663]
ranged_DD(asd_1.values, starts, ends+1)
[ 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085 1.43863472
1.85322338 1.84767224 1.32605754 1.48688414]
which are identical except for the first two:
[ 1.75893509 6.08002911 vs [ 6.08002911
and the last two
1.48688414 5.44786663] vs 1.48688414]
p.s.:while looking in more detail at the docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html) I found that this might be the problem
"All but the last (righthand-most) bin is half-open. In other words,
if bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1,
but excluding 2) and the second [2, 3). The last bin, however, is [3,
4], which includes 4. New in version 0.11.0."
problem is I don't how to reset it.

Why does this numpy array comparison fail?

I try to compare the results of some numpy.array calculations with expected results, and I constantly get false comparison, but the printed arrays look the same, e.g:
def test_gen_sine():
A, f, phi, fs, t = 1.0, 10.0, 1.0, 50.0, 0.1
expected = array([0.54030231, -0.63332387, -0.93171798, 0.05749049, 0.96724906])
result = gen_sine(A, f, phi, fs, t)
npt.assert_array_equal(expected, result)
prints back:
> raise AssertionError(msg)
E AssertionError:
E Arrays are not equal
E
E (mismatch 100.0%)
E x: array([ 0.540302, -0.633324, -0.931718, 0.05749 , 0.967249])
E y: array([ 0.540302, -0.633324, -0.931718, 0.05749 , 0.967249])
My gen_sine function is:
def gen_sine(A, f, phi, fs, t):
sampling_period = 1 / fs
num_samples = fs * t
samples_range = (np.arange(0, num_samples) * 2 * f * np.pi * sampling_period) + phi
return A * np.cos(samples_range)
Why is that? How should I compare the two arrays?
(I'm using numpy 1.9.3 and pytest 2.8.1)
The problem is that np.assert_array_equal returns None and does the assert statement internally. It is incorrect to preface it with a separate assert as you do:
assert np.assert_array_equal(x,y)
Instead in your test you would just do something like:
import numpy as np
from numpy.testing import assert_array_equal
def test_equal():
assert_array_equal(np.arange(0,3), np.array([0,1,2]) # No assertion raised
assert_array_equal(np.arange(0,3), np.array([2,0,1]) # Raises AssertionError
Update:
A few comments
Don't rewrite your entire original question, because then it was unclear what an answer was actually addressing.
As far as your updated question, the issue is that assert_array_equal is not appropriate for comparing floating point arrays as is explained in the documentation. Instead use assert_allclose and then set the desired relative and absolute tolerances.