negative of "contains" in openrefine - openrefine

I would like to add a column based on another column and fill it with all the values that do NOT contain "jpg"
so the negation of this:
filter(value.split(","), v, v.contains("jpg")).join("|")
How can I write "does not contain"?

contains gives a boolean output i.e. true or false. So we have:
v = "picture.jpg" -> v.contains("jpg") = TRUE
v = "picture.gif" -> v.contains("jpg") = FALSE
filter finds all values in an array which return TRUE for whatever condition you use in the filter. There are a couple of ways you could filter an array to find the values that don't contain a string, but using contains the simplest is probably to use not to reverse the result of your condition:
filter(value.split(","), v, not(v.contains("jpg"))).join("|")

Related

Filter out entries of datasets based on string matching

I'm working with a dataframe of chemical formulas (str objects). Example
formula
Na0.2Cl0.4O0.7Rb1
Hg0.04Mg0.2Ag2O4
Rb0.2AgO
...
I want to filter it out based on specified elements. For example I want to produce an output which only contains the elements 'Na','Cl','Rb' therefore the desired output should result in:
formula
Na0.2Cl0.4O0.7Rb1
What I've tried to do is the following
for i, formula in enumerate(df['formula'])
if ('Na' and 'Cl' and 'Rb' not in formula):
df = df.drop(index=i)
but it seems not to work.
You can use use contains with or condition for multiple string pattern matching for matching only one of them
df[df['formula'].str.contains("Na|Cl|Rb", na=False)]
Or you can use pattern with contains if you want to match all of them
df[df['formula'].str.contains(r'^(?=.*Na)(?=.*Cl)(?=.*Rb)')]
Your requirements are unclear, but assuming you want to filter based on a set of elements.
Keeping formulas where all elements from the set are used:
s = {'Na','Cl','Rb'}
regex = f'({"|".join(s)})'
mask = (
df['formula']
.str.extractall(regex)[0]
.groupby(level=0).nunique().eq(len(s))
)
df.loc[mask[mask].index]
output:
formula
0 Na0.2Cl0.4O0.7Rb1
Keeping formulas where only elements from the set are used:
s = {'Na','Cl','Rb'}
mask = (df['formula']
.str.extractall('([A-Z][a-z]*)')[0]
.isin(s)
.groupby(level=0).all()
)
df[mask]
output: no rows for this dataset

Adding column value for a list of indexes

I have a list of indexes and trying to populate a column 'Type' for these rows only.
What I tried to do:
index_list={1,5,9,10,13}
df.loc[index_list,'Type']=="gain+loss"
Output:
1 False
5 False
9 False
10 False
13 False
But the output just gives the list with all False instead of populating these rows.
Thanks for any advice.
You need to put a single equal instead of double equal. In python, and in most progamming languages, == is the comparison operator. In your case you need the assignment operator =.
So the following code will do what you want :
index_list={1,5,9,10,13}
df.loc[index_list,'Type']="gain+loss"

How to return ONLY top 5% of responses in a column PANDAS

I am looking to return the top 5% of responses in a column using pandas. So, for col_1, basically, I want a list of the responses that make up at least 5% of the responses in that column.
The following returns the list of ALL responses in the col_1 that meet the condition, as well as those that do not (returns boolean True and False):
df['col_1'].value_counts(normalize = True) >= .05
While this is somewhat helpful, I would like to return ONLY those that evaluate to true. Should I use a dictionary and loop? If so, how do I signal that I am using value_counts(normalize = True) >= .05 to append to that dictionary?
Thank you for your help!
If need filter by boolean indexing:
s = df['col_1'].value_counts(normalize = True)
L = s.index[s >= .05].tolist()
print (L)

Python Get index of Tuple with using endswith

Is there a to give me the index of the tuple when I use endswith function.
eg:
lst = ('jpg','mp4','mp3')
b = "filemp4"
If I use b.endswith(lst), this give me True or false.
but I need to find the index of 'lst' that match's the the string provided in 'b'

Need a TRUE and FALSE column in Spark-SQL

I'm trying to write a multi-value filter for a Spark SQL DataFrame.
I have:
val df: DataFrame // my data
val field: String // The field of interest
val values: Array[Any] // The allowed possible values
and I'm trying to come up with the filter specification.
At the moment, I have:
val filter = values.map(value => df(field) === value)).reduce(_ || _)
But this isn't robust in the case where I get passed an empty list of values. To cover that case, I would like:
val filter = values.map(value => df(field) === value)).fold(falseColumn)(_ || _)
but I don't know how to specify falseColumn.
Anyone know how to do so?
And is there a better way of writing this filter? (If so, I still need the answer for how to get a falseColumn - I need a trueColumn for a separate piece).
A column that is always true:
val trueColumn = lit(true)
A column that is always false:
val falseColumn = lit(false)
Using lit(...) means these will always be valid columns, regardless of what columns the DataFrame contains.