How to set multiple conditions for a Dataframe while modifying the values? - pandas

So, I'm looking for an efficient way to set up values within an existing column and setting values for a new column based on some conditions. If I have 10 conditions in a big data set, do I have to write 10 lines? Or can I combine them somehow...haven't figured it out yet.
Can you guys suggest something?
For example:
data_frame.loc[data_frame.col1 > 50 ,["col1","new_col"]] = "Cool"
data_frame.loc[data_frame.col2 < 100 ,["col1","new_col"]] = "Cool"
Can it be written in a single expression? "&" or "and" don't work...
Thanks!

yes you can do it,
here is an example:
data_frame.loc[(data_frame["col1"]>100) & (data_frame["col2"]<10000) | (data_frame["col3"]<500),"test"] = 0
explanation:
the filter I used is (with "and" and "or" conditions): (data_frame["col1"]>100) & (data_frame["col2"]<10000) | (data_frame["col3"]<500)
the column that will be changed is "test" and the value will be 0

You can try:
all_conditions = [condition_1, condition_2]
fill_with = [fill_condition_1_with, fill_condition_2_with]
df[["col1","new_col"]] = np.select(all_conditions, fill_with, default=default_value_here)

Related

Filter out entries of datasets based on string matching

I'm working with a dataframe of chemical formulas (str objects). Example
formula
Na0.2Cl0.4O0.7Rb1
Hg0.04Mg0.2Ag2O4
Rb0.2AgO
...
I want to filter it out based on specified elements. For example I want to produce an output which only contains the elements 'Na','Cl','Rb' therefore the desired output should result in:
formula
Na0.2Cl0.4O0.7Rb1
What I've tried to do is the following
for i, formula in enumerate(df['formula'])
if ('Na' and 'Cl' and 'Rb' not in formula):
df = df.drop(index=i)
but it seems not to work.
You can use use contains with or condition for multiple string pattern matching for matching only one of them
df[df['formula'].str.contains("Na|Cl|Rb", na=False)]
Or you can use pattern with contains if you want to match all of them
df[df['formula'].str.contains(r'^(?=.*Na)(?=.*Cl)(?=.*Rb)')]
Your requirements are unclear, but assuming you want to filter based on a set of elements.
Keeping formulas where all elements from the set are used:
s = {'Na','Cl','Rb'}
regex = f'({"|".join(s)})'
mask = (
df['formula']
.str.extractall(regex)[0]
.groupby(level=0).nunique().eq(len(s))
)
df.loc[mask[mask].index]
output:
formula
0 Na0.2Cl0.4O0.7Rb1
Keeping formulas where only elements from the set are used:
s = {'Na','Cl','Rb'}
mask = (df['formula']
.str.extractall('([A-Z][a-z]*)')[0]
.isin(s)
.groupby(level=0).all()
)
df[mask]
output: no rows for this dataset

how to reduce rows to 1 row by concatenate in Azure Log Analytics

string row1
string row2
Is it possible to reduce rows to 1 row?
Rows should be joined with a comma.
As a result I expect
string row1, string row2
One of the workaround could able to solve the above issue,
To concatenate we can use this for e.g | extend New_Column = strcat(tagname,",", tagvalue) with comma between two string.
For example we have tested in our environment with tag name and tag value
resourcecontainers
| where type =~ 'microsoft.resources/subscriptions'
| extend tagname = tostring(bag_keys(tags)[0])
| extend tagvalue = tostring(tags[tagname])
| extend New_Column = strcat(tagname,",", tagvalue) // for concate two rows into one with comma between two string
Here is the sample output for reference:
For more information please refer this SO THREAD
UPDATE: To cancat the rows we tried with the example of code as stated in the given SO THREAD which is suggested by #Yoni L.
| summarize result = strcat_array(make_list(word), ",")
Sample output for reference:
thanks for the tips. in your links I found:
| summarize result = strcat_array(make_list(name_s), ",")

Pandas - data per row instead of all in one cell

I have problems getting the data in separate rows. At the moment all my data per column is in one cell. I really would appreciate your support!
the column header is "Dealer" and it is showing one value below like this:
|Dealer|
|:---- |
|['Automobiles', 'Garage Benz', 'Cencini SA']|
I would like to get three rows out of this:
Row
Dealer
1
'Automobiles'
2
'Garage Benz'
3
'Cencini SA'
4
....
5
....
...
...
what would be the easiest way to achieve this?
Thanks for your support, as I am totally new to pandas!
The easiest way is to convert your data into a dict like data:
x = {'Dealer':['Automobiles', 'Garage Benz', 'Cencini SA']}
Then
x = pd.DataFrame(x)

How to drop multiple column names given in a list from Spark DataFrame?

I have a dynamic list which is created based on value of n.
n = 3
drop_lst = ['a' + str(i) for i in range(n)]
df.drop(drop_lst)
But the above is not working.
Note:
My use case requires a dynamic list.
If I just do the below without list it works
df.drop('a0','a1','a2')
How do I make drop function work with list?
Spark 2.2 doesn't seem to have this capability. Is there a way to make it work without using select()?
You can use the * operator to pass the contents of your list as arguments to drop():
df.drop(*drop_lst)
You can give column name as comma separated list e.g.
df.drop("col1","col11","col21")
This is how drop specified number of consecutive columns in scala:
val ll = dfwide.schema.names.slice(1,5)
dfwide.drop(ll:_*).show
slice take two parameters star index and end index.
Use simple loop:
for c in drop_lst:
df = df.drop(c)
You can use drop(*cols) 2 ways .
df.drop('age').collect()
df.drop(df.age).collect()
Check the official documentation DataFrame.drop

Finding the count of a set of substrings in pandas dataframe

I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this
training['concat']
0 svAxu$paxArWAn
1 xvAxaSa$varRANi
2 AxAna$xurbale
3 go$BakwAH
4 viXi$Bexena
5 nIwi$kuSalaM
6 lafkA$upamam
7 yaSas$lipsoH
8 kaSa$AGAwam
9 hewumaw$uwwaram
10 varRa$pUgAn
My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur
reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
#The length of dicitioanry is 2000
Particularly I need to find those substrings which occur more than twice
I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.
elites = dict()
for reg_pat in reg_:
count = 0
eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
if eliter >=3:
elites[reg_pat] = reg_[reg_pat]
You can use apply instead str.contains, it is faster:
reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}
elites = dict()
for reg_pat in reg_:
if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
elites[reg_pat] = reg_[reg_pat]
print (elites)
{'a': 0.0005}
Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in check, but I could be wrong on that.
for substr in reg:
totalStringAppearances = training.apply((lambda string: substr in string))
totalStringAppearances = totalStringAppearances.sum()
if totalStringAppearances > 2:
reg[substr] = totalStringAppearances / len(training)
else:
# do what you want to with the very rare substrings
Some gotchas:
If you wanted something like a substring 'a' in 'abcdefa' to return 2, then this will not work. It merely checks for existence of the substring in each string.
Inside the apply(), I am using a potentially unreliable exploitation of booleans. See this question for more details.
Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply and in, I validate his claim, and my presumption: