Alternate way to use Dask loc like in Pandas loc | = operator not working in dask - pandas

for col1 in columns_1:
for col2 in columns_2:
df.loc[df['any_column_in_df'] == col2, col1] = 0
What I want : I want alternative Code/Way to get this done in dask ! working in pandas.
Problem : Can't use assign ( = ) in dask while using df.loc because of inplace is not support ?
Explanation : I want to assign 0/value where condition meet and return dataframe ! ( not series ! )
I Tried using mask, map_partitions with df.replace (working fine for this simple 1 column value manipulation and returning dataframe as required)...
def replace(x: pd.DataFrame) -> pd.DataFrame:
return x.replace(
{'any_column_to_replace_value': [np.nan]},
{'any_column_to_replace_value': [0]}
)
df = df.map_partitions(replace)
How to do for first code ? and return dataframe.
Thanks in advance, Please help me Dask Experts i'm new to dask and exploring it..

Answer by #martindurant on gitter…
This is a row-wise compute, so you can use apply or map_partitions
def process(df):
for col1 in columns_1:
for col2 in columns_2:
df.loc[df[‘any_column_in_df’] == col2, col1] = 0
return df
df2 = df.map_partitions(process)

Related

Filter dataframe based on condition before groupby

Suppose I have a dataframe like this
Create sample dataframe:
import pandas as pd
import numpy as np
data = {
'gender': np.random.choice(['m', 'f'], size=100),
'vaccinated': np.random.choice([0, 1], size=100),
'got sick': np.random.choice([0, 1], size=100)
}
df = pd.DataFrame(data)
and I want to see, by gender, what proportion of vaccinated people got sick.
I've tries something like this:
df.groupby('gender').agg(lambda group: sum(group['vaccinated']==1 & group['sick']==1)
/sum(group['sick']==1))
but this doesn't work because agg works on the series level. Same applies for transform. apply doesn't work either, but I'm not as clear why or how apply functions on groupby objects.
Any ideas how to accomplish this with a single line of code?
You could first filter for the vaccinated people and then group by gender and calculate the proportion of people that got sick..
df[df.vaccinated == 1].groupby("gender").agg({"got sick":"mean"})
Output:
got sick
gender
f 0.548387
m 0.535714
In this case the proportion is calculated based on a sample data that I've created
The docs for GroupBy.apply state that the function is applied "group-wise". This means that the function is called on each group separately as a data frame.
That is, df.groupby(c).apply(f) is conceptually equivalent to:
results = {}
for val in df[c]:
group = df.loc[df[c] == val]
result = f(group)
results[val] = result
pd.concat(results)
We can use this understanding to apply your custom aggregation function, using a top-level def just to make the code easier to read:
def calc_vax_sick_frac(group):
vaccinated = group['vaccinated'] == 1
sick = group['sick'] == 1
return (vaccinated & sick).sum() / sick.sum()
(
df
.groupby('gender')
.apply(calc_vax_sick_frac)
)

Create a new column after if-else in dask

df[‘new_col’] = np.where(df[‘col1’] == df[‘col2’] , True, False), where col1 and col2 are both str data types, seems pretty straight forward. What is the more efficient method to create a column in dask after an if else statement? I tried the recommendation from this Create an if-else condition column in dask dataframe but it is taking forever. It has only processed about 30% after about an hour. I have 13mil rows and 70 columns
IIUC use if need set column to boolean:
df['new_col'] = df['col1'] == df['col2']
If need set to another values:
df['new_col'] = 'val for true'
ddf = df.assign(col1 = df.new_col.where(cond=df['col1'] == df['col2'], other='val for false'))

How can I skip even/odd rows while reading a csv file?

Is there a simple way to ignore all even/odd rows when reading a csv using pandas?
I know skiprows argument in pd.read_csv but for that I'll need to know the number of rows in advance.
The pd.read_csv skiprows argument accepts a callable, so you could use a lambda function. E.g.:
df = pd.read_csv(some_path, skiprows=lambda x: x%2 == 0)
A possible solution after reading would be:
import pandas as pd
df = pd.read_csv(some_path)
# remove odd rows:
df = df.iloc[::2]
# remove even rows:
df = df.iloc[1::2]

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

Extracting value and creating new column out of it

I would like to extract certain section of a URL, residing in a column of a Pandas Dataframe and make that a new column. This
ref = df['REFERRERURL']
ref.str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE)
returns me a Series with tuples in it. How can I take out only one part of that tuple before the Series is created, so I can simply turn that into a column? Sample data for referrerurl is
http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....
In this example I am interested in creating a column that only has 'someproduct_step2' in it.
Thanks,
In [25]: df = DataFrame([['http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....']],columns=['A'])
In [26]: df['A'].str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE).apply(lambda x: Series(x[0][0],index=['first']))
Out[26]:
first
0 someproduct_step2
in 0.11.1 here is a neat way of doing this as well
In [34]: df.replace({ 'A' : "http:.+\d\d\/(.*?)(;|\\?).*$"}, { 'A' : r'\1'} ,regex=True)
Out[34]:
A
0 someproduct_step2
This also worked
def extract(x):
res = re.findall("\\d\\d\\/(.*?)(;|\\?)",x)
if res: return res[0][0]
session['RU_2'] = session['REFERRERURL'].apply(extract)