Assign conditional values to columns in Dask - dataframe

I am trying to do a conditional assignation to the rows of a specific column: target. I have done some research, and it seemed that the answer was given here: "How to do row processing and item assignment in dask".
I will reproduce my necessity. Mock data set:
x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))
The look of mock is:
In [4]: mock.head(7)
Out [4]:
speed target
0 200 3
1 300 0
2 400 3
3 215 4
4 219 0
5 360 0
6 280 0
Having this Pandas DataFrame, I convert it into a Dask DataFrame:
mock_dask = dd.from_pandas(mock, npartitions = 2)
I apply my conditional rule: all values in target above 0, must be 1, all others 0 (binaryze target). Following the mentioned thread above, it should be:
result = mock_dask.target.where(mock_dask.target > 0, 1)
I have a look at the result dataset and it is not working as expected:
In [7]: result.head(7)
Out [7]:
0 3
1 1
2 3
3 4
4 1
5 1
6 1
Name: target, dtype: object
As we can see, the column target in mock and result are not the expected results. It seems that my code is converting all 0 original values to 1, instead of the values that are greater than 0 into 1 (the conditional rule).
Dask newbie here, Thanks in advance for your help.

OK, the documentation in Dask DataFrame API is pretty clear. Thanks to #MRocklin feedback I have realized my mistake. In the documentation, where function (the last one in the list) is used with the following syntax:
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Thus, the correct code line would be:
result = mock_dask.target.where(mock_dask.target <= 0, 1)
This will output:
In [7]: result.head(7)
Out [7]:
0 1
1 0
2 1
3 1
4 0
5 0
6 0
Name: target, dtype: int64
Which is the expected output.

They seem to be the same to me
In [1]: import pandas as pd
In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
...: mock = pd.DataFrame(dict(target = x, speed = y))
...:
In [3]: import dask.dataframe as dd
In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)
In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64
In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64

Related

Pandas : How to Apply a Condition on Every Values of a Dataframe, Based on a Second Symmetrical Dataframe

I have a dictionary with 2 DF : "quantity variation in %" and "prices". They are both symmetrical DF.
Let's say I want to set the price = 0 if the quantity variation in percentage is greater than 100 %
import numpy as np; import pandas as pd
d = {'qty_pct': pd.DataFrame({ '2020': [200, 0.5, 0.4],
'2021': [0.9, 0.5, 500],
'2022': [0.9, 300, 0.4]}),
'price': pd.DataFrame({ '2020': [-6, -2, -9],
'2021': [ 2, 3, 4],
'2022': [ 4, 6, 8]})}
# I had something like that in mind ...
df = d['price'].applymap(lambda x: 0 if x[d['qty_pct']] >=1 else x)
P.S. If by any chance there is a way to do this on asymmetrical DF, I would be curious to see how it's done.
Thanks,
I want to obtain this DF :
price = pd.DataFrame({'2020': [ 0, -2, -9],
'2021': [ 2, 3, 0],
'2022': [ 4, 0, 8]})
Assume price and qty_pct always have the same dimension, then you can just do:
d['price'][d['qty_pct'] >= 1] = 0
d['price']
2020 2021 2022
0 0 2 4
1 -2 3 0
2 -9 0 8

How to create a multiIndex (hierarchical index) dataframe object from another df's column's unique values?

I'm trying to create a pandas multiIndexed dataframe that is a summary of the unique values in each column.
Is there an easier way to have this information summarized besides creating this dataframe?
Either way, it would be nice to know how to complete this code challenge. Thanks for your help! Here is the toy dataframe and the solution I attempted using a for loop with a dictionary and a value_counts dataframe. Not sure if it's possible to incorporate MultiIndex.from_frame or .from_product here somehow...
Original Dataframe:
data = pd.DataFrame({'A': ['case', 'case', 'case', 'case', 'case'],
'B': [2001, 2002, 2003, 2004, 2005],
'C': ['F', 'M', 'F', 'F', 'M'],
'D': [0, 0, 0, 1, 0],
'E': [1, 0, 1, 0, 1],
'F': [1, 1, 0, 0, 0]})
A B C D E F
0 case 2001 F 0 1 1
1 case 2002 M 0 0 1
2 case 2003 F 0 1 0
3 case 2004 F 1 0 0
4 case 2005 M 1 1 0
Desired outcome:
unique percent
A case 100
B 2001 20
2002 20
2003 20
2004 20
2005 20
C F 60
M 40
D 0 80
1 20
E 0 40
1 60
F 0 60
1 40
My failed for loop attempt:
def unique_values(df):
values = {}
columns = []
df = pd.DataFrame(values, columns=columns)
for col in data:
df2 = data[col].value_counts(normalize=True)*100
values = values.update(df2.to_dict)
columns = columns.append(col*len(df2))
return df
unique_values(data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-a341284fb859> in <module>
11
12
---> 13 unique_values(data)
<ipython-input-84-a341284fb859> in unique_values(df)
5 for col in data:
6 df2 = data[col].value_counts(normalize=True)*100
----> 7 values = values.update(df2.to_dict)
8 columns = columns.append(col*len(df2))
9 return df
TypeError: 'method' object is not iterable
Let me know if there's something obvious I'm missing! Still relatively new to EDA and pandas, any pointers appreciated.
This is a fairly straightforward application of .melt:
data.melt().reset_index().groupby(['variable', 'value']).count()/len(data)
output
index
variable value
A case 1.0
B 2001 0.2
2002 0.2
2003 0.2
2004 0.2
2005 0.2
C F 0.6
M 0.4
D 0 0.8
1 0.2
E 0 0.4
1 0.6
F 0 0.6
1 0.4
I'm sorry! I've written an answer, but it's in javascript. I came here after I thought I've clicked on javascript and started coding, but on posting I saw that you're coding in python.
I will post it anyway, maybe it will help you. Python is not that much different from javascript ;-)
const data = {
A: ["case", "case", "case", "case", "case"],
B: [2001, 2002, 2003, 2004, 2005],
C: ["F", "M", "F", "F", "M"],
D: [0, 0, 0, 1, 0],
E: [1, 0, 1, 0, 1],
F: [1, 1, 0, 0, 0]
};
const getUniqueStats = (_data) => {
const results = [];
for (let row in _data) {
// create list of unique values
const s = [...new Set(_data[row])];
// filter for unique values and count them for percentage, then push
results.push({ index: row, values: s.map((x) => ({ unique: x, percentage: (_data[row].filter((y) => y === x).length / data[row].length) * 100 })) });
}
return results;
};
const results = getUniqueStats(data);
results.forEach((row) =>
row.values.forEach((value) =>
console.log(`${row.index}\t${value.unique}\t${value.percentage}%`)
)
);

how to change value based on criteria pandas

I have a following problem. I have this df:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
I would like to have a new column where value will be 1 if in any other cases it is also 1. See desired output:
d = {'id': [1, 1, 2, 2, 3], 'value': [0, 1, 0, 0, 1], 'newvalue': [1, 1, 0, 0, 1]}
df = pd.DataFrame(data=d)
How can I do it please?
If need set 0,1 by condition - here at least one 1 use GroupBy.transform with GroupBy.any for mask and casting to integers for True, False to 1,0 map:
df['newvalue'] = df['value'].eq(1).groupby(df['id']).transform('any').astype(int)
Alternative:
df['newvalue'] = df['id'].isin(df.loc[df['value'].eq(1), 'id']).astype(int)
Or if only 0,1 values is possible simplify solution for new column by maximal values per groups:
df['newvalue'] = df.groupby('id')['value'].transform('max')
print (df)
id value newvalue
0 1 0 1
1 1 1 1
2 2 0 0
3 2 0 0
4 3 1 1

Generating boolean dataframe based on contents in series and dataframe

I have:
df = pd.DataFrame(
[
[22, 33, 44],
[55, 11, 22],
[33, 55, 11],
],
index=["abc", "def", "ghi"],
columns=list("abc")
) # size(3,3)
and:
unique = pd.Series([11, 22, 33, 44, 55]) # size(1,5)
then I create a new df based on unique and df, so that:
df_new = pd.DataFrame(index=unique, columns=df.columns) # size(5,3)
From this newly created df, I'd like to create a new boolean df based on unique and df, so that the end result is:
df_new = pd.DataFrame(
[
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
],
index=unique,
columns=df.columns
)
This new df is either true or false depending on whether the value is present in the original dataframe or not. For example, the first column has three values: [22, 55, 33]. In a df with dimensions (5,3), this first column would be: [0, 1, 1, 0, 1] i.e. [0, 22, 33, 0 , 55]
I tried filter2 = unique.isin(df) but this doesn't work, also notnull. I tried applying a filter but the dimensions returned were incorrect. How can I do this?
Use DataFrame.stack with DataFrame.reset_index, DataFrame.pivot, then check if not missing values by DataFrame.notna, cast to integers for True->1 and False->0 mapping and last remove index and columns names by DataFrame.rename_axis:
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
Helper Series is not necessary, but if there is more values or is necessary change order by helper Series use add DataFrame.reindex:
#added 66
unique = pd.Series([11, 22, 33, 44, 55,66])
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.reindex(unique)
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
66 0 0 0

Deleting Chained Duplicates

Lets say I have a list:
lits = [1, 1, 1, 2, 0, 0, 0, 0, 3, 3, 1, 4, 5, 2, 2, 2, 0, 0, 0]
and i need this to become [1, 1, 2, 0, 0, 3, 3, 1, 4, 5, 2, 2, 0, 0]
(Delete duplicates, but only in a chain of duplicates. Going to do this on a huge HDF5 file, with pandas, numpy. Would rather not use a for loop iterating through all elements.
table = table.drop_duplicates(cols='[SPEED OVER GROUND [kts]]', take_last=True)
Is there a modification I can do to this code?
In pandas you can do a boolean mask, selecting a row only if it is differs from either the preceding or succeeding value:
>>> df=pd.DataFrame({ 'lits':lits })
>>> df[ (df.lits != df.lits.shift(1)) | (df.lits != df.lits.shift(-1)) ]
lits
0 1
2 1
3 2
4 0
7 0
8 3
9 3
10 1
11 4
12 5
13 2
15 2
16 0
18 0