how to make a new 0 and 1 column? - pandas

I have a pandas data frame and I wanna make a new columns with 0 and 1:
if col1 is zero and col2 is positive set new column to 1.
if 'col1is zero andcol2is negative set new column to 0. ifcol1is 1 andcol2 is positive set new column to 0. if 'col1 is 1 and col2 is negative set new column to 1.
col1 col2
0 2
0 -4
1 -2
1 5
1 9
new_colum
1
0
1
0
0

You can determine if col2 is positive and get the absolute difference with col1 (booleans behave like 0/1):
df['new_column'] = df['col1'].sub(df['col2'].gt(0)).abs()
Or, compare the two outputs, you want them to be different:
df['new_column'] = df['col1'].ne(df['col2'].gt(0)).astype(int)
output:
col1 col2 new_column
0 0 2 1
1 0 -4 0
2 1 -2 1
3 1 5 0
4 1 9 0

Related

Perform similar computations in every dataframe in a list of dataframes

I have a list of 18 different dataframes. The only thing these dataframes have in common is that each contains a variable that ends with "_spec". The computations I would like to perform on each dataframe in the list are as follows:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above); and
store the results of #2 in a separate list of 18 dataframes
I can get the output that I would like for each individual dataframe with the following:
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) # count (negative) no. of numeric vars in df
lvmo_spec = df[df.sum(numeric_only=True,axis=1)==lvmo_numlength].filter(regex='_spec') # does ^ = sum of numeric vars?
lvmo_spec.to_list()
but I don't want to copy and paste this 18(+) times...
I am new to writing functions and loops, but I know these can be utilized to perform the procedure I desire; yet I don't know how to execute it. The below code shows the abomination I have created, which can't even make it off the ground. Any suggestions?
# make list of dataframes
name_list = [lvmo, trx_nonrx, pd, odose_drg, fx, cpn_use, dem_hcc, dem_ori, drg_man, drg_cou, nlx_gvn, nlx_ob, opd_rsn, opd_od, psy_yn, sti_prep_tkn, tx_why, tx_curtx]
# create variable that satisfies condition 1
def numlen(name):
return name + "_numlen"
# create variable that satisfies condition 2
def spec(name):
return name + "_spec"
# loop it all together
for name in name_list:
numlen(name) = -len(name.select_dtypes('number').columns.tolist())
spec(name) = name[name.sum(numeric_only=True,axis=1)]==numlen(name).filter(regex='spec')
You can achieve what I believe your question is asking as follows, given input df_list which is a list of dataframes:
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
Explanation:
for each input dataframe, create a new dataframe as follows: for rows where the sum of the values in numeric columns is <=0 and is equal in magnitude to the number of numeric columns, select only those columns with a label ending in '_spec'
use a list comprehension to compile the above new dataframes into a list
Note that this can also be expressed using a standard for loop instead of a list comprehension as follows:
res_list = []
for df in df_list:
res_list.append( df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') )
Sample code (using 7 input dataframe objects instead of 18:
import pandas as pd
df_list = [pd.DataFrame({'b':['a','b','c','d']} | {f'col{i+1}{"_spec" if not i%3 else ""}':[-1,0,0]+([0 if i!=n-1 else -n]) for i in range(n)}) for n in range(7)]
for df in df_list: print(df)
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
for df in res_list: print(df)
Input:
b
0 a
1 b
2 c
3 d
b col1_spec
0 a -1
1 b 0
2 c 0
3 d -1
b col1_spec col2
0 a -1 -1
1 b 0 0
2 c 0 0
3 d 0 -2
b col1_spec col2 col3
0 a -1 -1 -1
1 b 0 0 0
2 c 0 0 0
3 d 0 0 -3
b col1_spec col2 col3 col4_spec
0 a -1 -1 -1 -1
1 b 0 0 0 0
2 c 0 0 0 0
3 d 0 0 0 -4
b col1_spec col2 col3 col4_spec col5
0 a -1 -1 -1 -1 -1
1 b 0 0 0 0 0
2 c 0 0 0 0 0
3 d 0 0 0 0 -5
b col1_spec col2 col3 col4_spec col5 col6
0 a -1 -1 -1 -1 -1 -1
1 b 0 0 0 0 0 0
2 c 0 0 0 0 0 0
3 d 0 0 0 0 0 -6
Output:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
col1_spec
0 -1
3 -1
col1_spec
0 -1
3 0
col1_spec
0 -1
3 0
col1_spec col4_spec
0 -1 -1
3 0 -4
col1_spec col4_spec
0 -1 -1
3 0 0
col1_spec col4_spec
0 -1 -1
3 0 0
Also, a couple of comments about the original question:
lvmo_spec.to_list() doesn't work because to_list() is not defined. There is a method named tolist(), but it will only work for a Series (not a DataFrame).
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) gives a negative result. I have assumed this is your intention, and that you want the sum of each row's numeric values to have a negative value, but this is slightly at odds with your description which states:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above);

pandas creating new columns for each value in categorical columns

I have a pandas dataframe with some numeric and some categoric columns. I want to create a new column for each value of every categorical column and give that column a value of 1 in every row where that value is true and 0 in every row where that value is false. So the df is something like this -
col1 col2 col3
A P 1
B P 3
A Q 7
expected result is something like this:
col1 col2 col3 A B P Q
A P 1 1 0 1 0
B P 3 0 1 1 0
A Q 7 1 0 0 1
Is this possible? can someone please help me?
Use df.select_dtypes, pd.get_dummies with pd.concat:
# First select all columns which have object dtypes
In [826]: categorical_cols = df.select_dtypes('object').columns
# Create one-hot encoding for the above cols and concat with df
In [817]: out = pd.concat([df, pd.get_dummies(df[categorical_cols])], 1)
In [818]: out
Out[818]:
col1 col2 col3 col1_A col1_B col2_P col2_Q
0 A P 1 1 0 1 0
1 B P 3 0 1 1 0
2 A Q 7 1 0 0 1

Achieve incremental values for a month based on value in another column and date

I’m having a scenario where I have to increment the numbers in a month.
Condition 1 : If the value in col2 is greater than 0 then expected output is 0.
Condition 2: If value in col1 is 0 then expected output should be 999.
Condition 3: If the value in col2 is 0 then increment the numbers from 1.
Note: If either condition 1 or condition 2 is satisfied while incrementing then we must increment again from 1.
Id Date Col1 col2. Expected Output
101 01/01 28 1. 0
101 01/02 43 0 1
101 01/03 46 0. 2
101 01/04 0 0. 999
101 01/05 56 0 1
101 01/06 95 5. 0
101 01/07 0 0. 999
101 01/08 65 0. 1
101 01/09 1 0. 2
101 01/10 2 0. 3
Please suggest how this can be achieved
A cumulative count plus Teradata's RESET WHEN option:
-- similar to ROW_NUMBER, but counts only zeros
case
when col1 = 0 then 999
else count(case when col2 > 0 or col1 = 0 then NULL else 1 end)
over (partition by id
order by date_
reset when col2 > 0 or col1 = 0
rows unbounded preceding)
end

Count the number of columns that has a true value then divide it to the total number of columns

Lets assume that the table below is called Table
**---------------------------------------------
ID Col1 Col2 Col3 Col4 ... Total
--------------------------------------------
1 1 0 NULL 1 30.33
2 0 1 1 1 60.12
3 1 1 0 0 20.12
4 1 0 1 1 60.12
5 0 NULL NULL 1 10.19
6 1 1 NULL 1 90.00
7 0 0 NULL 0 0.00
--------------------------------------------**
I wanted to count and get the average number of columns that has a "true" in it. And display the total average of it in the Total Columns. For example there are 10 columns and 5 columns are true so I divide it and got 50% in total. Assuming that all of the columns that I will counting are bit and has a value of null,0 and 1. How do I achieve this one?
You could use:
SELECT
ID,
100.0*(COALESCE(Col1, 0) + COALESCE(Col2, 0) + ... + COALESCE(Col10, 0)) / 10 AS pct
FROM yourTable;

Pandas, create new column applying groupby values

I have a DF:
Col1 Col2 Label
0 0 5345
1 0 7574
2 0 3445
0 1 2126
1 1 4653
2 1 9566
So I'm trying to groupby on Col1 and Col2 to get index value based on Label column like this:
df_gb = df.groupby(['Col1','Col2'])['Label'].agg(['sum', 'count'])
df_gb['sum_count'] = df_gb['sum'] / df_gb['count']
sum_count_total = df_gb['sum_count'].sum()
index = df_gb['sum_count'] / 10
Col2 Col1
0 0 2.996036
1 3.030063
2 3.038579
1 0 2.925314
1 2.951295
2 2.956083
2 0 2.875549
1 2.899254
2 2.905063
Everything so far is as I expected. But now I would like to assign this 'index' groupby df to my original 'df' based on those two groupby columns. If it was only one column it's working with map() function but not if I would like to assign index values based on two columns order.
df_index = df.copy()
df_index['index'] = df.groupby([]).apply(index)
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Tried with agg() and transform() but without success. Any ideas how to proceed?
Thanks in advance.
Hristo.
I believe you need join:
a = df.join(index.rename('new'), on=['Col1','Col2'])
print (a)
Col1 Col2 Label new
0 0 0 5345 534.5
1 1 0 7574 757.4
2 2 0 3445 344.5
3 0 1 2126 212.6
4 1 1 4653 465.3
5 2 1 9566 956.6
Or GroupBy.transform:
df['new']=df.groupby(['Col1','Col2'])['Label'].transform(lambda x: x.sum() / x.count()) / 10
print (df)
Col1 Col2 Label new
0 0 0 5345 534.5
1 1 0 7574 757.4
2 2 0 3445 344.5
3 0 1 2126 212.6
4 1 1 4653 465.3
5 2 1 9566 956.6
And if no NaNs in Label column use solution from Zero suggestion, thank you:
df.groupby(['Col1','Col2'])['Label'].transform('mean') / 10
If need count only non NaNs values by count use solution with transform.