Creating dummy columns from cells with multiple values - pandas

I have a DF as shown below:
DF =
id Result
1 Li_In-AR-B, Or_Ba-AR-B
1 Li_In-AR-L, Or_Ba-AR-B
3 N
4 Lo_In-AR-U
5 Li_In-AR-U
6 Or_Ba-AR-B
6 Or_Ba-AR-L
7 N
Now I want to create new columns for every unique value in Result before the first "-". Every other value in the new column should be set to N. The delimiter "," is used to separate both instances in case of multiple values (2 or more).
DF =
id Result Li_In Lo_In Or_Ba
1 Li_In-AR-B Li_In-AR-B N Or_Ba-AR-B
1 Li_In-AR-L Li_In-AR-L N Or_Ba-AR-B
3 N N N N
4 Lo_In-AR-U N Lo_In-AR-U N
5 Li_In-AR-U Li_In-AR-U N N
6 Or_Ba-AR-B N N Or_Ba-AR-B
6 Or_Ba-AR-L N N Or_Ba-AR-L
7 N N N N
I thought I could do this easily using .get_dummies but this only returns a binary value for each cell.
DF_dummy = DF.Result.str.get_dummies(sep='-')
DF = pd.concat([DF,DF_dummy ],axis=1)
Also this solution for an earlier post is not applicable for the new case.
m = DF['Result'].str.split('-', n=1).str[0].str.get_dummies().drop('N', axis=1) == 1
df1 = pd.concat([DF['Result']] * len(m.columns), axis=1, keys=m.columns)
Any ideas?

Use dictionary comprehension with DataFrame constructor for split by ,\s+ for split by coma with one or more whitespaces.
import re
f = lambda x: {y.split('-', 1)[0] : y for y in re.split(',\s+', x) if y != 'N' }
df1 = pd.DataFrame(DF['Result'].apply(f).values.tolist(), index=DF.index).fillna('N')
print (df1)
Li_In Lo_In Or_Ba
0 Li_In-AR-B N Or_Ba-AR-B
1 Li_In-AR-L N Or_Ba-AR-B
2 N N N
3 N Lo_In-AR-U N
4 Li_In-AR-U N N
5 N N Or_Ba-AR-B
6 N N Or_Ba-AR-L
7 N N N
Last add to original DataFrame:
df = DF. join(df1)
print (df)
id Result Li_In Lo_In Or_Ba
0 1 Li_In-AR-B, Or_Ba-AR-B Li_In-AR-B N Or_Ba-AR-B
1 1 Li_In-AR-L, Or_Ba-AR-B Li_In-AR-L N Or_Ba-AR-B
2 3 N N N N
3 4 Lo_In-AR-U N Lo_In-AR-U N
4 5 Li_In-AR-U Li_In-AR-U N N
5 6 Or_Ba-AR-B N N Or_Ba-AR-B
6 6 Or_Ba-AR-L N N Or_Ba-AR-L
7 7 N N N N

Related

df.apply(myfunc, axis=1) results in error but df.groupby(df.index).apply(myfunc) does not

I have a dataframe that looks like this:
a b c
0 x x x
1 y y y
2 z z z
I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:
def my_func(df):
dup_num = int(df.c - df.a)
if isinstance(df, pd.Series):
df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num,
ignore_index=True)
else:
df_expanded = pd.concat([pd.DataFrame(df)]*dup_num,
ignore_index=True)
return df_expanded
The final dataframe will look like something like this:
a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y
5 z z z
6 z z z
So I did:
df_expanded = df.apply(my_func, axis=1)
I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:
ValueError: cannot copy sequence with size XX to array axis with dimension YY
As if apply is trying to return a Series not a group of dataFrames that the function created.
So instead of df.apply I did:
df_expanded = df.groupby(df.index).apply(my_func)
Which just creates groups of single rows and applies the same function. This on the other hand works.
Why?
Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.
Given:
a b c
0 1 1 4
1 2 2 4
2 3 3 4
Doing:
new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
.explode(ignore_index=True)
.apply(pd.Series))
new_df.columns = df.columns
print(new_df)
Output:
a b c
0 1 1 4
1 1 1 4
2 1 1 4
3 2 2 4
4 2 2 4
5 3 3 4

Hive impala query

Input.
Key---- id---- ind1 ----ind2
1 A Y N
1 B N N
1 C Y Y
2 A N N
2 B Y N
Output
Key ind1 ind2
1 Y Y
2 Y N
So basically whenever the ind1..n col is y for same key different id . The output should be y else N.
That why for key 1 both ind is y
And key 2 ....ind is y and n.
You can use max() for this:
select id, max(ind1), max(ind2)
from t
group by id;

how to iterate a Series with multiindex in pandas

I am a beginner to pandas. And now I want to realise Decision Tree Algorithm with pandas. First, I read test data into a padas.DataFrame, it is like below:
In [4]: df = pd.read_csv('test.txt', sep = '\t')
In [5]: df
Out[5]:
Chocolate Vanilla Strawberry Peanut
0 Y N Y Y
1 N Y Y N
2 N N N N
3 Y Y Y Y
4 Y Y N Y
5 N N N N
6 Y Y Y Y
7 N Y N N
8 Y N Y N
9 Y N Y Y
then I groupby 'Peanut' and 'Chocolate', what I get is:
In [15]: df2 = df.groupby(['Peanut', 'Chocolate'])
In [16]: serie1 = df2.size()
In [17]: serie1
Out[17]:
Peanut Chocolate
N N 4
Y 1
Y Y 5
dtype: int64
Now, the type of serie1 is Series. I can access the value of serie1 but I can not get value of 'Peanut' and 'Chocolate. How can I get the number of serie1 and the value of 'Peanut' and 'Chocolate at the same time?
You can use index:
>>> serie1.index
MultiIndex(levels=[[u'N', u'Y'], [u'N', u'Y']],
labels=[[0, 0, 1], [0, 1, 1]],
names=[u'Peanut', u'Chocolate'])
You can obtain the values of the column names and the levels. Note that the labels refer to the index in the same row in levels. So for example for 'Peanut' the first label is levels[0][labels[0][0]] which is 'N'. The last label of 'Chocolate' is levels[1][labels[1][2]] which is 'Y'.
I created a small example which loops through the indexes and prints all data:
#loop the rows
for i in range(len(serie1)):
print "Row",i,"Value",serie1.iloc[i],
#loop the columns
for j in range(len(serie1.index.names)):
print "Column",serie1.index.names[j],"Value",serie1.index.levels[j][serie1.index.labels[j][i]],
print
Which results in:
Row 0 Value 4 Column Peanut Value N Column Chocolate Value N
Row 1 Value 1 Column Peanut Value N Column Chocolate Value Y
Row 2 Value 5 Column Peanut Value Y Column Chocolate Value Y

Write a number N as sum of K prime numbers

Is there any condition for writing a number N as sum of K prime numbers(prime numbers not necessarily distinct)?
Example: If N=6 and K=2 then we can write N as 6=3+3 whereas if N=11 and K=2 then we cannot represent 11 as sum of two primes.
My Approach- I deduced the condition that If K>=N then we cannot represent N as sum of K primes.Also if K=1 then by primality testing we can check whether whether N is a prime number. Also by goldbach's conjecture for even numbers(except 2) N can be represented as sum of two prime numbers.
But the main problem is that I'm not able to predict it for K>=3.
1.Well, first list out all the prime numbers less than and equal to N.
2.Brute Force Approach with backtracking method.
ex :
N = 8
k = 2.
2 2
2 3
2 5
2 7
3 3(Don't again consider 3 and 2)
3 5.
Done!
ex : 2
N = 12,
k = 4
2 2 2 2
2 2 2 3
2 2 2 5
2 2 2 7
2 2 3 3(don't again check for 2232)
2 2 3 5.
Done!
ex 3:
N = 11,
k = 3
2 2 2
2 2 3
2 2 5
2 2 7
2 2 11
2 3 3(don't check again for 232)
2 3 5
2 3 7>11(don't check for 2311)
3 3 3(don't again check the 32.. series.)
10.3 3 5
Done!

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)