How to create and populate pandas columns based on cell values - pandas

I have created a dataframe called df as follows:
import pandas as pd
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
print(df)
The dataframe looks like this:
I want to create two new columns called Freq_1 and Freq_2 that count, for each record, how many times the number 1 and number 2 appear respectively. So, I'd like the resulting dataframe to look like this:
So, let's take a look at the column called Freq_1:
for the first record, it's equal to 1 because the number 1 appears only once across the whole first record;
for the other records, it's equal to 0 because the number 1 never appears.
Let's take a look now at the column called Freq_2:
for the first record, Freq_2 is equal to 0 because number 2 doesn't appear;
for second record, Freq_2 is equal to 2 because the number 2 appears twice;
and so on ...
How do I create the columns Freq_1 and Freq_2 in pandas?

Try this:
freq = {
i: df.eq(i).sum(axis=1) for i in range(10)
}
pd.concat([df, pd.DataFrame(freq).add_prefix("Freq_")], axis=1)
Result:
feature1 feature2 feature3 Freq_0 Freq_1 Freq_2 Freq_3 Freq_4 Freq_5 Freq_6 Freq_7 Freq_8 Freq_9
1 33 100 0 1 0 0 0 0 0 0 0 0
22 2 2 0 0 2 0 0 0 0 0 0 0
45 2 359 0 0 1 0 0 0 0 0 0 0
78 65 87 0 0 0 0 0 0 0 0 0 0
78 65 2 0 0 1 0 0 0 0 0 0 0

String pattern matching can be performed when the columns are casted to string columns.
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
df = df.stack().astype(str).unstack()
Now we can iterate for each pattern that we are looking for:
usefull_columns = df.columns
for pattern in ['1', '2']:
df[f'freq_{pattern}'] = df[usefull_columns].stack().str.count(pattern).unstack().max(axis=1)
Printing the output:
feature1 feature2 feature3 freq_1 freq_2
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0

We can do
s = df.where(df.isin([1,2])).stack()
out = df.join(pd.crosstab(s.index.get_level_values(0),s).add_prefix('Freq_')).fillna(0)
Out[299]:
feature1 feature2 feature3 Freq_1.0 Freq_2.0
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0

Related

how to conditionally create new column in dataframe based on other column values in julia

I have a julia dataframe that looks like this:
time data
0 34
1 34
2 30
3 37
4 32
5 35
How do I create a new binary column that is 0 if time is less than 2 and greater than 4, and 1 if not either condition?
Like this:
time data x
0 34 0
1 34 0
2 30 1
3 37 1
4 32 1
5 35 0
In python, I would do something like:
def func(df):
if df.time < 2 or df.time > 4:
return 0
else:
return 1
df['x'] = df.apply(func, axis=1)
In Julia we have the beautiful Dot Syntax which can be gracefully applied here:
julia> df[!, :x] = 2 .<= df[!, :time] .<= 4
6-element BitVector:
0
0
1
1
1
0
or alternatively
df.x = 2 .<= df.time .<= 4

Pandas condition across 2 dataframes

I have 2 dataframes df1 and df2
df1;
A B C
0 11 22 55
1 66 34 54
2 0 34 66
df2;
A B C
0 11 33 455
1 0 0 54
2 0 34 766
Both dataframes have the same dimensions. I want to say if value is 0 in df2 then give that value (based on column and index) a 0 in df1.
So df1 will be
df1;
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Use DataFrame.mask:
df1 = df1.mask(df2 == 0, 0)
For better performance use numpy.where:
df1 = pd.DataFrame(np.where(df2 == 0, 0, df1),
index=df1.index,
columns=df1.columns)
print (df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Using where:
df1 = df1.where(df2.ne(0), 0)
print(df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Another way -
df1 = df1[~df2.eq(0)].fillna(0)

group data using pandas, but how do I keep the order of the group and do math on two of the columns rows?

df:
Time Name X Y
0 00 AA 0 0
1 30 BB 1 1
2 45 CC 2 2
3 60 GG:AB 3 3
4 90 GG:AC 4 4
5 120 AA 5 3
dataGroup = df.groupby
([pd.Grouper(key=Time,freq='30s'),'Name'])).sort_values(by=['Timestamp'],ascending=True)
I have tried doing a diff() on the row, but it is returning NaN or something not expected.
df.groupby('Name', sort=False)['X'].diff()
How do I keep the groupings and the time sort, and do diff between a row and its previous row (for both the X and the Y column)
Expected output:
XDiff would be Group AA,
XDiff row 1 = (X row1 - origin (known))
XDiff row 2 = (X row2 - X row1)
Time Name X Y XDiff YDiff
0 00 AA 0 0 0 0
5 120 AA 5 3 5 3
1 30 BB 1 1 0 0
6 55 BB 2 3 1 2
2 45 CC 2 2 0 0
3 60 GG:AB 3 3 0 0
4 90 GG:AC 4 4 0 0
It would be nice to see the total distance for each group (ie, AA is 5, BB is 1)
In my example, I only have a couple of rows for each group, but what if there were 100 of them, the diff would give me values for the distance between any two, but not the total distance for that group.
Ripping off https://stackoverflow.com/a/20664760/6672746, you can use a lambda function to calculate the difference between rows for X and Y. I also included two lines to set the index (after the groupby) and sort it.
df['x_diff'] = df.groupby(['Name'])['X'].transform(lambda x: x.diff()).fillna(0)
df['y_diff'] = df.groupby(['Name'])['Y'].transform(lambda x: x.diff()).fillna(0)
df.set_index(["Name", "Time"], inplace=True)
df.sort_index(level=["Name", "Time"], inplace=True)
Output:
X Y x_diff y_diff
Name Time
AA 0 0 0 0.0 0.0
120 5 3 5.0 3.0
BB 30 1 1 0.0 0.0
CC 45 2 2 0.0 0.0
GG:AB 60 3 3 0.0 0.0
GG:AC 90 4 4 0.0 0.0

Pandas groupby with MultiIndex columns and different levels

I want to do a groupby on a MultiIndex dataframe, counting the occurrences for each column for every user2 in df:
>>> df
user1 user2 count
0 1 2
a x d a
0 2 6 0 1 0 0
1 4 6 0 0 0 3
2 21 76 2 0 1 0
3 5 18 0 0 0 0
Note that user1 and user2 are at the same level as count (side effect of merging).
Desired output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 76 1 0 0 0
3 18 0 0 0 0
I've tried
>>> df.groupby(['user2','count'])
but I get
ValueError: Grouper for 'count' not 1-dimensional
GENERATOR CODE:
df = pd.DataFrame({'user1':[2,4,21,21],'user2':[6,6,76,76],'param1':[0,2,0,1],'param2':['x','a','a','d'],'count':[1,3,2,1]}, columns=['user1','user2','param1','param2','count'])
df = df.set_index(['user1','user2','param1','param2'])
df = df.unstack([2,3]).sort_index(axis=1).reset_index()
df2 = pd.DataFrame({'user1':[2,5,21],'user2':[6,18,76]})
df2.columns = pd.MultiIndex.from_product([df2.columns, [''],['']])
final_df = df2.merge(df, on=['user1','user2'], how='outer').fillna(0)
IIUC, you want:
final_df.where(final_df>0).groupby('user2').count().drop('user1', axis=1).reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0 1 0 1
1 18 0 0 0 0
2 76 1 0 1 0
Avoid dropping columns, select only 'count', and changed function to sum:
final_df.where(final_df>0).groupby('user2').sum()[['count']].reset_index()
Output:
user2 count
0 1 2
a x d a
0 6 0.0 1.0 0.0 3.0
1 18 0.0 0.0 0.0 0.0
2 76 2.0 0.0 1.0 0.0
To void dropping user2 equal to zero values also.
final_df[['count']].where(final_df[['count']]>0)\
.groupby(final_df.user2).sum().reset_index()

plot pandas data frame but most columns have zeros

I am new to pandas and ipython I just setup everything and currently playing around. I have following data frame:
Field 10 20 30 40 50 60 70 80 90 95
0 A 0 0 0 0 0 0 0 0 1 3
1 B 0 0 0 0 0 0 0 1 4 14
2 C 0 0 0 0 0 0 0 1 2 7
3 D 0 0 0 0 0 0 0 1 5 15
4 u 0 0 0 0 0 0 0 1 5 14
5 K 0 0 0 0 0 0 1 2 7 21
6 S 0 0 0 0 0 0 0 1 3 8
7 E 0 0 0 0 0 0 0 1 3 8
8 F 0 0 0 0 0 0 0 1 6 16
I used a csv file to import this data:
df = pd.read_csv('/mycsvfile.csv',
index_col=False, header=0)
As you can see post of the columns are zero this data frame has large number of rows but there is possibility that in column most of the rows can be zero while one or two remaining with a value like "70".
I wounder how can I get this to nice graph where I can show 70, 80, 95 columns with the emphasis.
I found following tutorial: [http://pandas.pydata.org/pandas-docs/version/0.9.1/visualization.html][1] but still I am unable to get a good figure.
It depends a bit on how you want to handle the zero values, but here is an approach:
df = pd.DataFrame({'a': [0,0,0,0,70,0,0,90,0,0,80,0,0],
'b': [0,0,0,50,0,60,0,90,0,80,0,0,0]})
fig, axs = plt.subplots(1,2,figsize=(10,4))
# plot the original, for comparison
df.plot(ax=axs[0])
for name, col in df.iteritems():
col[col != 0].plot(ax=axs[1], label=name)
axs[1].set_xlim(df.index[0],df.index[-1])
axs[1].set_ylim(bottom=0)
axs[1].legend(loc=0)
You could also go for something with .replace(0,np.nan), but matplotlib doesnt draw lines if there are nan's in between. So you probably end up with looping over the columns anyway (and then using dropna().plot() for example).