Merge/concatenate/Append dataframe pandas - pandas

I have two dataframes that look like this:
df1:
Index var1
0 56
1 67
2 21
Index var2
0 89
1 64
2 31
When I append or concatenate them, I get this:
Index var1 var2
0 56 nan
1 67 nan
2 21 nan
0 nan 89
1 nan 64
2 nan 31
But I would like to get this:
Index var1 var2
0 56 89
1 67 64
2 21 31
The commands I used are:
pd.concat([df1, df2], axis=1)
df1.append([df2])
EDIT:
This is a min-example:
df1 = pd.DataFrame({'var1' : [56,67,21]})
df2 = pd.DataFrame({'var2' : [89,64,31]})
df1.to_dict()
{'var1': {0: 56, 1: 67, 2: 21}}
df2.to_dict()
{'var2': {0: 89, 1: 64, 2: 31}}
df1.index.dtype
dtype('int64')
df2.index.dtype
dtype('int64')

Use:
df1 = df1.set_index('Index')
df2 = df2.set_index('Index')
pd.concat([df1,df2], axis=1)
NOTE: Besure that Index is in the index of the dataframe:
Output:
var1 var2
Index
0 56 89
1 67 64
2 21 31

Related

How to stack and rename N successive columns in df

How would I achieve the desired output as shown below? Ie, stack the first 3 columns underneath each other, stack the second 3 columns underneath each other and rename the columns.
d = {'A': [76, 34], 'B': [21, 48], 'C': [45, 89], 'D': [56, 41], 'E': [3, 2],
'F': [78, 32]}
df = pd.DataFrame(data=d)
df.columns=['A', 'A', 'A', 'A', 'A', 'A']
Output
df
A A A A A A
0 76 21 45 56 3 78
1 34 48 89 41 2 32
Desired Output
Z1 Z2
0 76 56
1 34 41
2 21 3
3 48 2
4 45 78
5 89 32
Go down into numpy, reshape and create a new dataframe:
pd.DataFrame(df.to_numpy().reshape((-1, 2), order='F'), columns = ['Z1','Z2'])
Out[19]:
Z1 Z2
0 76 56
1 34 41
2 21 3
3 48 2
4 45 78
5 89 32

How to create and populate pandas columns based on cell values

I have created a dataframe called df as follows:
import pandas as pd
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
print(df)
The dataframe looks like this:
I want to create two new columns called Freq_1 and Freq_2 that count, for each record, how many times the number 1 and number 2 appear respectively. So, I'd like the resulting dataframe to look like this:
So, let's take a look at the column called Freq_1:
for the first record, it's equal to 1 because the number 1 appears only once across the whole first record;
for the other records, it's equal to 0 because the number 1 never appears.
Let's take a look now at the column called Freq_2:
for the first record, Freq_2 is equal to 0 because number 2 doesn't appear;
for second record, Freq_2 is equal to 2 because the number 2 appears twice;
and so on ...
How do I create the columns Freq_1 and Freq_2 in pandas?
Try this:
freq = {
i: df.eq(i).sum(axis=1) for i in range(10)
}
pd.concat([df, pd.DataFrame(freq).add_prefix("Freq_")], axis=1)
Result:
feature1 feature2 feature3 Freq_0 Freq_1 Freq_2 Freq_3 Freq_4 Freq_5 Freq_6 Freq_7 Freq_8 Freq_9
1 33 100 0 1 0 0 0 0 0 0 0 0
22 2 2 0 0 2 0 0 0 0 0 0 0
45 2 359 0 0 1 0 0 0 0 0 0 0
78 65 87 0 0 0 0 0 0 0 0 0 0
78 65 2 0 0 1 0 0 0 0 0 0 0
String pattern matching can be performed when the columns are casted to string columns.
d = {'feature1': [1, 22,45,78,78], 'feature2': [33, 2,2,65,65], 'feature3': [100, 2,359,87,2],}
df = pd.DataFrame(data=d)
df = df.stack().astype(str).unstack()
Now we can iterate for each pattern that we are looking for:
usefull_columns = df.columns
for pattern in ['1', '2']:
df[f'freq_{pattern}'] = df[usefull_columns].stack().str.count(pattern).unstack().max(axis=1)
Printing the output:
feature1 feature2 feature3 freq_1 freq_2
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0
We can do
s = df.where(df.isin([1,2])).stack()
out = df.join(pd.crosstab(s.index.get_level_values(0),s).add_prefix('Freq_')).fillna(0)
Out[299]:
feature1 feature2 feature3 Freq_1.0 Freq_2.0
0 1 33 100 1.0 0.0
1 22 2 2 0.0 2.0
2 45 2 359 0.0 1.0
3 78 65 87 0.0 0.0
4 78 65 2 0.0 1.0

how to conditionally create new column in dataframe based on other column values in julia

I have a julia dataframe that looks like this:
time data
0 34
1 34
2 30
3 37
4 32
5 35
How do I create a new binary column that is 0 if time is less than 2 and greater than 4, and 1 if not either condition?
Like this:
time data x
0 34 0
1 34 0
2 30 1
3 37 1
4 32 1
5 35 0
In python, I would do something like:
def func(df):
if df.time < 2 or df.time > 4:
return 0
else:
return 1
df['x'] = df.apply(func, axis=1)
In Julia we have the beautiful Dot Syntax which can be gracefully applied here:
julia> df[!, :x] = 2 .<= df[!, :time] .<= 4
6-element BitVector:
0
0
1
1
1
0
or alternatively
df.x = 2 .<= df.time .<= 4

How to compute mean for different size subsets within pandas dataframe?

compute mean of particular column for each unique subset of rows in pandas dataframe. In following example each subset is till 1 appears in column "Flag" i.e. (54+34+78+91+29)/5 = 57.2 and (81+44+61)/3 = 62.0
Currently unable to compute rolling subset of different sizes based on particular column condition
>>> import pandas as pd
>>> df = pd.DataFrame({"Indx": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "Units": [54, 34, 78, 91, 29, 81, 44, 61, 73, 19], "Flag": [0, 0, 0, 0, 1, 0, 0, 1, 0, 1]})
>>> df
Indx Units Flag
0 1 54 0
1 2 34 0
2 3 78 0
3 4 91 0
4 5 29 1
5 6 81 0
6 7 44 0
7 8 61 1
8 9 73 0
9 10 19 1
# DESIRED OUTPUT
>>> df
Indx Units Flag avg
0 1 54 0 57.2
1 2 34 0 57.2
2 3 78 0 57.2
3 4 91 0 57.2
4 5 29 1 57.2
5 6 81 0 62.0
6 7 44 0 62.0
7 8 61 1 62.0
8 9 73 0 46.0
9 10 19 0 46.0
Create the group key by using cumsum then transform
df['Units'].groupby(df.Flag.iloc[::-1].cumsum()).transform('mean')
0 57.2
1 57.2
2 57.2
3 57.2
4 57.2
5 62.0
6 62.0
7 62.0
8 46.0
9 46.0
Name: Units, dtype: float64
#df['new']=df['Units'].groupby(df.Flag.iloc[::-1].cumsum()).transform('mean')
The shortest solution (I think) is:
df['avg'] = df.groupby(df.Flag[::-1].cumsum()).Units.transform('mean')
You don't even need to use iloc, as df.Flag[::-1] retrieves Flag
column in reversed order.

Pandas condition across 2 dataframes

I have 2 dataframes df1 and df2
df1;
A B C
0 11 22 55
1 66 34 54
2 0 34 66
df2;
A B C
0 11 33 455
1 0 0 54
2 0 34 766
Both dataframes have the same dimensions. I want to say if value is 0 in df2 then give that value (based on column and index) a 0 in df1.
So df1 will be
df1;
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Use DataFrame.mask:
df1 = df1.mask(df2 == 0, 0)
For better performance use numpy.where:
df1 = pd.DataFrame(np.where(df2 == 0, 0, df1),
index=df1.index,
columns=df1.columns)
print (df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Using where:
df1 = df1.where(df2.ne(0), 0)
print(df1)
A B C
0 11 22 55
1 0 0 54
2 0 34 66
Another way -
df1 = df1[~df2.eq(0)].fillna(0)