Concatenating dataframe that have different number of rows - pandas

I have a dataframe df = df[['A', 'B', 'C']] with 3 columns and 2000 rows
Then I have another set of data with only 200 rows
How can I add this into df['D'] such that this 200 rows will only appear as the tail of the 2000 rows?
So that from row 0-1800 for df['D'] it will be NaN and then 1801 to 2000 will be the values
Been trying various ways without success... thank you
data with 200 rows in this format
[[ 0.43628979]
[ 0.43454027]
[ 0.43552566]
[ 0.43542767]
[ 0.43331838]
...

I believe you need join with changing index by last index values of df1:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(20,3)), columns=list('ABC'))
print (df1)
A B C
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
6 6 2 4
7 1 5 3
8 4 4 3
9 7 1 1
10 7 7 0
11 2 9 9
12 3 2 5
13 8 1 0
14 7 6 2
15 0 8 2
16 5 1 8
17 1 5 4
18 2 8 3
19 5 0 9
df2 = pd.DataFrame(np.random.randint(10, size=(2,5)), columns=list('werty'))
print (df2)
w e r t y
0 3 6 3 4 7
1 6 3 9 0 4
df2.index = df1.index[-len(df2.index):]
df = df1.join(df2)
print (df)
A B C w e r t y
0 8 8 3 NaN NaN NaN NaN NaN
1 7 7 0 NaN NaN NaN NaN NaN
2 4 2 5 NaN NaN NaN NaN NaN
3 2 2 2 NaN NaN NaN NaN NaN
4 1 0 8 NaN NaN NaN NaN NaN
5 4 0 9 NaN NaN NaN NaN NaN
6 6 2 4 NaN NaN NaN NaN NaN
7 1 5 3 NaN NaN NaN NaN NaN
8 4 4 3 NaN NaN NaN NaN NaN
9 7 1 1 NaN NaN NaN NaN NaN
10 7 7 0 NaN NaN NaN NaN NaN
11 2 9 9 NaN NaN NaN NaN NaN
12 3 2 5 NaN NaN NaN NaN NaN
13 8 1 0 NaN NaN NaN NaN NaN
14 7 6 2 NaN NaN NaN NaN NaN
15 0 8 2 NaN NaN NaN NaN NaN
16 5 1 8 NaN NaN NaN NaN NaN
17 1 5 4 NaN NaN NaN NaN NaN
18 2 8 3 3.0 6.0 3.0 4.0 7.0
19 5 0 9 6.0 3.0 9.0 0.0 4.0

Related

Pandas groupby calculation using values from different rows based on other column

I have the following dataframe, observations are grouped in pairs. NaN here represents different products traded in pair wrt A. I want to groupby transaction and compute
A/NaN so that the value for all NaNs can be expressed in unit A.
transaction name value ...many other columns
1 A 3
1 NaN 5
2 NaN 7
2 A 6
3 A 4
3 NaN 3
4 A 10
4 NaN 9
5 C 8
5 A 6
..
Thus the desired df would be
transaction name value new_column ...many other columns
1 A 3 NaN
1 NaN 6 0.5
2 NaN 7 0.8571
2 A 6 NaN
3 A 4 1.333
3 NaN 3 NaN
4 A 10 1.111
4 NaN 9 NaN
5 C 8 0.75
5 A 6 NaN
...
First filter rows with A and convert transaction to index for possible divide rows with missing value by mapped transaction by Series.map:
m = df['name'].ne('A')
s = df[~m].set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN
1 1 NaN 5 0.600000
2 2 NaN 7 0.857143
3 2 A 6 NaN
4 3 A 4 NaN
5 3 NaN 3 1.333333
6 4 A 10 NaN
7 4 NaN 9 1.111111
8 5 NaN 8 0.750000
9 5 A 6 NaN
EDIT: There is multiple A values per groups, not only one, possible solution is removed duplicates:
print (df)
transaction name value
0 1 A 3
1 1 A 4
2 1 NaN 5
3 2 NaN 7
4 2 A 6
5 3 A 4
6 3 NaN 3
7 4 A 10
8 4 NaN 9
9 5 C 8
10 5 A 6
# s = df[~m].set_index('transaction')['value']
# df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
# print (df)
#InvalidIndexError: Reindexing only valid with uniquely valued Index objects
m = df['name'].ne('A')
print (df[~m].drop_duplicates(['transaction','name']))
transaction name value
0 1 A 3
4 2 A 6
5 3 A 4
7 4 A 10
10 5 A 6
s = df[~m].drop_duplicates(['transaction','name']).set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN <- 2 times a per 1 group
1 1 A 4 NaN <- 2 times a per 1 group
2 1 NaN 5 0.600000
3 2 NaN 7 0.857143
4 2 A 6 NaN
5 3 A 4 NaN
6 3 NaN 3 1.333333
7 4 A 10 NaN
8 4 NaN 9 1.111111
9 5 C 8 0.750000
10 5 A 6 NaN
Assuming there are only two values per transaction, you can use agg and divide the first and last element by each other:
df.loc[df['name'].isna(), 'new_column'] = df.sort_values(by='name').\
groupby('transaction')['value'].\
agg(f='first', l='last').agg(lambda x: x['f'] / x['l'], axis=1)

How to keep True and None Value using pandas?

I've one DataFrame
import pandas as pd
data = {'a': [1,2,3,None,4,None,2,4,5,None],'b':[6,6,6,'NaN',4,'NaN',11,11,11,'NaN']}
df = pd.DataFrame(data)
condition = (df['a']>2) | (df['a'] == None)
print(df[condition])
a b
0 1.0 6
1 2.0 6
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
6 2.0 11
7 4.0 11
8 5.0 11
9 NaN NaN
Here, i've to keep where condition is coming True and Where None is there i want to keep those rows as well.
Expected output is :
a b
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
7 4.0 11
8 5.0 11
9 NaN NaN
Thanks in Advance
You can use another | or condition (Note: See #ALlolz's comment, you shouldnt compare a series with np.nan)
condition = (df['a']>2) | (df['a'].isna())
df[condition]
a b
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
7 4.0 11
8 5.0 11
9 NaN NaN

Pandas assign value in one column based on top 10 values in another column

I have a table:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
I would like to make a new column called 'flag' for the top 2 values in column D.
I've tried:
for i in df.D.nlargest(2):
df.['flag']= 1
But that gets me:
A B C D flag
0 NaN 2.0 NaN 0 1
1 3.0 4.0 NaN 1 1
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
What I want is:
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
IIUC:
df['flag'] = 0
df.loc[df.D.nlargest(2).index, 'flag'] = 1
Or:
df['flag'] = df.index.isin(df.D.nlargest(2).index).astype(int)
Output:
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1
IIUC
df['flag']=df.D.sort_values().tail(2).eq(df.D).astype(int)
df
A B C D flag
0 NaN 2.0 NaN 0 0
1 3.0 4.0 NaN 1 0
2 NaN NaN NaN 5 1
3 NaN 3.0 NaN 4 1

Create a new ID column based on conditions in other column using pandas

I am trying to make a new column 'ID' which should give a unique ID each time there is no 'NaN' value in 'Data' column. If the non null values come right to each other, the ID remains the same. I have provided how my final Id column should look like below as reference to better understand. Could anyone guide me on this?
Id Data
0 NaN
0 NaN
0 NaN
1 54
1 55
0 NaN
0 NaN
2 67
0 NaN
0 NaN
3 33
3 44
3 22
0 NaN
.groupby the cumsum to get consecutive groups, using where to mask the NaN. .ngroup gets the consecutive IDs. Also possible with rank.
s = df.Data.isnull().cumsum().where(df.Data.notnull())
df['ID'] = df.groupby(s).ngroup()+1
# df['ID'] = s.rank(method='dense').fillna(0).astype(int)
Output:
Data ID
0 NaN 0
1 NaN 0
2 NaN 0
3 54.0 1
4 55.0 1
5 NaN 0
6 NaN 0
7 67.0 2
8 NaN 0
9 NaN 0
10 33.0 3
11 44.0 3
12 22.0 3
13 NaN 0
Using factorize
v=pd.factorize(df.Data.isnull().cumsum()[df.Data.notnull()])[0]+1
df.loc[df.Data.notnull(),'Newid']=v
df.Newid.fillna(0,inplace=True)
df
Id Data Newid
0 0 NaN 0.0
1 0 NaN 0.0
2 0 NaN 0.0
3 1 54.0 1.0
4 1 55.0 1.0
5 0 NaN 0.0
6 0 NaN 0.0
7 2 67.0 2.0
8 0 NaN 0.0
9 0 NaN 0.0
10 3 33.0 3.0
11 3 44.0 3.0
12 3 22.0 3.0
13 0 NaN 0.0

Boxplot with pandas and groupby

I have the following dataset sample:
0 1
0 0 0.040158
1 2 0.500642
2 0 0.005694
3 1 0.065052
4 0 0.034789
5 2 0.128495
6 1 0.088816
7 1 0.056725
8 0 -0.000193
9 2 -0.070252
10 2 0.138282
11 2 0.054638
12 2 0.039994
13 2 0.060659
14 0 0.038562
And need a box and whisker plot, grouped by column 0. I have the following:
plt.figure()
grouped = df.groupby(0)
grouped.boxplot(column=1)
plt.savefig('plot.png')
But I end up with three subplots. How can place all three on one plot?
Thanks.
In 0.16.0 version of pandas, you could simply do this:
df.boxplot(by='0')
Result:
I don't believe you need to use groupby.
df2 = df.pivot(columns=df.columns[0], index=df.index)
df2.columns = df2.columns.droplevel()
>>> df2
0 0 1 2
0 0.040158 NaN NaN
1 NaN NaN 0.500642
2 0.005694 NaN NaN
3 NaN 0.065052 NaN
4 0.034789 NaN NaN
5 NaN NaN 0.128495
6 NaN 0.088816 NaN
7 NaN 0.056725 NaN
8 -0.000193 NaN NaN
9 NaN NaN -0.070252
10 NaN NaN 0.138282
11 NaN NaN 0.054638
12 NaN NaN 0.039994
13 NaN NaN 0.060659
14 0.038562 NaN NaN
df2.boxplot()