I have two dataframes that have mostly different columns, but two of the columns are almost the same, frame and date.
df_1
id FRAME var_1 date
1 10 15 3/4/16
2 12 69 3/5/17
df_2
id frame var_2 date_time
1 11 15 3/2/16 08:14:32
2 12 69 3/5/17 09:12:29
Right now, I'm using pd.concat as df_3 = pd.concat([df_1, df_2], axis=0, ignore_index=True)
df_3
id FRAME var_1 date frame var_2 date_time
1 10 15 3/4/16 NaN NaN NaT
2 12 69 3/5/17 NaN NaN NaT
3 NaN NaN NaT 11 15 3/2/16 08:14:32
4 Nan NaN NaT 12 69 3/5/17 09:12:29
What I would like to have is the FRAME and date/date_time columns merged
df_3
id FRAME var_1 var_2 date_time
1 10 15 NaN 3/4/16
2 12 69 NaN 3/5/17
3 11 NaN 15 3/2/16 08:14:32
4 12 NaN 69 3/5/17 09:12:29
Use pd.concat with rename:
df_3 = pd.concat([df_1,
df_2.rename(columns={'frame':'FRAME', 'date_time':'date'})],
ignore_index=True,
sort=True)
Output
FRAME date var_1 var_2
0 10 3/4/16 15.0 NaN
1 12 3/5/17 69.0 NaN
2 11 3/2/16 08:14:32 NaN 15.0
3 12 3/5/17 09:12:29 NaN 69.0
Related
I currently using pandas to summarize my data. I have a data listed like this (the real data have ten of thousands of entries).
A
B
Intensity
Area
3
4
20.2
55
3
4
20.7
23
3
4
30.2
17
3
4
51.8
80
5
6
79.6
46
5
6
11.9
77
5
7
56.7
19
5
7
23.4
23
I would like to group the columns (A & B) together and list down the all the intensity and area values without aggregating the values (eg calculate mean, median, mode etc)
A,B
Intensity
3,4
20.2
20.7
30.2
51.8
5,6
79.6
11.9
NaN
NaN
5,7
56.7
23.4
NaN
NaN
Area
3,4
55
23
17
80
5,6
46
77
NaN
NaN
5,7
19
23
NaN
NaN
Here is one way to do it
# Melt to make wide layout to long, bring area and intensity as rows
df2=df.melt(id_vars=['A', 'B'])
# concat A and B into a single column
df2['A,B']=df2['A'].astype(str)+','+df2['B'].astype(str)
# drop A and B
df2.drop(columns=['A','B'], inplace=True)
# create a sequence number to aid in creating column in result
df2['seq']=df2.assign(seq=1).groupby(['variable','A,B'])['seq'].cumsum()
# do a pivot, and format the resultset
df2=(df2.pivot(index=['variable','A,B'], columns='seq', values='value')
.reset_index()
.rename_axis(columns=None)
.rename(columns={'variable':''}))
df2
A,B 1 2 3 4
0 Area 3,4 55.0 23.0 17.0 80.0
1 Area 5,6 46.0 77.0 NaN NaN
2 Area 5,7 19.0 23.0 NaN NaN
3 Intensity 3,4 20.2 20.7 30.2 51.8
4 Intensity 5,6 79.6 11.9 NaN NaN
5 Intensity 5,7 56.7 23.4 NaN NaN
you can use:
df['class']=df['A'].astype(str) + ',' + df['B'].astype(str)
def convert_values(col_name):
dfx=pd.DataFrame(df[[col_name,'class']].groupby('class').agg(list)[col_name].to_list(),index=df[[col_name,'class']].groupby('class').agg(list).index).reset_index()
dfx.index=[col_name] * len(dfx)
return dfx
df1=convert_values('Intensity')
df2=convert_values('Area')
final=pd.concat([df1,df2])
print(final)
'''
class 0 1 2 3
Intensity 3,4 20.2 20.7 30.2 51.8
Intensity 5,6 79.6 11.9 nan nan
Intensity 5,7 56.7 23.4 nan nan
Area 3,4 55 23 17.0 80.0
Area 5,6 46 77 nan nan
Area 5,7 19 23 nan nan
'''
following from expand year values to month in pandas
I have:
pd.DataFrame({'comp':['a','b'], 'period':['20180331','20171231'],'value':[12,24]})
comp period value
0 a 20180331 12
1 b 20171231 24
and would like to extrapolate to 201701 to 201812 inclusive. The value should be spread out for the 12 months preceding the period.
comp yyymm value
a 201701 na
a 201702 na
...
a 201705 12
a 201706 12
...
a 201803 12
a 201804 na
b 201701 24
...
b 201712 24
b 201801 na
...
Use:
#create month periods with min and max value
r = pd.period_range('2017-01', '2018-12', freq='m')
#convert column to period
df['period'] = pd.to_datetime(df['period']).dt.to_period('m')
#create MultiIndex for add all possible values
mux = pd.MultiIndex.from_product([df['comp'], r], names=('comp','period'))
#reindex for append values
df = df.set_index(['comp','period'])['value'].reindex(mux).reset_index()
#back filling by 11 values of missing values per groups
df['new'] = df.groupby('comp')['value'].bfill(limit=11)
print (df)
comp period value new
0 a 2017-01 NaN NaN
1 a 2017-02 NaN NaN
2 a 2017-03 NaN NaN
3 a 2017-04 NaN 12.0
4 a 2017-05 NaN 12.0
...
...
10 a 2017-11 NaN 12.0
11 a 2017-12 NaN 12.0
12 a 2018-01 NaN 12.0
13 a 2018-02 NaN 12.0
14 a 2018-03 12.0 12.0
15 a 2018-04 NaN NaN
16 a 2018-05 NaN NaN
17 a 2018-06 NaN NaN
18 a 2018-07 NaN NaN
19 a 2018-08 NaN NaN
20 a 2018-09 NaN NaN
21 a 2018-10 NaN NaN
22 a 2018-11 NaN NaN
23 a 2018-12 NaN NaN
24 b 2017-01 NaN 24.0
25 b 2017-02 NaN 24.0
26 b 2017-03 NaN 24.0
...
...
32 b 2017-09 NaN 24.0
33 b 2017-10 NaN 24.0
34 b 2017-11 NaN 24.0
35 b 2017-12 24.0 24.0
36 b 2018-01 NaN NaN
37 b 2018-02 NaN NaN
38 b 2018-03 NaN NaN
...
...
44 b 2018-09 NaN NaN
45 b 2018-10 NaN NaN
46 b 2018-11 NaN NaN
47 b 2018-12 NaN NaN
See if this works:
dftime = pd.DataFrame(pd.date_range('20170101','20181231'), columns=['dt']).apply(lambda x: x.dt.strftime('%Y-%m'), axis=1) # Populating full range including dates
dftime = dftime.assign(dt=dftime.dt.drop_duplicates().reset_index(drop=True)).dropna() # Dropping duplicates from above range
df['dt'] = pd.to_datetime(df.period).apply(lambda x: x.strftime('%Y-%m')) # Adding column for merging purpose
target = df.groupby('comp').apply(lambda x: dftime.merge(x[['comp','dt','value']], on='dt', how='left').fillna({'comp':x.comp.unique()[0]})).reset_index(drop=True) # Populating data for each company
This gives desired output:
print(target)
dt comp value
0 2017-01 a NaN
1 2017-02 a NaN
2 2017-03 a NaN
3 2017-04 a NaN
4 2017-05 a NaN
5 2017-06 a NaN
6 2017-07 a NaN
and so on.
I am new in pandas functionality.
I have a DF as shown below. which is repair data of mobiles.
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
I tried to find out the duration for each id from previous status.
find the mean for each status sequence for that ID.
My expected output is shown below.
ID S1 S1_Dur S2 S2_dur S3 S3_dur S4 S4_dur Avg_MF Avg_FM
0 1 F-M 30 M-F 335.00 F-M 61.00 M-F 750.00 542.50 45.50
1 2 M-F 92 F-F 122.00 NaN nan NaN nan 92.00 nan
2 3 M-F 457 F-M 122.00 NaN nan NaN nan 457.00 122.00
3 4 M-F 304 NaN nan NaN nan NaN nan 304.00 nan
S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration
I tried following codes
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)
Then I got a DF as shown below
ID Status index Date Cost Difference
0 1 F 0 2017-06-22 500 NaT
1 1 M 1 2017-07-22 100 30 days
2 1 F 7 2018-06-22 600 335 days
3 1 M 9 2018-08-22 150 61 days
4 1 F 10 2019-03-22 750 212 days
5 2 M 2 2017-06-29 200 NaT
6 2 F 5 2017-09-29 600 92 days
7 2 F 6 2018-01-29 500 122 days
8 3 M 3 2017-03-20 300 NaT
9 3 F 8 2018-06-20 700 457 days
10 3 M 11 2018-10-20 250 122 days
11 4 M 4 2017-08-10 800 NaT
12 4 F 12 2018-06-10 100 304 days
After that I am stuck.
Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)
df3 = df2.join(df1).reset_index()
print (df3)
ID D_1 S_1 D_2 S_2 D_3 S_3 D_4 S_4 Avg_F-F Avg_F-M \
0 1 30.0 F-M 335.0 M-F 61.0 F-M 212.0 M-F NaN 45.5
1 2 92.0 M-F 122.0 F-F NaN NaN NaN NaN 122.0 NaN
2 3 457.0 M-F 122.0 F-M NaN NaN NaN NaN NaN 122.0
3 4 304.0 M-F NaN NaN NaN NaN NaN NaN NaN NaN
Avg_M-F
0 273.5
1 92.0
2 457.0
3 304.0
I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output
I'm working on a multiindex pivot table with 470 columns >7000 rows:
pivot.head()
Componentnr 1 2 3 4 5 6 7 8
Genename
A2M Mediancoverage 84.5 281 156 131 11 81.5 251 101
inhouse3-5 NaN NaN NaN NaN 1 NaN NaN NaN
A2ML1 Mediancoverage 146 156 4.75 124.5 208.5 111.5 136.5 164.5
inhouse3-5 5 0 0 0 0 0 0 0
A4GALT Mediancoverage 165 NaN NaN NaN NaN NaN NaN Na
I want to select those level-0 index names that have 'Mediancoverage'<20 and 'inhouse3-5'>0 (excl. NaN) for the same componentnr (=columns).
Thus for the example above the result should be 'A2M' because of column 5
Until know, I managed to select for row in which any value meets the criteria:
apply(lambda row: any(i<20 for i in row), axis=1)
apply(lambda row: any(i>=1 for i in row), axis=1)
but in this way I don't know if both criteria were met in the same column
Does anyone have an idea?
I slightly modified/reformated your data frame example to make it easier to work with:
1 2 3 4 5 6 7
Genename Filter
A2M Mediancoverage 84.5 281.0 156.00 131.0 11.0 81.5 251.0
inhouse3-5 NaN NaN NaN NaN 1.0 NaN NaN
A2ML1 Mediancoverage 146.0 156.0 4.75 124.5 208.5 111.5 136.5
inhouse3-5 5.0 0.0 0.00 0.0 0.0 0.0 0.0
You can use groupby/apply here to filter your relevant genes:
def filter_gene(sub_df):
# drop multiindex for easier selection
sub_df = sub_df.reset_index(0, drop=True)
# define your filters
bool_coverage = sub_df.loc["Mediancoverage", :] < 20
bool_inhouse = sub_df.loc["inhouse3-5", :] > 0
# if any column fulfills both filter requirements, return True
if (bool_coverage & bool_inhouse).any():
return True
# groupby by gene, apply filter, and remove non-hits
result = df.groupby(level="Genename").apply(filter_gene).dropna().index
print(result)
Index(['A2M'], dtype='object', name='Genename')
The idea is simple. You iterate over your genes, apply the filter, and keep those which fulfill your filter requirements.