Can i go from 15 object variables to one final binary target variable?
Those 15 variables has ~10.000 different codes, my dataset is about 21.000.000 records. What im trying to do is at first replace the codes i want with 1 and the other with 0, then if one of fifteen variables is 1 the target variable will be 1, if all fifteen variables are 0 the target variable will be 0.
i have tried to work with to_replace, as_type, to_numeric, infer_objects with not good results,for example my dataset look like this head(5):
D P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
41234 1234 4367 874 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
42345 7657 4367 874 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
34212 7654 4347 474 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
34212 8902 4317 374 NAN 452 NAN 719 NAN NAN NAN NAN NAN NAN NAN NAN
19374 2564 4387 274 NAN 452 NAN 799 NAN NAN NAN NAN NAN NAN NAN NAN
I want to transform all nan as 0, and selected codes with 1, so all the P1-P15 will be binary and the i will create a final P variable with them.
For example if P1-P15 have '3578','9732','4734'...(im using about 200 codes) i want to become 1.
All the other values i want to become 0.
D variable should stay as it is.
The final dataset will be (D,P), then i will add the train variables
Any ideas? The following code gives me wrong results.
selCodes=['3722','66']
dfnew['P']=(dfnew.loc[:,'PR1':].astype(str).isin(selCodes).any(axis=1).astype(int))
Take a look at a test dataset(left), and new P(right).With the example code 3722 P should be 1.
IIUC, Use, DataFrame.isin:
# example select codes
selCodes = ['1234', '9732', '719']
df['P'] = (
df.loc[:, 'P1':].astype(str)
.isin(selCodes).any(axis=1).astype(int)
)
df = df[['D', 'P']]
Result:
D P
0 41234 1
1 42345 0
2 34212 0
3 34212 1
4 19374 0
Related
I have two dataframes of the same size (510x6)
preds
0 1 2 3 4 5
0 2.610270 -4.083780 3.381037 4.174977 2.743785 -0.766932
1 0.049673 0.731330 1.656028 -0.427514 -0.803391 -0.656469
2 -3.579314 3.347611 2.891815 -1.772502 1.505312 -1.852362
3 -0.558046 -1.290783 2.351023 4.669028 3.096437 0.383327
4 -3.215028 0.616974 5.917364 5.275736 7.201042 -0.735897
... ... ... ... ... ... ...
505 -2.178958 3.918007 8.247562 -0.523363 2.936684 -3.153375
506 0.736896 -1.571704 0.831026 2.673974 2.259796 -0.815212
507 -2.687474 -1.268576 -0.603680 5.571290 -3.516223 0.752697
508 0.182165 0.904990 4.690155 6.320494 -2.326415 2.241589
509 -1.675801 -1.602143 7.066843 2.881135 -5.278826 1.831972
510 rows × 6 columns
outputStats
0 1 2 3 4 5
0 2.610270 -4.083780 3.381037 4.174977 2.743785 -0.766932
1 0.049673 0.731330 1.656028 -0.427514 -0.803391 -0.656469
2 -3.579314 3.347611 2.891815 -1.772502 1.505312 -1.852362
3 -0.558046 -1.290783 2.351023 4.669028 3.096437 0.383327
4 -3.215028 0.616974 5.917364 5.275736 7.201042 -0.735897
... ... ... ... ... ... ...
505 -2.178958 3.918007 8.247562 -0.523363 2.936684 -3.153375
506 0.736896 -1.571704 0.831026 2.673974 2.259796 -0.815212
507 -2.687474 -1.268576 -0.603680 5.571290 -3.516223 0.752697
508 0.182165 0.904990 4.690155 6.320494 -2.326415 2.241589
509 -1.675801 -1.602143 7.066843 2.881135 -5.278826 1.831972
510 rows × 6 columns
when I execute:
preds - outputStats
I expect a 510 x 6 dataframe with elementwise subtraction. Instead I get this:
0 1 2 3 4 5 0 1 2 3 4 5
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
505 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
506 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
507 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
508 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
509 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've tried dropping columns and the like, and that hasn't helped. I also get the same result with preds.subtract(outputStats). Any Ideas?
There are many ways that two different values can appear the same when displayed. One of the main ways is if they are different types, but corresponding values for those types. For instance, depending on how you're displaying them, the int 1 and the str '1' may not be easily distinguished. You can also have whitespace characters, such as '1' versus ' 1'.
If the problem is that one set is int while the other is str, you can solve the problem by converting them all to int or all to str. To do the former, do df.columns = [int(col) for col in df.columns]. To do the latter, df.columns = [str(col) for col in df.columns]. Converting to str is somewhat safer, as trying to convert to int can raise an error if the string isn't amenable to conversion (e.g. int('y') will raise an error), but int can be more usual as they have the numerical structure.
You asked in a comment about dropping columns. You can do this with drop and including axis=1 as a parameter to tell it to drop columns rather than rows, or you can use the del keyword. But changing the column names should remove the need to drop columns.
I have a dataframe that I first group, Counting QuoteLine Items grouped by stock(1-true, 0-false) and mfg type (K-Kit, M-manufactured, P-Purchased). Ultimately, I am interested in quotes that ALL items are either NonStock/Kit and/or Stock/['M','P'] :
grouped = df.groupby(['QuoteNum', 'typecode', 'stock']).agg({"QuoteLine": "count"})
and I get this:
QuoteLine-count
QuoteNum typecode stock
10001 K 0 1
10003 M 0 1
10005 M 0 3
1 1
10006 M 1 1
... ... ... ...
26961 P 1 1
26962 P 1 1
26963 P 1 2
26964 K 0 1
M 1 2
If I unstack it twice:
grouped = df.groupby(['QuoteNum', 'typecode', 'stock']).agg({"QuoteLine": "count"}).unstack().unstack()
# I get
QuoteLine-count
stock 0 1
typecode K M P K M P
QuoteNum
10001 1.0 NaN NaN NaN NaN NaN
10003 NaN 1.0 NaN NaN NaN NaN
10005 NaN 3.0 NaN NaN 1.0 NaN
10006 NaN NaN NaN NaN 1.0 NaN
10007 2.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
26959 NaN NaN NaN NaN NaN 1.0
26961 NaN 1.0 NaN NaN NaN 1.0
26962 NaN NaN NaN NaN NaN 1.0
26963 NaN NaN NaN NaN NaN 2.0
26964 1.0 NaN NaN NaN 2.0 NaN
Now I need to filter out all records where, this is where I need help
# pseudo-code
(stock == 0 and typecode in ['M','P']) -> values are NOT NaN (don't want those)
and
(stock == 1 and typecode='K') -> values are NOT NaN (don't want those either)
so I'm left with these records:
Basically: Columns "0/M, 0/P, 1/K" must be all NaNs and other columns have at least one non NaN value
QuoteLine-count
stock 0 1
typecode K M P K M P
QuoteNum
10001 1.0 NaN NaN NaN NaN NaN
10006 NaN NaN NaN NaN 1.0 NaN
10007 2.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
26959 NaN NaN NaN NaN NaN 1.0
26962 NaN NaN NaN NaN NaN 1.0
26963 NaN NaN NaN NaN NaN 2.0
26964 1.0 NaN NaN NaN 2.0 NaN
IIUC, use boolean mask to set rows that match your conditions to NaN then unstack desired levels:
# Shortcut (for readability)
lvl_vals = grouped.index.get_level_values
m1 = (lvl_vals('typecode') == 'K') & (lvl_vals('stock') == 0)
m2 = (lvl_vals('typecode').isin(['M', 'P'])) & (lvl_vals('stock') == 1)
grouped[m1|m2] = np.nan
out = grouped.unstack(level=['stock', 'typecode']) \
.loc[lambda x: x.isna().all(axis=1)]
Output result:
>>> out
QuoteLine-count
stock 0 1
typecode K M M P
QuoteNum
10001 NaN NaN NaN NaN
10006 NaN NaN NaN NaN
26961 NaN NaN NaN NaN
26962 NaN NaN NaN NaN
26963 NaN NaN NaN NaN
26964 NaN NaN NaN NaN
Desired values could be obtained by as_index==False, but i am not sure if they are in desired format.
grouped = df.groupby(['QuoteNum', 'typecode', 'stock'], as_index=False).agg({"QuoteLine": "count"})
grouped[((grouped["stock"]==0) & (grouped["typecode"].isin(["M" ,"P"]))) | ((grouped["stock"]==1) & (grouped["typecode"].isin(["K"])))]
I'm struggling with this problem and I'm not sure if I'm approaching it correctly.
I have this dataset:
ticker date filing_date_x currency_symbol_x researchdevelopment effectofaccountingcharges incomebeforetax minorityinterest netincome sellinggeneraladministrative grossprofit ebit nonoperatingincomenetother operatingincome otheroperatingexpenses interestexpense taxprovision interestincome netinterestincome extraordinaryitems nonrecurring otheritems incometaxexpense totalrevenue totaloperatingexpenses costofrevenue totalotherincomeexpensenet discontinuedoperations netincomefromcontinuingops netincomeapplicabletocommonshares preferredstockandotheradjustments filing_date_y currency_symbol_y totalassets intangibleassets earningassets othercurrentassets totalliab totalstockholderequity deferredlongtermliab ... totalcurrentliabilities shorttermdebt shortlongtermdebt shortlongtermdebttotal otherstockholderequity propertyplantequipment totalcurrentassets longterminvestments nettangibleassets shortterminvestments netreceivables longtermdebt inventory accountspayable totalpermanentequity noncontrollinginterestinconsolidatedentity temporaryequityredeemablenoncontrollinginterests accumulatedothercomprehensiveincome additionalpaidincapital commonstocktotalequity preferredstocktotalequity retainedearningstotalequity treasurystock accumulatedamortization noncurrrentassetsother deferredlongtermassetcharges noncurrentassetstotal capitalleaseobligations longtermdebttotal noncurrentliabilitiesother noncurrentliabilitiestotal negativegoodwill warrants preferredstockredeemable capitalsurpluse liabilitiesandstockholdersequity cashandshortterminvestments propertyplantandequipmentgross accumulateddepreciation commonstocksharesoutstanding
116638 JNJ.US 2019-12-31 2020-02-18 USD 3.232000e+09 NaN 4.218000e+09 NaN 4.010000e+09 6.039000e+09 1.363200e+10 6.119000e+09 6.500000e+07 4.238000e+09 NaN 85000000.0 208000000.0 81000000.0 -4000000.0 NaN 104000000.0 NaN 208000000.0 2.074700e+10 9.414000e+09 7.115000e+09 -1.200000e+08 NaN 4.010000e+09 4.010000e+09 NaN 2020-02-18 USD 1.577280e+11 4.764300e+10 NaN 2.486000e+09 9.825700e+10 5.947100e+10 5.958000e+09 ... 3.596400e+10 1.202000e+09 1.202000e+09 NaN -1.589100e+10 1.765800e+10 4.527400e+10 1.149000e+09 -2.181100e+10 1.982000e+09 1.448100e+10 2.649400e+10 9.020000e+09 3.476200e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.106590e+11 -3.841700e+10 NaN 5.695000e+09 7.819000e+09 1.124540e+11 NaN 2.649400e+10 2.984100e+10 6.229300e+10 NaN NaN NaN NaN 1.577280e+11 1.928700e+10 NaN NaN 2.632507e+09
116569 JNJ.US 2020-03-31 2020-04-29 USD 2.580000e+09 NaN 6.509000e+09 NaN 5.796000e+09 5.203000e+09 1.364400e+10 8.581000e+09 7.460000e+08 5.788000e+09 NaN 25000000.0 713000000.0 67000000.0 42000000.0 300000000.0 58000000.0 NaN 713000000.0 2.069100e+10 7.135000e+09 7.047000e+09 6.210000e+08 NaN 5.796000e+09 5.796000e+09 NaN 2020-04-29 USD 1.550170e+11 4.733800e+10 NaN 2.460000e+09 9.372300e+10 6.129400e+10 5.766000e+09 ... 3.368900e+10 2.190000e+09 2.190000e+09 NaN -1.624300e+10 1.740100e+10 4.422600e+10 NaN -1.951500e+10 2.494000e+09 1.487400e+10 2.539300e+10 8.868000e+09 3.149900e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.129010e+11 -3.848400e+10 NaN 5.042000e+09 NaN 7.539000e+09 NaN 2.539300e+10 2.887500e+10 6.003400e+10 NaN NaN NaN NaN 1.550170e+11 1.802400e+10 4.324700e+10 -2.584600e+10 2.632392e+09
116420 JNJ.US 2020-06-30 2020-07-24 USD 2.707000e+09 NaN 3.940000e+09 NaN 3.626000e+09 4.993000e+09 1.177900e+10 5.711000e+09 -5.000000e+06 3.990000e+09 NaN 45000000.0 314000000.0 19000000.0 -26000000.0 NaN 67000000.0 NaN 314000000.0 1.833600e+10 7.839000e+09 6.557000e+09 -8.500000e+07 NaN 3.626000e+09 3.626000e+09 NaN 2020-07-24 USD 1.583800e+11 4.741300e+10 NaN 2.688000e+09 9.540200e+10 6.297800e+10 5.532000e+09 ... 3.677200e+10 5.332000e+09 5.332000e+09 NaN -1.553300e+10 1.759800e+10 4.589200e+10 NaN -1.832500e+10 7.961000e+09 1.464500e+10 2.506200e+10 9.424000e+09 3.144000e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.138980e+11 -3.850700e+10 NaN 5.782000e+09 NaN 7.805000e+09 NaN 2.506200e+10 2.803600e+10 5.863000e+10 NaN NaN NaN NaN 1.583800e+11 1.913500e+10 4.405600e+10 -2.645800e+10 2.632377e+09
116235 JNJ.US 2020-09-30 2020-10-23 USD 2.840000e+09 NaN 4.401000e+09 NaN 3.554000e+09 5.431000e+09 1.411000e+10 4.445000e+09 -1.188000e+09 5.633000e+09 NaN 44000000.0 847000000.0 12000000.0 -32000000.0 NaN 206000000.0 NaN 847000000.0 2.108200e+10 8.477000e+09 6.972000e+09 -1.268000e+09 NaN 3.554000e+09 3.554000e+09 NaN 2020-10-23 USD 1.706930e+11 4.700600e+10 NaN 2.619000e+09 1.062200e+11 6.447300e+10 5.615000e+09 ... 3.884700e+10 5.078000e+09 5.078000e+09 NaN -1.493800e+10 1.785500e+10 5.757800e+10 NaN -1.684000e+10 1.181600e+10 1.457900e+10 3.268000e+10 9.599000e+09 3.376900e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.148310e+11 -3.854000e+10 NaN 6.131000e+09 NaN 7.816000e+09 NaN 3.268000e+10 2.907800e+10 6.737300e+10 NaN NaN NaN NaN 1.706930e+11 3.078100e+10 4.516200e+10 -2.730700e+10 2.632167e+09
116135 JNJ.US 2020-12-31 2021-02-22 USD 4.032000e+09 NaN 1.647000e+09 NaN 1.738000e+09 6.457000e+09 1.466100e+10 1.734000e+09 -2.341000e+09 4.075000e+09 NaN 87000000.0 -91000000.0 13000000.0 -74000000.0 NaN 97000000.0 NaN -91000000.0 2.247500e+10 1.058600e+10 7.814000e+09 -2.414000e+09 NaN 1.738000e+09 1.738000e+09 NaN 2021-02-22 USD 1.748940e+11 5.340200e+10 NaN 3.132000e+09 1.116160e+11 6.327800e+10 7.214000e+09 ... 4.249300e+10 2.631000e+09 2.631000e+09 NaN -1.524200e+10 1.876600e+10 5.123700e+10 NaN -2.651700e+10 1.120000e+10 1.357600e+10 3.263500e+10 9.344000e+09 3.986200e+10 NaN NaN NaN NaN NaN 3.120000e+09 NaN 1.138900e+11 -3.849000e+10 NaN 6.562000e+09 NaN 8.534000e+09 NaN 3.263500e+10 2.927400e+10 6.912300e+10 NaN NaN NaN NaN 1.748940e+11 2.518500e+10 NaN NaN 2.632512e+09
then I have this dataframe(daily) prices:
ticker date open high low close adjusted_close volume
0 JNJ.US 2021-08-02 172.470 172.840 171.300 172.270 172.2700 3620659
1 JNJ.US 2021-07-30 172.540 172.980 171.840 172.200 172.2000 5346400
2 JNJ.US 2021-07-29 172.740 173.340 171.090 172.180 172.1800 4214100
3 JNJ.US 2021-07-28 172.730 173.380 172.080 172.180 172.1800 5750700
4 JNJ.US 2021-07-27 171.800 172.720 170.670 172.660 172.6600 7089300
I have daily data in the price data but I have quarterly data in the first data frame. I want to merge the dataframe in a way that all the prices between Jan-01-2020 and Mar-01-2020 are being merged with the correct row.
I'm not sure exactly how to do this. I thought of extracting the date to month-year but I still don't know how to merge based on the range of values?
Any suggestions would be welcomed, if I'm not clear please let me know and I can clarify.
If I understand correctly you could create common year and quarter columns for each DataFrame and do a merge on those columns. I did a left merge if you only want to match columns in the left dataset (daily data).
If this is not what you are looking for, could you please clarify with a sample input/output?
# importing pandas as pd
import pandas as pd
# Creating dummy data of daily values
dt = pd.Series(['2020-08-02', '2020-07-30', '2020-07-29',
'2020-07-28', '2020-07-27'])
# Convert the underlying data to datetime
dt = pd.to_datetime(dt)
dt_df = pd.DataFrame(dt, columns=['date'])
dt_df['quarter_1'] = dt_df['date'].dt.quarter
dt_df['year_1'] = dt_df['date'].dt.year
print(dt_df)
date quarter_1 year_1
0 2020-08-02 3 2020
1 2020-07-30 3 2020
2 2020-07-29 3 2020
3 2020-07-28 3 2020
4 2020-07-27 3 2020
# Creating dummy data of quarterly values
dt2 = pd.Series(['2019-12-31', '2020-03-31', '2020-06-30',
'2020-09-30', '2020-12-31'])
# Convert the underlying data to datetime
dt2 = pd.to_datetime(dt2)
dt2_df = pd.DataFrame(sr2, columns=['date2'])
dt2_df['quarter_2'] = dt2_df['date2'].dt.quarter
dt2_df['year_2'] = dt2_df['date2'].dt.year
print(dt2_df)
date_quarter quarter_2 year_2
0 2019-12-31 4 2019
1 2020-03-31 1 2020
2 2020-06-30 2 2020
3 2020-09-30 3 2020
4 2020-12-31 4 2020
Then you can just merge on how ever you want.
dt_df.merge(dt2_df, how='left', left_on=['quarter_1', 'year_1'], right_on=['quarter_2', 'year_2'] , validate="many_to_many")
OUTPUT:
date quarter_1 year_1 date_quarter quarter_2 year_2
0 2020-08-02 3 2020 2020-09-30 3 2020
1 2020-07-30 3 2020 2020-09-30 3 2020
2 2020-07-29 3 2020 2020-09-30 3 2020
3 2020-07-28 3 2020 2020-09-30 3 2020
4 2020-07-27 3 2020 2020-09-30 3 2020
I'm reading a simple csv file and creating a pandas dataframe. The csv file can have 1 row or 2 rows or 10 rows.
If the csv file has 1 row then I want to create few columns and if it has <=2 rows, then create couple of new columns and if it has 10 rows, then I want to create 10 new columns.
After reading the csv, my sample dataframe looks like below.
df=pd.read_csv('/home/abc/myfile.csv',sep=',')
print(df)
id rate amount address lb ub msa
1 2.50 100 abcde 30 90 101
10 20 102
103
104
105
106
107
108
109
110
Case 1)If the dataframe has only 1 record then I want to create new columns 'new_id', 'new_rate' & 'new_address' and assign the values from 'id', 'rate' and 'address' columns coming from the dataframe
Expected Output:
id rate amount address lb ub msa new_id new_rate new_address
1 2.50 100 abcde 30 90 101 1 2.50 abcde
Case 2)If the dataframe has <=2 records then I want to create for the 1st record 'lb_1', 'ub_1' with values 30 and 90 and for the 2nd record 'lb_2' & 'ub_2' with values 10 & 20 coming from the dataframe
Expected Output:
if there is only 1 row:
id rate amount address lb ub msa lb_1 ub_1
1 2.50 100 abcde 30 90 101 30 90
if there are 2 rows:
id rate amount address lb ub msa lb_1 ub_1 lb_2 ub_2
1 2.50 100 abcde 30 90 101 30 90 10 20
10 20 102
Case 3)If the dataframe has 10 records then I want to create 10 new columns ie, msa_1,msa_2....msa_10 and assign the respective values msa_1=101, msa_2=102.......msa_10=110 for each row coming from the dataframe
Expected Output:
id rate amount address lb ub msa msa_1 msa_2 msa_3 msa_4 msa_5 msa_6 msa_7 msa_8 msa_9 msa_10
1 2.50 100 abcde 30 90 101 101 102 103 104 105 106 107 108 109 110
10 20 102
103
104
105
106
107
108
109
110
I'm trying to write the code as below but for 2nd and 3rd case, I'm not sure how to do it and also if there is any better way to handle all the 3 cases, that would be great.
Appreciate if anyone can show me the best way to get it done. Thanks in advance
Case1:
if df.shape[0]==1:
df.loc[(df.shape[0]==1), "new_id"] = df["id"]
df.loc[(df.shape[0]==1),"new_rate"]= df["rate"]
df.loc[(df.shape[0]==1),"new_address"]= df["address"]
Case2:
if df.shape[0]<=2:
for i in 1 to len(df.index)
df.loc[df['lb_i']]=db['lb']
df.loc[df['ub_i']]=df['ub']
Case3:
if df.shape[0]<=10:
for i in 1 to len(df.index)
df.loc[df['msa_i']]=df['msa']
for case 2 and case 3, you can do something like this -
Case 2-
# case 2
df= pd.read_csv('test.txt')
lb_dict = { f'lb_{i}': value for i,value in enumerate(df['lb'].to_list(),start=1)}
lb_df = pd.DataFrame.from_dict(lb_dict, orient='index').transpose()
ub_dict = { f'ub_{i}': value for i,value in enumerate(df['ub'].to_list(),start=1)}
ub_df = pd.DataFrame.from_dict(ub_dict, orient='index').transpose()
final_df = pd.concat([df,lb_df,ub_df],axis =1)
print(final_df)
output-
id
rate
amount
address
lb
ub
msa
lb_1
lb_2
ub_1
ub_2
0
1.0
2.5
100.0
abcde
30
90
101
30.0
10.0
90.0
20.0
1
NaN
NaN
NaN
NaN
10
20
102
NaN
NaN
NaN
NaN
For case 3 -
# case 3
df= pd.read_csv('test.txt')
msa_dict = { f'msa_{i}': value for i,value in enumerate(df['msa'].to_list(),start=1)}
msa_df = pd.DataFrame.from_dict(msa_dict, orient='index').transpose()
pd.concat([df,msa_df],axis =1)
Output -
id
rate
amount
address
lb
ub
msa
msa_1
msa_2
msa_3
msa_4
msa_5
msa_6
msa_7
msa_8
msa_9
msa_10
0
1.0
2.5
100.0
abcde
30.0
90.0
101
101.0
102.0
103.0
104.0
105.0
106.0
107.0
108.0
109.0
110.0
1
NaN
NaN
NaN
NaN
10.0
20.0
102
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
NaN
NaN
NaN
NaN
NaN
NaN
103
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
3
NaN
NaN
NaN
NaN
NaN
NaN
104
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
NaN
NaN
NaN
NaN
NaN
NaN
105
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
5
NaN
NaN
NaN
NaN
NaN
NaN
106
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
6
NaN
NaN
NaN
NaN
NaN
NaN
107
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7
NaN
NaN
NaN
NaN
NaN
NaN
108
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
8
NaN
NaN
NaN
NaN
NaN
NaN
109
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
9
NaN
NaN
NaN
NaN
NaN
NaN
110
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Solution -
I've just created a dictionary from the required column and then I concatenated it with the original dataframe column-wise.
Looking for some help accessing the first empty df column that is also a duplicate name, by name.
Consider this dataframe
import pandas as pd
df = pd.DataFrame(columns=['A', 'B', 'C', 'C', 'C', 'C', 'D', 'E'], index=[0,1,2,3])
A B C C C C D E
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
then access a slice by indexer and column name
indexer = [1,3]
df.loc[indexer, 'C']
C C C C
1 NaN NaN NaN NaN
3 NaN NaN NaN NaN
I want to edit only the first instance of column C so that I get
A B C C C C D E
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 99 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN 99 NaN NaN NaN NaN NaN
I tried df.loc[indexer, 'C'].iloc[:,0] = 99
But it did not set the values.
Thanks in advance for your replies and ideas.
IIUC:
indexer = [1, 3]
col = (df.columns == 'C').argmax()
df.iloc[indexer, col] = 99
df
A B C C C C D E
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 99 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN 99 NaN NaN NaN NaN NaN
I would use index.get_loc to get the slice of integer location of columns C and passing its start to .iloc as follows:
indexer = [1, 3]
df.iloc[indexer, df.columns.get_loc('C').start] = 99
Or using np.nonzero
c_loc = np.nonzero(df.columns == 'C')[0]
df.iloc[indexer, c_loc[0]] = 99
Out[87]:
A B C C C C D E
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 99 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN 99 NaN NaN NaN NaN NaN