Duplicate column Pandas dataframe slice issue - pandas

I have a dataframe df with duplicate columns: (I need duplicate columns dataframe, which will be passed as a parameter to matplotlib to plot, so the columns' name and content might be same or different)
>>> df
PE RT Ttl_mkv PE
STK_ID RPT_Date
11_STK79 20130115 41.932 2.744 3629.155 41.932
21_STK58 20130115 14.223 0.048 30302.324 14.223
22_STK229 20130115 22.436 0.350 15968.313 22.436
23_STK34 20130115 -63.252 0.663 4168.189 -63.252
I can get the second column by : df[df.columns[1]] ,
>>> df[df.columns[1]]
STK_ID RPT_Date
11_STK79 20130115 2.744
21_STK58 20130115 0.048
22_STK229 20130115 0.350
23_STK34 20130115 0.663
but if I want to get the first column by df[df.columns[0]] , it will give :
>>> df[df.columns[0]]
PE PE
STK_ID RPT_Date
11_STK79 20130115 41.932 41.932
21_STK58 20130115 14.223 14.223
22_STK229 20130115 22.436 22.436
23_STK34 20130115 -63.252 -63.252
Which one has two columns? That will make my application down for the application just wants the first column but Pandas give 1st & 4th column! Is it a bug or it is designed as this on purpose ? How to bypass this issue ?
My pandas version is 0.8.1 .

I dont really understand why you need to two columns with the same name, avoiding it would probably be the best.
But to answer your question, this would return only 1 of the 'PE' columns:
df.T.drop_duplicates().T.PE
STK_ID RPT_Date
11_STK79 20130115 41.932
21_STK58 20130115 14.223
22_STK229 20130115 22.436
23_STK34 20130115 -63.252
Name: PE
or:
df.T.ix[0].T

Related

Duplicated rows when merging on pandas

I have a list that contains multiple pandas dataframes.
Each dataframe has columns 'Trading Day' and Maturity.
However the name of the column Maturity changes depending on the maturity, for example the first dataframe column names are: 'Trading Day', 'Y_2021','Y_2022'.
The second dataframe has 'Trading Day',Y_2022','Y_2023','Y_2024'.
The column 'Trading day' has all unique np.datetime64 dates for every dataframe
And the maturity columns have either floats or nans
My goal is to merge all the dataframes into one and have something like:
'Trading Day','Y_2021,'Y_2022','Y_2023',...'Y_2030'
In my code gh is the list that contains all the dataframes and original is a dataframe that contains all the dates from 5 years ago through today.
gt is the final dataframe.
So far what I have done is:
original = pd.DataFrame()
original['Trading Day'] = np.arange(np.datetime64(str(year_now-5)+('-01-01')), np.datetime64(date.today())+1)
for i in range(len(gh)):
gh[i]['Trading Day']=gh[i]['Trading Day'].astype('datetime64[ns]')
gt = pd.merge(original,gh[0],on='Trading Day',how = 'left')
for i in range (1,len(gh)):
gt=pd.merge(gt,gh[i],how='outer')
The code works more or less the problem is that when there is a change of years I get the following example results:
Y_2021 Y_2023 Y_2024
2020-06-05 45
2020-06-05 54
2020-06-05 43
2020-06-06 34
2020-06-06 23
2020-06-06 34
#While what I want is:
Y_2021 Y_2023 Y_2024
2020-06-05 45 54 43
2020-06-06 34 23 34
Given your actual output and what you want, you should be able to just:
output.ffill().bfill().drop_duplicates()
To get the output you want.
Found the fix:
gt = gt.groupby('Trading Day').sum()
gt = gt.replace(0, np.nan)

pandas resampling: aggregating monthly values with offset

I work with monthly climate data (e.g. monthly mean temperature or precipitation) where I am often interested in taking several-month means e.g. December-March or May-September. To do this, I'm attempting to aggregrate monthly time series data using offsets in pandas (version 1.3.5) following the documentation.
For example, I have a monthly time series:
import pandas as pd
index = pd.date_range(start="2000-01-31", end="2000-12-31", freq="M")
data = pd.Series(range(12), index=index)
Taking a 4-month mean:
data_4M = data.resample("4M").mean()
>>> data_4M
2000-01-31 0.0
2000-05-31 2.5
2000-09-30 6.5
2001-01-31 10.0
Freq: 4M, dtype: float64
Attempting a 4-month mean with a 2-month offset produces a warning with the same results as the no-offset example above:
data_4M_offset = data.resample("4M", offset="2M").mean()
c:\program files\python39\lib\site-packages\pandas\core\resample.py:1381: FutureWarning: Units 'M', 'Y' and 'y' do not represent unambiguous timedelta values and will be removed in a future version
tg = TimeGrouper(**kwds)
>>> data_4M_offset
2000-01-31 0.0
2000-05-31 2.5
2000-09-30 6.5
2001-01-31 10.0
Freq: 4M, dtype: float64
Does this mean that the monthly offset functionality has already been removed?
Is there another way that I can take multi-month averages with offsets?

Looping over columns in Pandas

I am trying to divide every two columns by the last two columns in the data set. For example, I want to divide column[0] and column[2] by column[-2] and then store the result in column[0] and column[2], respectively.
Ideally, what I want is to obtain from this:
fra1 ger1 fra2 ger2 fra pop ger pop
0 12 14 525 52 14 14
something like this:
fra1 ger1 fra2 ger2
0 12/fra pop 14/ger pop 525/fra pop 52/ger pop
that is, I want to create a new Dataframe (that keeps the original column labels) by dividing values of each country by its population.
Doing this manually for every column would take too much time with the real dataset, and I cannot figure out how to run a loop.
Can anybody help?
Thanks a lot!
You can select the columns to fit your use case with df.columns and slicing
Setting up the dataframe
import pandas as pd
import io
t = '''
fra1 ger1 fra2 ger2 fra pop ger pop
0 12 14 525 52 14 14'''
df = pd.read_csv(io.StringIO(t), sep='\s\s+', engine='python')
df
Out:
fra1 ger1 fra2 ger2 fra pop ger pop
0 12 14 525 52 14 14
The slices [:4] [-2:] and the multiplication factor 2 for the column names to devide by have to be adjusted for your real data
df[df.columns[:4]].div(df[df.columns[-2:].tolist()*2].values)
Out:
fra1 ger1 fra2 ger2
0 0.857143 1.0 37.5 3.714286
If you change your original organization you can do this much more easily, but from this point probably best to just use some logic to determine the prefixes and then perform the division for each subgroup and then join the results with concat in the end.
# Prefix is everything before `' pop'`
prefixes = [x.rsplit(' ', 1)[0] for x in df.columns if x.endswith('pop')]
#['fra', 'ger']
l = []
for pref in prefixes:
l.append(df[[x for x in df.columns if x.startswith(pref) and not x.endswith('pop')]]
.divide(df[f'{pref} pop'], axis=0))
res = pd.concat(l, axis=1)
# fra1 fra2 ger1 ger2
#0 0.857143 37.5 1.0 3.714286
I think I also found a solution:
divisor = df.iloc[:,-2:]
for index, column in enumerate(df):
values = df[column]
if index < 2:
num1 = values
df[column] = num1/divisor.iloc[:,index]
if 1 < index < 3:
num2 = values
df[column] = num2/divisor.iloc[:,index-2]
Here is a solution that uses pandas' multi-indexing and broadcasting. Multi-indexing puts country and metric in two separate levels of the column labels. Broadcasting lets you divide every German (or French) metric by the German (or French) population.
from io import StringIO
import pandas as pd
# add 2nd row to validate results below
t = '''
fra1 ger1 fra2 ger2 fra pop ger pop
0 12 14 525 52 14 14
1 2 3 4 5 6 7
'''
df = pd.read_csv(StringIO(t), sep='\s\s+', engine='python')
# create hierarchical index (i.e., multi-index)
midx = [('france', 'm1'), ('germany', 'm2'),
('france', 'm2'), ('germany', 'm2'),
('france', 'pop'), ('germany', 'pop')]
midx = pd.MultiIndex.from_tuples(midx, names=['country', 'metric'])
df.columns = midx
# create `metrics` data frame (excludes population)
metrics = df.loc[:, (slice(None), ['m1', 'm2'])]
# create population data frame (and remove one level of index)
pop = df.loc[:, (slice(None), 'pop')].droplevel(level='metric', axis=1)
result = metrics.div(pop, level='country')
print(result)
country france germany france germany
metric m1 m2 m2 m2
0 0.857143 1.000000 37.500000 3.714286
1 0.333333 0.428571 0.666667 0.714286
More info here: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

pybbg: extracting data from Bloomberg using ISIN codes, not tickers

I have a large list of ISIN codes and would like to use them to pull Bloomberg data into Python using pybbg.
for example, this gives me nan values for all ISIN codes:
fld_list = ['OAS_SPREAD_MID','DUR_ADJ_MID','DUR_ADJ_OAS_MID']
bb = bbg.bdp("US46628LAA61 ISIN", fld_list)
When using the tickers, I get all field values.
Any ideas would be really appreciated.
Many thanks,
The correct syntax to request data for an ISIN is /isin/US46628LAA61.
With xbbg you can do this:
In[1]: from xbbg import blp
In[2]: fld_list = ['OAS_SPREAD_MID','DUR_ADJ_MID','DUR_ADJ_OAS_MID']
In[3]: blp.bdp(['US46628LAA61 Mtge', 'US46631JAA60 Mtge'], fld_list)
Out[3]:
ticker field value
0 US46628LAA61 Mtge OAS_SPREAD_MID -5.30
1 US46628LAA61 Mtge DUR_ADJ_MID 6.00
2 US46628LAA61 Mtge DUR_ADJ_OAS_MID 2.43
3 US46631JAA60 Mtge OAS_SPREAD_MID 50.10
4 US46631JAA60 Mtge DUR_ADJ_MID 1.71
5 US46631JAA60 Mtge DUR_ADJ_OAS_MID 4.09

Fill missing Values by a ratio of other values in Pandas

I have a column in a Dataframe in Pandas with around 78% missing values.
The remaining 22% values are divided between three labels - SC, ST, GEN with the following ratios.
SC - 16%
ST - 8%
GEN - 76%
I need to replace the missing values by the above three values so that the ratio of all the elements remains same as above. The assignment can be random as long the the ratio remains as above.
How do I accomplish this?
Starting with this DataFrame (only to create something similar to yours):
import numpy as np
df = pd.DataFrame({'C1': np.random.choice(['SC', 'ST', 'GEN'], p=[0.16, 0.08, 0.76],
size=1000)})
df.loc[df.sample(frac=0.22).index] = np.nan
It yields a column with 22% NaN and the remaining proportions are similar to yours:
df['C1'].value_counts(normalize=True, dropna=False)
Out:
GEN 0.583
NaN 0.220
SC 0.132
ST 0.065
Name: C1, dtype: float64
df['C1'].value_counts(normalize=True)
Out:
GEN 0.747436
SC 0.169231
ST 0.083333
Name: C1, dtype: float64
Now you can use fillna with np.random.choice:
df['C1'] = df['C1'].fillna(pd.Series(np.random.choice(['SC', 'ST', 'GEN'],
p=[0.16, 0.08, 0.76], size=len(df))))
The resulting column will have these proportions:
df['C1'].value_counts(normalize=True, dropna=False)
Out:
GEN 0.748
SC 0.165
ST 0.087
Name: C1, dtype: float64