Multi-index dataframe split and stack - pandas

When I download data from yfinance, I get 8 columns (Open, High, Low, etc...) per ticker. Since I am downloading 15 tickers, I have 120 columns and 1 index column (date). They add up horizontally. See image 1
Instead of having that many columns, in 2 levels, I want just the 8 unique columns. Plus creating one new column that identifies the ticker. See Image 2.
Image 1: Current Form
Image 1 but in raw text:
Adj Close ... Volume
DANHOS13.MX FCFE18.MX FHIPO14.MX FIBRAHD15.MX FIBRAMQ12.MX FIBRAPL14.MX FIHO12.MX FINN13.MX FMTY14.MX FNOVA17.MX ... FIBRAPL14.MX FIHO12.MX FINN13.MX FMTY14.MX FNOVA17.MX FPLUS16.MX FSHOP13.MX FUNO11.MX FVIA16.MX TERRA13.MX
Date
2015-01-02 26.065336 NaN 18.526043 NaN 16.337654 18.520781 14.683501 11.301384 9.247743 NaN ... 338697 189552 148064 57 NaN NaN 212451 2649823 NaN 1111343
2015-01-05 24.670488 NaN 18.436762 NaN 15.857328 17.859756 13.795850 11.071105 9.209846 NaN ... 449555 364819 244594 19330 NaN NaN 491587 3317923 NaN 1255128
Image 2: Desired outcome
The code Im applying is:
start = dt.datetime(2015,1,1)
end = dt.datetime.now()
df = yf.download("FUNO11.MX FIBRAMQ12.MX FIHO12.MX DANHOS13.MX FINN13.MX FSHOP13.MX TERRA13.MX FMTY14.MX FIBRAPL14.MX FHIPO14.MX FIBRAHD15.MX FPLUS16.MX FVIA16.MX FNOVA17.MX FCFE18.MX",
start = start,
end = end,
group_by = 'Ticker',
actions = True)

I will download the data a little differently:
import yfinance as yf
from datetime import datetime as dt
from dateutil.relativedelta import relativedelta
start = dt(2015,1,1)
end = dt.now()
symbols = ["FUNO11.MX", "FIBRAMQ12.MX", "FIHO12.MX", "DANHOS13.MX", "FINN13.MX", "FSHOP13.MX", "TERRA13.MX", "FMTY14.MX",
"FIBRAPL14.MX", "FHIPO14.MX", "FIBRAHD15.MX", "FPLUS16.MX", "FVIA16.MX", "FNOVA17.MX", "FCFE18.MX"]
data = yf.download(symbols, start=start, end=end, actions=True)
And then
Option 1:
def reshaper(symb, dframe):
df = dframe.unstack().reset_index()
df.columns = ['variable','symbol','Date','Value']
df = df.loc[df.symbol==symb,['Date','variable','Value']].pivot_table(index='Date', columns='variable', values='Value').reset_index()
df.columns.name = ''
df['Ticker'] = symb
return df
h = pd.DataFrame()
for s in symbols:
h = h.append(reshaper(s, data), ignore_index=True)
h
Option 2: For a one-liner, you could do this:
data.stack().reset_index().rename(columns={'level_1':'Ticker'})

A slightly simpler version relies on stacking first the two column index levels (measure and ticker) to get long form tidy data, and then stack on the measure level, keeping ticker and date as indices:
import yfinance as yf
symbols = ["FUNO11.MX", "FIBRAMQ12.MX", "FIHO12.MX", "DANHOS13.MX",
"FINN13.MX", "FSHOP13.MX", "TERRA13.MX", "FMTY14.MX",
"FIBRAPL14.MX", "FHIPO14.MX", "FIBRAHD15.MX", "FPLUS16.MX",
"FVIA16.MX", "FNOVA17.MX", "FCFE18.MX"]
data = yf.download(symbols, start='2015-01-01', end='2020-11-15', actions=True)
data_reshape=data.stack(level=[0,1]).unstack(1)
data_reshape.index=data_reshape.index.set_names(['ticker'],level=[1])
data_reshape.head()
data_reshape.head()
Adj Close Close Dividends High \
Date ticker
2015-01-02 DANHOS13.MX 26.065336 37.000000 0.0 37.400002
FHIPO14.MX 18.526043 24.900000 0.0 24.900000
FIBRAMQ12.MX 16.337654 24.490000 0.0 25.110001
FIBRAPL14.MX 18.520781 26.740801 0.0 27.118500
FIHO12.MX 14.683501 21.670000 0.0 22.190001
Low Open Stock Splits Volume
Date ticker
2015-01-02 DANHOS13.MX 36.330002 36.330002 0.0 82849.0
FHIPO14.MX 24.900000 24.900000 0.0 94007.0
FIBRAMQ12.MX 24.350000 24.990000 0.0 1172917.0
FIBRAPL14.MX 26.343100 26.750700 0.0 338697.0
FIHO12.MX 21.209999 22.120001 0.0 189552.0

Related

How to replace dates of 1 month to month in python?

Above is the image of the below dataframe with x-axis as date and y-axis as High
what I want is for date between 06-09-21 to 31-09-21 it should replace it with sep 21 and likewise remaining dates with respected months in graph as right now the x-axis is not readable
I don't even know where to start with
Below is the code that I used to draw/plot graph
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Stock.csv")
x1=df['High'].values.tolist()
r=df['Date'].values.tolist()
plt.plot(r, x1,color="green", label = "High")
You can use pandas.to_datetime with pandas.Series.dt.strftime.
Try this :
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True).dt.strftime("%b-%d")
# Output :
As a new column to illustrate the change :
print(df)
Date Date(new)
0 08-09-21 Sep-08
1 09-09-21 Sep-09
2 13-09-21 Sep-13
3 30-08-21 Aug-30
4 01-09-22 Sep-01
5 02-09-22 Sep-02
6 05-09-22 Sep-05
7 06-09-22 Sep-06
# Edit :
You can use matplotlib.dates.DateFormatter :
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True)
f, ax = plt.subplots(figsize = [8, 4])
ax.plot(df["Date"], df["High"])
ax.xaxis.set_major_formatter(DateFormatter("%b-%Y"))
Used input :
Date Open High Low Close AdjClose Volume
0 06-09-21 1579.949951 1580.949951 1561.949951 1565.699951 1547.704712 3938448
1 07-09-21 1562.500000 1582.000000 1555.199951 1569.250000 1551.213989 3622748
2 08-09-21 1571.949951 1580.500000 1565.599976 1576.400024 1558.281860 3362040
3 09-09-21 1574.000000 1579.449951 1561.000000 1568.599976 1550.571411 4125474
4 13-09-21 1562.000000 1584.000000 1553.650024 1555.550049 1537.671509 4479582
5 30-08-22 1446.449951 1489.949951 1443.099976 1486.099976 1486.099976 5067700
6 01-09-22 1464.750000 1489.449951 1459.000000 1472.150024 1472.150024 11201568
7 02-09-22 1472.150024 1490.500000 1465.199951 1485.500000 1485.500000 6019043
8 05-09-22 1486.099976 1499.000000 1484.099976 1495.050049 1495.050049 6065966
9 06-09-22 1498.900024 1506.650024 1487.099976 1502.000000 1502.000000 4066957

How can I find '\[a-z]' in dataframe using reg?

I'm trying to find the string with specific pattern in my dataframe
import re
import pandas as pd
import numpy as np
df = pd.read_excel(io = "mydata.xlsx", sheet_name = 'Sheet1', index_col = 0)
to find '\[a-z]' string:
header = df.select_dtypes(['object']).columns
df_header = df[header]
p = re.compile('\[a-z]')
df_header_check = df_header.apply(lambda x: x.str.contains(p, na=False))
df_header.loc[df_header_check.any(1), df_header_check.any()]
And I don't get any results. Not an error message, just an empty dataframe.
I've tried:
p = re.compile(r'\\[a-z]') but also does not work
The sample dataset:
TIME11 WARNEMOTION4 WARNEMOTION4DTL TIME12 WARNSIGN_DTL EVENT_DTL EVENT_DTL_2
EXCLUDE
1_3 1 NaN 2.0 1.0 2.0 2.0 1.0 2.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN 언어: ****************** 1. 변사자 정보 : ***_*****-*******_x000D__x000D_\n2. 발견일시 : ****년 **월 **일 **:**_x000D__x000D_\n3. 시도... NaN
And I expect the dataframe output like the above.

Pandas load from_records non-sparsely

I am trying to load a list of dictionaries into pandas as efficiently as possible. Here is a minimal example for constructing my data, which I call below, mylist:
import pandas as pd
import random
from string import ascii_lowercase
random.seed(100)
mylist = []
for i in range(100):
random_string_variable = "".join(random.sample("DINOSAUR", len("DINOSAUR")))
random_string = "".join(random.sample("DINOSAUR", len("DINOSAUR")))
for j in range(10):
myrecord = {"i": i,
"identifier" : random_string,
f"var_{ascii_lowercase[j].upper()}_xx" : random.random(),
f"var_{ascii_lowercase[j].upper()}_yy" : random.random()*10,
f"var_{ascii_lowercase[j].upper()}_zz" : random.random()*100
}
mylist.append(myrecord)
pprint(mylist[0:5])
[{'i': 0,
'identifier': 'NROUIDSA',
'var_A_xx': 0.03694960304368877,
'var_A_yy': 4.4615792434297585,
'var_A_zz': 68.37385464983947},
{'i': 0,
'identifier': 'NROUIDSA',
'var_B_xx': 0.7476846773635049,
'var_B_yy': 3.2014779786116643,
'var_B_zz': 58.91595571819701},
{'i': 0,
'identifier': 'NROUIDSA',
'var_C_xx': 0.3502573960649995,
'var_C_yy': 6.713087131908023,
'var_C_zz': 74.36827046647622},
{'i': 0,
'identifier': 'NROUIDSA',
'var_D_xx': 0.23513409285324904,
'var_D_yy': 3.894932754840866,
'var_D_zz': 65.35552900764706},
{'i': 0,
'identifier': 'NROUIDSA',
'var_E_xx': 0.6660170004345193,
'var_E_yy': 1.9094479278081555,
'var_E_zz': 36.84983796653053}]
When I try to load this into pandas, it makes the data frame very non-sparse, with a lot of NaN repetition:
df = pd.DataFrame.from_records(mylist)
df
produces:
df
i identifier var_A_xx var_A_yy var_A_zz var_B_xx var_B_yy ... var_H_zz var_I_xx var_I_yy var_I_zz var_J_xx var_J_yy var_J_zz
0 0 NROUIDSA 0.03695 4.461579 68.373855 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
1 0 NROUIDSA NaN NaN NaN 0.747685 3.201478 ... NaN NaN NaN NaN NaN NaN NaN
2 0 NROUIDSA NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
3 0 NROUIDSA NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
4 0 NROUIDSA NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 99 SORIUDAN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
996 99 SORIUDAN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
997 99 SORIUDAN NaN NaN NaN NaN NaN ... 63.72333 NaN NaN NaN NaN NaN NaN
998 99 SORIUDAN NaN NaN NaN NaN NaN ... NaN 0.367797 4.162167 84.699542 NaN NaN NaN
999 99 SORIUDAN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.634893 7.628154 75.903316
[1000 rows x 32 columns]
What I would like it to look like is:
var_A_xx var_A_yy var_A_zz var_B_xx var_B_yy var_B_zz ... var_I_xx var_I_yy var_I_zz var_J_xx var_J_yy var_J_zz
i identifier ...
0 NROUIDSA 0.036950 4.461579 68.373855 0.747685 3.201478 58.915956 ... 0.962999 7.332500 13.216899 0.847280 6.504308 8.552283
1 NURDASOI 0.814194 9.570388 21.239626 0.468727 6.180384 24.260818 ... 0.346681 9.865105 82.261586 0.221160 8.481875 92.645263
2 OARNDUIS 0.813418 1.103359 1.198749 0.646912 2.409214 76.037434 ... 0.404528 2.112085 8.461932 0.621124 5.372169 36.500880
3 DISORNAU 0.533450 1.094177 44.053734 0.804385 5.947438 28.360524 ... 0.121844 5.806337 85.657067 0.735207 4.011567 38.368097
4 SIONUDRA 0.672725 3.724022 58.280713 0.346717 7.432624 49.726532 ... 0.238869 0.769056 58.188641 0.415537 6.828866 38.802765
... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 URIADNSO 0.231775 3.114448 65.241238 0.116461 4.330002 12.864624 ... 0.516712 5.589706 87.261427 0.572551 4.060943 80.102004
96 ISDONRAU 0.295684 8.406004 22.817404 0.160434 8.415922 47.288958 ... 0.050647 8.720049 44.407892 0.038166 5.027924 73.852513
97 OIAUSDNR 0.331393 9.480417 90.311381 0.985708 6.384429 55.459062 ... 0.947673 4.406426 68.098531 0.377523 5.258620 61.035638
98 DIONAURS 0.690593 4.316975 9.866558 0.822896 3.822044 68.863371 ... 0.994493 3.550660 22.769721 0.199187 7.254650 91.232969
99 SORIUDAN 0.960168 6.769579 49.488535 0.671168 1.577146 78.835216 ... 0.367797 4.162167 84.699542 0.634893 7.628154 75.903316
[100 rows x 30 columns]
You can see it is a 10x waste of memory to have the first representation. Obviously, there are a variety of ways to get from A to B. How can I tell pandas to /read in/ this list of records as non-sparse, as I assume this would be the most performant? You can see extra records are inserted with NaN values. I'm expecting 100 rows, where the index is given by ["i", "identifier"] and 30 columns.
My preference is to do this at load time with the correct keywords and data load method, rather than relying after the fact on a pivot operations in pandas as they are comparatively slow. I'm asking this question largely for performance, for example with much larger i and somewhat larger j.
df = pd.DataFrame.from_records(mylist, index=["i", "identifier"])
df
Did not do the job.
pd.DataFrame.from_records(mylist, index=["i", "identifier"]).unstack()
ValueError: Index contains duplicate entries, cannot reshape
Also fails.
If there do not exist arguments to ingest the list of dictionaries non-sparsely into a dataframe---this is the focus of my question---which of the .agg, pivot_table, reshape, long_to_wide, and unstack methods would be the fastest at getting from A to B for larger data sets?
There’s a number of ways to load the data as it is:
>>> df_idx = pd.DataFrame.from_records(mylist, index=['i', 'identifier'])
>>> df = pd.DataFrame.from_records(mylist)
>>> df = pd.DataFrame.from_dict(mylist)
>>> df = pd.DataFrame(mylist)
Then we could group by columns or levels, and take the first non-NA value:
>>> df_idx.groupby(level=[0, 1]).first()
>>> df.groupby(['i', 'identifier']).first()
Those are both pretty much improved versions of .agg() as you don’t have to specify a lambda function.
Then with the same loading, we can try to reshape the data, with stack/unstack or melt/pivot
>>> df_idx.stack().unstack()
>>> pd.melt(pd.DataFrame(mylist), id_vars=['i', 'identifier']).dropna().pivot(index=['i', 'identifier'], columns='variable', values='value')
If that’s not satisfactory, there’s also reshaping before loading, which could be done through list comprehensions, and either ChainMap or dict comprehensions, or with numpy. This relies on the fact that there’s always 10 dictionaries with the same keys in a row, and chaining the same iterator the appropriate number of times with zip():
>>> pd.DataFrame({k: v for d in tup for k, v in d.items()} for tup in zip(*[iter(mylist)] * 10))
>>> pd.DataFrame(ChainMap(*tup) for tup in zip(*[iter(mylist)] * 10))
>>> df = pd.concat([pd.DataFrame(mylist[n::10]) for n in range(10)], axis='columns')
>>> df.groupby(df.columns, axis='columns').first()
>>> reshaped = np.reshape(mylist, (100, 10))
>>> df = pd.concat([pd.json_normalize(reshaped[:,n]) for n in range(10)], axis='columns')
>>> df.groupby(df.columns, axis='columns').first()
I’ve measured with i=100 and j=10 as in your example and with i=1000 and j=100.
You can see which way you get the data into a dataframe does not matter: all groupby variants have the same results. As you suspect loading the data and then “fixing” it performs pretty bad. pd.concat does not work too well on 100x10 but scales better on the 1000x100 data, and what seems the best is pure-python dict iteration (maybe because it’ a comprehension and not a list? Not sure). The reshaping techniques, stack/unstack and melt/pivot, are always the worst.
Of course these results may change with different data sizes, and you probably know better what the right sizes to test are, based on your real data. Here’s the full script I used to run the tests so you can run some yourself:
#!/usr/bin/python3
import numpy as np
import pandas as pd
from collections import ChainMap
from matplotlib import pyplot as plt
import timeit
def gen(imax, jmax):
mylist = []
l = ord('A')
for i in range(imax):
random_string = ''.join(np.random.permutation(list('DINOSAUR')))
for j in range(jmax):
var = f'var_{chr(l + (j // 26)) if j >= 26 else ""}{chr(l + (j % 26))}'
mylist.append({
'i': i,
'identifier' : random_string,
var + '_xx' : np.random.random(),
var + '_yy' : np.random.random()*10,
var + '_zz' : np.random.random()*100
})
return mylist
def load_rec_idx_groupby(mylist, j):
return pd.DataFrame.from_records(mylist, index=['i', 'identifier']).groupby(level=[0, 1]).first()
def load_rec_groupby(mylist, j):
return pd.DataFrame.from_records(mylist).groupby(['i', 'identifier']).first()
def load_dict_groupby(mylist, j):
return pd.DataFrame.from_dict(mylist).groupby(['i', 'identifier']).first()
def load_constr_groupby(mylist, j):
return pd.DataFrame(mylist).groupby(['i', 'identifier']).first()
def load_constr_stack(mylist, j):
return pd.DataFrame(mylist).set_index(['i', 'identifier']).stack().unstack()
def load_constr_melt_pivot(mylist, j):
return pd.melt(pd.DataFrame(mylist), id_vars=['i', 'identifier']).dropna().pivot(index=['i', 'identifier'], columns='variable', values='value')
def load_zip_iter_dict(mylist, j):
return pd.DataFrame({k: v for d in tup for k, v in d.items()} for tup in zip(*[iter(mylist)] * j))
def load_zip_iter_chainmap(mylist, j):
return pd.DataFrame(ChainMap(*tup) for tup in zip(*[iter(mylist)] * 10))
def load_concat_step(mylist, j):
return pd.concat([pd.DataFrame(mylist[n::10]).drop(columns=['i', 'identifier'] if n else []) for n in range(10)], axis='columns')
def load_concat_reshape(mylist, j):
reshaped = np.reshape(mylist, (len(mylist) // j, j))
return pd.concat([pd.json_normalize(reshaped[:,n]).drop(columns=['i', 'identifier'] if n else []) for n in range(j)], axis='columns')
def plot_results(df):
mins = df.groupby(level=0).median().min(axis='columns')
rel = df.unstack().T.div(mins)
ax = rel.groupby(level=0).median().plot.barh()
ax.set_xlabel('slowdown over fastest')
ax.axvline(1, color='black', lw=1)
ax.set_xticks([1, *ax.get_xticks()[1:]])
ax.set_xticklabels([f'{n:.0f}×' for n in ax.get_xticks()])
plt.subplots_adjust(left=.4, bottom=.15)
plt.show()
def run():
candidates = {n: f for n, f in globals().items() if n.startswith('load_') and callable(f)}
df = {}
for tup in [(100, 10), (1000, 100)]:
glob = {'mylist': gen(*tup), **candidates}
dat = pd.DataFrame({name:
timeit.Timer(f'{name}(mylist, {10})', globals=glob).repeat(5, int(100000 / np.multiply(*tup)))
for name in candidates
})
print(dat)
df['{}×{}'.format(*tup)] = dat
df = pd.concat(df).rename(columns=lambda s: s.replace('load_', '').replace('_', ' '))
print(df)
plot_results(df)
if __name__ == '__main__':
run()
I don't think this is possible with a pandas argument at load time, but you can use a comprehension to collapse your list of dicts into a single dict for each row
Data:
a = [
{'i': 0,
'identifier': 'NROUIDSA',
'var_A_xx': 0.03694960304368877,
'var_A_yy': 4.4615792434297585,
'var_A_zz': 68.37385464983947},
{'i': 0,
'identifier': 'NROUIDSA',
'var_B_xx': 0.7476846773635049,
'var_B_yy': 3.2014779786116643,
'var_B_zz': 58.91595571819701},
{'i': 0,
'identifier': 'NROUIDSA',
'var_C_xx': 0.3502573960649995,
'var_C_yy': 6.713087131908023,
'var_C_zz': 74.36827046647622},
{'i': 0,
'identifier': 'NROUIDSA',
'var_D_xx': 0.23513409285324904,
'var_D_yy': 3.894932754840866,
'var_D_zz': 65.35552900764706},
{'i': 0,
'identifier': 'NROUIDSA',
'var_E_xx': 0.6660170004345193,
'var_E_yy': 1.9094479278081555,
'var_E_zz': 36.84983796653053},
{'i': 1,
'identifier': 'SORIUDAN',
'var_A_xx': 0.03694960304368877,
'var_A_yy': 4.4615792434297585,
'var_A_zz': 68.37385464983947},
{'i': 1,
'identifier': 'SORIUDAN',
'var_B_xx': 0.7476846773635049,
'var_B_yy': 3.2014779786116643,
'var_B_zz': 58.91595571819701},
{'i': 1,
'identifier': 'SORIUDAN',
'var_C_xx': 0.3502573960649995,
'var_C_yy': 6.713087131908023,
'var_C_zz': 74.36827046647622},
{'i': 1,
'identifier': 'SORIUDAN',
'var_D_xx': 0.23513409285324904,
'var_D_yy': 3.894932754840866,
'var_D_zz': 65.35552900764706},
{'i': 1,
'identifier': 'SORIUDAN',
'var_E_xx': 0.6660170004345193,
'var_E_yy': 1.9094479278081555,
'var_E_zz': 36.84983796653053}
]
Cleaning the list of dicts:
# get list of keys--assumed here to be the identifier dict value
l_key = list(dict.fromkeys([l.get('identifier') for l in a]))
# a data dict we'll append the properly parsed dict to
data = list()
# iterate through original dict and append.
for i in l_key:
l_data = [l for l in a if l.get('identifier') == i]
data.append({k: v for d in l_data for k, v in d.items()})
adding data to the pandas df.
import pandas as pd
df = pd.DataFrame.from_records(data)
print(df)
print(df.dtypes)
I don't know that this would be faster than dealing with the data after you've loaded it into the DataFrame but it is another approach.
I think your issue is not from pandas, you can create the records according to what you want as a result. I edit your implementation as below,
import random
from string import ascii_lowercase
import pandas as pd
random.seed(100)
mylist = []
for i in range(100):
random_string_variable = "".join(random.sample("DINOSAUR", len("DINOSAUR")))
random_string = "".join(random.sample("DINOSAUR", len("DINOSAUR")))
record = {
"i": i,
"identifier": random_string
}
for j in range(10):
record[f"var_{ascii_lowercase[j].upper()}_xx"] = random.random()
record[f"var_{ascii_lowercase[j].upper()}_yy"] = random.random() * 10,
record[f"var_{ascii_lowercase[j].upper()}_zz"] = random.random() * 100
mylist.append(record)
print(len(mylist))
df = pd.DataFrame.from_records(mylist)
df
As it is about loading the records into Pandas, maybe it can be easier to process the list before passing into pandas such as
from itertools import groupby
from collections import ChainMap
records = []
for k, v in groupby(mylist, key=lambda x: (x['i'], x['identifier'])):
record = dict(ChainMap(*v))
records.append(record)
df = pd.DataFrame.from_records(records)
print(df)

Pandas to mark both if cell value is a substring of another

A column with short and full form of people names, I want to unify them, if the name is a part of the other name. e.g. "James.J" and "James.Jones", I want to tag them both as "James.J".
data = {'Name': ["Amelia.Smith",
"Lucas.M",
"James.J",
"Elijah.Brown",
"Amelia.S",
"James.Jones",
"Benjamin.Johnson"]}
df = pd.DataFrame(data)
I can't figure out how to do it in Pandas. So only a xlrd way, with similarity ratio by SequenceMatcher (and sort it manually in Excel):
import xlrd
from xlrd import open_workbook,cellname
import xlwt
from xlutils.copy import copy
workbook = xlrd.open_workbook("C:\\TEM\\input.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")
from difflib import SequenceMatcher
wb = copy(workbook)
sheet = wb.get_sheet(0)
for row_index in range(0, old_sheet.nrows):
current = old_sheet.cell(row_index, 0).value
previous = old_sheet.cell(row_index-1, 0).value
sro = SequenceMatcher(None, current.lower(), previous.lower(), autojunk=True).ratio()
if sro > 0.7:
sheet.write(row_index, 1, previous)
sheet.write(row_index-1, 1, previous)
wb.save("C:\\TEM\\output.xls")
What's the nice Pandas way to do it/ Thank you.
using pandas, making use of str.split and .map with some boolean conditions to identify the dupes.
df1 = df['Name'].str.split('.',expand=True).rename(columns={0 : 'FName', 1 :'LName'})
df2 = df1.loc[df1['FName'].duplicated(keep=False)]\
.assign(ky=df['Name'].str.len())\
.sort_values('ky')\
.drop_duplicates(subset=['FName'],keep='first').drop('ky',1)
df['NewName'] = df1['FName'].map(df2.assign(newName=df2.agg('.'.join,1))\
.set_index('FName')['newName'])
print(df)
Name NewName
0 Amelia.Smith Amelia.S
1 Lucas.M NaN
2 James.J James.J
3 Elijah.Brown NaN
4 Amelia.S Amelia.S
5 James.Jones James.J
6 Benjamin.Johnson NaN
Here is an example of using apply with a custom function. For small dfs this should be fine; this will not scale well for large dfs. A more sophisticated data structure for memo would be an ok place to start to improve performance without degrading readability too much:
df = df.sort_values("Name")
def short_name(row, col="Name", memo=[]):
name = row[col]
for m_name in memo:
if name.startswith(m_name):
return m_name
memo.append(name)
return name
df["short_name"] = df.apply(short_name, axis=1)
df = df.sort_index()
output:
Name short_name
0 Amelia.Smith Amelia.S
1 Lucas.M Lucas.M
2 James.J James.J
3 Elijah.Brown Elijah.Brown
4 Amelia.S Amelia.S
5 James.Jones James.J
6 Benjamin.Johnson Benjamin.Johnson

Copy values from one column in a pandas dataframe only when target field is blank in target dataframe

I have 2 dataframes of equal length. The source has one column, ML_PREDICTION that I want to copy to the target dataframe, which has some values already that I don't want to overwrite.
#Select only blank values in target dataframe
mask = br_df['RECOMMENDED_ACTION'] == ''
# Attempt 1 - Results: KeyError: "['Retain' 'Retain' '' ... '' '' 'Retain'] not in index"
br_df.loc[br_df['RECOMMENDED_ACTION'][mask]] = ML_df['ML_PREDICTION'][mask]
br_df.loc['REASON_CODE'][mask] = 'ML01'
br_df.loc['COMMENT'][mask] = 'Automated Prediction'
# Attempt 2 - Results: Overwrites all values in target dataframe
br_df['RECOMMENDED_ACTION'].where(mask, other=ML_df['ML_PREDICTION'], inplace=True)
br_df['REASON_CODE'].where(mask, other='ML01', inplace=True)
br_df['COMMENT'].where(mask, other='Automated Prediction', inplace=True)
# Attempt 3 - Results: Overwrites all values in target dataframe
br_df['RECOMMENDED_ACTION'] = [x for x in ML_df['ML_PREDICTION'] if [mask] ]
br_df['REASON_CODE'] = ['ML01' for x in ML_df['ML_PREDICTION'] if [mask]]
br_df['COMMENT'] = ['Automated Prediction' for x in ML_df['ML_PREDICTION'] if [mask]]
Attempt 4 - Results: Values in target (br_df) were unchanged
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'REASON_CODE'] = 'ML01'
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'COMMENT'] = 'Automated Prediction'
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'RECOMMENDED_ACTION'] = ML_df['ML_PREDICTION']
Attempt 5
#Dipanjan
` # Before - br_df['REASON_CODE'].value_counts()
BR03 10
BR01 8
Name: REASON_CODE, dtype: int64
#Attempt 5
br_df.loc['REASON_CODE'] = br_df['REASON_CODE'].fillna('ML01')
br_df.loc['COMMENT'] = br_df['COMMENT'].fillna('Automated Prediction')
br_df.loc['RECOMMENDED_ACTION'] = br_df['RECOMMENDED_ACTION'].fillna(ML_df['ML_PREDICTION'])
# After -- print(br_df['REASON_CODE'].value_counts())
BR03 10
BR01 8
ML01 2
Automated Prediction 1
Name: REASON_CODE, dtype: int64
#WTF? -- br_df[br_df['REASON_CODE'] == 'Automated Prediction']
PERSON_STATUS ... RECOMMENDED_ACTION REASON_CODE COMMENT
COMMENT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Automated Prediction Automated Prediction Automated Prediction
What am I missing here?
use below options -
df.loc[df['A'].isnull(), 'A'] = df['B']
or
df['A'] = df['A'].fillna(df['B'])
import numpy as np
df_a = pd.DataFrame([0,1,np.nan])
df_b = pd.DataFrame([0,np.nan,2])
df_a
0
0 0.0
1 1.0
2 NaN
df_b
0
0 0.0
1 NaN
2 2.0
df_a[0] = df_a[0].fillna(df_b[0])
final_output-
df_a
0
0 0.0
1 1.0
2 2.0
Ultimately, this is the syntax that appears to solve my problem:
mask = mask[:len(br_df)] # create the boolean index
br_df = br_df[:len(mask)] # make sure they are the same length
br_df['RECOMMENDED_ACTION'].loc[mask] = ML_df['ML_PREDICTION'].loc[mask]
br_df['REASON_CODE'].loc[mask] = 'ML01'
br_df['COMMENT'].loc[mask] = 'Automated Prediction'