How to simplify the iteration through rows in Openpyxl? - iteration

Hello I have came out with this solution to iterate throught several rows (by specifying the row number '2', '3', '4', exc), however I would like to do the same iteration in a different context where there are way more rows.
What command shall I use instead of specifing the number of row?
for col in range(3, 4 + 1):
for row in range(2, 14):
Jan = database[str(get_column_letter(col)) + '2'].value
Feb = database[str(get_column_letter(col)) + '3'].value
Mar = database[str(get_column_letter(col)) + '4'].value
Apr = database[str(get_column_letter(col)) + '5'].value
May = database[str(get_column_letter(col)) + '6'].value
Jun = database[str(get_column_letter(col)) + '7'].value
Jul = database[str(get_column_letter(col)) + '8'].value
Aug = database[str(get_column_letter(col)) + '9'].value
Sep = database[str(get_column_letter(col)) + '10'].value
Oct = database[str(get_column_letter(col)) + '11'].value
Nov = database[str(get_column_letter(col)) + '12'].value
Dec = database[str(get_column_letter(col)) + '13'].value
the result I am trying to obtain is this:
{'Jan': [218, 124], 'Feb': [541, 874], 'Mar': [215, 156], 'Apr': [365, 189], 'May': [245, 645], 'Jun': [542, 245], 'Jul': [542, 654], 'Aug': [987, 354], 'Sep': [167, 369], 'Oct': [367, 785], 'Nov': [174, 412], 'Dec': [841, 213]}
considering this structure of data on excel:
Thanks in advance

Yes, as Charlie Clark has said, this is covered in the documentation. Here is the code I have used using the openpyxl docs:
import openpyxl
from openpyxl.utils import column_index_from_string
from itertools import zip_longest
wb = openpyxl.load_workbook('test.xlsx')
ws = wb.active
col1 = []
col2 = []
col3 = []
elems = []
for (row1,row2,row3) in zip_longest(ws['A1':'A13'],ws['B1':'B13'],ws['C1':'C13']):
for cell1,cell2,cell3 in zip_longest(row1,row2,row3):
col1.append(ws.cell(row=cell1.row,column=column_index_from_string(cell1.column)).value)
col2.append(ws.cell(row=cell2.row,column=column_index_from_string(cell2.column)).value)
col3.append(ws.cell(row=cell3.row,column=column_index_from_string(cell3.column)).value)
for count,elem in enumerate(zip(col1,col2,col3),1):
print(elem)
elems.append(elem)
dict_answer = {x[0]:list(x[1:3]) for x in elems}
print(dict_answer)
Please make sure in future to read the docs and search previous queries before asking your own

Related

Compare each string with all other strings in a dataframe

I have this dataframe:
mylist = [
"₹67.00 to Rupam Sweets using Bank Account XXXXXXXX5343<br>11 Feb 2023, 20:42:25",
"₹66.00 to Rupam Sweets using Bank Account XXXXXXXX5343<br>10 Feb 2023, 21:09:23",
"₹32.00 to Nagori Sajjad Mohammed Sayyed using Bank Account XXXXXXXX5343<br>9 Feb 2023, 07:06:52",
"₹110.00 to Vikram Manohar Jsohi using Bank Account XXXXXXXX5343<br>9 Feb 2023, 06:40:08",
"₹120.00 to Winner Dinesh Gupta using Bank Account XXXXXXXX5343<br>30 Jan 2023, 06:23:55",
]
import pandas as pd
df = pd.DataFrame(mylist)
df.columns = ["full_text"]
ndf = df.full_text.str.split("to", expand=True)
ndf.columns = ["amt", "full_text"]
ndf2 = ndf.full_text.str.split("using Bank Account XXXXXXXX5343<br>", expand=True)
ndf2.columns = ["client", "date"]
df = ndf.join(ndf2)[["date", "client", "amt"]]
I have created embeddings for each client name:
from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
openai.api_key = 'xxx'
embedding_model = "text-embedding-ada-002"
embeddings = df.client.apply([lambda x: get_embedding(x, engine=embedding_model)])
df["embeddings"] = embeddings
I can now calculate the similarity index for a given string. For e.g. "Rupam Sweet" using:
query_embedding = get_embedding("Rupam Sweet", engine="text-embedding-ada-002")
df["similarity"] = df.embeddings.apply(lambda x: cosine_similarity(x, query_embedding))
But I need the similarity score of each client across all other clients. In other words, the client names will be in rows as well as in columns and the score will be the data. How do I achieve this?
I managed to get the expected results using:
for k, i in enumerate(df.client):
query_embedding = get_embedding(i, engine="text-embedding-ada-002")
if i in df.columns:
df[i + str(k)] = df.embeddings.apply(
lambda x: cosine_similarity(x, query_embedding)
)
else:
df[i] = df.embeddings.apply(lambda x: cosine_similarity(x, query_embedding))
I am not sure if this is efficient in case of big data.

How can I add several columns within a dataframe (broadcasting)?

import numpy as np
import pandas as pd
data = [[30, 19, 6], [12, 23, 14], [8, 18, 20]]
df = pd.DataFrame(data = data, index = ['A', 'B', 'C'], columns = ['Bulgary', 'Robbery', 'Car Theft'])
df
I get the following:
Bulgary
Robbery
Car Theft
A
30
19
6
B
12
23
14
C
8
18
20
I would like to assign:
df['Total'] = df['Bulgary'] + df['Robbery'] + df['Car Theft']
But does this operation have to be done manually? I am looking for a function that can handle conveniently.
#pseudocode
#df['Total'] = df.Some_Column_Adding_Function([0:3])
#df['Total'] == df['Bulgary'] + df['Robbery'] + df['Car Theft'] returns True
Similarly, how do I add across rows?
Use sum:
df['Total'] = df.sum(axis=1)
Or if you want subset of columns:
df['Total'] = df[df.columns[0:3]].sum(axis=1)
# or df['Total'] = df[['Bulgary', 'Robbery', 'Car Theft']].sum(axis=1)

Speeding up conversion of pandas to numpy

I've got a Pandas df of approximately 2.5m rows, with a multi-index of the form:
('assetCode', 'date') and approximately 60 columns.
I'm trying to convert this to a 3D numpy matrix:
assetCodes = X_calculated.index.get_level_values(0).unique().sort_values().to_numpy()
dates = X_calculated.index.get_level_values(1).unique().sort_values().to_numpy()
columns = X_calculated.columns.to_numpy()
myData = np.empty((assetCodes.size, dates.size, columns.size))
def updateMatrix(row):
idx = row.name
assetLabel = np.searchsorted(assetCodes, idx[0])
dateLabel = np.where(dates == idx[1])
myData[assetLabel][dateLabel] = row.to_numpy()
X_calculated.apply(updateMatrix, axis=1)
This operation takes a very long time. Is there a quicker way?
I think if you already have all the combinations of assetCode and date in your dataframe, you can do it whit reshape:
# example data
X_calculated = pd.DataFrame(np.arange(36).reshape(9, -1),
index=pd.MultiIndex.from_product([range(101,104),
range(111,114)],
names=('assetCode','date')),
columns=list('abcd'))
# get dimensions
nb_asset = X_calculated.index.get_level_values(0).nunique()
nb_dates = X_calculated.index.get_level_values(1).nunique()
nb_cols = len(X_calculated.columns)
# create myData
myData = X_calculated.sort_index().to_numpy().reshape(nb_asset, nb_dates, nb_cols)
print (myData) #same result than with your code
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
[[24 25 26 27]
[28 29 30 31]
[32 33 34 35]]]
If you have missing combinations, you can use reindex before with pd.MultiIndex.from_product with unique value in both index levels. No need to sort_index anymore I think as the new multiIndex is generated sorted
assetCodes = X_calculated.index.get_level_values(0).unique().sort_values()
dates = X_calculated.index.get_level_values(1).unique().sort_values()
myData = (X_calculated.reindex(pd.MultiIndex.from_product([assetCodes, dates]))
.to_numpy()
.reshape(len(assetCodes), len(dates), len(X_calculated.columns))
)

Represent negative timedelta in most basic form

If I create a negative Timedelta for e.g. 0.5 hours, the internal representation looks as follow:
In [2]: pd.Timedelta('-0.5h')
Out[2]: Timedelta('-1 days +23:30:00')
How can I get back a (str) representation of this Timedelta in the form -00:30?
I want to display these deltas and requiring the user to calculate the expression -1 day + something is a bit award.
I can't add comment to you so adding it here. Don't know if this helps but I think you can use python humanize.
import humanize as hm
hm.naturaltime((pd.Timedelta('-0.5h')))
Out:
'30 minutes from now'
Ok, I will live with a hack going trough a date:
sign = ''
date = pd.to_datetime('today')
if delta.total_seconds() < 0:
sign = '-'
date = date - delta
else:
date = date + delta
print '{}{:%H:%M}'.format(sign, date.to_pydatetime())
You can use the components of a Pandas timedelta
import pandas as pd
t = pd.Timedelta('-0.5h')
print t.components
>> Components(days=-1L, hours=23L, minutes=30L, seconds=0L, milliseconds=0L, microseconds=0L, nanoseconds=0L)
You can access each component with
print t.components.days
>> -1
print t.components.hours
>> 23
print t.components.minutes
>> 30
The rest is then formatting.
source
This is a total hack that won't work for Series data, but....
import pandas as pd
import numpy as np
t = pd.Timedelta('-0.5h').components
mins = t.days*24*60 + t.hours*60 + t.minutes
print str(np.sign(mins))[0]+str(divmod(abs(mins), 60)[0]).zfill(2)+':'+str(divmod(abs(mins), 60)[1]).zfill(2)
>> -00:30
I was looking for something similar (see https://github.com/pandas-dev/pandas/issues/17232 )
I'm not sure if it will be implemented in Pandas, so here is a workaround
import pandas as pd
def timedelta2str(td, display_plus=False, format=None):
"""
Parameters
----------
format : None|all|even_day|sub_day|long
Returns
-------
converted : string of a Timedelta
>>> td = pd.Timedelta('00:00:00.000')
>>> timedelta2str(td)
'0 days'
>>> td = pd.Timedelta('00:01:29.123')
>>> timedelta2str(td, display_plus=True, format='sub_day')
'+ 00:01:29.123000'
>>> td = pd.Timedelta('-00:01:29.123')
>>> timedelta2str(td, display_plus=True, format='sub_day')
'- 00:01:29.123000'
"""
td_zero = pd.Timedelta(0)
sign_sep = ' '
if td >= td_zero:
s = td._repr_base(format=format)
if display_plus:
s = "+" + sign_sep + s
return s
else:
s = timedelta2str(-td, display_plus=False, format=format)
s = "-" + sign_sep + s
return s
if __name__ == "__main__":
import doctest
doctest.testmod()

Pandas Apply(), Transform() ERROR = invalid dtype determination in get_concat_dtype

Flowing on from this question, which i link as background, but question is standalone.
4 questions:
I cannot understand the error I see when using apply or transform:
"invalid dtype determination in get_concat_dtype"
Why does ClipNetMean work but the other 2 methods not?
Unsure if or why i need the .copy(deep=True)
Why the slightly different syntax needed to call the InnerFoo function
The DataFrame:
cost
section item
11 1 25
2 100
3 77
4 10
12 5 50
1 39
2 7
3 32
13 4 19
1 21
2 27
The code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'section' : [11,11,11,11,12,12,12,12,13,13,13]
,'item' : [1,2,3,4,5,1,2,3,4,1,2]
,'cost' : [25.,100.,77.,10.,50.,39.,7.,32.,19.,21.,27.]
})
df.set_index(['section','item'],inplace=True)
upper =50
lower = 10
def ClipAndNetMean(cost,upper,lower):
avg = cost.mean()
new_cost = (cost- avg).clip(lower,upper)
return new_cost
def MiniMean(cost,upper,lower):
cost_clone = cost.copy(deep=True)
cost_clone['A'] = lower
cost_clone['B'] = upper
v = cost_clone.apply(np.mean,axis=1)
return v.to_frame()
def InnerFoo(lower,upper):
def inner(group):
group_clone = group.copy(deep=True)
group_clone['lwr'] = lower
group_clone['upr'] = upper
v = group_clone.apply(np.mean,axis=1)
return v.to_frame()
return inner
#These 2 work fine.
print df.groupby(level = 'section').apply(ClipAndNetMean,lower,upper)
print df.groupby(level = 'section').transform(ClipAndNetMean,lower,upper)
#apply works but not transform
print df.groupby(level = 'section').apply(MiniMean,lower,upper)
print df.groupby(level = 'section').transform(MiniMean,lower,upper)
#apply works but not transform
print df.groupby(level = 'section').apply(InnerFoo(lower,upper))
print df.groupby(level = 'section').transform(InnerFoo(lower,upper))
exit()
So to Chris's answer, note that if I add back the column header the methods will work in a Transform call.
see v.columns = ['cost']
def MiniMean(cost,upper,lower):
cost_clone = cost.copy(deep=True)
cost_clone['A'] = lower
cost_clone['B'] = upper
v = cost_clone.apply(np.mean,axis=1)
v = v.to_frame()
v.columns = ['cost']
return v
def InnerFoo(lower,upper):
def inner(group):
group_clone = group.copy(deep=True)
group_clone['lwr'] = lower
group_clone['upr'] = upper
v = group_clone.apply(np.mean,axis=1)
v = v.to_frame()
v.columns = ['cost']
return v
return inner
1 & 2) transform expects something "like-indexed", while apply is flexible. The two failing functions are adding additional columns.
3) In some cases, (e.g. if you're passing a whole DataFrame into a function) it can be necessary to copy to avoid mutating the original. It should not be necessary here.
4) The first two functions take a DataFrame with two parameters and returns data. InnerFoo actually returns another function, so it needs to be called before being passed into apply.