Loop over Pandas dataframe to populate list (Python) - pandas

I have the following dataframe:
import pandas as pd
action = ['include','exclude','ignore','include', 'exclude', 'exclude','ignore']
names = ['john','michael','joshua','peter','jackson','john', 'erick']
df = pd.DataFrame(list(zip(action,names)), columns = ['action','names'])
I also have a list of starting participants like this:
participants = [['michael','jackson','jeremiah','martin','luis']]
I want to iterate over df['action']. If df['action'] == 'include', add another list to the participants list that includes all previous names and the one in df['names']. So, after the first iteration, participants list should look like this:
participants = [['michael','jackson','jeremiah','martin','luis'],['michael','jackson','jeremiah','martin','luis','john']]
I have managed to achieve this with the following code (I donĀ“t know if this part could be improved, although it is not my question):
for i, row in df.iterrows():
if df.at[i,'action'] == 'include':
person = [df.at[i,'names']]
old_list = participants[-1]
new_list = old_list + person
participants.append(new_list)
else:
pass
The main problem (and my question is), how do I accomplish the same but removing the name when df['action'] == 'exclude'? So, after the second iteration, I should have this list in participants:
participants = [['michael','jackson','jeremiah','martin','luis'],['michael','jackson','jeremiah','martin','luis','john'],['jackson','jeremiah','martin','luis','john']]

You can just add a elif to your code. With the remove method you can remove a item by value. Just be careful your person is a list and not a string. I just call it by index with [0].
elif df.at[i, 'action'] == 'exclude':
person = [df.at[i, 'names']]
participants.append(participants[-1].remove(person[0]))

Related

Change number format in Excel using names of headers - openpyxl [duplicate]

I have an Excel (.xlsx) file that I'm trying to parse, row by row. I have a header (first row) that has a bunch of column titles like School, First Name, Last Name, Email, etc.
When I loop through each row, I want to be able to say something like:
row['School']
and get back the value of the cell in the current row and the column with 'School' as its title.
I've looked through the OpenPyXL docs but can't seem to find anything terribly helpful.
Any suggestions?
I'm not incredibly familiar with OpenPyXL, but as far as I can tell it doesn't have any kind of dict reader/iterator helper. However, it's fairly easy to iterate over the worksheet rows, as well as to create a dict from two lists of values.
def iter_worksheet(worksheet):
# It's necessary to get a reference to the generator, as
# `worksheet.rows` returns a new iterator on each access.
rows = worksheet.rows
# Get the header values as keys and move the iterator to the next item
keys = [c.value for c in next(rows)]
for row in rows:
values = [c.value for c in row]
yield dict(zip(keys, values))
Excel sheets are far more flexible than CSV files so it makes little sense to have something like DictReader.
Just create an auxiliary dictionary from the relevant column titles.
If you have columns like "School", "First Name", "Last Name", "EMail" you can create the dictionary like this.
keys = dict((value, idx) for (idx, value) in enumerate(values))
for row in ws.rows[1:]:
school = row[keys['School'].value
I wrote DictReader based on openpyxl. Save the second listing to file 'excel.py' and use it as csv.DictReader. See usage example in the first listing.
with open('example01.xlsx', 'rb') as source_data:
from excel import DictReader
for row in DictReader(source_data, sheet_index=0):
print(row)
excel.py:
__all__ = ['DictReader']
from openpyxl import load_workbook
from openpyxl.cell import Cell
Cell.__init__.__defaults__ = (None, None, '', None) # Change the default value for the Cell from None to `` the same way as in csv.DictReader
class DictReader(object):
def __init__(self, f, sheet_index,
fieldnames=None, restkey=None, restval=None):
self._fieldnames = fieldnames # list of keys for the dict
self.restkey = restkey # key to catch long rows
self.restval = restval # default value for short rows
self.reader = load_workbook(f, data_only=True).worksheets[sheet_index].iter_rows(values_only=True)
self.line_num = 0
def __iter__(self):
return self
#property
def fieldnames(self):
if self._fieldnames is None:
try:
self._fieldnames = next(self.reader)
self.line_num += 1
except StopIteration:
pass
return self._fieldnames
#fieldnames.setter
def fieldnames(self, value):
self._fieldnames = value
def __next__(self):
if self.line_num == 0:
# Used only for its side effect.
self.fieldnames
row = next(self.reader)
self.line_num += 1
# unlike the basic reader, we prefer not to return blanks,
# because we will typically wind up with a dict full of None
# values
while row == ():
row = next(self.reader)
d = dict(zip(self.fieldnames, row))
lf = len(self.fieldnames)
lr = len(row)
if lf < lr:
d[self.restkey] = row[lf:]
elif lf > lr:
for key in self.fieldnames[lr:]:
d[key] = self.restval
return d
The following seems to work for me.
header = True
headings = []
for row in ws.rows:
if header:
for cell in row:
headings.append(cell.value)
header = False
continue
rowData = dict(zip(headings, row))
wantedValue = rowData['myHeading'].value
I was running into the same issue as described above. Therefore I created a simple extension called openpyxl-dictreader that can be installed through pip. It is very similar to the suggestion made by #viktor earlier in this thread.
The package is largely based on source code of Python's native csv.DictReader class. It allows you to select items based on column names using openpyxl. For example:
import openpyxl_dictreader
reader = openpyxl_dictreader.DictReader("names.xlsx", "Sheet1")
for row in reader:
print(row["First Name"], row["Last Name"])
Putting this here for reference.

Pandas to SUM of list items in a DataFrame

Below i have list bu_lst which I'm passing to a dataframe df2 to do the sum of each individual item in the list, How could i achieve that in one go so, i do not repeat it multiple times:
bu_lst = ['FPG','IPG','DSG','STG','WFO','IT']
FPG = ['ADE','FPG AE','FPG PE','MMSIM','OrFAD','Tirtuoso DashBoard','SPB AE','SPB PE']
IPG = ['DDR','DDR_DT','Tensilica']
DSG = ['FLA','FLS','FEQoS','IFD PT','Sasus R&D','sasus'] PE','Toltus','Tempus','Quantus','Genus']
STG = ['ATS','HST','TIP','System Engineering']
WFO = ['AFademiF Network','FRAFT','Fhip Estimate','EduFation SerTiFes','LiFensing','Sales','SerTiFes','TFAD']
IT = ['App Development','Fumulus','InfoSeF']
My current approach:
print(df2[FPG].sum())
print(df2[IPG].sum())
print(df2[DSG].sum())
print(df2[STG].sum())
print(df2[WFO].sum())
print(df2[IT].sum())
I just the took the relevant line of the code to show here.
You can create dictionary of lists and then use dictionary comprehension if in lists are columns names:
d = {'bu_lst':bu_lst, 'FPG': FPG, ...}
d2 = {k: df2[v].sum() for k, v in d.items()}

How to map different indices in Pyomo?

I am a new Pyomo/Python user. Now I need to formulate one set of constraints with index 'n', where all of the 3 components are with different indices but correlate with index 'n'. I am just curious that how I can map the relationship between these sets.
In my case, I read csv files in which their indices are related to 'n' to generate my set. For example: a1.n1, a2.n3, a3.n5 /// b1.n2, b2.n4, b3.n6, b4.n7 /// c1.n1, c2.n2, c3.n4, c4.n6 ///. The constraint expression of index n1 and n2 is the follows for example:
for n1: P(a1.n1) + L(c1.n1) == D(n1)
for n2: - F(b1.n2) + L(c2.n2) == D(n2)
Now let's go the coding. The set creating codes are as follow, they are within a class:
import pyomo
import pandas
import pyomo.opt
import pyomo.environ as pe
class MyModel:
def __init__(self, Afile, Bfile, Cfile):
self.A_data = pandas.read_csv(Afile)
self.A_data.set_index(['a'], inplace = True)
self.A_data.sort_index(inplace = True)
self.A_set = self.A_data.index.unique()
... ...
Then I tried to map the relationship in the constraint construction like follows:
def createModel(self):
self.m = pe.ConcreteModel()
self.m.A_set = pe.Set( initialize = self.A_set )
def obj_rule(m):
return ...
self.m.OBJ = pe.Objective(rule = obj_rule, sense = pe.minimize)
def constr(m, n)
As = self.A_data.reset_index()
Amap = As[ As['n'] == n ]['a']
Bs = self.B_data.reset_index()
Bmap = Bs[ Bs['n'] == n ]['b']
Cs = self.C_data.reset_index()
Cmap = Cs[ Cs['n'] == n ]['c']
return sum(m.P[(p,n)] for p in Amap) - sum(m.F[(s,n)] for s in Bmap) + sum(m.L[(r,n)] for r in Cmap) == self.D_data.ix[n, 'D']
self.m.cons = pe.Constraint(self.m.D_set, rule = constr)
def solve(self):
... ...
Finally, the error raises when I run this:
KeyError: "Index '(1, 1)' is not valid for indexed component 'P'"
I know it is the wrong way, so I am wondering if there is a good way to map their relationships. Thanks in advance!
Gabriel
I just forgot to post my answer to my own question when I solved this one week ago. The key thing towards this problem is setting up a map index.
Let me just modify the code in the question. Firstly, we need to modify the dataframe to include the information of the mapped indices. Then, the set for the mapped index can be constructed, taking 2 mapped indices as example:
self.m.A_set = pe.Set( initialize = self.A_set, dimen = 2 )
The names of the two mapped indices are 'alpha' and 'beta' respectively. Then the constraint can be formulated, based on the variables declared at the beginning:
def constr(m, n)
Amap = self.A_data[ self.A_data['alpha'] == n ]['beta']
Bmap = self.B_data[ self.B_data['alpha'] == n ]['beta']
return sum(m.P[(i,n)] for i in Amap) + sum(m.L[(r,n)] for r in Bmap) == D.loc[n, 'D']
m.TravelingBal = pe.Constraint(m.A_set, rule = constr)
The summation groups all associated B to A with a mapped index set.

How to get dates along with the functions I perform?

My initial data frame is like this:
import pandas as pd
df = pd.DataFrame({'serialNo':['aaaa','aaaa','cccc','ffff'],
'Date':['2018-09-15','2018-09-16','2018-09-15','2018-09-19'],
'moduleLocation': ['face','head','stomach','legs'],
'moduleName': ['singing', 'dance','booze', 'vocals'],
'warning': [4402, 3747 ,5555,8754],
'failed':[0,3462,5161,3262]})
I have performed the following functions to clean up the data the first is to make all the datatypes as string:
all_columns = list(df)
df[all_columns] = df[all_columns].astype(str)
This is followed by the function to perform certain concatenations:
def concatenate(diagnostics, field, target):
diagnostics.sort_values(by=['serialNo',field],inplace=True)
diagnostics.drop_duplicates(inplace=True)
diagnostics[target] = \
diagnostics.groupby(['serialNo'], as_index=False)[field].transform(lambda s: ','.join(filter(None, s)))
diagnostics.drop([field],axis=1,inplace=True)
diagnostics.drop_duplicates(inplace=True)
return diagnostics
module = concatenate(df[['serialNo','moduleName']], 'moduleName', 'Module')
Warn = concatenate(df[['serialNo','warning']], 'warning', 'Warn')
Err = concatenate(df[['serialNo','failed']], 'failed', 'Err')
Location = concatenate(df[['serialNo','moduleLocation']], 'moduleLocation', 'Location')
diag_final = pd.merge(module,Warn,on=['serialNo'],how='inner')
diag_final = pd.merge(diag_final,Err,on=['serialNo'],how='inner')
diag_final = pd.merge(diag_final,Location,on=['serialNo'],how='inner')
Now the problem is the Date column no longer exists in my diag_final data frame and I would like to have it. I do not want to make changes to the existing function but just make sure that I have the corresponding Dates. How should I achieve this?
There are likely to be multiple values for each serial number. Hence, you will have to concatenate the values, similar what you are doing for moduleLocation, and moduleName.
dates = concatenate(df[['serialNo','Date']], 'Date', 'Date_cat')
diag_final = pd.merge(diag_final,dates,on=['serialNo'],how='inner')

Populating data to individual columns in pandas dataframe

I am trying to get the data from the list (list_addresses) and populate it to different columns of the dataframe (dfloc). I use the below code, not sure where I am going wrong.
Values are present in list_addresses but not getting populated to the dataframe.
Any help would be appreciated.
for index in range(len(list_addresses)):
location = geolocator.reverse([list_addresses[index][0],list_addresses[index][1]])
dfloc.loc[dfloc.Latitude] = list_addresses[index][0]
dfloc.loc[dfloc.Longitude] = list_addresses[index][1]
dfloc.loc[dfloc.Address] = location.address
So it looks like you have a list of lists or tuples with form of [(Lat1,Lon1),(Lat2,Lon2), etc...]. I like to make a list for each column, then assign the entire column at once:
lat_list = [x[0] for x in list_addresses]
lon_list = [x[1] for x in list_addresses]
address_list = []
for index in range(len(list_addresses)):
location = geolocator.reverse([list_addresses[index][0],list_addresses[index][1]])
address_list.append(location.address)
dfloc['Latitude'] = lat_list
dfloc['Longitude'] = lon_list
dfloc['Address'] = address_list