Define a new column based on preferences in one set of columns, and available resources in another - pandas

I have a dataframe with many columns, of which 7 are relevant here.
df1=pd.DataFrame({'Shipment ID':[1,2,3,4,5,6],'Pref 1':['UPS','DHL','DHL','ARA','USPS','FED'],'Pref 2':['DHL','','FED','FED','UPS','USPS'],'Pref 3':['FED','','','DHL','ARA',''],'BudgetUPS':[NaN,'No','Yes',NaN,'No','Yes'],'BudgetUSPS':['Yes','Yes','Yes',NaN,'Yes','No'],'BudgetFED':['No','Yes',NaN,'Yes','Yes','No'],'BudgetARA':['Yes',NaN,NaN,NaN,NaN,'Yes'],'BudgetDHL':['No','Yes','Yes',NaN,'Yes','Yes']})
The data here represents the top 3 customer preferences for a shipping agent for each of the shipments being generated by an e-commerce site. The Budget columns specify whether the budget for the corresponding shipping agent is available, not available, or unknown (due to query failure).
What I need to generate is a column that picks up the top two (or one, or none) of the preferences for each shipment ID and create entries (column name: Prefnbudget) like "FED UPS", "USPS", "DHL ARA", "None" for each shipment. The purpose of this step is to a) detect if a shipment can be processed with customer preference, and the budget constraint (to prevent deadlocks), b) query the customer with a confirmation.
I would like to make the answer as pythonic as possible. It is certainly easy to do this in a loop over a list ['DHL','UPS','USPS','ARA','FED'], etc. but I want something that is more vectorized, and pithily compact.

I copied your dataframe. Below is one way to implement your task:
# this dictionary maps column name to its index (column count) in the dataframe
dictIndexToCol = {col_name: i for i, col_name in enumerate(df1.columns)}
def getPref(row, dictIdxToCol=dictIndexToCol):
""" This function takes the preferences and checks whether Budget value is Yes """
pref1 = row[1]
pref2 = row[2]
pref3 = row[3]
pref_budget = " "
if len(pref1) > 0:
if row[dictIdxToCol[f"Budget{pref1}"]] == "Yes":
pref_budget += pref1 + " "
if len(pref2) > 0:
if row[dictIdxToCol[f"Budget{pref2}"]] == "Yes":
pref_budget += pref2 + " "
if len(pref3) > 0:
if row[dictIdxToCol[f"Budget{pref3}"]] == "Yes":
pref_budget += pref3
return pref_budget
df1["Prefnbudget"] = df1.apply(lambda row: getPref(row), axis=1)

Related

Store targets as collections that handle logic operation

I think my title is kinda unclear but I don't konw how to tell that otherwise.
My problem is:
We have users that belong to groups, there are many types of groups and any user belong to exaclty one group for each type.
Example: With group types A, B and C, containing respectively the groups (A1; A2; A3), (B1; B2) and (C1; C2; C3)
Every User must have a list of groups like [A1, B1, C1] or [A1, B2, C3] but never [A1, A2, B1] or [A1, C2]
We have messages that target to certain groups but not just a union, it can be more complex collection operations
Example: we can have message intended to [A1, B1, C3], [A1, *, *], [A1|A2, *, *] or even like ([A1, B1, C2] | [A2, B2, C1])
(* = any group of the type, | = or)
Messages are stored in a SQL DB, and users can retrieve all messages intended to their groups
How may I store messages and make my Query to reproduce this behavior ?
An option could be to encode both the user groups and the message targets in a (big) integer built on the powers of 2, and then base your query on a bitwise AND between user group code and message target code.
The idea is, group 1 is 1, group 2 is 2, group 3 is 4 and so on.
Level 1:
Assumptions:
you know in advance how many group types you have, and you have very few of them
you don't have more than 64 groups per type (assuming you work with 64-bit integers)
the message has only one target: A1|A2,B..,C... is ok, A*,B...,C... is ok, (A1,B1,C1)|(A2,B2,C2) is not.
Solution:
Encode each user group as the corresponding power of 2
Encode each message target as the sum of the allowed values: if groups 1 and 3 are allowed (A1|A3) the code will be 1+4=5, if all groups are allowed (A*) the code will be 2**64-1
you will have a User table and a Message table, and both will have one field for each group type code
The query will be WHERE (u.g1 & m.g1) * (u.g2 & m.g2) * ... * (u.gN & m.gN) <> 0
Level 2:
Assumptions:
you have some more group types, and/or you don't know in advance how many they are, or how they are composed
you don't have more than 64 groups in total (e.g. 10 for the first type, 12 for the second, ...)
the message still has only one target as above
Solution:
encode each user group and each message target as a single integer, taking care of the offset: if the first type has 10 groups they will be encoded from 1 to 1023 (2**10-1), then if the second type has 12 groups they will go from 1024 (2**10) to 4194304 (2**(10+12)-1), and so on
you will still have a User table and a Message table, and both will have one single field for the cumulative code
you will need to define a function which is able to check the user group vs the message target separately by each range; this can be difficult to do in SQL, and depends on which engine you are using
following is a Python implementation of both the encoding and the check
class IdEncoder:
def __init__(self, sizes):
self.sizes = sizes
self.grouplimits = {}
offset = 0
for i,size in enumerate(sizes):
self.grouplimits[i] = (2**offset, 2**(offset + size)-1)
offset += size
def encode(self, vals):
n = 0
for i, val in enumerate(vals):
if val == '*':
g = self.grouplimits[i][1] - self.grouplimits[i][0] + 1
else:
svals = val.split('|')
g = 0
for sval in svals:
g += 2**(int(sval)-1)
if i > 0:
g *= self.grouplimits[i][0]
print(g)
n += g
return n
def check(self, user, message):
res = False
for i,size in enumerate(self.sizes):
if user%2**size & message%2**size == 0:
break
if i < len(self.sizes)-1:
user >>= size
message >>= size
else:
res = True
return res
c = IdEncoder([10,12,10])
m3 = c.encode(['1|2','*','*'])
u1 = c.encode(['1','1','1'])
c.check(u1,m3)
True
u2=c.encode(['4','1','1'])
c.check(u2,m3)
False
Level 3:
Assumptions:
you adopt one of the above solutions, but you need multiple targets for each message
Solution:
You will need a third table, MessageTarget, containing the target code fields as above and a FK linking to the message
The query will search for all the MessageTarget rows compatible with the User group code(s) and show the related Message data
So you have 3 main tables:
Messages
Users
Groups
You then create 2 relationship tables:
Message-Group
User-Group
If you want to limit users to have access to just "their" messages then you join:
User > User-Group > Message-Group > Message

Arcpy Script to loop through field and run Union Analysis

I have a polygon file in form of a fishnet. Also another feature class with polygons named Trawl_Buffers. There is a unique field within Trawl_Buffers based on YEAR. I'd like to create a script to run a selection on YEAR, and then perform a union analysis with the fishnet polygon for each YEAR. So the desired output would be "Trawl_Buffers_union2003", "Trawl_Buffers_union2004" etc. I have a function that will get me the unique list of the years, and puts them in a list which i called vals.
Then seems I need to run a for loop over this list of unique years, create a temporary selection, then use that as input for the union, but I am having trouble implementing the query process.
Here is where I started, but seriously tripping
import arcpy
#Set the data environment
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb'
trawlBuffs = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\buffers\buffers_testing'
fishnet = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\fishnets\vms_net1k'
unionOut = r'C:\Data\working\AK_Fishing_VMS\2021_Delivery\ArcPro_proj\ArcPro_proj.gdb\unions\union'
# function to get unique values for the YEAR field found within the trawlBuffs fc
def unique_values(table, field):
with arcpy.da.SearchCursor(table, [field]) as cursor:
return sorted({row[0] for row in cursor})
# Get the unique values for the field 'YEAR' found within the 'trawl_buffs' featureclass table
vals = unique_values(trawlBuffs, "YEAR")
# Create a query string for the selected country
yearSelectionClause = '"YEAR" = ' + "'" + vals + "'"
#loop through the years, create selection, union, make permanent
for year in vals:
year_layer = str(year) + "_union"
arcpy.MakeFeatureLayer_management(trawlBuffs, year_layer)
arcpy.SelectLayerByAttribute_management(year_layer, "NEW_SELECTION", "\"YEAR"\" = %d" % (year))
arcpy.Union_analysis(fishnet, year_layer , unionOut)
arcpy.CopyFeatures_management(year_layer, "union_" + str(year))

Create variable based on value in multiple columns?

There is a rather large Stata dataset (education) with 60+ variables devoted to 'exam taken' information and a few others based on student gender, age, demographics, etc. There are tens of thousands of students (rows). Unfortunately the grades on various tests are not standard (combo of letters and numbers, and may appear in any of the 60+ columns for each student, depending on when they took the relevant exam). I'm trying to create a new variable, identifying all those who took some variation of the G40 or G41 exam at this time. The grade columns are all assigned as dx with a number, so I've started by trying the following:
gen byte event = 0
replace event = 1 if dx1 == "G40" | dx1 == "G41"| dx2 == "G40" | dx2 == "G41" | dx3 == "G40" | dx3 == "G41" | dx4 == "G40" | dx4 == "G41" | dx5 == "G40" | dx5 == "G41" & age < 12
I don't want to write out every single one of the 60+ columns each time I'm making a new variable for a new exam. Is there a faster way of doing this?
I am going to show two techniques, as one is good for the smaller code example you give and one is better for 60+ "columns" (variables!).
Just your example I would tend to write as one line
gen byte event = ( inlist("G40", dx1, dx2, dx3, dx4, dx5) | ///
inlist("G41", dx1, dx2, dx3, dx4, dx5) ) & age < 12
For 60+ such variables I would write a loop.
gen byte event = 0
foreach v of var dx* {
display "`v' " _c
replace event = 1 if inlist(`v', "G40", "G41") & age < 12
}
where for purposes of debugging, or just understanding, the output is noisier than would be customary once the operations seem routine. A standard trick with inlist() is to note that a test of the form foo == whatever is the same as a test of whatever == foo so there is often a choice about which argument is first and which other argument(s) follow.

How can i add an array of values to Google ortools versus a lower and upper bound?

In the documentation and all examples I can find... in terms of nurse scheduling at least, everyone just declares shift values within the search space of {1,4} lets say for shift 1,2,3,4....
solver = pywrapcp.Solver("schedule_shifts")
num_nurses = 4
num_shifts = 4 # Nurse assigned to shift 0 means not working that day.
num_days = 7
# [START]
# Create shift variables.
shifts = {}
for j in range(num_nurses):
for i in range(num_days):
shifts[(j, i)] = solver.IntVar(0, num_shifts - 1, "shifts(%i,%i)" % (j, i))
shifts_flat = [shifts[(j, i)] for j in range(num_nurses) for i in range(num_days)]
# Create nurse variables.
nurses = {}
for j in range(num_shifts):
for i in range(num_days):
nurses[(j, i)] = solver.IntVar(0, num_nurses - 1, "shift%d day%d" % (j,i))
I want to avoid the use of range of values when I call solver.IntVar(lowerbound, upperbound, ...)
I want IntSolver([available values that you can choose], ...)
I created a matrix of all shifts as the columns flowing from the first day to last. My row indexes don't matter but in each day/shift column, I have the index values of nurses in ranked descending order of who bid the highest for that shift. I want to create then a constraint where if I choose a nurse, I choose the maximum bid that is allowed via other constraints from the column, however I don't know how to do that given the limited documentation ortools has with python IntVar.
Can you try
solver.IntVar([values...], 'name')
It should work.
See https://github.com/google/or-tools/blob/master/examples/python/einav_puzzle2.py

Podio calculations - how to get the source of the entry chosen in a relationship field?

Lets'say I have a relationship field H where the user can choose to add a single entry from any of three different apps (X app, Y app, Z app).
I would like to use this information in a calculation field to calculate results depending on what app the info comes from, rather that the values of the incoming entry.
Is that possible?
If there is always only one related item allowed you can use .length. In this case it counts the number of the related items
var $AppX = #TitleFieldAppX.length;
var $AppX = #TitleFieldAppY.length;
var $AppZ = #TitleFieldAppZ.length;
AppX == 1 ? "Calculate this" : AppY == 1 ? "Calculate that" ? AppZ == 1 : "Calculate else" : null
If an item from App x is related than do this calculation, if from App Y .....