Querying election results - arcgis

I'm building electoral maps from mayoral races in ArcGIS over the last 3 election cycles, trying to see if a particular part of town votes like the rest of the town.
There are multiple candidates for mayor and an example table looks like this:
PRECINCT_NUMBER TOTAL_BALLOTSCAST CANDIDATE1 CANDIDATE2 CANDIDATE3
500 1000 500 300 200
501 2000 800 700 500
502 1000 400 500 100
I can show how a specific candidate did by precinct by calcuating candidatex / total_ballotcast to give percentages for each precinct, and display the precincts by category of percentage by candidate.
However, I am also trying to show who was the biggest winner in each precinct by percentage and I am not sure how to query the data to reflect this.

To accomplish this within ArcMap:
Create a new field (e.g. WINNER). Use field calculator to compare the values from three fields and output the maximum (biggest winner) as follows:
Check "Show Codeblock" to do an advanced python expression in the field calculator tool.
The expression should be:
find_winner(!CANDIDATE1!, !CANDIDATE2!, !CANDIDATE3!)
The pre-logic script code is where you define the function find_winner that the expression is calling:
def find_winner(a, b, c):
if a > b and a > c: # if a got more votes than b and c
return "Candidate 1"
elif b > a and b > c: # if b got more votes than a and c
return "Candidate 2"
elif c > a and c > b: # if c got more votes than a and b
return "Candidate 3"
else:
return "TIE?"
Once WINNER is populated with values, it should be straightforward to symbolize your polygons based on that attribute.

Related

Combining multiple dataframe columns into a single time series

I have built a financial model in python where I can enter sales and profit for x years in y scenarios - a base scenario plus however many I add.
Annual figures are uploaded per scenario in my first dataframe (e.g. if x = 5 beginning in 2022 then the base scenario sales column would show figures for 2022, 2023, 2024, 2025 and 2026)
I then use monthly weightings to create a monthly phased sales forecast in a new dataframe with the title Base sales 2022 and figures shown monthly, base sales 2023, base sales 2024 etc
I want to show these figures in a single series, so that I have a single times series for base sales of Jan 2022 to Dec 2026 for charting and analysis purposes.
I've managed to get this to work by creating a list and manually adding the names of each column I want to add but this will not work if I have a different number of scenarios or years so am trying to automate the process but can't find a way where I can do this.
I don't want to share my main model coding but I have created a mini model doing a similar thing below but it doesn't work as although it generates most of the output I want (three lists are requested listA0, listA1, listA2), the lists clearly aren't created as they aren't callable. Also, I really need all the text in a single line rather than split over multiple lines (or perhaps I should use list append for each susbsequent item). Any help gratefully received.
Below is the code I have tried:
#Create list of scenarios and capture the number for use later
Scenlist=["Bad","Very bad","Terrible"]
Scen_number=3
#Create the list of years under assessment and count the number of years
Years=[2020,2021,2022]
Totyrs=len(Years)
#Create the dataframe dprofit and for example purposes create the columns, all showing two datapoints 10 and 10
dprofit=pd.DataFrame()
a=0
b=0
#This creates column names in the format Bad profit 2020, Bad profit 2021 etc
while a<Scen_number:
while b<Totyrs:
dprofit[Scenlist[a]+" profit "+str(Years[b])]=[10,10]
b=b+1
b=0
a=a+1
#Now that the columns have been created print the table
print(dprofit)
#Now create the new table profit2 which will be used to capture the three columns (bad, very bad and terrible) for the full time period by listing the years one after another
dprofit2=pd.DataFrame()
#Create the output to recall the columns from dprofit to combine into 3 lists listA0, list A1 and list A2
a=0
b=0
Totyrs=len(Years)
while a<Scen_number:
while b<Totyrs:
if b==0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]")
b=b+1
b=0
a=a+1
print(listA0)
#print(list A0) will not call as NameError: name 'listA0' is not defined. Did you mean: 'list'?
To fix the printing you could set the end param to end=''.
while a < Scen_number:
while b < Totyrs:
if b == 0:
print(f"listA{a}=dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
else:
print(f"+dprofit[{Scenlist[a]} profit {Years[b]}]", end="")
results.append([Scenlist[a], Years[b]])
b = b + 1
print()
b = 0
a = a + 1
Output:
listA0=dprofit[Bad profit 2020]+dprofit[Bad profit 2021]+dprofit[Bad profit 2022]
listA1=dprofit[Very bad profit 2020]+dprofit[Very bad profit 2021]+dprofit[Very bad profit 2022]
listA2=dprofit[Terrible profit 2020]+dprofit[Terrible profit 2021]+dprofit[Terrible profit 2022]
To obtain a list or pd.DataFrame of the columns, you could simply filter() for the required columns. No loop required.
listA0 = dprofit.filter(regex="Bad profit", axis=1)
listA1 = dprofit.filter(regex="Very bad profit", axis=1)
listA2 = dprofit.filter(regex="Terrible profit", axis=1)
print(listA1)
Output for listA1:
Very bad profit 2020 Very bad profit 2021 Very bad profit 2022
0 10 10 10
1 10 10 10

Create variable based on value in multiple columns?

There is a rather large Stata dataset (education) with 60+ variables devoted to 'exam taken' information and a few others based on student gender, age, demographics, etc. There are tens of thousands of students (rows). Unfortunately the grades on various tests are not standard (combo of letters and numbers, and may appear in any of the 60+ columns for each student, depending on when they took the relevant exam). I'm trying to create a new variable, identifying all those who took some variation of the G40 or G41 exam at this time. The grade columns are all assigned as dx with a number, so I've started by trying the following:
gen byte event = 0
replace event = 1 if dx1 == "G40" | dx1 == "G41"| dx2 == "G40" | dx2 == "G41" | dx3 == "G40" | dx3 == "G41" | dx4 == "G40" | dx4 == "G41" | dx5 == "G40" | dx5 == "G41" & age < 12
I don't want to write out every single one of the 60+ columns each time I'm making a new variable for a new exam. Is there a faster way of doing this?
I am going to show two techniques, as one is good for the smaller code example you give and one is better for 60+ "columns" (variables!).
Just your example I would tend to write as one line
gen byte event = ( inlist("G40", dx1, dx2, dx3, dx4, dx5) | ///
inlist("G41", dx1, dx2, dx3, dx4, dx5) ) & age < 12
For 60+ such variables I would write a loop.
gen byte event = 0
foreach v of var dx* {
display "`v' " _c
replace event = 1 if inlist(`v', "G40", "G41") & age < 12
}
where for purposes of debugging, or just understanding, the output is noisier than would be customary once the operations seem routine. A standard trick with inlist() is to note that a test of the form foo == whatever is the same as a test of whatever == foo so there is often a choice about which argument is first and which other argument(s) follow.

Creating similar samples based on three different categorical variables

I am trying to do an analysis where I am trying to create two similar samples based on three different attributes. I want to create these samples first and then do the analysis to see which out of those two samples is better. The categorical variables are sales_group, age_group, and country. So I want to make both samples such as the proportion of countries, age, and sales is similar in both samples.
For example: Sample A and B have following variables in it:
Id Country Age Sales
The proportion of Country in Sample A is:
USA- 58%
UK- 22%
India-8%
France- 6%
Germany- 6%
The proportion of country in Sample B is:
India- 42%
UK- 36%
USA-12%
France-3%
Germany- 5%
The same goes for other categorical variables: age_group, and sales_group
Thanks in advance for help
You do not need to establish special procedure for sampling as one-sample proportion is unbiased estimate of population proportion. In case you have, suppose, >1000 observations and you are sampling more than, let us say, 30 samples the estimate would be quite exact (Central Limit Theorem).
You can see it in the simulation below:
set.seed(123)
n <- 10000 # Amount of rows in the source data frame
df <- data.frame(sales_group = sample(LETTERS[1:4], n, replace = TRUE),
age_group = sample(c("old", "young"), n, replace = TRUE),
country = sample(c("USA", "UK", "India", "France", "Germany"), n, replace = TRUE),
amount = abs(100 * rnorm(n)))
s <- 100 # Amount of sampled rows
sampleA <- df[sample(nrow(df), s), ]
sampleB <- df[sample(nrow(df), s), ]
table(sampleA$sales_group)
# A B C D
# 23 22 32 23
table(sampleB$sales_group)
# A B C D
# 25 22 28 25
DISCLAIMER: However if you have some very small or very big proportion and have too little samples you will need to use some advanced procedures like Laplace smoothing

Distribute numbers as close to possible

This seems to be a 2 step problem I'm trying to solve.
Let's say we have N records, and we are trying to distribute as evenly as possible into K groups.
The second problem - each group in K can only accept an M amount of records.
For example, if we have 5 records, and 3 groups, then we would distribute 2 into Group K1, 2 into Group K2 and 1 record into Group K3. However, if say in group 1, it only accepts at most 1 record. Then the arrangement would need to be 1 into Group K1, 2 into Group K2, and 2 into Group K3.
I'm not necessary after the solution but what algorithm I might need to use to solve this? Apparently for the distribution, I need to use the Greedy algorithm? But for the second step, this seems to be a bit more complicated
Edit:
The example I'm looking at is:
Number of records: 23
Groups: 10
Max records for each group
G1 = 4
G2 = 1
G3 = 0
G4 = 5
G5 = 0
G6 = 0
G7 = 2
G8 = 4
G9 = 2
G10 = 2
if N=12 and K=3 then in normal situation,you just split it V=12/3=4 for each group. but since you have M limitation, and for example K3 can only accept 1 then the distribution can be 6-5-1 which is not evenly distributed.
So i guess you need to sort K based on the M limitation, so for the example above the groups order become K3-K1-K2.
then if the distributed value V is bigger than the accepted amount M for that group, you need to take the remainder and distribute it again to the remaining group (K3=1, then 4-1=3 must be distributed to K1 and K2).
the implementation might be complicated, i hope you can find more simple solution for this
From what I understood, you need to separate all groups which allows a fixed number of values first and then equally distribute records among remaining groups. Let's take an example, let's say we have 15 records which needs to be distributed among 5 groups (G1, G2, G3, G4 and G5). Also let's assume that G2 and G4 allows max records of 2 and 4 respectively. Now algorithm should go like this:
Get average(ceiling integer) of records based on number of groups (In this example we'll get 3).
Add all max allowed records which are smaller than our average (In this example it's G2 only who's max limit(i.e. 2) is less than our average hence the number comes as 2).
Now subtract our number from step 2 from total records and also subtract the number of groups involved in step 2 from total groups. (remaining total records: 13, remaining total groups 4).
Get the new average(ceiling integer) using remaining records and groups. (New average 4).
Get average (Integer) (i.e. 3) and allot equal number of records to remaining groups - 1.
Get Mod (i.e. 1) and allot that number to the last group.
Now what we finally will have here:
G1(No limit): 4
G2(Limit 2): 2
G3(No limit): 4
G4(Limit 4): 4
G5(No limit): 1
Let me know if you think that this algo might fail for some scenarios.
Formula to get ceiling integer average
floor((#total_records + #total_groups-1) / #total_groups)

Calculating Percentage of Percentage in SSRS

Figure A
Figure B
Figure C
I have created the table (figure A) that is giving me the result (figure B) where the subcategory percentages are taken from the TOTALS (example: for 1/16/2016 | 3.04% + 11.13% + 0.02% = 14.19%).
I need the subcategory percentages to be taken from the respective category total making it the new 100%, as displayed in figure C: the desired result (example: for 1/16/2016 | 21.40% + 78.42% + 0.17% = 100%).
In your example you want to refer to that cell with a total of 584 for Category B. SSRS doesn't have the option for you to refer to a value within multiple groups like that. You can only provide one scope override. To get this functionality you can add a subquery to your dataset that aggregates those values in a new column.
So for example your dataset should end up looking like this:
CategoryName SubcategoryName Number CategorySubtotal
Category B subcategory a 125 584
Category B subcategory b 458 584
...
Now you can easily calculate the percent of total for each category in the report.