I've got data (which changes every time) in 2 columns - basically state and number. This is an example:
Example Data
State Total
Connecticut 624
Georgia 818
Washington 10
Arkansas 60
New Jersey 118
Ohio 2,797
N. Carolina 336
Illinois 168
California 186
Utah 69
Texas 183
Minnesota 172
Kansas 945
Florida 113
Arizona 1,430
S. Dakota 293
Puerto Rico 184
Each state needs to be grouped. The groupings are as follows:
Groupings
**US Group 1**
California
District of Columbia
Florida
Hawaii
Illinois
Michigan
Nevada
New York
Pennsylvania
Texas
**US Group 3**
Iowa
Idaho
Kansas
Maine
Missouri
Montana
North Dakota
Nebraska
New Hampshire
South Dakota
Utah
Wyoming
Every other state belongs in US Group 2..
What I am trying to do is sum a total for each group. So in this example I would have totals of:
Totals
650 in Group 1 (4 states)
6365 in Group 2 (9 states)
1307 in Group 3 (3 states)
So what I would like to do each time I get a new spreadsheet with this data, is not have to create an if/countif/sumif formula each time. I figure it would be much more efficient to select my data and possibly run a macro which will do that (possibly checking against some legend or something)
Can anyone point me in the right direction? I have been banging my head against the VBA editor for 2 days now...
Here is one way.
Step 1: Create a named range for each of your groups.
Step 2: Try this formula: =SUMPRODUCT(SUMIF(A2:A18,Group1,B2:B18))
Formula Breakdown:
A2:A18 is the the state names
Group1 is the named range that has each of your states in group 1
B2:B18 is the values you want to sum.
It's important that your state names and the values you want summed are the same size (number of rows). You should also standardize your state names. Having S. Dakota in your data and South Dakota in your named range won't work. Either add in the different variations of the state name(s) to your list, or standardize your data coming in.
To get a clear visual of what the formula is doing, use the Evaluate Formula button on the Formulas Tab, it will be much better than me trying to explain it.
EDIT
Try this formula for summing up values that are not in Group1 or Group3:
=SUMPRODUCT(--(NOT(ISNUMBER(MATCH(A2:A18,Group1,0)))),--(NOT(ISNUMBER(MATCH(A2:A18,Group3,0)))),B2:B18)
Seemed to work on my end. Basically it works by only summing valyes in B2:B18 where both match functions return N/A (meaning it's not in the defined group list).
Use a vlookup with a mapping of your states to groups. Then from the group number, add it if it's found, or add 0.
Related
I have a dataframe:
State
County
Candidate
CandidateVotes
Mode
South Carolina
Beaufort
Joe Biden
13713
ABSENTEE BY MAIL
South Carolina
Beaufort
Joe Biden
63
FAILSAFE
South Carolina
Beaufort
Joe Biden
33
FAILSAFE PROVISIONAL
South Carolina
Beaufort
Donald Trump
9122
ABSENTEE BY MAIL
South Carolina
Beaufort
Donald Trump
26495
ELECTION DAY
South Carolina
Beaufort
Donald Trump
42
FAILSAFE PROVISIONAL
Pennsylvania
York
Donald Trump
146733
TOTAL
Pennsylvania
York
Joe Biden
88114
TOTAL
The mode can be a variety of things, but the total number of votes will always be the total of the column for that candidate. Also, some states/counties will keep a total rather than breaking everything down. What I am looking to do is the same as what Pennsylvania is listed at the bottom.
This is my desired output:
State
County
Candidate
CandidateVotes
Mode
South Carolina
Beaufort
Joe Biden
13809
TOTAL
South Carolina
Beaufort
Donald Trump
26537
TOTAL
Pennsylvania
York
Donald Trump
146733
TOTAL
Pennsylvania
York
Joe Biden
88114
TOTAL
I think the correct way to do this is to group by State, County and Candidate. From here, add all of the modes for that respective candidate and create a new column with that total. And where Mode = 'TOTAL', simply bring that over to the new column then delete Mode.
How do I do this?
You can groupby and do a sum using the three columns State, County, and Candidate from the dataset.
df = df.groupby(['State', 'County', 'Candidate']).sum().reset_index()
This will give an output with the first four columns and then you can integrate the Mode column separately since it will have the static value.
df['Mode'] = 'Total'
I'm trying to achieve a result where only one result for each TEAM and each PLACE is returned.
The twist is that the highest result should from each place should have priority.
My table currently looks something like this:
ENTRY_ID TEAM_ID DATE PLACE SCORE
1 1 2021-10-12 Ireland 64
2 2 2021-10-12 Ireland 31
3 3 2021-10-12 France 137
4 2 2021-10-12 France 61
5 5 2021-10-12 France 38
6 1 2021-10-12 France 66
7 2 2021-10-12 Italy 17
8 3 2021-10-12 Italy 61
9 1 2021-10-12 Italy 74
The competition is held at three different places at the same time, with technically all teams being able to have people playing in all of them at the same time.
Each team however can only win one point so, in the example, it's possible to see that Team 1 would win both in Italy and Ireland, but it should be awarded only one point for the highest score, so only Italy. The point in Ireland should go to the second place.
I've tried over 30 queries I've found in several correlated questions, but none of them seems to be applicable to my situation.
Basically:
"Return the highest score on each PLACE, but only calls each TEAM once.
If that certain TEAM was already called, ignore it, get the second place."
So I could retrieve all three winners with no further processing. The results I'm trying to achieve should repeat neither the TEAM_ID nor PLACE, in this particular example it should output:
3 FRANCE (Since it has the highest score in France at 137)
1 ITALY (For the highest score in Italy at 74)
2 IRELAND (For the second-highest score in Ireland, since Team 1 already won in Italy)
The production model of this table has far more entries so it's unlikely there would be any clashes with too many second-places.
How can I achieve that?
I'd like to make a query that returns just one row when it meets 3 conditions. I have a database that looks like this:
Location
Date
Item
Price
Chicago
2021-06-10
1
150
New York
2021-06-10
2
130
Chicago
2021-06-10
1
150
Los Angeles
2021-06-10
3
100
Atlanta
2021-06-10
4
120
New York
2021-06-09
2
125
Chicago
2021-06-09
1
155
Los Angeles
2021-06-09
3
99
Atlanta
2021-06-09
4
140
This database contains the price of different items, by date and location. This price changes each day and the price in each location for the same item does not need to be the same. Given that this database contains each sale made in a day, for each item, I'd like to make a query that returns only one observation by Location, Date and Item. I want to have like a time series for each the price of each item, in each location. So the resulting table should look like this:
Location
Date
Item
Price
Chicago
2021-06-10
1
150
New York
2021-06-10
2
130
Los Angeles
2021-06-10
3
100
Atlanta
2021-06-10
4
120
New York
2021-06-09
2
125
Chicago
2021-06-09
1
155
Los Angeles
2021-06-09
3
99
Atlanta
2021-06-09
4
140
Hope someone can help me, thanks.
To elaborate on the comments, this will give exactly what you have specified.
SELECT
DISTINCT
*
FROM
yourTable
The DISTINCT key word looks at all columns in each row and eliminates any row that exactly matches any other row.
If the price can vary within a day, but you want the maximum value, for example, use a GROUP BY...
SELECT
location,
date,
item,
MAX(price) AS max_price
FROM
yourTable
GROUP BY
location,
date,
item
That will ensure you get one row per unique combination of location, date, item, and then you can pick which price to include using aggregate functions.
Note: Using keywords such as date as column names is a bad idea. depending on your database you may need to "quote"/"escape" such column names, and even then the make reading the code harder for others.
I am trying to create a function that averages over both row and column. For example:
**State** **1943 1944 1945 1946 1947 (1947_AVG) 1948 (1948_AVG)**
Alaska 1 2 3 4 5 2 6 3
CA 234 234 234 6677 34
I want a code that will give me an average for 1947 using 1943, 1944, and 1945. Something that gives me 1948 using 1944, 1945, 1946, ect, ect.
I currently have:
d3['pandas_SMA_Year'] = d3.iloc[:,1].rolling(window=3).mean()
But this is simply working over the rows, not the columns, and it doesn't take into account the fact that I'm looking 2 years back. Please and thank you for any guidance!
ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco Cal USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df=df.drop(['3666 21st St'], axis=1, inplace=True)
I am using this code and still, it's showing an error stating that :
KeyError: "['3666 21st St'] not found in axis"
Can anyone help me solve this?
The drop method only works on the index or column names. There are 2 ways to do what you want:
Make the Address column the index, then drop the value(s) you want to drop. You should use axis=0 for this, and not axis=1. The default is axis=0. Do not use inplace=True if you are assigning the output.
Use a Boolean filter instead of drop.
The 1st method is preferred if the Address values are all distinct. The index of a data frame is effectively a sequence of row labels, so it doesn't make much sense to have duplicate row labels:
df.set_index('Address', inplace=True)
df.drop(['3666 21st St'], inplace=True)
The 2nd method is therefore preferred if the Address column is not distinct:
is_bad_address = df['Address'] == '3666 21st St'
# Alternative if you have multiple bad addresses:
# is_bad_address = df['Address'].isin(['366 21st St'])
df = df.loc[~is_bad_address]
You need to consult the Pandas documentation for the correct usage of the axis= and inplace= keyword arguments. You are using both of them incorrectly. DO NOT COPY AND PASTE CODE WITHOUT UNDERSTANDING HOW IT WORKS.