Figure out average for multiple subsets of rows - pandas

I have a csv file with data on store sales for each province, including the store ID. I've already figured out how to get a list of the provinces with the most sales, and a list of the stores with the most sales, but now I need to calculate: 1) The average store sales for each province and 2) The best-selling store in each province and then 3) The difference between them. The data looks like this:
>>> store_sales
sales
store_num province
1396 ONTARIO 223705.21
1891 ONTARIO 71506.85
4823 MANITOBA 114692.70
4861 MANITOBA 257.69
6905 ONTARIO 19713.24
6973 ONTARIO 336392.25
7104 BRITISH COLUMBIA 32233.31
7125 BRITISH COLUMBIA 11873.71
7167 BRITISH COLUMBIA 87488.70
7175 BRITISH COLUMBIA 14096.53
7194 BRITISH COLUMBIA 6327.60
7238 ALBERTA 1958.75
7247 ALBERTA 6231.31
7269 ALBERTA 451.56
7296 ALBERTA 184410.04
7317 SASKATCHEWAN 43491.55
8142 ONTARIO 429871.74
8161 ONTARIO 6479.71
9604 ONTARIO 20823.49
9609 ONTARIO 148.02
9802 ALBERTA 54101.00
9807 ALBERTA 543703.84
I was able to get there by using the following:
import pandas as pd
df = pd.read_csv('/path/to/sales.csv')
store_sales = df.groupby(['store_num', 'province']).agg({'sales': 'sum'})
I think 3) is probably pretty simple but for 1) I'm not sure how to apply an average to subsets of specific rows (I imagine it involves using 'groupby') and for 2) although I was able to generate a list of the best-selling stores, I'm uncertain as to how I could display a single top store for each province (although something tells me it should be simpler than it seems).

For (1), you just need to pass the column name to groupby:
df.groupby("province).mean()
For (2), you just need to apply a different function to groupby:
df.groupby("province).max()
For (3), the difference can be easily calculated by subtracting (1) and (2):
df.groupby("province").max() - df.groupby("province").mean()

Related

Best way to add info/description to my items?

I made a geo game a while back where the player has to guess an item from an image (what I call an item is a SQL row basically) for example the bot sends the flag of the Netherlands, you have to type "Netherlands" to win.
Items can be the flag of a country, a capital city, a french department...
I made an info tab where it would basically give info about an item (ie region, former name, capital city, etc).
What I would like to do is properly save this information. I don't really know if I should store this in files like JSON because I would also like to give stats (Win rate per region, amount of games played per region, etc...).
Also, these elements are not fixed because some items have regions, capital cities or whatever and some don't.
Item examples :
(For a flag
Column
Attribute
ID
1
Name
United Kingdom
Former name
United Kingdom of Great Britain and Northern Ireland
Code
GB
Continent
Europe
Subregion
Northern Europe
Capital city
London
...
(For a U.S. State)
Column
Attribute
ID
1
Name
Arizona
Capital city
Phoenix
Largest city
Phoenix
...
The both solution (Add all as column and json) are not the proper way.
I think the best design is to have a key-value table.
Create Table tableName (ID INT, [Key] SYSNAME, [Value])
And data will look like:
ID
Key
Value
1
Name
Arizona
1
Capital City
Phoenix
1
Largest City
Phoenix
2
Name
United Kingdom
2
Former name
United Kingdom of Great Britain and Northern Ireland
Most valuable benefits: No Extra storage for columns with large amount of rows with NULL value.

Dataset interpretation Continuous vs Categorical for House Prices

I'm working with the UK house price dataset and was wanting to create a ML model to predict the price of a house based on the city (plus some other categories).
As a newb to all of this, I am stumped. I am fine creating models with continuous variables, or even carrying out one-hot encoding (dummy variables) for some of the other categories which have 4 different options (type of house for example).
However, when it comes to cities, there are about 1200 different cities in the data set and so I am not sure how to engineer the data to deal with this.
Would greatly appreciate anyone having any idea about this!
No matter how much I search, I can't find an answer to this, but this could perhaps be due to not knowing exactly what to search for.
For me you need to have a city grade in every city and a price for a house.
For example:
Country | City Grade
------------+------------
Los Angeles | 1
New York | 4
House | Price
------------+------------
Option1 | $200,000
Option2 | $300,000
Then calculate the house price based on the city grade by multiplying house price * City Grade.
So it means the Option1 house in Los Angeles will still $200,000 but in New York would be $1,200,000.
You don't need to worry about the 1200 cities its easy to query in database.

PostgreSQL - query column that has city, state, country for another column?

I have a database with a column that has geographical location data like this:
party_no ...misc_cols_in_between... loc_reverse_geoloc_adr
------------------------------------------------------------------------------------------------
013 ...data... 367 Cleta Tunnel, South Moriahton, Palestinian Territory
666 ...data... NULL
185 ...data... Schillerstraße 50-50, 08340, Schwarzenberg/Erzgeb., Sachsen, DE
267 ...data... 18701-18999 N County Road 2500E, Oakland, IL, 61943, US
389 ...data... Le Rocher-Percé, Québec, CA
666 ...data... 94531, Antioch, CA, US
185 ...data... 76 Willochra Rd, Salisbury Plain, South Australia, 5109, AU
As you can see, now all the geographical location data is the same format. I basically need to run multiple queries with one of them being for example, the max appearing city,state,country and to what party number its linked to.
Would wildcards using LIKE % be the best bet here? I'm thinking I need somewhere to parse out the data in the column on the comma delimiter, since it seems the city , state, and country are delimited by columns. However, sometimes the address is missing and the data isn't formatted the same at all times.

How to write a SQL query for showing route information from a flight database?

I have a set of flight data and I am trying to write a query (ex: recursive query using CTE) to show the No. of flights per routes, destination city, departure city, airline info, total time of delay per routes.
Currently I don't know a way to group total number of flights per route for each airline. I also have trouble grouping totaltimedelay for each airline's routes.
Sample flight data info - Four columns total (All the data below are from the fact table in OLAP database)
AirlineName DepartureCity DestinationCity TimeDelay(min) FlightID
CA NY CA 9 389
OA NJ TX 8 321
AA SEA NY 10 231
UA NY CA 20 098
HA NJ TX 15 321
OA NJ TX 20 123
< Expected output: 5 columns >
AirlineName DeparCity DestiCity TotalNumberofFlights TotaltimeDelay
Thanks a lot I hope I made it clear enough. Any sort of help or direction would be appreciated.
A simple GROUP BY should be enough...
SELECT
AirlineName,
DepartureCity AS DeparCity,
DestinationCity AS DestiCity,
COUNT(*) AS TotalNumberofFlights,
SUM(TimeDelay) AS TotaltimeDelay
FROM Flight
GROUP BY
AirlineName,
DepartureCity,
DestinationCity
Click here to see it in action & have a play in SqlFiddle.com

DAX for Grouping

I have following data set as data Model.
Country City AssetCount
USA Newyork 50
USA Washington 40
USA California 30
India Bangalore 100
India Delhi 50
India Bombay 30
I want to show one row showing sum of Assetcount at country level & city level on the same row.
There are two slicers for slicing City & Country as below:
USA Newyork
India Washington
California
Bangalore
Delhi
Bombay
So when I select country as India it should show sum of Asset-Count at country(India) level.
In the same way when I select City as Delhi it should show Asset-Count at City(Delhi) level.
India Delhi
180 50
Is it possible using PowerPivot using DAX?
Related content from their question on MSDN
Actually your solution is not working. I have created the hierarchy as Country-->City & kept that in Rows. So when I select a particular Country & City it showing like this:
Row Labels AssetCount
USA 40
Washington 40
Grand Total 40
But I want
USA Washington
120 40
or may be like
USA 120
Washington 40
I have tried some aggregate functions like below:
=SUMX(VALUES(Query[City]),CALCULATE(SUM(Query[AssetCount])))
=CALCULATE(SUM(Query[AssetCount]),SUMMARIZE('Query',Query[City]))
Here Query is table for Data Model & City can be replaced by Country.
but not working.
So showing such counts on same row is possible or not?
Sounds like you are just getting started with Power Pivot. You might browse through the links on this page for more help.
I took the data you provided and pasted it into Excel.
Selected the data and clicked Add to Data Model and checked the box for My Data Has Headers.
I made sure the AssetCount Column had a data type of whole number. Then clicked the Pivot Table button and created a pivot table on my existing spreadsheet.
I put AssetCount in the values and made sure it was set to Sum in the Field Value Settings.
I selected my pivot table and then went to the Analyze tab under PivotTable Tools and clicked the Insert Slicer button.
I selected both Country and City as slicers.
This gives your desired result.
If you want the two numbers in a row, that's pretty straightforward. Keep in mind, that all those slicers do is putting filters on the pivot table.
Therefore to get your city result, you could use either an implicit measure or explicit measure that simply sums up AssetCount.
For the country result, you'd wish to overload the city filter like this:
=calculate(SUM(Query[AssetCount]),ALL(Query[City]))
If you also need the country and city names there, it gets a bit tricky.