Dataset interpretation Continuous vs Categorical for House Prices - pandas

I'm working with the UK house price dataset and was wanting to create a ML model to predict the price of a house based on the city (plus some other categories).
As a newb to all of this, I am stumped. I am fine creating models with continuous variables, or even carrying out one-hot encoding (dummy variables) for some of the other categories which have 4 different options (type of house for example).
However, when it comes to cities, there are about 1200 different cities in the data set and so I am not sure how to engineer the data to deal with this.
Would greatly appreciate anyone having any idea about this!
No matter how much I search, I can't find an answer to this, but this could perhaps be due to not knowing exactly what to search for.

For me you need to have a city grade in every city and a price for a house.
For example:
Country | City Grade
------------+------------
Los Angeles | 1
New York | 4
House | Price
------------+------------
Option1 | $200,000
Option2 | $300,000
Then calculate the house price based on the city grade by multiplying house price * City Grade.
So it means the Option1 house in Los Angeles will still $200,000 but in New York would be $1,200,000.
You don't need to worry about the 1200 cities its easy to query in database.

Related

SQL how to display results only if two parts are unique

I'm currently having an issue trying to make a query such that it displays the fields only if both parts are unique. For example, lets say the fields to be displayed currently are as goes:
SELECT
Name,
CompanyName,
JobStartDate,
Birthday,
Age,
Favorite Ice Cream,
Height
From 'sample_person_data'
How would I set this so that it only displays fields where both CompanyName and JobStartDate are both distinct?
At first, I thought just putting distinct would be enough, but came to the realization that would not work, I then thought what if I make it so that it has to check both CompanyName + JobStartDate as unique fields, so only showing the fields where both those two things are unique, but could not go about implementing it.
Essentially what I'm aiming to achieve is if there was a large dataset with some repeated values, how could I help display only the unique fields. I use CompanyName and JobStartDate as examples here, but I understand that people can start at the same company on the same day, therefore this would be a concept which could expand into adding more comparisons.
Thank you for your time.
EDIT: Based on comments I am trying to provide further detail by example
Say this is the sample data:
Name
CompanyName
JobStartDate
Birthday
Age
Favorite Ice Cream
Height
John
Google
04-17-00
01-01-78
50
Vanilla
5-7
John
Google
04-17-00
01-01-78
50
Chocolate
5-7
John
Microsoft
04-17-00
02-01-95
30
Chocolate
5-8
Nancy
Google
06-27-00
04-01-78
50
Vanilla
5-2
Joanna
Google
08-19-00
05-01-78
50
Vanilla
5-0
So here we see the same John from Google filled the form twice because say he decided to change his favorite ice cream. How do I edit the query such that it displays such as the following:
Name
CompanyName
JobStartDate
Birthday
Age
Favorite Ice Cream
Height
John
Google
04-17-00
01-01-78
50
Vanilla
5-7
John
Microsoft
04-17-00
02-01-95
30
Chocolate
5-8
Nancy
Google
06-27-00
04-01-78
50
Vanilla
5-2
Joanna
Google
08-19-00
05-01-78
50
Vanilla
5-0
I don't really care if his favorite ice cream shows up as Chocolate or Vanilla, but rather that only 1 entry of a John from google shows up, using the current company + job start date as the identifying fields for example.
Use below simple approach
select * from your_table
qualify 1 = row_number() over(partition by CompanyName, JobStartDate)
if applied to sample data in your question - output is

Best way to add info/description to my items?

I made a geo game a while back where the player has to guess an item from an image (what I call an item is a SQL row basically) for example the bot sends the flag of the Netherlands, you have to type "Netherlands" to win.
Items can be the flag of a country, a capital city, a french department...
I made an info tab where it would basically give info about an item (ie region, former name, capital city, etc).
What I would like to do is properly save this information. I don't really know if I should store this in files like JSON because I would also like to give stats (Win rate per region, amount of games played per region, etc...).
Also, these elements are not fixed because some items have regions, capital cities or whatever and some don't.
Item examples :
(For a flag
Column
Attribute
ID
1
Name
United Kingdom
Former name
United Kingdom of Great Britain and Northern Ireland
Code
GB
Continent
Europe
Subregion
Northern Europe
Capital city
London
...
(For a U.S. State)
Column
Attribute
ID
1
Name
Arizona
Capital city
Phoenix
Largest city
Phoenix
...
The both solution (Add all as column and json) are not the proper way.
I think the best design is to have a key-value table.
Create Table tableName (ID INT, [Key] SYSNAME, [Value])
And data will look like:
ID
Key
Value
1
Name
Arizona
1
Capital City
Phoenix
1
Largest City
Phoenix
2
Name
United Kingdom
2
Former name
United Kingdom of Great Britain and Northern Ireland
Most valuable benefits: No Extra storage for columns with large amount of rows with NULL value.

DAX for Grouping

I have following data set as data Model.
Country City AssetCount
USA Newyork 50
USA Washington 40
USA California 30
India Bangalore 100
India Delhi 50
India Bombay 30
I want to show one row showing sum of Assetcount at country level & city level on the same row.
There are two slicers for slicing City & Country as below:
USA Newyork
India Washington
California
Bangalore
Delhi
Bombay
So when I select country as India it should show sum of Asset-Count at country(India) level.
In the same way when I select City as Delhi it should show Asset-Count at City(Delhi) level.
India Delhi
180 50
Is it possible using PowerPivot using DAX?
Related content from their question on MSDN
Actually your solution is not working. I have created the hierarchy as Country-->City & kept that in Rows. So when I select a particular Country & City it showing like this:
Row Labels AssetCount
USA 40
Washington 40
Grand Total 40
But I want
USA Washington
120 40
or may be like
USA 120
Washington 40
I have tried some aggregate functions like below:
=SUMX(VALUES(Query[City]),CALCULATE(SUM(Query[AssetCount])))
=CALCULATE(SUM(Query[AssetCount]),SUMMARIZE('Query',Query[City]))
Here Query is table for Data Model & City can be replaced by Country.
but not working.
So showing such counts on same row is possible or not?
Sounds like you are just getting started with Power Pivot. You might browse through the links on this page for more help.
I took the data you provided and pasted it into Excel.
Selected the data and clicked Add to Data Model and checked the box for My Data Has Headers.
I made sure the AssetCount Column had a data type of whole number. Then clicked the Pivot Table button and created a pivot table on my existing spreadsheet.
I put AssetCount in the values and made sure it was set to Sum in the Field Value Settings.
I selected my pivot table and then went to the Analyze tab under PivotTable Tools and clicked the Insert Slicer button.
I selected both Country and City as slicers.
This gives your desired result.
If you want the two numbers in a row, that's pretty straightforward. Keep in mind, that all those slicers do is putting filters on the pivot table.
Therefore to get your city result, you could use either an implicit measure or explicit measure that simply sums up AssetCount.
For the country result, you'd wish to overload the city filter like this:
=calculate(SUM(Query[AssetCount]),ALL(Query[City]))
If you also need the country and city names there, it gets a bit tricky.

Joining multiple fields between the same tables

I have a table called 'Resources' that looks like this:
Country City Street Headcount
UK Halifax High Street 20
United Kingdom Oxford High Street 30
Canada Halifax North St 40
Because of the nature of the location fields, I need to map them to a single 'Address' field, and so I also have the following table called 'Addresses':
Country City Street Address
UK Halifax High Street High Street, Halifax, UK
Canada Halifax North St North Street, Halifax, Canada
United Kingdom Oxford High Street High Street, Oxford, UK
(In reality the Address field does add information rather than just combining what is already there.)
I am currently using the following SQL to produce the query:
SELECT Resources.Country, Resources.City, Resources.Street, Addresses.Address,
Resources.Headcount
FROM Resources
INNER JOIN Addresses ON Resources.Country = Addresses.Country
AND Resources.City = Addresses.City
AND Resources.Street = Addresses.Street
This works for me, but I am worried that I have not seen people use this many ANDs in a single join elsewhere, so don't know if it is a bad idea. (This is simplified version - I may need up to 8 ANDs in a single join in another case) Is this the best way to approach the problem, or is there a better solution?
Thanks
Joining on multiple columns is fine. You don't have to "fear" this.
As far as "a better way". I would suggest creating some variable tables, putting some data in them, and posting that TSQL (DDL and DML) here. Then you can get some possible alternatives. Your question is vague at the present (in regards to the "is there a better way" portion of your question)

best way to merge rows that have matching ids

i have a table of households which has the address information and city info
and then i have and individuals table of all the people in the household
it could be 1 person that belongs to the house hold or it could be 10
what i want to achieve is that if the individuals belong to the same household there information will show up in the same row as the household information all in 1 row
so if theres 10 people the inforamtion will still be in 1 row, if theres 2 people still only 1 row
household table
1 bekshire st dell MA 10001 02639 50 0002 dell NULL ALRGEN
BERKSHIRE ST NULL NULL NULL NULL
individuals that belong to household id 10001
first last code
BOB BUILDER U
JESS BUILDER A
i want
1 bekshire st dell MA 10001 02639 50 0002 dell NULL ALRGEN 1 BERKSHIRE ST BOB,JESS BUILDER U,A
The reason this is so hard is that SQL favors normalization and structure, and essentially what your asking for is to go the opposite direction. I know I'm not directly answering your question, but maybe your best bet is to consider manipulating and displaying the data on the client side and stick to simple queries to get the data from the database.