Geocoding for Data Manipulation - data-science

Use data from KY’s Department of Alcoholic Beverage Control (ABC) to calculate the availability of alcohol in Fayette county’s neighborhoods. In particular for each neighborhood, calculate the rate of liquor licenses per capita. Show the top 20 neighborhoods with the highest rate of alcohol availability. Show the top 20 neighborhoods with the highest number of licenses. Discuss whether or not these two top-20 lists differ and how. Define neighborhood as a US Census Bureau’s tract
I was trying to data manipulation on the data from ABC for grouping them in neighbourhood and convert Census track to neighbourhood so that I can perform calculations. But could get only the longitude and latitude.

Related

Dataset interpretation Continuous vs Categorical for House Prices

I'm working with the UK house price dataset and was wanting to create a ML model to predict the price of a house based on the city (plus some other categories).
As a newb to all of this, I am stumped. I am fine creating models with continuous variables, or even carrying out one-hot encoding (dummy variables) for some of the other categories which have 4 different options (type of house for example).
However, when it comes to cities, there are about 1200 different cities in the data set and so I am not sure how to engineer the data to deal with this.
Would greatly appreciate anyone having any idea about this!
No matter how much I search, I can't find an answer to this, but this could perhaps be due to not knowing exactly what to search for.
For me you need to have a city grade in every city and a price for a house.
For example:
Country | City Grade
------------+------------
Los Angeles | 1
New York | 4
House | Price
------------+------------
Option1 | $200,000
Option2 | $300,000
Then calculate the house price based on the city grade by multiplying house price * City Grade.
So it means the Option1 house in Los Angeles will still $200,000 but in New York would be $1,200,000.
You don't need to worry about the 1200 cities its easy to query in database.

How to use normalization to set levels of confidence between a rating and the number of ratings in Python or SQL?

I have a list of about 800 sales items that have a rating (from 1 to 5), and the number of ratings. I'd like to list the items that are most probable of having a "good" rating in an unbiased way, meaning that 1 person voting 5.0 isn't nearly as good as 50 people having voted and the rating of the item being a 4.5.
Initially I thought about getting the smallest amount of votes (which will be zero 99% of the time), and the highest amount of votes for an item on the list and factor that into the ratings, giving me a confidence level of 0 to 100%, however I'm thinking that this approach would be too simplistic.
I've heard about Bayesian probability but I have no idea on how to implement it. My list of items, ratings and number of ratings is on a MySQL view, but I'm parsing the code using Python, so I can make the calculations on either side (but preferably at the SQL view).
Is there any practical way that I can normalize this voting with SQL, considering the rating and number of votes as parameters?
|----------|--------|--------------|
| itemCode | rating | numOfRatings |
|----------|--------|--------------|
| 12330 | 5.00 | 2 |
| 85763 | 4.65 | 36 |
| 85333 | 3.11 | 9 |
|----------|--------|--------------|
I've started off trying to assign percentiles to the rating and numOfRatings, this way I'd be able to do normalization (sum them with an initial 50/50 weight). Here's the code I've attempted:
SELECT p.itemCode AS itemCode, (p.rating - min(p.rating)) / (max(p.rating) - min(p.rating)) AS percentil_rating,
(p.numOfRatings - min(p.numOfRatings)) / (max(p.numOfRatings) - min(p.numOfRatings)) AS percentil_qtd_ratings
FROM products p
WHERE p.available = 1
GROUP BY p.itemCode
However that's only bringing me a result for the first itemCode on the list, not all of them.
Clearly the issue here is the low number of observations your data has. Implementing Bayesian's method is the way to go because it provides great probability distribution for applications involving ratings especially if there is limited observations, and it easily decides the future likelihood ratio based on given parameters (this article provides an excellent explanation about Bayesian probability for beginners).
I would suggest storing your data in CSV files so it becomes easier to manipulate in python. Denormalizing the data via joins is the first task to do before analyzing your ratings.
This is Bayesian's simplified formula to use in your python code:
R – Confidence level aka number of observations
v – number of votes for a single product
C – avg vote for all products
m - tuneable parameter aka cutoff number required for votes to be considered (How many votes do you want displayed)
Since this is the simplified formula, this article explains how its been derived from its original formula. This article is helpful too in explaining the parameters.
Knowing the formula pretty much gets 50% of your work done, the rest is just importing your data and working with it. I provided below examples similar to your problem in case you need full demonstration:
Github example 1
Github example 2

Choosing a value based on a ranking of another column

I have decent Google-Fu skills, but they've utterly failed me on this.
I'm working in PowerPivot and I'm trying to match up a product with a price point in another table. Sounds easy, right? Well each product has several price points, based on the source of the price, with a hierarchy of importance.
For Example:
Product 1 has three prices living in a Pricing Ledger:
Price 1 has an account # of A22011
Price 2 has an account # of B22011
Price 3 has an account # of C22011
Price A overrides Price B which overrides Price C
What I want to do is be able to pull the most relevant price (i.e, that with the highest rank in the hierarchy) when not all price points are being used.
I'd originally used a series of IF statements, but that's when there were only four points. We now have ten points, and that might grow, so the IF statements are an untenable solution.
I'd appreciate any help.
-Thanks-

With MDX is there a generic way to calculate the ratio of cells with regards to the selected members of a specific hierarchy?

I want to define a cube measure in a SSAS Analysis Services Cube (multidimensional model) that calculates ratios for the selection a user makes for a predefined hierarchy. The following example illustrates the desired behavior:
|-City----|---|
| Hamburg | 2 |
| Berlin | 1 |
| Munich | 3 |
This is my base table. What I want to achieve is a cube measure that calculates ratios based on a users' selection. E.g. when the user queries Hamburg (2) and Berlin (1) the measure should return the values 67% (for Hamburg) and 33% (for Berlin). However if Munich (3) is added to the same query, the return values would be 33% (Hamburg), 17% (Berlin) and 50% (Munich). The sum of the values should always equal to 100% no matter how many hierarchy members have been included into the MDX query.
So far I came up with different measures, but they all seem to suffer from the same problem that is it seems impossible to access the context of the whole MDX query from within a cell.
My first approach to this was the following measure:
[Measures].[Ratio] AS SUM([City].MEMBERS,[Measures].[Amount])/[Measures].[Amount]
This however sums up the amount of all cities regardless of the users selection and though always returns the ratio of a city with regards to the whole city hierarchy.
I also tried to restrict the members to the query context by adding the EXISTING keyword.
[Measures].[Ratio] AS SUM(EXISTING [City].MEMBERS,[Measures].[Amount])/[Measures].[Amount]
But this seems to restrict the context to the cell which means that I get 100% as a result for each cell (because EXISTING [City].MEMBERS is now restricted to a cell it only returns the city of the current cell).
I also googled to find out whether it is possible to add a column or row with totals but that also seems not possible within MDX.
The closest I got was with the following measure:
[Measures].[Ratio] AS SUM(Axis(1),[Measures].[Amount])/[Measures].[Amount]
Along with this MDX query
SELECT {[Measures].[Ratio]} ON 0, {[City].[Hamburg],[City].[Berlin]} ON 1 FROM [Cube]
it would yield the correct result. However, this requires the user to put the correct hierarchy for this specific measure onto a specific axis - very error prone, very unintuitive, I don't want to go this way.
Are there any other ideas or approaches that could help me to define this measure?
I would first define a set with the selected cities
[GeoSet] AS {[City].[Hamburg],[City].[Berlin]}
Then the Ratio
[Measures].[Ratio] AS [Measures].[Amount]/SUM([GeoSet],[Measures],[Amount])
To get the ratio of that city to the set of cities. Lastly
SELECT [Measures].[Ratio] ON COLUMNS,
[GeoSet] ON ROWS
FROM [Cube]
Whenever you select a list of cities, change the [GeoSet] to the list of cities, or other levels in the hierarchy, as long as you don't select 2 overlapping values ([City].[Hamburg] and [Region].[DE6], for example).

CUBE dimension reduction with sequential slices

I have data in a cube, organized across 5 axes:
Source (data provider)
GEO (country)
Product (A or B)
Item (Sales, Production, Sales_national)
Date
In short, I have multiple data providers for different Product, Item, GEO and Date, i.e. for different slices of the cube.
Not all "sources" cover all dates, product, countries. Some will have more up to date information, but it will be preliminary.
The core of the problem is to have a synthesis of what all sources say.
Importantly, the choice of data provide for each "slice" of the cube is made by the user/analyst and needs to be so (business knowledge of provider methodology, quality etc).
What I am looking for, is a way to create a 'central dictionary' with all the calculation-types.
Such dictionary would be organized like this:
Operation Source GEO Item Product Date_start Date_end
Assign Source3 ITA Sales Product_A 01/01/2016 01/01/2017
Assign Source1 ITA Sales Product_A 01/01/2017 last
Assign with %delta Source2 ITA Sales Product_A 01/01/2018 last
This means:
From Jan2016 to Jan 2017, ProdA Sales in Italy, take Source 3
From Jan17 to last available, take Source 1
From Jan18 to last available, take the existing, add %difference across time from Source 2
The data and calculation are examples, there are other more complex, but the gist of it is putting slices of the "Source" 5-dimensional cube into a "Target" 4-dimensional cube, with a set of sequential calculations.
In SQL, it is the equivalent of a bunch of filtered SELECTs + INSERT, but the complexity of the calculations will probably lead to lots of nested JOINS.
The solution will be most likely custom functions, but I was wondering if anyone is aware of a language or software other than DAX/MDX which would allow to do this with minimal customization?
Many thanks