I have a database with a column that has geographical location data like this:
party_no ...misc_cols_in_between... loc_reverse_geoloc_adr
------------------------------------------------------------------------------------------------
013 ...data... 367 Cleta Tunnel, South Moriahton, Palestinian Territory
666 ...data... NULL
185 ...data... Schillerstraße 50-50, 08340, Schwarzenberg/Erzgeb., Sachsen, DE
267 ...data... 18701-18999 N County Road 2500E, Oakland, IL, 61943, US
389 ...data... Le Rocher-Percé, Québec, CA
666 ...data... 94531, Antioch, CA, US
185 ...data... 76 Willochra Rd, Salisbury Plain, South Australia, 5109, AU
As you can see, now all the geographical location data is the same format. I basically need to run multiple queries with one of them being for example, the max appearing city,state,country and to what party number its linked to.
Would wildcards using LIKE % be the best bet here? I'm thinking I need somewhere to parse out the data in the column on the comma delimiter, since it seems the city , state, and country are delimited by columns. However, sometimes the address is missing and the data isn't formatted the same at all times.
Related
I have a csv file with data on store sales for each province, including the store ID. I've already figured out how to get a list of the provinces with the most sales, and a list of the stores with the most sales, but now I need to calculate: 1) The average store sales for each province and 2) The best-selling store in each province and then 3) The difference between them. The data looks like this:
>>> store_sales
sales
store_num province
1396 ONTARIO 223705.21
1891 ONTARIO 71506.85
4823 MANITOBA 114692.70
4861 MANITOBA 257.69
6905 ONTARIO 19713.24
6973 ONTARIO 336392.25
7104 BRITISH COLUMBIA 32233.31
7125 BRITISH COLUMBIA 11873.71
7167 BRITISH COLUMBIA 87488.70
7175 BRITISH COLUMBIA 14096.53
7194 BRITISH COLUMBIA 6327.60
7238 ALBERTA 1958.75
7247 ALBERTA 6231.31
7269 ALBERTA 451.56
7296 ALBERTA 184410.04
7317 SASKATCHEWAN 43491.55
8142 ONTARIO 429871.74
8161 ONTARIO 6479.71
9604 ONTARIO 20823.49
9609 ONTARIO 148.02
9802 ALBERTA 54101.00
9807 ALBERTA 543703.84
I was able to get there by using the following:
import pandas as pd
df = pd.read_csv('/path/to/sales.csv')
store_sales = df.groupby(['store_num', 'province']).agg({'sales': 'sum'})
I think 3) is probably pretty simple but for 1) I'm not sure how to apply an average to subsets of specific rows (I imagine it involves using 'groupby') and for 2) although I was able to generate a list of the best-selling stores, I'm uncertain as to how I could display a single top store for each province (although something tells me it should be simpler than it seems).
For (1), you just need to pass the column name to groupby:
df.groupby("province).mean()
For (2), you just need to apply a different function to groupby:
df.groupby("province).max()
For (3), the difference can be easily calculated by subtracting (1) and (2):
df.groupby("province").max() - df.groupby("province").mean()
Database name: landmarks
landmark(landmarkId(PK),landmarkName,type,GPSCoordinates,locationId)
Comments(commentId(PK),comment,landmarkId)
Location(locationId(PK),streetName,streetNumber,buildingNumber,suburbId)
Suburb(suburbId(PK),suburbName,cityId)
City(cityId(PK),cityName,stateId)
State(stateId(PK),stateName,countryId)
Country(countryId(PK),countryName)
I'm trying to make a query to get the closest landmark which its type=coffee shop to a specific address.
For example:
The closest coffee shop to 54 George Street, Los angles, California, usa .
I'm new to spatial databases, what is the best optimised SQL query to this example.
I have following data set as data Model.
Country City AssetCount
USA Newyork 50
USA Washington 40
USA California 30
India Bangalore 100
India Delhi 50
India Bombay 30
I want to show one row showing sum of Assetcount at country level & city level on the same row.
There are two slicers for slicing City & Country as below:
USA Newyork
India Washington
California
Bangalore
Delhi
Bombay
So when I select country as India it should show sum of Asset-Count at country(India) level.
In the same way when I select City as Delhi it should show Asset-Count at City(Delhi) level.
India Delhi
180 50
Is it possible using PowerPivot using DAX?
Related content from their question on MSDN
Actually your solution is not working. I have created the hierarchy as Country-->City & kept that in Rows. So when I select a particular Country & City it showing like this:
Row Labels AssetCount
USA 40
Washington 40
Grand Total 40
But I want
USA Washington
120 40
or may be like
USA 120
Washington 40
I have tried some aggregate functions like below:
=SUMX(VALUES(Query[City]),CALCULATE(SUM(Query[AssetCount])))
=CALCULATE(SUM(Query[AssetCount]),SUMMARIZE('Query',Query[City]))
Here Query is table for Data Model & City can be replaced by Country.
but not working.
So showing such counts on same row is possible or not?
Sounds like you are just getting started with Power Pivot. You might browse through the links on this page for more help.
I took the data you provided and pasted it into Excel.
Selected the data and clicked Add to Data Model and checked the box for My Data Has Headers.
I made sure the AssetCount Column had a data type of whole number. Then clicked the Pivot Table button and created a pivot table on my existing spreadsheet.
I put AssetCount in the values and made sure it was set to Sum in the Field Value Settings.
I selected my pivot table and then went to the Analyze tab under PivotTable Tools and clicked the Insert Slicer button.
I selected both Country and City as slicers.
This gives your desired result.
If you want the two numbers in a row, that's pretty straightforward. Keep in mind, that all those slicers do is putting filters on the pivot table.
Therefore to get your city result, you could use either an implicit measure or explicit measure that simply sums up AssetCount.
For the country result, you'd wish to overload the city filter like this:
=calculate(SUM(Query[AssetCount]),ALL(Query[City]))
If you also need the country and city names there, it gets a bit tricky.
I have to compare addresses from two tables and get the Id if the address matches.
Each table has three columns Houseno, street, state
The address are not in standard format in either of the tables. There are approx. 50,000 rows, I need to scan through
At some places its Ave. Avenue Ave . Str Street, ST. Lane Ln. Place PL Cir CIRCLE.
Any combination with a dot or comma or spaces ,hypen.
I was thinking of combining all three What can be best way to do it in SQL or PLSQL for example
table1
HNO STR State
----- ----- -----
12 6th Ave NY
10 3rd Aven SD
12-11 Fouth St NJ
11 sixth Lane NY
A23 Main Parkway NY
A-21 124 th Str. VA
table2
id HNO STR state
-- ----- ----- -----
1 12 6 Ave. NY
13 10 3 Avenue SD
15 1121 Fouth Street NJ
33 23 9th Lane NY
24 X23 Main Cir. NY
34 A1 124th Street VA
There is no simple way to achieve what you want. There is a expensive software (google for "address standardization software") that can do this but rarely 100% automatic.
What this type of software does is to take the data, use complex heuristics to try to figure out the "official" address and then return that (sometimes with the confidence that the result is correct, sometimes a list of results sorted by confidence).
For a small percentage of the data, the software will simply not work and you'll have to fix that yourself.
Oracle has a built in package UTL_Match which has an edit_distance function (based on the Levenshtein algorithm, this is a measure of how many changes you would need to make to make one string the same as another). More info about this Package / Function can be found here: http://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm
You would need to make some decisions around whether to compare each column or concatenate and then compare and what a reasonable threshold is. For example, you may want to do a manual check on any with an edit distance of less than 8 on the concatenated values.
Let me know if you want any help with the syntax, the edit_distance function just takes 2 varchar2 args (the strings you want to compare) and returns a number.
This is not a perfect solution in that if you set the threshold high you will have a lot of manual checking to do to discard some, and if you set it too low you will miss some matches, but it may be about the best if you want a relatively simple solution.
The way we did this for one of our applications was to use a third party adddress normalization API(eg:Pitney Bowes),normalize each address(Address is a combination of Street Address,City ,State and Zip) and create a T-sql hash for that address.For the adress to compare do the same thing and compare the two hashes and if they match,we have a match
you can make a cursor where you do first a group by where house number and city =.
in a loop
you can separate a row with instr e substr considering chr(32).
After that you can try to consider to make a confront with substring where you have a number 6 = 6th , other case street = str.
good luck!
I have a table called 'Resources' that looks like this:
Country City Street Headcount
UK Halifax High Street 20
United Kingdom Oxford High Street 30
Canada Halifax North St 40
Because of the nature of the location fields, I need to map them to a single 'Address' field, and so I also have the following table called 'Addresses':
Country City Street Address
UK Halifax High Street High Street, Halifax, UK
Canada Halifax North St North Street, Halifax, Canada
United Kingdom Oxford High Street High Street, Oxford, UK
(In reality the Address field does add information rather than just combining what is already there.)
I am currently using the following SQL to produce the query:
SELECT Resources.Country, Resources.City, Resources.Street, Addresses.Address,
Resources.Headcount
FROM Resources
INNER JOIN Addresses ON Resources.Country = Addresses.Country
AND Resources.City = Addresses.City
AND Resources.Street = Addresses.Street
This works for me, but I am worried that I have not seen people use this many ANDs in a single join elsewhere, so don't know if it is a bad idea. (This is simplified version - I may need up to 8 ANDs in a single join in another case) Is this the best way to approach the problem, or is there a better solution?
Thanks
Joining on multiple columns is fine. You don't have to "fear" this.
As far as "a better way". I would suggest creating some variable tables, putting some data in them, and posting that TSQL (DDL and DML) here. Then you can get some possible alternatives. Your question is vague at the present (in regards to the "is there a better way" portion of your question)