postgis query for addresses (with osm data) - sql

I want to make queries for addresses to postgis database with data from openstreetmap, check if such address exist in database and if so, get coordinates. Database was filled from .pbf file using osmosis. This is schema for the database http://pastebin.com/Yigjt77f. I have addresses in form of city name, street name and number of street. The most important for me is this table:
CREATE TABLE node_tags (
node_id BIGINT NOT NULL,
k text NOT NULL,
v text NOT NULL
);
k column is in form of tags, one that I'm interested are: addr:housenumber, addr:street, addr:city and v is corresponding value. First I'm searching if name of city matches one in database, than in results set I'm searching for street and than for house number. The problem is that I don't know how to make SQL query that will get this result with asking only once. I can ask first only for city name, get all node_id that match my city and save them in java program, than make queries asking for each found(matching my city) id_number (list from my java program) for the street, and so on. This way is really slow, because asking for more detailed information (city than street than number) I have to make more and more queries and what is more I have to check a lot of addresses. Once I have matching node_id I can easily find coordinates, so that's not a problem.
Example of this table:
node_id | k | v <br>
123 | addr:housenumber | 50
123 | addr:street | Kingsway
123 | addr:city | London
123 | (some other stuff) | .....
100 | addr:housenumber | 121
100 | addr:street | Edmund St
100 | addr:city | London
I hope I explained clearly what is my problem.

This is not as easy as you might think. Addresses in OSM are hierarchically, like in the real world. Not all elements in OSM have a full address attached. Some only have addr:housenumber and simply belong to the nearest street. Some have addr:housenumber and addr:street but no addr:city because they simply belong to the nearest city. Or they are enclosed by a boundary relation which specifies the corresponding city. And instead of addr:housenumber there are sometimes also just address interpolations described by the addr:interpolation key. See the addr key wiki page for more information.
The Karlsruhe Schema page in the OSM wiki explains a lot about addresses in OSM. It also mentions associatedStreet relations which are sometimes used to group house numbers and their corresponding streets.
As you can see a single query in the database probably won't suffice. If you need some inspiration you can take a look at OSM's address search engine Nominatim. But note that Nominatim uses a different data base scheme than the usual one in order to optimize address queries. You can also take a look at one of the many routing applications which all have to do address lookups.

Related

SQL different null values in different rows

I have a quick question regarding writing a SQL query to obtain a complete entry from two or more entries where the data is missing in different columns.
This is the example, suppose I have this table:
Client Id | Name | Email
1234 | John | (null)
1244 | (null) | john#example.com
Would it be possible to write a query that would return the following?
Client Id | Name | Email
1234 | John | john#example.com
I am finding this particularly hard because these are 2 entires in the same table.
I apologize if this is trivial, I am still studying SQL and learning, but I wasn't able to come up with a solution for this and I although I've tried looking online I couldn't phrase the question in the proper way, I suppose and I couldn't really find the answer I was after.
Many thanks in advance for the help!
Yes, but actually no.
It is possible to write a query that works with your example data.
But just under the assumption that the first part of the mail is always equal to the name.
SELECT clients.id,clients.name,bclients.email FROM clients
JOIN clients bclients ON upper(clients.name) = upper(substring(bclients.email from 0 for position('#' in bclients.email)));
db<>fiddle
Explanation:
We join the table onto itself, to get the information into one row.
For this we first search for the position of the '#' in the email, get the substring from the start (0) of the string for the amount of characters until we hit the # (result of positon).
To avoid case-problems the name and substring are cast to uppercase for comparsion.
(lowercase would work the same)
The design is flawed
How can a client have multiple ids and different kind of information about the same user at the same time?
I think you want to split the table between clients and users, so that a user can have multiple clients.
I recommend that you read information about database normalization as this provides you with necessary knowledge for successfull database design.

How to match phone number prefix to country from phonenumber in SQL

I am trying to extract the country code prefix from a list of numbers, and match them to the region that they belong to. The data might look something like this:
| id | phone_number |
|----|----------------|
| 1 | +27000000000 |
| 2 | +16840000000 |
| 3 | +10000000000 |
| 4 | +27000000000 |
The country codes here are:
American Samoa: +1684
United States and Caribbean: +1
South Africa: +27
And the desired result would be something this:
| country | count |
|-----------------------------|-------|
| South Africa | 2 |
| American Samoa | 1 |
| United States and Caribbean | 1 |
There are some difficulties because
country prefix codes vary from 1 to 4 numbers and even without the country prefix,
phone number length varies from place to place.
I do not have write access to this DB, so adding another column, while probably the best solution, will not work in this use case
This is my current solution:
SELECT
CASE
WHEN SUBSTRING(phone_number,1,5) = '+1684' THEN 'American Samoa'
WHEN SUBSTRING(phone_number,1,5) = '+1264' THEN 'Anguilla'
...
WHEN SUBSTRING(phone_number,1,5) = '+1599' THEN 'Saint Martin'
WHEN SUBSTRING(phone_number,1,4) = '+355' THEN 'Albania'
WHEN SUBSTRING(phone_number,1,4) = '+213' THEN 'Algeria'
...
WHEN SUBSTRING(phone_number,1,4) = '+263' THEN 'Zimbabwe'
WHEN SUBSTRING(phone_number,1,3) = '+93' THEN 'Afghanistan'
WHEN SUBSTRING(phone_number,1,3) = '+54' THEN 'Argentina'
...
WHEN SUBSTRING(phone_number,1,3) = '+58' THEN 'Venezuela'
WHEN SUBSTRING(phone_number,1,3) = '+84' THEN 'Vietnam'
WHEN SUBSTRING(phone_number,1,2) = '+1' THEN 'United States and Caribbean'
WHEN SUBSTRING(phone_number,1,2) = '+7' THEN 'Kazakhstan, Russia'
ELSE 'unknown'
END as country_name,
count(*)
FROM users
GROUP BY country_name
order by count desc
There are ~205 WHEN ... THEN cases. It seems to be very inefficient and times out. I assume this is because it runs the pattern matching on every row. This would need to scale to roughly 10s of millions of rows
Is there a more efficient way to do this?
I am using postgreSQL 9.6.16
In spite of reading the whole table, an index could help here. In order to aggregate the data per country code, the DBMS must sort all rows by country code. Sorting is an expensive operation, especially on large data sets. If you had an index on the country codes, the DBMS would find the codes already pre-sorted in the index and could avoid the work of sorting the data.
You don't have the separate country code in a column, but each phone number starts with the code, so you could index the complete phone number:
create index idx on users (phone_number);
Then you must make it obvious to the DBMS that you are interested in the beginnings of the string, so it will consider using the index. Invoking a function like SUBSTRING on the phone number is likely to make the the DBMS blind to this. Use LIKE instead. According to the docs (https://www.postgresql.org/docs/9.3/indexes-types.html), indexes on strings can be used with LIKE 'something%':
WHEN phone_number LIKE '+1684%' THEN 'American Samoa'
There is no guarantee this will help, but it's worth a try I think. It depends on whether the optimizer sees the advantage of using the pre-sorted phone numbers from the index.

How to mimic Active Record association using raw SQL query?

In Rails application I have two associated (as many to many) tables — volumes and tracks.
volumes is a collection of radio show episodes. Each episode has multiple tracks. But each track can appear on many shows.
volumes:
id | title
-----+-----------------------
23 | Oldstyle
24 | How to stop worrying
tracks:
id | artist | title
------+----------------------+------------------------
4764 | John Lennon | Mind Games
4765 | George Harrison | All Those Years Ago
4766 | Paul McCartney | Here Today
What I need is to aggregate tracks by artist and show all show volumes that have that artist tracks. Something like Artist.select(name: 'Beatles').volumes.
The problem is I don't have Artist model. And don't want to create it, because I believe it will make manual data cleanup much harder.
For example, I can query data I need like:
SELECT DISTINCT t.artist, v.number
FROM tracks t
JOIN volumes_tracks vt ON vt.track_id = t.id
JOIN volumes v ON v.id = vt.volumes_id
GROUP BY artist, number;
But is it possible to wrap it model-like data structure for easy access?
You can map each result returned by the sql query to a Struct, which may give you the ease of access that you're looking for. Something like this:
Artist = Struct.new(:artist, :volume_id)
result=ActiveRecord::Base.connection.execute(sql)
artists = result.map{|r| Artist.new(r['t.artist'], r['v.id']}
first_artist = artists[0].artist
(the sql variable represents your sql query string)

sqlite variable and unknown number of entries in column

I am sure this question has been asked before, but I'm so new to SQL, I can't even combine the correct search terms to find an answer! So, apologies if this is a repetition.
The db I'm creating has to be created at run-time, then the data is entered after creation. Some fields will have a varying number of entries, but the number is unknown at creation time.
I'm struggling to come up with a db design to handle this variation.
As an (anonymised) example, please see below:
| salad_name | salad_type | salad_ingredients | salad_cost |
| apple | fruity | apple | cheap |
| unlikely | meaty | sausages, chorizo | expensive |
| normal | standard | leaves, cucumber, tomatoes | mid |
As you can see, the contents of "salad_ingredients" varies.
My thoughts were:
just enter a single, comma-separated string and separate at run-time. Seems hacky, and couldn't search by salad_ingredients!
have another table, for each salad, such as "apple_ingredients", which could have a varying number of rows for each ingredient. However, I can't do this, because I don't know the salad_name at creation time! :(
Have a separate salad_ingredients table, where each row is a salad_name, and there is an arbitrary number of ingredients fields, say 10, so you could have up to 10 ingredients. Again, seems slightly hacky, as I don't like to unused fields, and what happens if a super-complicated salad comes along?
Is there a solution that I've missed?
Thanks,
Dan
based on my experience the best solution is based on a normalized set of tables
table salads
id
salad_name
salad_type
salad_cost
.
table ingredients
id
name
and
table salad_ingredients
id
id_salad
id_ingredients
where id_salad is the corresponding if from salads
and id_ingredients is the corresponding if from ingredients
using proper join you can get (select) and filter (where) all the values you need

Multiple conflicting facts in database / data warehouse

Our organization is currently in the process of building a new data warehouse. We are actually able to use some techniques borrowed from the DW community such as ETL processing to conform data, de-normalized dimensions in the "kimbal" style, etc. etc. Overall, data warehousing is still fairly new to our organization, but we are learning the concepts as we go along.
The problem: We have multiple sources of data, with often conflicting sources of facts. For example, we have a Master Person Index, where we use a score-based matching algorithm during ETL to match an inbound person to an existing person, so even if the inbound record doesn't exactly match, we can score based on other things like zip code radius.
Here's the question: What is the standard way to handle multiple versions of a fact from two or more sources?
I understand one of the main ideas of the data warehouse is to keep a running history of any fact, which we are doing. That's all fine and dandy when a record is being maintained by one inbound source, we keep the history of that fact over time. The problem occurs when two different sources perhaps updating on a daily basis have two different facts, e.g. source A says the name is Mary Smith, source B says the name is Mary Jane changing this value every day! Based on the matching algorithm we're confident it's the same person, but due to our history style table, it basically keeps flopping back and forth to both names every day because it is reading the name as a "change" from each data source.
An example table:
first_name last_name source last_updated
Mary Smith A 5/2/12 1:00am
Mary Jane B 5/2/12 2:00am
Mary Smith A 5/3/12 1:00am
Mary Jane B 5/3/12 2:00am
Mary Smith A 5/4/12 1:00am
Mary Jane B 5/4/12 2:00am
...
Have one table that stores your external data:
id | first_name | last_name | source | external_unique_id | import_date
----+------------+-----------+--------+--------------------+-------------
1 | Mary | Smith | A | abcdefg123 | 5/2/12 1:00am
2 | Mary | Jane | B | 1234567abc | 5/2/12 2:00am
Then have a second table that contains your cleaned data:
id | first_name | last_name
----+------------+-----------
1 | Mary | Jane-Smith (or whatever)
Then have a mapping table between the two.
local_person_id | foreign_person_id
-----------------+-------------------
1 | 1
1 | 2
Or something broadly similar.
The objective is to load the facts from your source once, and keep them.
Then use your fuzzy logic to relate them to master records somewhere. Which you only need to do when new facts are loaded or old facts are changed.
Still, you have the choice on what last_name to use. But that can be almost arbitrary in the absence of determining data. For example : Whichever pick the last name from the fact loaded most recently.
You can still quickly and simply relate the master to the child facts, to their sources, and to their corresponding data. But you have a unified entity in your warehouse to hang these external facts on.
One thing about terminology - What you've listed are "Attributes", not "Facts". A fact is a measure that you take on a set of dimensional Attributes. (for example, an order that this "person" places, or the dollar value of this customer's recent order, etc). In this case, you have multiple sources of dimensional attributes, each one considered the "same".
#Dems method is one way (and a good one) to keep your cleaned data separate from your staging / operational data set.
Another, if you need to have access to both data sets in reporting, while still keeping a "clean" version, would be to put all the attributes on your person/customer dimension:
FIRST_NAME
LAST_NAME
SOURCE1_FIRST_NAME
SOURCE1_LAST_NAME
SOURCE2_FIRST_NAME
SOURCE2_LAST_NAME
For reports on measures where the user community is expecting to see the name from Source 2, you can use the source2 attribute. For people expecting source 1, use that. For people looking for the results of the processing which "conforms" the name, use the main attribute.