PostgreSQL parse countries in array against the countries table - sql

We have content and country tables.
Country is pretty simple:
country_name column defined as string:
Albania,
Belgium,
China,
Denmark etc...
Content is a table with half a million of rows with various data with countries column defined as array text[]. Each value there has a number of countries concatenated like:
{"denmark,finland,france,germany,ireland,gb,italy,netherlands,poland,russia,spain,sweden,australia,brazil,canada,china,india,indonesia,japan,malaysia,vietnam,mexico,"south korea",thailand,usa,singapore,uae"}
The update from internal team is for a thousand of records and we are not sure if countries are all spelled correctly. So the task is to reconcile against the country_name in country table.
I am doing replace(replace(country_array::text,'{',''),'}','') as country_text and think about doing UNPIVOT to check each column against country table.
Is there any other easier way to make sure countries array in Content table has valid country names from country table?
Thank you

You can unnest() each array to a set of rows, and ensure that all values occur in the country table. The following query gives you the array elements that are missing in the reference table:
select *
from
content c
cross join lateral unnest(c.countries) as t(country_name)
left join country y on y.country_name = t.country_name
where y.country_name is null
Demo on DB Fiddle
country table:
id | country_name
-: | :-----------
1 | albania
2 | denmark
content table:
id | countries
-: | :----------------
1 | {albania,denmark}
1 | {albania,france}
query results:
id | countries | country_name
-: | :--------------- | :-----------
1 | {albania,france} | france

If you have doubts about some countries not being spelled correctly, then no doubt there are such examples.
Start by getting the list of countries that are not in the reference table:
select c_country, count(*)
from content c cross join lateral
unnnest(c.countries) c_country left join
countries co
on co.country_name = c_country
where co.country_name is not null
group by c_country
order by count(*) desc;
Then, you can go in and fix the data.
There is nothing wrong a priori with storing values in arrays. However, if you were designing the database from scratch, I would probably recommend a contentCountries table and a countryId. That would ensure unambiguous relationships.
In your case, you should probably fix the ingestion process so the values are known to be correct when input. That might be sufficient, given that you already have a lot of data and just need to fix it.

Related

Language dependent column headers

I am working on an PostgreSQL based application and am very curious if there might be a clever solution to have language dependent column headers.
I sure know, that I can set an alias for a header with the "as" keyword, but that obviously has to be done for every select and over and over again.
So I have a table for converting the technical column name to a mnemonic one, to be shown to the user.
I can handle the mapping in the application, but would prefer a database solution. Is there any?
At least could I set the column header to table.column?
You could use a "view". You can think of a view as a psuedo-table, it can be created using a single or multiple tables created from a query. For instance, if I have a table that has the following shape
Table: Pets
Id | Name | OwnerId | AnimalType
1 | Frank| 1 | 1
2 | Jim | 1 | 2
3 | Bobo | 2 | 1
I could create a "view" that changes the Name field to look like PetName instead without changing the table
CREATE VIEW PetView AS
SELECT Id, Name as PetName, OwnerId, AnimalType
FROM Pets
Then I can use the view just like any other table
SELECT PetName
FROM PetView
WHERE AnimalType = 1
Further we could combine another table as well into the view. For instance if we add another table to our DB for Owners then we could create a view that automatically joins the two tables together before subjecting to other queries
Table: Owners
Id | Name
1 | Susan
2 | Ravi
CREATE VIEW PetsAndOwners AS
SELECT p.Id, p.Name as PetName, o.Name as OwnerName, p.AnimalType
FROM Pets p, Owners o
WHERE p.OwnerId = o.Id
Now we can use the new view again as in any other table (for querying, inserts and deletes are not supported in views).
SELECT * FROM PetsAndOwners
WHERE OwnerName = 'Susan'

Getting count of foreign keys and joining to another list?

I have two tables, lists and items
Lists table looks like this and
the items table looks like this
I am trying to query the database to get the below result,
| list | count |
|------------|-------|
| my list | 1 |
| my list 2 | 5 |
I could get the count with
SELECT count(items.list_id) as count
from items
group by list_id
when joining the list to this query, the counts are getting wrong. What can be the query to get the correct results? my database is sqlite.
looking to your sample seems you need a join between the tables
SELECT list.name, count(items.list_id) as count
from list
inner join items on list.id = items.list_id
group by list.name

MS Access Database tables comparison

I am trying to compare three MS Access tables for any given field. For example, I have a Main Table, which holds the record for school children. It has the fields Student ID and Name. Then there are 3 sub-tables schools, but they have some data discrepancy. So lets call these schools, A, B and C. These schools have somehow mixed up Student ID with Name, so I need a way to return any Student ID, which has a mismatch for Name. The Main table has student ID as the PKey, and the other; A, B & C have student ID as PKey as well. But the problem is that when I build relationships in Access, it only returns IDs that are common in all 3 tables - INNER JOIN. I need an efficient way to match schools, A -> B & A -> C and concatenate the results. I think JOINING each of these in pairs might take far too long. Please let me know if you have any other alternatives.
So, you have two problems:
You have bad data that needs to be fixed Student_ID and NAme mixed
up
Your schema is not good.
Addressing the data issue:
If your student_ids are all numeric, you could try something like:
UPDATE subA SET student_id = [name], [name]=student_id WHERE isnumeric([name]);
And repeat for the other mixed up sub tables.
Addressing the schema issue:
You have three "Subtables" one for each school. These three tables should be a single table, and "School" should be a field in that table. So your data looks something like:
+--------+------------+---------+
| School | Student_Id | Name |
+--------+------------+---------+
| A | 1 | John |
| A | 2 | Jasmine |
| B | 3 | Fred |
| C | 5 | Harold |
| C | 6 | Donna |
+--------+------------+---------+
This way you only join in a single table, and your data only grows in rows as new schools are brought into your database.
Second, if I'm reading your question correctly, you have both student_id and name in the main table as well as the three sub-tables? It seems like you should only keep these in a single table, maybe named student.
Lastly, you can combine the three subtables into a single view that will make it 9000% (guesstimate) easier to join for future queries, using a UNION query:
SELECT 'A' as school, student_id, name FROM subA
UNION ALL
SELECT 'B', student_id, name FROM subB
UNION ALL
SELECT 'C', student_id, name FROM subC
This will stack all three tables on top of each other and give you a schema similar to the example above. You can join to your main table like:
SELECT *
FROM mainTable
INNER JOIN
(
SELECT 'A' as school, student_id, name FROM subA
UNION ALL
SELECT 'B', student_id, name FROM subB
UNION ALL
SELECT 'C', student_id, name FROM subC
) AS subs ON
mainTable.student_id = subs.student_id

'Implicit' JOIN based on schema's foreign keys?

Hello all :) I'm wondering if there is way to tell the database to look at the schema and infer the JOIN predicate:
+--------------+ +---------------+
| prices | | products |
+--------------+ +---------------+
| price_id (PK)| |-1| product_id(PK)|
| prod_id |*-| | weight |
| shop | +---------------+
| unit_price |
| qty |
+--------------+
Is there a way (preferably in Oracle 10g) to go from:
SELECT * FROM prices JOIN product ON prices.prod_id = products.product_id
to:
SELECT * FROM pricesIMPLICIT JOINproduct
The closest you can get to not writing the actual join condition is a natural join.
select * from t1 natural join t2
Oracle will look for columns with identical names and join by them (this is not true in your case). See the documentation on the SELECT statement:
A natural join is based on all columns in the two tables that have the same name. It selects rows from the two tables that have equal values in the relevant columns. If two columns with the same name do not have compatible data types, then an error is raised
This is very poor practice and I strongly recommend not using it on any environment
You shouldnt do that. Some db systems allow you to but what if you modify the fk's (i.e. add foreign keys)? You should always state what to join on to avoid problems. Most db systems won't even allow you to do an implicit join though (good!).

Joining Three tables without a matching column

I want to create a geography dimension using ssis 2008.I have 3 table sources.
Here is the explanation
Table 1 = Country: country code and country name
Table 2 = Post code: post code and city name
Table 3 = Territory : Territory code and Territory name
Here is how data looks
[Table 1= Country]
code name
------------------
US | United states
CA | Canada
[Table 2= post code]
Code city
---------------
1000 | Paris
2000 | Niece
[Table 3= Territory]
Code name
----------------
N | North
S | south
As you can see there is no single common column, I want to group these 3 tables in the same geography dimension.
So, how can I do it ?
Also,The use of this geography dim will be when another dimension for example customer dimension.we want to know the revenue of client according to his geography or the the top salespersons in some city.
and in both customer and salesperson tables you can find the those 3 as foreign keys.
You don't need a "common column" shared by all three tables.
You do need a "column column" between each pair of tables. How else are you going to link them???
Q: Is there any column that links "Country" to "City"? You should have a "country code" column in "city".
Q: Is there any way to link "Territory" with either "post code" or "country"? If "Yes": problem solved. Please list the fields. If "No" ... then you need to change your schema.
Based on you comment to paulsm4 you then want to use those tables that hold the linking information to join to each of the above 3 tables.
On the other hand if you really want to join just those three tables
select * from Country
full outer join [Post code]
on 'a' = 'a'
full outer join Territory
on 'b' = 'b'
create table dim.geography (geoID int,citycode int, countrycode char(2),territorycode char(1))
insert into dim.geography (select city as citycode,country as countrycode, territory as territorycode from Customer union select city, country,territory from salesperson)
Assuming here that Customer and salesperson tables hold the codes and not the values for country,territory, and country.
The code above will build a dimension for the geography you want. Of course if you add any additional unique city,country,territory codes into the customer/salesperson tables you will need to add it to your dimension. This is just an initial load. You may also need to modify the code to account for nulls in any of the three qualifiers.