Adding in missing Country Codes into a dataset (GDP Dataset) - pandas

I have downloaded a dataset which has countries, their codes and their GDP by year in 4 columns (5 if you include the unique row number far left). I noticed however that there are some missing codes for the country codes and was wondering if anyone could help me out and tell me how to get those codes and add them in , probably from a seperate dataset I imagine . You can see this isin the pictures I posted. Second pictures shows the missing country code data. Thanks.
.

Your country codes look like ISO 3166-1, which are only defined for countries and not for the larger entities such as « East Asia » and « Western Offshoots ».
You could roll your own for these entities, see ISO country codes glossary:
User-assigned codes - If users need code elements to represent country names not included in ISO 3166-1, the series of letters [...] AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ respectively, and the series of numbers 900 to 999 are available.
I think the easiest is to prefix them all with X so you know easily that they are your own codes. Then use the 2 next letters for initials:
East Asia: XEA
Western Offshoots: XWO
etc.

Related

Tricks to exceed column limitations in SQL Database

Hello swarm intelligence!
I have the following use case: For every movie that is requested by a user, I create a number of tags for that specific movie, derived from several sources (actors, plot etc.. ).
I will use this data for associaton mining.
The problem: If I use the movie for rows and the tags for columns, the tags will easily exceed the technical limitations of 3000 columns ( there is even more actors, and then plot keywords etc)
Is there any way, I can organize this data to then use it for (quick) association mining?
Thanks a lot
Don't put tags in columns. Instead create a separate table, named something like movie_tags with two columns, movie_id and tag. Put each tag in a separate row of that table.
This is known as "normalizing" your data. Here's a nice walkthrough with an example very similar to yours.
Edit: Let's say you have a catalog of movies about the Italian Mafia in New York City in the 20th century. Let's say the movies are
1 Godfather
2 Goodfellas
3 Godfather II
Then your movie_tags table might contain these rows.
1 Gangsters
2 Gangsters
3 Gangsters
1 Francis Ford Coppola
3 Francis Ford Coppola
2 Martin Scorsese
Pro tip If you find yourself thinking about putting lots of data items with the same meaning in their own columns, you probably need to normalize the data and add appropriate tables.

How to extract 4 words before each word of a given list in sql

I have got a table with a column containing text (the column name is 'Text'). There are some acronyms in brackets, so I would like to extract them along with the five words appearing before them.
I have already extracted the rows that contain all the acronyms of my list using the like operator:
select Text from table
where Text like '(NASA)'
or Text like '(NBA)'
In stead of getting an output of the whole text in each row:
Text
He works for the National Aeronautics and Space Administration (NASA).
He played basketball for the National Basketball Association (NBA) from 2000 to 2002.
I would like to get the output of two columns one for the acronym and another for the meaning of the acronym (showing the five words prior to the acronym):
Acronym Meaning
(NASA) National Aeronautics and Space Administration
(NBA) for the National Basketball Association
Without actually seeing your data, I will assume that all the acronyms follow the same pattern but you should be able to adapt the code with the correct logic if your strings are structured differently. In this case '(Acronym) meaning' is the structure which I'm going to work with.
select '(NASA) National Aeronautics and Space Administration' as text
into #temp1
union all
select '(FBI) Federal Bureau of Investigation' as text
select SUBSTRING(text,CHARINDEX('(',text)+1 ,CHARINDEX(')',text)-CHARINDEX('(',text)-1) as Acronym,
SUBSTRING(text,CHARINDEX(')',text)+2 ,len(text)-CHARINDEX(')',text)+1) as meaning
from #temp1
This code subsets the original string by using character positions in the string between the brackets for the acronym and then character positions starting after closed brackets for the meanings.

Oracle REGEXP_SUBSTR for string matching b/w two columns

The problem
Users are frequently inputting "country name" strings into the "city name" field. Heuristically, this appears to be an extremely common practice. For example, a user might put "TAIPEI TAIWAN" in the city name when only "TAIPEI" should be input and then the country would be "TAIWAN". I am working to aggregate these instances for this specific field (your help will allow me to expand this to other columns and tables) and then identify where possible rankings associated with strictly the "country" names in the "city" field.
I have two tables that I am attempting to leverage to track down data validation issues. Tbl1 is named "Customer_Address" comprised of geographic columns like (Customer_Num, Address, City_Name, State, Country_Code, Zipcode). Tbl2 named "HR_Countries" is clean table of 2-digit ISO country codes with their corresponding name values (Lebanon, Taiwan, China, Syria, Russia, Ukraine, etc) and some other fields not presently used.
The initial step is to query "Customer_Address" to find City_Names LIKE a series of OR statements (LIKE '%CHINA', OR LIKE 'TAIWAN', OR etc etc) and count the number of occurrences where the City_Name is like the designated country_name string I passed it and the results are pretty good. I've coded in some exclusions to deal with things like "Lebanon, OH" so my overall results are satisfactory for the first phase.
Part of the query does a LEFT join from Tbl1 to Tbl2 to add the risk rating from tbl2 as a result of the query against tbl1:
LEFT JOIN tbl2 risk
ON INSTR(addr.CITY_NM, risk.COUNTRY_NAME,1) <> 0
Example of Tbl1 Data Output (head(tbl1), n=7)
CountryNameInCity CountOfOccurences RR
China 15 High
Taiwan 2000 Medium
Japan 250 Low
Taipei, Taiwan 25 NULL
Kabul, Afghanistan 10 NULL
Shenzen China 100 NULL
Afghanistan 52 Very High
Example of Tb2 Data (head(tbl2), n=6)
CountryName CountryCode RR
China CN High
Taiwan TW High
Iraq IQ Very High
Cuba CU Medium
Lebanon LB Very High
Greece GR High
So my question(s) are as follows:
1) Instead of manually passing in a series of OR-statements for country codes is there a better way to using Tbl2 as the matching "LIKE" driving the query?
2) Can you recommend a better way of comparing the output of the query (see Tbl1 example) and ensuring that multiple strings (Taipei, Taiwan, etc) are appropriately aggregated and bring back the correct 'RR' rating.
Thanks for taking the time to review this and respond.

Get following letter from given

I have a table with company names. Some companies have different locations and different legal names but they should be reported under the same Group Code. The Code is made up using the first five letters.
Company GroupCode
DEEZER FRANCE DEEZE
DEEZER SPAIN DEEZE
DEEZER ALGERIA DEEZE
So far so good. Now I’m adding a different company which starts with the same letters but should get a new Group Code.
A new Code should be assigned if the company name does not contain a word which is part of a company name already having a GroupCode. In this Case DEEZER is the key word which determines association with GroupCode DEEZE
Rule is that the code should then use the first four letters + the fifth letter next in the alphabet. If this code also exists then use the first four letters + the fifth letter next but one in the alphabet. The required result would look like:
Company GroupCode Status
DEEZER FRANCE DEEZE EXISTING
DEEZER SPAIN DEEZE EXISITNG
DEEZER ALGERIA DEEZE EXISTING
DEEZEMBER DEEZF CREATED
DEEZEMAL DEEZG CREATED
So what I need to figure out is the next „unused“ letter. How can I achieve this with SQL Server 2008 R2?
Try this:
;with cte as
(select max(groupcode) maxcode
from yourtable
where left(code,4) = left(#companyname,4))
insert into yourtable (company, groupcode, [status])
select #companyname,
case when maxcode is null then left(#companyname,4) + 'a' else left(maxcode,4) + char(ascii(right(maxcode,1))+1) end,
'created'
from cte
Assumption: Your input is taking the company name as a parameter from somewhere, presumably the front end.
The idea is to use ascii function to get the ASCII code of the last letter, increment it by 1 and go back to the corresponding character using char function.
Be warned, however, that this is definitely not the best solution. For instance, I have not implemented bounds checking to ensure range between A and Z. In fact, I would suggest that you handle this in application code rather than at DB level.

Sort Postcode for menu/list

I need to sort a list of UK postcodes in to order.
Is there a simple way to do it?
UK postcodes are made up of letters and numbers:
see for full info of the format:
http://en.wikipedia.org/wiki/UK_postcodes
But my problem is this a simple alpha sort doesn't work because each code starts with 1 or two letters letters and then is immediately followed by a number , up to two digits, then a space another number then a letter. e.g. LS1 1AA or ls28 1AA, there is also another case where once the numbers in the first section exceed 99 then it continues 9A etc.
Alpha sort cause the 10s to immediately follow the 1:
...
LS1 9ZZ
LS10 1AA
...
LS2
I'm looking at creating a SQL function to convert the printable Postcode into a sortable postcode e.g. 'LS1 9ZZ' would become 'LS01 9ZZ', then use this function in the order by clause.
Has anybody done this or anything similar already?
You need to think of this as a tokenization issue so SW1A 1AA should tokenize to:
SW
1
A
1AA
(although you could break the inward part down into 1 and AA if you wanted to)
and G12 8QT should tokenize to:
G
12
(empty string)
8QT
Once you have broken the postcode down into those component parts then sorting should be easy enough. There is an exception with the GIR 0AA postcode but you can just hardcode a test for that one
edit: some more thoughts on tokenization
For the sample postcode SW1A 1AA, SW is the postcode area, 1A is the postcode district (which we'll break into two parts for sorting purposes), 1 is the postcode sector and AA is the unit postcode.
These are the valid postcode formats (source: Royal Mail PAF user guide page 8 - link at bottom of this page):
AN NAA
AAN NAA
ANN NAA
ANA NAA
AAA NAA (only for GIR 0AA code)
AANN NAA
AANA NAA
So a rough algorithm would be (assuming we want to separate the sector and unit postcode):
code = GIR 0AA? Tokenize to GI/R/ /0/AA (treating R as the district simplifies things)
code 5 letters long e.g G1 3AF? Tokenize to G/1/ /3/AF
code 6 letters long with 3rd character being a letter e.g. W1P 1HQ? Tokenize to W/1/P/1/HQ
code 6 letters long with 2nd character being a letter e.g. CR2 6XH? Tokenize to CR/2/ /6/XH
code 7 letters long with 4th character being a letter e.g. EC1A 1BB? Tokenize to EC/1/A/1/BB
otherwise e.g. TW14 2ZZ, tokenize to TW/14/ /2/ZZ
If the purpose is to display a list of postcodes for the user to choose from then I would adopt Neil Butterworth's suggestion of storing a 'sortable' version of the postcode in the database. The easiest way to create a sortable version is to pad all entries to nine characters:
two characters for the area (right-pad if shorter)
two for the district number (left-pad if shorter)
one for the district letter (pad if missing)
space
one for the sector
two for the unit
and GIR 0AA is a slight exception again. If you pad with spaces then the sort order should be correct. Examples using # to represent a space:
W1#1AA => W##1##1AA
WC1#1AA => WC#1##1AA
W10#1AA => W#10##1AA
W1W#1AA => W##1W#1AA
GIR#0AA => GI#R##0AA
WC10#1AA => WC10##1AA
WC1W#1AA => WC#1W#1AA
You need to right-pad the area if it's too short: left-padding produces the wrong sort order. All of the single letter areas - B, E, G, L, M, N, S, W - would sort before all of the two-letter areas - AB, AL, ..., ZE - if you left-padded
The district number needs to be left padded to ensure that the natural W1, W2, ..., W9, W10 order remains intact
I know this is a couple of years late but i too have just experienced this problem.
I have managed to over come it with the following code, so thought i would share as i searched the internet and could not find anything!
mysql_query("SELECT SUBSTRING_INDEX(postcode,' ',1) as p1, SUBSTRING_INDEX(postcode,' ',-1) as p2 from `table` ORDER BY LENGTH(p1), p1, p2 ASC");
This code will take a Full UK postcode and split it into 2.
It will then order by the first part of the postcode followed by the second.
I'd be tempted to store the normalised postcode in the database along with the real postcode - that way you only do the string manipulation once, and you can use an index to help you with the sort.