How to find a specific word within a phrase?

How to find a specific word within a phrase? - sql

I wish to know how to find a specific word within a phrase. I am trying to find the word "Pizza" within a set of keywords, however there is no keyword that only has "Pizza". There are keywords such as "Pizza Delivery" and "Pizza Delivery Boy", however they won't show up! How can I do this?
Desired output:
MOVIE KEYWORD
----------------------------------- ----------------------------------
Drive Angry Waitress
Taken France
Saving Private Ryan France
30 Minutes or Less Pizza Delivery
30 Minutes or Less Pizza Delivery Boy
My script:
SELECT MovieTitle AS "MOVIE", KEYWORDDESC AS "KEYWORD"
FROM TBLMOVIE
JOIN TBLKEYWORDDETAIL ON TBLMOVIE.MOVIEID = TBLKEYWORDDETAIL.MOVIEID
JOIN TBLKEYWORD ON TBLKEYWORDDETAIL.KEYWORDID = TBLKEYWORD.KEYWORDID
WHERE TBLKEYWORD.KEYWORDDESC IN ('France', 'Waitress', 'Pizza');
My output:
MOVIE KEYWORD
----------------------------------- ----------------------------------
Drive Angry Waitress
Taken France
Saving Private Ryan France

One method uses LIKE:
WHERE TBLKEYWORD.KEYWORDDESC LIKE '%France%' OR
TBLKEYWORD.KEYWORDDESC LIKE '%Waitress%' OR
TBLKEYWORD.KEYWORDDESC LIKE '%Pizza%'
Another method uses REGEXP_LIKE():
WHERE REGEXP_LIKE(TBLKEYWORD.KEYWORDDESC, 'France|Waitress|Pizza')
If you use REGEXP_LIKE() you should spend a little bit of time learning about regular expressions and how to use them.

Related

SQL how to display results only if two parts are unique

I'm currently having an issue trying to make a query such that it displays the fields only if both parts are unique. For example, lets say the fields to be displayed currently are as goes:
SELECT
Name,
CompanyName,
JobStartDate,
Birthday,
Age,
Favorite Ice Cream,
Height
From 'sample_person_data'
How would I set this so that it only displays fields where both CompanyName and JobStartDate are both distinct?
At first, I thought just putting distinct would be enough, but came to the realization that would not work, I then thought what if I make it so that it has to check both CompanyName + JobStartDate as unique fields, so only showing the fields where both those two things are unique, but could not go about implementing it.
Essentially what I'm aiming to achieve is if there was a large dataset with some repeated values, how could I help display only the unique fields. I use CompanyName and JobStartDate as examples here, but I understand that people can start at the same company on the same day, therefore this would be a concept which could expand into adding more comparisons.
Thank you for your time.
EDIT: Based on comments I am trying to provide further detail by example
Say this is the sample data:
Name
CompanyName
JobStartDate
Birthday
Age
Favorite Ice Cream
Height
John
Google
04-17-00
01-01-78
50
Vanilla
5-7
John
Google
04-17-00
01-01-78
50
Chocolate
5-7
John
Microsoft
04-17-00
02-01-95
30
Chocolate
5-8
Nancy
Google
06-27-00
04-01-78
50
Vanilla
5-2
Joanna
Google
08-19-00
05-01-78
50
Vanilla
5-0
So here we see the same John from Google filled the form twice because say he decided to change his favorite ice cream. How do I edit the query such that it displays such as the following:
Name
CompanyName
JobStartDate
Birthday
Age
Favorite Ice Cream
Height
John
Google
04-17-00
01-01-78
50
Vanilla
5-7
John
Microsoft
04-17-00
02-01-95
30
Chocolate
5-8
Nancy
Google
06-27-00
04-01-78
50
Vanilla
5-2
Joanna
Google
08-19-00
05-01-78
50
Vanilla
5-0
I don't really care if his favorite ice cream shows up as Chocolate or Vanilla, but rather that only 1 entry of a John from google shows up, using the current company + job start date as the identifying fields for example.

Use below simple approach
select * from your_table
qualify 1 = row_number() over(partition by CompanyName, JobStartDate)
if applied to sample data in your question - output is

Adding in missing Country Codes into a dataset (GDP Dataset)

I have downloaded a dataset which has countries, their codes and their GDP by year in 4 columns (5 if you include the unique row number far left). I noticed however that there are some missing codes for the country codes and was wondering if anyone could help me out and tell me how to get those codes and add them in , probably from a seperate dataset I imagine . You can see this isin the pictures I posted. Second pictures shows the missing country code data. Thanks.
.

Your country codes look like ISO 3166-1, which are only defined for countries and not for the larger entities such as « East Asia » and « Western Offshoots ».
You could roll your own for these entities, see ISO country codes glossary:
User-assigned codes - If users need code elements to represent country names not included in ISO 3166-1, the series of letters [...] AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ respectively, and the series of numbers 900 to 999 are available.
I think the easiest is to prefix them all with X so you know easily that they are your own codes. Then use the 2 next letters for initials:
East Asia: XEA
Western Offshoots: XWO
etc.

How to generate a dummy variable in Stata based on a sub-string of an existing string variable?

I am looking for a way to create a dummy variable which checks a variable called text against multiple given substrings like "book, buy, journey".
Now, I want to check if a observation has either book, buy, or journey in it. If there is one of these keywords found in the substring then the dummy variable should be 1, otherwise 0.
A example:
TEXT
Book your tickets now
Swiss is making your journey easy
Buy your holiday tickets now!
A touch of Austria in your lungs.
The desired outcome should be
dummy variable
1
1
1
0
I tried it with strpos and also regexm with very limited results.
Regards,
Johi

Using strpos may be tedious because you have to take capitalization into account, so I would use regular expressions.
* Example generated by -dataex-. To install: ssc install dataex
clear
input str33 text
"Book your tickets now"
"Swiss is making your journey easy"
"Buy your holiday tickets now!"
"A touch of Austria in your lungs."
end
generate wanted = regexm(text, "[Bb]ook|[Bb]uy|[Jj]ourney")
list
Result:
. list
+--------------------------------------------+
| text wanted |
|--------------------------------------------|
1. | Book your tickets now 1 |
2. | Swiss is making your journey easy 1 |
3. | Buy your holiday tickets now! 1 |
4. | A touch of Austria in your lungs. 0 |
+--------------------------------------------+
See also this link for info on regular expressions.

postgis query for addresses (with osm data)

I want to make queries for addresses to postgis database with data from openstreetmap, check if such address exist in database and if so, get coordinates. Database was filled from .pbf file using osmosis. This is schema for the database http://pastebin.com/Yigjt77f. I have addresses in form of city name, street name and number of street. The most important for me is this table:
CREATE TABLE node_tags (
node_id BIGINT NOT NULL,
k text NOT NULL,
v text NOT NULL
);
k column is in form of tags, one that I'm interested are: addr:housenumber, addr:street, addr:city and v is corresponding value. First I'm searching if name of city matches one in database, than in results set I'm searching for street and than for house number. The problem is that I don't know how to make SQL query that will get this result with asking only once. I can ask first only for city name, get all node_id that match my city and save them in java program, than make queries asking for each found(matching my city) id_number (list from my java program) for the street, and so on. This way is really slow, because asking for more detailed information (city than street than number) I have to make more and more queries and what is more I have to check a lot of addresses. Once I have matching node_id I can easily find coordinates, so that's not a problem.
Example of this table:
node_id | k | v <br>
123 | addr:housenumber | 50
123 | addr:street | Kingsway
123 | addr:city | London
123 | (some other stuff) | .....
100 | addr:housenumber | 121
100 | addr:street | Edmund St
100 | addr:city | London
I hope I explained clearly what is my problem.

This is not as easy as you might think. Addresses in OSM are hierarchically, like in the real world. Not all elements in OSM have a full address attached. Some only have addr:housenumber and simply belong to the nearest street. Some have addr:housenumber and addr:street but no addr:city because they simply belong to the nearest city. Or they are enclosed by a boundary relation which specifies the corresponding city. And instead of addr:housenumber there are sometimes also just address interpolations described by the addr:interpolation key. See the addr key wiki page for more information.
The Karlsruhe Schema page in the OSM wiki explains a lot about addresses in OSM. It also mentions associatedStreet relations which are sometimes used to group house numbers and their corresponding streets.
As you can see a single query in the database probably won't suffice. If you need some inspiration you can take a look at OSM's address search engine Nominatim. But note that Nominatim uses a different data base scheme than the usual one in order to optimize address queries. You can also take a look at one of the many routing applications which all have to do address lookups.

Beginner SQL question: querying gold and silver tag badges in Stack Exchange Data Explorer

I'm using the Stack Exchange Data Explorer to learn SQL, but I think the fundamentals of the question is applicable to other databases.
I'm trying to query the Badges table, which according to Stexdex (that's what I'm going to call it from now on) has the following schema:
Badges
Id
UserId
Name
Date
This works well for badges like [Epic] and [Legendary] which have unique names, but the silver and gold tag-specific badges seems to be mixed in together by having the same exact name.
Here's an example query I wrote for [mysql] tag:
SELECT
UserId as [User Link],
Date
FROM
Badges
Where
Name = 'mysql'
Order By
Date ASC
The (slightly annotated) output is: as seen on stexdex:
User Link Date
--------------- ------------------- // all for silver except where noted
Bill Karwin 2009-02-20 11:00:25
Quassnoi 2009-06-01 10:00:16
Greg 2009-10-22 10:00:25
Quassnoi 2009-10-31 10:00:24 // for gold
Bill Karwin 2009-11-23 11:00:30 // for gold
cletus 2010-01-01 11:00:23
OMG Ponies 2010-01-03 11:00:48
Pascal MARTIN 2010-02-17 11:00:29
Mark Byers 2010-04-07 10:00:35
Daniel Vassallo 2010-05-14 10:00:38
This is consistent with the current list of silver and gold earners at the moment of this writing, but to speak in more timeless terms, as of the end of May 2010 only 2 users have earned the gold [mysql] tag: Quassnoi and Bill Karwin, as evidenced in the above result by their names being the only ones that appear twice.
So this is the way I understand it:
The first time an Id appears (in chronological order) is for the silver badge
The second time is for the gold
Now, the above result mixes the silver and gold entries together. My questions are:
Is this a typical design, or are there much friendlier schema/normalization/whatever you call it?
In the current design, how would you query the silver and gold badges separately?
GROUP BY Id and picking the min/max or first/second by the Date somehow?
How can you write a query that lists all the silver badges first then all the gold badges next?
Imagine also that the "real" query may be more complicated, i.e. not just listing by date.
How would you write it so that it doesn't have too many repetition between the silver and gold subqueries?
Is it perhaps more typical to do two totally separate queries instead?
What is this idiom called? A row "partitioning" query to put them into "buckets" or something?
Requirement clarification
Originally I wanted the following output, essentially:
User Link Date
--------------- -------------------
Bill Karwin 2009-02-20 11:00:25 // result of query for silver
Quassnoi 2009-06-01 10:00:16 // :
Greg 2009-10-22 10:00:25 // :
cletus 2010-01-01 11:00:23 // :
OMG Ponies 2010-01-03 11:00:48 // :
Pascal MARTIN 2010-02-17 11:00:29 // :
Mark Byers 2010-04-07 10:00:35 // :
Daniel Vassallo 2010-05-14 10:00:38 // :
------- maybe some sort of row separator here? can SQL do this? -------
Quassnoi 2009-10-31 10:00:24 // result of query for gold
Bill Karwin 2009-11-23 11:00:30 // :
But the answers so far with a separate column for silver and gold is also great, so feel free to pursue that angle as well. I'm still curious how you'd do the above, though.

Is this a typical design, or are there much friendlier schema/normalization/whatever you call it?
Sure, you could add a type code to make it more explicit. But when you consider that one can not get a gold badge before a silver one, the date stamp makes a lot of sense to differentiate between them.
In the current design, how would you query the silver and gold badges separately? GROUP BY Id and picking the min/max or first/second by the Date somehow?
Yes - joining onto a derived table (AKA inline view) that is a list of users & the minimum date would return the silver badges. Using HAVING COUNT(*) >= 1 would work too. You'd have to use a combination of GROUP BY and HAVING COUNT(*) = 2` to get gold badges - the max date doesn't ensure that there are more than one record for a userid...
How can you write a query that lists all the silver badges first then all the gold badges next?
Sorry - by users, or all silvers first and then golds? The former might be done simply by using ORDER BY t.userid, t.date; the latter I'd likely use analytic functions (IE: ROW_NUMBER(), RANK())...
Is it perhaps more typical to do two totally separate queries instead?
See above about how vague your requirements are, to me anyways...
What is this idiom called? A row "partitioning" query to put them into "buckets" or something?
What you're asking about is referred to by the following synonyms: Analytic, Windowing, ranking...

You'd do something like this and rely only on date or count in an aggregate.
Arguably, it also makes no sense to query silver followed by gold, but rather get data side by side like this:
Unfortunately, you haven't really specified what you want, but a good starting point for aggregates is to express it in plain English
Example: "Give me dates of silver and gold badge awards per user for tag mysql". Which this does:
SELECT
UserId as [User Link],
min(Date) as [Silver Date],
case when count(*) = 1 THEN NULL ELSE max(date) END
FROM
Badges
Where
Name = 'mysql'
group by
UserId
Order By
case when count(*) = 1 THEN NULL ELSE max(date) END DESC, min(Date)
Edit, after update:
Your desired output is not really SQL: it's 2 separate recordsets. The separator is a no-go. As a setb based operation, there is no "natural" order so this introduces one:
SELECT
UserId as [User Link],
min(Date) as [Date],
0 as dummyorder
FROM
Badges
Where
Name = 'mysql'
group by
UserId
union all
select
UserId as [User Link],
max(Date) as [Date],
1 as dummyorder
FROM
Badges
Where
Name = 'mysql'
group by
UserId
having
count(*) = 2
Order By
dummyorder, Date

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to find a specific word within a phrase? - sql

Related

SQL how to display results only if two parts are unique

Adding in missing Country Codes into a dataset (GDP Dataset)

How to generate a dummy variable in Stata based on a sub-string of an existing string variable?

postgis query for addresses (with osm data)

Beginner SQL question: querying gold and silver tag badges in Stack Exchange Data Explorer

Categories

Resources