Hive: Split string using regexp as a separate column - sql

I have a string in a text column. I want to extract the hashtag values from the string into a new table so that I can find the distinct count for each hashtag.
Example strings->
NeverTrump is never more. They were crushed last night in Cleveland at
Rules Committee by a vote of 87-12. MAKE AMERICA GREAT AGAIN!
CrookedHillary is outspending me by a combined 31 to 1 in Florida,
Ohio, & Pennsylvania. I haven't started yet!
CrookedHillary is not qualified!
MakeAmericaSafeAgain!#GOPConvention #RNCinCLE
MakeAmericaGreatAgain #ImWithYou

I am outlining the steps here as I'm not that good with the query, may update the answer once I get it right
Replace '#' in string by ' #'.
split each word in a string with space as delimiter.
use explode() lateral view functionality to get all the words of the string.
use a WHERE condition to fetch records starting with "#". LIKE '#%' condition should work.
then add the group by condition to get the counts of each hashtag.

This is what #lazilyInitialised said, I did a query with your data example:
with your_data as (--This is your data example, use your table instead of this CTE
select stack( 1,
1, --ID
" NeverTrump is never more. They were crushed last night in Cleveland at Rules Committee by a vote of 87-12. MAKE AMERICA GREAT AGAIN!
CrookedHillary is outspending me by a combined 31 to 1 in Florida, Ohio, & Pennsylvania. I haven't started yet!
CrookedHillary is not qualified!
MakeAmericaSafeAgain!#GOPConvention #RNCinCLE
MakeAmericaGreatAgain #ImWithYou
"
) as (id, str)
)
select id, word as hashtag
from
(
select id, word
from your_data d
lateral view outer explode(split(regexp_replace(d.str, '#',' #' ),'\\s')) l as word --replace hash w space+hash, split and explode words
)s
where word rlike '^#'
;
Result:
OK
id hashtag
1 #GOPConvention
1 #RNCinCLE
1 #ImWithYou
Time taken: 0.405 seconds, Fetched: 3 row(s)

Related

Get multiple occurrence of string from a column in sql query

I have a table which has the following data
Ticketid created Details
205853669 2020-03-05 #CLOSE# Next action value://346004/ next action value://346002/ or value://346008/
205853670 2020-03-06 #Archive Next action value://346088/ next action value://346077/ or value://346057/
The string "value://" pattern is same in all column, I want to extract those numbers from the string.
ticketid Numbers
205853669 346004
205853669 346002
205853669 346008
205853670 346088
205853670 346077
205853670 346057
I am using standard Sql only
I have created something like below.
select ticketid,TRIM(REPLACE(SUBSTR(
details, STRPOS(details, "value//"),10
),"value//"","")) AS number from table
Below is for BigQuery Standard SQL
#standardSQL
SELECT Ticketid, Numbers
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(Details, r'value://(\d+)/')) Numbers
If to apply to sample data from your question - output is
Row Ticketid Numbers
1 205853669 346004
2 205853669 346002
3 205853669 346008
4 205853670 346088
5 205853670 346077
6 205853670 346057
The below query would work. This query splits the comment on value then extracts the 6 digit id.
with `project.dataset.table` as (
select id, split(details, 'value://') AS number from (
select '1' as id, '#CLOSE# Next action value://346004/ next action value://346002/ or value://346008/' as details
union all
select '2' as id, '#Archive Next action value://346088/ next action value://346077/ or value://346057/'
)
)
select id, regexp_extract(number1, "\\d{6}") as number
from `project.dataset.table` ,
UNNEST( number ) number1
where regexp_extract(number1, "\\d{6}") is not null
It has one remark about UNNEST function. As per documentation
The UNNEST operator takes an ARRAY and returns a table, with one row for each element in the ARRAY.
If you have only a few 'values://' for each comment then this wouldn't cause as much problem, but if there would be unlimited number of 'value://' this might become a performance bottleneck so keep that in mind. On the other hand this is the only way I know how to achieve that using CloudSQL.

query to display count of corresponding each distinct word

There is a column in a table which can store up to 4000 characters. So for a given row, we need to write a query to display count of corresponding each distinct word in the sentence.
For e.g. the column has "Jack and Jill went up a hill. Jack came tumbling down"
Output :
<Word> - <Count>
Jack - 2
Jill - 1
hill - 1
and - 1
a - 1
came - 1 ... and so on
Since you tagged Teradata you can use STRTOK_SPLIT_TO_TABLE for the tokenizing part. Just add more characters to the separators list:
with cte as
(select
1 as keycol,
'Jack and Jill went up a hill. Jack came tumbling down' as col)
select keycol, token, count(*) as cnt
FROM TABLE (STRTOK_SPLIT_TO_TABLE(cte.keycol, cte.col,
' .,;:-?!()''"') -- list of separators
RETURNS (keycol INTEGER,
tokennum INTEGER,
token VARCHAR(100) CHARACTER SET UNICODE)
) AS d
group by 1,2
order by 1, cnt desc
But counting words might be much more complicated, as it usually includes tokenizing, stemming and stop words.
First , convert the words into rows and then group it.
In this query, we use a basic concept of row generation using CONNECT BY.
For Example:
select level from dual CONNECT BY level <= 10;
The above query would generate 10 rows.(Hierarchical Level query).
Based on this simple logic, now we have to count the number of spaces here, and generate that many rows.REGEXP_COUNT(str,'[^ ]+') would give the number of spaces in the sentence.
And using the level, extract a word from the sentence in each row. REGEXP_SUBSTR(str,'[^ ]+',1,level) would do this.
You can play around with this query to handle other scenarios. Good Luck.
with tokenised_rows(str) as(
SELECT REGEXP_SUBSTR('Jack and Jill went up a hill. Jack came tumbling down','[^ ]+',1,LEVEL)
FROM dual
CONNECT BY level <= REGEXP_COUNT('Jack and Jill went up a hill. Jack came tumbling down','[^ ]+')
)
select str,count(1) from tokenised_rows
group by str;

Oracle - Query to return first line of a value

I am new to Queries.
how I can write a query to pull only the first line from value?
Example -
select address
from user
where id =1;
Sample Output (Single row & not 3 rows)
Anthony Benoit
490 E Main Street
Norwich CT 06360
I would like to get only the first line,
which is Anthony Benoit
You could use SUBSTR and index on the first new-line character something like this
select substr(lines,0,instr(lines,chr(10))) d
from
(select 'line 1
line2' lines
from dual)
Pay attention to chr(10) you most likely gonna need to improve that part to support different new line chars (e.g. chr(13), chr(13) and chr(10)...etc)
Assuming that your example output are three rows.
If you are on 11g and prior, Just add select * from ... where rownum = 1 in the outer query to your posted query, but you also need to have order by column in your inner query.
If you are on 12c, you can simply use ROW LIMIT clause, just add this clause FETCH FIRST 1 ROWS ONLY.

Finding what words a set of letters can create?

I am trying to write some SQL that will accept a set of letters and return all of the possible words it can make. My first thought was to create a basic three table database like so:
Words -- contains 200k words in real life
------
1 | act
2 | cat
Letters -- contains the whole alphabet in real life
--------
1 | a
3 | c
20 | t
WordLetters --First column is the WordId and the second column is the LetterId
------------
1 | 1
1 | 3
1 | 20
2 | 3
2 | 1
2 | 20
But I'm a bit stuck on how I would write a query that returns words that have an entry in WordLetters for every letter passed in. It also needs to account for words that have two of the same letter. I started with this query, but it obviously does not work:
SELECT DISTINCT w.Word
FROM Words w
INNER JOIN WordLetters wl
ON wl.LetterId = 20 AND wl.LetterId = 3 AND wl.LetterId = 1
How would I write a query to return only words that contain all of the letters passed in and accounting for duplicate letters?
Other info:
My Word table contains close to 200,000 words which is why I am trying to do this on the database side rather than in code. I am using the enable1 word list if anyone cares.
Ignoring, for the moment, the SQL part of the problem, the algorithm I'd use is fairly simple: start by taking each word in your dictionary, and producing a version of it with the letters in sorted order, along with a pointer back to the original version of that word.
This would give a table with entries like:
sorted_text word_id
act 123 /* we'll assume `act` was word number 123 in the original list */
act 321 /* we'll assume 'cat' was word number 321 in the original list */
Then when we receive an input (say, "tac") we sort it's letters, look it up in our table of sorted letters joined to the table of the original words, and that gives us a list of the words that can be created from that input.
If I were doing this, I'd have the tables for that in a SQL database, but probably use something else to pre-process the word list into the sorted form. Likewise, I'd probably leave sorting the letters of the user's input to whatever I was using to create the front-end, so SQL would be left to do what it's good at: relational database management.
If you use the solution you provide, you'll need to add an order column to the WordLetters table. Without that, there's no guarantee that you'll retrieve the rows that you retrieve are in the same order you inserted them.
However, I think I have a better solution. Based on your question, it appears that you want to find all words with the same component letters, independent of order or number of occurrences. This means that you have a limited number of possibilities. If you translate each letter of the alphabet into a different power of two, you can create a unique value for each combination of letters (aka a bitmask). You can then simply add together the values for each letter found in a word. This will make matching the words trivial, as all words with the same letters will map to the same value. Here's an example:
WITH letters
AS (SELECT Cast('a' AS VARCHAR) AS Letter,
1 AS LetterValue,
1 AS LetterNumber
UNION ALL
SELECT Cast(Char(97 + LetterNumber) AS VARCHAR),
Power(2, LetterNumber),
LetterNumber + 1
FROM letters
WHERE LetterNumber < 26),
words
AS (SELECT 1 AS wordid, 'act' AS word
UNION ALL SELECT 2, 'cat'
UNION ALL SELECT 3, 'tom'
UNION ALL SELECT 4, 'moot'
UNION ALL SELECT 5, 'mote')
SELECT wordid,
word,
Sum(distinct LetterValue) as WordValue
FROM letters
JOIN words
ON word LIKE '%' + letter + '%'
GROUP BY wordid, word
As you'll see if you run this query, "act" and "cat" have the same WordValue, as do "tom" and "moot", despite the difference in number of characters.
What makes this better than your solution? You don't have to build a lot of non-words to weed them out. This will constitute a massive savings of both storage and processing needed to perform the task.
There is a solution to this in SQL. It involves using a trick to count the number of times that each letter appears in a word. The following expression counts the number of times that 'a' appears:
select len(word) - len(replace(word, 'a', ''))
The idea is to count the total of all the letters in the word and see if that matches the overall length:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word, (LEN(w.word) - LEN(replace(w.word, wl.letter))) as LettersInWord
from word w
cross join wordletters wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
This particular solution allows multiple occurrences of a letter. I'm not sure if this was desired in the original question or not. If we want up to a certain number of occurrences, then we might do the following:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word,
(case when (LEN(w.word) - LEN(replace(w.word, wl.letter))) <= maxcount
then (LEN(w.word) - LEN(replace(w.word, wl.letter)))
else maxcount end) as LettersInWord
from word w
cross join
(
select letter, count(*) as maxcount
from wordletters wl
group by letter
) wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
If you want an exact match to the letters, then the case statement should use " = maxcount" instead of " <= maxcount".
In my experience, I have actually seen decent performance with small cross joins. This might actually work server-side. There are two big advantages to doing this work on the server. First, it takes advantage of the parallelism on the box. Second, a much smaller set of data needs to be transfered across the network.

mysql query to dynamically convert row data to columns

I am working on a pivot table query.
The schema is as follows
Sno, Name, District
The same name may appear in many districts eg take the sample data for example
1 Mike CA
2 Mike CA
3 Proctor JB
4 Luke MN
5 Luke MN
6 Mike CA
7 Mike LP
8 Proctor MN
9 Proctor JB
10 Proctor MN
11 Luke MN
As you see i have a set of 4 distinct districts (CA, JB, MN, LP). Now i wanted to get the pivot table generated for it by mapping the name against districts
Name CA JB MN LP
Mike 3 0 0 1
Proctor 0 2 2 0
Luke 0 0 3 0
i wrote the following query for this
select name,sum(if(District="CA",1,0)) as "CA",sum(if(District="JB",1,0)) as "JB",sum(if(District="MN",1,0)) as "MN",sum(if(District="LP",1,0)) as "LP" from district_details group by name
However there is a possibility that the districts may increase, in that case i will have to manually edit the query again and add the new district to it.
I want to know if there is a query which can dynamically take the names of distinct districts and run the above query. I know i can do it with a procedure and generating the script on the fly, is there any other method too?
I ask so because the output of the query "select distinct(districts) from district_details" will return me a single column having district name on each row, which i will like to be transposed to the column.
You simply cannot have a static SQL statement returning a variable number of columns. You need to build such statement each time the number of different districts changes. To do that, you execute first a
SELECT DISTINCT District FROM district_details;
This will give you the list of districts where there are details. You then build a SQL statement iterating over the previous result (pseudocode)
statement = "SELECT name "
For each row returned in d = SELECT DISTINCT District FROM district_details
statement = statement & ", SUM(IF(District=""" & d.District & """,1 ,0)) AS """ & d.District & """"
statement = statement & " FROM district_details GROUP BY name;"
And execute that query. You'll then need have to handle in your code the processing of the variable number of columns
a) "For each " is not supported in MySQL stored procedures.
b) Stored procedures cannot execute prepared statements from concatenated strings using so called dynamic SQL statements, nor can it return results with more than One distinct row.
c) Stored functions cannot execute dynamic SQL at all.
It is a nightmare to keep track of once you got a good idea and everyone seems to debunk it before they think "Why would anyone wanna..."
I hope you find your solution, I am still searching for mine.
The closes I got was
(excuse the pseudo code)
-> to stored procedure, build function that...
1) create temp table
2) load data to temp table from columns using your if statements
3) load the temp table out to INOUT or OUT parameters in a stored procedure as you would a table call... IF you can get it to return more than one row
Also another tip...
Store your districts as a table conventional style, load this and iterate by looping through the districts marked active to dynamically concatenate out a querystring that could be plain text for all the system cares
Then use;
prepare stmName from #yourqyerstring;
execute stmName;
deallocate prepare stmName;
(find much more on the stored procedures part of the mysql forum too)
to run a different set of districts every time, without having to re-design your original proc
Maybe it's easier in numerical form.
I work on plain text content in my tables and have nothing to sum, count or add up
The following assumes you want matches of distinct (name/district) pairs. I.e. Luke/CA and Duke/CA would yield two results:
SELECT name, District, count(District) AS count
FROM district_details
GROUP BY District, name
If this is not the case simply remove name from the GROUP BY clause.
Lastly, notice that I switched sum() for count() as you are trying to count all of the grouped rows rather than getting a summation of values.
Via comment by #cballou above, I was able to perform this sort of function which is not exactly what OP asked for but suited my similar situation, so adding it here to help those who come after.
Normal select statement:
SELECT d.id ID,
q.field field,
q.quota quota
FROM defaults d
JOIN quotas q ON d.id=q.default_id
Vertical results:
ID field quota
1 male 25
1 female 25
2 male 50
Select statement using group_concat:
SELECT d.id ID,
GROUP_CONCAT(q.fields SEPARATOR ",") fields,
GROUP_CONCAT(q.quotas SEPARATOR ",") quotas
FROM defaults d
JOIN quotas q ON d.id=q.default_id
Then I get comma-separated fields of "fields" and "quotas" which I can then easily process programmatically later.
Horizontal results:
ID fields quotas
1 male,female 25,25
2 male 50
Magic!