Hive for bag of words (word count for each word in the dictionary)

Hive for bag of words (word count for each word in the dictionary) - sql

I have a table with this structure:
user_id | message_id | content
1 | 1 | "I like cats"
1 | 1 | "I like dogs"
And a list of valid words in dictionary.txt (or an external hive table), for example:
I,like,dogs,cats,lemurs
And my goal is to generate an word-count table for each user
user_id | "I" | "like" | "dogs" | "cats" | "lemurs"
1 | 2 | 2 | 1 | 1 | 0
This is what I tried so far:
SELECT user_id, word, COUNT(*)
FROM messages LATERAL VIEW explode(split(content, ' ')) lTable as word
GROUP BY user_id,word;

Check this :
select ename,
length(ename)-length(replace(ename,'A', '')) A,
length(ename)-length(replace(ename,'W', '')) W
FROM EMP;
Else you can define a variable(your search string) and place it in the place of 'A', 'W' etc

I am not very familiar with doing Pivot on Hive, but in pig it can be possible to do.
DEFINE GET_WORDCOUNTS com.stackoverflow.pig.GetWordCounts('$dictionary_path');
A = LOAD .... AS user_id, message_id, content;
C = GROUP B BY (user_id);
D = FOREACH C GENERATE group, FLATTEN(GET_WORDCOUNTS(B.content));
You will have to write a simple UDF GetWordCounts which tokenizes your input content for each grouped record, and checks with input dictionary.

Related

Count string occurances within a list column - Snowflake/SQL

I have a table with a column that contains a list of strings like below:
EXAMPLE:
STRING User_ID [...]
"[""null"",""personal"",""Other""]" 2122213 ....
"[""Other"",""to_dos_and_thing""]" 2132214 ....
"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]" 4342323 ....
QUESTION:
I want to be able to get a count of the amount of times each unique string appears (strings are seperable within the strings column by commas) but only know how to do the following:
SELECT u.STRING, count(u.USERID) as cnt
FROM table u
group by u.STRING
order by cnt desc;
However the above method doesn't work as it only counts the number of user ids that use a specific grouping of strings.
The ideal output using the example above would like this!
DESIRED OUTPUT:
STRING COUNT_Instances
"null" 1223
"personal" 543
"Other" 324
"to_dos_and_thing" 221
"getting_things_done" 146
"Work!!!!!" 22

Based on your description, here is my sample table:
create table u (user_id number, string varchar);
insert into u values
(2122213, '"[""null"",""personal"",""Other""]"'),
(2132214, '"[""Other"",""to_dos_and_thing""]"'),
(2132215, '"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]"' );
I used SPLIT_TO_TABLE to split each string as a row, and then REGEXP_SUBSTR to clean the data. So here's the query and output:
select REGEXP_SUBSTR( s.VALUE, '""(.*)""', 1, 1, 'i', 1 ) extracted, count(*) from u,
lateral SPLIT_TO_TABLE( string , ',' ) s
GROUP BY extracted
order by count(*) DESC;
+---------------------+----------+
| EXTRACTED | COUNT(*) |
+---------------------+----------+
| Other | 2 |
| null | 1 |
| personal | 1 |
| to_dos_and_thing | 1 |
| getting_things_done | 1 |
| TO_dos_and_thing | 1 |
| Work!!!!! | 1 |
+---------------------+----------+
SPLIT_TO_TABLE https://docs.snowflake.com/en/sql-reference/functions/split_to_table.html
REGEXP_SUBSTR https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html

How to fetch data while where condition in jsonb in Postgresql

I have a table data_table like this
| id | reciever
| (bigint) |(jsonb)
----------------------------------------------------------------------
| 1 | [{"name":"ABC","email":"abc#gmail.com"},{"name":"ABDFC","email":"ab34c#gmail.com"},...]
| 2 | [{"name":"DEF","email":"deef#gmail.com"},{"name":"AFDBC","email":"a45bc#gmail.com"},...]
| 3 | [{"name":"GHI","email":"ghfi#gmail.com"},{"name":"AEEBC","email":"5gf#gmail.com"},...]
| 4 | [{"name":"LMN","email":"lfmn#gmail.com"},{"name":"EEABC","email":"gfg5#gmail.com"},...]
| 5 | [{"name":"PKL","email":"dfdf#gmail.com"},{"name":"ABREC","email":"a4rbc#gmail.com"},...]
| 6 | [{"name":"ANI","email":"fdffd#gmail.com"},{"name":"ABWC","email":"abrtc#gmail.com"},...]
when i run on pg admin it works fine
I want to fetch row by putting email in where condition like select * from data_table where receiver = 'abc#gmail.com'. there can be more data in array so i have shown "...".
I have tried like where receiver-->>'email'='abc#gmail.com' but it is working in the case {"name":"ABC","email":"abc#gmail.com"} only not in array where i have to chaeck every email in array
Help will be appreciated.

One option is to use exists and jsonb_array_elements():
select t.*
from mytable t
where exists (
select 1
from jsonb_array_elements(t.receiver) x(elt)
where x.elt ->> 'email' = 'abc#gmail.com'
)
This gives you all rows where at least one element in the array has the given email.
If you want to actually exhibit the matching elements, then you can use a lateral join instead (if more than one element in the array has the given email, this duplicates the row):
select t.*, x.elt
from mytable t
cross join lateral jsonb_array_elements(t.receiver) x(elt)
where x.elt ->> email = 'abc#gmail.com'

Filter json values regardless of keys in PostgreSQL

I have a table called diary which includes columns listed below:
| id | user_id | custom_foods |
|----|---------|--------------------|
| 1 | 1 | {"56": 2, "42": 0} |
| 2 | 1 | {"19861": 1} |
| 3 | 2 | {} |
| 4 | 3 | {"331": 0} |
I would like to count how many diaries having custom_foods value(s) larger than 0 each user have. I don't care about the keys, since the keys can be any number in string.
The desired output is:
| user_id | count |
|---------|---------|
| 1 | 2 |
| 2 | 0 |
| 3 | 0 |
I started with:
select *
from diary as d
join json_each_text(d.custom_foods) as e
on d.custom_foods != '{}'
where e.value > 0
I don't even know whether the syntax is correct. Now I am getting the error:
ERROR: function json_each_text(text) does not exist
LINE 3: join json_each_text(d.custom_foods) as e
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
My using version is: psql (10.5 (Ubuntu 10.5-1.pgdg14.04+1), server 9.4.19). According to PostgreSQL 9.4.19 Documentation, that function should exist. I am so confused that I don't know how to proceed now.
Threads that I referred to:
Postgres and jsonb - search value at any key
Query postgres jsonb by value regardless of keys

Your custom_foods column is defined as text, so you should cast it to json before applying json_each_text. As json_each_text by default does not consider empty jsons, you may get the count as 0 for empty jsons from a separate CTE and do a UNION ALL
WITH empty AS
( SELECT DISTINCT user_id,
0 AS COUNT
FROM diary
WHERE custom_foods = '{}' )
SELECT user_id,
count(CASE
WHEN VALUE::int > 0 THEN 1
END)
FROM diary d,
json_each_text(d.custom_foods::JSON)
GROUP BY user_id
UNION ALL
SELECT *
FROM empty
ORDER BY user_id;
Demo

Counting SQLite rows that might match multiple times in a single query

I have a SQLite table which has a column containing categories that each row may fall into. Each row has a unique ID, but may fall into zero, one, or more categories, for example:
|-------+-------|
| name | cats |
|-------+-------|
| xyzzy | a b c |
| plugh | b |
| quux | |
| quuux | a c |
|-------+-------|
I'd like to obtain counts of how many items are in each category. In other words, output like this:
|------------+-------|
| categories | total |
|------------+-------|
| a | 2 |
| b | 2 |
| c | 2 |
| none | 1 |
|------------+-------|
I tried to use the case statement like this:
select case
when cats like "%a%" then 'a'
when cats like "%b%" then 'b'
when cats like "%c%" then 'c'
else 'none'
end as categories,
count(*)
from test
group by categories
But the problem is this only counts each row once, so it can't handle multiple categories. You then get this output instead:
|------------+-------|
| categories | total |
|------------+-------|
| a | 2 |
| b | 1 |
| none | 1 |
|------------+-------|
One possibility is to use as many union statements as you have categories:
select case
when cats like "%a%" then 'a'
end as categories, count(*)
from test
group by categories
union
select case
when cats like "%b%" then 'b'
end as categories, count(*)
from test
group by categories
union
...
but this seems really ugly and the opposite of DRY.
Is there a better way?

Fix your data structure! You should have a table with one row per name and per category:
create table nameCategories (
name varchar(255),
category varchar(255)
);
Then your query would be easy:
select category, count(*)
from namecategories
group by category;
Why is your data structure bad? Here are some reasons:
A column should contain a single value.
SQL has pretty lousy string functionality.
SQL queries to do what you want cannot be optimized.
SQL has a great data structure for storing lists. It is called a table, not a string.
With that in mind, here is one brute force method for doing what you want:
with categories as (
select 'a' as category union all
select 'b' union all
. . .
)
select c.category, count(t.category)
from categories c left join
test t
on ' ' || t.categories || ' ' like '% ' || c.category || ' %'
group by c.category;
If you already have a table of valid categories, then the CTE is not needed.

Get previous and next row from rows selected with (WHERE) conditions

For example I have this statement:
my name is Joseph and my father's name is Brian
This statement is splitted by word, like this table:
------------------------------
| ID | word |
------------------------------
| 1 | my |
| 2 | name |
| 3 | is |
| 4 | Joseph |
| 5 | and |
| 6 | my |
| 7 | father's |
| 8 | name |
| 9 | is |
| 10 | Brian |
------------------------------
I want to get previous and next word of each word
For example I want to get previous and next word of "name":
--------------------------
| my | name | is |
--------------------------
| father's | name | is |
--------------------------
How could I get this result?

you didn't specify your DBMS, so the following is ANSI SQL:
select prev_word, word, next_word
from (
select id,
lag(word) over (order by id) as prev_word,
word,
lead(word) over (order by id) as next_word
from words
) as t
where word = 'name';
SQLFiddle: http://sqlfiddle.com/#!12/7639e/1

Why did no-body give the simple answer?
SELECT LAG(word) OVER ( ORDER BY ID ) AS PreviousWord ,
word ,
LEAD(word) OVER ( ORDER BY ID ) AS NextWord
FROM words;

Without subqueries:
SELECT a.word
FROM my_table AS a
JOIN my_table AS b
ON b.word = 'name' AND abs(a.id - b.id) <= 1
ORDER BY a.id

Use Join to get the expected result for SQL Server 2005 plus.
create table words (id integer, word varchar(20));
insert into words
values
(1 ,'my'),
(2 ,'name'),
(3 ,'is'),
(4 ,'joseph'),
(5 ,'and'),
(6 ,'my'),
(7 ,'father'),
(8 ,'name'),
(9 ,'is'),
(10,'brian');
SELECT A.Id , C.word AS PrevName ,
A.word AS CurName ,
B.word AS NxtName
FROM words AS A
LEFT JOIN words AS B ON A.Id = B.Id - 1
LEFT JOIN words AS C ON A.Id = C.Id + 1
WHERE A.Word = 'name'
Result:
Fiddler Demo

Try this
SELECT *
FROM tablename a
WHERE ID IN(SELECT ID - 1
FROM tablename
WHERE word = 'name') -- will fetch previous rows of word `name`
OR ID IN(SELECT ID + 1
FROM tablename
WHERE word = 'name') -- will fetch next rows of word `name`
OR word = 'name' -- to fetch the rows where word = `name`

Here's a different approach, if you want the selects to be fast. It takes a bit of preparation work.
Create a new column (e.g. "phrase") in the database that will contain the words
you want. (i.e. the previous, the current and next).
Write a trigger that on insert appends the new word to the previous
row's phrase and prepends the previous row's word to the new row's word and fills
phrase.
If the individual words can change, you'll need a trigger on update to keep the phrase in sync.
Then just select the phrase. You get much better speed, but at the cost of extra storage and slower insert and harder maintainability. Obviously you have to update the phrase column for the existing records, but you have the SQL to do that in the other answers.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive for bag of words (word count for each word in the dictionary) - sql

Check this : select ename, length(ename)-length(replace(ename,'A', '')) A, length(ename)-length(replace(ename,'W', '')) W FROM EMP; Else you can define a variable(your search string) and place it in the place of 'A', 'W' etc

Related

Count string occurances within a list column - Snowflake/SQL

How to fetch data while where condition in jsonb in Postgresql

Filter json values regardless of keys in PostgreSQL

Counting SQLite rows that might match multiple times in a single query

Get previous and next row from rows selected with (WHERE) conditions

Categories

Resources