Get the unique word count of each word in Hive - sql

I am having a table such as follows,
select * from tablename;
ID sentence
1 This is a sentence
2 This might be a test
3 America
4 This this
I want to write a query to split the sentence into words and get the count of the words in the descending order. I want to have an output something like,
word count Unique(ids)
This 4 3
a 2 2
might 1 1
.
.
.
where count is the number of times the word has occurred in the column and Unique(ids) is the number of users with that word.
I am thinking in what way we can write a query to do this?
Can anybody help me doing this in hive?
Thanks

laterral View
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
select id, word
from tablename tn lateral view explode( split( tn.sentense, ' ' ) ) tb as word
the result will be:
1 This
1 is
1 a
1 sentense
2 This
2 might
2 be
2 a
2 test
3 america
aggregate the result

Related

BigQuery INSERT SELECT results in random order of records?

I used standard SQL to insert data form one table to another in BigQuery using Jupyter Notebook.
For example I have two tables:
table1
ID Product
0 1 book1
1 2 book2
2 3 book3
table2
ID Product Price
0 5 book5 8.0
1 6 book6 9.0
2 4 book4 3.0
I used the following codes
INSERT test_data.table1
SELECT *
FROM test_data.table2
ORDER BY Price;
SELECT *
FROM test_data.table1
I got
ID Product
0 1 book1
1 3 book3
2 2 book2
3 5 book5
4 6 book6
5 4 book4
I expected it appears in the order of ID 1 2 3 4 5 6 which 4,5,6 are ordered by Price
It also seems that the data INSERT and/or SELECT FROM display records in a random order in different run.
How do I control the SELECT FROM output without including the 'Price' column in the output table in order to sort them?
And this happened when I import a csv file to create a new table, the record order is random when using SELECT FROM to display them.
The ORDER BY clause specifies a column or expression as the sort criterion for the result set.
If an ORDER BY clause is not present, the order of the results of a query is not defined.
Column aliases from a FROM clause or SELECT list are allowed. If a query contains aliases in the SELECT clause, those aliases override names in the corresponding FROM clause.
So, you most likely wanted something like below
SELECT *
FROM test_data.table1
ORDER BY Price DESC
LIMIT 100
Note the use of LIMIT - it is important part - If you are sorting a very large number of values, use a LIMIT clause to avoid resource exceeded type of error

Access SQL query select with a specific pattern

I want to select each 5 rows to be unique and the select pattern applies for the rest of the result (i.e if the result contains 10 records I am expecting to have 2 set of 5 unique rows)
Example:
What I have:
1
1
5
3
4
5
2
4
2
3
Result I want to achieve:
1
2
3
4
5
1
2
3
4
5
I have tried and searched a lot but couldn't find anything close to what I want to achieve.
Assuming that you can somehow order the rows within the sets of 5:
SELECT t.Row % 5, t.Row FROM #T t
ORDER BY t.Row , t.Row % 5
We could probably get closer to the truth with more details about what your data looks like and what it is you're trying to actually do.
This will work with the sample of data you provided
SELECT DISTINCT(thevalue) FROM theresults
UNION ALL
SELECT DISTINCT(thevalue) FROM theresults
But it's unclear to me if it's really what you need.
For instance :
if your table/results returns 12 rows, do you still want 2x5 rows or do you want 2x6 rows ?
do you have always in your table/results the same rows in double ?
There's a lot more questions to rise and no hint about them in what you asked.

Very simple BigQuery SQL script won't return "0" for Count rows with no results

I am trying to make this very simple SQL script work:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) date_submission,
COUNT(*) AS num_apples_oranges_submissions
FROM
[fh-bigquery:reddit_comments.2008]
WHERE
(LOWER(body) CONTAINS ('apples')
AND LOWER(body) CONTAINS ('oranges'))
GROUP BY
date_submission
ORDER BY
date_submission
The results look like this:
1 2008-01-07 3
2 2008-01-08 1
3 2008-01-09 2
4 2008-01-10 3
5 2008-01-11 2
6 2008-01-13 2
7 2008-01-15 2
8 2008-01-16 3
As you can see, for days where there were no submissions containing both "apples" and "oranges", instead of a value of 0 being returned, the entire row is simply missing (such as on the 12th and 14th).
How can I fix this? I'm at my wits end. Thank you.
Try below, it will return all submissions days
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) date_submission,
SUM((LOWER(body) CONTAINS ('apples') AND LOWER(body) CONTAINS ('oranges'))) AS num_apples_oranges_submissions
FROM
[fh-bigquery:reddit_comments.2008]
GROUP BY
date_submission
ORDER BY
date_submission

Searching for a number in a database column where column contains series of numbers seperated by a delimeter '"&" in SQLite

My table structure is as follows :
id category
1 1&2&3
2 18&2&1
3 11
4 1&11
5 3&1
6 1
My Question: I need a sql query which generates the result set as follows when the user searched category is 1
id category
1 1&2&3
2 18&2&1
4 1&11
5 3&1
6 1
but i am getting all the results not the expected one
I have tried regexp and like operators but no success.
select * from mytable where category like '%1%'
select * from mytable where category regexp '([.]*)(1)(.*)'
I really dont know about regexp I just found it.
so please help me out.
For matching a list item separated by &, use:
SELECT * FROM mytable WHERE '&'||category||'&' LIKE '%&1&%';
this will match entire item (ie, only 1, not 11, ...), whether it is at list beginning, middle or end.

removing duplicates in ms access

please tell me how to write this query
i have an access table
number
2
2
1
2
2
1
1
3
2
i want a query that gives
number count
2 5
1 3
3 1
any help appreciated
something like...
SELECT number, count(number) AS count
FROM table
GROUP BY number
It's a bad idea to have a column named number since it is a reserved keyword.
You probably want something like
select number_, count(*) as count from ... group by number_