clickhouse approach for word frequency count on textual field

clickhouse approach for word frequency count on textual field - word-count

I have a Clickhouse table, where one of the fields contains a textual description (~300 words).
For example Reviews:
Rev_id Place_id Stars Category Text
1 12 3 Food Nice food but a bad dirty place.
2 31 4 Sport Not bad, they have everything.
3 55 1 Bar Poor place,bad audience.
I'd like to make some word count analysis, such as general word frequency count (how many times each of the words has appeared) or top-K words per Category.
In the example:
word count
bad 3
place 2
...
Is there a way to do it solely in ClickHouse without involving programming languages?

SELECT
arrayJoin(splitByChar(' ', replaceRegexpAll(x, '[.,]', ' '))) AS w,
count()
FROM
(
SELECT 'Nice food but a bad dirty place.' AS x
UNION ALL
SELECT 'Not bad, they have everything.'
UNION ALL
SELECT 'Poor place,bad audience.'
)
GROUP BY w
ORDER BY count() DESC
┌─w──────────┬─count()─┐
│ │ 4 │
│ bad │ 3 │
│ place │ 2 │
│ have │ 1 │
│ Poor │ 1 │
│ food │ 1 │
│ Not │ 1 │
│ they │ 1 │
│ audience │ 1 │
│ Nice │ 1 │
│ but │ 1 │
│ dirty │ 1 │
│ a │ 1 │
│ everything │ 1 │
└────────────┴─────────┘
SELECT CATEGORY, ....
GROUP BY CATEGORY, w

If it applicable in your case I would consider using the alphaTokens as a more efficient one.
SELECT
category,
arrayJoin(arrayFilter(x -> NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x), alphaTokens(text))) token,
count() count
FROM
(
/* test data */
SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
FROM
(
SELECT arrayJoin([
(1, 12, 3, 'Food', 'Nice food but a bad dirty place.'),
(4, 12, 3, 'Food', ' the the the the good food ..'),
(2, 31, 4, 'Sport', 'Not bad,,, they have everything.'),
(3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
)
)
GROUP BY category, token
ORDER BY count DESC
LIMIT 5;
/*
┌─category─┬─token────┬─count─┐
│ Food │ food │ 2 │
│ Food │ bad │ 1 │
│ Bar │ audience │ 1 │
│ Food │ Nice │ 1 │
│ Bar │ Poor │ 1 │
└──────────┴──────────┴───────┘
*/
Example of using topK:
SELECT
category,
arrayReduce('topK(3)',
arrayFilter(x -> (NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x)), groupArrayArray(alphaTokens(text)))) AS result
FROM
(
/* test data */
SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
FROM
(
SELECT arrayJoin([
(1, 12, 3, 'Food', 'Nice food but a bad dirty place.'),
(4, 12, 3, 'Food', ' the the the the good food ..'),
(2, 31, 4, 'Sport', 'Not bad,,, they have everything.'),
(3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
)
)
GROUP BY category;
/* result
┌─category─┬─result─────────────────┐
│ Bar │ ['Poor','place','bad'] │
│ Food │ ['food','Nice','bad'] │
│ Sport │ ['Not','bad','they'] │
└──────────┴────────────────────────┘
*/
ps: probably make sense to lower all strings/tokens before processing

Related

How convert "2022-10-26T00:00:00.654199+00:00" timeformat to unix

I need convert string 2022-10-26T00:00:00.654199+00:00
to Unix timestamp, is it possible with clickhouse?
toUnixTimestamp64Milli(visitParamExtractString(msg, 'time'))
where time is 2022-10-26T00:00:00.654199+00:00
This way it doesn't work.

Try this way:
SELECT
json,
JSONExtractString(json, 'time') AS time,
parseDateTime64BestEffort(time, 6) AS dt,
toUnixTimestamp64Milli(dt) AS ts
FROM
(
WITH [
'{"time": "2022-10-26T00:00:00.654199+00:00"}',
'{"time": "2022-10-26T00:00:00.654199+08:00"}'] AS jsons
SELECT arrayJoin(jsons) AS json
)
/*
┌─json─────────────────────────────────────────┬─time─────────────────────────────┬─────────────────────────dt─┬────────────ts─┐
│ {"time": "2022-10-26T00:00:00.654199+00:00"} │ 2022-10-26T00:00:00.654199+00:00 │ 2022-10-26 00:00:00.654199 │ 1666742400654 │
│ {"time": "2022-10-26T00:00:00.654199+08:00"} │ 2022-10-26T00:00:00.654199+08:00 │ 2022-10-25 16:00:00.654199 │ 1666713600654 │
└──────────────────────────────────────────────┴──────────────────────────────────┴────────────────────────────┴───────────────┘
*/

SQL Server - Join to void nulls

Am sorry I couldn't explain the issue clearly. The actual problem is I have transaction table that contains item transactions such as purchase and sales across various location. I need to find the unit purchase cost of all items across all branches. Now, in a given location, all items may not be purchased. At the same time all items would be purchased in the central warehouse. Means, some items are transferred from Warehouse to locations instead of purchase at the location. In such cases, the unit cost should be picked from the central warehouse purchase data.
Now, I can get the items and purchase cost across each location from the transaction table, given that the item is purchased at the location. My question was, how to fetch the central warehouse price for items that do not have a purchase history in the transaction table and list it along with all other location purchase cost. Why its difficult is, if there is no purchase hist it means i have no item number to search in the central warehouse.
Frankly I do not know how to do this through SQL query in a single go. Hence i did make a master view as first step - containing all branches and items. This is not ideal because the data is so huge as I have around 50 locations and 200K items resulting in 50 x 200k rows. However, it served the purpose of acting as a location-item master.
Second step, I made central warehouse master with item and purchase cost at the warehouse.
Thirdly, queried transaction table to fetch items that has no purchase at specific locations. These item id was linked to location-item master and used a case statement to get, if purchase cost is null then get the cost from warehouse.
Thank you for pointing out the mistakes and for introducing COALESCE.
Table (Tab1) is as below:
┌─────────┐
│ TabCol1 │
├─────────┤
│ 01 │
│ 02 │
│ 03 │
│ 04 │
│ 05 │
└─────────┘
I have a table (Tab2 ) with two columns:
┌──────┬──────┐
│ Col1 │ Col2 │
├──────┼──────┤
│ 1111 │ 01 │
│ 1111 │ 02 │
│ 1111 │ 03 │
└──────┴──────┘
If we join the above table we get:
┌─────────┬──────┬──────┐
│ TabCol1 │ Col1 │ Col2 │
├─────────┼──────┼──────┤
│ 01 │ 1111 │ 01 │
│ 02 │ 1111 │ 02 │
│ 03 │ 1111 │ 03 │
│ 04 │ NULL │ NULL │
│ 05 │ NULL │ NULL │
└─────────┴──────┴──────┘
What I need is, instead of NULL, I must get 1111:
┌─────────┬──────┬──────┐
│ TabCol1 │ Col1 │ Col2 │
├─────────┼──────┼──────┤
│ 01 │ 1111 │ 01 │
│ 02 │ 1111 │ 02 │
│ 03 │ 1111 │ 03 │
│ 04 │ 1111 │ 04 │
│ 05 │ 1111 │ 05 │
└─────────┴──────┴──────┘
In other words, I need to make a master table, with all COL1 filled to avoid NULL.

What you are trying to achieve makes completely no sense to me, but there's one way to get that result:
select T1.TabCol1,
coalesce(T2.Col1, '1111'),
coalesce(T2.Col2, T1.TabCol1)
from Tab1 T1 left join Tab2 T2 on T1.TabCol1 = T2.Col2

You can replace NULLs with whatever you want:
SELECT TabCol1, ISNULL(Col1, '1111') AS Col1, ISNULL(Col2, TabCol1) AS Col2
FROM Tab2
LEFT JOIN Tab1 ON Tab1.TabCol1 = Tab2.Col2
Note that ISNULL only works in SQL Server. Alternatively, you can use COALESCE which is supported by most databases:
SELECT TabCol1, COALESCE(Col1, '1111') AS Col1, COALESCE(Col2, TabCol1) AS Col2
FROM Tab2
LEFT JOIN Tab1 ON Tab1.TabCol1 = Tab2.Col2

dialog --buildlist option, how to use it?

I've been reading up on the many uses of dialog to create interactive shell scripts, but I'm stumped on how to use the --buildlist option. Read the man pages, searched google, searched stackoverflow, even read through some old articles of Linux Journal from 1994, to no avail.
Can some give me a clear example of how to use it properly?
Lets imagine a directory with 5 files which you'd want to select from, to copy to another directory. Can someone give a working example?
Thankyou!

Consider the following:
dialog --buildlist "Select a directory" 20 50 5 \
f1 "Directory One" off \
f2 "Directory Two" on \
f3 "Directory Three" on
This will display something like
┌────────────────────────────────────────────────┐
│ Select a directory │
│ ┌─────────────────────┐ ┌────^(-)─────────────┐│
│ │Directory One │ │Directory Two ││
│ │ │ │Directory Three ││
│ │ │ │ ││
│ │ │ │ ││
│ │ │ │ ││
│ └─────────────────────┘ └─────────────100%────┘│
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
├────────────────────────────────────────────────┤
│ <OK> <Cancel> │
└────────────────────────────────────────────────┘
The box is 50 characters wide and 20 rows tall; each column displays 5 items. off/on determines if the item starts in the left or right column, respectively.
The controls:
^ selects the left column
$ selects the right column
Move up and down the selected column with the arrow keys
Move the selected item to the other column with the space bar
Toggle between OK and Cancel with the tab key. If you use the --visit-items option, the tab key lets you cycle through the lists as well as the buttons.
Hit enter to select OK or cancel.
If you select OK, the tags (f1, f2, etc) associated with each item in the right column is printed to standard error.

SQL query for converting column breaks in a single column

I have a database in postgres where one of the columns contains text data with multiple column breaks.
So, when I export the data into csv file, the columns are jumbled!
I need a query which will ignore the column breaks in a single column and give an output where the data in the column is available in the same column and does not extend to the next column.

This example table exhibits the problem you are talking about:
test=> SELECT * FROM breaks;
┌────┬───────────┐
│ id │ val │
├────┼───────────┤
│ 1 │ text with↵│
│ │ three ↵│
│ │ lines │
│ 2 │ text with↵│
│ │ two lines │
└────┴───────────┘
(2 rows)
Then you can use the replace function to replace the line breaks with spaces:
test=> SELECT id, replace(val, E'\n', ' ') FROM breaks;
┌────┬───────────────────────┐
│ id │ replace │
├────┼───────────────────────┤
│ 1 │ text with three lines │
│ 2 │ text with two lines │
└────┴───────────────────────┘
(2 rows)

How to convert several duplicate rows into an array in SQL (Postgres)?

I have the following table One:
id │ value
────┼───────
1 │ a
2 │ b
And Two:
id │ value
─────┼───────
10 │ a
20 │ a
30 │ b
40 │ a
50 │ b
One.value has a unique constraint but not Two.value (one-to-many relationship).
Which SQL (Postgres) query will retrieve as array the ids of Two whose value match One.value? The result I am looking for is:
id │ value
─────────────┼───────
{10,20,40} │ a
{30,50} │ b

Check on SQL Fiddle
SELECT array_agg(id) AS id, "value"
FROM Two
GROUP BY "value";
Using value as identifier (column name here) is a bad practice, as it is a reserved keyword.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

clickhouse approach for word frequency count on textual field - word-count

Related

How convert "2022-10-26T00:00:00.654199+00:00" timeformat to unix

SQL Server - Join to void nulls

dialog --buildlist option, how to use it?

SQL query for converting column breaks in a single column

How to convert several duplicate rows into an array in SQL (Postgres)?

Categories

Resources