ngrams combination of words hive - hive

I have a table with 1000 rows and 3 variables (an ID, a COUNTRY and a string variable "VAR1"). VAR1 is a sentence composed by words separated by a space.
I want, by COUNTRY, a count of all couples (or all triplets) of words. Very important the couples (or triplets) are a cross-over of all words (not necessarily step by step). Perhaps, we can do it with an ngrams hql-function but when i use it, it's counting words step by step and not all the crossing.
Let's go for the example to give you an idea of what i want :
> **"ID" "COUNTRY" "VAR1"**
> "1" "CANADA" "dad mum child"
> "2" "CANADA" "dad mum dog"
> "3" "USA" "bird lion car"
VAR1 is not necessarily a length of 3 words. It's just to simplify.
The resutls in 4 steps that i want for a 2-ngrams:
STEP 1 : THE MOST IMPORTANT STEP : crossing words by 2
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "dad child" 1
> "1" "CANADA" "mum dad" 1
> "1" "CANADA" "mum child" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "mum dad" 1
> "2" "CANADA" "mum dog" 1
> "2" "CANADA" "dog dad" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "lion bird" 1
> "3" "USA" "lion car" 1
> "3" "USA" "car bird" 1
> "3" "USA" "car lion" 1
STEP 2 : order the 2-grams
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dog mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "car lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "car lion" 1
STEP 3 : distinct by ID, COUNTRY, 2-ngrams
> "1" "CANADA" "dad mum"
> "1" "CANADA" "child dad"
> "1" "CANADA" "child mum"
> "2" "CANADA" "dad mum"
> "2" "CANADA" "dad dog"
> "2" "CANADA" "dog mum"
> "3" "USA" "bird lion"
> "3" "USA" "bird car"
> "3" "USA" "car lion"
STEP 4 : count by COUNTRY, 2-ngrams
> "CANADA" "dad mum" 2
> "CANADA" "child dad" 1
> "CANADA" "child mum" 1
> "CANADA" "dad dog" 1
> "CANADA" "dog mum" 1
> "USA" "bird lion" 1
> "USA" "bird car" 1
> "USA" "car lion" 1
THANK YOU VERY MUCH

with cte as
(
select t.ID
,t.COUNTRY
,pe.pos
,pe.val
from mytable t
lateral view posexplode (split(VAR1,'\\s+')) pe
)
select t1.COUNTRY
,concat_ws(' ',t1.val,t2.val) as combination
,count (*) as cnt
from cte t1
join cte t2
on t2.id =
t1.id
where t1.pos < t2.pos
group by t1.COUNTRY
,t1.val
,t2.val
;
+----------+--------------+------+
| country | combination | cnt |
+----------+--------------+------+
| CANADA | dad child | 1 |
| CANADA | dad dog | 1 |
| CANADA | dad mum | 2 |
| CANADA | mum child | 1 |
| CANADA | mum dog | 1 |
| USA | bird car | 1 |
| USA | bird lion | 1 |
| USA | lion car | 1 |
+----------+--------------+------+

Related

Use column value as column name when inserting (Clickhouse)

I have a source table:
Data
122435
2912
32
I want to select data from this table and insert into a destination table. The desired output for the destination table is:
Index_1
Index_2
Index_3
2
4
5
2
9
2
The logic behind this is numbers in odd positions are indexes (columns), and numbers in even positions are values:
1) "122435" -> "12", "24", "35" -> "Index_1 = 2", "Index_2 = 4", "Index_3 = 5"
2) "2912" -> "29", "12" -> "Index_2 = 9", "Index_1 = 2"
3) "32" -> "Index_3 = 2"
My problem is I don't know if it is possible to use column value as column name in Clickhouse.

how to name matrix columns

I have a matrix like below, how can I give the column names like "month", "2015", "2016, "2017" from column 2:5? Thank you.
[,1] [,2] [,3] [,4] [,5]
[1,] "" "1" "75" "75" "94"
[2,] "" "2" "77" "67" "69"
[3,] "" "3" "67" "78" "80"
[4,] "" "4" "71" "99" "84"
[5,] "" "5" "62" "89" "74"
Assuming you're using R, you could do something like this (for matrix M)
colnames(M) <- c("","month","2015","2016","2017")

Filling up an ms-access table with sql

I am very new to databases and I'm currently working with Microsoft Access 2013. The situation is that I have a huge amount of data which I wanna fill in in an already created table (Inventory) by using an SQL-statement in a query.
What I have is the following:
INSERT INTO Inventory (Col 1, Col 2, Col 3, Col 4)
VALUES ("Val 1", "Val 2", "Val 3", "Val 4"),
("Val 5", "Val 6", "Val 7", "Val 8"),
....
("Val 9", "Val 10", "Val 11", "Val 12");
And what I want is simply this table:
Col 1 | Col 2 | Col 3 | Col 4
| | |
Val 1 | Val 2 | Val 3 | Val 4
Val 5 | Val 6 | Val 7 | Val 8
Val 9 | Val 10 | Val 11 | Val 12
The problem is, that I keep getting the error Missing semicolon at the end of sql-statement. Therefore I suppose that I should add a semicolon after each line. If I do this tho, I get the error that access found characters after the semicolon.
What is the right syntax to achieve my multiple-lined INSERT INTO-Statement?
I think MS Access only allows you to insert one record at a time using INSERT . . . VALUES:
INSERT INTO Inventory (Col 1, Col 2, Col 3, Col 4)
VALUES ("Val 1", "Val 2", "Val 3", "Val 4");
INSERT INTO Inventory (Col 1, Col 2, Col 3, Col 4)
VALUES ("Val 5", "Val 6", "Val 7", "Val 8");
....
INSERT INTO Inventory (Col 1, Col 2, Col 3, Col 4)
VALUES ("Val 9", "Val 10", "Val 11", "Val 12");
You can bulk insert using INSERT INTO ... SELECT and a union query:
INSERT INTO Inventory (Col 1, Col 2, Col 3, Col 4)
SELECT "Val 1", "Val 2", "Val 3", "Val 4"
FROM (SELECT First(ID) FROM MSysObjects) dummy
UNION ALL
SELECT "Val 5", "Val 6", "Val 7", "Val 8"
FROM (SELECT First(ID) FROM MSysObjects) dummy
UNION ALL
SELECT "Val 9", "Val 10", "Val 11", "Val 12"
FROM (SELECT First(ID) FROM MSysObjects) dummy
However, the overhead might not make this construct worthwhile, and Access does have a maximum length on single queries of ~65K characters.
To serialize and deserialize tables, I recommend using ADO and persistence. This can properly store field properties, serializing to different file formats or database formats will cause information loss.

sscan doesn't returns part of members

I have a set which contains integer values. And I want to retrieve part of it with sscan.
127.0.0.1:6379[1]> smembers d
1) "1"
2) "2"
3) "3"
4) "4"
5) "5"
6) "6"
7) "7"
8) "8"
...
But sscan returns full list of members:
127.0.0.1:6379[1]> sscan d 0
1) "0"
2) 1) "1"
2) "2"
3) "3"
4) "4"
5) "5"
6) "6"
7) "7"
8) "8"
9) "9"
....
Is there any way which brings me members page by page(for eg. 10 items for every scan)
Use the COUNT directive as explained in SCAN's documentation to return a fixed number of results.

SQL - CASE WHEN troubles

Running into an issue with a CASE WHEN statement. Sample script below:
SELECT
CASE WHEN Column1 = "Example 1" THEN "Name 1"
WHEN Column1 = "Example 2" THEN "Name 2"
WHEN Column1 = "Example 3" THEN "Name 3"
WHEN Column1 = "Example 3" AND Column2 IN ("Sample1", "Sample2") THEN "Name4"
WHEN Column1 = "Example 3" AND Column2 IN ("Sample3", "Sample4") THEN "Name5"
ELSE "-" END AS Name,
[aggregation language that doesn't affect the script]
FROM Table1
GROUP BY Name
HAVING Name IN ("Name1", "Name2", "Name3", "Name4", "Name5"
ORDER BY Name ASC
The issue I'm having is that when executing the script "Name1", "Name2", and "Name3" all pull (and pull accurately), but "Name4" and "Name5" won't pull at all, presumably because they share a condition with "Name3" (Column1 = "Example3").
Essentially, I'm trying to pull both the aggregate that is "Name3" and it's components that are "Name4" and "Name5".
One way to think about it is that "Name3" is the NFL and "Name4" and "Name5" are the AFC and NFC, respectively. Because I'm pulling in the NFL with the condition {Column1 = "Example3"}, it won't pull in the AFC and NFC, despite having a second required "AND" condition.
Would LOVE if someone could help here. I've tried using parentheses, changing the order of the WHENs...no luck.
Thanks in advance!
My recommendation would be to change the ordering of your cases:
SELECT
CASE WHEN Column1 = "Example 1" THEN "Name 1"
WHEN Column1 = "Example 2" THEN "Name 2"
WHEN Column1 = "Example 3" AND Column2 IN ("Sample1", "Sample2") THEN "Name4"
WHEN Column1 = "Example 3" AND Column2 IN ("Sample3", "Sample4") THEN "Name5"
WHEN Column1 = "Example 3" THEN "Name 3"
ELSE "-" END AS Name,
[aggregation language that doesn't affect the script]
FROM Table1
GROUP BY Name
HAVING Name IN ("Name1", "Name2", "Name3", "Name4", "Name5"
ORDER BY Name ASC
With your current ordering, If "Name4" or "Name5" is true, "Name 3" will always be true, so it will get executed first. With the modified ordering, "Name 3" will be true only if "Name4" and "Name5" come out to be false. Make sense?
Would LOVE if someone could help here. I've tried using parentheses, changing the order of the WHENs...no luck.
You're not being entirely honest, are you?
http://sqlfiddle.com/#!6/11381/3/0 -- a simple switch of the WHEN conditions fixes your little problem. The lesson here is that testing stops at the first condition that's true.
For "Name 3", you need to exclude the conditions of "Name 4" and "Name 5":
CASE WHEN Column1 = "Example 1" THEN "Name 1"
WHEN Column1 = "Example 2" THEN "Name 2"
WHEN Column1 = "Example 3"
AND Column2 NOT IN ("Sample1", "Sample2", "Sample3", "Sample4") THEN "Name 3"
WHEN Column1 = "Example 3" AND Column2 IN ("Sample1", "Sample2") THEN "Name4"
WHEN Column1 = "Example 3" AND Column2 IN ("Sample3", "Sample4") THEN "Name5"
ELSE "-" END AS Name,
case when returns from the first matching condition and then ignores the rest, so you need to make the earlier condition narrower. Alternatively, you could put the Name4/5 conditions before the Name 3 condition.