Tokenize Function in Hive - hive

I am trying to follow this example where the term frequency and inverse document frequency is calculated in Hive:https://github.com/myui/hivemall/wiki/TFIDF-calculation
I have a table called pigoutputhive where I have the following fields:
The 'body' column contains a string of words [a-z A-Z & 0-9 only] separated by spaces.
I would like to tokenize the body so that I can generate a relation with a owneruserid and body tuple in order to perform the TF-IDF algorithm.
I am receiving an error relating the the tokenize function, can anyone tell me where I am going wrong?
My error is as follows: Error while compiling statement: FAILED: SemanticException [Error 10011]: Line 8:37 Invalid function 'tokenize' [ERROR_STATUS]
create or replace view pigoutputhive_exploded
as
select
owneruserid,
body,
score
from
pigoutputhive LATERAL VIEW explode(tokenize(body,true)) t as word
where
not is_stopword(word);

The tokenize function is a Hivemall extension to Hive.
So, you need to install Hivemall first.
See the following page for loading Hivemall functions into Hive.
https://github.com/myui/hivemall/wiki/Installation

Tokenize does not work in Hive and had to use sentences() function.

Related

Extract JSON content in Metabase SQL query

Using: Django==2.2.24, Python=3.6, PostgreSQL is underlying DB
Working with Django ORM, I can easily make all sort of queries, but I started using Metabase, and my SQL might be a bit rusty.
The problem:
I am trying to get a count of the items in a list, under a key in a dictionary, stored as a JSONField:
from django.db import models
from jsonfield import JSONField
class MyTable(models.Model):
data_field = JSONField(blank=True, default=dict)
Example of the dictionary stored in data_field:
{..., "my_list": [{}, {}, ...], ...}
Under "my_list" key, the value stored is a list, which contains a number of other dictionaries.
In Metabase, I am trying to get a count for the number of dictionaries in the list, but even more basic things, none of which work.
Some stuff I tried:
Attempt:
SELECT COUNT(elem->'my_list') as my_list_count
FROM my_table, json_object_keys(data_field:json) AS elem
Error:
ERROR: syntax error at or near ":" Position: 226
Attempt:
SELECT ARRAY_LENGTH(elem->'my_list') as my_list_count
FROM my_table, JSON_OBJECT_KEYS(data_field:json) AS elem
Error:
ERROR: syntax error at or near ":" Position: 233
Attempt:
SELECT JSON_ARRAY_LENGTH(data_field->'my_list'::json)
FROM my_table
Error:
ERROR: invalid input syntax for type json Detail: Token "my_list" is invalid. Position: 162 Where: JSON data, line 1: my_list
Attempt:
SELECT ARRAY_LENGTH(JSON_QUERY_ARRAY(data_field, '$.my_list'))
FROM my_table
Error:
ERROR: function json_query_array(text, unknown) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts. Position: 140
Basically, I think the issue is that I am using the wrong signatures (most of the time) in the methods I am trying to use.
I used this query to make sure I can at least get the keys from the dictionary:
SELECT JSON_OBJECT_KEYS(data_field::json)
FROM my_table
I was not able to use JSON_OBJECT_KEYS() without adding the ::json cast, I was getting this error:
ERROR: function json_object_keys(text) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts. Position: 127
But with the json cast, I am getting all the keys as intended.
Thank you for taking a look!
EDIT:
I also found this interesting article with different solution but none of the solutions worked.
Also seen this SO post which did not help.
Ok, after some more digging around, I found this article, which had the correct format/syntax.
This code is what I used to fetch the list from the JSON object successfully:
select data_field::json->'my_list' as the_list
from my_table
Then, I used json_array_length() to get the number of elements:
select json_array_length(data_field::json->'my_list') as number_of_elements
from my_table
All done! :)
EDIT:
I just found the reason to this whole shenanigan.
In the code (which goes years back) we used this package:
jsonfield==1.0.3
And used this way:
from jsonfield import JSONField
The issue is that in the background, Postgres saves the data as a string, so it needs to be cast into a JSON.
Later Django introduced its own JSONField, which stores data as you would expect, without a need to cast:
from django.contrib.postgres.fields import JSONField

PostgresSQL - Converting String in JSONB to Actual JSONB, Error: Token is invalid

I am having trouble working with the JSONB structure in PostgreSQL. currently my data is saved as follows:
"{\"Hello\":\"World\",\"idx\":0}"
Which obviously is not correct 😀 so I am trying to "repair" this and get the actual JSON representation for querying with:
SELECT regexp_replace(trim('"' FROM json_data::text), '\\"', '"', 'g')::jsonb FROM My_table
However when trying this, I get the following error:
ERROR: invalid input syntax for type json
DETAIL: Token "Рыба" is invalid.
CONTEXT: JSON data, line 1: ...х, как : Люди X, Пароль \\"Рыба...
SQL state: 22P02
So I am thinking that this is due to the character encoding that is not being accepted by the JSONB standard.
My main question then is though, how can I repair this kind of table so that I am still able to query it? I tried utilizing conver_from and convert_to but am unable to figure out how to fix this error... did anyone encounter this already?
Update: Found it! (thanks to Convert JSON string to JSONB), utilizing
SELECT (json_data#>>'{}')::jsonb FROM my_table
fixed it

react-native-realm pattern matching causes invalid predicate

I have this line of code:
realm.objects('Users').filtered("profile LIKE '%athletic%'")
and I tried this line of code:
realm.objects('Users').filtered("profile LIKE '*athletic*'")
It gave the error profile LIKE '%athletic%':1:0: Invalid predicate when running the program. I am unable to find documentation for pattern matching in react-native-realm. How do I find records that return users who have profiles with the string athletic in it?
For that kind of query, you can use CONTAINS
realm.objects('Users').filtered("profile CONTAINS 'athletic'")
See more on the query language used in Realm-JS here

String interpolation in HiveQL

I'm trying to create an external table that routes to an S3 bucket. I want to group all the tables by date, so the path will be something like s3://<bucketname>/<date>/<table-name>. My current function for creating it looks something like
concat('s3://<bucket-name>/', date_format(current_date,'yyyy-MM-dd'), '/<table-name>/');
I can run this in a SELECTquery fine; however when I try to put it in my table creation statement, I get the following error:
set s3-path = concat('s3://<bucket-name>/', date_format(current_date,'yyyy-MM-dd'), '/<table-name>/');
CREATE EXTERNAL TABLE users
(id STRING,
name STRING)
STORED AS PARQUET
LOCATION ${hiveconf:s3-path};
> FAILED: ParseException line 7:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
Is there any way to do string interpolation or a function invocation in Hive in this context?
you can try something like this. retrieve udf results in Hive. Essentially, as workaround, you can create a script and call it from the terminal passing the parameter like a hive config.

What is wrong with my SQL in SelectLayerByAttribute?

My code is arcpy.SelectLayerByAttribute_management("polygons_file", "NEW_SELECTION", "Shape_Area=(SELECT MAX(Shape_Area) FROM polygons_file")
I am getting an error "The SQL expression is invalid."
How can I fix it?
Subqueries should work, but you have a mismatched parentheses.
Wrong: Shape_Area=(SELECT MAX(Shape_Area) FROM polygons_file
Right: Shape_Area=(SELECT MAX(Shape_Area) FROM polygons_file)
Also note the different field delimiters to use depending on your data source (e.g. shapefile or geodatabase); you may need to include quotation marks around Shape_Area, e.g.
arcpy.SelectLayerByAttribute_management('polygons_file', 'NEW_SELECTION', '"Shape_Area" = (SELECT MAX("Shape_Area") FROM polygons_file)')