extracting a substring from a text column in hive

extracting a substring from a text column in hive - hive

We have text data in a column named title like below
"id":"S-1-98-13474422323-33566802","name":"uid=Xzdpr0,ou=people,dc=vm,dc=com","shortName":"XZDPR0","displayName":"Jund Lee","emailAddress":"jund.lee#bm.com","title":"Leading Product Investor"
Need to extract just the display name (Jund lee in this example) from the above text data in hive, I have tried using substring function but don't seem to work,Please help

Use regexp_extract function with the matching regex to capture only the displayName from your title field value.
Ex:
hive> with tb as(select string('"id":"S-1-98-13474422323-33566802",
"name":"uid=Xzdpr0,ou=people,dc=vm,dc=com","shortName":"XZDPR0",
"displayName":"Jund Lee","emailAddress":"jund.lee#bm.com",
"title":"Leading Product Investor"')title)
select regexp_extract(title,'"displayName":"(.*?)"',1) title from tb;
+-----------+--+
| title |
+-----------+--+
| Jund Lee |
+-----------+--+

Related

extract json in column

I have a table
id | status | outgoing
-------------------------
1 | paid | {"a945248027_14454878":"processing","old.a945248027_14454878":"cancelled"}
2 | pending| {"069e5248cf_45299995":"processing"}
I am trying to extract the values after each underscore in the outgoing column e.g from a945248027_14454878 I want 14454878
Because the json data is not standardised I can't seem to figure it out.

You may extract the json key part after the underscore using regexp version of substring.
select id, status, outgoing,
substring(key from '_([^_]+)$') as key
from the_table, lateral jsonb_object_keys(outgoing) as j(key);
See demo.

Combine query to get all the matching search text in right order

I have the following table:
postgres=# \d so_rum;
Table "public.so_rum"
Column | Type | Collation | Nullable | Default
-----------+-------------------------+-----------+----------+---------
id | integer | | |
title | character varying(1000) | | |
posts | text | | |
body | tsvector | | |
parent_id | integer | | |
Indexes:
"so_rum_body_idx" rum (body)
I wanted to do phrase search query, so I came up with the below query, for example:
select id from so_rum
where body ## phraseto_tsquery('english','Is it possible to toggle the visibility');
This gives me the results, which only match's the entire text. However, there are documents, where the distance between lexmes are more and the above query doesn't gives me back those data. For example: 'it is something possible to do toggle between the. . . visibility' doesn't get returned. I know I can get it returned with <2> (for example) distance operator by giving in the to_tsquery, manually.
But I wanted to understand, how to do this in my sql statement itself, so that I get the results first with distance of 1 and then 2 and so on (may be till 6-7). Finally append results with the actual count of the search words like the following query:
select count(id) from so_rum
where body ## to_tsquery('english','string & string . . . ')
Is it possible to do in a single query with good performance?

I don't see a canned solution to this. It sounds like you need to use plainto_tsquery to get all the results with all the lexemes, and then implement your own custom ranking function to rank them by distance between the lexemes, and maybe filter out ones with the wrong order.

postgresql: automated extracting strings from text

I've the following table in a postgresl database
id | species
----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 |[{"id":1,"animalName":"Lupo appennico","animalCode":"LUPO"},{"id":2,"animalName":"Orso bruno marsicano","animalCode":"ORSO"},{"id":3,"animalName":"Volpe","animalCode":"VOLPE"}]
----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2 |[{"id":1,"animalName":"Cinghiale","animalCode":"CINGHIALE"},{"id":2,"animalName":"Orso bruno marsicano","animalCode":"ORSO"},{"id":3,"animalName":"Cervo","animalCode":"CERVO"}]|
I would like to extract only values after '"animalName":' and put them in a new field.
id | new_field |
----+--------------------------------------------+
1 |Lupo appennico, Orso bruno marsicano,Volpe |
----+--------------------------------------------+
2 |Cinghiale, Orso bruno marsicano, Cervo |
Unfortunately the field is a text type (not json or array). I've tried with regexp without success.

Your column is not of a json datatype, but it seems to contain valid json. If so, you can cast it and use json functions on it:
select id, string_agg(j ->> 'animalName', ', ') new_field
from mytable t
cross join lateral jsonb_array_elements(t.species::jsonb) j(obj)
group by id
order by id
Demo on DB Fiddle:
id | new_field
-: | :------------------------------------------
1 | Lupo appennico, Orso bruno marsicano, Volpe
2 | Cinghiale, Orso bruno marsicano, Cervo

Split string and Pivot Result - SQL Server 2012

I am using SQL Server 2012 and I have a table called XMLData that looks like this:
| Tag | Attribute |
|--------------|-----------------------------|
| tag1 | Cantidad=222¬ClaveProdServ=1|
| tag1 | Cantidad=333¬ClaveProdServ=2|
The column Tag has many repeated values, what is different is the column Attribute that has a string of attributes separated by "¬". I want to separate the list of attributes and then pivot the table so the tags are the column names.
The result I want is like this:
| tag1 | tag1 |
|-----------------|----------------|
| Cantidad=222 | Cantidad=333 |
| ClaveProdServ=1 | ClaveProdServ=2|
I have a custom made function that splits the string since SQL server 2012 doesn't have a premade function that does this. The function I have receives a
string as a parameter and the delimiter like so:
select *
from [dbo].[Split]('lol1,lol2,lol3,lol4',',')
this function will return this:
| item |
|--------|
| lol1 |
| lol2 |
| lol3 |
I can't find a way to pass the values of the column Attribute as parameter of this function, something like this:
SELECT *
FROM Split(A.Attribute,'¬'),XMLData A
And then put the values of the column Tag as the the column names for each set of Attributes

My magic crystal ball tells me, that you have - why ever - decided to do it this way and any comments about don't store CSV data are just annoying to you.
How ever...
If this is just a syntax issue, try it like this:
SELECT t.Tag
,t.Attribute
,splitted.item
FROM YourTable AS t
CROSS APPLY dbo.Split(t.Attribute,'¬') AS splitted
Otherwise show some more relevant details. Please read How to ask a good SQL question and How to create a MCVE

How do I create sql query for searching partial matches?

I have a set of items in db .Each item has a name and a description.I need to implement a search facility which takes a number of keywords and returns distinct items which have at least one of the keywords matching a word in the name or description.
for example
I have in the db ,three items
1.item1 :
name : magic marker
description: a writing device which makes erasable marks on whiteboard
2.item2:
name: pall mall cigarettes
description: cigarette named after a street in london
3.item3:
name: XPigment Liner
description: for writing and drawing
A search using keyword 'writing' should return magic marker and XPigment Liner
A search using keyword 'mall' should return the second item
I tried using the LIKE keyword and IN keyword separately ,..
For IN keyword to work,the query has to be
SELECT DISTINCT FROM mytable WHERE name IN ('pall mall cigarettes')
but
SELECT DISTINCT FROM mytable WHERE name IN ('mall')
will return 0 rows
I couldn't figure out how to make a query that accommodates both the name and description columns and allows partial word match..
Can somebody help?
update:
I created the table through hibernate and for the description field, used javax.persistence #Lob annotation.Using psql when I examined the table,It is shown
...
id | bigint | not null
description | text |
name | character varying(255) |
...
One of the records in the table is like,
id | description | name
21 | 133414 | magic marker

First of all, this approach won't scale in the large, you'll need a separate index from words to item (like an inverted index).
If your data is not large, you can do
SELECT DISTINCT(name) FROM mytable WHERE name LIKE '%mall%' OR description LIKE '%mall%'
using OR if you have multiple keywords.

This may work as well.
SELECT *
FROM myTable
WHERE CHARINDEX('mall', name) > 0
OR CHARINDEX('mall', description) > 0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

extracting a substring from a text column in hive - hive

Related

extract json in column

Combine query to get all the matching search text in right order

postgresql: automated extracting strings from text

Split string and Pivot Result - SQL Server 2012

How do I create sql query for searching partial matches?

Categories

Resources