Hive UDF to return multiple colunm output - hive

How create a UDF which take a String return multiple Strings ?
The UDF so far I have seen could only give one output. How to get multiple feilds as output from a UDF ?
Simplest would be implementation of name -> FirstName, LastName.
Not looking for alternate solution to split names, but looking for API / UDF which would help implement such needs .
Lets Say nameSplitteris my UDF
Select age,nameSplitter(name) as firstName,LastName from myTable;
InPut
****Input****
------------------------
Age | Name
------------------------
24 | John Smit
13 | Sheldon Cooper
-------------------------
OutPut
****Out put ****
-----------------------------------
Age | First Name | Last Name
-----------------------------------
24 | John | Smit
13 | Sheldon | Cooper
-----------------------------------

Use split() function, it splits strinng around regexp pattern and returns an array:
select age,
NameSplitted[0] as FirstName,
NameSplitted[1] as LastName
from
(
select age,
split(Name,' +') as NameSplitted
from myTable
)s;
Or just select age, split(Name,' +')[0] FirstName, split(Name,' +')[0] LastName from myTable;
Pattern ' +' means one or more spaces.
Also if you have three words names or even longer and you want to split only first word as a name and everything else as last name, or using more complex rule, you can use regexp_extract function like in this example:
hive> select regexp_extract('Johannes Chrysostomus Wolfgangus Theophilus Mozart', '^(.*?)(?: +)(.*)$', 1);
OK
Johannes
Time taken: 1.144 seconds, Fetched: 1 row(s)
hive> select regexp_extract('Johannes Chrysostomus Wolfgangus Theophilus Mozart', '^(.*?)(?: +)(.*)$', 2);
OK
Chrysostomus Wolfgangus Theophilus Mozart
Time taken: 0.692 seconds, Fetched: 1 row(s)
Pattern here means: the beginning of the string '^', first capturing group consisting of any number of characters (.*?), non-capturing group consisting of any number of spaces (?: +), last capturing group consisting of any number of characters greedy (.*), and $ means the end of the string

Related

is there a way to extract 70% of character from a string using bigquery?

I have column with Names of various length like :
Name | ID
Avi | 01
Li | 02
Amandeep | 03
I want to extract 70% of characters.
I am using :
substring(Name,1, (length(Name)-5))
But this does not work when length(name) is less than 2 or 3
I think you want:
SELECT Name, ID, SUBSTRING(Name, 1, CAST(CEIL(0.7*LENGTH(Name)) AS INT64)) AS Name70Pct
FROM yourTable;
Here we are taking 70% of the length of the name. I wrap with CEIL() to ensure that a name of one character will at least return that one character.

Big Query -- Reorder elements within a delimited string by another delimiter

Summary
I'd like to reorder elements in a string, the elements are delimited by new lines.
The elements I'd like to sort should be ordered by a string that can have numbers or letters within it. This sorting string is not at the beginning of the data, but rather it is also a delimited string (messy data set, I know). To make this even messier, there is an extra new line; this doesn't seem like the crux of this issue
Example
Below is a simplified version of what I'd like to do. I have a table, and I'd like to sort students' favorite shows and characters by the show's name, which is the second element of a pipe-delimited string.
student
favorite characters and shows
alice
10th doctor | dr who troy | community
bob
11 | stranger things Liz | 30 Rock mr peanut butter | bojack horseman
would become this:
student
favorite characters and shows
alice
troy | community 10th doctor | dr who
bob
Liz | 30 Rock mr peanut butter | bojack horseman 11 | stranger things
What I've tried
Big Query doesn't allow arrays of arrays. If it did, I would have an easier time here. I've tried working with COLLATE but today is my first time seeing that function; I'm not sure that is the right way to go, anyways.
Currently, I'm working to split by new line, and rejoin later. I have never done this with tables, so I'm a bit out of my element. Here is the query I'm working from:
WITH
-- example data from above
example_data AS (
SELECT
'alice' AS student,
-- note: the new line is at the end of every pipe-delimited line, so there is always some floating empty row when using functions like split()
'10th doctor | dr who\ntroy | community\n' AS favorite_characters_and_shows
UNION ALL
SELECT
'bob' AS student,
"11 | stranger things\nLiz | 30 Rock\nmr peanut butter | bojack horseman\n" AS favorite_characters_and_shows ),
-- I have no need for this to be another table, but it is where I am. Tell me if this is misguided, please.
soln_table AS (
SELECT
example_data.student,
example_data.favorite_characters_and_shows,
SPLIT(example_data.favorite_characters_and_shows, '\n'),
array( select x from unnest(SPLIT(example_data.favorite_characters_and_shows, '\n') ) as x order by x) as foo,
FROM
example_data )
-- where I am trying to display a sorted solution
SELECT
*
FROM
soln_table;
Consider below approach
select student, (
select string_agg(line, '\n' order by split(line, '|')[safe_offset(1)])
from unnest(split(favorite_characters_and_shows, '\n')) line
where trim(line) != ''
) as favorite_characters_and_shows
from example_data
if applied to sample data in your question - output is

SQL WHERE column values into capital letters

Let's say I have the following entries in my database:
Id
Name
12
John Doe
13
Mary anne
13
little joe
14
John doe
In my program I have a string variable that is always capitalized, for example:
myCapString = "JOHN DOE"
Is there a way to retrieve the rows in the table by using a WHERE on the name column with the values capitalized and then matching myCapString?
In this case the query would return two entries, one with id=12, and one with id=14
A solution is NOT to change the actual values in the table.
A general solution in Postgres would be to capitalize the Name column and then do a comparison against an all-caps string literal, e.g.
SELECT *
FROM yourTable
WHERE UPPER(Name) = 'JOHN DOE';
If you need to implement this is Knex, you will need to figure out how to uppercase a column. This might require using a raw query.

How to limit returned rows using max characters in sqlite query

Say I have students table with one column named name as follows:
| name |
--------
Jhon
Natalie Kocher
Jonell Dickson
Irvin
Kiara Audet
Shawna Duvall
Cobey Maryellen
Kenny
Lindsy Taylor
How to get all rows that under 6 characters so that I get the following name:
Jhon
Irvin
Kenny
In case that's not possible at least how to sort / set the order of returned rows start from smallest length of chars to the longest.
Thanks,
About SQLite query
Use the length() function:
select t.*
from t
where length(name) < 6;
Or you can use not like:
where name not like '______%' -- there are 6 underscores

Custom sorting (order by) in PostgreSQL, independent of locale

Let's say I have a simple table with two columns: id (int) and name (varchar). In this table I store some names which are in Polish, e.g.:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
Now, let's say I want to sort the results by name:
SELECT * FROM table ORDER BY name;
If I have C locale, I get:
4 | Włocławek
1 | sępoleński
3 | toruński
2 | świecki
which is wrong, because "ś" should be after "s" and before "t". If I use Polish locale (pl_PL.UTF-8), I get:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
which is also not what I want, because I would like names starting with capital letters to be first just like in C locale, like this:
4 | Włocławek
1 | sępoleński
2 | świecki
3 | toruński
How can I do this?
If you want a custom sort, you must define some function that modifies your values in some way so that the natural ordering of the modified values fits your requirement.
For example, you can append some character or string it the value starts with uppercase:
CREATE OR REPLACE FUNCTION mysort(text) returns text IMMUTABLE as $$
SELECT CASE WHEN substring($1 from 1 for 1) =
upper( substring($1 from 1 for 1)) then 'AAAA' || $1 else $1 END
;
$$ LANGUAGE SQL;
And then
SELECT * FROM table ORDER BY mysort(name);
This is not foolprof (you might want to change 'AAA' for something more apt) and hurts performance, of course.
If you want it efficient, you'll need to create another column that "naturally" sorts correctly (e.g. even in the C locale), and use that as a sorting criterion. For that, you should use the approach of the strxfrm C library function. As a straight-forward strxfrm table for your approach, replace each letter with two ASCII letters: 's' would become 's0' and 'ś' would become 's1'. Then 'świecki' becomes 's1w0i0e0c0k0i0', and the regular ASCII sorting will sort it correctly.
If you don't want to create a separate column, you can try to use a function in the where clause:
SELECT * FROM table ORDER BY strxfrm(name);
Here, strxfrm needs to be replaced with a proper function. Either you write one yourself, or you use the standard translate function (although this doesn't support replacing a character with two of them, so you'll need some more involved transformation).