How to build distribution table for number of digits after the dot in hive? - hive

There is a string column converted from float(double) one in a hive table. I need a table that represents number of digits after dot with count rows for each number.
+-----+------+--+
| num | _c0 |
+-----+------+--+
| 2 | 300 |
| 3 | 400 |
| 4 | 248 |
| 5 | 117 |
| 6 | 43 |
| NULL| 999 |
+-----+------+--+
There is a function to obtain number of digits after dot in column foo
length(split(foo, '\\.')[1])
So, my failed attempt to obtain the above table was
select length(split(foo, '\\.')[1]) as num, count(num) from tbl_bar group by num;
The error message was
Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:77 Invalid table alias or column reference 'num': (possible column names are: foo, moo, hroo) (state=42000,code=10004)
What is the correct query to get distribution by number of digits after the dot in column foo?

Column-aliases can't be selected in the query at the same level. Use the actual calculation instead.
select length(split(foo, '\\.')[1]) as num, count(*)
from tbl_bar
group by length(split(foo, '\\.')[1]);

Related

PostgreSQL Compare value from row to value in next row (different column)

I have a table of encounters called user_dates that is ordered by 'user' and 'start' like below. I want to create a column indicating whether an encounter was followed up by another encounter within 30 days. So basically I want to go row by row checking if "encounter_stop" is within 30 days of "encounter_start" in the following row (as long as the following row is the same user).
user | encounter_start | encounter_stop
A | 4-16-1989 | 4-20-1989
A | 4-24-1989 | 5-1-1989
A | 6-14-1993 | 6-27-1993
A | 12-24-1999 | 1-2-2000
A | 1-19-2000 | 1-24-2000
B | 2-2-2000 | 2-7-2000
B | 5-27-2001 | 6-4-2001
I want a table like this:
user | encounter_start | encounter_stop | subsequent_encounter_within_30_days
A | 4-16-1989 | 4-20-1989 | 1
A | 4-24-1989 | 5-1-1989 | 0
A | 6-14-1993 | 6-27-1993 | 0
A | 12-24-1999 | 1-2-2000 | 1
A | 1-19-2000 | 1-24-2000 | 0
B | 2-2-2000 | 2-7-2000 | 1
B | 5-27-2001 | 6-4-2001 | 0
You can select..., exists <select ... criteria>, that would return a boolean (always true or false) but if really want 1 or 0 just cast the result to integer: true=>1 and false=>0. See Demo
select ts1.user_id
, ts1.encounter_start
, ts1. encounter_stop
, (exists ( select null
from test_set ts2
where ts1.user_id = ts2.user_id
and ts2.encounter_start
between ts1.encounter_stop
and (ts1.encounter_stop + interval '30 days')::date
)::integer
) subsequent_encounter_within_30_days
from test_set ts1
order by user_id, encounter_start;
Difference: The above (and demo) disagree with your expected result:
B | 2-2-2000 | 2-7-2000| 1
subsequent_encounter (last column) should be 0. This entry starts and ends in Feb 2000, the other B entry starts In May 2001. Please explain how these are within 30 days (other than just a simple typo that is).
Caution: Do not use user as a column name. It is both a Postgres and SQL Standard reserved word. You can sometimes get away with it or double quote it. If you double quote it you MUST always do so. The big problem being it has a predefined meaning (run select user;) and if you forget to double quote is does not necessary produce an error or exception; it is much worse - wrong results.

How to convert arrays from two different table columns to parallel rows?

I'm working with hive and I have a table of the following format (I present only one row, but it has many rows)
_______________________________
segments | rates | sessID
---------|-----------|---------
'1,2,3' | '10,20,30'| 555
Namely, two columns have a string representing arrays of the same length and the third column has some integer. I want to flatten the arrays such that first member of the first array appears in the same row with the first member of the second array, etc:
Something like:
----------------------------
segment | rate | sessId
--------|------|------------
1 | 10 | 555
2 | 20 | 555
3 | 30 | 555
I've tried the following query (for simplicity I've hardcoded the values):
SELECT explode(segments), explode (rates), sessID FROM
(SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) data ;
However, this does produce the required result, returning an error:
FAILED: SemanticException 1:26 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'rates'
When I try to flatten just one column it does work:
The query:
SELECT explode(segments) FROM (
SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) data ;
the result:
1
2
3
How can I get the result I want?
I don't have access to Hive to test this, but the approach should basically work.
POSEXPLODE() can be used to get two columns, the position within an array and the item itself. Then you can use that position to look up the corresponding item from the other array...
SELECT
yourData.sessID,
segment.item AS segment,
SPLIT(yourData.rates, ',')[segment.pos] AS rate
FROM
yourData
LATERAL VIEW
POSEXPLODE(SPLIT(yourData.segments,',')) segment AS pos, item
I think that POSEXPLODE() returns the positions starting from 1, but array indexes in Hive start from 0? If that's the case then use [segment.pos - 1] instead.
Please give a try on this.
select sessID,tf1.val as segments, tf2.val as rates
from (SELECT Split('1,2,3', ',') as segments, Split('10,20,30', ',') as rates, 555 as sessID) t
lateral view posexplode(segments) tf1
lateral view posexplode(rates) tf2
where tf1.pos = tf2.pos;
+---------+-----------+--------+--+
| sessid | segments | rates |
+---------+-----------+--------+--+
| 555 | 1 | 10 |
| 555 | 2 | 20 |
| 555 | 3 | 30 |
+---------+-----------+--------+--+

Filter json values regardless of keys in PostgreSQL

I have a table called diary which includes columns listed below:
| id | user_id | custom_foods |
|----|---------|--------------------|
| 1 | 1 | {"56": 2, "42": 0} |
| 2 | 1 | {"19861": 1} |
| 3 | 2 | {} |
| 4 | 3 | {"331": 0} |
I would like to count how many diaries having custom_foods value(s) larger than 0 each user have. I don't care about the keys, since the keys can be any number in string.
The desired output is:
| user_id | count |
|---------|---------|
| 1 | 2 |
| 2 | 0 |
| 3 | 0 |
I started with:
select *
from diary as d
join json_each_text(d.custom_foods) as e
on d.custom_foods != '{}'
where e.value > 0
I don't even know whether the syntax is correct. Now I am getting the error:
ERROR: function json_each_text(text) does not exist
LINE 3: join json_each_text(d.custom_foods) as e
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
My using version is: psql (10.5 (Ubuntu 10.5-1.pgdg14.04+1), server 9.4.19). According to PostgreSQL 9.4.19 Documentation, that function should exist. I am so confused that I don't know how to proceed now.
Threads that I referred to:
Postgres and jsonb - search value at any key
Query postgres jsonb by value regardless of keys
Your custom_foods column is defined as text, so you should cast it to json before applying json_each_text. As json_each_text by default does not consider empty jsons, you may get the count as 0 for empty jsons from a separate CTE and do a UNION ALL
WITH empty AS
( SELECT DISTINCT user_id,
0 AS COUNT
FROM diary
WHERE custom_foods = '{}' )
SELECT user_id,
count(CASE
WHEN VALUE::int > 0 THEN 1
END)
FROM diary d,
json_each_text(d.custom_foods::JSON)
GROUP BY user_id
UNION ALL
SELECT *
FROM empty
ORDER BY user_id;
Demo

How to use Regex in SQL for extracting values after repetitive numbers

I have the following table (table1):
+---+---------------------------------------------+
+---|--------att1 --------------------------------+
| 1 | 10.2.5.4 4.3.2.1.in-addr.arpa |
| 2 | asd 100.99.98.97 97.3.2.1.a.b.c fsdf |
| 3 | fd 95.94.93.92 92.5.7.1.a.b.c |
| 4 | a 11.4.99.75 75.77.52.41.in-addr.arpa |
+---+---------------------------------------------+
I would like to get the following values (that are located after the repetitive numbers): in-addr.arpa, a.b.c, a.b.c, in-addr.arpa.
I tried to use the following format with no success:
SELECT att1
FROM table1
WHERE REGEXP_LIKE(att1 , '^(\d+?)\1$')
I would like it to run in Impala and Oracle.
Use REGEXP_SUBSTR (assuming you are using an Oracle DB).
select regexp_substr(att1,'[0-9]\.([^0-9]+)',1,1,null,1)
from table1
[0-9]\. a numeric followed by a .
[^0-9]+ any character other than a numeric is matched until the next numeric is found. () around this indicates the group (first in this case) and we only extract that part of the string.
Sample Demo

Can't select an existing column in PostgreSQL

I'm new in SQL and I'm trying to select column Foto_municipis:
askdbase4=# select * from avatar_avatarx;
id | llista_municipis | Foto_municipis | primary | date_uploaded
----+------------------+-----------------+---------+------------------------
1 | Tore | tore.jpg | t | 2014-06-05 01:19:40+02
2 | Calldetenes | calldetenes.jpg | f | 2014-06-05 23:24:18+02
3 | Rupit i Pruit | baixa.jpeg | f | 2014-06-16 03:09:48+02
4 | Olost | olost.jpg | f | 2014-06-16 23:20:05+02
(4 rows)
for some reason I can select llista municipis successfully:
SELECT llista_municipis FROM avatar_avatarx;
but when I try to select Foto_municipis this is what I get:
askdbase4=# SELECT llista_municipis FROM avatar_avatarx;
ERROR: column "Foto_municipis" does not exist
LINE 1: select Foto_municipis from avatar_avatarx;
What am I doing wrong?
You probably created the column with a double-quoted identifier and this will work:
select "Foto_municipis"
from avatar_avatarx
That is almost always a bad idea as it will be forever necessary to reference it using double-quotes, unless it is an all lower case identifier in which case it can be referenced in lower case without double quotes.
If the column is created with an identifier without double quotes then it is possible to reference it in any case style like Foto_municipis or foto_Municipis regardless of the original identifier case style.