Select specific value from Hive array - sql

I have a table in Hive structured with 3 columns as follows;
timestamp UserID OtherId
2016-09-01 123 "101","222","321","987","393.1","090","467","863"
2016-09-01 124 "188","389","673","972","193","100","143","210"
2016-09-01 125 "888","120","482","594","393.2"
2016-09-01 126 "441","501","322","671","008","899"
2016-09-01 127 "004","700","393.4","761","467","356","643","578"
2016-09-01 128 "322","582","348"
2016-09-01 129 "029","393.8","126","187"
Where OtherID is an array.
I need to parse OtherID so that the resultant dataset is the following, since I am only interested in values which contain '393%'
timestamp UserID OtherId
2016-09-01 123 393.1
2016-09-01 125 393.2
2016-09-01 127 393.4
2016-09-01 129 393.8
I have researched a ton of parse functions but it seems they're all intended to return the position of the value, or you need to specify the position of the value to return it. Both of these options do not work here because '3309%' can occur at any point in the array for any given row.
There's also the fact that I need to incorporate the wildcard to allow for variations of my desired value.
Another option is explode but my table is simply too large for that option.
I'm thinking a UDF might be the only way to go but would welcome some guidance there.
Grateful for any assistance.

It's easy do what you need using the lateral view option available in hive.
0: jdbc:hive2://quickstart:10000/default> select * from test_5;
+-----------+------------+----------------------------------------------+
| test_5.t | test_5.id | test_5.oid |
+-----------+------------+----------------------------------------------+
| 123 | 123 | "222","321","987","393.1","090","467","863" |
+-----------+------------+----------------------------------------------+
And this is the trick:
SELECT id, ooid
FROM test_5
LATERAL VIEW EXPLODE(SPLIT(oid,",")) temp AS ooid;
+------+----------+
| id | ooid |
+------+----------+
| 123 | "222" |
| 123 | "321" |
| 123 | "987" |
| 123 | "393.1" |
| 123 | "090" |
| 123 | "467" |
| 123 | "863" |
+------+----------+
Ergo:
SELECT id, regexp_replace(ooid,'"','')
FROM test_5
LATERAL VIEW EXPLODE(SPLIT(oid,",")) temp AS ooid;
WHERE ooid LIKE '"393%';
+------+----------+
| id | ooid |
+------+----------+
| 123 | 393.1 |
+------+----------+

May be you can try as below:
hive> select timestamp1, userid, otherids from userdet1 LATERAL VIEW explode(otherid) testTable as otherids where otherids LIKE concat('393','%');
OK
2016-09-01 123 393.1
2016-09-01 125 393.2
2016-09-01 127 393.4
2016-09-01 129 393.8
Time taken: 0.297 seconds, Fetched: 4 row(s)

Related

SQL or Pandas: Join/Pivot information from two tables

I have three relational postgres tables (timescale hypertables) and need to get my data into a CSV file, but I am struggling to get it to the format I want. I am using django as frameworks, but I need to solve this with raw SQL.
Imagine I have 2 tables: drinks and consumption_data.
The drinks table looks like this:
name | fieldx | fieldy
---------+--------+------
test-0 | |
test-1 | |
test-2 | |
The consumption_data table looks like this:
time | drink_id | consumption
------------------------+-------------+------------------------------
2018-12-15 00:00:00+00 | 2 | 123
2018-12-15 00:01:00+00 | 2 | 122
2018-12-15 00:02:00+00 | 2 | 125
My target table should join these two tables and give me all consumption data with the drink names back.
time | test-0 | test-1 | test-2
------------------------+-------------+---------+-------
2018-12-15 00:00:00+00 | 123 | 123 | 22
2018-12-15 00:01:00+00 | 334 | 122 | 32
2018-12-15 00:02:00+00 | 204 | 125 | 24
I do have all the drink-ids and all the names, but those are hundreds or thousands.
I tried this by first querying the consumption data for a single drink and renaming the column: SELECT time, drink_id, "consumption" AS test-0 FROM heatflowweb_timeseriestestperformance WHERE drink_id = 1;
Result:
time | test-0 |
------------------------+-------------+
2018-12-15 00:00:00+00 | 123 |
2018-12-15 00:01:00+00 | 334 |
2018-12-15 00:02:00+00 | 204 |
But now, I would have to add hundreds of columns and I am not sure how to do this. With UNION? But I don't want to write thousands of union statements...
Maybe there is an easier way to achieve what I want? I am not an SQL expert, so what I need could be super easy to achieve or also impossible... Thanks in advance for any help, really appreciated.

Select data from multiple existing tables dynamically

I have tables "T1" in the database that are broken down by month of the form (table_082020, table_092020, table_102020). Each contains several million records.
+----+----------+-------+
| id | date | value |
+----+----------+-------+
| 1 | 20200816 | abc |
+----+----------+-------+
| 2 | 20200817 | xyz |
+----+----------+-------+
+----+----------+-------+
| id | date | value |
+----+----------+-------+
| 1 | 20200901 | cba |
+----+----------+-------+
| 2 | 20200901 | zyx |
+----+----------+-------+
There is a second table "T2" that stores a reference to the primary key of the first one and actually to the table itself only without the word "table_".
+------------+--------+--------+--------+--------+
| rec_number | period | field1 | field2 | field3 |
+------------+--------+--------+--------+--------+
| 777 | 092020 | aaa | bbb | ccc |
+------------+--------+--------+--------+--------+
| 987 | 102020 | eee | fff | ggg |
+------------+--------+--------+--------+--------+
| 123456 | 082020 | xxx | yyy | zzz |
+------------+--------+--------+--------+--------+
There is also a third table "T3", which is the ratio of the period and the table name.
+--------+--------------+
| period | table_name |
+--------+--------------+
| 082020 | table_082020 |
+--------+--------------+
| 092020 | table_092020 |
+--------+--------------+
| 102020 | table_102020 |
+--------+--------------+
Tell me how you can combine 3 tables to get dynamic data for several periods. For example: from 15082020 to 04092020, where the data will be located in different tables, respectively
There really is no good reason for storing data in this format. It makes querying a nightmare.
If you cannot change the data format, then add a view each month that combines the data:
create view t as
select '202010' as YYYYMM, t.*
from table_102020
union all
select '202008' as YYYYMM, t.*
from table_092020
union all
. . .;
For a once-a-month effort, you can spend 10 minutes writing the code and do so with a calendar reminder. Or, better yet, set up a job that uses dynamic SQL to generate the code and run this as a job after the underlying tables are using.
What should you be doing? Well, 5 million rows a months isn't actually that much data. But if you are concerned about it, you can use table partitioning to store the data by month. This can be a little tricky; for instance, the primary key needs to include the partitioning key.

Merging some columns from two postgres tables into a new table based on row value

Hello PostgresSQL experts (and maybe this is also a task for Perl's DBI since I also happen to be working with it, but...) I might also have some terminologies misused here so bear with me.
I have a set of 32 tables, every one exactly as the other. The first column of every table always contains a date, while the second column contains values (integers) that can change once every 24 hours, some samples get back-dated. In many cases, the tables may not contain data for a particular date, ever. So here's an example of two such tables:
date_list | sum date_list | sum
---------------------- --------------------------
2020-03-12 | 4 2020-03-09 | 1
2020-03-14 | 5 2020-03-11 | 3
| 2020-03-12 | 5
| 2020-03-13 | 9
| 2020-03-14 | 12
The idea is to merge the separate tables into one, sort of like a grid, but with the samples placed in the correct row in its own column and ensuring that the date column (always the first column) is not missing any dates, looking like this:
date_list | sum1 | sum2 | sum3 .... | sum32
---------------------------------------------------------
2020-03-08 | | |
2020-03-09 | | 1 |
2020-03-10 | | | 5
2020-03-11 | | 3 | 25
2020-03-12 | 4 | 5 | 35
2020-03-13 | | 9 | 37
2020-03-14 | 5 | 12 | 40
And so on, so 33 columns by 2020-01-01 to date.
Now, I have tried doing a FULL OUTER JOIN and it succeeds. It's the subsequent attempts that get me trouble, creating a long, cascading table with the values in the wrong place or accidentally clobbering data. So I know this works if I use a table of one column with a date sequence and joining the first data table, just as a test of my theory using baby steps:
SELECT date_table.date_list, sums_1.sum FROM date_table FULL OUTER JOIN sums_1 ON date_table.date_list = sums_1.date_list
2020-03-07 | 1
2020-03-08 |
2020-03-09 |
2020-03-10 | 2
2020-03-11 |
2020-03-12 | 4
Encouraged, I thought I'd get a little more ambitious with my testing, but that places some rows out of sequence to the bottom of the table and I'm not sure that I'm losing data or not, this time trying USING as an alternative:
SELECT * FROM sums_1 FULL OUTER JOIN sums_2 USING (date_list);
Result:
fecha_sintomas | sum | sum
----------------+-------+-------
2020-03-09 | | 1
2020-03-11 | | 3
2020-03-12 | 4 | 5
2020-03-13 | | 9
2020-03-14 | 5 | 12
2020-03-15 | 6 | 15
2020-03-16 | 8 | 20
: : :
2020-10-29 | 10053 | 22403
2020-10-30 | 10066 | 22407
2020-10-31 | 10074 | 22416
2020-11-01 | 10076 | 22432
2020-11-02 | 10077 | 22434
2020-03-07 | 1 |
2020-03-10 | 2 |
(240 rows)
I think I'm getting close. In any case, where do I get to what I want, which is my grid of data described above? Maybe this is an iterative process that could benefit from using DBI?
Thanks,
You can full join like so:
select date_list, s1.sum as sum1, s2.sum as sum2, s3.sum as sum3
from sums_1 s1
full join sums_2 s2 using (date_list)
full join sums_3 s3 using (date_list)
order by date_list;
The using syntax makes unqualified column date_list unambiguous in the select and order by clauses. Then, we need to enumerate the sum columns, provided aliases for each of them.

How can I find the users that queried a view in Redshift?

Hello everyone and thank you in advanced!
I'm having trouble to find a query to get a list of users that have queried some specifics views.
A example to clarify, if I have a couple of views
user_activity_last_6_months &
user_compliance_last_month
I need to know who is querying those 2 views and if posible other statistics. This could be a desired output.
+--------+-----------------------------+----------+----------------------------+----------------------------+----------------+-------------------+----------------------+------------------+
| userid | view_name | queryid | starttime | endtime | query_cpu_time | query_blocks_read | query_execution_time | return_row_count |
+--------+-----------------------------+----------+----------------------------+----------------------------+----------------+-------------------+----------------------+------------------+
| 293 | user_activity_last_6_months | 88723456 | 2018-05-08 13:08:08.727686 | 2018-05-08 13:08:12.423532 | 4 | 1023 | 6 | 435 |
| 345 | user_compliance_last_month | 99347882 | 2018-05-10 00:00:03.049967 | 2018-05-10 00:00:09.177362 | 6 | 345 | 8 | 214 |
| 345 | user_activity_last_6_months | 99347883 | 2018-05-10 12:27:36.637483 | 2018-05-10 12:27:44.502705 | 8 | 14 | 9 | 13 |
| 293 | user_compliance_last_month | 99347884 | 2018-05-10 12:31:00.433556 | 2018-05-10 12:31:30.090183 | 30 | 67 | 35 | 7654 |
+--------+-----------------------------+----------+----------------------------+----------------------------+----------------+-------------------+----------------------+------------------+
I have developed a query to get this info but for tables in the database using system tables and views, but I can't find any clue to get the same results for views.
As I've said, the first 3 columns are mandatory and the others will be nice to have. Plus, any further information is welcome!!
Thank you all!!
If you need that level of auditing for table and view access then I recommend you start by enabling Database Audit Logging for your Redshift cluster. This will generate a number of logs files in S3.
The "User Activity Log" contains the text for all queries run on the cluster, it can then either be loaded back into Redshift or added as a Spectrum table so that the query text can be parsed for table and view names.

SAP Business Objects Cross Table Data Duplication

I'm using Business Objects to construct a simple report on whether a unit is on or off for a given day. When constructing a vertical table, the data is correct and looks like such:
Unit ID | Status | Date
1 | On | 2016-09-10
1 | On | 2016-09-11
1 | Off | 2016-09-12
2 | Off | 2016-09-10
2 | Off | 2016-09-11
2 | On | 2016-09-12
However the cross table I've created, with columns of "date" and rows of "Unit ID" is duplicating Unit ID and having an entire row of 'On' followed by an entire row of 'Off' like:
____| 2016-09-10 | 2016-09-11 | 2016-09-12
1 | On | On | On
1 | Off | Off | Off
2 | On | On | On
2 | Off | Off | Off
instead of what it should be as:
____| 2016-09-10 | 2016-09-11 | 2016-09-12
1 | On | On | Off
2 | Off | Off | On
Any suggestions as to why it's doing this? The table isn't particularly useful if it has these duplicate rows and I can't understand why it's resulting in this odd table.
Turns out what happened is the "Status" field was a dimension type, but the cross table requires the data field to be a measure type. Simply making a new variable that was a measure equal to "Status" solved the issue.