Hive "argmin" failing without output - hive

I am currently trying to run a request on a table that looks like
Key Desc1 Desc2 Val
1 Hello World 37
2 Alpha Beta 27
2 Gamma Kappa 28
1 Bjr Mde 42
My goal is to group by "Key" and ask for the line where Val=min(Vals) (in the group). For the dummy table above, I expect something like
Key Desc1 Desc2 Val
1 Hello World 37
2 Alpha Beta 27
To do so, I am using the following request:
select Key
min(struct(Val,Desc1)).col2 as Desc1,
min(struct(Val,Desc2)).col2 as Desc2,
min(Val) as Val
from mytable;
When I try to execute the query, I do not have any error during the syntax checking but hive just hangs without creating any jobs. Then it fails with the following error:
FAILED: SemanticException org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
I do not have any idea of the exact reason of the fail. Do you see an obvious mistake in my query ?
(if so, it should fail during the syntax check ; note that it fails "normally" if I introduce a mistake in the query like "col2" -> "col3")
Do you know if we can force hive to display more information about the error ? I do not have seen a "verbose" mode but I may have missed it....
Thanks a lot for your help.

This is fairly straightforward using Hive windowing functions. Just take the min() over the window and then select the rows where arg_min and Val equal each other.
Query:
select Key, Desc1, Desc2, arg_min
from (
select *
, min(Val) over (partition by key) as arg_min
from db.tbl ) x
where Val = arg_min
Output:
1 Hello World 37
2 Alpha Beta 27

Actually, the query proposed in my question is correct and working (in hive 0.10 at least).
The problem was a pure "timeout" problem that can be solved by setting the configuration as follows:
set hive.metastore.client.socket.timeout=300;

Related

How to find values that can't be cast as date in Redshift?

I have a big set of data. I try to parse one column to extract a substring that's a date and cast it as such - date(<substring here>). I'm getting an error ERROR: Error converting text to date but I don't know what the actual issue is. How do I find the values that are causing a problem? Something like try_cast, but that doesn't work in Redshift. I'm not sure I can use regex since I don't know the format of what I'm looking for.
Your question is broad and not very specific as the task which you're trying to solve, so it's hard to provide a correct answer
You can use a pattern matching
Say, you have the following data in TableA
id
dt
1
2020-08-20
2
2021-08-20
3
2021-08-21
4
2021-08-2000
5
asdfghjkl
6
08-01-2021
7
06/07/2021
with pattern matching you can find all the rows with correct dates
select id from TableA
where dt similar to '\\d{4}-\\d{2}-\\d{2}'
or dt similar to '\\d{2}-\\d{2}-\\d{4}'
or dt similar to '\\d{2}/\\d{2}/\\d{4}'
All you need to do now is to reverse this query to find opposites
select id, dt from TableA
where id not in (
select id from logs.sot
where dt similar to '\\d{4}-\\d{2}-\\d{2}'
or dt similar to '\\d{2}-\\d{2}-\\d{4}'
or dt similar to '\\d{2}/\\d{2}/\\d{4}'
)
This will give you result
id
dt
5
asdfghjkl
4
2021-08-2000
If this does not work, you can try to sort by date column and validate a head and tail - bad values usually live here
Use processing outside Redshift if possible. Usually it's good practice to take care about data cleaning before putting it into database. I believe a simple python (or any other language) script will make the job

How to use a group by but still access every number values cleverly

I need to "group by" my data to distinguish between tests (each test has a specific id, name and temperature) and to calculate their count, standard deviation, etc. But I also need access every raw data value from each group, for further indexes calculations that I do in a python script.
I have found two solution to this problem, but both seems non-optimal/flawed:
1) Using listagg to store every raw value that were grouped into a single string row. It does the work but it is not optimized : I concatenate multiples float values into a giant string that I will immediately de-concatenate and convert back to float. That seem necessary and costly.
2) Removing the group by entirely and do the count and standard deviation though partitioning. But that seems even worse to me. I don't know if PLSQL/oracle optimizes this, it could be calculating the same count and standard deviation for every line (I don't know how to check this). The query result also becomes messy: since there is no 'group by' anymore, I have to do add multiple checks in my python file in order to differentiate every test data (specific id, name and temperature).
I think that my first solution can be improved but I don't know how. How can I use a group by but still access every number values cleverly ?
A function similar to list_agg but with a collection/array output type instead of a string output type could maybe do the trick (a sort of 'array_agg' compatible with oracle), but I don't know any.
EDIT:
The sample data is complex and probably restricted to the company viewing, but I can show you my simplified query for my 1) :
SELECT
rav.rav_testid as test_id,
tte.tte_testname as test_name,
tsc.tsc_temperature as temperature,
listagg(rav.rav_value, ' ')WITHIN GROUP (ORDER BY rav.rav_value) as all_specific_test_values,
COUNT(rav.rav_value) as n,
STDDEV(rav.rav_value) as sigma,
FROM
...
(8 inner joins)
GROUP BY
rav.rqv_testid, tte.tte_testname,tsc.tsc_temperature
ORDER BY
rav.RAV_testid, tte.tte_testname, spd.SPD_SPLITNAMEINTERNAL,tsc.tsc_temperature
The result looks like :
test_id | test_name | temperature | all_specific_test_values | n | sigma
-------------------------------------------------------------------------
6001 |VADC_A(...) | -40 | ,8094034194946289 ,8(...)| 58 | 0,54
6001 |VADC_A(...) | 25 | ,5054857852946545 ,6(...)| 56 | 0,24
6001 |VADC_A(...) | 150 | ,8625754277452524 ,4(...)| 56 | 0,26
6002 |VADC_B(...) | -40 | ,9874514651548454 ,5(...)| 57 | 0,44
I think you want analytic functions:
select t.*,
count(*) over (partition by test) as cnt,
avg(value) over (partition by test) as avg_value,
stddev(value) over (partition by test) as stddev_value
from t;
This adds additional columns on each row.
I would suggest going with #Gordon_Linoff's solution. That is likely the most standard solution.
If you want to go with a less standard solution, you can have a group by that returns a collection as one of the columns. Presumably, your script could iterate through that collection though it might take a bit of work in the script to do that.
create type num_tbl as table of number;
/
create table foo (
grp integer,
val number
);
insert into foo values( 1, 1.1 );
insert into foo values( 2, 1.2 );
insert into foo values( 1, 1.3 );
insert into foo values( 2, 1.4 );
select grp, avg(val), cast( collect( val ) as num_tbl )
from foo
group by grp

SQL in KDB or am I crazy?

I am trying to see if I can use KDB for some of my current work. I have a fair bit of code in legacy SQL and prospect of reuse seems exciting.
Which is when I came across: http://code.kx.com/q/interfaces/q-client-for-odbc/
This link only speaks of SQL select - is it OK to use this for insert and delete as well? What about performance?
Based on your question, I'm not sure this will do what you are hoping for. You seem to want to reuse SQL code on a non-SQL database.
This driver does not run SQL against the current database, it allows you to connect to an external database, and pull back data using the SQL capability of that other database. (ODBC is a standardised driver system for connecting to various kinds of databases, sending queries, and returning data).
This would only be useful if you intended to leave two different databases running side-by-side, and needed them to interact at the database level (rather than, as #millimoose mentions above, connecting to them individually from your application).
It is seldom used, but there is a way to use ANSI SQL with KDB. Just prefix the query with s)
q)t:([]col1:1 1 2 2;col2:10 10 20 20; col3:5.0 2.0 2.3 2.4; grp:aabc)
q)t
col1 col2 col3 grp`
------------------
1 10 5 a
1 10 2 a
2 20 2.3 b
2 20 2.4 c
q) /standard select
q)select from t
col1 col2 col3 grp
------------------`
1 10 5 a
1 10 2 a
2 20 2.3 b
2 20 2.4 c
q)/SQL type select with select *
q)select * from t
'rank
q) /Prefix the query with s)
q)s)select * from t
col1 col2 col3 grp
------------------
1 10 5 a
1 10 2 a
2 20 2.3 b
2 20 2.4 c
Now - this feature is rarely used and the parser is not optimized for this type of usage and resources are scarce. You'd probably spend more time debugging issues with this than you would just by converting your code to Q. Hope this helps.
Another option is to use the qodbc server -- http://code.kx.com/q/interfaces/q-server-for-odbc/

Aggregation over order-dependent partition?

I have a source data set like this (simplified to be more clear):
Key F1 F2
1 X 4
2 X 5
3 Y 6
4 X 9
5 X 7
6 X 8
7 Y 9
8 X 6
9 X 5
10 Y 3
The data is sorted by the Key field. Now, I want to compute an aggregate of the F2 field over partitions that are defined by the F1 field: A partition starts at the first X value and ends with the first subsequent Y value.
So, for example, I might want wo compute the MIN() over the partitions defined as described above. Then the result set would look like this:
rownum MIN(F2)
1 4
2 7
3 3
I have tried a number of resources (incl. our own intranet community and of course stackoverflow) but found nothing for my case. Usually partitioning only works with a field that can be used to identify the partitions. Here, the partitions are defined by a change in a field's content with respect to a given order.
Although I am aware that I may have to resort to writing a procedural solution I would prefer to solve this in pure SQL.
Any ideas how such a partitioning could be achieved with a SQL select statement?
Thanks and regards
Kai.
A little bit shorter solution: http://sqlfiddle.com/#!12/7390d/24
Query:
select min(f2)
from t t1
group by (select max(key)
from t t2
where t2.f1='Y' and
t1.key > t2.key)
Result:
| MIN |
-------
| 4 |
| 7 |
| 3 |
The idea is to find the key of preceding 'Y' for each row and group by it. Should work with any SQL engine.
You didn't specify engine or dialect or version so I assumed SQL Server 2012.
Example that you can run to see the solution: http://sqlfiddle.com/#!6/f5d38/21
You solve it by creating correct partitions in your set. Code looks like this.
WITH groupLimits as
(
SELECT
[Key] AS groupend
,COALESCE(LAG([Key]) OVER (order by [Key]),0)+1 AS groupstart
FROM sourceData
WHERE F1 = 'Y'
)
SELECT
MIN(sourceData.F2)
FROM groupLimits
INNER JOIN sourceData
ON sourceData.[Key] BETWEEN groupLimits.groupstart and groupLimits.groupend
GROUP BY groupLimits.groupstart
ORDER BY groupLimits.groupstart

how does a SQL query work?

How does a SQL query work?
How does it get compiled?
Is the from clause compiled first to see if the table exists?
How does it actually retrieve data from the database?
How and in what format are the tables stored in a database?
I am using phpmyadmin, is there any way I can peek into the files where data is stored?
I am using MySQL
sql execution order:
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT .
SQL Query mainly works in three phases .
1) Row filtering - Phase 1: Row filtering - phase 1 are done by FROM, WHERE , GROUP BY , HAVING clause.
2) Column filtering: Columns are filtered by SELECT clause.
3) Row filtering - Phase 2: Row filtering - phase 2 are done by DISTINCT , ORDER BY , LIMIT clause.
In here i will explain with an example . Suppose we have a students table as follows:
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
10
Rahim
80
A
11
Karim
81
B
12
Abdullah
92
D
Now we run the following sql query:
select section_,sum(marks) from students where id_<10 GROUP BY section_ having sum(marks)>100 order by section_ LIMIT 2;
Output of the query is:
section_
sum
A
247
B
131
But how we got this output ?
I have explained the query step by step . Please read bellow:
1. FROM , WHERE clause execution
Hence from clause works first therefore from students where id_<10 query will eliminate rows which has id_ greater than or equal to 10 . So the following rows remains after executing from students where id_<10 .
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
2. GROUP BY clause execution
now GROUP BY clause will come , that's why after executing GROUP BY section_ rows will make group like bellow:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
3
Maria
10
C
8
Jahid
25
C
3. HAVING clause execution
having sum(marks)>100 will eliminates groups . sum(marks) of D group is 185 , sum(marks) of A groupd is 247 , sum(marks) of B group is 131 , sum(marks) of C group is 35 . So we can see tha C groups's sum is not greater than 100 . So group C will be eliminated . So the table looks like this:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
4. SELECT clause execution
select section_,sum(marks) query will only decides which columns to prints . It is decided to print section_ and sum(marks) column .
section_
sum
D
185
A
245
B
131
5. ORDER BY clause execution
order by section_ query will sort the rows ascending order.
section_
sum
A
245
B
131
D
185
6. LIMIT clause execution
LIMIT 2; will only prints first 2 rows.
section_
sum
A
245
B
131
This is how we got our final output .
Well...
First you have a syntax check, followed by the generation of an expression tree - at this stage you can also test whether elements exist and "line up" (i.e. fields do exist WITHIN the table). This is the first step - any error here any you just tell the submitter to get real.
Then you have.... analysis. A SQL query is different from a program in that it does not say HOW to do something, just WHAT THE RESULT IS. Set based logic. So you get a query analyzer in (depending on product bad to good - oracle long time has crappy ones, DB2 the most sensitive ones even measuring disc speed) to decide how best to approach this result. This is a really complicated beast - it may try dozens or hundreds of approaches to find one he believes to be fastest (cost based, basically some statistics).
Then that gets executed.
The query analyzer, by the way, is where you see huge differences. Not sure about MySQL - SQL Server (Microsoft) shines in that it does not have the best one (but one of the good ones), but that it really has nice visual tools to SHOW the query plan, compare the estimates the the analyzer to the real needs (if they differ too much table statistics may be off so the analyzer THINKS a large table is small). They present that nicely visually.
DB2 had a great optimizer for some time, measuring - i already said - disc speed to put it into it's estimates. Oracle went "left to right" (no real analysis) for a long time, and took user provided query hints (crap approach). I think MySQL was VERY primitive too in the start - not sure where it is now.
Table format in database etc. - that is really something you should not care for. This is documented (clearly, especially for an open source database), but why should you care? I have done SQL work for nearly 15 years or so and never had that need. And that includes doing quite high end work in some areas. Unless you try building a database file repair tool.... it makes no sense to bother.
The order of SQL statement clause execution-
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY
My answer is specific to Oracle database, which provides tutorials pertaining to your queries. Well, when SQL database engine processes any SQL query/statement, It first starts parsing and within parsing it performs three checks Syntax, Semantic and Shared Pool. To know how do these checks work? Follow the link below.
Once query parsing is done, it triggers the Execution plan. But hey Database Engine! you are smart enough. You do check if this SQL query has already been parsed (Soft Parse), if so then you directly jump on execution plan or else you deep dive and optimize the query (Hard Parse). While performing hard parse, you also use a software called Row Source Generation which provides Iterative Execution Plan received from optimizer. Enough! see the SQL query processing stages below.
Note - Before execution plan, it also performs Bind operations for variable's values and once the query is executed It performs Fetch to obtain the records and finally store into result set. So in short, the order is-
PASRE -> BIND -> EXECUTE -> FETCH
And for in depth details, this tutorial is waiting for you.
This may be helpful to someone.
If you're using SSMS for Sql Server and want to know where your data files are stored, you can use this query
SELECT
mdf.database_id,
mdf.name,
mdf.physical_name as data_file,
ldf.physical_name as log_file,
db_size = CAST((mdf.size * 8.0)/1024 AS DECIMAL(8,2)),
log_size = CAST((ldf.size * 8.0 / 1024) AS DECIMAL(8,2))
FROM (SELECT * FROM sys.master_files WHERE type_desc = 'ROWS' ) mdf
JOIN (SELECT * FROM sys.master_files WHERE type_desc = 'LOG' ) ldf
ON mdf.database_id = ldf.database_id
Here's a copy of the output