Currently I am migrating a database from SQL_SERVER to SPARK using HIVE_SQL.
I had an issue when im trying to pass a number to a date format.I found the answer is:
from_unixtime(unix_timestamp(cast(DATE as string) , 'dd-MM-yyyy'))
When I execute this query it bring me the data, notice that iI put an alias different to the name of column FECHA :
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(FECHA AS STRING ) ,'yyyyMMdd'), 'yyyy-MM-dd') AS FECHA_1
FROM reportes_hechos_avisos_diarios
LIMIT 1
| FECHA_1 |
| -------- |
| 2019-01-01 |
But when I put the same alias as the column name it bring me an incosistent information:
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(FECHA AS STRING ) ,'yyyyMMdd'), 'yyyy-MM-dd') AS FECHA
FROM reportes_hechos_avisos_diarios
LIMIT 1
| FECHA |
| -------- |
| 2.019 |
I know the trivial answer is , put an alias that doesnt be the same as the column name, but i have an implementation in Tableau that feeds from this query and Its complicated to change this columns because basically i must change all implementation so I need to preserve the column name.This query works for me in SQL SERVER, but i dont know why doesnt works in hive.
Issue
ExpectedResult
PSDT:Thanks for your attention, this is the first question I ask in stack and my native language is not English, sorry if I had grammatical errors.
limit 1 without order by can produce non-deterministic results from run to run because the order of rows is random due to parallel execution, some factors may affect it somehow but getting the same row is not guaranteed.
What is happening - I guess you receiving different row and the date is corrupted in that row, this is why some weird result is returned.
Also, you can another method of conversion:
select date(regexp_replace(cast(20200101 as string),'(\\d{4})(\\d{2})(\\d{2})','$1-$2-$3')) --put your column instead of constant.
Result:
2020-01-01
My table is stud.
+-----+------+-------+
| no | name | grade |
+-----+------+-------+
| 101 | naga | A |
| 102 | raj | A |
| 103 | john | A |
+-----+------+-------+
The query I'm using is:
SELECT * FROM stud WHERE no = 101 AND grade = 'A'.
If am using single record buffering, how much data is being stored in the buffer area?
This query doesn't do anything. There is no "into" clause. meaning it wont store anything selected.
You are probably looking to do something like this....
SELECT * FROM stud into wa_stud WHERE no = 101 AND grade = 'A'.
"processing of each single row is performed here
endselect.
or perhaps something like this, where only 1 row (the first rows ordered by primary key) is selected...
select single * from stud into wa_stud where no = 101 and grade = 'A' .
or perhaps you want everything brought in to a table, meaning number and grade does not include the full primary key.
select * from stud into table it_stud where no = 101 and grade = 'A'.
this is from ABAP Keyword documentation in SE38:
SAP Buffer - Single Record Buffering
Only those rows in the table are buffered that are actually accessed.
This requires less space in the buffer than when using generic or full
buffering. On the other hand, more administration work is required and
significantly more direct database accesses.
So since your query returns a single record (based on the data you displayed) it should just get one row and hold in the buffer.
I'd suggest looking at SAP help and Google - also have a look at SELECT SINGLE and incompletely specified keys - there used to be a problem with the buffer being bypassed in some situations - have a read for reference.
I have a large dataset in an oracle database that is currently accessed from Java one item at a time. For example if a user is trying to do a bulk get of 50 items it will process them sequentially, calling a stored procedure for each one. I am now trying to implement a bulk get, but am having some difficulty due to the way the user can pass in a range query:
An example table:
prim_key | identifier | start | end
----------+--------------+---------+-------
1 | aaa | 1 | 3
2 | aaa | 3 | 7
3 | bbb | 1 | 5
The way it works is that if you have a query like (id='aaa' and pos=1) it will find prim_key = 1, but if you query (id='aaa' and pos=2) it won't find anything. If you do (id='aaa' and pos=-2) then it will again find prim_key=1 because the stored proc converts the -2 into a range scan equivalent to start<=2 and end>2.
(Extra context: the start/end are actually dates and this querying mechanism allows efficient "latest as of date" queries as opposed to doing something like select prim_key,
start from myTable
where start = (select max(start) from myTable where start <= 2))
This is all fine and works correctly for single gets, but now I'm trying to do bulk gets so that we can speed up the batch considerably. The first attempt was to multithread the individual calls, but it put too much stress on the database to be doing so many parallel queries on the same table. To solve this I've been trying to create a query like
select prim_key
from myTable
where (identifier='aaa' and start=3)
or (identifier='aaa' and start<=2 and end>2)
building this up from the list of input parameters ('aaa',3 ; 'bbb',-2), which works well and produces an explain plan using all of the indexes I would expect.
My Problem: I need to know what the input parameters were that retrieved that row in order to do further processing and return the relevant prim_key. I need to use something like a psuedocolumn that I can define myself:
select prim_key, PSUEDO
from myTable
where (identifier='aaa' and start=3 and PSUEDO='a3')
or (identifier='aaa' and start<=2 and end>2 and PSUEDO='a-2')
but I can't find any way to return a value from the where clause, and I think subqueries would lose the indexing efficiencies gained by doing it all in one select.
Try something like:
select
prim_key,
case when start = 3 then 'a3' else 'a-2' end pseudo
from
you_table
where
...
Background
For a data entry project, a user can enter variables using a short-hand notation:
"Pour i1 into a flask."
"Warm the flask to 25 degrees C."
"Add 1 drop of i2 to the flask."
"Immediately seek cover."
In this case i1 and i2 are reference variables, where the number refers to an ingredient. The text strings are in the INSTRUCTION table, the ingredients the INGREDIENT table.
Each ingredient has a sequence number for sorting purposes.
Problem
Users may rearrange the ingredient order, which adversely changes the instructions. For example, the ingredient order might look as follows, initially:
seq | label
1 | water
2 | sodium
The user adds another ingredient:
seq | label
1 | water
2 | sodium
3 | francium
The user reorders the list:
seq | label
1 | water
2 | francium
3 | sodium
At this point, the following line is now incorrect:
"Add 1 drop of i2 to the flask."
The i2 must be renumbered (because ingredient #2 was moved to position #3) to point to the original reference variable:
"Add 1 drop of i3 to the flask."
Updated Details
This is a simplified version of the problem. The full problem can have lines such as:
"Add 1 drop of i2 to the o3 of i1."
Where o3 is an object (flask), and i1 and i2 are water and sodium, respectively.
Table Structure
The ingredient table is structured as follows:
id | seq | label
The instruction table is structured as follows:
step
Algorithm
The algorithm I have in mind:
Repeat for all steps that match the expression '\mi([0-9]+)':
Break the step into word tokens.
For each token:
If the numeric portion of the token matches the old sequence number, replace it with the new sequence number.
Recombine the tokens and update the instruction.
Update the ingredient number.
Update
The algorithm may be incorrect as written. There could be two reference variables that must change. Consider before:
seq | label
1 | water
2 | sodium
3 | caesium
4 | francium
And after (swapping sodium and caesium):
seq | label
1 | water
2 | caesium
3 | sodium
4 | francium
Every i2 in every step must become i3; similarly i3 must become i2. So
"Add 1 drop of i2 to the flask, but absolutely do not add i3."
Becomes:
"Add 1 drop of i3 to the flask, but absolutely do not add i2."
Code
The code to perform the first two parts of the algorithm resembles:
CREATE OR REPLACE FUNCTION
renumber_steps(
p_ingredient_id integer,
p_old_sequence integer,
p_new_sequence integer )
RETURNS void AS
$BODY$
DECLARE
v_tokens text[];
BEGIN
FOR v_tokens IN
SELECT
t.tokens
FROM (
SELECT
regexp_split_to_array( step, '\W' ) tokens,
regexp_matches( step, '\mi([0-9]+)' ) matches
FROM
instruction
) t
LOOP
RAISE NOTICE '%', v_tokens;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
Question
What is a more efficient way to solve this problem (i.e., how would you eliminate the looping constructs), possibly leveraging PostgreSQL-specific features, without a major revision to the data model?
Thank you!
System Details
PostgreSQL 9.1.2.
You have to take care that you don't change ingredients and seq numbers back and forth. I introduce a temporary prefix for ingredients and negative numbers for seq for that purpose and exchange them for permanent values when all is done.
Could work like this:
CREATE OR REPLACE FUNCTION renumber_steps(_old int[], _new int[])
RETURNS void AS
$BODY$
DECLARE
_prefix CONSTANT text := ' i'; -- prefix, incl. leading space
_new_prefix CONSTANT text := ' ###'; -- temp prefix, incl. leading space
i int;
o text;
n text;
BEGIN
IF array_upper(_old,1) <> array_upper(_new,1) THEN
RAISE EXCEPTION 'Array length mismatch!';
END IF;
FOR i IN 1 .. array_upper(_old,1) LOOP
IF _old[i] <> _new[i] THEN
o := _prefix || _old[i] || ' '; -- leading and trailing blank!
-- new instruction are temporarily prefixed with new_marker
n := _new_prefix || _new[i] || ' ';
UPDATE instruction
SET step = replace(step, o, n) -- replace all instances
WHERE step ~~ ('%' || o || '%');
UPDATE ingredient
SET seq = _new[i] * -1 -- temporarily negative
WHERE seq = _old[i];
END IF;
END LOOP;
-- finally replace temp. prefix
UPDATE instruction
SET step = replace(step, _new_prefix, _prefix)
WHERE step ~~ ('%' || _new_prefix || '%');
-- .. and temp. negative seq numbers
UPDATE ingredient
SET seq = seq * -1
WHERE seq < 0;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT;
Call:
SELECT renumber_steps('{2,3,4}'::int[], '{4,3,2}'::int[]);
The algorithm requires ...
... that ingredients in the steps are delimited by spaces.
... that there are no permanent negative seq numbers.
_old and _new are ARRAYs of the old and new instruction.seq of ingredients that change position. The length of both arrays has to match, or an exception will be raised. It can contain seq that don't change. Nothing will happen to those.
Requires PostgreSQL 9.1 or later.
I think your model is problematic... you should have the "real name (id)" (i1, o3 etc.) FIXED after creation and have a second field in the ingredient table providing the "sorting". The user enters the "sorting name" and you immediately replace it with the "real name" (id) on saving the entered data into the step table.
When you read it from the step table you just replace/map the "real name" (id) with the current "sorting name" for display purposes if need be...
This way you don't have to change the data already in the step table for everytime someone changes the sorting which is a complex and expensive operation IMHO - it is prone to concurrency problems too...
The above option reduces the whole problem to a mapping operiton (table ingredient) on INSERT/UPDATE/SELECT (table step) for the one entry currently worked on - it doesn't mess with any other entries already there.
I've been beating my head on the desk trying to figure this one out. I have a table that stores job information, and reasons for a job not being completed. The reasons are numeric,01,02,03,etc. You can have two reasons for a pending job. If you select two reasons, they are stored in the same column, separated by a comma. This is an example from the JOBID table:
Job_Number User_Assigned PendingInfo
1 user1 01,02
There is another table named Pending, that stores what those values actually represent. 01=Not enough info, 02=Not enough time, 03=Waiting Review. Example:
Pending_Num PendingWord
01 Not Enough Info
02 Not Enough Time
What I'm trying to do is query the database to give me all the job numbers, users, pendinginfo, and pending reason. I can break out the first value, but can't figure out how to do the second. What my limited skills have so far:
select Job_number,user_assigned,SUBSTRING(pendinginfo,0,3),pendingword
from jobid,pending
where
SUBSTRING(pendinginfo,0,3)=pending.pending_num and
pendinginfo!='00,00' and
pendinginfo!='NULL'
What I would like to see for this example would be:
Job_Number User_Assigned PendingInfo PendingWord PendingInfo PendingWord
1 User1 01 Not Enough Info 02 Not Enough Time
Thanks in advance
You really shouldn't store multiple items in one column if your SQL is ever going to want to process them individually. The "SQL gymnastics" you have to perform in those cases are both ugly hacks and performance degraders.
The ideal solution is to split the individual items into separate columns and, for 3NF, move those columns to a separate table as rows if you really want to do it properly (but baby steps are probably okay if you're sure there will never be more than two reasons in the short-medium term).
Then your queries will be both simpler and faster.
However, if that's not an option, you can use the afore-mentioned SQL gymnastics to do something like:
where find ( ',' |fld| ',', ',02,' ) > 0
assuming your SQL dialect has a string search function (find in this case, but I think charindex for SQLServer).
This will ensure all sub-columns begin and start with a comma (comma plus field plus comma) and look for a specific desired value (with the commas on either side to ensure it's a full sub-column match).
If you can't control what the application puts in that column, I would opt for the DBA solution - DBA solutions are defined as those a DBA has to do to work around the inadequacies of their users :-).
Create two new columns in that table and make an insert/update trigger which will populate them with the two reasons that a user puts into the original column.
Then query those two new columns for specific values rather than trying to split apart the old column.
This means that the cost of splitting is only on row insert/update, not on _every single select`, amortising that cost efficiently.
Still, my answer is to re-do the schema. That will be the best way in the long term in terms of speed, readable queries and maintainability.
I hope you are just maintaining the code and it's not a brand new implementation.
Please consider to use a different approach using a support table like this:
JOBS TABLE
jobID | userID
--------------
1 | user13
2 | user32
3 | user44
--------------
PENDING TABLE
pendingID | pendingText
---------------------------
01 | Not Enough Info
02 | Not Enough Time
---------------------------
JOB_PENDING TABLE
jobID | pendingID
-----------------
1 | 01
1 | 02
2 | 01
3 | 03
3 | 01
-----------------
You can easily query this tables using JOIN or subqueries.
If you need retro-compatibility on your software you can add a view to reach this goal.
I have a tables like:
Events
---------
eventId int
eventTypeIds nvarchar(50)
...
EventTypes
--------------
eventTypeId
Description
...
Each Event can have multiple eventtypes specified.
All I do is write 2 procedures in my site code, not SQL code
One procedure converts the table field (eventTypeIds) value like "3,4,15,6" into a ViewState array, so I can use it any where in code.
This procedure does the opposite it collects any options your checked and converts it in
If changing the schema is an option (which it probably should be) shouldn't you implement a many-to-many relationship here so that you have a bridging table between the two items? That way, you would store the number and its wording in one table, jobs in another, and "failure reasons for jobs" in the bridging table...
Have a look at a similar question I answered here
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM JobId
),
Split AS
(
SELECT JOB_NUMBER, USER_ASSIGNED, SUBSTRING(PENDING_INFO, Numbers.N, CHARINDEX(',', PENDING_INFO + ',', Numbers.N) - Numbers.N) AS PENDING_NUM
FROM JobId
JOIN Numbers ON Numbers.N <= DATALENGTH(PENDING_INFO) + 1
AND SUBSTRING(',' + PENDING_INFO, Numbers.N, 1) = ','
)
SELECT *
FROM Split JOIN Pending ON Split.PENDING_NUM = Pending.PENDING_NUM
The basic idea is that you have to multiply each row as many times as there are PENDING_NUMs. Then, extract the appropriate part of the string
While I agree with DBA perspective not to store multiple values in a single field it is doable, as bellow, practical for application logic and some performance issues. Let say you have 10000 user groups, each having average 1000 members. You may want to have a table user_groups with columns such as groupID and membersID. Your membersID column could be populated like this:
(',10,2001,20003,333,4520,') each number being a memberID, all separated with a comma. Add also a comma at the start and end of the data. Then your select would use like '%,someID,%'.
If you can not change your data ('01,02,03') or similar, let say you want rows containing 01 you still can use " select ... LIKE '01,%' OR '%,01' OR '%,01,%' " which will insure it match if at start, end or inside, while avoiding similar number (ie:101).