How can I perform a similar UPDATE-WHERE statement on SparkSQL2.0? - sql

How can I implement an SQL query like this, in SparkSQL 2.0 using DataFrames and Scala language? I've read a lot of posts but none of them seems to achieve what I need, or if you can point me one, would do. Here's the problem:
UPDATE table SET value = 100 WHERE id = 2
UPDATE table SET value = 70 WHERE id = 4
.....
Suppose that you have a table table with two columns like this:
id | value
--- | ---
1 | 1
2 | null
3 | 3
4 | null
5 | 5
Is there a way to implement the above query, using map, match cases, UDFs or if-else statements? The values that I need to store in the value field are not sequential, so I have specific values to put there. I'm aware too that it is not possible to modify a immutable data when dealing with DataFrames. I have no code to share because I can't get it to work nor reproduce any errors.

Yes you can, it's very simple. You can use when and otherwise.
val pf = df.select($"id", when($"id" === 2, lit(100)).otherwise(when($"id" === 4, lit(70)).otherwise($"value")).as("value"))

Related

How to replace text contained in one row with text contained in another row using a select statement

I am crafting a sql query that dynamically builds a where clause. I was able to transform the separate pieces of the where clause as return rows like so:
-------------------------------------------
| ID | Query Part |
-------------------------------------------
| TOKEN 1 | (A = 1 OR B = 2) |
-------------------------------------------
| TOKEN 2 | ([TOKEN 1] or C = 3 |
-------------------------------------------
| TOKEN 3 | ([TOKEN 2] and D = 4) |
-------------------------------------------
My goal is to wrap the current return results above in a stuff and or replace (or something entirely different I hadn't considered) to output the following result:
(((A=1 OR B=2) OR C=3) AND D=4)
Ideally there would be no temp table necessary but I am open to recommendations.
Thank you for any guidance, this has had me pretty stumped at work.
Its unusual. It looks like the query part you want is only Token 3. Then the process should replace any [token] tags in this query part with the corresponding query parts. With the subsequent resulting query part, again the process should replace any [token] tags with the corresponding query parts. This continues until there are no more [token] tags to replace.
I think there should there be a way of indicating the master query (ie token 3) , then use a recursive common table expression to build the expression up until there are no more [token]s.

What is the shortest notation for updating a column in an internal table?

I have an ABAP internal table. Structured, with several columns (e.g. 25). Names and types are irrelevant. The table can get pretty large (e.g. 5,000 records).
| A | B | ... |
| --- | --- | --- |
| 7 | X | ... |
| 2 | CCC | ... |
| 42 | DD | ... |
Now I'd like to set one of the columns (e.g. B) to a specific constant value (e.g. 'Z').
What is the shortest, fastest, and most memory-efficient way to do this?
My best guess is a LOOP REFERENCE INTO. This is pretty efficient as it changes the table in-place, without wasting new memory. But it takes up three statements, which makes me wonder whether it's possible to get shorter:
LOOP AT lt_table REFERENCE INTO DATA(ls_row).
ls_row->b = 'Z'.
ENDLOOP.
Then there is the VALUE operator which reduces this to one statement but is not very efficient because it creates new memory areas. It also gets longish for a large number of columns, because they have to be listed one by one:
lt_table = VALUE #( FOR ls_row in lt_table ( a = ls_row-a
b = 'Z' ) ).
Are there better ways?
The following code sets PRICE = 0 of all lines at a time. Theoritically, it should be the fastest way to update all the lines of one column, because it's one statement. Note that it's impossible to omit the WHERE, so I use a simple trick to update all lines.
DATA flights TYPE TABLE OF sflight.
DATA flight TYPE sflight.
SELECT * FROM sflight INTO TABLE flights.
flight-price = 0.
MODIFY flights FROM flight TRANSPORTING price WHERE price <> flight-price.
Reference: MODIFY itab - itab_lines
If you have a workarea declared...
workarea-field = 'Z'.
modify table from workarea transporting field where anything.
Edition:
After been able to check that syntax in my current system, I could prove (to myself) that the WHERE clause must be added or a DUMP is raised.
Thanks Sandra Rossi.

Is condensing the number of columns in a database beneficial?

Say you want to record three numbers for every Movie record...let's say, :release_year, :box_office, and :budget.
Conventionally, using Rails, you would just add those three attributes to the Movie model and just call #movie.release_year, #movie.box_office, and #movie.budget.
Would it save any database space or provide any other benefits to condense all three numbers into one umbrella column?
So when adding the three numbers, it would go something like:
def update
...
#movie.umbrella = params[:movie_release_year]
+ "," + params[:movie_box_office] + "," + params[:movie_budget]
end
So the final #movie.umbrella value would be along the lines of "2015,617293,748273".
And then in the controller, to access the three values, it would be something like
#umbrella_array = #movie.umbrella.strip.split(',').map(&:strip)
#release_year = #umbrella_array.first
#box_office = #umbrella_array.second
#budget = #umbrella_array.third
This way, it would be the same amount of data (actually a little more, with the extra commas) but stored only in one column. Would this be better in any way than three columns?
There is no benefit in squeezing such attributes in a single column. In fact, following that path will increase the complexity of your code and will limit your capabilities.
Here's some of the possible issues you'll face:
You will not be able to add indexes to increase the performance of lookup of records with a specific attribute value or sort the filtering
You will not be able to query a specific attribute value
You will not be able to sort by a specific column value
The values will be stored and represented as Strings, rather than Integers
... and I can continue. There are no advantages, only disadvantages.
Agree with comments above, as an example try to use pg_column_size() to compare results:
WITH test(data_txt,data_int,data_date) AS ( VALUES
('9999'::TEXT,9999::INTEGER,'2015-01-01'::DATE),
('99999999'::TEXT,99999999::INTEGER,'2015-02-02'::DATE),
('2015-02-02'::TEXT,99999999::INTEGER,'2015-02-02'::DATE)
)
SELECT pg_column_size(data_txt) AS txt_size,
pg_column_size(data_int) AS int_size,
pg_column_size(data_date) AS date_size
FROM test;
Result is :
txt_size | int_size | date_size
----------+----------+-----------
5 | 4 | 4
9 | 4 | 4
11 | 4 | 4
(3 rows)

Is there a way to transpose data in Hive?

Can data in Hive be transposed? As in, the rows become columns and columns are the rows? If there is no function straight up, is there a way to do it in a couple of steps?
I have a table like this:
| ID | Names | Proc1 | Proc2 | Proc3 |
| 1 | A1 | x | b | f |
| 2 | B1 | y | c | g |
| 3 | C1 | z | d | h |
| 4 | D1 | a | e | i |
I want it to be like this:
| A1 | B1 | C1 | D1 |
| x | y | z | a |
| b | c | d | e |
| f | g | h | i |
I have been looking up other related questions and they all mention using lateral views and explode, but is there a way to selectively choose columns for lateral(ly) view(ing) and explod(ing)?
Also, what might be the rough process to achieve what I would like to do? Please help me out. Thanks!
Edit: I have been reading this link: https://cwiki.apache.org/Hive/languagemanual-lateralview.html and it shows me half of what I want to achieve. The first example in the link is basically what I'd like except that I don't want the rows to repeat and want them as column names. Any ideas on how to get the data to a form such that if I do an explode, it would result in my desired output, or the other way, ie, explode first to lead to another step that would then lead to my desired output table. Thanks again!
I don't know of a way out of the box in hive to do this, sorry. You get close with explode etc. but I don't think it can get the job done.
Overall, conceptually, I think it's hard to a transpose without knowing what the columns of the destination table are going to be in advance. This is true, in particular for hive, because the metadata related to how many columns, their types, their names, etc. in a database - the metastore. And, it's true in general, because not knowing the columns beforehand, would require some sort of in-memory holding of data (ok, sure with spills) and users may need to be careful about not overflowing the memory and such (just like dynamic partitioning in hive).
In any case, long story short, if you know the columns of the destination table beforehand, life is good. There isn't a set command in hive per se, to the best of my knowledge, but you could use a bunch of if clauses and case statements (ugly I know, but that's how I have done the same in the past) in the select clause to transpose the data. Something along the lines of SQL - How to transpose?
Do let me know how it goes!
As Mark pointed out there's no easy way to do this in Hive since PIVOT doesn't present in Hive and you may also encounter issues when trying to use the case/when 'trick' since you have multiple values (proc1,proc2,proc3).
As for testing purposes, you may try a different approach:
select v, o1, o2, o3 from (
select k,
v,
LEAD(v,3) OVER() as o1,
LEAD(v,6) OVER() as o2,
LEAD(v,9) OVER() as o3
from (select transform(name,proc1,proc2,proc3) using 'python strm.py' AS (k, v)
from input_table) q1
) q2 where k = 'A1';
where strm.py:
import sys
for line in sys.stdin:
line = line.strip()
name, proc1, proc2, proc3 = line.split('\t')
print '%s\t%s' % (name, proc1)
print '%s\t%s' % (name, proc2)
print '%s\t%s' % (name, proc3)
The trick here is to use a python script in the map phase which emits each column of a row as distinct rows. Then every third (since we have 3 proc columns) row will form the resulting row which we get by peeking forward (lead).
However, this query does the job, it has the drawback that as the input grows, you need to peek the next 3rd element in the query which may lead to performance hit. Anyway you may evaluate it for testing purposes.

user defined psuedocolumn oracle

I have a large dataset in an oracle database that is currently accessed from Java one item at a time. For example if a user is trying to do a bulk get of 50 items it will process them sequentially, calling a stored procedure for each one. I am now trying to implement a bulk get, but am having some difficulty due to the way the user can pass in a range query:
An example table:
prim_key | identifier | start | end
----------+--------------+---------+-------
1 | aaa | 1 | 3
2 | aaa | 3 | 7
3 | bbb | 1 | 5
The way it works is that if you have a query like (id='aaa' and pos=1) it will find prim_key = 1, but if you query (id='aaa' and pos=2) it won't find anything. If you do (id='aaa' and pos=-2) then it will again find prim_key=1 because the stored proc converts the -2 into a range scan equivalent to start<=2 and end>2.
(Extra context: the start/end are actually dates and this querying mechanism allows efficient "latest as of date" queries as opposed to doing something like select prim_key,
start from myTable
where start = (select max(start) from myTable where start <= 2))
This is all fine and works correctly for single gets, but now I'm trying to do bulk gets so that we can speed up the batch considerably. The first attempt was to multithread the individual calls, but it put too much stress on the database to be doing so many parallel queries on the same table. To solve this I've been trying to create a query like
select prim_key
from myTable
where (identifier='aaa' and start=3)
or (identifier='aaa' and start<=2 and end>2)
building this up from the list of input parameters ('aaa',3 ; 'bbb',-2), which works well and produces an explain plan using all of the indexes I would expect.
My Problem: I need to know what the input parameters were that retrieved that row in order to do further processing and return the relevant prim_key. I need to use something like a psuedocolumn that I can define myself:
select prim_key, PSUEDO
from myTable
where (identifier='aaa' and start=3 and PSUEDO='a3')
or (identifier='aaa' and start<=2 and end>2 and PSUEDO='a-2')
but I can't find any way to return a value from the where clause, and I think subqueries would lose the indexing efficiencies gained by doing it all in one select.
Try something like:
select
prim_key,
case when start = 3 then 'a3' else 'a-2' end pseudo
from
you_table
where
...