hive auto increment after certain number - hive

I have a to insert data into a target table where all columns should be populated from different source tables except the surrogate key column; which should be maximum value of the target table plus auto increment value starting 1. I can generate auto increment value by using row_number() function, but in the same query how should I get the max value of surrogate key from target table. Is there any concept in HIVE where I can select the max value of surrogate key and save it in a temporary variable? Or is there any other simple way to achieve this result?

Here are two approaches which worked for me for the above problem. ( explained with example)
Approach 1: getting the max and setting to hive commands through ${hiveconf} variable using shell script
Approach 2: using row_sequence(), max() and join operations
My Environment:
hadoop-2.6.0
apache-hive-2.0.0-bin
Steps: (note: step 1 and step 2 are common for both approaches. Starting from step 3 , it differs for both)
Step 1: create source and target tables
source
hive>create table source_table1(string name);
hive>create table source_table2(string name);
hive>create table source_table2(string name);
target
hive>create table target_table(int id,string name);
Step 2: load data into source tables
hive>load data local inpath 'source_table1.txt' into table source_table1;
hive>load data local inpath 'source_table2.txt' into table source_table2;
hive>load data local inpath 'source_table3.txt' into table source_table3;
Sample Input:
source_table1.txt
a
b
c
source_table2.txt
d
e
f
source_table3.txt
g
h
i
Approach 1:
Step 3: create a shell script hive_auto_increment.sh
#!/bin/sh
hive -e 'select max(id) from target_table' > max.txt
wait
value=`cat max.txt`
hive --hiveconf mx=$value -e "add jar /home/apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar;
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
set mx;
set hiveconf:mx;
INSERT INTO TABLE target_table SELECT row_sequence(),name from source_table1;
INSERT INTO TABLE target_table SELECT (\${hiveconf:mx} +row_sequence()),name from source_table2;
INSERT INTO TABLE target_table SELECT (\${hiveconf:mx} +row_sequence()),name from source_table3;"
wait
hive -e "select * from target_table;"
Step 4: run the shell script
> bash hive_auto_increment.sh
Approach 2:
Step 3: Add Jar
hive>add jar /home/apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar;
Step 4: register row_sequence function with help of hive contrib jar
hive>create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
Step 5: load the source_table1 to target_table
hive>INSERT INTO TABLE target_table select row_sequence(),name from source_table1;
Step 6: load the other sources to target_table
hive>INSERT INTO TABLE target_table SELECT M.rowcount+row_sequence(),T.name from source_table2 T join (select max(id) as rowcount from target_table) M;
hive>INSERT INTO TABLE target_table SELECT M.rowcount+row_sequence(),T.name from source_table3 T join (select max(id) as rowcount from target_table) M;
output:
INFO : OK
+---------------+-----------------+--+
| target_table.id | target_table.name
+---------------+-----------------+--+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
| 5 | e |
| 6 | f |
| 7 | g |
| 8 | h |
| 9 | i |

create table autoincrement1 ( id int, name string)
insert into autoincrement1
select if(isnull(max(id)) ,0 , max(id) ) +1, 'sagar' from autoincrement1

Related

Hive Query to find out the average rating of movies having rating more than 2

I have a table named movierating consisting of the following fields:
| ColumnNO. | Name | DataType |
| Column1 | id | int |
| Column2 | movieid | int |
| Column3 | rating | int |
| Column4 | time | string |
I have even created the table for the above description using SQL query and loaded it with data successfully. Now I am supposed to write a Hive query to find out the average rating of movies having rating more than 2, rounding off the average to two decimal places and save the output in output.txt. In the Bash terminal, I have typed the Hive command as follows (after receiving help from #SKM):
hive -e "select movieid, round(avg(rating), 2) from movierating group by movieid having avg(rating) > 2;" > output.txt
I have even referred here for a similar situation, but it didn't help me much.
Please have a look at the screenshots after I ran the query:
And when I run the command to open the output.txt file as vim output.txt, i just get a blank file with no data. I am unable to understand what is being described in the terminal (shown in the screenshot).
Query for table creation:
create table if not exists movierating (id int, movieid int, rating int, time string);
load data local inpath '/tmp/Movie-rating.txt' overwrite into table movierating;
This is the first step in the challenge which I am attending. Since I am new to hive, I am not much familiar with its working.
Challenge description:
where Step1 was to create the table with the above mentioned fields.
Please do help me in this regard.
Based on Comments
Output after running the query sent by #SKM:
You can change your query like below:
hive -e "select movieid, round(avg(rating), 2) from movierating group by movieid having avg(rating) > 2;" > output.txt
Snapshot:

PostgreSQL order by special char

I have to sync two databases. The first one (source) run on Sybase and the second (dist) run on PostgreSQL. I make for each rows a comparaison to check if the row exist or not (comparaison of the pk). But, for a specific table, the pk is a varchar.
I make :
SELECT * FROM table ORDER BY 1;
The table output look like this on Sybase :
# | row 1
.. | row 2
1 | row 3
...
and on PostgreSQL :
.. | row 1
# | row 2
1 | row 3
So, the rows have not the same order. I would like to obtain the same output in PostgreSQL that my Sybase db (SQL Anywhere).
Any idea ?
If I understand you correctly you need this query :
WITH x(a) AS (SELECT * FROM table ORDER BY 1)
SELECT *
FROM x
ORDER BY a COLLATE "C";

How can I update all rows in Postgresql from 1 to N, where N number of rows?

Find this example https://stackoverflow.com/a/13629639/947111 but it's for sqlserver while I need it for Postgresql.
For example my table has the following structure:
id|title|slot_id
----------------
1 | 1| 1
2 | 2| 1
3 | 3| 1
4 | 1| 2
5 | 2| 2
When I delete row from the middle of set (set defined by slot_id), for example 1,2,3 where slot_id = 1 and row with title = 2 was removed I need to perform renaming so it won't 1,3 but 1,2
You haven't provide sufficient information about your table structure, so you need to adjust the following query yourself to your table and column names:
update the_table
set the_column_to_update = t.rn
from (
select the_primary_key_column,
row_number() over (order by the_primary_key_column) as rn
from x
) t
where t.the_primary_key_column = x.the_primary_key_column;
The issue in "Update int column in table with unique incrementing values" is referring to adding an automatically incremented id in a table after having defined the table.
In SQL Server this is called an IDENTITY column while in PostgreSQL it is called a SERIAL column.
You can find a similar solution (if that is what you need) in:
Adding 'serial' to existing column in Postgres.
That is if you want to update a column with a value starting from 1 up to N to serve as a unique id for each row of your table.
If you define the table from the beginning as such then you can set one column as SERIAL as in the following article: PostgreSQL Autoincrement

SQL Multiple Row Insert w/ multiple selects from different tables

I am trying to do a multiple insert based on values that I am pulling from a another table. Basically I need to give all existing users access to a service that previously had access to a different one. Table1 will take the data and run a job to do this.
INSERT INTO Table1 (id, serv_id, clnt_alias_id, serv_cat_rqst_stat)
SELECT
(SELECT Max(id) + 1
FROM Table1 ),
'33', --The new service id
clnt_alias_id,
'PI' --The code to let the job know to grant access
FROM TABLE2,
WHERE serv_id = '11' --The old service id
I am getting a Primary key constraint error on id.
Please help.
Thanks,
Colin
This query is impossible. The max(id) sub-select will evaluate only ONCE and return the same value for all rows in the parent query:
MariaDB [test]> create table foo (x int);
MariaDB [test]> insert into foo values (1), (2), (3);
MariaDB [test]> select *, (select max(x)+1 from foo) from foo;
+------+----------------------------+
| x | (select max(x)+1 from foo) |
+------+----------------------------+
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
+------+----------------------------+
3 rows in set (0.04 sec)
You will have to run your query multiple times, once for each record you're trying to copy. That way the max(id) will get the ID from the previous query.
Is there a requirement that Table1.id be incremental ints? If not, just add the clnt_alias_id to Max(id). This is a nasty workaround though, and you should really try to get that column's type changed to auto_increment, like Marc B suggested.

Is it possible to use a PG sequence on a per record label?

Does PostgreSQL 9.2+ provide any functionality to make it possible to generate a sequence that is namespaced to a particular value? For example:
.. | user_id | seq_id | body | ...
----------------------------------
- | 4 | 1 | "abc...."
- | 4 | 2 | "def...."
- | 5 | 1 | "ghi...."
- | 5 | 2 | "xyz...."
- | 5 | 3 | "123...."
This would be useful to generate custom urls for the user:
domain.me/username_4/posts/1
domain.me/username_4/posts/2
domain.me/username_5/posts/1
domain.me/username_5/posts/2
domain.me/username_5/posts/3
I did not find anything in the PG docs (regarding sequence and sequence functions) to do this. Are sub-queries in the INSERT statement or with custom PG functions the only other options?
You can use a subquery in the INSERT statement like #Clodoaldo demonstrates. However, this defeats the nature of a sequence as being safe to use in concurrent transactions, it will result in race conditions and eventually duplicate key violations.
You should rather rethink your approach. Just one plain sequence for your table and combine it with user_id to get the sort order you want.
You can always generate the custom URLs with the desired numbers using row_number() with a simple query like:
SELECT format('domain.me/username_%s/posts/%s'
, user_id
, row_number() OVER (PARTITION BY user_id ORDER BY seq_id)
)
FROM tbl;
db<>fiddle here
Old sqlfiddle
Maybe this answer is a little off-piste, but I would consider partitioning the data and giving each user their own partitioned table for posts.
There's a bit of overhead to the setup as you will need triggers for managing the DDL statements for the partitions, but would effectively result in each user having their own table of posts, along with their own sequence with the benefit of being able to treat all posts as one big table also.
General gist of the concept...
psql# CREATE TABLE posts (user_id integer, seq_id integer);
CREATE TABLE
psql# CREATE TABLE posts_001 (seq_id serial) INHERITS (posts);
CREATE TABLE
psql# CREATE TABLE posts_002 (seq_id serial) INHERITS (posts);
CREATE TABLE
psql# INSERT INTO posts_001 VALUES (1);
INSERT 0 1
psql# INSERT INTO posts_001 VALUES (1);
INSERT 0 1
psql# INSERT INTO posts_002 VALUES (2);
INSERT 0 1
psql# INSERT INTO posts_002 VALUES (2);
INSERT 0 1
psql# select * from posts;
user_id | seq_id
---------+--------
1 | 1
1 | 2
2 | 1
2 | 2
(4 rows)
I left out some rather important CHECK constraints in the above setup, make sure you read the docs for how these kinds of setups are used
insert into t values (user_id, seq_id) values
(4, (select coalesce(max(seq_id), 0) + 1 from t where user_id = 4))
Check for a duplicate primary key error in the front end and retry if needed.
Update
Although #Erwin advice is sensible, that is, a single sequence with the ordering in the select query, it can be expensive.
If you don't use a sequence there is no defeat of the nature of the sequence. Also it will not result in a duplicate key violation. To demonstrate it I created a table and made a python script to insert into it. I launched 3 parallel instances of the script inserting as fast as possible. And it just works.
The table must have a primary key on those columns:
create table t (
user_id int,
seq_id int,
primary key (user_id, seq_id)
);
The python script:
#!/usr/bin/env python
import psycopg2, psycopg2.extensions
query = """
begin;
insert into t (user_id, seq_id) values
(4, (select coalesce(max(seq_id), 0) + 1 from t where user_id = 4));
commit;
"""
conn = psycopg2.connect('dbname=cpn user=cpn')
conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_SERIALIZABLE)
cursor = conn.cursor()
for i in range(0, 1000):
while True:
try:
cursor.execute(query)
break
except psycopg2.IntegrityError, e:
print e.pgerror
cursor.execute("rollback;")
cursor.close()
conn.close()
After the parallel run:
select count(*), max(seq_id) from t;
count | max
-------+------
3000 | 3000
Just as expected. I developed at least two applications using that logic and one of then is more than 13 years old and never failed. I concede that if you are Facebook or some other giant then you could have a problem.
Yes:
CREATE TABLE your_table
(
column type DEFAULT NEXTVAL(sequence_name),
...
);
More details here:
http://www.postgresql.org/docs/9.2/static/ddl-default.html