Does HBase support ACID just as Hive does? - hive

as https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions says, Hive supports some limited ACID transactions. SO, if I just need row-level transactions, is Hive enough? Is HBase's advantages become less and less?
Thanks.

It is possible to do ACID transactions in HBase with Apache Phoenix, a layer for HBase which provides an SQL interface for handling data.
To use transactions, after installing Phoenix you set the property phoenix.transactions.enabled to true in your hbase-site.xml , then use the TRANSACTIONAL option when you create your table. For example:
CREATE TABLE my_table (id INTEGER PRIMARY KEY, val VARCHAR) TRANSACTIONAL=true;
Following that you simply interact with your table normally, with SQL through JDBC or another interface. (Note you can also alter an existing non-transactional table to be transactional.)
For more, you can read about Phoenix and its transaction support at the project's website:
https://phoenix.apache.org/transactions.html

Related

Create index in huge MariaDB production database without table locking

I have a table with 202M records where I need to add a few indexes and I can't find it anywhere (or maybe I don't understand the lingo) if that is possible to do, without locking, in MariaDB 10.3.
I found this post where I can see that that is possible in MySQL 5.6+, but my google foo didn't get my any info on MariaDB.
I tried using pt-online-schema-change but since I don't have any index (not even primary) that is not an option.
This is possible with the use of ALTER ONLINE TABLE.
ALTER ONLINE TABLE is equivalent to LOCK=NONE. Therefore, the ALTER
ONLINE TABLE statement can be used to ensure that your ALTER TABLE
operation allows all concurrent DML.
Further reading tells that adding primary keys is a "copy" operation as DB engine needs to copy the whole table to new file, but adding other indexes in an inplace operation.
InnoDB supports adding a primary key to a table with ALGORITHM set to
INPLACE. The table is rebuilt, which means that all of the data is
reorganized substantially, and the indexes are rebuilt. As a result,
the operation is quite expensive. This operation supports the
non-locking strategy. This strategy can be explicitly chosen by
setting the LOCK clause to NONE. When this strategy is used, all
concurrent DML is permitted.
InnoDB supports adding a plain index to a table with ALGORITHM set to
INPLACE. The table is not rebuilt. This operation supports the
non-locking strategy. This strategy can be explicitly chosen by
setting the LOCK clause to NONE. When this strategy is used, all
concurrent DML is permitted.
More info in MariaDB documentation.

How can I update the rows in external Hive table when ACID properties are off?

The transaction manager is non ACID so I cannot obviously use ACID transactional operation here. I tried using "insert Overwrite" and it is only working on managed table not on external table.
Is there a possible way to do it from Pyspark?
PS: hive table gets loaded by a job in production.There are few rows which we need to update manually. And the table is stored in AWS S3

How to join a table which is in another database in postgres [duplicate]

I'm going to guess that the answer is "no" based on the below error message (and this Google result), but is there anyway to perform a cross-database query using PostgreSQL?
databaseA=# select * from databaseB.public.someTableName;
ERROR: cross-database references are not implemented:
"databaseB.public.someTableName"
I'm working with some data that is partitioned across two databases although data is really shared between the two (userid columns in one database come from the users table in the other database). I have no idea why these are two separate databases instead of schema, but c'est la vie...
Note: As the original asker implied, if you are setting up two databases on the same machine you probably want to make two schemas instead - in that case you don't need anything special to query across them.
postgres_fdw
Use postgres_fdw (foreign data wrapper) to connect to tables in any Postgres database - local or remote.
Note that there are foreign data wrappers for other popular data sources. At this time, only postgres_fdw and file_fdw are part of the official Postgres distribution.
For Postgres versions before 9.3
Versions this old are no longer supported, but if you need to do this in a pre-2013 Postgres installation, there is a function called dblink.
I've never used it, but it is maintained and distributed with the rest of PostgreSQL. If you're using the version of PostgreSQL that came with your Linux distro, you might need to install a package called postgresql-contrib.
dblink() -- executes a query in a remote database
dblink executes a query (usually a SELECT, but it can be any SQL
statement that returns rows) in a remote database.
When two text arguments are given, the first one is first looked up as
a persistent connection's name; if found, the command is executed on
that connection. If not found, the first argument is treated as a
connection info string as for dblink_connect, and the indicated
connection is made just for the duration of this command.
one of the good example:
SELECT *
FROM table1 tb1
LEFT JOIN (
SELECT *
FROM dblink('dbname=db2','SELECT id, code FROM table2')
AS tb2(id int, code text);
) AS tb2 ON tb2.column = tb1.column;
Note: I am giving this information for future reference. Reference
I have run into this before an came to the same conclusion about cross database queries as you. What I ended up doing was using schemas to divide the table space that way I could keep the tables grouped but still query them all.
Just to add a bit more information.
There is no way to query a database other than the current one. Because PostgreSQL loads database-specific system catalogs, it is uncertain how a cross-database query should even behave.
contrib/dblink allows cross-database queries using function calls. Of course, a client can also make simultaneous connections to different databases and merge the results on the client side.
PostgreSQL FAQ
Yes, you can by using DBlink (postgresql only) and DBI-Link (allows foreign cross database queriers) and TDS_LInk which allows queries to be run against MS SQL server.
I have used DB-Link and TDS-link before with great success.
I have checked and tried to create a foreign key relationships between 2 tables in 2 different databases using both dblink and postgres_fdw but with no result.
Having read the other peoples feedback on this, for example here and here and in some other sources it looks like there is no way to do that currently:
The dblink and postgres_fdw indeed enable one to connect to and query tables in other databases, which is not possible with the standard Postgres, but they do not allow to establish foreign key relationships between tables in different databases.
If performance is important and most queries are read-only, I would suggest to replicate data over to another database. While this seems like unneeded duplication of data, it might help if indexes are required.
This can be done with simple on insert triggers which in turn call dblink to update another copy. There are also full-blown replication options (like Slony) but that's off-topic.
see https://www.cybertec-postgresql.com/en/joining-data-from-multiple-postgres-databases/ [published 2017]
These days you also have the option to use https://prestodb.io/
You can run SQL on that PrestoDB node and it will distribute the SQL query as required. It can connect to the same node twice for different databases, or it might be connecting to different nodes on different hosts.
It does not support:
DELETE
ALTER TABLE
CREATE TABLE (CREATE TABLE AS is supported)
GRANT
REVOKE
SHOW GRANTS
SHOW ROLES
SHOW ROLE GRANTS
So you should only use it for SELECT and JOIN needs. Connect directly to each database for the above needs. (It looks like you can also INSERT or UPDATE which is nice)
Client applications connect to PrestoDB primarily using JDBC, but other types of connection are possible including a Tableu compatible web API
This is an open source tool governed by the Linux Foundation and Presto Foundation.
The founding members of the Presto Foundation are: Facebook, Uber,
Twitter, and Alibaba.
The current members are: Facebook, Uber, Twitter, Alibaba, Alluxio,
Ahana, Upsolver, and Intel.
In case someone needs a more involved example on how to do cross-database queries, here's an example that cleans up the databasechangeloglock table on every database that has it:
CREATE EXTENSION IF NOT EXISTS dblink;
DO
$$
DECLARE database_name TEXT;
DECLARE conn_template TEXT;
DECLARE conn_string TEXT;
DECLARE table_exists Boolean;
BEGIN
conn_template = 'user=myuser password=mypass dbname=';
FOR database_name IN
SELECT datname FROM pg_database
WHERE datistemplate = false
LOOP
conn_string = conn_template || database_name;
table_exists = (select table_exists_ from dblink(conn_string, '(select Count(*) > 0 from information_schema.tables where table_name = ''databasechangeloglock'')') as (table_exists_ Boolean));
IF table_exists THEN
perform dblink_exec(conn_string, 'delete from databasechangeloglock');
END IF;
END LOOP;
END
$$

Is it possible to roll back CREATE TABLE and ALTER TABLE statements in major SQL databases?

I am working on a program that issues DDL. I would like to know whether CREATE TABLE and similar DDL can be rolled back in
Postgres
MySQL
SQLite
et al
Describe how each database handles transactions with DDL.
http://wiki.postgresql.org/wiki/Transactional_DDL_in_PostgreSQL:_A_Competitive_Analysis provides an overview of this issue from PostgreSQL's perspective.
Is DDL transactional according to this document?
PostgreSQL - yes
MySQL - no; DDL causes an implicit commit
Oracle Database 11g Release 2 and above - by default, no, but an alternative called edition-based redefinition exists
Older versions of Oracle - no; DDL causes an implicit commit
SQL Server - yes
Sybase Adaptive Server - yes
DB2 - yes
Informix - yes
Firebird (Interbase) - yes
SQLite also appears to have transactional DDL as well. I was able to ROLLBACK a CREATE TABLE statement in SQLite. Its CREATE TABLE documentation does not mention any special transactional 'gotchas'.
PostgreSQL has transactional DDL for most database objects (certainly tables, indices etc but not databases, users). However practically any DDL will get an ACCESS EXCLUSIVE lock on the target object, making it completely inaccessible until the DDL transaction finishes. Also, not all situations are quite handled- for example, if you try to select from table foo while another transaction is dropping it and creating a replacement table foo, then the blocked transaction will finally receive an error rather than finding the new foo table. (Edit: this was fixed in or before PostgreSQL 9.3)
CREATE INDEX ... CONCURRENTLY is exceptional, it uses three transactions to add an index to a table while allowing concurrent updates, so it cannot itself be performed in a transaction.
Also the database maintenance command VACUUM cannot be used in a transaction.
Can't be done with MySQL it seems, very dumb, but true... (as per the accepted answer)
"The CREATE TABLE statement in InnoDB is processed as a single
transaction. This means that a ROLLBACK from the user does not undo
CREATE TABLE statements the user made during that transaction."
https://dev.mysql.com/doc/refman/5.7/en/implicit-commit.html
Tried a few different ways and it simply won't roll back..
Work around is to simply set a failure flag and do "drop table tblname" if one of the queries failed..
Looks like the other answers are pretty outdated.
As of 2019:
Postgres has supported transactional DDL for many releases.
SQLite has supported transactional DDL for many releases.
MySQL has supported Atomic DDL since 8.0 (which was released in 2018).
While it is not strictly speaking a "rollback", in Oracle the FLASHBACK command can be used to undo these types of changes, if the database has been configured to support it.

Possible to perform cross-database queries with PostgreSQL?

I'm going to guess that the answer is "no" based on the below error message (and this Google result), but is there anyway to perform a cross-database query using PostgreSQL?
databaseA=# select * from databaseB.public.someTableName;
ERROR: cross-database references are not implemented:
"databaseB.public.someTableName"
I'm working with some data that is partitioned across two databases although data is really shared between the two (userid columns in one database come from the users table in the other database). I have no idea why these are two separate databases instead of schema, but c'est la vie...
Note: As the original asker implied, if you are setting up two databases on the same machine you probably want to make two schemas instead - in that case you don't need anything special to query across them.
postgres_fdw
Use postgres_fdw (foreign data wrapper) to connect to tables in any Postgres database - local or remote.
Note that there are foreign data wrappers for other popular data sources. At this time, only postgres_fdw and file_fdw are part of the official Postgres distribution.
For Postgres versions before 9.3
Versions this old are no longer supported, but if you need to do this in a pre-2013 Postgres installation, there is a function called dblink.
I've never used it, but it is maintained and distributed with the rest of PostgreSQL. If you're using the version of PostgreSQL that came with your Linux distro, you might need to install a package called postgresql-contrib.
dblink() -- executes a query in a remote database
dblink executes a query (usually a SELECT, but it can be any SQL
statement that returns rows) in a remote database.
When two text arguments are given, the first one is first looked up as
a persistent connection's name; if found, the command is executed on
that connection. If not found, the first argument is treated as a
connection info string as for dblink_connect, and the indicated
connection is made just for the duration of this command.
one of the good example:
SELECT *
FROM table1 tb1
LEFT JOIN (
SELECT *
FROM dblink('dbname=db2','SELECT id, code FROM table2')
AS tb2(id int, code text);
) AS tb2 ON tb2.column = tb1.column;
Note: I am giving this information for future reference. Reference
I have run into this before an came to the same conclusion about cross database queries as you. What I ended up doing was using schemas to divide the table space that way I could keep the tables grouped but still query them all.
Just to add a bit more information.
There is no way to query a database other than the current one. Because PostgreSQL loads database-specific system catalogs, it is uncertain how a cross-database query should even behave.
contrib/dblink allows cross-database queries using function calls. Of course, a client can also make simultaneous connections to different databases and merge the results on the client side.
PostgreSQL FAQ
Yes, you can by using DBlink (postgresql only) and DBI-Link (allows foreign cross database queriers) and TDS_LInk which allows queries to be run against MS SQL server.
I have used DB-Link and TDS-link before with great success.
I have checked and tried to create a foreign key relationships between 2 tables in 2 different databases using both dblink and postgres_fdw but with no result.
Having read the other peoples feedback on this, for example here and here and in some other sources it looks like there is no way to do that currently:
The dblink and postgres_fdw indeed enable one to connect to and query tables in other databases, which is not possible with the standard Postgres, but they do not allow to establish foreign key relationships between tables in different databases.
If performance is important and most queries are read-only, I would suggest to replicate data over to another database. While this seems like unneeded duplication of data, it might help if indexes are required.
This can be done with simple on insert triggers which in turn call dblink to update another copy. There are also full-blown replication options (like Slony) but that's off-topic.
see https://www.cybertec-postgresql.com/en/joining-data-from-multiple-postgres-databases/ [published 2017]
These days you also have the option to use https://prestodb.io/
You can run SQL on that PrestoDB node and it will distribute the SQL query as required. It can connect to the same node twice for different databases, or it might be connecting to different nodes on different hosts.
It does not support:
DELETE
ALTER TABLE
CREATE TABLE (CREATE TABLE AS is supported)
GRANT
REVOKE
SHOW GRANTS
SHOW ROLES
SHOW ROLE GRANTS
So you should only use it for SELECT and JOIN needs. Connect directly to each database for the above needs. (It looks like you can also INSERT or UPDATE which is nice)
Client applications connect to PrestoDB primarily using JDBC, but other types of connection are possible including a Tableu compatible web API
This is an open source tool governed by the Linux Foundation and Presto Foundation.
The founding members of the Presto Foundation are: Facebook, Uber,
Twitter, and Alibaba.
The current members are: Facebook, Uber, Twitter, Alibaba, Alluxio,
Ahana, Upsolver, and Intel.
In case someone needs a more involved example on how to do cross-database queries, here's an example that cleans up the databasechangeloglock table on every database that has it:
CREATE EXTENSION IF NOT EXISTS dblink;
DO
$$
DECLARE database_name TEXT;
DECLARE conn_template TEXT;
DECLARE conn_string TEXT;
DECLARE table_exists Boolean;
BEGIN
conn_template = 'user=myuser password=mypass dbname=';
FOR database_name IN
SELECT datname FROM pg_database
WHERE datistemplate = false
LOOP
conn_string = conn_template || database_name;
table_exists = (select table_exists_ from dblink(conn_string, '(select Count(*) > 0 from information_schema.tables where table_name = ''databasechangeloglock'')') as (table_exists_ Boolean));
IF table_exists THEN
perform dblink_exec(conn_string, 'delete from databasechangeloglock');
END IF;
END LOOP;
END
$$