Creating a hive table with ~40K columns - hive

I'm trying to create a fairly large table. ~3 millions rows and ~40K columns using hive. To begin, I'm creating an empty table and inserting the data into the table.
However, I hit an error when trying this.
Unable to acquire IMPLICIT, SHARED lock default after 100 attempts. FAILED: Error in acquiring locks: Locks on the underlying objects cannot be acquire. retry after some time
The query is pretty straightforward:
create external table database.dataset (
var1 decimal(10,2),
var2 decimal(10,2),
...
var40000 decimal(10,2)
) location 'hdfs://nameservice1/root/user1/project1';
Anybody seen this error before? Cloudera says there are no limits on number of columns, but clearly hitting some system limitation here.
Additionally, I can create a smaller hive table in the specified location.

Ran across this blog post which appears to identify and fix the problem: http://gbif.blogspot.com/2014/03/lots-of-columns-with-hive-and-hbase.html
Short answer: there is a limit on the number of characters hive will pass in a query, but you can increase that with the following option change:
alter table "SERDE_PARAMS" alter column "PARAM_VALUE" type text;
Untested as I went with a different tool to handle the data (for the problem above) since hive was failing for unknown reasons. If you come across something similar, try this out and give an update please.

Related

DB2: Working with concurrent DDL operations

We are working on a data warehouse using IBM DB2 and we wanted to load data by partition exchange. That means we prepare a temporary table with the data we want to load into the target table and then use that entire table as a data partition in the target table. If there was previous data we just discard the old partition.
Basically you just do "ALTER TABLE target_table ATTACH PARTITION pname [starting and ending clauses] FROM temp_table".
It works wonderfully, but only for one operation at a time. If we do multiple loads in parallel or try to attach multiple partitions to the same table it's raining deadlock errors from the database.
From what I understand, the problem isn't necessarily with parallel access to the target table itself (locking it changes nothing), but accesses to system catalog tables in the background.
I have combed through the DB2 documentation but the only reference to the topic of concurrent DDL statements I found at all was to avoid doing them. The answer to this question, can't be to simply not attempt it?
Does anyone know a way to deal with this problem?
I tried to have a global, single synchronization table to lock if you want to attach any partitions, but it didn't help either. Either I'm missing something (implicit commits somewhere?) or some of the data catalog updates even happen asynchronously, which makes the whole problem much worse. If that is the case, is there are any chance at all to query if the attach is safe to perform at any given moment?

How to track changes for certain database tables?

I have program that takes user and updates information about him/her in five tables. The process is fairly sophisticated as it takes many steps(pages) to complete. I have logs, sysout and syserr statements that helps me to find sql queries in IDE console but it doesn't have all of them. I've already spend many days to catch other missing queries by debugging but no luck so far. The reason why I am doing this is because I want to automate user information updates so I don't have to go through every page entering user details manually.
I wonder if I could just have some technique that will show me database table changes as I already know table names, by changes I mean whether it was update or insert statements and what exactly changed(column name and value inserted/updated). Any advice is greatly appreciated. I have IBM RAD and DB2 database. Thanks.
In DB2 you can track basic auditing information.
DB2 can track what data was modified, who modified the data, and the SQL operation that modified the data.
To track when data was modified, define your table as a system-period temporal table. The row-begin and row-end columns in the associated history table contain information about when data modifications occurred.
To track who and what SQL modified the data, you can use non-deterministic generated expression columns. These columns can contain values that are helpful for auditing purposes, such as the value of the CURRENT SQLID special register at the time that the data was modified. Possible values for non-deterministic generated expression columns are defined in the syntax for the CREATE TABLE and ALTER TABLE statements.
For example
CREATE TABLE TempTable (balance INT,
userId VARCHAR(100) GENERATED ALWAYS AS ( SESSION_USER ) ,
opCode CHAR(1)
GENERATED ALWAYS AS ( DATA CHANGE OPERATION )
... SYSTEM PERIOD (SYS_START, SYS_END));
The userId column stores who modified the data. This column is defined as a non-deterministic generated expression column that contains the value of SESSION_USER special register.
The opCode column stores the SQL operation that modified the data. This column is defined as a non-deterministic generated expression column and stores a value that indicates the type of SQL operation.
Suppose that you then use the following statements to create a history table for TempTable and to associate that history table with TempTable:
CREATE TABLE TempTable_HISTORY (balance INT, user_id VARCHAR(128) , op_code CHAR(1) ... );
ALTER TABLE TempTable ADD VERSIONING
USE HISTORY TABLE TempTable_HISTORY ON DELETE ADD EXTRA ROW;
Capturing SQL statements for a limited number of tables and a limited time - as far as I understand your problem - could be solved with the DB2 Audit facility.
create audit policy tabsql categories execute status both error type normal
audit <tabname> using policy tabsql
You have to have SECADM rights in theh database and the second command will start the audit process. You can stop it with
audit <tabname> remove policy
Check out the
db2audit
command to configure paths and extract the data from the audit file to a delimited file which then could be loaded again into the database.
The necessarfy tables can be created with the provided sqllib/misc/db2audit.ddl script. You will need the query the EXECUTE table for your SQL details
Please note that audit can capture huge amounts of data so make sure to switch it off again after you have catured the necessary information.

Teradata Drop Column returns with "no more room"

I am trying to drop a varchar(100) column of a 150 GB table (4.6 billion records). All the data in this column is null. I have 30GB more space in the database.
When I attempt to drop the column, it says "no more room in database XY". Why does such an action needs so much space?
The ALTER TABLE statement needs a temporary storage for the altered version before overwriting the original table. I guess the the table that you are trying to alter occupies at least 1/3 of your total storage size
This could happen for a variety of reasons. It's possible that one of the AMP's in your database are full, this would cause that error even with a minor table alteration.
try running the following SQL to check space
select VProc, CurrentPerm, MaxPerm
from dbc.DiskSpace
where DatabaseName='XY';
also, you should check to see what column your primary index is on in this very large table. if the table is not skewed properly, you could also run into space issues when trying to alter a table or by running a query against it.
For additional suggestions I found a decent article on the kind of things you may want to investigate when the "no more room in database" error occurs - Teradata SQL Tutorial. Some of the suggestions include:
dropping any intermediary work or "sandbox" tables
implementing single value or multi-value compression.
dropping unwanted/unnecessary secondary indexes
removing data in dbc tables like accesslog or dbql tables
remove and archive old tables that are no longer used.

How the Alter Table command is handled by SQLServer?

We are using SQL Server 2008. We have an Existing database and it was required to ADD a new COLUMN to one of the Table which has 2700 rows only but one of its column is of type VARCHAR(8000). When i try to add new column (CHAR(1) NULL) by using ALTER table command, it takes too much time!! it took 5 minutes and the command was still running to i stopped the command.
Below is the command, i was trying to add new column:
ALTER TABLE myTable Add ColumnName CHAR(1) NULL
Can someone help me to understand that How the SQL Server handles
the ALTER Table command? what happens exactly?
Why it takes so much time to Add new column
EDIT:
What is the effect of Table size on ALTER Command?
Altering a table requires a schema lock. Many other operations require the same lock too. After all, it wouldn't make sense to add a column halfway a select statement.
So a likely explanation is that a process had the table locked for 5 minutes. The ALTER then has to wait until it gets the lock itself.
You can see blocked processes, and the blocking process, from the Activity Monitor in SQL Server Management Studio.
Well, one thing to bear in mind is that you were adding a new fixed length column to the table. The way that rows are structured in storage, all fixed length columns are placed before all of the variable length columns, for each row. So every row would have had to be updated in storage to make this change.
If, in turn, this caused the number of rows which could be stored on each page to change, a great many new allocations may have been required.
That being said, for the number of rows indicated, I wouldn't have though it should take 5 minutes - unless, as Andomar indicated, there was some lock contention also involved.

Prevent update to non-existent rows

At work we have a table to hold settings which essentially contains the following columns:
PARAMNAME
VALUE
Most of the time new settings are added but on rare occasions, settings are removed. Unfortunately this means that any scripts which might have previously updated this value will continue to do so despite the fact that the update results in "0 rows updated" and leads to unexpected behaviour.
This situation was picked up recently by a regression test failure but only after much investigation into why the data in the system was different.
So my question is: Is there a way to generate an error condition when an update results in zero rows updated?
Here are some options I have thought of, but none of them are really all that desirable:
PL/SQL wrapper which notices the failed update and throws an exception.
Not ideal as it doesn't stop anyone/a script from manually doing an update.
A trigger on the table which throws an exception.
Goes against our current policy of phasing out triggers.
Requires updating trigger every time a setting is removed and maintaining a list of obsolete settings (if doing exclusion).
Might have problems with mutating table (if doing inclusion by querying what settings currently exist).
A PL/SQL wrapper seems like the best option to me. Triggers are a great thing to phase out, with the exception of generating sequences and inserting history records.
If you're concerned about someone manually updating rather than using the PL/SQL wrapper, just restrict the user role so that it does not have UPDATE privileges on the table but has EXECUTE privileges on the procedure.
Not really a solution but a method to organize things a bit:
Create a separate table with the parameter definitions and link to that table from the parameter value table. Make the reference to the parameter definition required (nulls not allowed).
Definition table PARAMS (ID, NAME)
Actual settings table PARAM_VALUES (PARAM_ID, VALUE)
(changing your table structure is also a very effective way to evoke errors in scripts that have not been updated...)
May be you can use MERGE statement
here is a link for it
http://www.oracle-developer.net/display.php?id=203
The merge statement allows you to combine insert and update in the same query, so in case the desired row does not exist you may insert a record in a buffer table to indicate the the row does not exist or else you can update the required record
Hope it helps