I have a scenario I need to increase hbase.client.scanner.caching to 10000 from 100. But I don't want to make this permanent change, I only need it for that particular session when I am querying from hive querying engine. Is there any way how to set this property for that particular session.
i.e
set hbase.client.scanner.caching = 10000;
SELECT count(*) FROM hive_external_table;
-- but setting the parameter is not taking any effect.
-- where hive_external_table is a external table mapped from hbase_table
Yes, you can definitely set the property value in the same way. Don't give whitespace between key=value.
Use following:
hive> set hbase.client.scanner.caching=10000;
hive> SELECT count(*) FROM hive_external_table;
It will override the default value for the current session.
Related
currently, have hive properties:
SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.enforce.bucketing=undefined
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=2;
This, by default, creates ACID table.
I would like to create non-acid table by default. If i want to create non-acid, should i change the property to change hive.txn.manager to DummyTxnManager?
When user want to create transactional table, they should explicitly mention transactional=true while creating table. In this case, how does the transactional table get the features of transactional from DbTxnManager.
I would like to know, on what basis DbTxnManager applicable if we dont have properties set in hive-site.xml.
Also, want to know difference in DbTxnManager and setting transactional=true in table?
I am using SSDT 2017 and I am working on a solution that basically gets a full result set from a query into a variable (1 column only: AccountID), and I need to include the values in that object variable in a query, something like this:
"SELECT * FROM dbo.account WHERE AccountID IN (" + #AccountIDObjectVariable + ")"
I tried with an expression but I get an error, so I am not sure if there's a better way, also I tried a for each loop container logic but since I have millions of record in the object variable I think that's not the best way.
Any idea?
It doesn't work that way. Where "it" is going to be a host of things.
The SSIS data types are primitive types (boolean, date, numbers) or Object. The only supported operations for Object is a null check and enumeration.
SSIS parameterization is only for equality based substitutions. There is no concept of a list data type in SQL so there's no analog in SSIS.
I have millions of record in the object variable
Even if you converted your list to a string and used string concatenation, the next problem you're going to run into is the string length limit of 4000 characters.
What is the way?
Let's reset the problem: You have a non-trivial set of identities from a source system. That set of ids needs to be used as the basis for a subsequent extract.
Is the source of identities and the actual data on the same server
While you can empty the ocean with a teaspoon, it's not the correct tool. Same holds true here. Move the query that identifies the recordset to be extracted into a filter condition for your source.
i.e.
Load dataset into #AccountIDObjectVariable
SELECT
OA.AccountId
FROM
dbo.OutstandingAccount AS OA;
Extract that isn't working
"SELECT * FROM dbo.account WHERE AccountID IN (" + #AccountIDObjectVariable + ")"
is rewritten as
SELECT * FROM dbo.account AS A WHERE EXISTS (SELECT * FROM dbo.OutstandingAccount AS OA WHERE OA.AccountID = A.AccountID);
There are two reasonable approaches for solving this
Pull it all
If the source ids list and the source table are of similar orders of magnitude, it might be easier to just bring it all down and use the account id generating query in a Lookup Task. If AccountID exists, then it's the data you want. Yes you pulled more than you wanted but you likely would have burned more cycles and complexity trying to selectively pull what you wanted.
Push and pull
This approach is going to work for SQL Server and I have no idea about any other database. Well, I suppose Sybase would be the same given database paternity.
Open SSMS and create a global temporary table on the database where dbo.account lives. Do not disconnect from SSMS.
IF OBJECT_ID('tempdb..##SO_66961235') IS NOT NULL
BEGIN
DROP TABLE ##SO_66961235;
END
GO
CREATE TABLE ##SO_66961235
(
AccountID int NOT NULL
);
Modify the Connection manager to set the RetainSameConnection Property to true for the database connection to dbo.account
Execute SQL Task - Make Temp Table
Use the connection to the account database and the above query. This will ensure the table exists for future sessions of SSIS to work.
DataFlow Load IDs
In the dataflow properties, set DelayValidation to True
Use your source query to generate the list of IDs and select the temporary table as the destination. You might need to have a second connection manager to this system running and pointed at tempdb, it's been a long time since I've done this. Same rule about RetainSameConnection will hold true though.
When this data flow completes, then we will have a temporary table on the data source server that we can reference.
Dataflow 2 Get Data
Again, DelayValidation to true.
Source will be a query
SELECT * FROM dbo.account AS A WHERE EXISTS (SELECT * FROM ##SO_66961235 AS OA WHERE OA.AccountID = A.AccountID);
What's with all the delay validation?
When the SSIS package starts, the first thing it does is ensure all the pieces are in place for it to run successfully and not only are the pieces in place, is the shape of the data still the same? A temporary table won't exist when the package starts and the package will fail with VS_NEEDSNEWMETADATA error. Setting DelayValidation tells SSIS that it should not worry about checking until the component actually gets the signal to start before it checks metadata. Since we defined the precursor Execute SQL Task to create the table, the validation should succeed.
I used global temporary tables here. You can use local scoped temporary tables but it makes the already fiddly design process much more so. Were it me, I'd have a package parameter controlling a boolean that uses a global temp table for development sessions and local temp table for actual run-time operations but that's beyond the scope of this question.
I have one enviornment in which queries containing more than 100 tables. now i need to access same query in read only environment. so i need to use <schema_name>.<table_name> in read only env. This is read only env so i can not create synonyms for all.
instead of writing schema name in prefix of each table, Is there any short cut for it. i am just guessing if anything is possible? They all belongs to same schema.
Try this out. It will set your session environment to the specified schema and as consequence, no need to provide the <schema_name> prefix.
ALTER SESSION SET CURRENT_SCHEMA = <schema_name>;
In the project I'm working at the Id for certain insert statements is managed by hdbsequences. Now I want to create a sequence for another table that already has existing data in it and I want it to start with the max id value of the data of that table.
I know I could just manually set the "start_with"-Property to it but that is not an option because we need to transport the sequence to another system later where the data in that corresponding table is not the same as on the current system (therefore the ID is different).
I also know of the "reset_by"-Property in which I can select the max value of the table, the problem is that I don't know how to trigger that explicitly.
What I already found out, is that the "reset_by"-Property is called whenever the database is restarted, but unfortunately that is not also not an option because we can't reset the database to not disrupt the other systems.
Thanks in advance for your time and help.
You can do an ALTER SEQUENCE and set the value to be used by the next sequence usage with the option "restart with".
For instance (schema name and sequence name have to be replaced):
alter sequence "<schema name>"."<sequence name>" restart with 100;
The integer value behind the "restart with" option has to be set to the value which has to be used next. So in case your last ID is 100, set it to 101. 101 is the value returned by the next NEXTVAL call on the sequence.
Our database is set up so that each of our clients is hosted in a separate schema (the organizational level above a table in Postgres/Redshift, not the database structure definition). We have a table in the public schema that has metadata about our clients. I want to use some of this metadata in a view I am creating.
Say I have 2 tables:
public.clients
name_of_schema_for_client
metadata_of_client
client_name.usage_info
whatever columns this isn't that important
I basically want to get the metadata for the client I'm running my query on and use it later:
SELECT *
FROM client_name.usage_info
INNER JOIN public.clients
ON CURRENT_SCHEMA() = public.clients.name_of_schema_for_client
This is not possible because CURRENT_SCHEMA() is a leader-node function. This function returns an error if it references a user-created table, an STL or STV system table, or an SVV or SVL system view. (see https://docs.aws.amazon.com/redshift/latest/dg/r_CURRENT_SCHEMA.html)
Is there another way to do this? Or am I just barking up the wrong tree?
Your best bet is probably to just manually set the search path within the transaction from whatever source you call this from. See this:
https://docs.aws.amazon.com/redshift/latest/dg/r_search_path.html
let's say you only want to use the table matching your best client:
set search_path to your_best_clients_schema, whatever_other_schemas_you_need_for_this;
Then you can just do:
select * from clients;
Which will try to match to the first clients table available, which by coincidence you just set to your client's schema!
You can manually revert afterwards if need be or just reset the connection to return to default, up to you