How to configure datanucleus.connectionPool.maxPoolSize in hive? - hive

I have a question about datanucleus.connectionPool.maxPoolsize in Hive
The following formula should be used to obtain the datanucleus.connectionPool.maxPoolSize value in the hive wiki:
(2 * pool_size * metastore_instances + 2 * pool_size * HS2_instances_with_embedded_metastore) = (2 * physical_core_count + hard_disk_count).
What exactly do physical_core_count and disk_count mean in this formula?
Suppose I have a server with cpu 96 cores/10 disks and a server with cpu 96 cores/5 disks, and I have configured the hive metastore in remote mode on each server. In addition, each server has also run a hive server2.
Then,
Q1) Is it correct that there are 2 metastore_instances and 0 HS2_instances_with_embedded_metastore?
Q2) In this case, what is 2 * physical_core_count + hard_disk_count? Is 'physical_core_count' the average number of cpu cores in two servers?
What is the hard_disk_count value?
Please let me know

Related

pypyodbc sql query of cloud stored ms access database is slow when querying newest data but fast when querying oldest data

I'm using pypyodbc and pandas.read_sql_query to query a cloud stored MS Access Database .accdb file.
def query_data(group_id,dbname = r'\\cloudservername\myfile.accdb',table_names=['ContainerData']):
start_time = datetime.now()
print(start_time)
pypyodbc.lowercase = False
conn = pypyodbc.connect(
r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};"+
r"DBQ=" + dbname + r";")
connection_time = datetime.now()-start_time
print("Connection Time: " + str(connection_time))
querystring = ("SELECT TOP 10 Column1, Column2, Column3, Column4 FROM " +
table_names[0] + " WHERE Column0 = " + group_id)
my_data = pd.read_sql_query(querystring,conn)
print("Query Time: " + str(datetime.now()-start_time-connection_time))
conn.close()
return(my_data)
The database has about 30,000 rows. The group_id are sequential numbers from 1 to 3000 with 10 rows assigned to each group. For example, rows 1-10 in the database (oldest date) all have group_id=1. Rows 2990-3000 (newest data) all have group_id = 3000.
When I store the database locally on my PC and run query_data('1') the connection time is 0.1s and the query time is 0.01s. Similarly, running query_data('3000') the connection time is 0.2s and the query time is 0.08s.
When the database is stored on the cloud server, the connection time varies from 20-60s. When I run query_data('1') the query time is ~3 seconds. NOW THE BIG ISSUE: When I run query_data('3000') the query time i ~10 minutes!
I've tried using ORDER BY group_id DESC but that causes both queries to take ~ 10 minutes.
I've also tried changing the "Order by" group_id to Descending in the accdb itself and setting "Order by on load" to yes. Neither of these seem to change how the SQL query locates the data.
The problem is, the code I'm using almost always needs to find the newest data (e.g. group_id = max) which takes the longest amount of time to find. Is there a way to have the SQL query reverse it's searching order, so that the newest entries are looked through first, rather than the oldest entries? I wouldn't mind a 3 second (or even 1 minute) query time, but a 10 minute query time is too long. Or is there a setting I can change in the access database to change the order in which the data is stored?
I've also watched the network monitor while running the script, and python.exe steadily sends about 2kb/s and receives about 25kb/s throughout the full 10 minute duration of the script.

SQL query to combine column data from multiple columns

SQL Server 2012, Python 3.
Background Info:
I have two tables; LOADS and FACTORS.
LOADS has the following columns:
FACTORS has the following columns:
Table FACTORS is basically a roadmap how to combine and factor the data in table LOADS.
FACTORS columns mlc and tlc contain the same numbers as LOADS table column subcase.
In this example we will only use value 32771 for LOADS column eid.
Problem:
I need to get a set of results with columns similar to LOADS but with additional columns, like this:
eid, mlc, tlc, fx, fy, fz.
To do this manually:
start with the first row of FACTORS. mlc=1053002, tlc=4053400
select fx, fy, fz from LOADS where subcase=mlc and eid=32771 (this is our mlc loads)
select fx, fy, fz from LOADS where subcase=tlc and eid=32771 (this is our tlc loads)
the final values in columns fx, fy, fz are from these equation:
(mlc.fx * mf + tlc.fx * tf) * cf
(mlc.fy * mf + tlc.fy * tf) * cf
(mlc.fz * mf + tlc.fz * tf) * cf
If tlc=0 then only use mlc. The fx, fy, fz values are from these equations:
(mlc.fx * mf) * cf
(mlc.fy * mf) * cf
(mlc.fz * mf) * cf
I am somewhere between beginner and intermediate using SQL so I have no idea how to do this using only SQL. I have successfully done this using pandas basically doing this the manual way by creating a blank DataFrame, calculating the (3) DOF fx,fy,fz and adding Rows one at a time, building the DataFrame from the ground up. I can share that code if anyone needs to see it, but I really want to do this in SQL if it's possible. The reason is because most general queries using this procedure can take several minutes (for 50,000+ Rows, granted my pandas knowledge might not be too efficient for real world usage) and I really want to get that time down to a few seconds.
Following ZLK's comment, it seems like you want this:
select
LoadsMlc.eid,
Factors.mlc,
Factors.tlc,
(LoadsMlc.fx * Factors.mf + isnull(LoadsTlc.fx * Factors.tf, 0)) * Factors.cf fx,
(LoadsMlc.fy * Factors.mf + isnull(LoadsTlc.fy * Factors.tf, 0)) * Factors.cf fy,
(LoadsMlc.fz * Factors.mf + isnull(LoadsTlc.fz * Factors.tf, 0)) * Factors.cf fz
from
dbo.Factors
join dbo.Loads LoadsMlc on Factors.mlc = LoadsMlc.subcase
left outer join dbo.Loads LoadsTlc on nullif(Factors.tlc, 0) = LoadsTlc.subcase
This assumes there exists exactly one match for subcase against mlc--your question doesn't currently indicate one way or another. You may want to, perhaps, add something like
where
LoadsTlc.eid is null -- We didn't find a tlc (it was 0)
or LoadsMlc.eid = LoadsTlc.eid
If there are potentially multiple matches across the Loads table for each of the Factors. Note that the tlc join is a left outer, as you indicated that we should only use mlc in the case that it is 0.
The second join to Loads uses a nullif to nullify the case when tlc is 0. In that case you'll receive no records for that value. The three columns account for this null case by adding 0.

BUG in SQL query "select * in table" using RODBC package with ODBC Driver 13 for SQL Server in R

There seems to be a problem with ODBC Driver 13 for SQL Server (Running local Ubuntu 16.04) with RODBC package (version1.3-15) in R (version 3.4.1 (2017-06-30)). First of we make a query to see the size of the SQL table called TableName.
library(RODBC)
connectionString <- "Driver={ODBC Driver 13 for SQL Server};Server=tcp:<DATABASE-URL>,<NUMBER>;Database=<DATABASE NAME>;Uid=<USER ID>;Pwd=<PASSWORD>;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;"
connection <- odbcDriverConnect(connectionString)
count <- sqlQuery(connection, 'select count(*) from TableName')
odbcGetErrMsg(connection)
Output of the above gives a count value of 200.000 (odbcGetErrMsg returns no errors), which is known to be the correct size of the SQL table called TableName.
Now comes the trange part.
TableName <- sqlQuery(connection, 'select * from TableName')
count = dim(TableName)[1]
odbcGetErrMsg(connection)
Output of the above first gives a value of 700 (odbcGetErrMsg returns no errors). But when the above code is executed again it returns another count value of 2300 i.e. it is random. When repeating above code multiple times I see the range of the count value returned is approx. between 700-8.000 (TableName has 8 columns).
None of the above outputs changes when setting ConnectionTimeout equal to either 0 or some absurd high number, respectively.
Does anybody know what is going on here? The goal is to store the full SQL table called TableName as a dataframe in R for further data processing.
Any help is much appreciated.
Note for others with a similar problem:
I did not solve this BUG, however by shifting to Microsoft JDBC Driver 6.2 for SQL server with R package RJDBC returns the correct result. With this setup I am now able to load the full SQL table (200.000 rows and counting) into R as a dataframe for further processing.

Issue with broadcast Join in Spark 1.6.0

Spark creates different SQL execution plans in versions 1.5.2 and 1.6.0.
The database contains one large facts table (f) and several small dimension tables (d1, d2).
The dimension tables (d1, d2) are cached in Spark using:
sqlContext.cacheTable("d1");
sqlContext.sql("SELECT * from d1").count();
sqlContext.cacheTable("d2");
sqlContext.sql("SELECT * from d2").count();
The size of tables d1 and d2: ~100MB (the tab STORAGE shows this size for the tables in Spark Web UI).
The Spark SQL is configured by the broadcast parameter (1GB) which is more than the size of the tables:
spark.sql.autoBroadcastJoinThreshold=1048576000
The SQL is:
SELECT d1.name, d2.name, SUM(f.clicks)
FROM f
JOIN d1 on f.d1_id = d1.id
JOIN d2 on f.d2_id = d2.id
WHERE d1.name='A' AND f.date=20160101
GROUP BY d1.name, d2.name
ORDER BY SUM(f.clicks) DESC
The query is executed using ThriftServer.
When we run this query using Spark 1.5.2 the query is executed fast and the execution plan contains BroadcastJoin wth dimensions.
But when we run the same query using Spark 1.6.0 the query is executed very slow and the execution plan contains SortMergeJoin with dimensions.
The environment for both executions is the same (the same Yarn Cluster).
Provided executors count, tasks count per executor and memory of the executors are the same for both executions.
What can I do to configure Spark 1.6.0 for using BroadcastJoin in sql executions?
Thanks.

Math operation in SQL?

I'm creating an application to calculate some Login -Logouts on a call center, basically what I do is to get an interval within times.
Which would be best:
to get the interval on the DB Server (SQL Server 2000),
or in the code itself (Perl)?
I'm running on Windows Server 2003.
Basically the operation is:
Login-Logout + 1
But there are about 1 000 000 rows on each query.
P.S I do know how to do it, what I'm wondering is what would be a best practice.
This is my actual query :
select S.Ident,S.Dateissued ,
S.LoginMin,S.LogoutMin ,
E.Exc_Name ,
CAST(CAST( (LoginMin / 60 + (LoginMin % 60) / 100.0) as int ) AS varchar ) + ':' + CASE WHEN LoginMin % 60 < 10 THEN '0'+ CAST(LoginMin % 60 AS varchar) ELSE CAST(LoginMin % 60 AS varchar) END ,
CAST(CAST( (LogoutMin / 60 + (LogoutMin % 60) / 100.0) as int ) AS varchar ) + ':' + CASE WHEN LogoutMin % 60 < 10 THEN '0'+ CAST(LogoutMin % 60 AS varchar) ELSE CAST(LogoutMin % 60 AS varchar) END,
(LogoutMin-LoginMin)+1 as Mins,
E.Exc_ID,action
FROM igp_ScheduleLoginLogout S INNER JOIN igp_ExemptionsCatalog E
ON S.Exc_ID = E.Exc_ID
where ident=$ident
and dateissued between '$dateissued' and '$dateissued2'"
Short answer:
If you are doing math on a set of data (like your 1 million row example), SQL is optimized for set-based operations.
If you are doing math on an iterative, row-by-row basis, your calling application or script is probably best.
Generally aggregating on the server and returning the final answer is faster than pulling all of the rows to an application and chugging through them there.
Generally the answer is that if you can do the calculation as part of the SQL query without having to change the form of the query, and if your application-layer code supports it (e.g. you aren't using an ORM that makes it difficult) then you may as well do the query as part of the SQL. With such a simple calculation it's not likely to make much difference, so you should write whatever leads to the most maintainable code.
As with any performance question, the real answer is to benchmark it yourself. Answers on StackOverflow can only get you so far, since so many factors can affect performance in the real world.
It partially depends on how scalable this has to be.
With 1 client and 1 server, as other have noted, doing it in SQL may be faster (but benchmark youself!)
With several clients and 1 server (either now or in projection), you scale the calculations per client and offload ALL of them from 1 server, so the server load is dramatically lower. In this case, do the calculation in the client (or app server).