Timeout error when writing large amounts of data to big query - sql

I am getting the following error when trying to write large amounts of data to big query using
client.insert_rows_json()
google.api_core.exceptions.RetryError: Deadline of 600.0s exceeded while calling target function, last exception: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
method. I have tried modifying the timeout parameter in the following way:
client.insert_rows_json(*args, timeout=1000000)
but I still get the same timeout error where the deadline is still at 600.0s.
Is there someway to establish the client with:
credentials = service_account.Credentials.from_service_account_info(service_account_json)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)
and specify how long before timeout should occur?

Please try this solution
from google.cloud import bigquery
from google.oauth2 import service_account
# Load the service account credentials
service_account_json = 'path/to/service_account.json'
credentials = service_account.Credentials.from_service_account_file(service_account_json)
# Create the client with the desired timeout value
client = bigquery.Client(
credentials=credentials,
project=credentials.project_id,
default_query_job_config=bigquery.QueryJobConfig(
timeout=1800, # Set the timeout to 1800 seconds (30 minutes)
),
)

Related

Getting error while connecting ADLS to Notebook in AML

I am getting below error while connecting dataset created and registered in AML notebook and which is based on ADLS. When I connect this dataset in designer I am able to visualize the same. Below is the code that I am using. Please let me know the solution if anyone have faced the same error.
Examle 1 Import dataset to notebbok
from azureml.core import Workspace, Dataset
subscription_id = 'abcd'
resource_group = 'RGB'
workspace_name = 'DSG'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='abc')
dataset.to_pandas_dataframe()
Error 1
ExecutionError: Could not execute the specified transform.
(Error in getting metadata for path /local/top.txt.
Operation: GETFILESTATUS failed with Unknown Error: The operation has timed out..
Last encountered exception thrown after 5 tries.
[The operation has timed out.,The operation has timed out.,The operation has timed out.,The operation has timed out.,The operation has timed out.]
[ServerRequestId:])|session_id=2d67
Example 2 Import data from datastore to notebook
from azureml.core import Workspace, Datastore, Dataset
datastore_name = 'abc'
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name)
datastore_paths = [(datastore, '/local/top.txt')]
df_ds = Dataset.Tabular.from_delimited_files(
path=datastore_paths, validate=True,
include_path=False, infer_column_types=True,
set_column_types=None, separator='\t',
header=True, partition_format=None
)
df = df_ds.to_pandas_dataframe()
Error 2
Cannot load any data from the specified path. Make sure the path is accessible.
Try removing the initial slash from your path 'local/top.txt'
datastore_paths = [(datastore, 'local/top.txt')]
For your dataset abc, can you visualize/preview the data on ml.azure.com?
Might be due to the fact that your data permission is not set up correctly in ADLS. You need to give permission to the service principal for the file/folder you are access.
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-access-control
Data Access Setting on a file in ADLS

mysql python multiprocessing pool issues

Error I keep getting:
Lost connection to MySQL server during query
My code:
def runDBQuery(bl_sel):
dbResponse = []
bl_cur.execute(bl_sel)
myresult2 = bl_cur.fetchall()
dbResponse.append(myresult2)
return(dbResponse)
if __name__ == '__main__':
p1abl_sel = bl_sel_template.replace("{firstupc}",p1afirstupc).replace("{lastupc}",p1alastupc)
p2abl_sel = bl_sel_template.replace("{firstupc}",p2afirstupc).replace("{lastupc}",p2alastupc)
list_of_columns = [ p1abl_sel, p2abl_sel ]
#list_of_columns = [ p1abl_sel ]
p = Pool(processes=2)
data = p.map(runDBQuery, [i for i in list_of_columns])
# the 4 lines below are my failed attempts to try to resolve this.
bl_cur.close()
if cur and con:
cur.close()
con.close()
p.close()
print(data)
Whenever I uncomment the list_of_columns so there's only one element(query) in the list, it works and I get back a response from the DB. However, if I have more than one element in the list, I encounter the listed error.
Can anyone help me solve this problem?
The problem can be not in your code.
MySQL error "Lost connection to MySQL server during query" can accrue because of reading timeout. It can be either on the client side or mysql server configuration
MySQL
max_execution_time: The execution timeout for SELECT statements, in milliseconds. If the value is 0, timeouts are not enabled.
connect_timeout: Number of seconds the mysqld server waits for a connect packet before responding with 'Bad handshake'
interactive_timeout Number of seconds the server waits for activity on an interactive connection before closing it
wait_timeout Number of seconds the server waits for activity on a connection before closing it
https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_max_execution_time
For pyMysql check read_timeout
https://pymysql.readthedocs.io/en/latest/modules/connections.html

is there a way to setup timeout in grpc server side?

Unable to timeout a grpc connection from server side. It is possible that client establishes a connection but kept on hold/sleep which is resulting in grpc server connection to hang. Is there a way at server side to disconnect the connection after a certain time or set the timeout?
We tried disconnecting the connection from client side but unable to do so from server side. In this link Problem with gRPC setup. Getting an intermittent RPC unavailable error, Angad says that it is possible but unable to define those parameters in python.
My code snippet:
def serve():
server = grpc.server(thread_pool=futures.ThreadPoolExecutor(max_workers=2), maximum_concurrent_rpcs=None, options=(('grpc.so_reuseport', 1),('grpc.GRPC_ARG_KEEPALIVE_TIME_MS', 1000)))
stt_pb2_grpc.add_ListenerServicer_to_server(Listener(), server)
server.add_insecure_port("localhost:50051")
print("Server starting in port "+str(50051))
server.start()
try:
while True:
time.sleep(60 * 60 * 24)
except KeyboardInterrupt:
server.stop(0)
if __name__ == '__main__':
serve()
I expect the connection should be timed out from grpc server side too in python.
In short, you may find context.abort(...) useful, see API reference. Timeout a server handler is not supported by the underlying C-Core API of gRPC Python. So, you have to implement your own timeout mechanism in Python.
You can try out some solution from other StackOverflow questions.
Or use a simple-but-big-overhead extra threads to abort the connection after certain length of time. It might look like this:
_DEFAULT_TIME_LIMIT_S = 5
class FooServer(FooServicer):
def RPCWithTimeLimit(self, request, context):
rpc_ended = threading.Condition()
work_finished = threading.Event()
def wrapper(...):
YOUR_ACTUAL_WORK(...)
work_finished.set()
rpc_ended.notify_all()
def timer():
time.sleep(_DEFAULT_TIME_LIMIT_S)
rpc_ended.notify_all()
work_thread = threading.Thread(target=wrapper, ...)
work_thread.daemon = True
work_thread.start()
timer_thread = threading.Thread(target=timer)
timer_thread.daemon = True
timer_thread.start()
rpc_ended.wait()
if work_finished.is_set():
return NORMAL_RESPONSE
else:
context.abort(grpc.StatusCode.DEADLINE_EXCEEDED, 'RPC Time Out!')

How to set producer/publisher socket timeout for RabbitMQ (pika)?

Is there a way to make set the socket timeout when publishing?
I'm testing correct recovery from lost connection with Pika, by
establishing a BlockingConnection connection
disconnecting from the network to force an error
reestablishing a connection and checking that the producer reconnects correctly and continues producing.
However, I don't seem to be able to set the socket timeout and basic_publish hangs - for WAY more than 5 seconds -- 60 or more.
credentials = pika.PlainCredentials(worker_config.username, worker_config.password)
connection = pika.BlockingConnection(pika.ConnectionParameters(
host=worker_config.host,
credentials=credentials,
port=worker_config.port,
connection_attempts=1,
retry_delay=5,
socket_timeout=5,
))
# No effect
#connection._impl.socket.settimeout(5)
channel = connection.channel()
while True:
result = channel.basic_publish(
exchange=EXCHANGE,
routing_key=ROUTING_KEY,
body=message,
properties=pika.BasicProperties(
delivery_mode, # MQ_TRANSIENT_DELIVERY_MODE, #1
))
# Someone after some success, disconnect network.
Pika comes into (select_connection.py):
def poll(self, write_only=False):
"""Poll until the next timeout waiting for an event
:param bool write_only: Only process write events
"""
while True:
try:
events = self._poll.poll(self.get_next_deadline())
break
except _SELECT_ERROR as error:
if _get_select_errno(error) == errno.EINTR:
continue
else:
raise
... and indeed, get_next_deadline is sending 5.
_poll is a python Poll object which takes a timeout in seconds.
What's up with this?
There's a similar question, but has no answers (not enough detail?)

create a mass of connections on client side by twisted

I use twisted to do a test job for a server. I need create a lot of connections connect to the server. This is my code:
class Account(Protocol):
def connectionMade(self):
print "connection made"
def connectionLost(self, reason):
print "connection Lost. reason: ", reason
def createAccount(self, name):
self.transport.write(...)
print "create account: ", name
class AccountFactory(Factory):
def buildProtocol(self, addr):
return Account()
def accountCreate(p, i):
print "begin create"
p.createAccount(NAME_PREFIX+str(i))
def onError(err):
return 'error: ', err
c = 0
while c < 100:
accountPoint = TCP4ClientEndpoint(reactor, server_ip, port)
accountConn = accountPoint.connect(AccountFactory())
accountConn.addCallback(accountCreate, c)
accountConn.addErrback(onError)
c += 1
reactor.run()
If server and client located in same LAN, there is no problem, all of 100 "create account: xxx" will printed. But when I put server on a remote address(internet), the client only prints near 50% number of "create account: xxx". onError doesn't fire.
The log is:
2014-07-29 15:57:06+0800 [Uninitialized] connection made
2014-07-29 15:57:06+0800 [Uninitialized] begin create
2014-07-29 15:57:06+0800 [Uninitialized] create account: xxx
repeat 60 times
2014-07-29 15:57:17+0800 [Uninitialized] Stopping factory <__main__.AccountFactory instance at xxx>
repeat 40 times
Some callback failed to be calling, even the connection haven't be made. The only different is the latency between server and client.
The most interested thing is the duration between first success log and first "Stopping factory" log is exactly 20 seconds(I try this many times). But I am sure this is not caused by timeout because TCP4ClientEndpoint default timeout is 30 seconds.
And the log time stamp is also abnormal, the log time stamp is in bundle, for example: 10 logs are 2014-07-29 17:25:09, 20 logs are 2014-07-29 17:25:15. If the connection made in async manner, the time stamp should be random enough. It should not gather together: made 10 connections at time point a, made another 20 at time point a+15sec. Or log utility problem?
Revised:
I think this is bug of twisted. The reason of "Stopping" is timeout. When I run this in linux, the time duration between first log and first stopping is timeout seconds I passed into TCP4ClientEndpoint, but under windows whatever I set the timeout seconds, the duration always 21 seconds. I use socket(blocking) to do same thing instead, all is pretty good. So this should be a bug in twisted which involve timeout when making a lot of connections.
You haven't added any error handlers to your code, nor have you enabled logging so that unhandled errors will be reported anywhere.
Enable logging, either by calling twisted.python.log.startLogging or by writing your code as an ISeviceMaker plugin and running it with twistd.
And add errbacks to each Deferred in your application so you can handle failures from their associate operations.