BigQuery Job Scheduler - google-bigquery

Someone leave from the company and leave me no clue about every job scheduler that he already made, some of the jobs that connected to datastudio stop working. is there a way I can find which job scheduler connect to the table?

You can find all Cloud Scheduler jobs here: https://console.cloud.google.com/cloudscheduler/start?project=your-project-id
However, for BigQuery in particular, it's worth also checking the scheduled queries: https://console.cloud.google.com/bigquery/scheduled-queries?project=your-project-id

Related

How to make Dataproc detect Python-Hive connection as a Yarn Job?

I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)
Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".
Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.
Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.
Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?
Thanks.
This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.
If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:
gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
--execute="sh sleep $((5 * 60 * 60))"

Bigquery user statistics from Microstrategy

I am using Microstrategy to connect to Bigquery using a service account. I want to collect user level job statistics from MSTR but since I am using a service account, I need a way to track user level job statistics in Bigquery for all the jobs executed via Microstrategy.
Since you are using a Service account to make the requests from Microstrategy, you could look up for all your project Jobs by listing them then, by using each Job ID in the list, gather the information of the job as this shows the Email used for the Job ID.
A workaround for this would be also using Stackdriver Logging advanced filters and use a filter to get the jobs made by the Service Account. For instance:
resource.type="bigquery_resource"
protoPayload.authenticationInfo.principalEmail="<your service account>"
Keep in mind this only shows jobs in the last 30 days due to the Logs retention periods.
Hope it helps.

resource management on spark jobs on Yarn and spark shell jobs

Our company has a 9 nodes clusters on cloudera.
We have 41 long running spark streaming jobs [YARN + cluster mode] & some regular spark shell jobs scheduled to run on 1pm daily.
All jobs are currently submitted at user A role [ with root permission]
The issue I encountered are that while all 41 spark streaming jobs are running, my scheduled jobs will not be able to obtain resource to run.
I have tried the YARN fair scheduler, but the scheduled jobs remain not running.
We expect the spark streaming jobs are always running, but it will reduce the resources occupied whenever other scheduled jobs start.
please feel free to share your suggestions or possible solutions.
Your spark streaming jobs are consuming too many resources for your scheduled jobs to get started. This is either because they're always scaled to a point that there aren't enough resources left for scheduled jobs or they aren't scaling back.
For the case where the streaming jobs aren't scaling back you could check whether you have dynamic resource allocation enabled for your streaming jobs. One way of checking is via the spark shell using spark.sparkContext.getConf.get("spark.streaming.dynamicAllocation.enabled"). If dynamic allocation is enabled then you could look at reducing the minimum resources for those jobs.

How to detect APScheduler's running jobs?

I have some recurring jobs run frequently or last for a while.
It seems that Scheduler().get_jobs() will only return the list of scheduled jobs that are not currently running, so I cannot determine if a job with certain id do not exists or is actually running.
How may I test if a job is running or not in this situation?
(I set up those jobs not the usual way, (because I need them to run in a random interval, not fixed interval), they are jobs that execute only once, but will add a job with the same id by the end of their execution, and they will stop doing so when reaching a certain threshold.)
APScheduler does not filter the list of jobs for get_jobs() in any way. If you need random scheduling, why not implement that in a custom trigger instead of constantly readding the job?

how to run (or not) a cron job based on the results of another cron job

I need to execute a cron job based on whether or not the cron job that ran before it was at least partially successful, I am trying to set up conditions for the run... sometimes these condition would be on the local or remote machine...
is there a way to do this?
In that case you can share a common-file in which the first job will write its status and second cron job will read it an decide on that basis weather it should proceed or not.
In case of remote machine you may share that file on network.