Cassandra DSE, restore from S3 timeouts - amazon-s3

I'm trying to test S3 backup/restore functionality.
What I did:
installed DSE +OpsCenter on Amazon
Scheduled hourly backup of all keyspaces (60 MB total size). got 10 backups in the morning.
Terminate instances and create new one
try to get my data back. No luck. OpsCenter can't connect to my S3 bucket
it takes >10 min now...
What do I do wrong?
UPD:
finally, got response:

I believe that this may be OPSC-5915 (sorry no public bug tracker) which is fixed in the upcoming 5.2.0 release.
The summary is that the API calls will still work as expected but the UI is not pushing the destination information to the API endpoint correctly.
You can confirm that this is the error you are experiencing thusly:
1) goto /etc/opscenter/clusters/<cluster_name>.conf (or similar location depending on if you've done a tarball install/etc)
2) Find the destination ID that matches your bucket, it'll look something like b699738d9bd8409c82e664b543f24030
3) Confirm the clustername in your opsc URLs, it'll look something like localhost:8888/my_cluster
4) Manually hit the API to retrieve your backup list
curl localhost:8888/<clustername>/backups?amount=6\&last_seen=\&list_all=1\&destination=<destination ID>
It'll look like this
curl localhost:8888/dse/backups?amount=6\&last_seen=\&list_all=1\&destination=b699738d9bd8409c82e664b543f24030
5) You should get back a json, confirm that your backup is listed
{"opscenter_adhoc_2014-12-17-20-22-57-UTC": {"keyspaces": {"OpsCenter":...
If you see your backup in the JSON, then opsc sees your backup and this is indeed OPSC-5915, so that's at least confirmed.
If this is your case, we can work around it by manually hitting the restore API (this is admittedly a bit more involved).
http://docs.datastax.com/en/opscenter/5.1/api/docs/backups.html#backups
It'll look a bit like this:
BACKUP='opscenter_4a269167-96c1-40c7-84b7-b070c6bcd0cd_2012-06-07-18-00-00-UTC'
curl -X POST
http://192.168.1.1:8888/Test_Cluster/backups/restore/$BACKUP
-d '{
"destination": "fe85800f3f4043a88fbe76fc45b22b19",
"keyspaces": {
"Keyspace1": {
"column-families: ["users", "dates"],
"truncate": true
},
"OpsCenter": {
"truncate": false
}
},
}'

Related

RavenDb: Document Refresh feature does not run at nor after the specified time by #refresh flag

I need to mark documets as expired after some time and therefore I am trying to use #refresh feature to re-run subscription and to compute my 'expired' flag. I know there is 'Document expiration' feature but this one removes data which I don't want.
I have turned Refresh feature in settings and added #refresh UTC datetime in metadata for required documents. For example I added manually this document:
{
"Name": "My data",
"#metadata": {
"#collection": "Testing",
"#refresh": "2021-04-30T07:41:35.4845961Z"
}
}
It looks like I am facing non deterministic behavior - sometimes refresh is processed sometimes not. I tried with different combinations of times and set through code or via Raven Studio.
Refresh interval is set to refresh but still says "in less than a minute"
I am using
Community license (Document refresh not mentioned here, but I don't see it mentioned for any other licenses as well)
community license extensions
tried more vresions of RavenDB with same result (5.1.7. was looking more promising as it worked for some time but after a while stopped):
4.2.111 server/studio version in Docker on Windows 10
5.1.7 server/studio version
C# RavenDB.Client 5.1.6
Did not find related issue in bug tracker
https://issues.hibernatingrhinos.com/issues/RavenDB?q=document%20refresh
Any ideas what to check or what might be the case?
EDIT: After turned logging into console I found some error log. It looks like
RavendbProject, Raven.Server.Documents.Expiration.ExpiredDocumentsCleaner, Failed to refresh documents on RavendbProject which are older than 05/17/2021 09:48:47, EXCEPTION: System.NullReferenceException: Object reference not set to an instance of an object.
RavendbProject | at Sparrow.Server.ByteStringContext`1.From(String value, ByteStringType type, ByteString& str) in C:\Builds\RavenDB-Stable-5.1\51024\src\Sparrow.Server\ByteString.cs:line 1297
RavendbProject | at Raven.Server.Documents.DocumentPutAction.PutDocument(DocumentsOperationContext context, String id, String expectedChangeVector, BlittableJsonReaderObject document, Nullable`1 lastModifiedTicks, String changeVector, DocumentFlags flags, NonPersistentDocumentFlags nonPersistentFlags) in C:\Builds\RavenDB-Stable-5.1\51024\src\Raven.Server\Documents\DocumentPutAction.cs:line 190
Also worth mentioning is that my document was stored in ClusterWide transaction and thus I can see in one of my documents corresponding flag:
"#flags": "FromClusterTransaction",
My current suspicion is that it may happen that one of these documents prevented other documents from being refreshed. After deleting cluster-transaction document, other documents in collection were refreshed
The bug related to document that was added via cluster transaction, the workaround for now would be to not use cluster transaction.
I have opened an issue on bug tracker,
https://issues.hibernatingrhinos.com/issue/RavenDB-16710

How to invoke an on-demand bigquery Data transfer service?

I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.
StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).
As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):
import time
from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = bigquery_datatransfer_v1.DataTransferServiceClient()
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc' # alphanumeric ID you'll find in the UI
parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))
response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)
Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.
You should get a response such as:
runs {
name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
destination_dataset_id: "DATASET_NAME"
schedule_time {
seconds: 1579358571
nanos: 922599371
}
...
data_source_id: "google_cloud_storage"
state: PENDING
params {
...
}
run_time {
seconds: 1579358581
}
user_id: 28...65
}
and the transfer is triggered as expected (nevermind the error):
Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.
With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).
Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.
Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.
If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.
If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?
You can use wildcards to match a file pattern when you create the transfer:
Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.
Lastly, anyone know if DTS will lower the limit to 10 mins in future?
There is already a Feature Request here. Feel free to star it to show your interest and receive updates
Now your can easy manual run transfer Bigquery data use RESTApi:
HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.
Here
More here:
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns
following the Guillem's answer and the API updates, this is my new code:
import time
from google.cloud.bigquery import datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp
client = datatransfer_v1.DataTransferServiceClient()
config = '34y....654'
PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = config
parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)
start_time = Timestamp(seconds=int(time.time()))
request = datatransfer_v1.types.StartManualTransferRunsRequest(
{ "parent": parent, "requested_run_time": start_time }
)
response = client.start_manual_transfer_runs(request, timeout=360)
print(response)
For this to work, you need to know the correct TRANSFER_CONFIG_ID.
In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :
# Put your projetID here
PROJECT_ID = 'PROJECT_ID'
from google.cloud import bigquery_datatransfer_v1
bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)
# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
# Print Display Name for each Scheduled Query
print(f'[Schedule Query Name]:\t{element.display_name}')
# Print name of all elements (it contains the ID)
print(f'[Name]:\t\t{element.name}')
# Extract the IDs:
TRANSFER_CONFIG_ID= element.name.split('/')[-1]
print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')
# You can print the entire element for debug purposes
print(element)

Is it possible to use service accounts to schedule queries in BigQuery "Schedule Query" feature ?

We are using the Beta Scheduled query feature of BigQuery.
Details: https://cloud.google.com/bigquery/docs/scheduling-queries
We have few ETL scheduled queries running overnight to optimize the aggregation and reduce query cost. It works well and there hasn't been much issues.
The problem arises when the person who scheduled the query using their own credentials leaves the organization. I know we can do "update credential" in such cases.
I read through the document and also gave it some try but couldn't really find if we can use a service account instead of individual accounts to schedule queries.
Service accounts are cleaner and ties up to the rest of the IAM framework and is not dependent on a single user.
So if you have any additional information regarding scheduled queries and service account please share.
Thanks for taking time to read the question and respond to it.
Regards
BigQuery Scheduled Query now does support creating a scheduled query with a service account and updating a scheduled query with a service account. Will these work for you?
While it's not supported in BigQuery UI, it's possible to create a transfer (including a scheduled query) using python GCP SDK for DTS, or from BQ CLI.
The following is an example using Python SDK:
r"""Example of creating TransferConfig using service account.
Usage Example:
1. Install GCP BQ python client library.
2. If it has not been done, please grant p4 service account with
iam.serviceAccout.GetAccessTokens permission on your project.
$ gcloud projects add-iam-policy-binding {user_project_id} \
--member='serviceAccount:service-{user_project_number}#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com' \
--role='roles/iam.serviceAccountTokenCreator'
where {user_project_id} and {user_project_number} are the user project's
project id and project number, respectively. E.g.,
$ gcloud projects add-iam-policy-binding my-test-proj \
--member='serviceAccount:service-123456789#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com'\
--role='roles/iam.serviceAccountTokenCreator'
3. Set environment var PROJECT to your user project, and
GOOGLE_APPLICATION_CREDENTIALS to the service account key path. E.g.,
$ export PROJECT_ID='my_project_id'
$ export GOOGLE_APPLICATION_CREDENTIALS=./serviceacct-creds.json'
4. $ python3 ./create_transfer_config.py
"""
import os
from google.cloud import bigquery_datatransfer
from google.oauth2 import service_account
from google.protobuf.struct_pb2 import Struct
PROJECT = os.environ["PROJECT_ID"]
SA_KEY_PATH = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
credentials = (
service_account.Credentials.from_service_account_file(SA_KEY_PATH))
client = bigquery_datatransfer.DataTransferServiceClient(
credentials=credentials)
# Get full path to project
parent_base = client.project_path(PROJECT)
params = Struct()
params["query"] = "SELECT CURRENT_DATE() as date, RAND() as val"
transfer_config = {
"destination_dataset_id": "my_data_set",
"display_name": "scheduled_query_test",
"data_source_id": "scheduled_query",
"params": params,
}
parent = parent_base + "/locations/us"
response = client.create_transfer_config(parent, transfer_config)
print response
As far as I know, unfortunately you can't use a service account to directly schedule queries yet. Maybe a Googler will correct me, but the BigQuery docs implicitly state this:
https://cloud.google.com/bigquery/docs/scheduling-queries#quotas
A scheduled query is executed with the creator's credentials and
project, as if you were executing the query yourself
If you need to use a service account (which is great practice BTW), then there are a few workarounds listed here. I've raised a FR here for posterity.
This question is very old and came on this thread while I was searching for same.
Yes, It is possible to use service account to schedule big query jobs.
While creating schedule query job, click on "Advance options", you will get option to select service account.
By default is uses credential of requesting user.
Image from bigquery "create schedule query"1

Are Bigquery job logs stored?

Using logging service https://console.cloud.google.com/logs?project=xxxx&service=bigquery.googleapis.com I'm not able to find logs related to job for UI or bq command line.
[EDIT ]
As suggested by #DoIT International bq ls -j show list of jobs but no log about the failure
Use the following command to list the history of query jobs:
bq ls -j
The bq is the part of Google CloudSDK available here: https://cloud.google.com/sdk/
You can use Jobs: list API to collect job info and upload it to GBQ
Since it is in GBQ - you can analyze it any way you want using power of BigQuery
You can either flatten result or use original - i recommend using original as it is less headache as no any transformation before loading to GBQ (you just literally upload whatever you got from API). Of course all this in simple app/script that you still have to write
Note: make sure you use full value for projection parameter
Info about failure should be present as below
"errorResult": {
"reason": string,
"location": string,
"debugInfo": string,
"message": string
},
see more details in https://cloud.google.com/bigquery/docs/reference/v2/jobs/list#response

GitHub v3 API - how do I create the initial commit in a repository?

I'm using the v3 API and managed to list repos/trees/branches, access file contents, and create blobs/trees/commits. I'm now trying to create a new repo, and managed to do it with "POST user/repos"
But when I try to create blobs/trees/commits/references in this new repo I get the same error message. (409) "Git Repository is empty.". Obviously I can go and init the repository myself through the git command line, but would rather like if my application did it for me.
Is there a way to do that? What's the first thing I need to do through the API after I create an empty repository?
Thanks
Since 2012, it is now possible to auto initialize a repository after creation, according to this blog post published on the GitHub blog:
Today we’ve made it easier to add commits to a repository via the GitHub API. Until now, you could create a repository, but you would need to initialize it locally via your Git client before adding any commits via the API.
Now you can optionally init a repository when it’s created by sending true for the auto_init parameter:
curl -i -u pengwynn \
-d '{"name": "create-repo-test", "auto_init": true}' \
https://api.github.com/user/repos
The resulting repository will have a README stub and an initial commit.
Update May 2013: Note that the repository content API now authorize adding files.
See "File CRUD and repository statistics now available in the API".
Original answer (May 2012)
Since it doesn't seems to be supported yet ("GitHub v3 API: How to create initial commit for my shiny new repository?", as aclark comments), you can start by pushing an initial empty commit
git commit --allow-empty -m 'Initial commit'
git push origin master
That can be a good practice to initialize one's repository anyway.
And it is illustrated in "git's semi-secret empty tree".
If you want to create an empty initial commit (i.e. one without any file) you can do the following:
Create the repository using the auto_init option as in Jai Pandya's answer; or, if the repository already exists, use the create file endpoint to create a dummy file - this will create the branch:
PUT https://api.github.com/repos/USER/REPO/contents/dummy
{
"branch": "master",
"message": "Create a dummy file for the sake of creating a branch",
"content": "ZHVtbXk="
}
This will give you a bunch of data including a commit SHA, but you can discard all of it since we are about to obliterate that commit.
Use the create commit endpoint to create a commit that points to the empty tree:
POST https://api.github.com/repos/USER/REPO/git/commits
{
"message": "Initial commit",
"tree": "4b825dc642cb6eb9a060e54bf8d69288fbee4904"
}
This time you need to take note of the returned commit SHA.
Use the update reference endpoint to make the branch point to the commit you just created (notice the Use Of The ForceTM):
PATCH https://api.github.com/repos/USER/REPO/git/refs/heads/master
{
"sha": "<the SHA of the commit>",
"force": true
}
Done! Your repository has now one branch, one commit and zero files.
2023 January: The 2. step of Konamiman's solution did not work for me, unless I deleted the dummy file with the contents api:
DELETE https://api.github.com/repos/USER/REPO/contents/dummy
{
"branch": "master",
"message": "Delete dummy file",
"sha": "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391"
}
(This sha is for the empty content "".)
After deleting the dummy file somehow 4b825dc642cb6eb9a060e54bf8d69288fbee4904 becomes assignable to a new commit.
It looks like over time this changed. It feels bad to rely on undocumented API behavior. :(