In datastax cassandra 5.1, why is dsetool missing insights_config command? - datastax

running dsetool insights_config on the DataStax Cassandra node returns Unknown command: insights_config whereas the documentation states that this command should be present.

DSE Metrics Collector was introduced in the DSE 5.1.14 (current version is 5.1.17) - make sure that you're using version that has this functionality. In my setup it works just fine:
(dse-5.1.17) ...\>dsetool insights_config --show_config
{
"mode" : "DISABLED",
"config_refresh_interval_in_seconds" : 30,
"metric_sampling_interval_in_seconds" : 30,
"data_dir_max_size_in_mb" : 1024,
"node_system_info_report_period" : "PT1H"
}

Related

GCSToBigQueryOperator not working in composer-2.1.0-airflow-2.3.4

After a recent upgrade to composer-2.1.0-airflow-2.3.4 the GCSToBigQueryOperator is no longer able to find data in buckets to upload to BigQuery.
All other aspects of the DAGs still work.
The usage is as follows
gcs_to_bq = GCSToBigQueryOperator(
task_id = f"transfer_{data_type}_to_bq_task",
bucket = os.environ["GCS_BUCKET"],
source_objects = file_names,
destination_project_dataset_table = os.environ["GCP_PROJECT"] + f".creditsafe.{data_type}",
schema_object = f"dags/schema/creditsafe/{data_type}.json",
source_format = "CSV",
field_delimiter = '|',
quote_character = "",
max_bad_records = 0,
create_disposition = "CREATE_IF_NEEDED",
ignore_unknown_values = True,
allow_quoted_newlines = True,
allow_jagged_rows = True,
write_disposition = "WRITE_TRUNCATE",
gcp_conn_id = 'google_cloud_default',
skip_leading_rows = 1,
dag = dag
)
The error from the API is
google.api_core.exceptions.NotFound: 404 GET
{ "error": { "code": 400, "message": "Unknown output format: media:", "errors": [ { "message": "Unknown output format: media:", "domain": "global", "reason": "invalidAltValue", "locationType": "parameter", "location": "alt" } ] } }
The error delivered by Cloud Composer is
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/download/storage/v1/b/[BUCKET_HIDDEN]/o/data%2Fcreditsafe%2FCD01%2Ftxt%2F%2A.txt?alt=media: No such object: [BUCKET_HIDDEN]/data/creditsafe/CD01/txt/*.txt: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
I can't see the cause of the error. The reference to the GCS location has not changed and appears correct while the gcp_conn_id appears sufficient for all other tasks. I'm at a loss.
A fix for this above issue has now been made
https://github.com/apache/airflow/pull/28444
It is unclear how long it will take for this to be integrated into the Cloud Composer libraries.
GCSToBigQueryOperator does not support wildcard *.csv. For your requirement, you can try the below steps:
You can attach to a pod in the composer environment by running the below commands :
gcloud container clusters get-credentials --region __GCP_REGION__ __GKE_CLUSTER_NAME__
kubectl get pods -n [Namespace]
kubectl exec -it [Worker] -n [Namespace] -- bash
You can run the below command to identify the google provider package,
pip list | grep -i goo | grep provider
If the output of the above command is a different version from 8.3.0 then change the version to apache-airflow-providers-google ==8.3.0.
The release 8.5.0 of apache-airflow-providers-google that cames with airflow (>2.3.4 and <2.5.1) introduced several critical regressions nottably:
GCSToBigquery operator is broken as it ignores its options https://github.com/apache/airflow/pull/27961
This means that all custom settings specified in the operator (delimiter, formatting, null values, wildcards on files) are no longer sent to BigQuery leading to unexpected results.
Until Google releases a composer version based on airflow 2.5.1 the workaround is to upgrade the apache-airflow-providers-google library (or to use a composer version based on apache <=2.2.5).
No need to connect via gcloud/kubectl to change the apache-airflow-providers-google version, you can change it directly in the Composer UI, via PyPi Packages page (or via the terraform provider).
I can confirm that on the latest (today) composer : composer-1.20.4-airflow-2.4.3 configuring the apache-airflow-providers-google ==8.8.0 (latest) solves those issues for me.
But as mentioned previously this is only an workaround and your mileage might vary...
configuring custom PyPI packages

Multiple Versions of GNOME Packages After a yum update

I had to update some packages, related to GNOME. I simply executed a yum update:
yum update libsoup PackageKit PackageKit-glib pipewire pipewire-libs python3-gobject python3-gobject-base tracker
Now there are two versions of each package installed.
ex:
Installed Packages
Name : tracker
Version : 2.1.5
Release : 1.el8
Architecture : x86_64
Size : 4.1 M
Source : tracker-2.1.5-1.el8.src.rpm
Repository : #System
From repo : AppStream
Summary : Desktop-neutral metadata database and search tool
URL : https://wiki.gnome.org/Projects/Tracker
License : GPLv2+
Description : Tracker is a powerful desktop-neutral first class object database,
: tag/metadata database and search tool.
Name : tracker
Version : 2.1.5
Release : 2.el8
Architecture : x86_64
Size : 4.0 M
Source : tracker-2.1.5-2.el8.src.rpm
Repository : #System
How do I get the system to use the new package and remove the old one, like yum update does every other time?

Error using Notebook "Streaming structured data from Elasticsearch using Tensorflow-IO" when it is tested in local environment

I got the notebook "Streaming structured data from Elasticsearch using Tensorflow-IO" in my PC.
"This tutorial focuses on streaming data from an Elasticsearch cluster into a tf.data.Dataset which is then used in conjunction with tf.keras for training and inference."
Follow instructions, elasticsearch has been locally installed (Windows 10, ELK version 1.9)
The step by step run Ok but in step "Training dataset", when the exercise goes to read DataSets from "Train" and "Test" indexes, the notebook display the error "Skipping node:
http://localhost:9200/_cluster/health" with additional info:
"ConnectionError: No healthy node available for the index: train, please check the cluster config"
I check indexes status (http://localhost:9200/_cat/indices?v=true&s=index) and the response from elasticsearch is according to expectations:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open test EKQKEYWCSBOLY1-8-dqeUg 1 1 3462 0 306.7kb 306.7kb
yellow open train 8D4LF-TqRQ6f-CZmgnhM9g 1 1 8075 0 698.9kb 698.9kb
Runnign the same notebook in Colab environment, the execises goes OK, without errors.
My environment:
OS: Windows 10
tensorflow-io version: 0.17.0
tensorflow version: 2.4.1
curl -sX GET "localhost:9200/"
{
"name" : "nnnnnnnnnnn",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "_fdrIUXPScCIPqOCvPPorA",
"version" : {
"number" : "7.9.0",
"build_flavor" : "default",
"build_type" : "zip",
"build_hash" : "a479a2a7fce0389512d6a9361301708b92dff667",
"build_date" : "2020-08-11T21:36:48.204330Z",
"build_snapshot" : false,
"lucene_version" : "8.6.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
This is a couple of months too late but the reason this happens is likely due to missing core ops dependency (which the ElasticsearchIODataset function relies on) in the windows distribution of tensorflow-io. Maybe try this from a linux wsl environment instead..

Programmatically determine latest Gradle version?

Is there a way to find, via an API or CLI, the latest available version of Gradle?
I'm working on a tool to check versions of dependencies and compare them to the latest versions, and I need a way to determine what the latest version of Gradle that is available.
To be clear, I am not looking for the version of Gradle I already have. I know how to get that any number of ways. I'm just looking for some officially maintained endpoint I can call to determine the latest version available.
Gradle has an API to retrieve all sorts of information:
https://services.gradle.org/
For the current version:
GET https://services.gradle.org/versions/current
{
"version" : "6.8.1",
"buildTime" : "20210122132008+0000",
"current" : true,
"snapshot" : false,
"nightly" : false,
"releaseNightly" : false,
"activeRc" : false,
"rcFor" : "",
"milestoneFor" : "",
"broken" : false,
"downloadUrl" : "https://services.gradle.org/distributions/gradle-6.8.1-bin.zip",
"checksumUrl" : "https://services.gradle.org/distributions/gradle-6.8.1-bin.zip.sha256",
"wrapperChecksumUrl" : "https://services.gradle.org/distributions/gradle-6.8.1-wrapper.jar.sha256"
}
You can get the data using curl and then use jq to extract the version key.
Node.js has in-built JSON support so this will be even easier.
CURRENT_GRADLE_VERSION="$(curl -s https://services.gradle.org/versions/current | jq -r '.version')"
echo "${CURRENT_GRADLE_VERSION}" # prints 6.8.1

Querying Hive from Apache Drill causes Stackoverflow error

I am trying to see a table named customers, in hive from drill. I am using Drill in Embedded mode. I am using the default derby database for hive metastore.
When I do a describe, it shows all the columns and types.
But, when I do a select command like this,
select * from customers limit 10;
In the Web UI, this is what I got
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: StackOverflowError
Hive plugin:
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://ip_address:9083",
"javax.jdo.option.ConnectionURL": "jdbc:derby:;databaseName=../sample-data/drill_hive_db;create=true",
"hive.metastore.warehouse.dir": "/user/hive/warehouse",
"fs.default.name": "file///",
"hive.metastore.sasl.enabled": "false"
}
}
Errors showed in the Log file:
org.apache.drill.exec.work.foreman.ForemanException: Unexpected
exception during fragment initialization: java.lang.AssertionError:
Internal error: Error while applying rule DrillPushProjIntoScan,
java.lang.StackOverflowError: null at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:355)
~[hadoop-common-2.7.1.jar:na]
And, finally this
Query failed: org.apache.drill.common.exceptions.UserRemoteException:
SYSTEM ERROR: StackOverflowError
And, the versions i am using are:
Apache Drill : 1.3.0
Hive : 0.13.1-cdh5.3.0
Hadoop : 2.5.0-cdh5.3.0
This is a version conflict I guess.
According to Drill's Documentation:
Drill 1.0 supports Hive 0.13. Drill 1.1 supports Hive 1.0.
So, for 1.1+ you may get issues with hive 0.13.
Read more here.
So upgrade hive to 1.0 or downgrade drill to 1.0 to test this.