How to see the data(primary and backup) on each of the node in the Apache Ignite? - ignite

I have 6 Ignite nodes and all are connected well to form a cluster. Also, i am giving the backup copies as 2 . Now i have sent 20 data to the cluster to check the partition and data(primary and backup). I can see the count using the cache -a -r command .
Is there a command or way where i can see the actual data in each of the node, where i can see the primary data as well as the backup copies?

You could use cache -scan -c=cacheName
Entries in cache: SQL_PUBLIC_PERSON
+=============================================================================================================================================+
| Key Class | Key | Value Class | Value |
+=============================================================================================================================================+
| java.lang.Integer | 1 | o.a.i.i.binary.BinaryObjectImpl | SQL_PUBLIC_PERSON_.. [hash=357088963, NAME=Name1] |
+---------------------------------------------------------------------------------------------------------------------------------------------+
use help cache to see all cache related commands.
see: https://apacheignite-tools.readme.io/docs/command-line-interface
You also have the option of turning on SQL: https://apacheignite-sql.readme.io/docs/schema-and-indexes
and: https://apacheignite-sql.readme.io/docs/getting-started
then use JDBC/sql to see entries in your cache.

Related

Octavia apply Airbyte gives

I'm trying to create a new BigQuery destination on Airbyte with Octavia cli.
When launching:
octavia apply
I receive:
Error: {"message":"The provided configuration does not fulfill the specification. Errors: json schema validation failed when comparing the data to the json schema. \nErrors:
$.loading_method.method: must be a constant value Standard
Here is my conf:
# Configuration for airbyte/destination-bigquery
# Documentation about this connector can be found at https://docs.airbyte.com/integrations/destinations/bigquery
resource_name: "BigQueryFromOctavia"
definition_type: destination
definition_id: 22f6c74f-5699-40ff-833c-4a879ea40133
definition_image: airbyte/destination-bigquery
definition_version: 1.2.12
# EDIT THE CONFIGURATION BELOW!
configuration:
dataset_id: "airbyte_octavia_thibaut" # REQUIRED | string | The default BigQuery Dataset ID that tables are replicated to if the source does not specify a namespace. Read more here.
project_id: "data-airbyte-poc" # REQUIRED | string | The GCP project ID for the project containing the target BigQuery dataset. Read more here.
loading_method:
## -------- Pick one valid structure among the examples below: --------
# method: "Standard" # REQUIRED | string
## -------- Another valid structure for loading_method: --------
method: "GCS Staging" # REQUIRED | string}
credential:
## -------- Pick one valid structure among the examples below: --------
credential_type: "HMAC_KEY" # REQUIRED | string
hmac_key_secret: ${AIRBYTE_BQ1_HMAC_KEY_SECRET} # SECRET (please store in environment variables) | REQUIRED | string | The corresponding secret for the access ID. It is a 40-character base-64 encoded string. | Example: 1234567890abcdefghij1234567890ABCDEFGHIJ
hmac_key_access_id: ${AIRBYTE_BQ1_HMAC_KEY_ACCESS_ID} # SECRET (please store in environment variables) | REQUIRED | string | HMAC key access ID. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. | Example: 1234567890abcdefghij1234
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
# keep_files_in_gcs-bucket: "Delete all tmp files from GCS" # OPTIONAL | string | This upload method is supposed to temporary store records in GCS bucket. By this select you can chose if these records should be removed from GCS when migration has finished. The default "Delete all tmp files from GCS" value is used if not set explicitly.
credentials_json: ${AIRBYTE_BQ1_CREDENTIALS_JSON} # SECRET (please store in environment variables) | OPTIONAL | string | The contents of the JSON service account key. Check out the docs if you need help generating this key. Default credentials will be used if this field is left empty.
dataset_location: "europe-west1" # REQUIRED | string | The location of the dataset. Warning: Changes made after creation will not be applied. Read more here.
transformation_priority: "interactive" # OPTIONAL | string | Interactive run type means that the query is executed as soon as possible, and these queries count towards concurrent rate limit and daily limit. Read more about interactive run type here. Batch queries are queued and started as soon as idle resources are available in the BigQuery shared resource pool, which usually occurs within a few minutes. Batch queries don’t count towards your concurrent rate limit. Read more about batch queries here. The default "interactive" value is used if not set explicitly.
big_query_client_buffer_size_mb: 15 # OPTIONAL | integer | Google BigQuery client's chunk (buffer) size (MIN=1, MAX = 15) for each table. The size that will be written by a single RPC. Written data will be buffered and only flushed upon reaching this size or closing the channel. The default 15MB value is used if not set explicitly. Read more here. | Example: 15
It was an indentation issue on my side:
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
Should be at 1 upper level (this wasn't clear in the commented template, hence the error and the possibility that others persons will do the same).
Here is full final conf:
# Configuration for airbyte/destination-bigquery
# Documentation about this connector can be found at https://docs.airbyte.com/integrations/destinations/bigquery
resource_name: "BigQueryFromOctavia"
definition_type: destination
definition_id: 22f6c74f-5699-40ff-833c-4a879ea40133
definition_image: airbyte/destination-bigquery
definition_version: 1.2.12
# EDIT THE CONFIGURATION BELOW!
configuration:
dataset_id: "airbyte_octavia_thibaut" # REQUIRED | string | The default BigQuery Dataset ID that tables are replicated to if the source does not specify a namespace. Read more here.
project_id: "data-airbyte-poc" # REQUIRED | string | The GCP project ID for the project containing the target BigQuery dataset. Read more here.
loading_method:
## -------- Pick one valid structure among the examples below: --------
# method: "Standard" # REQUIRED | string
## -------- Another valid structure for loading_method: --------
method: "GCS Staging" # REQUIRED | string}
credential:
## -------- Pick one valid structure among the examples below: --------
credential_type: "HMAC_KEY" # REQUIRED | string
hmac_key_secret: ${AIRBYTE_BQ1_HMAC_KEY_SECRET} # SECRET (please store in environment variables) | REQUIRED | string | The corresponding secret for the access ID. It is a 40-character base-64 encoded string. | Example: 1234567890abcdefghij1234567890ABCDEFGHIJ
hmac_key_access_id: ${AIRBYTE_BQ1_HMAC_KEY_ACCESS_ID} # SECRET (please store in environment variables) | REQUIRED | string | HMAC key access ID. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. | Example: 1234567890abcdefghij1234
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
# keep_files_in_gcs-bucket: "Delete all tmp files from GCS" # OPTIONAL | string | This upload method is supposed to temporary store records in GCS bucket. By this select you can chose if these records should be removed from GCS when migration has finished. The default "Delete all tmp files from GCS" value is used if not set explicitly.
credentials_json: ${AIRBYTE_BQ1_CREDENTIALS_JSON} # SECRET (please store in environment variables) | OPTIONAL | string | The contents of the JSON service account key. Check out the docs if you need help generating this key. Default credentials will be used if this field is left empty.
dataset_location: "europe-west1" # REQUIRED | string | The location of the dataset. Warning: Changes made after creation will not be applied. Read more here.
transformation_priority: "interactive" # OPTIONAL | string | Interactive run type means that the query is executed as soon as possible, and these queries count towards concurrent rate limit and daily limit. Read more about interactive run type here. Batch queries are queued and started as soon as idle resources are available in the BigQuery shared resource pool, which usually occurs within a few minutes. Batch queries don’t count towards your concurrent rate limit. Read more about batch queries here. The default "interactive" value is used if not set explicitly.
big_query_client_buffer_size_mb: 15 # OPTIONAL | integer | Google BigQuery client's chunk (buffer) size (MIN=1, MAX = 15) for each table. The size that will be written by a single RPC. Written data will be buffered and only flushed upon reaching this size or closing the channel. The default 15MB value is used if not set explicitly. Read more here. | Example: 15

Geomesa: Using statistics in redis

I try to use geomesa with redis. I thought that redis enables statistics on geomesa by default.
my redis geomesa db:
./geomesa-redis describe-schema -u localhost:6379 -c geomesa -f SignalBuilder
INFO Describing attributes of feature 'SignalBuilder'
geo | Point (Spatio-temporally indexed) (Spatially indexed)
time | Date (Spatio-temporally indexed) (Attribute indexed)
cam | String (Attribute indexed) (Attribute indexed)
imei | String
dir | Double
alt | Double
vlc | Double
sl | Integer
ds | Integer
dir_y | Double
poi_azimuth_x | Double
poi_azimuth_y | Double
User data:
geomesa.attr.splits | 0
geomesa.feature.expiry | time(30 days)
geomesa.id.splits | 0
geomesa.index.dtg | time
geomesa.indices | z3:7:3:geo:time,z2:5:3:geo,attr:8:3:time,attr:8:3:cam,attr:8:3:cam:time
geomesa.stats.enable | true
geomesa.table.partition | time
geomesa.z.splits | 0
geomesa.z3.interval | week
from doc: https://www.geomesa.org/documentation/stable/user/datastores/query_planning.html#stats-collected
Stat generation can be enabled or disabled through the simple feature type user data
using the key geomesa.stats.enable
Cached statistics, and thus cost-based query planning, are currently
only implemented for the Accumulo and Redis data stores.
*Total count, *Min/max (bounds) for the default geometry, *default date
and any indexed attributes, *Histograms for the default geometry,
default date and any indexed attributes, *Frequencies for any indexed
attributes...
Why the return time is increased when increased amount of data?
./geomesa-redis export -u localhost:6379 -c geomesa -f SignalBuilder -q "cam like '%' and bbox(geo,38,56,39,57)" --hints STATS_STRING='Enumeration(cam)'
INFO Running export - please wait...
id,stats:String,*geom:Geometry
stat,"{""5798a065-d51e-47a1-b04b-ab48df9f1324"":203215}",POINT (0 0)
INFO Feature export complete to standard out for 1 features in 2056ms
next request
/geomesa-redis export -u localhost:6379 -c geomesa -f SignalBuilder -q "cam like '%' and bbox(geo,38,56,39,57)" --hints STATS_STRING='Enumeration(cam)'
INFO Running export - please wait...
id,stats:String,*geom:Geometry
stat,"{""5798a065-d51e-47a1-b04b-ab48df9f1324"":595984}",POINT (0 0)
INFO Feature export complete to standard out for 1 features in 3418ms
How to understand that statistics are collected and saved, and used when returning hints stats, like STATS_STRING='MinMax(time)' or STATS_STRING='Enumeration(cam)'?
And how to use sampling with geotools?
I try next
geomesa-cassandra export -P 10.200.217.24:9042 -u cassandra -p cassandra \
-k geomesa -c gsm_events -f SignalBuilder \
-q "cam like '%' and time DURING 2021-12-27T16:50:38.004Z/2022-01-26T16:50:38.004Z" \
--hints SAMPLE_BY='cam';SAMPLING=0.000564
but it does not work.
Thank you for any answer.
When you run an export with a query hint for stats, GeoMesa will always run a query. If you want to use the cached statistics, use the stats-* commands instead. In code, you'd use the stats method which all GeoMesa data stores implement.

JQ Pulling Inconsistent Results

If I run this very straight query on my json data from an aws command I get a correct result as to how many aws server instances I have in an account:
aws ec2 describe-instances | jq -r '.Reservations[].Instances[].InstanceId'
Produces a list of 47 instance IDs which corresponds to the number of server instances I have in the account. For example:
i-01adbf1408ef1a333
i-0f92d078ce975c138
i-0e4e117c44b17b417
and on up to 47 instances
This next query still produces the correct number of results:
aws ec2 describe-instances | jq -r '.Reservations[].Instances[] | [( .InstanceId ) ]'
However, if I add a query to include the name tags of the servers I get dramatically less number of server instances reported:
aws ec2 describe-instances | jq -r '.Reservations[].Instances[] | [( (.Tags[]|select(.Key=="Name")|.Value), .InstanceId ) ]'
This is the output of that command:
"i-08d3c05eed1316c9d"
"USAMZLAB10003","i-79eebb29"
"EOMLABAMZ1306","i-dbc98af4"
"USAMZLAB10002","i-d1dc1d83"
"i-0366c9bf18d27eb96"
"i-04d061334bc2f2d6b"
"USAMZLAB10007","i-f7a680a7"
"i-090e84eff4fece2b3"
"EOMLABAMZ1303","i-7cc98a53"
"EOMLABCSE713","i-08233926"
"i-0705eb3039cd56e04"
jq: error (at <stdin>:5013): Cannot iterate over null (null)
For some reason that query reports that there are only 11 aws server instances (when there should be 47). It does report that there are servers with and without name tags. But it's not reporting the correct number of servers.
It also produces the jq error "Cannot iterate over null".
I have put the original JSON into this paste:
Original JSON
How can I make the error more verbose so I can find out what's going on?
And why does adding the name tag to the query dramatically reduce the number of results?
In your json, not all instances have a set of Tags thus the error. You would have to handle it or substitute an empty array in its place with (.Tags // []). But overall, I would write it like this:
.Reservations[].Instances[] | [ (.Tags // [] | from_entries.Name), .InstanceId ]
How can I make the error more verbose so I can find out what's going on?
You could use debug.
why does adding the name tag to the query dramatically reduce the number of results?
Because your jq program is at variance with your expectations; specifically, you have overlooked what happens when .Tags evaluates to null. To understand the mismatch, consider:
$ jq -n '{} | .Tags[]|select(.Key=="Name")|.Value'
Another issue is the handling of empty arrays. You might like to handle the case of empty arrays along the lines suggested by the following:
$ jq -n '{Tags: []} | (.Tags[] | select(.Key=="Name")|.Value) // null'
$ null
One solution
If you want null to appear whenever there isn't a tag:
.Reservations[].Instances[]
| [ ((.Tags // [])[] | select(.Key=="Name") | .Value) // null,
.InstanceId ]
Given your input, the first two lines of the output would be:
[null,"i-08d3c05eed1316c9d"]
["USAMZLAB10003","i-79eebb29"]
Variant using try
.Reservations[].Instances[]
| [ try (.Tags[] | select(.Key=="Name")|.Value) // null,
.InstanceId ]

Can GraphDB load 10 million statements with OWL reasoning?

I am struggling to load most of the Drug Ontology OWL files and most of the ChEBI OWL files into GraphDB free v8.3 repository with Optimized OWL Horst reasoning on.
is this possible? Should I do something other than "be patient?"
Details:
I'm using the loadrdf offline bulk loader to populate an AWS r4.16xlarge instance with 488.0 GiB and 64 vCPUs
Over the weekend, I played around with different pool buffer sizes and found that most of these files individually load fastest with a pool buffer of 2,000 or 20,000 statements instead of the suggested 200,000. I also added -Xmx470g to the loadrdf script. Most of the OWL files would load individually in less than one hour.
Around 10 pm EDT last night, I started to load all of the files listed below simultaneously. Now it's 11 hours later, and there are still millions of statements to go. The load rate is around 70/second now. It appears that only 30% of my RAM is being used, but the CPU load is consistently around 60.
are there websites that document other people doing something of this scale?
should I be using a different reasoning configuration? I chose this configuration as it was the fastest loading OWL configuration, based on my experiments over the weekend. I think I will need to look for relationships that go beyond rdfs:subClassOf.
Files I'm trying to load:
+-------------+------------+---------------------+
| bytes | statements | file |
+-------------+------------+---------------------+
| 471,265,716 | 4,268,532 | chebi.owl |
| 61,529 | 451 | chebi-disjoints.owl |
| 82,449 | 1,076 | chebi-proteins.owl |
| 10,237,338 | 135,369 | dron-chebi.owl |
| 2,374 | 16 | dron-full.owl |
| 170,896 | 2,257 | dron-hand.owl |
| 140,434,070 | 1,986,609 | dron-ingredient.owl |
| 2,391 | 16 | dron-lite.owl |
| 234,853,064 | 2,495,144 | dron-ndc.owl |
| 4,970 | 28 | dron-pro.owl |
| 37,198,480 | 301,031 | dron-rxnorm.owl |
| 137,507 | 1,228 | dron-upper.owl |
+-------------+------------+---------------------+
#MarkMiller you can take a look at the Preload tool, which is part of GraphDB 8.4.0 release. It's specially designed to handle large amount of data with constant speed. Note that it works without inference, so you'll need to load your data and then change the ruleset and reinfer the statements.
http://graphdb.ontotext.com/documentation/free/loading-data-using-preload.html
Just typing out #Konstantin Petrov's correct suggestion with tidier formatting. All of these queries should be run in the repository of interest... at some point in working this out, I misled myself into thinking that I should be connected to the SYSTEM repo when running these queries.
All of these queries also require the following prefix definition
prefix sys: <http://www.ontotext.com/owlim/system#>
This doesn't directly address the timing/performance of loading large datasets into an OWL reasoning repository, but it does show how to switch to a higher level of reasoning after loading lots of triples into a no-inference ("empty" ruleset) repository.
Could start by querying for the current reasoning level/rule set, and then run this same select statement after each insert.
SELECT ?state ?ruleset {
?state sys:listRulesets ?ruleset
}
Add a predefined ruleset
INSERT DATA {
_:b sys:addRuleset "rdfsplus-optimized"
}
Make the new ruleset the default
INSERT DATA {
_:b sys:defaultRuleset "rdfsplus-optimized"
}
Re-infer... could take a long time!
INSERT DATA {
[] <http://www.ontotext.com/owlim/system#reinfer> []
}

What is the easiest way to find the biggest objects in Redis?

I have a 20GB+ rdb dump in production.
I suspect there's a specific set of keys bloating it.
I'd like to have a way to always spot the first 100 biggest objects from static dump analysis or ask it to the server itself, which by the way has ove 7M objects.
Dump analysis tools like rdbtools are not helpful in this (I think) really common use case!
I was thinking to write a script and iterate the whole keyset with "redis-cli debug object", but I have the feeling there must be some tool I'm missing.
An option was added to redis-cli: redis-cli --bigkeys
Sample output based on https://gist.github.com/michael-grunder/9257326
$ ./redis-cli --bigkeys
# Press ctrl+c when you have had enough of it... :)
# You can use -i 0.1 to sleep 0.1 sec every 100 sampled keys
# in order to reduce server load (usually not needed).
Biggest string so far: day:uv:483:1201737600, size: 2
Biggest string so far: day:pv:2013:1315267200, size: 3
Biggest string so far: day:pv:3:1290297600, size: 5
Biggest zset so far: day:topref:2734:1289433600, size: 3
Biggest zset so far: day:topkw:2236:1318723200, size: 7
Biggest zset so far: day:topref:651:1320364800, size: 20
Biggest string so far: uid:3467:auth, size: 32
Biggest set so far: uid:3029:allowed, size: 1
Biggest list so far: last:175, size: 51
-------- summary -------
Sampled 329 keys in the keyspace!
Total key length in bytes is 15172 (avg len 46.12)
Biggest list found 'day:uv:483:1201737600' has 5235597 items
Biggest set found 'day:uvx:555:1201737600' has 47 members
Biggest hash found 'day:uvy:131:1201737600' has 2888 fields
Biggest zset found 'day:uvz:777:1201737600' has 1000 members
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
19 lists with 5236744 items (05.78% of keys, avg size 275618.11)
50 sets with 112 members (15.20% of keys, avg size 2.24)
250 hashs with 6915 fields (75.99% of keys, avg size 27.66)
10 zsets with 1294 members (03.04% of keys, avg size 129.40)
redis-rdb-tools does have a memory report that does exactly what you need. It generates a CSV file with memory used by every key. You can then sort it and find the Top x keys.
There is also an experimental memory profiler that started to do what you need. Its not yet complete, and so isn't documented. But you can try it - https://github.com/sripathikrishnan/redis-rdb-tools/tree/master/rdbtools/cli. And of course, I'd encourage you to contribute as well!
Disclaimer: I am the author of this tool.
I am pretty new to bash scripting. I came out with this:
for line in $(redis-cli keys '*' | awk '{print $1}'); do echo `redis-cli DEBUG OBJECT $line | awk '{print $5}' | sed 's/serializedlength://g'` $line; done; | sort -h
This script
Lists all the key with redis-cli keys "*"
Gets size with redis-cli DEBUG OBJECT
sorts the script based on the name prepend with the size
This may be very slow due to the fact that bash is looping through every single redis key. You have 7m keys you may need to cache the out put of the keys to a file.
If you have keys that follow this pattern "A:B" or "A:B:*", I wrote a tool that analyzes both existing content as well as monitors for things such as hit rate, number of gets/sets, network traffic, lifetime, etc. The output is similar to the one below.
https://github.com/alexdicianu/redis_toolkit
$ ./redis-toolkit report -type memory -name NAME
+----------------------------------------+----------+-----------+----------+
| KEY | NR KEYS | SIZE (MB) | SIZE (%) |
+----------------------------------------+----------+-----------+----------+
| posts:* | 500 | 0.56 | 2.79 |
| post_meta:* | 440 | 18.48 | 92.78 |
| terms:* | 192 | 0.12 | 0.63 |
| options:* | 109 | 0.52 | 2.59 |
Try redis-memory-analyzer - a console tool to scan Redis key space in real time and aggregate memory usage statistic by key patterns. You may use this tools without maintenance on production servers. It shows you detailed statistics about each key pattern in your Redis serve.
Also you can scan Redis db by all or selected Redis types such as "string", "hash", "list", "set", "zset". Matching pattern also supported.
RMA also try to discern key names by patterns, for example if you have keys like 'user:100' and 'user:101' application would pick out common pattern 'user:*' in output so you can analyze most memory distressed data in your instance.