Geomesa: Using statistics in redis - redis

I try to use geomesa with redis. I thought that redis enables statistics on geomesa by default.
my redis geomesa db:
./geomesa-redis describe-schema -u localhost:6379 -c geomesa -f SignalBuilder
INFO Describing attributes of feature 'SignalBuilder'
geo | Point (Spatio-temporally indexed) (Spatially indexed)
time | Date (Spatio-temporally indexed) (Attribute indexed)
cam | String (Attribute indexed) (Attribute indexed)
imei | String
dir | Double
alt | Double
vlc | Double
sl | Integer
ds | Integer
dir_y | Double
poi_azimuth_x | Double
poi_azimuth_y | Double
User data:
geomesa.attr.splits | 0
geomesa.feature.expiry | time(30 days)
geomesa.id.splits | 0
geomesa.index.dtg | time
geomesa.indices | z3:7:3:geo:time,z2:5:3:geo,attr:8:3:time,attr:8:3:cam,attr:8:3:cam:time
geomesa.stats.enable | true
geomesa.table.partition | time
geomesa.z.splits | 0
geomesa.z3.interval | week
from doc: https://www.geomesa.org/documentation/stable/user/datastores/query_planning.html#stats-collected
Stat generation can be enabled or disabled through the simple feature type user data
using the key geomesa.stats.enable
Cached statistics, and thus cost-based query planning, are currently
only implemented for the Accumulo and Redis data stores.
*Total count, *Min/max (bounds) for the default geometry, *default date
and any indexed attributes, *Histograms for the default geometry,
default date and any indexed attributes, *Frequencies for any indexed
attributes...
Why the return time is increased when increased amount of data?
./geomesa-redis export -u localhost:6379 -c geomesa -f SignalBuilder -q "cam like '%' and bbox(geo,38,56,39,57)" --hints STATS_STRING='Enumeration(cam)'
INFO Running export - please wait...
id,stats:String,*geom:Geometry
stat,"{""5798a065-d51e-47a1-b04b-ab48df9f1324"":203215}",POINT (0 0)
INFO Feature export complete to standard out for 1 features in 2056ms
next request
/geomesa-redis export -u localhost:6379 -c geomesa -f SignalBuilder -q "cam like '%' and bbox(geo,38,56,39,57)" --hints STATS_STRING='Enumeration(cam)'
INFO Running export - please wait...
id,stats:String,*geom:Geometry
stat,"{""5798a065-d51e-47a1-b04b-ab48df9f1324"":595984}",POINT (0 0)
INFO Feature export complete to standard out for 1 features in 3418ms
How to understand that statistics are collected and saved, and used when returning hints stats, like STATS_STRING='MinMax(time)' or STATS_STRING='Enumeration(cam)'?
And how to use sampling with geotools?
I try next
geomesa-cassandra export -P 10.200.217.24:9042 -u cassandra -p cassandra \
-k geomesa -c gsm_events -f SignalBuilder \
-q "cam like '%' and time DURING 2021-12-27T16:50:38.004Z/2022-01-26T16:50:38.004Z" \
--hints SAMPLE_BY='cam';SAMPLING=0.000564
but it does not work.
Thank you for any answer.

When you run an export with a query hint for stats, GeoMesa will always run a query. If you want to use the cached statistics, use the stats-* commands instead. In code, you'd use the stats method which all GeoMesa data stores implement.

Related

How to see the data(primary and backup) on each of the node in the Apache Ignite?

I have 6 Ignite nodes and all are connected well to form a cluster. Also, i am giving the backup copies as 2 . Now i have sent 20 data to the cluster to check the partition and data(primary and backup). I can see the count using the cache -a -r command .
Is there a command or way where i can see the actual data in each of the node, where i can see the primary data as well as the backup copies?
You could use cache -scan -c=cacheName
Entries in cache: SQL_PUBLIC_PERSON
+=============================================================================================================================================+
| Key Class | Key | Value Class | Value |
+=============================================================================================================================================+
| java.lang.Integer | 1 | o.a.i.i.binary.BinaryObjectImpl | SQL_PUBLIC_PERSON_.. [hash=357088963, NAME=Name1] |
+---------------------------------------------------------------------------------------------------------------------------------------------+
use help cache to see all cache related commands.
see: https://apacheignite-tools.readme.io/docs/command-line-interface
You also have the option of turning on SQL: https://apacheignite-sql.readme.io/docs/schema-and-indexes
and: https://apacheignite-sql.readme.io/docs/getting-started
then use JDBC/sql to see entries in your cache.

JQ Pulling Inconsistent Results

If I run this very straight query on my json data from an aws command I get a correct result as to how many aws server instances I have in an account:
aws ec2 describe-instances | jq -r '.Reservations[].Instances[].InstanceId'
Produces a list of 47 instance IDs which corresponds to the number of server instances I have in the account. For example:
i-01adbf1408ef1a333
i-0f92d078ce975c138
i-0e4e117c44b17b417
and on up to 47 instances
This next query still produces the correct number of results:
aws ec2 describe-instances | jq -r '.Reservations[].Instances[] | [( .InstanceId ) ]'
However, if I add a query to include the name tags of the servers I get dramatically less number of server instances reported:
aws ec2 describe-instances | jq -r '.Reservations[].Instances[] | [( (.Tags[]|select(.Key=="Name")|.Value), .InstanceId ) ]'
This is the output of that command:
"i-08d3c05eed1316c9d"
"USAMZLAB10003","i-79eebb29"
"EOMLABAMZ1306","i-dbc98af4"
"USAMZLAB10002","i-d1dc1d83"
"i-0366c9bf18d27eb96"
"i-04d061334bc2f2d6b"
"USAMZLAB10007","i-f7a680a7"
"i-090e84eff4fece2b3"
"EOMLABAMZ1303","i-7cc98a53"
"EOMLABCSE713","i-08233926"
"i-0705eb3039cd56e04"
jq: error (at <stdin>:5013): Cannot iterate over null (null)
For some reason that query reports that there are only 11 aws server instances (when there should be 47). It does report that there are servers with and without name tags. But it's not reporting the correct number of servers.
It also produces the jq error "Cannot iterate over null".
I have put the original JSON into this paste:
Original JSON
How can I make the error more verbose so I can find out what's going on?
And why does adding the name tag to the query dramatically reduce the number of results?
In your json, not all instances have a set of Tags thus the error. You would have to handle it or substitute an empty array in its place with (.Tags // []). But overall, I would write it like this:
.Reservations[].Instances[] | [ (.Tags // [] | from_entries.Name), .InstanceId ]
How can I make the error more verbose so I can find out what's going on?
You could use debug.
why does adding the name tag to the query dramatically reduce the number of results?
Because your jq program is at variance with your expectations; specifically, you have overlooked what happens when .Tags evaluates to null. To understand the mismatch, consider:
$ jq -n '{} | .Tags[]|select(.Key=="Name")|.Value'
Another issue is the handling of empty arrays. You might like to handle the case of empty arrays along the lines suggested by the following:
$ jq -n '{Tags: []} | (.Tags[] | select(.Key=="Name")|.Value) // null'
$ null
One solution
If you want null to appear whenever there isn't a tag:
.Reservations[].Instances[]
| [ ((.Tags // [])[] | select(.Key=="Name") | .Value) // null,
.InstanceId ]
Given your input, the first two lines of the output would be:
[null,"i-08d3c05eed1316c9d"]
["USAMZLAB10003","i-79eebb29"]
Variant using try
.Reservations[].Instances[]
| [ try (.Tags[] | select(.Key=="Name")|.Value) // null,
.InstanceId ]

Execute an Impala query and get query time

I want to be able to execute a number of Impala queries and return the time it took for each query to execute. Using the Impala shell, I can do this with the following command:
impl -q "select count(*) from database.table;"
This gives me the output
Using service name 'impala'
SSL is enabled. Impala server certificates will NOT be verified (set --ca_cert to change)
Connected to *****.************:21000
Server version: impalad version 2.6.0-cdh5.8.3 RELEASE (build c644f476b774db9db87a619628f7a6ecc5f843e0)
Query: select count(*) from database.table
+----------+
| count(*) |
+----------+
| 1130976 |
+----------+
Fetched 1 row(s) in 0.86s
I want to be able to fetch that last line and extract the time. It doesn't really matter how, which is why I haven't tagged a language. I have tried using grep like this:
impl -q "select count(*) from database.table" | grep -Po "\d+\.\d+"
But that does nothing but remove the table. Putting the query in a python script and using subprocess couldn't find impl as a command, and same for scala.
The weird thing is that impala-shell dumps those messages to stderr rather than to stdout, so to fetch the last line, you would have to append a 2>&1 to redirect stderr to stdout
impala-shell -q "query string" 2>&1 | grep -Po "\d+\.\d+(?=s)"
Notice that a positive lookahead (?=s) is probably required to avoid capturing version numbers

DynamoDB Local to DynamoDB AWS

I've built an application using DynamoDB Local and now I'm at the point where I want to setup on AWS. I've gone through numerous tools but have had no success finding a way to take my local DB and setup the schema and migrate data into AWS.
For example, I can get the data into a CSV format but AWS has no way to recognize that. It seems that I'm forced to create a Data Pipeline... Does anyone have a better way to do this?
Thanks in advance
As was mentioned earlier, DynamoDB local is there for testing purposes. However, you can still migrate your data if you need to. One approach would be to save data into some format, like json or csv and store it into S3, and then use something like lambdas or your own server to read from S3 and save into your new DynamoDB. As for setting up schema, You can use the same code you used to create your local table to create remote table via AWS SDK.
you can create a standalone application to get the list of tables from the local dynamoDB and create them in your AWS account after that you can get all the data for each table and save them.
I'm not sure which language you familiar with but will explain some API might help you in Java.
DynamoDB.listTables();
DynamoDB.createTable(CreateTableRequest);
example about how to create table using the above API
ProvisionedThroughput provisionedThroughput = new ProvisionedThroughput(1L, 1L);
try{
CreateTableRequest groupTableRequest = mapper.generateCreateTableRequest(Group.class); //1
groupTableRequest.setProvisionedThroughput(provisionedThroughput); //2
// groupTableRequest.getGlobalSecondaryIndexes().forEach(index -> index.setProvisionedThroughput(provisionedThroughput)); //3
Table groupTable = client.createTable(groupTableRequest); //4
groupTable.waitForActive();//5
}catch(ResourceInUseException e){
log.debug("Group table already exist");
}
1- you will create TableRequest against mapping
2- setting the provision throughput and this will vary depend on your requirements
3- if the table has global secondary index you can use this line (Optional)
4- the actual table will be created here
5- the thread will be stopped till the table become active
I didn't mention the API related to data access (insert ... etc), I supposed that you're familiar with since you already use them in local dynamodb
I did a little work setting up my local dev environment. I use SAM to create the dynamodb tables in AWS. I didn't want to do the work twice so I ended up copying the schema from AWS to my local instance. The same approach can work the other way around.
aws dynamodb describe-table --table-name chess_lobby \
| jq '.Table' \
| jq 'del(.TableArn)' \
| jq 'del(.TableSizeBytes)' \
| jq 'del(.TableStatus)' \
| jq 'del(.TableId)' \
| jq 'del(.ItemCount)' \
| jq 'del(.CreationDateTime)' \
| jq 'del(.GlobalSecondaryIndexes[].IndexSizeBytes)' \
| jq 'del(.ProvisionedThroughput.NumberOfDecreasesToday)' \
| jq 'del(.GlobalSecondaryIndexes[].IndexStatus)' \
| jq 'del(.GlobalSecondaryIndexes[].IndexArn)' \
| jq 'del(.GlobalSecondaryIndexes[].ItemCount)' \
| jq 'del(.GlobalSecondaryIndexes[].ProvisionedThroughput.NumberOfDecreasesToday)' > chess_lobby.json
aws dynamodb create-table \
--cli-input-json file://chess_lobby.json \
--endpoint-url http://localhost:8000
The top command uses describe table aws cli capabilities to get the schema json. Then I use jq to delete all unneeded keys, since create-table is strict with its parameter validation. Then I can use create-table to create the table in the local environent by using the --endpoint-url command.
You can use the --endpoint-url parameter on the top command instead to fetch your local schema and then use the create-table without the --endpoint-url parameter to create it directly in AWS.

What is the easiest way to find the biggest objects in Redis?

I have a 20GB+ rdb dump in production.
I suspect there's a specific set of keys bloating it.
I'd like to have a way to always spot the first 100 biggest objects from static dump analysis or ask it to the server itself, which by the way has ove 7M objects.
Dump analysis tools like rdbtools are not helpful in this (I think) really common use case!
I was thinking to write a script and iterate the whole keyset with "redis-cli debug object", but I have the feeling there must be some tool I'm missing.
An option was added to redis-cli: redis-cli --bigkeys
Sample output based on https://gist.github.com/michael-grunder/9257326
$ ./redis-cli --bigkeys
# Press ctrl+c when you have had enough of it... :)
# You can use -i 0.1 to sleep 0.1 sec every 100 sampled keys
# in order to reduce server load (usually not needed).
Biggest string so far: day:uv:483:1201737600, size: 2
Biggest string so far: day:pv:2013:1315267200, size: 3
Biggest string so far: day:pv:3:1290297600, size: 5
Biggest zset so far: day:topref:2734:1289433600, size: 3
Biggest zset so far: day:topkw:2236:1318723200, size: 7
Biggest zset so far: day:topref:651:1320364800, size: 20
Biggest string so far: uid:3467:auth, size: 32
Biggest set so far: uid:3029:allowed, size: 1
Biggest list so far: last:175, size: 51
-------- summary -------
Sampled 329 keys in the keyspace!
Total key length in bytes is 15172 (avg len 46.12)
Biggest list found 'day:uv:483:1201737600' has 5235597 items
Biggest set found 'day:uvx:555:1201737600' has 47 members
Biggest hash found 'day:uvy:131:1201737600' has 2888 fields
Biggest zset found 'day:uvz:777:1201737600' has 1000 members
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
19 lists with 5236744 items (05.78% of keys, avg size 275618.11)
50 sets with 112 members (15.20% of keys, avg size 2.24)
250 hashs with 6915 fields (75.99% of keys, avg size 27.66)
10 zsets with 1294 members (03.04% of keys, avg size 129.40)
redis-rdb-tools does have a memory report that does exactly what you need. It generates a CSV file with memory used by every key. You can then sort it and find the Top x keys.
There is also an experimental memory profiler that started to do what you need. Its not yet complete, and so isn't documented. But you can try it - https://github.com/sripathikrishnan/redis-rdb-tools/tree/master/rdbtools/cli. And of course, I'd encourage you to contribute as well!
Disclaimer: I am the author of this tool.
I am pretty new to bash scripting. I came out with this:
for line in $(redis-cli keys '*' | awk '{print $1}'); do echo `redis-cli DEBUG OBJECT $line | awk '{print $5}' | sed 's/serializedlength://g'` $line; done; | sort -h
This script
Lists all the key with redis-cli keys "*"
Gets size with redis-cli DEBUG OBJECT
sorts the script based on the name prepend with the size
This may be very slow due to the fact that bash is looping through every single redis key. You have 7m keys you may need to cache the out put of the keys to a file.
If you have keys that follow this pattern "A:B" or "A:B:*", I wrote a tool that analyzes both existing content as well as monitors for things such as hit rate, number of gets/sets, network traffic, lifetime, etc. The output is similar to the one below.
https://github.com/alexdicianu/redis_toolkit
$ ./redis-toolkit report -type memory -name NAME
+----------------------------------------+----------+-----------+----------+
| KEY | NR KEYS | SIZE (MB) | SIZE (%) |
+----------------------------------------+----------+-----------+----------+
| posts:* | 500 | 0.56 | 2.79 |
| post_meta:* | 440 | 18.48 | 92.78 |
| terms:* | 192 | 0.12 | 0.63 |
| options:* | 109 | 0.52 | 2.59 |
Try redis-memory-analyzer - a console tool to scan Redis key space in real time and aggregate memory usage statistic by key patterns. You may use this tools without maintenance on production servers. It shows you detailed statistics about each key pattern in your Redis serve.
Also you can scan Redis db by all or selected Redis types such as "string", "hash", "list", "set", "zset". Matching pattern also supported.
RMA also try to discern key names by patterns, for example if you have keys like 'user:100' and 'user:101' application would pick out common pattern 'user:*' in output so you can analyze most memory distressed data in your instance.