Use multiple fields as key in aerospike loader - aerospike

I am wanting to upload a psv file with records holding key statistics for a physician, location and a practice, stored per day.
A unique key for this entry would consist of a:
physician name,
practice name,
location name, and
a date of service.
Four fields all together.
Configuration file example for Aerospike loader shows only version with single key, and I am not seeing the syntax for multiple entries.
Can someone advise me please if this would be possible to do (have configuration listing multiple key fields using columns from the loaded file), and also show me the example.

Join the keys into one string. For readability, use separator like ":".
It might useful to know that aerospike does not store original keys, it stores digests (hashes) instead.

There is no simple answer as to the "best way" and it depends on what you want to query at speed and scale. Your data model will reflect how you want to read the data and at what latency and throughput.
If you want high speed (1-5ms latency) and high throughput (100k per second) of a particular piece of data, you will need to aggregate the data as you write it to Aerospike and store it using a composite key that will allow you to get that data quickly e.g. doctor-day-location.
If you want a statistical analysis over a period of time, and the query can take a few seconds to several minutes, then you can store the data in a less structured format and run Aerospike aggregations on it, or even use the Hadoop or Spark directly on the Aerospike data.

You can create a byte buffer and convert the fields into bytes then add them to the byte buffer. But when reading you will need to know the dataType or the format for keys to extract them from the byte buffer.
var keyVal = new ArrayBuffer[Byte]
for ( j<- 0 until keyIndex.length)
{
val field = schema(keyIndex(j))
field.dataType match {
case value: StringType => {
keyVal = keyVal.+=(row(keyIndex(j)).asInstanceOf[String].toByte)
}
case value: IntegerType => {
keyVal = keyVal.+=(row(keyIndex(j)).asInstanceOf[Integer].toByte)
}
case value: LongType => {
keyVal = keyVal.+=(row(keyIndex(j)).asInstanceOf[Long].toByte)
}
}
}
val key: Key = new Key(namespace, set,keyVal.toArray)
KeyIndexes = array containing the index of key fileds
Schema = schema of the fileds.
row = a single record to be written.
When extracting the values if you know the schema for the keys Like you made a key from int, int,Long you can extract by first4bytes.toInt and next4.toInt and Last8.toLong.

Related

How to trim a list/set in ScyllaDB to a specific size?

Is there a way to trim a list/set to a specific size (in terms of number of elements)?
Something similar to LTRIM command on Redis (https://redis.io/commands/ltrim).
The goal is to insert an element to a list/set but ensuring that its final size is always <= X (discarding old entries).
Example of what I would like to be able to do:
CREATE TABLE images (
name text PRIMARY KEY,
owner text,
tags set<text> // A set of text values
);
-- single command
UPDATE images SET tags = ltrim(tags + { 'gray', 'cuddly' }, 10) WHERE name = 'cat.jpg';
-- two commands (Redis style)
UPDATE images SET tags = tags + { 'gray', 'cuddly' } WHERE name = 'cat.jpg';
UPDATE images SET tags = ltrim(tags, 10) WHERE name = 'cat.jpg';
No, there is no such operation in Scylla (or in Cassandra).
The first reason is efficiency: As you may be aware, one reason why writes in Scylla are so efficient is that they do not do a read: Appending an element to a list just writes this single item to a sequential file (a so-called "sstable"). It does not need to read the existing list and check what elements it already has. The operation you propose would have needed to read the existing item before writing, slowing it down significantly.
The second reason is consistency: What happens if multiple operations like you propose are done in parallel, reaching different coordinators and replicas in different order? What happens if after earlier problems, one of the replicas is missing one of the values? There is no magic way to solve these problems, and the general solution that Scylla offers for concurrent Read-Modify-Write operations is LWT (Lightweight Transacations). You can emulate your ltrim operation using LWT but it will be significantly slower than ordinary writes. You will need to read the list to the client, modify it (append, ltrim, etc.) and then write it back with an LWT (with the extra condition that it still has its old value, or using an additional "version number" column).

Migrating records to another set on Aerospike

Current Setup:
I have a single namespace running on the Aerospike cluster. The namespace has few sets in it.
My Use-case:
I want to copy all the records from one set (has ~100 records) to another new set (keeping the schema same) under the same namespace.
My Finding:
I did some deep dive and found out a few solutions using aql:
List down all the records from the first set and insert them one-by-one into a new set.
Pros: Simple to implement.
Cons: Time taking, and Prone to manual error.
Using asbackup/asrestore command.
Pros: It is immune to manual error.
Cons: It doesn't allow to change the name of the set during restoration, which I can't afford. Aerospike's FAQ link does provide some workaround, but again it is risky.
Help Needed:
Is there any efficient way to migrate data from one to another set, with less effort and validation? I did think of writing some java code that would scan the entire set and write those records into another set, but again that was falling under the first categories I explained earlier.
Thanks!
A record in Aerospike is stored using the hash of your key and your set name. Set name is "stored" with that record in Aerospike purely as a metadata on that record. So you can scan an entire namespace and return records belonging to that set and in the scan callback, write each of them back as new records (due to the different set name). You will have to know "your key" for each record that comes back from the scan. By default Aerospike only stores the 20byte hash digest as the key for the record. So unless you stored it explicitly in the record either with send key true or in a bin, I don't see how you would identify "your key". Storing "your key" in a bin is easiest. You may have to first update all your 100 records and add a bin that has "your key" in it. Then in scan callback, where records come in no particular order, you will be able to compose a new Key with "your key" and "new set name". You will have to write your own java code for it. (If you have "your key" in the original records - its easy to do.)
I have not tested this .. but something along these lines would work assuming original records had your key in the "mykey" bin.
client.scanAll(null, "test", "set1", new ScanCallback(){
public void scanCallback (Key key, Record record) throws AerospikeException {
String mykey = (String)(record.getValue("mykey"));
String bin1data = record.getString("bin1");
//Alternate way to get string
Key reckey = new Key("test", "set2", mykey);
client.put(null, reckey, new Bin("bin1", bin1data));
}
});

How to properly store a JSON object into a Table?

I am working on a scenario where I have invoices available in my Data Lake Store.
Invoice example (extremely simplified):
{
"business_guid":"b4f16300-8e78-4358-b3d2-b29436eaeba8",
"ingress_timestamp": 1523053808,
"client":{
"name":"Jake",
"age":55
},
"transactions":[
{
"name":"peanut",
"amount":100
},
{
"name":"avocado",
"amount":2
}
]
}
All invoices are stored in ADLS, and can be queried. But, It is my desire to provide access to the same data inside an ALD DB.
I am not an expert on unstructed data: I have RDBMS background. Taking that into consideration, I can only think of 2 possible scenarios:
2/3 tables - invoice, client (could be removed) and transaction. In this scenario, I would have to create an invoice ID to be able to build relationships between those tables
1 table - client info could be normalized into invoice data. But, transactions could (maybe) be defined as an SQL.ARRAY<SQL.MAP<string, object>>
I have mainly 3 questions:
What is the correct way of doing so? Solution 1 seems much better structured.
If I go with solution 1, how do I properly create an ID (probably GUID)? Is it acceptable to require ID creation when working with ADL?
Is there another solution I am missing here?
Thanks in advance!
This type of question is a bit like do you prefer your sauce on the pasta or next to the pasta :). The answer is: it depends.
To answer your 3 questions more seriously:
#1 has the benefit of being normalized that works well if you want to operate on the data separately (e.g., just clients, just invoices, just transactions) and want to the benefits of normalization, get the right indexing, and are not limited by the rowsize limits (e.g., your array of map needs to fit into a row). So I would recommend that approach unless your transaction data is always small and you always access the data together and mainly search on the column data.
U-SQL per se has no understanding of the hierarchy of the JSON document. Thus, you would have to write an extractor that turns your JSON into rows in a way that it either gives you the correlation of the parent to the child (normally done by stepwise downwards navigation with cross apply) and use the key value of the parent data item as the foreign key, or have the extractor generate the key (as int or guid).
There are some sample JSON extractors on the U-SQL GitHub site (start at http://usql.io) that can get you started with the JSON to rowset conversion. Note that you will probably want to optimize the extraction at some point to be JSON Reader based so you process larger docs without loading it into memory.

What InfluxDB schema is suitable for these measurements?

I have data about the status of my server collected over the years: temperatures, fan speeds, cpu load, SMART data. They are stored in a SQLite database under various tables, each one specific for each type of data.
I'm switching to InfluxDB for easier graphing (Grafana) and future expansion: the data will include values from another server and also UPS data (voltages, battery, ...).
I read the guidelines about schemas in InfluxDB but still I'm confused because I have no experience on the topic. I found another question about a schema recommendation but I cannot apply that one to my case.
How should I approach the problem and how to design an appropriate schema for the time series? what should I put in tags and what in fields? should I use a single "measurement" series or should I create multiple ones?
These are the data I am starting with:
CREATE TABLE "case_readings"(date, sensor_id INTEGER, sensor_name TEXT, Current_Reading)
CREATE TABLE cpu_load(date, load1 REAL, load2 REAL, load3 REAL)
CREATE TABLE smart_readings(date, disk_serial TEXT, disk_long_name TEXT, smart_id INTEGER, value)
Examples of actual data:
case_readings:
"1478897100" "4" "01-Inlet Ambient" "20.0"
"1478897100" "25" "Power Supply 1" "0x0"
cpu_load:
"1376003998" "0.4" "0.37" "0.36"
smart_readings:
"1446075624" "50026B732C022B93" "KINGSTON SV300S37A60G" "194" "26 (Min/Max 16/76)"
"1446075624" "50026B732C022B93" "KINGSTON SV300S37A60G" "195" "0/174553172"
"1446075624" "50026B732C022B93" "KINGSTON SV300S37A60G" "196" "0"
"1446075624" "50026B732C022B93" "KINGSTON SV300S37A60G" "230" "100"
This is my idea for a InfluxDB schema. I use uppercase to indicate the actual value and spaces only when a string actually contains spaces:
case_readings,server=SERVER_NAME,sensor_id=SENSOR_ID "sensor name"=CURRENT_READING DATE
cpu_readings,server=SERVER_NAME load1=LOAD1 load2=LOAD2 load3=LOAD3 DATE
smart_readings,server=SERVER_NAME,disk=SERIAL,disk="DISK LONG NAME" smart_id=VALUE DATE
I found the schema used by an official Telegraph plugin for the same IPMI readings I have:
ipmi_sensor,server=10.20.2.203,unit=degrees_c,name=ambient_temp \
status=1i,value=20 1458488465012559455
I will convert my old data into that format, I have all the required fields stored in my old SQLite DB. I will modify the plugin to save the name of the server instead of the IP, that here at home is more volatile than the name itself. I will also probably reduce the precision of the timestamp to simple milliseconds or seconds.
Using that one as example, I understand that the one I proposed for CPU readings could be improved:
cpu,server=SERVER_NAME,name=load1 value=LOAD1 DATE
cpu,server=SERVER_NAME,name=load2 value=LOAD2 DATE
cpu,server=SERVER_NAME,name=load3 value=LOAD3 DATE
However I am still considering the one I proposed, without indexing of the single values:
cpu,server=SERVER_NAME load1=LOAD1 load2=LOAD2 load3=LOAD3 DATE
For SMART data my proposal was also not optimal so I will use:
smart_readings,server=SERVER_NAME,serial=SERIAL,name=DISK_LONG_NAME",\
smart_id=SMART_ID,smart_description=SMART_DESCRIPTION \
value=VALUE value_raw=VALUE_RAW DATE

How to add filter to match the value in a map bin Aerospike

I have a requirement where I have to find the record in an aerospike based on attributeId. The data in aerospike is inthe below format
{
name=ABC,
id=xyz,
ts=1445879080423,
inference={2601=0.6}
}
Now I will be getting the value "2601" programatically and I should find this record based on this value. But the problem is the value is in a Map and the size of this map may be more than 1 like
inference={{2601=0.6},{2830=0.9},{2931=0.8}}
So how can I find this record using attributeId in java. Any suggestions much appreciated
A little know feature of Aerospike is that, in addition to a Bin, you can define an index on:
List values
Map Keys
Map Values
Using in index defined on your map keys in the "inference" bin, you will be able to query (filter) base on the key's name.
I hope this helps