How to delete whole set from Aerospike namespace?

How to delete whole set from Aerospike namespace? - aerospike

Is there any way to delete a set from namespace (Aerospike) from aql or CLI ???
My set also contains Ldts .
Please suggest me a way to delete whole Set from LDT

You can delete a set by using
asinfo -v "set-config:context=namespace;id=namespace_name;set=set_name;set-delete=true;"
This link explains more about how the set is deleted
http://www.aerospike.com/docs/operations/manage/sets/#deleting-a-set-in-a-namespace

There is a new and better way to do this as of Aerospike Server version 3.12.0, released in March 2017:
asinfo -v "truncate:namespace=namespace_name;set=set_name"
This previous command has been DEPRECATED, and works only up to Aerospike 3.12.1, released in April 2017:
asinfo -v "set-config:context=namespace;id=namespace_name;set=set_name;set-delete=true;"
The new command is better in several ways:
Can be issued during migrations
Can be issued while data is being written to the set
It is sufficient to run it on just one node of the cluster
I used it under those conditions (during migration, while data was being written, on 1 node) and it ran very quickly. A set with 30 million records was reduced to 1000 records in about 6 seconds. (Those 1000 records were presumably the ones written during those 6 seconds)
Details here

As of Aerospike 3.12, which was released in March 2017, the feature of deleting all data in a set or namespace is now supported in the database.
Please see the following documentation:
http://www.aerospike.com/docs/reference/info#truncate
Which provides the command line command that looks like:
asinfo truncate:namespace=;set=;lut=
It can also be truncated from the client APIs, here is the java documentation:
http://www.aerospike.com/apidocs/java/com/aerospike/client/AerospikeClient.html
and scroll down to the "truncate" method.
Please note that this feature can, optionally, take a time specification. This allows you to delete records, then start inserting new records. As timestamps are used, please make certain your server clocks are well synchronized.

You can also delete a set with the Java client as follows:
(1) Use the client "execute" method, which applies a UDF on all queried rows
AerospikeClient.execute(WritePolicy policy, Statement statement, String packageName, String functionName, Value... functionArgs) throws AerospikeException
(2) Define the statement to include all rows of the given set:
Statement statement = new Statement();
statement.setNamespace("my_namespace");
statement.setSetName("my_set");
(3) Specify a UDF that deletes the given record:
function delete_rec(rec)
aerospike:remove(rec)
end
(4) Call the method:
ExecuteTask task = AerospikeClient.execute(null, statement, "myUdf", "delete_rec")
task.waitTillComplete(timeout);
Is it performant? Unclear, but my guess is asinfo is better. But it's very convenient for testing/debugging/setup.

You can't delete a set but you can delete all records that exist in the set by scanning all the records and deleting then one by one. Sample C# code that will do the trick:
AerospikeClient.ScanAll(null, AerospikeNameSpace, category, DeleteAllRecordsCallBack);
Here DeleteAllRecordsCallBack is a callback function where in you can delete records one by one. This callback function gets called for all records.
private void DeleteAllRecordsCallBack(Key key, Record record)
{
AerospikeClient.Delete(null, key);
}

Using AQL:
TRUNCATE namespace_name.set_name
https://www.aerospike.com/docs/tools/aql/aql-help.html

Take care that you'll need to restart your nodes (one by one to avoid downtime) because you'll not retrieve bins left space instead.
I mean maximum limit of bins in Aerospike is 32,767. If you just delete and recreate several times your set, if it's create for example 10000 bins each time, you'll not be able to create more than 2,767 bins the 4th time because bins counter is kept in ram.
If you restart you're cluster, it will be released.

You can't dynamically delete a set from namespace like "drop table" in RDMS.
The following command using asinfo only lazily delete data inside a set:
asinfo -v "set-config:context=namespace;id=namespace_name;set=set_name;set-delete=true;"
There is a post about it in aerospike discuss site and I didn't see progress about this issue yet.

In our production experience special java deletion utility works quite well even without recently introduced durable deletes. You build it from sources, put somewhere near the cluster and run this way:
java -jar delete-set-1.0.0-jar-with-dependencies.jar -h <aerospike_host> -p 3000 -s <set_to_delete> -n <namespace_name>
In our prod environment cold restarts are quite rare events, basically when aerospike crashes. And the data flow is quite high so defragmentation kicks in earlier and we don't even have zombie record issue.
BTW asinfo way mentioned earlier didn't work for us. The records stayed there for couple of days so we use delete-set utility which worked right away.

Related

How to force Nextflow process to recalculate and ignore cache in resumed workflow

I have a series of processes in nextflow pipeline, employing multiple heavy computing steps and database (SQL) insertion/fetch. I need to insert certain (intermediate) process results to the DB and fetch them later for further processing (within the same pipeline). In the most simplified form it will be something like:
process1 (fetch data from DB)
process2 (analyze process1.out)
process3 (inserts process2.out to DB)
The problem is, that when any values are changed in the DB, output from process1 is still cached (when using -resume flag), so changes in DB are not reflected here at all.
Is there any way to force reprocessing process1 while using -resume and ignore cache?
So far, I was manually deleting respective work folder, or adding dummy line to process1, but that is extremely ineffective solution.
Thanks for any help here.

Result caching is enable by default, but this feature can be disabled using the cache directive by setting the value to false. For example:
process process1 {
cache false
...
}
Not sure if we have the full picture here, but updating a database with some set of process results just to fetch them again later on seems wasteful. Or maybe I've just misunderstood. I would instead try to separate the heavy computational work (hours) from the database transactions (minutes) if at all possible. Note that if you need to make per process database transactions, you might be able to achieve this using the beforeScript and afterScript directives (which can be enabled/disabled using a nextflow.config profile, for example). For example, a beforeScript could be used to create a database object that is updated (using an afterScript) once the process has completed. Since both of these scripts are run from inside the workDir, you could use the basename of the current/working directory (i.e. the task UUID) as a key.

Query from a just-expired view in BQ

Is it possible to query from a recently expired view in Big Query and save a snapshot? (expired 2h ago)

You can try Managing tables
In the documentation there is some examples on how to do that in section Restoring deleted tables.
You can undelete a table within seven days of deletion, including explicit deletions and implicit deletions due to table expiration. After seven days, it is not possible to undelete a table using any method, including opening a support ticket.
You can restore a deleted table by:
Using the # snapshot decorator in the bq command-line tool
Using the client libraries
To restore a table, use a table copy operation with the # snapshot decorator. First, determine a UNIX timestamp of when the table existed (in milliseconds). Then, use the bq copy command with the snapshot decorator.
For example, enter the following command to copy mydataset.mytable at the time 1418864998000 into a new table mydataset.newtable.
bq cp mydataset.mytable#1418864998000 mydataset.newtable
(Optional) Supply the --location flag and set the value to your location.
You can also specify a relative offset. The following example copies the version of a table from one hour ago:
bq cp mydataset.mytable#-3600000 mydataset.newtable
For more information, see Restore a table from a point in time.

How to watch Changes to SQlite Database and Trigger Shell Script

Note: I believe I may be missing a simple solution to this problem. I'm relatively new to programming. Any advice is appreciated.
The problem: A small team of people (~3-5) want to be able to automate, as far as possible, the filing of downloaded files in appropriate folders. Files will be downloaded into a shared downloads folder. The files in this downloads folder will be sorted into a large shared folder structure according to their file-type, URL the file was downloaded from, and so on and so forth. These files are stored on a shared server, and the actual sorting will be done by some kind of shell script running on the server itself.
Whilst there are some utilities which do this (such as Maid), they don't do everything I want them to do. Maid for example doesn't have a way to get the the download url of a file in Linux. Additionally, it is written in Ruby, which I'd like to avoid.
The biggest stumbling block then is finding a find a way to get the url of the downloaded file that can be passed into the shell script. Originally I thought this could be done via getfattr, which would get a file's extended attributes. Frustratingly however, whilst chromium saves a file's download url as an extended attribute, Firefox doesn't seem to do the same thing. So relying on extended attributes seems to be out of the question.
What Firefox does do however is store download 'metadata' in the places.sqlite file, in two separate tables - moz_annos and moz_places. Inspired by this, I dediced to build a Firefox extension that writes all information about the downloaded file to a SQLite database downloads.sqlite on our server upon the completion of said download. This includes the url, MIME type, etc. of the downloaded file.
The idea is that with this data, the server could run a shell script that does some fine-grained sorting of the downloaded file into our shared file system.
However, I am struggling to find out a stable, reliable, and portable way of 'triggering' the script that will actually move the files, as well as passing information about these files to the script so that it can sort them accordingly.
There are a few ways I thought I could go about this. I'm not sure which method is the most appropriate:
1) Watch Downloads Folder
This method would watch for changes to the shared downloads directory, then use the file name of the downloaded file to query downloads.sqlite, getting the matching row, then finally passing the file's attributes into a bash script that sorts said file.
Difficulties: Finding a way to reliably match the downloaded file with the appropriate record in the database. Files may have the same download name but need to be sorted differently, perhaps, for example, if they were downloaded from a different URL. Additionally, I'd like to get additional attributes like whether the file was downloaded in incognito mode.
2) Create Auxillary 'Helper' File
Upon a file download event, the extension creates a 'helper' text file, which is the name of the file + some marker that contains the additional file attribute:
/Downloads/
mydownload.pdf
mydownload-downloadhelper.txt
The server can then watch for the creation of a .txt file in the downloads directory run the necessary shell script from this.
Difficulties: Whilst this avoids using a SQlite databse, it seems rather ungraceful and hacky, and I can see a multitude of ways in which this method would just break or not work.
3) Watch
SQlite Database
This method writes to the shared SQlite database downloads.sqlite on server. Then, by some method, watch for a new insert of a row into this database. This could either be by watching the sqlite databse for a new INSERT on a table, or have a sqlite trigger on INSERT that runs a bash script, passing on the download information into a shell script.
Difficulties: there doesn't seem to be any easy way to watch an SQlite database for a new row insert, and a trigger within SQlite doesn't seem to be able to launch an external script/program. I've searched high and low for a method of doing either of these two processes, but I'm struggling to find any documented way to do it that I am able to understand.
What I would like is :
Some feedback on which of these methods is appropriate, or if there is a more appropriate method that I am overlooking.
An example of a system/program that does something similar to this.
Many thanks in advance.

It seems to me that you have put "the cart in front of the horse":
Use cron to periodically check for new downloads. Process them on the command line instead of trying to trigger things from inside sqlite3:
a) Here is an approach using your shared sqlite3 database "downloads.sqlite":
Upfront once:
Add a table to your database containing just an integer as record counter and a timeStamp field, e.g., "table_counter":
sqlite3 downloads.sqlite "CREATE TABLE "table_counter" ( "counter" INTEGER PRIMARY KEY NOT NULL, "timestamp" DATETIME DEFAULT (datetime('now','UTC')));" 2>/dev/null
Insert an initial record into this new table setting the "counter" to zero and recording a timeStamp:
sqlite3 downloads.sqlite "INSERT INTO "table_counter" VALUES (0, (SELECT datetime('now','UTC')));" 2>/dev/null
Every so often:
Query the table containing the downloads with a "SELECT COUNT(*)" statement:
sqlite3 downloads.sqlite "SELECT COUNT(*) from table_downloads;" 2>/dev/null
Result e.g., 20
Compare this number to the number stored in the record counter field:
sqlite3 downloads.sqlite "SELECT (counter) from table_counter;" 2>/dev/null
Result e.g., 17
If result from 3) > result from 4), then you have downloaded more files than processed.
If so, query the table containing the downloads with a "SELECT" statement for the oldest not yet processed download, using a "subselect":
sqlite3 downloads.sqlite "SELECT * from table_downloads where rowid = (SELECT (counter+1) from table_counter);" 2>/dev/null
In my example this would SELECT all values for the data record with the rowid of 17+1 = 18;
Do your magic in regards to the downloaded file stored as record #18.
Increase the record counter in the "table_counter", again using a subselect:
sqlite3 downloads.sqlite "UPDATE table_counter SET counter = (SELECT (counter) from table_counter)+1;" 2>/dev/null
Finally, update the timeStamp for the "table_counter":
Why? Shit happens on shared drives... This way you can always check how many download records have been processed and when this has happened last time.
sqlite3 downloads.sqlite "UPDATE table_counter SET timeStamp = datetime('now','UTC');" 2>/dev/null
If you want to have a log of this processing then change the SQL statements in 4) to a "SELECT COUNT(*)" and in 7) to an "INSERT counter" and its subselect to an "(SELECT (counter+1) from table_counter)" respectively ...
Please note: The redirections " 2>/dev/null" at the end of the SQL statements are just to suppress this kind of line issued by newer versions of SQLite3 before showing your query results.
-- Loading resources from /usr/home/bernie/.sqliterc
If you don't like timeStamps based on UTC then use localtime instead:
(datetime('now','localtime'))
Put steps 3) inclusive 8) in a shell-script and use a cron entry to run this query/comparism periodically...
Use the complete /path/to/sqlite3 in this shell-script (just in case running on a shared drive. Someone could be fooling around with paths and could surprise your cron ...)
b) I will give you a simpler answer using awk and some hash like md5 in a separate answer.
So it is easier for future readers and easier for you to "rate" :-)

Using scan command inside lua script

I'm trying to Implement 2 behaviors in my system using Hiredis and Redis.
1) fetch all keys with pattern by publish event and not by the array returning when using SCAN command.
(my system works only with publish event even for get so need to stick to this behavior)
2) delete all keys with pattern
After reading the manuals I understand that "SCAN" command is my friend.
I have 2 approaches, not sure what is the pros/cons:
1) Using Lua script that will call SCAN until we get 0 as our cursor and publish-event/delete-key for each entry found.
2) Using Lua script but return the cursor as the return code and call the LUA script from the hiredis client with new cursor until it gets 0.
Or maybe other ideas will be nice.
My database is not hugh at all .. not more than 500k entries with key/val that are less then 100 bytes.
Thank you!

Option 1 is probably not ideal to run inside of a Lua script, since it blocks all other requests from being executed. SCAN works best when you are running it in your application so Redis can still process other requests.

mysql optimized query execution

As part of an ongoing research work, I am checking if an URL exists or not using the cURL command. I have been executing a shell script for couple of days and it is doing some updates for each URL in my database. However, the script seems to update around only 100,000 rows in a day.
I was thinking if I could write the values in a file first and then do the updates, the execution might be faster.
I am connecting to the database using the command line.
mysql -h servername -u username -ppassword databasename "Update Query"
For example, instead of connecting to the database 2 million times like above from the command line and updating 2 million rows, I am planning to connect to the database only once from the command line and update 2 million rows from the file.
So is the second approach better than the first one or the time difference would be negligible?

Three approaches.
You could using load data infile
You could build up a .sql file with all of the updates you need.
You could use something other than a CLI to connect to the URLs and DB. In other words, not using "curl" and "mysql" commands, but using a real programming language and provided libraries for checking URLs and updating databases.
Any of those would probably be faster. Though you'll likely get more speed improvement by making the http calls in parallel. You can do that more easily with a real programming language.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas