Can Aerospike return only specified bins? - aerospike

Simply, how can I get only specified bins out of a record?
A record has the bins:
(data1, data2 ...)
I query the record by it's primary key but want to specify I only want data1 bin to be returned so I don't have to send a massive record but only the parts I want!
so aerospike result would be something like this:
id: (data1)
This is not a secondary index query!

Yes, in every language client there's a way to limit a get to just the bins you want.
You probably should 'read the manual'. In the case of Python, see https://aerospike-python-client.readthedocs.io/en/latest/

Yes, We can get specified bins from set using java client
client.get(policy, key, "bin1","bin2");
using query statement like below
statement.setBinNames("bin1","bin2","bin3");

$ aql
aql> select data1 from test.testset where pk="rec1"

Related

How to get repeatable sample using Presto SQL?

I am trying to get a sample of data from a large table and want to make sure this can be repeated later on. Other SQL allow repeatable sampling to be done with either setting a seed using set.seed(integer) or repeatable (integer) command. However, this is not working for me in Presto. Is such a command not available yet? Thanks.
One solution is that you can simulate the sampling by adding a column (or create a view) with random stuff (such as UUID) and then selecting rows by filtering on this column (for example, UUID ended with '1'). You can tune the condition to get the sample size you need.
By design, the result is random and also repeatable across multiple runs.
If you are using Presto 0.263 or higher you can use key_sampling_percent to reproducibly generate a double between 0.0 and 1.0 from a varchar.
For example, to reproducibly sample 20% of records in table using the id column:
select
id
from table
where key_sampling_percent(id) < 0.2
If you are using an older version of Presto (e.g. AWS Athena), you can use what's in the source code for key_sampling_percent:
select
id
from table
where (abs(from_ieee754_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
I have found that you have to use from_big_endian_64 instead of from_ieee754_64 to get reliable results in Athena. Otherwise I got no many numbers close to zero because of the negative exponent.
select id
from table
where (abs(from_big_endian_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
You may create a simple intermediate table with selected ids:
CREATE TABLE IF NOT EXISTS <temp1>
AS
SELECT <id_column>
FROM <tablename> TABLESAMPLE SYSTEM (10);
This will contain only sampled ids and will be ready to use it downstream in your analysis by doing JOIN with data of interest.

pandas read sql query improvement

So I downloaded some data from a database which conveniently has a sequential ID column. I saved the max ID for each table I am querying to a small text file which I read into memory (max_ids dataframe).
I was trying to create a query where I would say give me all of the data where the Idcol > max_id for that table. I was getting errors that Series are mutable so I could not use them in a parameter. The code below ended up working but it was literally just a guess and check process. I turned it into an int and then a string which basically extracted the actual value from the dataframe.
Is this the correct way to accomplish what I am trying to do before I replicate this for about 32 different tables? I want to always be able to grab only the latest data from these tables which I am then doing stuff to in pandas and eventually consolidating and exporting to another database.
df= pd.read_sql_query('SELECT * FROM table WHERE Idcol > %s;', engine, params={'max_id', str(int(max_ids['table_max']))})
Can I also make the table name more dynamic as well? I need to go through a list of tables. The database is MS SQL and I am using pymssql and sqlalchemy.
Here is an example of where I ran max_ids['table_max']:
Out[11]:
0 1900564174
Name: max_id, dtype: int64
assuming that your max_ids DF looks as following:
In [24]: max_ids
Out[24]:
table table_max
0 tab_a 33333
1 tab_b 555555
2 tab_c 66666666
you can do it this way:
qry = 'SELECT * FROM {} WHERE Idcol > :max_id'
for i, r in max_ids.iterrows():
print('Executing: [%s], max_id: %s' %(qry.format(r['table']), r['table_max']))
pd.read_sql_query(qry.format(r['table']), engine, params={'max_id': r['table_max']})

Celery result backend stores a encoded string in result column

After I run an async task
tasks.add.apply_async( (10, 10))
I checked the result backends database table celery_taskmeta and noticed the result containing something like gAJLBC4=
I couldn't find in the docs what that result implies and whether I can store the actual result of the function call ( ie, the return value ) in the table as is.
For this instance where I am executing a task which adds two numbers : 10 and 10 , the result column in celery_taskmeta should have 20 as per my understanding ( which is probably wrong ) .
How should I achieve that ?
I'm assuming that result is also serialized? I'm using a redis broker and not clear which configuration I need to set to be able to retrieve the actual return value.
the best way to get the result is not to query the database directly and instead to use the result api
result = tasks.add.apply_async( (10, 10))
result.ready
> True
result.result
> 20

Return a SQL query where field doesn't contain specific text

I will setup a quick scenario and then ask my question: Let's say I have a DB for my warehouse with the following fields: StorageBinID, StorageReceivedDT, StorageItem, and StorageLocation.
Any single storage bin could have multiple records because of the multiple items in them. So, what I am trying to do is create a query that only returns a storage bin that doesn't contain a certain item, BUT, I don't want the rest of the contents. For example lets say I have 5000 storage bins in my warehouse and I know that there are a handful of bins that do not have "ItemX" in them, listed in the StorageItem field. I would like to return that short list of StorageBinID's without getting a full list of all of the bins without ItemX and their full contents. (I think that rules out IN, LIKE, and CONTAIN and their NOTS)
My workaround right now is running two queries, usually within a StorageReceivedDT. The first is the bins received with the date and then the second is the bins containing ItemX. Then import both .csv files into Excel and use a ISNA(MATCH) formula to compare the two columns.
Is this possible through a query? Thank you very much in advance for any advice.
You can do this as an aggregation query, with a having clause. Just count the number of rows where "ItemX" appears in each bin, and choose the bins where the count is 0:
select StorageBinID
from table t
group by StorageBinID
having sum(case when StorageItem = "ItemX" then 1 else 0 end) = 0;
Note that this only returns bins that have some items in them. If you have completely empty bins, they will not appear in the results. You do not provide enough information to handle that situation (although I can speculate that you have a StorageBins table that would be needed to solve this problem).
What flavour of SQL do you use ?
From the info that you gave, you could use:
select distinct StorageBinID
from table_name
where StorageBinID not in (
select StorageBinID
from table_name
where StorageItem like '%ItemX%'
)
You'll have to replace table_name with the name of your table.
If you want only exact matches (the StorageItem to be exactly "ItemX"), you should replace the condition
where StorageItem like '%ItemX%'
with
where StorageItem = 'ItemX'
Another option (should be faster):
select StorageBinID
from table_name
minus
select StorageBinID
from table_name
where StorageItem like '%ItemX%'

What's the least expensive way to get the number of rows (data) in a SQLite DB?

When I need to get the number of row(data) inside a SQLite database, I run the following pseudo code.
cmd = "SELECT Count(*) FROM benchmark"
res = runcommand(cmd)
read res to get result.
But, I'm not sure if it's the best way to go. What would be the optimum way to get the number of data in a SQLite DB? I use python for accessing SQLite.
Your query is correct but I would add an alias to make it easier to refer to the result:
SELECT COUNT(*) AS cnt FROM benchmark
Regarding this line:
count size of res
You don't want to count the number of rows in the result set - there will always be only one row. Just read the result out from the column cnt of the first row.