Redis - Count distinct problem (without hyper log log)

Redis - Count distinct problem (without hyper log log) - redis

I should solve a count-distinct problem in Redis without the use of HyperLogLog (because of the 0.81% of known error).
I got different requests with a list of objects [O1, O2, ... On] for a specific Key A.
For each list of objects received, Redis should memorize the Objects not still saved and return the number of new objects saved.
For Example:
Request 1 : Key: A - Objects: [O1, O2, O3] -> Response 1: Number of new objects : 3
Request 2 : Key: A - Objects: [O1, O2, O4] -> Response 2: Number of new objects : 1
Request 3 : Key: A - Objects: [O1, O2, O4] -> Response 3: Number of new objects : 0
I have tried to solve this problem with the Hyper Log Log and it's working perfectly but with a growing dataset of objects, the number of new objects saved is not so accurate.
With the sets and the hashmap, the memory is growing too much.
I have read some stuff about Bitmaps but is not too clear. Do you have any reference to projects that are already facing this problem?
Thanks in advance

You might want to consider using a bloom filter. This is available as a module https://redis.com/redis-best-practices/bloom-filter-pattern/.
Bloom filters allow quick tests for membership with 0 false negatives and a very low false positive, provided you know in advance what the maximum number of elements are. You would need to write code of the sort:
result = bf.exists(key, item)
if result == 0:
bf.add(key, item)
bf.inc(key_count)

Related

How to get just the most recent of all documents

In sanity studio you get a nice list of the most recent version of all your documents. If there is a draft you get that, if not, you get the published one.
I need the same list for a few filters and scripts. The following groq does the job but is not very fast and does not work in the new API (v2021-03-25).
*[
_type == $type &&
!defined(*[_id == "drafts." + ^._id])
]._id
A way around the breaking changes in the API is to use length() = 0 in place of !defined() but that makes an already slow query 10-20 X slower.
Does anyone know a way of making filters that consider only the latest version?
Edit: An example where I need this is if I want to see all documents without any categories. Regardless whether it is the published document or the draft that has no categories it shows up in a normal filter. So if you add categories but don't immediately want to publish it will be confusing in the no-categories-list. ,'-)

100 X improvement on API v2021-03-25 🥳
The only way I was able to solve this with speed was to first make a projection of the sub-query so it doesn't run once for every non-draft. Then I thought, why not project both sets and then figure out the overlap, and that was even faster! It runs more than 10 x faster than possible on API v1 and 100 x faster than any suggestions for new API.
{
'drafts': *[ _type == $type && _id in path("drafts.**") ]._id,
'published': *[ _type == $type && !(_id in path("drafts.**"))]._id,
}
{
'current': published[ !("drafts." + # in ^.drafts) ] + drafts
}
First I get both drafts and non-drafts and "store" it in this projection, like a variable-😉-ish
Then I start with my non-drafts - published
And filter out any that has a counterpart in my drafts "variable"
Lastly I add all drafts to the my list of filtered non-drafts

Overall I think you're on the right track. Some ideas to help you out:
Drafts are always fresher and newer than published documents, so if a given doc's id in path("drafts.**"), that's already the last updated one.
Knowing the above allows you to skip the defined(*[_id == ...]) part of the query for drafts, speeding up your execution
As drafts are already included, we can exclude published documents with a draft (defined(*[_id == "drafts." + ^._id][0]))
Notice I added a [0] to the end of the query to pick only the first element that matches. This will improve performance slightly.
For getting only documents that have no categories, use count(categoriesField) < 1
Order documents with | order(_updatedAt desc) to get the freshest documents first
And paginate your request to reduce the payload and speed things up.
Here's a sample query applying these principles (I haven't ran it, you may have to do some adjustments there):
*[
_type == $type &&
// Assuming you only want those without categories:
count(categories) < 1 &&
(
// Is either a draft -> drafts are always fresher
_id in path("drafts.**") ||
// Or a published document with no draft
!defined(*[_id == "drafts." + ^._id][0])
// 👆 with the check above we're ensuring only
// published documents run the expensive defined query
)
]
// Order by last updated
| order(_updatedAt desc)
// Paginate for faster queries
[$paginationStart..$paginationEnd]
// Get only the _id, assuming that's what you want
._id
Hope this helps 🙌

Issues trying to open a bi-dimensional array leave contained in a ROOT Tree in Pyroot

I’m stuck with a problem using Pyroot. I’m not able to read a leaf on a tree which is a two dimensional array of float values. You can see the related Tree in the following:
root [1] TTree tr=(TTree)g->Get(“tevent_2nd_integral”)
root [2] tr.Print()
*Tree :tevent_2nd_integral: Event packet tree 2nd GTUs integral *
*Entries : 57344 : Total = 548967602 bytes File Size = 412690067 *
: : Tree compression factor = 1.33 *
*Br 7 :photon_count_data : photon_count_data[1][1][48][48]/F *
*Entries : 57344 : Total Size= 530758073 bytes File Size = 411860735 *
*Baskets : 19121 : Basket Size= 32000 bytes Compression= 1.29 *
…
The array (the bold one) is photon_count_data[1][1][48][48]. Actually i have several root files and I tried both to make a chain and to use hadd method like hadd file.root 'ls /path/.root’.*
I tried several ways as i will show soon. Anytime i found different problem: once the numpy array which should contain the 48x48 values per each event was not created at all, others just didn’t write anything or strange values (negative also which is not possible).
My code is the following:
# calling the root file after using hadd to merge all files
rootFile = path+"merge.root"
f = XROOT.TFile(rootFile,'read')
tree = f.Get('tevent_2nd_integral')
# making a chain
PDMchain=TChain("tevent_2nd_integral")
for filename in sorted(os.listdir(path)):
if filename.endswith('.root') and("CPU_RUN_MAIN" in filename) :
PDMchain.Add(filename)
pdm_counts = []
#First method using python pyl class
leaves = tree.GetListOfLeaves()
# define dynamically a python class containing root Leaves objects
class PyListOfLeaves(dict) :
pass
# create an istance
pyl = PyListOfLeaves()
for i in range(0,leaves.GetEntries() ) :
leaf = leaves.At(i)
name = leaf.GetName()
# add dynamically attribute to my class
pyl.__setattr__(name,leaf)
for iev in range(0,nEntries_pixel) :
tree.GetEntry(iev)
pdm_counts.append(pyl.photon_count_data.GetValue())
# the Draw method
count = tree.Draw("photon_count_data","","")
pdm_counts.append(np.array(np.frombuffer(tree.GetV1(), dtype=np.float64, count=count)))
#ROOT buffer method
for event in PDMchain:
pdm_data_for_this_event = event.photon_count_data
pdm_data_for_this_event.SetSize(2304) #ROOT buffer
pdm_couts.append(np.array(pdm_data_for_this_event,copy=True))
with the python class method the array pdm_counts is filled with just the first element contained in photon_count_data
with the Draw method I get a segmentation violation or a strange kernel issue
with the root buffer method I get right back a list containing all the 2304 (48x48) values but they are completely different from those in the photon_count_data, id est, negative values or orders of magnitude senseless
Could you tell me where I’m wrong or if there could be a more elegant and quick method to do so.
Thanks in advance

actually I found the solution and I would like to share it if anytime someone will need it!
Actually the third method explained
for event in PDMchain:
pdm_data_for_this_event = event.photon_count_data
pdm_data_for_this_event.SetSize(2304) #ROOT buffer
pdm_couts.append(np.array(pdm_data_for_this_event,copy=True))
works, but unfortunately I was using Spyder to visualize data and for some reason it return strange values which are not right! So...don't use Spyder!!!
Moreover another method works fine:
from root_pandas import read_root
data = read_root('merge.root', 'tevent_2nd_integral', columns=['cpu_packet_time', 'photon_count_data'])
Cheers!

Neo4j: How to pass a variable to Neo4j Apoc (apoc.path.subgraphAll) Property

Am new to Neo4j and trying to do a POC by implementing a graph DB for Enterprise Reference / Integration Architecture (Architecture showing all enterprise applications as Nodes, Underlying Tables / APIs - logically grouped as Nodes, integrations between Apps as Relationships.
Objective is to achieve seamlessly 'Impact Analysis' using the strength of Graph DB (Note: I understand this may be an incorrect approach to achieve whatever am trying to achieve, so suggestions are welcome)
Let me come brief my question now,
There are four Apps - A1, A2, A3, A4; A1 has set of Tables (represented by a node A1TS1) that's updated by Integration 1 (relationship in this case) and the same set of tables are read by Integration 2. So the Data model looks like below
(A1TS1)<-[:INT1]-(A1)<-[:INT1]-(A2)
(A1TS1)-[:INT2]->(A1)-[:INT2]->(A4)
I have the underlying application table names captured as a List property in A1TS1 node.
Let's say one of the app table is altered for a new column or Data type and I wanted to understand all impacted Integrations and Applications. Now am trying to write a query as below to retrieve all nodes & relationships that are associated/impacted because of this table alteration but am not able to achieve this
Expected Result is - all impacted nodes (A1TS1, A1, A2, A4) and relationships (INT1, INT2)
Option 1 (Using APOC)
MATCH (a {TCName:'A1TS1',AppName:'A1'})-[r]-(b)
WITH a as STRTND, Collect(type(r)) as allr
CALL apoc.path.subgraphAll(STRTND, {relationshipFilter:allr}) YIELD nodes, relationships
RETURN nodes, relationships
This faile with error Failed to invoke procedure 'apoc.path.subgraphAll': Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
Option 2 (Using with, unwind, collect clause)
MATCH (a {TCName:'A1TS1',AppName:'A1'})-[r]-(b)
WITH a as STRTND, Collect(r) as allr
UNWIND allr as rels
MATCH p=()-[rels]-()-[rels]-()
RETURN p
This fails with error "Cannot use the same relationship variable 'rels' for multiple patterns" but if I use the [rels] once like p=()-[rels]=() it works but not yielding me all nodes
Any help/suggestion/lead is appreciated. Thanks in advance
Update
Trying to give more context
Showing the Underlying Data
MATCH (TC:TBLCON) RETURN TC
"TC"
{"Tables":["TBL1","TBL2","TBL3"],"TCName":"A1TS1","AppName":"A1"}
{"Tables":["TBL4","TBL1"],"TCName":"A2TS1","AppName":"A2"}
MATCH (A:App) RETURN A
"A"
{"Sponsor":"XY","Platform":"Oracle","TechOwnr":"VV","Version":"12","Tags":["ERP","OracleEBS","FinanceSystem"],"AppName":"A1"}
{"Sponsor":"CC","Platform":"Teradata","TechOwnr":"RZ","Tags":["EDW","DataWarehouse"],"AppName":"A2"}
MATCH ()-[r]-() RETURN distinct r.relname
"r.relname"
"FINREP" │ (runs between A1 to other apps)
"UPFRNT" │ (runs between A2 to different Salesforce App)
"INVOICE" │ (runs between A1 to other apps)
With this, here is what am trying to achieve
Assume "TBL3" is getting altered in App A1, I wanted to write a query specifying the table "TBL3" in match pattern, get all associated relationships and connected nodes (upstream)
May be I need to achieve in 3 steps,
Step 1 - Write a match pattern to find the start node and associated relationship(s)
Step 2 - Store that relationship(s) from step 1 in a Array variable / parameter
Step 3 - Pass the start node from step 1 & parameter from step 2 to apoc.path.subgraphAll to see all the impacted nodes
This may conceptually sound valid but how to do that technically in neo4j Cypher query is the question.
Hope this helps

This query may do what you want:
MATCH (tc:TBLCON)
WHERE $table IN tc.Tables
MATCH p=(tc)-[:Foo*]-()
WITH tc,
REDUCE(s = [], x IN COLLECT(NODES(p)) | s + x) AS ns,
REDUCE(t = [], y IN COLLECT(RELATIONSHIPS(p)) | t + y) AS rs
UNWIND ns AS n
WITH tc, rs, COLLECT(DISTINCT n) AS nodes
UNWIND rs AS rel
RETURN tc, nodes, COLLECT(DISTINCT rel) AS rels;
It assumes that you provide the name of the table of interest (e.g., "TBL3") as the value of a table parameter. It also assumes that the relationships of interest all have the Foo type.
It first finds tc, the TBLCON node(s) containing that table name. It then uses a variable-length non-directional search for all paths (with non-repeating relationships) that include tc. It then uses COLLECT twice: to aggregate the list of nodes in each path, and to aggregate the list of relationships in each path. Each aggregation result would be a list of lists, so it uses REDUCE on each outer list to merge the inner lists. It then uses UNWIND and COLLECT(DISTINCT x) on each list to produce a list with unique elements.
[UPDATE]
If you differentiate between your relationships by type (rather than by property value), your Cypher code can be a lot simpler by taking advantage of APOC functions. The following query assumes that the desired relationship types are passed via a types parameter:
MATCH (tc:TBLCON)
WHERE $table IN tc.Tables
CALL apoc.path.subgraphAll(
tc, {relationshipFilter: apoc.text.join($types, '|')}) YIELD nodes, relationships
RETURN nodes, relationships;

WIth some lead from cybersam's response, the below query gets me what I want. Only constraint is, this result is limited to 3 layers (3rd layer through Optional Match)
MATCH (TC:TBLCON) WHERE 'TBL3' IN TC.Tables
CALL apoc.path.subgraphAll(TC, {maxLevel:1}) YIELD nodes AS invN, relationships AS invR
WITH TC, REDUCE (tmpL=[], tmpr IN invR | tmpL+type(tmpr)) AS impR
MATCH FLP=(TC)-[]-()-[FLR]-(SL) WHERE type(FLR) IN impR
WITH FLP, TC, SL,impR
OPTIONAL MATCH SLP=(SL)-[SLR]-() WHERE type(SLR) IN impR RETURN FLP,SLP
This works for my needs, hope this might also help someone.
Thanks everyone for the responses and suggestions
****Update****
Enhanced the query to get rid of Optional Match criteria and other given limitations
MATCH (initTC:TBLCON) WHERE $TL IN initTC.Tables
WITH Reduce(O="",OO in Reduce (I=[], II in collect(apoc.node.relationship.types(initTC)) | I+II) | O+OO+"|") as RF
MATCH (TC:TBLCON) WHERE $TL IN TC.Tables
CALL apoc.path.subgraphAll(TC,{relationshipFilter:RF}) YIELD nodes, relationships
RETURN nodes, relationships
Thanks all (especially cybersam)

Using deepstream List for tens of thousands unique values

I wonder if it's a good/bad idea to use deepstream record.getList for storing a lot of unique values, for example, emails or any other unique identifiers. The main purpose is to be able to answer a question quickly whether we already have, say, a user with such email (email in use) or another record by specific unique field.
I made few experiments today and got two problems:
1) when I tried to populate the list with few thousands values I got
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
and my deepstream server went off. I was able to fix it by adding more memory to the server node process with this flag
--max-old-space-size=5120
it doesn't look fine but allowed me to make a list with more than 5000 items.
2) It wasn't enough for my tests so I precreated the list with 50000 items and put the data directly to rethinkdb table and got another issue on getting the list or modifing it:
RangeError: Maximum call stack size exceeded
I was able to fix it with another flag:
--stack-size=20000
It helps but I believe it's only matter of time when one of those errors appear in production when the list size reaches proper value. I don't know really whether it's nodejs, javascript, deepstream or rethinkdb issue. That's all in general made me think that I try to use deepstream List wrong way. Please, let me know. Thank you in advance!

Whilst you can use lists to store arrays of strings, they are actually intended as collections of recordnames - the actual data would be stored in the record itself, the list would only manage the order of the records.
Having said that, there are two open Github issues to improve performance for very long lists by sending more efficient deltas and by introducing a pagination option
Interesting results in regards to memory though, definitely something that needs to be handled more gracefully. In the meantime you could drastically improve performance by combining updates into one:
var myList = ds.record.getList( 'super-long-list' );
// Sends 10.000 messages
for( var i = 0; i < 10000; i++ ) {
myList.addEntry( 'something-' + i );
}
// Sends 1 message
var entries = [];
for( var i = 0; i < 10000; i++ ) {
entries.push( 'something-' + i );
}
myList.setEntries( entries );

Redis sorted sets and best way to store uids

I have data consisting of user_ids and tags of these user ids.
The user_ids occur multiple times and have pre-specified number of tags (500) however that might change in the feature. What must be stored is the user_id, their tags and their count.
I want later to easily find tags with top score.. etc. Every time a tag appears it is incremented
My implementation in redis is done using sorted sets
every user_id is a sorted set
key is user_id and is a hex number
works like this:
zincrby user_id:x 1 "tag0"
zincrby user_id:x 1 "tag499"
zincrby user_id:y 1 "tag3"
and so on
having in mind that I want to get tags with highest score, is there a better way?
The second issue is that right now I 'm using "keys *" to retrieve these keys for client side manipulation which I know that it's not aimed towards production systems.
Plus it would be great for memory problems to iterate through a specified number of keys (in the range of 10000). I know that keys have to be stored in memory, however they don't follow
a specific pattern to allow for partial retrieval so I can avoid "zmalloc" error (4GB 64 bit debian server).
Keys amount to range of 20 million.
Any thoughts?

My first point would be to note that 4 GB are tight to store 20M sorted sets. A quick try shows that 20M users, each of them with 20 tags would take about 8 GB on a 64 bits box (and it accounts for the sorted set ziplist memory optimizations provided with Redis 2.4 - don't even try this with earlier versions).
Sorted sets are the ideal data structure to support your use case. I would use them exactly as you described.
As you pointed out, KEYS cannot be used to iterate on keys. It is rather meant as a debug command. To support key iteration, you need to add a data structure to provide this access path. The only structures in Redis which can support iteration are the list and the sorted set (through the range methods). However, they tend to transform O(n) iteration algorithms into O(n^2) (for list), or O(nlogn) (for zset). A list is also a poor choice to store keys since it will be difficult to maintain it as keys are added/removed.
A more efficient solution is to add an index composed of regular sets. You need to use a hash function to associate a specific user to a bucket, and add the user id to the set corresponding to this bucket. If the user id are numeric values, a simple modulo function will be enough. If they are not, a simple string hashing function will do the trick.
So to support iteration on user:1000, user:2000 and user:1001, let's choose a modulo 1000 function. user:1000 and user:2000 will be put in bucket index:0 while user:1001 will be put in bucket index:1.
So on top of the zsets, we now have the following keys:
index:0 => set[ 1000, 2000 ]
index:1 => set[ 1001 ]
In the sets, the prefix of the keys is not needed, and it allows Redis to optimize the memory consumption by serializing the sets provided they are kept small enough (integer sets optimization proposed by Sripathi Krishnan).
The global iteration consists in a simple loop on the buckets from 0 to 1000 (excluded). For each bucket, the SMEMBERS command is applied to retrieve the corresponding set, and the client can then iterate on the individual items.
Here is an example in Python:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ----------------------------------------------------
import redis, random
POOL = redis.ConnectionPool(host='localhost', port=6379, db=0)
NUSERS = 10000
NTAGS = 500
NBUCKETS = 1000
# ----------------------------------------------------
# Fill redis with some random data
def fill(r):
p = r.pipeline()
# Create only 10000 users for this example
for id in range(0,NUSERS):
user = "user:%d" % id
# Add the user in the index: a simple modulo is used to hash the user id
# and put it in the correct bucket
p.sadd( "index:%d" % (id%NBUCKETS), id )
# Add random tags to the user
for x in range(0,20):
tag = "tag:%d" % (random.randint(0,NTAGS))
p.zincrby( user, tag, 1 )
# Flush the pipeline every 1000 users
if id % 1000 == 0:
p.execute()
print id
# Flush one last time
p.execute()
# ----------------------------------------------------
# Iterate on all the users and display their 5 highest ranked tags
def iterate(r):
# Iterate on the buckets of the key index
# The range depends on the function used to hash the user id
for x in range(0,NBUCKETS):
# Iterate on the users in this bucket
for id in r.smembers( "index:%d"%(x) ):
user = "user:%d" % int(id)
print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True )
# ----------------------------------------------------
# Main function
def main():
r = redis.Redis(connection_pool=POOL)
r.flushall()
m = r.info()["used_memory"]
fill(r)
info = r.info()
print "Keys: ",info["db0"]["keys"]
print "Memory: ",info["used_memory"]-m
iterate(r)
# ----------------------------------------------------
main()
By tweaking the constants, you can also use this program to evaluate the global memory consumption of this data structure.
IMO this strategy is simple and efficient, because it offers O(1) complexity to add/remove users, and true O(n) complexity to iterate on all items. The only downside is the key iteration order is random.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Redis - Count distinct problem (without hyper log log) - redis

Related

How to get just the most recent of all documents

Issues trying to open a bi-dimensional array leave contained in a ROOT Tree in Pyroot

Neo4j: How to pass a variable to Neo4j Apoc (apoc.path.subgraphAll) Property

Using deepstream List for tens of thousands unique values

Redis sorted sets and best way to store uids

Categories

Resources