Use spark RDD as a source of data in a REST API - api

There is a graph that computes on Spark and stores to Cassandra. Also there is a REST API which has endpoint to get graph node with edges and edges of edges.
This second degree graph may include up to 70000 nodes. Currently uses Cassandra as the database, but to extract a lot of data by key from Cassandra takes much time and resources. We tried TitanDB, Neo4j and OriendDB to improve performance but Cassandra showed the best results.
Now there is another idea. Persist RDD (or may be GrapgX object) in the API service and on API call filter necessary data from persisted RDD.
I guess that it will work fast while RDD fits in memory, but in the case that it caches to disk it will work like a full scan (e.g. full scan parquet file).
Also I expect that we will face to these issues:
memory leak in spark;
updating this RDD (unpersist previous, read new and persist new one) will require stop API;
concurrent using this RDD will require manually manage CPU resources.
Do anybody have such experience?

Spark is NOT a storage engine. Unless you will process big amount of data each time, you should consider:
In-memory data grids - Hazelcast, Apache Ignite, Coherence, GigaSpaces, etc.
Cassandra in-memory - https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/inMemory.html
search for "in-memory" option in other framework/database

Related

Cons of using MemoryCache as a temporary copy of DB table

I have a site where you can list your car for sale. There is a list and a map with filtering on car types and other car specifications. My idea was to cache cars table and use that to filter on when user is searching for a car on the website. Currently, especially when zooming in/out on the map, each time user does that, http request is made and it's querying the database, and that can be slow and heavy on the server.
As an experiment with 1 000 items, I have cached map data (trimmed data with only basic info) and it's working fine. I was thinking of doing a basically copy of cars table instead with all needed joins added in Memory Cache and use that instead of querying the DB every request for both list and the map. I would have Cron Job every 5 minutes (as data can change, but it doesn't have to be immediate) to update Memory Cache with latest cars data from DB.
What would be the cons of using this approach in long term and for using it for example storing 100 000 records? Beside server needing more RAM, would there be any concerns about scalability or usability of this approach? Would it be better to use Redis instead?
I do have in place now "search as you type" service, but I don't really need that functionality as filtering is pretty exact, I have added it more as a caching server but I think I would be better off just using Memory Cache until a real need for that kind of service is required.
Thank you
Since memory isn’t infinite, we need to limit the number of items stored in the In-Memory cache.
MemoryCache VS Redis
MemoryCache
MemoryCache is embedded in the process , hence can only be used as a plain key-value store from that process.
Redis
Redis is a remote data structure server. It is certainly slower than just storing the data in local memory.
I conclude that MemoryCache is running in the web server of the current application, and it is limited by the performance of the web server. Of course, it will be very fast under the same configuration. I think the disadvantage is that the stored data cannot be shared with other applications.
If redis is used, reading data directly from memory is not as fast as memorycache, but it has high reliability and high scalability.
Related Post:
1. How to update redis after updating database?
2. how to keep caching up to date
3. How can MySQL update data in real time in redis cache?

How to enrich events using a very large database with azure stream analytics?

I'm in the process of analyzing Azure Stream Analytics to replace a stream processing solutions based on NiFi with some REST microservices.
One step is the enrichment of sensor data form a very large database of sensors (>120Gb).
Is it possible with Azure Stream Analytics? I tried with a very small subset of the data (60Mb) and couldn't even get it to run.
Job logs give me warnings of memory usage being too high. Tried scaling to 36 stream units to see if it was even possible, to no avail.
What strategies do I have to make it work?
If I deterministically (via a hash function) partition the input stream using N partitions by ID and then partition the database using the same hash function (to get id on stream and ID on database to the same partition) can I make this work? Do I need to create several separated stream analytics jobs do be able to do that?
I suppose I can use 5Gb chunks, but I could not get it to work with ADSL Gen2 datalake. Does it really only works with Azure SQL?
Stream Analytics supports reference datasets of up to 5GB. Please note that large reference datasets come with the downside of making jobs/nodes restarts very slow (up to 20 minutes for the ref data to be distributed; restarts that may be user initiated, for service updates, or various errors).
If you can downsize that 120Gb to 5Gb (scoping only the columns and rows you need, converting to types that are smaller in size), then you should be able to run that workload. Sadly we don't support partitioned reference data yet. This means that as of now, if you have to use ASA, and can't reduce those 120Gb, then you will have to deploy 1 distinct job for each subset of stream/reference data.
Now I'm surprised you couldn't get a 60Mb ref data to run, if you have details on what exactly went wrong, I'm happy to provide guidance.

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

How to introduce Redis just for caching no CRUD

I am planning to implement a caching layer in my application using Redis. right now the application is fetching huge-sized data from DB whenever the user initiates certain plan load. this plan load behind the scene, triggers few heavyweight data accesses and orchestrate all calls in final results.
Data access is happening through JPA Repository right now to access my Oracle DB. when I introduced redis layer, it's not initiating the cache in first access, rather the application tried to fetch data from the empty cache.
My questions are
would my design work, since I want to keep the CRUD operations as is in JPA repositories. I just want to introduce redis for caching, no crud operations.
I have a huge amount of data(probably 2 GB) that should sit in the cache layer. how much can max data redis hold?
My questions are
would my design work, since I want to keep the CRUD operations as is in JPA repositories. I just want to introduce redis for caching, no crud operations.
It is gonna work however you will have main problem cache invalidation.
When you do CRUD operation your redis cache still will have old data and you will have inconsistency. The general way of using redis as cache is setting ttl(Time-To-live) for each key. But you can solve such inconsistency by introducing trigger which erases key in redis if you do any CRUD operation.
Depend on your workload you can meet case when you have low cache hit rate.
For example, if you rarely access to keys in the cache then all of them will be expired until next access. Frankly cache will not work effectively in this case. It could be avoided by warming cache or using redis not as cache but as second storage with replicating data.
I have a huge amount of data(probably 2 GB) that should sit in the cache layer. how much can max data redis hold?
Redis is extremely efficient and limited by your physical resources(RAM) and by size for key and for data stored by key, it is 512Mb.
You have to account that redis can fragment data among virtual memory than your source 2Gb of data represented by keys and data for it can occupy 3GB RAM.

What is a recommended scalable DB platform to use in AWS for large amounts of volatile data sets - elasticsearch, Redis or DynamoDB?

Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.