Why the vertex count is getting only in Development mode in DSE Graph - datastax

When I try to get vertex count in DSE Graph using
g.V().count()
in Production mode below error is displaying.
Could not find a suitable index to answer graph query and graph scans are disabled
But the same is working in Development mode. Am I doing something wrong.
I am trying quick start link but created my own schema
https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/QuickStartStudio.html
Schema:
schema.propertyKey("id").Int().single().create()
schema.propertyKey("name").Text().single().create()
schema.edgeLabel("created").multiple().create()
schema.vertexLabel("feed").properties("id").create()
schema.vertexLabel("user").properties("id", "name").create()
schema.vertexLabel("feed").index("byFeedId").materialized().by("id").add()
schema.vertexLabel("user").index("byUser").materialized().by("id").add()

TL;DR
Counting across your vertexes is an expensive query that is not expected to be done in production in an OLTP use case.
Why?
Vertices are stored in an adjacency list by vertex type in DSE Graph and to get a count of all the vertices you would have to do a full table scan for each of these which is impractical in a distributed transactional system (you'd have to hit an entire replica set which could be a lot of nodes).
It's a real query. What do I do?
If this is a real use case, it is probably an analytical use case in which case you should be leveraging the spark graph computer by running the query in analytics mode.
Note: the full table scan may still take time but it will be executed in massively distributed fashion via DSE Analytics.

Related

Singlestore (MemSQL)

I have a Singlestore (previously MemSQL) cloud database set up.
My software is running in the background, constantly writing to a table.
When I try to query this table, it takes 10+ seconds. When the software is shut off, the query takes milliseconds.
What would be the reason for this? And is there anything that can be done to mitigate against this?
From a high level, cluster resources are much more utilized while the background software constantly writes to the table. The same resources that handle the constant writes are concurrently trying to serve the query, so it makes sense its faster when there is no writing.
A 'knob to turn' WRT database ingest performance is partition count - you can try creating a test DB w/ more partitions that the current DB (say 2x more). Then try querying from the test DB, both while the background software is running and while it is not - compare this to the DB w/ fewer partitions.
For general guidance on troubleshooting query performance, see this section of the docs: https://docs.singlestore.com/managed-service/en/query-data/query-procedures/troubleshooting-poorly-performing-queries.html
If you're an active customer, you can file a support ticket for the issue for some additional analysis of the backend workings

How to enrich events using a very large database with azure stream analytics?

I'm in the process of analyzing Azure Stream Analytics to replace a stream processing solutions based on NiFi with some REST microservices.
One step is the enrichment of sensor data form a very large database of sensors (>120Gb).
Is it possible with Azure Stream Analytics? I tried with a very small subset of the data (60Mb) and couldn't even get it to run.
Job logs give me warnings of memory usage being too high. Tried scaling to 36 stream units to see if it was even possible, to no avail.
What strategies do I have to make it work?
If I deterministically (via a hash function) partition the input stream using N partitions by ID and then partition the database using the same hash function (to get id on stream and ID on database to the same partition) can I make this work? Do I need to create several separated stream analytics jobs do be able to do that?
I suppose I can use 5Gb chunks, but I could not get it to work with ADSL Gen2 datalake. Does it really only works with Azure SQL?
Stream Analytics supports reference datasets of up to 5GB. Please note that large reference datasets come with the downside of making jobs/nodes restarts very slow (up to 20 minutes for the ref data to be distributed; restarts that may be user initiated, for service updates, or various errors).
If you can downsize that 120Gb to 5Gb (scoping only the columns and rows you need, converting to types that are smaller in size), then you should be able to run that workload. Sadly we don't support partitioned reference data yet. This means that as of now, if you have to use ASA, and can't reduce those 120Gb, then you will have to deploy 1 distinct job for each subset of stream/reference data.
Now I'm surprised you couldn't get a 60Mb ref data to run, if you have details on what exactly went wrong, I'm happy to provide guidance.

Apache Impala - YARN like CPU utilization report for queries (on Cloudera)

We have YARN and Impala co-located on the same cloudera cluster, YARN utilization report and YARN history server provides more valuable information like YARN CPU (Vcores) and Memory usage.
Does something like that exist for IMPALA where I can fetch CPU and memory usage per query and as a whole on the Cloudera cluster.
Precisely I want to know how many Vcores are utilized out of its CPU allocation.
For example, an Impala Query takes 10s to execute a query, and lets say it used 4 vcores and 50MB of RAM, how do I find out that 4 vcores utilized.
Is there any direct way to query this from the cluster or any other method on how to compute the CPU utilization?
You can get a lot of information through the Cloudera Manager Charts. You can find an overview of all available metrics on their website or by clicking on the help symbol on the right side when creating a new chart.
There are quite a few categories for Impala that might be worth a read for you. For example the general Impala metrics and the Impala query metrics. The query metrics for example contain "memory_usage" measured in byte and the general metrics contain "impala_query_cm_cpu_milliseconds_rate" and "impala_query_memory_accrual_rate". These seem to be relevant for your usecase, but check them out and the linked sites to see which ones fit your usecase.
More information is available from the service page of the Impala service in your Cloudera Manager. You can find out more about this page here, but for example the linked page mentions:
The Impala Queries page displays information about Impala queries that are running and have run in your cluster. You can filter the queries by time period and by specifying simple filtering expressions.
It also allows you to display "Threads: CPU Time" and "Work CPU Time" for each query, which again could be relevant for you.
That is all the information available from Impala.

Hive or HBase for reporting?

I am trying to understand what would be the best big data solution for reporting purposes?
Currently I narrowed it down to HBase vs Hive.
The use case is that we have hundreds of terabytes of data with hundreds different files. The data is live and gets updated all the time. We need to provide the most efficient way to do reporting. We have dozens different reports pages where each report consist of different type of numeric and graph data. For instance:
Show all users that logged in to the system in the last hour and
their origin is US.
Show a graph with the most played games to the
least played games.
From all users in the system show the percentage
of paying vs non paying users.
For a given user, show his entire history. How many games he played? What kind of games he played. What was his score in each and every game?
The way I see it, there are 3 solutions:
Store all data in Hadoop and do the queries in Hive. This might work but I am not sure about the performance. How will it perform when the data is 100 TB? Also, Having Hadoop as the main data base is probably not the best solution as update operation will be hard to achieve, right?
Store all data in HBase and do the queries using Phoenix. This solution is nice but HBase is a key/value store. If I join on a key that is not indexed then HBase will do a full scan which will probably be even worse than Hive. I can put index on columns but that will require to put an index on almost each column which is I think not the best recommendation.
Store all data in HBase and do the queries in Hive that communicates with HBase using it propriety bridge.
Respective responses on your suggested solutions (based on my personal experience with similar problem):
1) You should not think of Hive as a regular RDMS as it is best suited for Immutable data. So it is like killing your box if you want to do updates using Hive.
2) As suggested by Paul, in comments you can use Phoenix to create indexes but we tried it and it will be really slow with the volume of data that you suggested (we saw slowness in Hbase with ~100 GB of data.)
3) Hive with Hbase is slower than Phoenix (we tried it and Phoenix worked faster for us)
If you are going to do updates, then Hbase is the best option that you have and you can use Phoenix with it. However if you can make the updates using Hbase, dump the data into Parquet and then query it using Hive it will be super fast.
You can use a lambda structure which is , hbase along with some stream-compute tools such as spark streaming. You store data in hbase ,and when there is new data coming ,update both original data and report by stream-compute. When a new report is created ,you can generate it from a full-scan of hbase, after that ,the report can by updated by stream-compute. You can also use a map-reduce job to adjust the stream-compute result periodically.
The first solution (Store all data in Hadoop and do the queries in Hive), won't allow you to update data. You can just insert to the hive table. Plain hive is pretty slow, as for me it's better to use Hive LLAP or Impala. I've used Impala, it's show pretty good performance, but it's can efficiently, only one query per time. Certainly, update rows in Impala isn`t possible too.
The third solution will get really slow join performance. I've tried Impala with HBase, and join works extremely slow.
About processing data size and cluster size ratio for Impala, https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_cluster_sizing.html
If you need rows update, you can try Apache Kudu.
Here you can find integration guide for Kudu with Impala: https://www.cloudera.com/documentation/enterprise/5-11-x/topics/impala_kudu.html

Use spark RDD as a source of data in a REST API

There is a graph that computes on Spark and stores to Cassandra. Also there is a REST API which has endpoint to get graph node with edges and edges of edges.
This second degree graph may include up to 70000 nodes. Currently uses Cassandra as the database, but to extract a lot of data by key from Cassandra takes much time and resources. We tried TitanDB, Neo4j and OriendDB to improve performance but Cassandra showed the best results.
Now there is another idea. Persist RDD (or may be GrapgX object) in the API service and on API call filter necessary data from persisted RDD.
I guess that it will work fast while RDD fits in memory, but in the case that it caches to disk it will work like a full scan (e.g. full scan parquet file).
Also I expect that we will face to these issues:
memory leak in spark;
updating this RDD (unpersist previous, read new and persist new one) will require stop API;
concurrent using this RDD will require manually manage CPU resources.
Do anybody have such experience?
Spark is NOT a storage engine. Unless you will process big amount of data each time, you should consider:
In-memory data grids - Hazelcast, Apache Ignite, Coherence, GigaSpaces, etc.
Cassandra in-memory - https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/inMemory.html
search for "in-memory" option in other framework/database