Running Analytics on DataStax DSE graph - datastax

I have a large graph on DataStax DSE.
I'm trying to run some analytics queries.
I've started with simple ones and I've noticed something
Running without Analytics
gremlin> :remote config alias g test.g
==>g=test.g
gremlin> g.V().hasLabel("person").has("id",5903806).count()
==>1
gremlin>
Running with Analytics
gremlin> :remote config alias g test.a
==>g=test.a
gremlin> g.V().hasLabel("person").has("id",5903806).count()
==>0
gremlin>
This can be some configuration issue ?
Many Thanks

This issue was in regards to specifying a float when the "id" property was a Text type. Wrapping the id in quotes solved the issue.

CristiC, we have an open defect for counts and OLAP ComputeEngine right now that will be addressed very shortly. I can't tell if what you're seeing is related to this, but it likely is. What version of DSE Graph are you using?

Related

Pyspark and Elasticsearch connectivity and creating dataframe from index

I'm having some challenges to connect with elasticsearch and pyspark. Want to creat dataframes and query it. I'm using setup of all the required version. But as documentation I haven't found anything specific. Can anyone here guide me with end to end process.
It will be very helpful for me to have your support.
As per my research I have tried configuration setup during creating spark session but I'm not sure whether it's right or wrong.

Flink Table API streaming S3 sink throws SerializedThrowable exception

I am trying to write a simple table API S3 streaming sink (csv format) using flink 1.15.1 and I am facing the following exception,
Caused by: org.apache.flink.util.SerializedThrowable: S3RecoverableFsDataOutputStream cannot sync state to S3. Use persist() to create a persistent recoverable intermediate point.
at org.apache.flink.fs.s3.common.utils.RefCountedBufferingFileStream.sync(RefCountedBufferingFileStream.java:111) ~[flink-s3-fs-hadoop-1.15.1.jar:1.15.1]
at org.apache.flink.fs.s3.common.writer.S3RecoverableFsDataOutputStream.sync(S3RecoverableFsDataOutputStream.java:129) ~[flink-s3-fs-hadoop-1.15.1.jar:1.15.1]
at org.apache.flink.formats.csv.CsvBulkWriter.finish(CsvBulkWriter.java:110) ~[flink-csv-1.15.1.jar:1.15.1]
at org.apache.flink.connector.file.table.FileSystemTableSink$ProjectionBulkFactory$1.finish(FileSystemTableSink.java:642) ~[flink-connector-files-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.BulkPartWriter.closeForCommit(BulkPartWriter.java:64) ~[flink-file-sink-common-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.closePartFile(Bucket.java:263) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.prepareBucketForCheckpointing(Bucket.java:305) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.onReceptionOfCheckpoint(Bucket.java:277) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.snapshotActiveBuckets(Buckets.java:270) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.snapshotState(Buckets.java:261) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSinkHelper.snapshotState(StreamingFileSinkHelper.java:87) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.connector.file.table.stream.AbstractStreamingWriter.snapshotState(AbstractStreamingWriter.java:129) ~[flink-connector-files-1.15.1.jar:1.15.1]
In my config, I am trying to read from Kafka and write to S3 (s3a) using table API and checkpoint configuration using s3p (presto). Even I tried with a simple datagen example instead of Kafka and I am getting the same issue. I think I am following all the exact steps mentioned in the docs and the above exceptions are not much helpful. Exactly it is failing when the code triggers the checkpoint but I don't have any clue after this. Could someone please help me to understand what I am missing here? I don't find any open issue with such logs.
Looks like a bug. I've raised it here (after discussing with the community). Sadly I am not able to find any work around for Table API S3 CSV streaming sink. Similar issue here for DataStreamAPI with a workaround.

ADT explorer error cant put models into graph view

i followed this guide to set up the adt explorer. i can upload models into the model view, but i cant put any of them into the graph view. i get this error: Error in instance creation: SyntaxError: unexpected token o in JSON at position 1.
it doesnt seem like there is anything wrong with the models, since they work with an older version of the adt explorer.
is there any of the packages that can cause this problem? or could it be that my pc couldn't do the console installation properly for some reason?
edit: i can put models into the graph view on earlier versions of the adt explorer then close it and start up the latest version again and run query to get the models into the graph view. so the problem seems to be trying to create new twins, straight form the models themselves or by importing graph.
Something is wrong with the ADT explorer i think. I can't make it to work. As a workaround, use the az cli (MS Link).
az dt twin create -n <ADT_instance_name> --dtmi "<dtmi>" --twin-id <twin-id>

Does Datastax DSE 5.1 search support Solr local paramater as used in facet.pivot

I understand that DSE 5.1 runs Solr 6.0 version.
I am trying to use facet.pivot feature using Solr local paramater, but it does not seem to be working.
My data is as follows
Simple 4 fields
What I need is to group the result by name field so as to get sum(money) for each Year. I believe facet.pivot with local parameter can solve but not working with DSE 5.1.
From:Solr documentation
Combining Stats Component With Pivots
In addition to some of the general local parameters supported by other types of faceting, a stats local parameters can be used with facet.pivot to refer to stats.field instances (by tag) that you would like to have computed for each Pivot Constraint.
Here is what I want to use.
stats=true&stats.field={!tag=piv1}money&facet=true&facet.pivot={!stats=piv1}name
If you're trying to execute these queries from solr_query within CQL, the stats component is not supported. We keep the faceting to simple parameters as the purpose is to provide more GROUP By type functionality in solr_query, not analytics.
With DSE 5.1 (Solr 6.0.1), and the desire for analytics with Solr, use the HTTP's JSON Facet API from Solr. It has replaced the stats component and provides what you are looking for in a more robust fashion.

Failed to import large data as dataframe, from Google BigQuery to Google Cloud DataLab

I tried 2 approaches to import a large table in Google BigQuery, about 50,000,000 rows, 18GB, into dataframe to Google Datalab, in order to do the machine learning using Tensorflow.
Firstly I use (all modules needed are imported) :
data = bq.Query('SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME`').execute().result().to_dataframe()
Then it keeps Running... until forever.
Even though I do LIMIT 1000000, it doesn't change.
Secondly I use:
data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID')
It runs well at first, but when it goes to about 450,000 rows (calculate using percentage and total row count), it gets stuck at:
Got page: 32; 45.0% done. Elapsed 293.1 s.
And I cannot find how to enable allowLargeResults in read_gbq().
As its document says, I try:
data = pd.read_gbq(query='SELECT {ABOUT_30_COLUMNS...} FROM `TABLE_NAME` LIMIT 1000000', dialect ='standard', project_id='PROJECT_ID', configuration = {'query': {'allowLargeResult': True}})
Then I get:
read_gbq() got an unexpected keyword argument 'configuration'
That's how I even failed to import 1,000,000 rows to Google Cloud Datalab.
I actually want to import 50 times the data size.
Any idea about it?
Thanks
Before loading large datasets into Google Cloud Datalab: Make sure to consider alternatives such as those mentioned in the comments of this answer. Use sampled data for the initial analysis, determine the correct model for the problem and then use a pipeline approach, such as Google Cloud Dataflow, to process the large dataset.
There is an interesting discussion regarding Datalab performance improvements when downloading data from BigQuery to Datalab here. Based on these performance tests, a performance improvement was merged into Google Cloud Datalab in Pull Request #339. This improvement does not appear to be mentioned in the release notes for Datalab but I believe that the fixes are included as part of Datalab 1.1.20170406. Please check the version of Google Cloud Datalab to make sure that you're running at least version 1.1.20170406. To check the version first click on the user icon in the top right corner of the navigation bar in Cloud Datalab then click About Datalab.
Regarding the pandas.read_gbq() command that appears to be stuck. I would like to offer a few suggestions:
Open a new issue in the pandas-gbq repository here.
Try extracting data from BigQuery to Google Cloud Storage in csv format, for example, which you can then load into a dataframe by using pd.read_csv. Here are 2 methods to do this:
Using Google BigQuery/Cloud Storage CLI tools:
Using the bq command line tool and gsutil command line tool, extract data from BigQuery to Google Cloud Storage, and then Download the object to Google Cloud Datalab. To do this type bq extract <source_table> <destination_uris>, followed by gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
Using Google Cloud Datalab
import google.datalab.bigquery as bq
import google.datalab.storage as storage
bq.Query(<your query>).execute(output_options=bq.QueryOutput.file(path='gs://<your_bucket>/<object name>', use_cache=False)).result()
result = storage.Bucket(<your_bucket>).object(<object name>).download()
Regarding the error read_gbq() got an unexpected keyword argument 'configuration', the ability to pass arbitrary key word arguments (configuration) was added in version 0.20.0. I believe this error is caused the fact that pandas is not up to date. You can check the version of pandas installed by running
import pandas
pandas.__version__
To upgrade to version 0.20.0, run pip install --upgrade pandas pandas-gbq. This will also install pandas-gbq which is an optional dependency for pandas.
Alternatively, you could try iterating over the table in Google Cloud Datalab. This works but its likely slower. This approach was mentioned in another StackOverflow answer here: https://stackoverflow.com/a/43382995/5990514
I hope this helps! Please let me know if you have any issues so I can improve this answer.
Anthonios Partheniou
Contributor at Cloud Datalab
Project Maintainer at pandas-gbq