Load CSV data from aws-s3 in dse Graph Loader - datastax

I have data on aws-s3(in csv format) and i want to load that data in dse graph using Graph Loader. i have search but nothing found on this topic. is it possible using dse graph Loader?

Here's how mapping looks for the graph loader when reading from csv's:
https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/dgl/dglCSV.html
Here's an HDFS example (also with csv files), S3 should be similar (just swap the dfs_url:
// Configures the data loader to create the schema
config create_schema: true, load_new: true, preparation: true
// Define the data input sources
// dfs_uri specifies the URI to the HDFS directory in which the files are stored.
dfs_uri = 'hdfs://host:port/path/'
authorInput = File.csv(dfs_uri + 'author.csv.gz').gzip().delimiter('|')
//Specifies what data source to load using which mapper (as defined inline)
load(authorInput).asVertices
{ label "author" key "name" }
// graphloader call
./graphloader myMap.groovy -graph testHDFS -address localhost
// start gremlin console and check the data
bin/dse gremlin-console
:remote config reset g testHDFS.g
schema.config().option('graph.schema_mode').set('Development')
g.V().hasLabel('author')

Related

How to read s3 rasters with accompanying ".aux.xml" metadata file using rasterio?

Suppose a GeoTIFF raster on a S3 bucket which has - next to the raw TIF file - an associated .aux.xml metadata file:
s3://my_s3_bucket/myraster.tif
s3://my_s3_bucket/myraster.tif.aux.xml
I'm trying to load this raster directly from the bucket using rasterio:
fn = 's3://my_s3_bucket/myraster.tif'
with rasterio.Env(session, **rio_gdal_options):
with rasterio.open(fn) as src:
src_nodata = src.nodata
scales = src.scales
offsets = src.offsets
bands = src.tags()['bands']
And this seems to be a problem. The raster file itself is successfully opened, but because rasterio did not automatically load the associated .aux.xml, the metadata was never loaded. Therefore, no band tags, no proper scales and offsets.
I should add that doing exactly the same on a local file does work perfectly. The .aux.xml automatically gets picked up and all relevant metadata is correctly loaded.
Is there a way to make this work on s3 as well? And if not, could there be a workaround for this problem? Obviously, metadata was too large to get coded into the TIF file. Rasterio (GDAL under the hood) generated the .aux.xml automatically when creating the raster.
Finally got it to work. It appears to be essential that in the GDAL options passed to the rasterio.Env module, .xml is added as an allowed extension to CPL_VSIL_CURL_ALLOWED_EXTENSIONS:
The documentation of this option states:
Consider that only the files whose extension ends up with one that is listed in CPL_VSIL_CURL_ALLOWED_EXTENSIONS exist on the server.
And while almost all examples to be found online only set .tif as allowed extension because it can dramatically speed up file opening, any .aux.xml files are not seen by rasterio/GDAL.
So in case we expect there to be associated .aux.xml metadata files with the .tif files, we have to change our example to:
rio_gdal_options = {
    'AWS_VIRTUAL_HOSTING': False,
    'AWS_REQUEST_PAYER': 'requester',
    'GDAL_DISABLE_READDIR_ON_OPEN': 'FALSE',
    'CPL_VSIL_CURL_ALLOWED_EXTENSIONS': '.tif,.xml', # Adding .xml is essential!
    'VSI_CACHE': False
}
with rasterio.Env(session, **rio_gdal_options):
with rasterio.open(fn) as src: # The associated .aux.xml file will automatically be found and loaded now
src_nodata = src.nodata
scales = src.scales
offsets = src.offsets
bands = src.tags()['bands']

Azureml : error "The SSL connection could not be established, see inner exception." while creating Tabular Dataset from Azure Blob Storage file

I have a new error using Azure ML maybe due to the Ubuntu upgrade to 22.04 which I did yesterday.
I have a workspace azureml created through the portal and I can access it whitout any issue with python SDK
from azureml.core import Workspace
ws = Workspace.from_config("config/config.json")
ws.get_details()
output
{'id': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.MachineLearningServices/workspaces/azml_lk',
'name': 'azml_lk',
'identity': {'principal_id': 'XXXXX',
'tenant_id': 'XXXXX',
'type': 'SystemAssigned'},
'location': 'westeurope',
'type': 'Microsoft.MachineLearningServices/workspaces',
'tags': {},
'sku': 'Basic',
'workspaceid': 'XXXXX',
'sdkTelemetryAppInsightsKey': 'XXXXX',
'description': '',
'friendlyName': 'azml_lk',
'keyVault': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.Keyvault/vaults/azmllkXXXXX',
'applicationInsights': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.insights/components/azmllkXXXXX',
'storageAccount': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.Storage/storageAccounts/azmllkXXXXX',
'hbiWorkspace': False,
'provisioningState': 'Succeeded',
'discoveryUrl': 'https://westeurope.api.azureml.ms/discovery',
'notebookInfo': {'fqdn': 'ml-azmllk-westeurope-XXXXX.westeurope.notebooks.azure.net',
'resource_id': 'XXXXX'},
'v1LegacyMode': False}
I then use this workspace ws to upload a file (or a directory) to Azure Blob Storage like so
from azureml.core import Dataset
ds = ws.get_default_datastore()
Dataset.File.upload_directory(
src_dir="./data",
target=ds,
pattern="*dataset1.csv",
overwrite=True,
show_progress=True
)
which again works fine and outputs
Validating arguments.
Arguments validated.
Uploading file to /
Filtering files with pattern matching *dataset1.csv
Uploading an estimated of 1 files
Uploading ./data/dataset1.csv
Uploaded ./data/dataset1.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Creating new dataset
{
"source": [
"('workspaceblobstore', '//')"
],
"definition": [
"GetDatastoreFiles"
]
}
My file is indeed uploaded to Blob Storage and I can see it either on azure portal or on azure ml studio (ml.azure.com).
The error comes up when I try to create a Tabular dataset from the uploaded file. The following code doesn't work :
from azureml.core import Dataset
data1 = Dataset.Tabular.from_delimited_files(
path=[(ds, "dataset1.csv")]
)
and it gives me the error :
ExecutionError:
Error Code: ScriptExecution.DatastoreResolution.Unexpected
Failed Step: XXXXXX
Error Message: ScriptExecutionException was caused by DatastoreResolutionException.
DatastoreResolutionException was caused by UnexpectedException.
Unexpected failure making request to fetching info for Datastore 'workspaceblobstore' in subscription: 'XXXXXX', resource group: 'gr_louis', workspace: 'azml_lk'. Using base service url: https://westeurope.experiments.azureml.net. HResult: 0x80131501.
The SSL connection could not be established, see inner exception.
| session_id=XXXXXX
After some research, I assumed it might be due to openssl version (which now is 1.1.1) but I am not sure and I surely don't know how to fix it...any ideas ?
According to the document there is no direct procedure to convert the file dataset into tabular dataset. Instead, we can create a workspace and that creates two storage methods (blobstorage which is the default storage, file storage). The SSL will be taken care by workspace.
We can create a datastore in the workspace and connect that to the blob storage.
Follow the procedure to do the same.
Create a workspace
If we want, we can create a dataset.
We can create from local files of datastore.
To choose a datastore, first we need to have a file in the datastore
Goto Datastores and click on create dataset. Observe that the name is workspaceblobstorage(default).
Fill the details and see that the dataset type is Tabular.
In the path, we will be having the local files path and can check there, under the select or create a datastore, it is showing default storage as blob.
After uploading, we can wee the name in this section which is a datastore and tabular dataset.
In your workspace created, check whether the public access is Disabled or Enabled. If disabled, it will not allow to access due to lack of SSL. Checkout the image below. After enabling, use the same procedure which was implemented till now.

Serverless offline CUSTOM: using external file AND internally added variables?

I am having a weird problem where I need to use Serverless "custom:" variables read both from an external file and internally from the serverless.yml file.
Something like this:
custom: ${file(../config.yml)}
dynamodb:
stages:
-local
..except this doesn't work. (getting bad indendation of a mapping entry error)
I'm not sure if that's possible and how do you do. Please help :)
The reason is dynamodb local serverless plugin won't work if it's config is set in the exteranl file. But we use the external file config in our project and we don't wanna change that.
So I need to have the dynamodb config separate in the serverless.yml file just not sure the proper way to do it.
Please someone help :) Thanks
You will either have to put all your vars in the external file or import each var from the custom file one at the time as {file(../config.yml):foo}
However... you can also use js instead of yml/json and create a serverless.js file instead allowing you to build your file programically if you need more power. I have fairly complex needs for my stuff and have about 10 yml files for all different services. For my offline sls I need to add extra stuff, modify some other so I just read the yaml files using node, parse them into json and build what I need then just export that.
Here's an example of loading multiple configs and exporting a merged one:
import { readFileSync } from 'fs'
import yaml from 'yaml-cfn'
import merge from 'deepmerge'
const { yamlParse } = yaml
const root = './' // wherever the config reside
// List of yml to read
const files = [
'lambdas',
'databases',
'whatever'
]
// A function to load them all
const getConfigs = () =>
files.map((file) =>
yamlParse(readFileSync(resolve(root, `${file}.yml`), 'utf8'))
)
// A function to merge together - you would want to adjust this - this uses deepmerge package which has options
const mergeConfigs = (configs) => merge.all(configs)
// No load and merge into one
const combined = mergeConfigs(getConfigs())
// Do stuff... maybe add some vars just for offline for ex
// Export - sls will pick that up
export default combined

Storing graph to gremlin server from in memory graph

I'm new to Graphs in general.
I'm attempting to store a TinkerPopGraph that I've created dynamically to gremlin server to be able to issue gremlin queries against it.
Consider the following code:
Graph inMemoryGraph;
inMemoryGraph = TinkerGraph.open();
inMemoryGraph.io(IoCore.graphml()).readGraph("test.graphml");
GraphTraversalSource g = inMemoryGraph.traversal();
List<Result> results =
client.submit("g.V().valueMap()").all().get();
I need some glue code. The gremlin query here is issued against the modern graph that is a default binding for the g variable. I would like to somehow store my inMemoryGraph so that when I run a gremlin query, its ran against my graph.
All graph configurations in Gremlin Server must occur through its YAML configuration file. Since you say you're connected to the modern graph I'll assume that you're using the default "modern" configuration file that ships with the standard distribution of Gremlin Server. If that is the case, then you should look at conf/gremlin-server-modern.yaml. You'll notice that this:
graphs: {
graph: conf/tinkergraph-empty.properties}
That creates a Graph reference in Gremlin Server called "graph" which you can reference from scripts. Next, note this second configuration:
org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/generate-modern.groovy]}}}
Specifically, pay attention to scripts/generate-modern.groovy which is a Gremlin Server initialization script. Opening that up you will see this:
// an init script that returns a Map allows explicit setting of global bindings.
def globals = [:]
// Generates the modern graph into an "empty" TinkerGraph via LifeCycleHook.
// Note that the name of the key in the "global" map is unimportant.
globals << [hook : [
onStartUp: { ctx ->
ctx.logger.info("Loading 'modern' graph data.")
org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerFactory.generateModern(graph)
}
] as LifeCycleHook]
// define the default TraversalSource to bind queries to - this one will be named "g".
globals << [g : graph.traversal()]
The comments should do most of the explaining. The connection here is that you need to inject your graph initialization code into this script and assign your inMemoryGraph.traversal() to g or whatever variable name you wish to use to identify it on the server. All of this is described in the Reference Documentation.
There is a way to make this work in a more dynamic fashion, but it involves extending Gremlin Server through its interfaces. You would have to build a custom GraphManager - the interface can be found here. Then you would set the graphManager key in the server configuration file with the fully qualified name of your instance.

How to configure kafka s3 sink connector for json using its fields AND time based partitioning?

I have a json coming in like this:
{
"app" : "hw",
"content" : "hello world",
"time" : "2018-05-06 12:53:04"
}
I wish to push to S3 in the following file format:
/upper-directory/$jsonfield1/$jsonfield2/$date/$HH
I know I can achieve:
/upper-directory/$date/$HH
with TimeBasedPartitioner and Topic.dir, but how do I put in the 2 json fields as well?
You need to write your own Partitioner to achieve a combination of TimeBased and Field Partitioners
That means make a new Java project, look at the source code for a reference point, build a JAR out of the project, and then copy the jar into kafka-connect-storage-common on all servers running Kafka Connect, which is picked up by the S3 connector. After you've copy the JAR, you will need to reboot the Connect process.
Note: there's already a PR that is trying to add this - https://github.com/confluentinc/kafka-connect-storage-common/pull/73/files