How can I load geo.point using graphloader? - datastax

Im using the last version of datastax dse graph. I need to load geo point from text file into the graph.
Is it ok to write POINT(12.3 34.5) for geo point in the text data file?
or POINT(X,Y)? or Geo.point(x,y)?

There is an example in the documentation that shows how to load geo data - https://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/reference/refDSEGraphDataTypes.html?hl=graph,geo,data
graph.addVertex(label,'author','name','Jamie Oliver','gender','M','point',Geo.point(1,2))

you can convert the longitude and latitude values into a Point using a transform with DSE GraphLoader (https://docs.datastax.com/en/datastax_enterprise/latest/datastax_enterprise/graph/dgl/dglTransform.html)
geoPointInput = geoPointInput.transform { it['location'] = new com.datastax.driver.dse.geometry.Point(Double.parseDouble(it['longitude']),Double.parseDouble(it['latitude'])); it }

Related

Splunk chart function displaying zero values when trying to round off input

I have been trying to display a chart in splunk. I uploaded my json data through Splunk HTTP Forwarder and running the query:
After I uploaded the json data, I have got fields such as
"message":{"acplbuild":"ACPL 1.20.1","coresyncbuild":"4.3.10.25","testregion":"EU_Stage","client":"EU_Mac","date":"2019-08-27","iteration":"20","localCreateTime":"6.672","createSyncTime":"135.768","createSearchTime":"0.679","filetype":"CPSD","filesize":"690_MB","filename":"690MB_NissPoetry.cpsd","operation":"upload","upload_DcxTime":"133.196","upload_manifest_time":"133.141","upload_journal_time":"1.753","upload_coresync_time":"135.225","upload_total_time":142.44},"severity":"info"}
I am trying to run the following query
index="coresync-ue1" host="acpsync_allacpl_7" message.testregion=EU_STAGE message.client=EU_Mac message.operation="upload" |eval roundVal = round(message.upload_total_time, 2) | chart median(roundVal) by message.acplbuild
I am getting no values. It should display rounded off median values as a chart. Can someone point me if I am doing anything wrong here.
I used the same data as specified by you and I faced an issue while rounding off the upload_total_time value. So, I first converted it to number, and then the Splunk search query worked.
Input Data Set
{"message":{"acplbuild":"ACPL 1.20.1","coresyncbuild":"4.3.10.25","testregion":"EU_Stage","client":"EU_Mac","date":"2019-08-27","iteration":"20","localCreateTime":"6.672","createSyncTime":"135.768","createSearchTime":"0.679","filetype":"CPSD","filesize":"690_MB","filename":"690MB_NissPoetry.cpsd","operation":"upload","upload_DcxTime":"133.196","upload_manifest_time":"133.141","upload_journal_time":"1.753","upload_coresync_time":"135.225","upload_total_time":142.44},"severity":"info"}
{ "message":{"acplbuild":"ACPL 1.20.2","coresyncbuild":"4.3.10.25","testregion":"EU_Stage","client":"EU_Mac","date":"2019-08-27","iteration":"20","localCreateTime":"6.672","createSyncTime":"135.768","createSearchTime":"0.679","filetype":"CPSD","filesize":"690_MB","filename":"690MB_NissPoetry.cpsd","operation":"upload","upload_DcxTime":"133.196","upload_manifest_time":"133.141","upload_journal_time":"1.753","upload_coresync_time":"135.225","upload_total_time":152.44123},"severity":"info"}
{ "message":{"acplbuild":"ACPL 1.20.3","coresyncbuild":"4.3.10.25","testregion":"EU_Stage","client":"EU_Mac","date":"2019-08-27","iteration":"20","localCreateTime":"6.672","createSyncTime":"135.768","createSearchTime":"0.679","filetype":"CPSD","filesize":"690_MB","filename":"690MB_NissPoetry.cpsd","operation":"upload","upload_DcxTime":"133.196","upload_manifest_time":"133.141","upload_journal_time":"1.753","upload_coresync_time":"135.225","upload_total_time":160.456},"severity":"info"}
Splunk Search Query
source="sample.json" index="splunk_answers" sourcetype="_json"
| convert num(message.upload_total_time) as total_upld_time
| eval roundVal = round(total_upld_time,2)
| chart median(roundVal) by message.acplbuild
Statistics View
Visualization View

need to join the vertex in dse

I have created properties and vertex like
schema.propertyKey('REFERENCE_ID').Int().multiple().create();
schema.propertyKey('Name').Text().single().create();
schema.propertyKey('PARENT_NAME').Text().single().create(); ... ....
.. schema.propertyKey('XXX').Text().single().create();
schema.vertexLabel('VERT1').properties("REFERENCE_ID",.."PROPERTY10"....."PROPERTY15")//15
PROPERTIES
schema.vertexLabel('VER2').properties("REFERENCE_ID",.."PROPERTY20"......"PROPERTY35")//35
PROPERTIES
schema.vertexLabel('VERT3').properties("REFERENCE_ID",.."PROPERTY20"....."PROPERTY25")//25
PROPERTIES
schema.vertexLabel('VERT4').properties("REFERENCE_ID",.."PROPERTY20"....."PROPERTY25")//25
PROPERTIES
and loaded csv data using DSG GRAPHLOADER(CSV TO(VERTEX)).
and created edge
schema.edgeLabel('ed1').single().create()
schema.edgeLabel('ed1').connection('VERT1', 'VER2').add()
schema.edgeLabel('ed1').single().create()
schema.edgeLabel('ed1').connection('VERT1', 'VERT3').add()
schema.edgeLabel('ed2').single().create()
schema.edgeLabel('ed2').connection('VERT3','VERT4').add()
But I don't know how to map the data between vertex and edge. I want to join all these 4 vertex. Could you please help on this?
I'm new to dse. I just ran the above code in datastax studio successfully and I can see the loaded data. I need to join the vertex...
Sql code: I want same in dse germlin.
select v1.REFERENCE_ID,v2.name,v3.total from VERT1 v1
join VER2 v2 on v1.REFERENCE_ID=v2.REFERENCE_ID
join VERT3 v3 on v2.sid=v3.sid
there are 2 "main" options in DSE for adding edge data, plus one if you're also using DSE Analytics.
One is to use Gremlin, like what's documented here - https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/using/insertTraversalAPI.html
This approach would be a traversal based approach and may not be the best/fastest choice for bulk operations
Another solution is to use the Graph Loader, check out the example with the .asEdge code sample here - https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/dgl/dglCSV.html#dglCSV
If you have DSE Analytics enabled, you can also use DataStax's DSE GraphFrame implementation, which leverages Spark, to preform this task as well. Here's an example - https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/graphAnalytics/dseGraphFrameImport.html

Converting map into tuple

I am loading hbase table using pig.
product = LOAD 'hbase://product' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:*', '-loadKey true') AS (id:bytearray, a:map[])
The relation product has a tuple that has map in it. I want to convert the map data into tuple
Here is the sample..
grunt>dump product;
06:177602927,[cloud_service#true,wvilnk#true,cmpgeo#true,cmplnk#true,webvid_standard#true,criteria_search#true,typeahead_search#true,aasgbr#true,lnkmin#false,aasdel#true,aasmcu#true,aasvia#true,lnkalt#false,aastlp#true,cmpeel#true,aasfsc#true,aasser#true,aasdhq#true,aasgbm#true,gboint#true,lnkupd#true,aasbig#true,webvid_basic#true,cmpelk#true]
06:177927527,[cloud_service#true,wvilnk#true,cmpgeo#true,cmplnk#true,webvid_standard#true,criteria_search#true,typeahead_search#true,aasgbr#false,lnkmin#false,aasdel#false,aasmcu#false,aasvia#false,lnkalt#false,aastlp#true,cmpeel#true,aasfsc#false,aasser#false,aasdhq#true,aasgbm#false,gboint#true,lnkupd#true,aasbig#false,webvid_basic#true,cmpelk#true,blake#true]
I want to convert each tuple into individual records like below
177602927,cloud_service,true
177602927,wvilnk,true
177602927,cmpgeo,true
177602927,cmpgeo,true
I am pretty new to pig and perhaps this is my first time to do something with Pig Latin. Any help is much appreciated.
I was able to find a fix for my problem.
I used a UDF called MapEntriesToBag which will convert all the maps into bags.
Here is my code.
>register /your/path/to/this/Jar/Pigitos-1.0-SNAPSHOT.jar
>DEFINE MapEntriesToBag pl.ceon.research.pigitos.pig.udf.MapEntriesToBag();
>product = LOAD 'hbase://product' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:*', '-loadKey true') AS (id:bytearray, a:map[])
>b = foreach product generate flatten(SUBSTRING($0,3,12)), flatten(MapEntriesToBag($1));
The UDF is available in the Jar Pigitos-1.0-SNAPSHOT.jar. You can download this jar from here
For more information you can refer to this link. It has more interesting UDF's related to Map datatype.

finds all the paths between a group of defined vertices in datastax dse graph

According to
this
the following query:
g.V(ids).as("a").repeat(bothE().otherV().simplePath()).times(5).emit(hasId(within(ids))).as("b").filter(select(last,"a","b").by(id).where("a", lt("b"))).path().by().by(label)
does not work in datastax graph because the lt("b") part cannot work on datastax id which is a json format
{
'~label=person',
member_id=54666,
community_id=505443455
}
How can I change the lt("b) part in order the query to work ?
Please help
You can pick any property that is comparable. E.g., if all your vertices have a name property:
g.V(ids).as("a").repeat(bothE().otherV().simplePath()).times(5).
emit(hasId(within(ids))).as("b").
filter(select(last,"a","b").by("name").where("a", lt("b"))).
path().by().by(label)

How to evenly distribute data in apache pig output files?

I've got a pig-latin script that takes in some xml, uses the XPath UDF to pull out some fields and then stores the resulting fields:
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
Note that we're using pig-0.12.0 on our cluster, so I ripped the XPath/XMLLoader classes out of pig-0.14.0 and put them in my own jar so that I could use them in 0.12.
This above script works fine and produces the data that I'm looking for. However, it generates over 1,900 partfiles with only a few mbs in each file. I learned about the default_parallel option, so I set that to 128 to try and get 128 partfiles. I ended up having to add a piece to force a reduce phase to achieve this. My script now looks like:
set default_parallel 128;
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
forced_reduce = FOREACH (GROUP results BY RANDOM()) GENERATE FLATTEN(results);
store forced_reduce into '$output';
Again, this produces the expected data. Also, I now get 128 part-files. My problem now is that the data is not evenly distributed among the part-files. Some have 8 gigs, others have 100 mb. I should have expected this when grouping them by RANDOM() :).
My question is what would be the preferred way to limit the number of part-files yet still have them evenly-sized? I'm new to pig/pig latin and assume I'm going about this in the completely wrong way.
p.s. the reason I care about the number of part-files is because I'd like to process the output with spark and our spark cluster seems to do a lot better with a smaller number of files.
I'm still looking for a way to do this directly from the pig script but for now my "solution" is to repartition the data within the spark process that works on the output of the pig script. I use the RDD.coalesce function to rebalance the data.
From the first code snippet, I am assuming it is map only job since you are not using any aggregates.
Instead of using reducers, set the property pig.maxCombinedSplitSize
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
exec;
set pig.maxCombinedSplitSize 1000000000; -- 1 GB(given size in bytes)
x = load '$output' using PigStorage();
store x into '$output2' using PigStorage();
pig.maxCombinedSplitSize - setting this property will make sure each mapper reads around 1 GB data and above code works as identity mapper job, which helps you write data in 1GB part file chunks.