Piglatin jodatime error with StanfordCoreNLP - apache-pig

I am trying to create a Pig UDF that extracts the locations mentioned in a tweet using the Stanford CoreNLP package interfaced through the sista Scala API. It works fine when run locally with 'sbt run', but throws a "java.lang.NoSuchMethodError" exception when called from Pig:
Loading default properties from tagger
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
Reading POS tagger model from
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz
2013-06-14 10:47:54,952 [communication thread] INFO
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce done [7.5
sec]. Loading classifier from
edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ...
2013-06-14 10:48:02,108 [Low Memory Detector] INFO
org.apache.pig.impl.util.SpillableMemoryManager - first memory handler
call - Collection threshold init = 18546688(18112K) used =
358671232(350264K) committed = 366542848(357952K) max =
699072512(682688K) done [5.0 sec]. Loading classifier from
edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz
... 2013-06-14 10:48:10,522 [Low Memory Detector] INFO
org.apache.pig.impl.util.SpillableMemoryManager - first memory handler
call- Usage threshold init = 18546688(18112K) used =
590012928(576184K) committed = 597786624(583776K) max =
699072512(682688K) done [5.6 sec]. 2013-06-14 10:48:11,469 [Thread-11]
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
java.lang.NoSuchMethodError:
org.joda.time.Duration.compareTo(Lorg/joda/time/ReadableDuration;)I
at edu.stanford.nlp.time.SUTime$Duration.compareTo(SUTime.java:3406)
at edu.stanford.nlp.time.SUTime$Duration.max(SUTime.java:3488) at
edu.stanford.nlp.time.SUTime$Time.difference(SUTime.java:1308) at
edu.stanford.nlp.time.SUTime$Range.(SUTime.java:3793) at
edu.stanford.nlp.time.SUTime.(SUTime.java:570)
Here is the relevant code:
object CountryTokenizer {
def tokenize(text: String): String = {
val locations = TweetEntityExtractor.NERLocationFilter(text)
println(locations)
locations.map(x => Cities.country(x)).flatten.mkString(" ")
}
}
class PigCountryTokenizer extends EvalFunc[String] {
override def exec(tuple: Tuple): java.lang.String = {
val text: java.lang.String = Util.cast[java.lang.String](tuple.get(0))
CountryTokenizer.tokenize(text)
}
}
object TweetEntityExtractor {
val processor:Processor = new CoreNLPProcessor()
def NERLocationFilter(text: String): List[String] = {
val doc = processor.mkDocument(text)
processor.tagPartsOfSpeech(doc)
processor.lemmatize(doc)
processor.recognizeNamedEntities(doc)
val locations = doc.sentences.map(sentence => {
val entities = sentence.entities.map(List.fromArray(_)) match {
case Some(l) => l
case _ => List()
}
val words = List.fromArray(sentence.words)
(words zip entities).filter(x => {
x._1 != "" && x._2 == "LOCATION"
}).map(_._1)
})
List.fromArray(locations).flatten
}
}
I am using sbt-assembly to construct a fat-jar, and so the joda-time jar file should be accessible. What is going on?

Pig ships with its own version of joda-time (1.6), which is incompatible with 2.x.

Related

Writing a huge dataframe iterator to S3 Any configuration to reclaim memory after write

I have this method that writes huge dataframe to S3 and this runs in a docker which has been allocated 30 GB memory but this is giving a OOM in the node and the task dies. Is there any spark configuration that can be set and is there any efficient way to do this. This cardsWithdfs can have 9 frames or frames up to 72. I want to know why the memory keeps going up even when i am processing only two dataframes and how can i do this clean up operation to claim the memory back. The dataframe can be 3GB in size
def writeParquetData(basePath: String, batchNumber: Int, resultCatalogCard: ResultsCatalogCard, cardsWithdfs: Iterable[(Int, DataFrame, String)], errorCategory: String, sortColumn: String, partitionColumns: Seq[String]): Iterable[(Int, String, Boolean, String)] = {
var outLevelIterator: mutable.MutableList[(Int, String, Boolean, String)] = mutable.MutableList.empty
cardsWithdfs.grouped(2).foreach( batch => {
val writeDataFrames = Future.traverse(batch) {
cardDataFrameCodeTriple =>
Future {
val (outputLevelCode, df, perspCode) = cardDataFrameCodeTriple
//.....
Try {
if (!df.isEmpty) {
saveResultsSortedDesc(df, partitionColumns, Seq(sortColumn), ulfPathLocation)
(outputLevelCode, pathWithouttenant, false, perspCode)
} else (outputLevelCode, pathWithouttenant, true, perspCode) // df is empty path does not matter
} match {
case Success(x) => {
val (outLevel, ulfPath, ignoreSchema, perspective) = x
logger.info(s"RUN: writeParquetData ulfPathLocation written for workflowId: ${workflowId}, analysisId: ${analysisId} outputLevel: ${outputLevelCode} lossPath: ${ulfPath} ignoreSchema ${ignoreSchema},perspective ${perspective}")
//Add to iterator
}
case Failure(ex) => this.synchronized {
resultsMetadataClient.update(resultCatalogCard.id.toInt, s"""{"$errorCategory": "FAILED", "$CONVERTED_TO_ULF_MIGRATION_ERROR": "FAILED - analysisId: ${analysisId} runningWorkflowId: $runningWorkflowId with ${ex.getMessage}"}""")
logger.warn(s"Exception Analysis readParquetData df for workflowId: ${workflowId}, analysisId: ${analysisId} outputLevel: ${outputLevelCode} batchNumber: ${batchNumber} failed with ${ex.getMessage} StackTrace: ${ex.printStackTrace()}", ex)
}
}
}
}
Await.result(writeDataFrames, Duration.Inf)
})
outLevelIterator
}

Case Class serialization in Spark

In a Spark app (Spark 2.1) I'm trying to send a case class as input parameter of a function that is meant to run on executors
object TestJob extends App {
val appName = "TestJob"
val out = "out"
val p = Params("my-driver-string")
val spark = SparkSession.builder()
.appName(appName)
.getOrCreate()
import spark.implicits._
(1 to 100).toDF.as[Int].flatMap(i => Dummy.process(i, p))
.write
.option("header", "true")
.csv(out)
}
object Dummy {
def process(i: Int, v:Params): Vector[String] = {
Vector { if( i % 2 == 1) v + "_odd" else v + "_even" }
}
}
case class Params(v: String)
When I run it with master local[*] everything goes well, while when running in a cluster, Params class state is not getting serialized and the output results in
null_even
null_odd
...
Could you please help me understanding what I'm doing wrong?
Googling around I found this post that gave me the solution:Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster
In the end the problem is due to the extend Apps

Corda: Party rejected session request as Requestor has not been registered

I've a Corda application that using M14 to build and run corda to run a TwoPartyProtocol where either parties can exchange data to reach a data validity consensus. I've followed Corda flow cookbook to build a flow.
Also, after reading the docs from several different corda milestones I've understood that M14 no longer needs flowSessions as mentioned in the release notes which also eliminates need to register services.
My TwoPartyFlow with inner FlowLogics:
class TwoPartyFlow{
#InitiatingFlow
#StartableByRPC
open class Requestor(val price: Long,
val otherParty: Party) : FlowLogic<SignedTransaction>(){
#Suspendable
override fun call(): SignedTransaction {
val notary = serviceHub.networkMapCache.notaryNodes.single().notaryIdentity
send(otherParty, price)
/*Some code to generate SignedTransaction*/
}
}
#InitiatedBy(Requestor::class)
open class Responder(val requestingParty : Party) : FlowLogic<SignedTransaction>(){
#Suspendable
override fun call(): SignedTransaction {
val request = receive<Long>(requestor).unwrap { price -> price }
println(request)
/*Some code to generate SignedTransaction*/
}
}
}
But, running the above using startTrackedFlow from Api causes the above error:
Party CN=Other,O=Other,L=NY,C=US rejected session request: com.testapp.flow.TwoPartyFlow$Requestor has not been registered
I had hard time finding the reason from corda docs or logs since Two Party flow implementations have changed among several Milestones of corda. Can someone help me understand the problem here.
My API Call:
#GET
#Path("start-flow")
fun requestOffering(#QueryParam(value = "price") price: String) : Response{
val price : Long = 10L
/*Code to get otherParty details*/
val otherPartyHostAndPort = HostAndPort.fromString("localhost:10031")
val client = CordaRPCClient(otherPartyHostAndPort)
val services : CordaRPCOps = client.start("user1","test").proxy
val otherParty: Party = services.nodeIdentity().legalIdentity
val (status, message) = try {
val flowHandle = services.startTrackedFlow(::Requestor, price, otherParty)
val result = flowHandle.use { it.returnValue.getOrThrow() }
// Return the response.
Response.Status.CREATED to "Transaction id ${result.id} committed to ledger.\n"
} catch (e: Exception) {
Response.Status.BAD_REQUEST to e.message
}
return Response.status(status).entity(message).build()
}
My Gradle deployNodes task:
task deployNodes(type: net.corda.plugins.Cordform, dependsOn: ['build']) {
directory "./build/nodes"
networkMap "CN=Controller,O=R3,OU=corda,L=London,C=UK"
node {
name "CN=Controller,O=R3,OU=corda,L=London,C=UK"
advertisedServices = ["corda.notary.validating"]
p2pPort 10021
rpcPort 10022
cordapps = []
}
node {
name "CN=Subject,O=Subject,L=NY,C=US"
advertisedServices = []
p2pPort 10027
rpcPort 10028
webPort 10029
cordapps = []
rpcUsers = [[ user: "user1", "password": "test", "permissions": []]]
}
node {
name "CN=Other,O=Other,L=NY,C=US"
advertisedServices = []
p2pPort 10030
rpcPort 10031
webPort 10032
cordapps = []
rpcUsers = [[ user: "user1", "password": "test", "permissions": []]]
}
There appears to be a couple of problems with the code you posted:
The annotation should be #StartableByRPC, not #StartableNByRPC
The price passed to startTrackedFlow should be a long, not an int
However, even after fixing these issues, I couldn't replicate your error. Can you apply these fixes, do a clean re-deploy of your nodes (gradlew clean deployNodes), and see whether the error changes?
You shouldn't be connecting to the other node via RPC. RPC is how a node's owner speaks to their node. In the real world, you wouldn't have the other node's RPC credentials, and couldn't log into the node in this way.
Instead, you should use your own node's RPC client to retrieve the counterparty's identity:
val otherParty = services.partyFromX500Name("CN=Other,O=Other,L=NY,C=US")!!
See an M14 example here: https://github.com/corda/cordapp-example/blob/release-M14/kotlin-source/src/main/kotlin/com/example/api/ExampleApi.kt.

Akka http - SSE - Not receiving streaming Json response

I am playing with Server Sent Events to get updates from akka-http v2.4.11 based micro-service. I am using akka-sse. For some reason, I am not receiving any updates on my Javascript front-end. However, as soon as, I terminate or kill the server process, I get some of the messages in the front-end. My code looks like this:
val start = ByteString.empty
val sep = ByteString("\n")
val end = ByteString.empty
import Fill._
implicit val jsonStreamingSupport: JsonEntityStreamingSupport =
EntityStreamingSupport.json()
.withFramingRenderer(Flow[ByteString].intersperse(start,
sep,
end))
import de.heikoseeberger.akkasse.EventStreamMarshalling._
def routes: Route = pathPrefix("subscribe") {
path("fills") {
get {
complete {
Source.actorPublisher[Fill](FillProvider())
.map(fill ⇒ sse(fill))
.keepAlive(1.second, () ⇒ ServerSentEvent.heartbeat)
}
}
}
}
def sse[T: ClassTag](obj: T)(implicit w: JsonWriter[T]): ServerSentEvent = {
ServerSentEvent(data = w.write(obj).compactPrint,
eventType = classTag[T].runtimeClass.getSimpleName)
}
Any pointers what I can be doing wrong? To me, it seems that I am following every instructions as mentioned here

ScalikeJDBC: Connection pool is not yet initialized.(name:'default)

I'm playing with ScalikeJdbc library. I want to retrieve the data from PostgreSQL database. The error I get is quite strange for me. Even if I configure manually the CP:
val poolSettings = new ConnectionPoolSettings(initialSize = 100, maxSize = 100)
ConnectionPool.singleton("jdbc:postgresql://localhost:5432/test", "user", "pass", poolSettings)
I still see the error. Here is my DAO:
class CustomerDAO {
case class Customer(id: Long, firstname: String, lastname: String)
object Customer extends SQLSyntaxSupport[Customer]
val c = Customer.syntax("c")
def findById(id: Long)(implicit session: DBSession = Customer.autoSession) =
withSQL {
select.from(Customer as c)
}.map(
rs => Customer(
rs.int("id"),
rs.string("firstname"),
rs.string("lastname")
)
).single.apply()
}
The App:
object JdbcTest extends App {
val dao = new CustomerDAO
val res: Option[dao.Customer] = dao.findById(2)
}
My application.conf file
# PostgreSQL
db.default.driver = "org.postgresql.Driver"
db.default.url = "jdbc:postgresql://localhost:5432/test"
db.default.user = "user"
db.default.password = "pass"
# Connection Pool settings
db.default.poolInitialSize = 5
db.default.poolMaxSize = 7
db.default.poolConnectionTimeoutMillis = 1000
The error:
Exception in thread "main" java.lang.IllegalStateException: Connection pool is not yet initialized.(name:'default)
at scalikejdbc.ConnectionPool$$anonfun$get$1.apply(ConnectionPool.scala:57)
at scalikejdbc.ConnectionPool$$anonfun$get$1.apply(ConnectionPool.scala:55)
at scala.Option.getOrElse(Option.scala:120)
at scalikejdbc.ConnectionPool$.get(ConnectionPool.scala:55)
at scalikejdbc.ConnectionPool$.apply(ConnectionPool.scala:46)
at scalikejdbc.NamedDB.connectionPool(NamedDB.scala:20)
at scalikejdbc.NamedDB.db$lzycompute(NamedDB.scala:32)
What did I miss?
To load application.conf, scalikejdbc-config's DBs.setupAll() should be called in advance.
http://scalikejdbc.org/documentation/configuration.html#scalikejdbc-config
https://github.com/scalikejdbc/hello-scalikejdbc/blob/9d21ec7ddacc76977a7d41aa33c800d89fedc7b6/test/settings/DBSettings.scala#L3-L22
In my case I omit play.modules.enabled += "scalikejdbc.PlayModule" in conf/application.conf using ScalikeJDBC Play support...