Dstream twitter example -- flatmap twitter_id with text
New to scala and spark streaming. Trying to extend example twitter stream code to split the tweet into words but keep those words connected to the twitter ids.
setupLogging()
val tweets = TwitterUtils.createStream(ssc, None)
val statuses = tweets.map(status => status.getText())
val tweetwords = statuses.flatmap((tweetText => tweetText.split(" ")
tweetwords.print
//get running list of words from tweets.
This
is
my
tweet
"#mytweet"
//instead want the same list with an twitter_id attached
val statuses = tweetmap{status => (status.getUser().getID(), status.getText())}
val tweetwords = statuses.flatmap( ????? This is where I am lost )
//this is what I want
tweetwords.print
1523523, This
1523523, is
1523523, my
1523523, tweet
1523523, #mytweet
I'm open to other ways of doing this including dataframes/datasets..
Thank you!
If anyone looks for this...
val tweetwords = statuses.flatmap(case (t1, t2) => t2.split(" ").map((t1, _))}
Related
i've been tried to working on spark streaming. My problem is I want to use wordCountsDataFrame again outside of the foreach block.
i want to conditionally join wordCountsDataFrame and another dataframe that is created from Dstream. Is there any way to do that or another approach?
Thanks.
My scala code block is below.
val Seq(projectId, subscription) = args.toSeq
val sparkConf = new SparkConf().setAppName("PubsubWordCount")
val ssc = new StreamingContext(sparkConf, Milliseconds(5000))
val credentail = SparkGCPCredentials.builder.build()
val pubsubStream: ReceiverInputDStream[SparkPubsubMessage] = PubsubUtils.createStream(ssc, projectId, None, subscription, credentail, StorageLevel.MEMORY_AND_DISK_SER_2)
val stream1= pubsubStream.map(message => new String(message.getData()))
stream1.foreachRDD{ rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Convert RDD[String] to DataFrame
val wordsDataFrame = rdd.toDF("word")
wordsDataFrame.createOrReplaceTempView("words")
val wordCountsDataFrame =
spark.sql("select word, count(*) from words group by word")
wordCountsDataFrame.show()
}
My objective is to collect distinct values of select fields to provided them as filter options for the frontend. DistinctValuesCollector seems to be the tool for this, however since I haven't found code sample and documentation except for the Javadocs I can't currently correctly construct this collector. Can anyone provide an example?
This is my attempt which doesn't deliver the desired distinct values of the field PROJEKTSTATUS.name.
val groupSelector = TermGroupSelector(PROJEKTSTATUS.name)
val searchGroup = SearchGroup<BytesRef>()
val valueSelector = TermGroupSelector(PROJEKTSTATUS.name)
val groups = mutableListOf(searchGroup)
val distinctValuesCollector = DistinctValuesCollector(groupSelector, groups, valueSelector)
That field is indexed as follows:
document.add(TextField(PROJEKTSTATUS.name, aggregat.projektstatus, YES))
document.add(SortedDocValuesField(PROJEKTSTATUS.name, BytesRef(aggregat.projektstatus)))
Thanks to #andrewJames's hint to a test class I could figure it out:
fun IndexSearcher.collectFilterOptions(query: Query, field: String, topNGroups: Int = 128, mapper: Function<String?, String?> = Function { it }): Set<String?> {
val firstPassGroupingCollector = FirstPassGroupingCollector(TermGroupSelector(field), Sort(), topNGroups)
search(query, firstPassGroupingCollector)
val topGroups = firstPassGroupingCollector.getTopGroups(0)
val groupSelector = firstPassGroupingCollector.groupSelector
val distinctValuesCollector = DistinctValuesCollector(groupSelector, topGroups, groupSelector)
search(query, distinctValuesCollector)
return distinctValuesCollector.groups.map { mapper.apply(it.groupValue.utf8ToString()) }.toSet()
}
I'm trying to do DBSCAN in each group of latitudes and longitudes from users. The implementation of this clustering algorithm was done by irvingc here. I bumped up all dependencies to make the code work properly in my env.
Describing the sistuation: I have a Dataframe which is composed by events from user, each event has an id, a lat, and a long, you can see the columns by this case class. By that, I transform the dataframe to dataset to use the groupbykey and mapgroups methods to apply the function to the grouped data. However, the DBSCAN I'm using receive an RDD[linalg.Vector], so I have to transform the group into Vector of lat/lon, and this transformation gives the error SPARK-28702. Can you give some advice how to handle this issue?
case class StayDataset(objectID: Long, latitude: Double, longitude: Double, timeStart: Long, timeEnd: Long)
var dfs: Array[DataFrame] = Array()
val s = dataset.groupByKey(k => k.objectID).mapGroups{
case(k, iter) => {
POIDetection.groupStayPointsFromUser(k, iter, dataset.sparkSession)
dfs = dfs ++ Array(df)
k
}
}
def groupStayPointsFromUser(k: Long, dataset: Iterator[StayDataset], spark: SparkSession): DataFrame = {
val points = dataset.map(row => Vectors.dense(Array(row.latitude, row.longitude))).toSeq
val rddVector = spark.sparkContext.parallelize(points)
val size = points.length
val model = DBSCAN.train(rddVector, eps = 20, minPoints = (size * 0.18).toInt, maxPointsPerPartition = (size / 4).toInt)
val pointRDD = new PointRDD(model.labeledPoints.map(p => {
val point = POIDetection.geoFactory.createPoint(new Coordinate(p.x, p.y))
point.setUserData(p.cluster.toString())
point
}))
val df = Adapter.toDf(pointRDD, Seq("cluster"), spark)
.select(col("cluster").cast("long"), col("geometry"))
df
}
I think this problem arises when we want to apply a KNN in a grouped data. How to do that?
I don't understand what you want to achieve but first, you need to create a RDD[linalg.Vector], I suppose that you have the dataset of StayDataset already, to retrieve the RDDs, you need to transform the Dataset of StayDataset to linalg.Vector
val dsVector = dataset.transform[linalg.Vector](rec => linalg.Vectors.dense(rec.latitude, rec.longitude))
and then you retrieve the rdd[linalg.Vector]:
val rdd = dsVector.rdd
and you pass the rdd to your DBSCAN:
DBSCAN.train(rdd, ...)
These are necessary for your to get the rdd to do the train.
I think you also need to do some aggregation beforehand. If it is true, you need to manipulate on the dataset you have
I implemented a groupby function which groups columns based on a particular aggregation successfully. The issue is I am using a argument for chosen columns and aggregation as Map[String,String] which means multiple aggregations cannot be performed on one column. for example sum, mean and max all on one column.
below is what works soo far:
groupByFunction(input, Map("someSignal" -> "mean"))
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Map[String,String],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
}
Upon looking into it a bit more, the agg function can take a list of columns like below
userData
.groupBy(
window(
(col(timeStampColumnName) / lit(millisSecondsPerSecond)).cast(TimestampType),
timeWindowInS.toString.concat(" seconds")
),
col(sessionColumnName),
col(signalColumnName)
).agg(
mean("physicalSignalValue"),
sum("physicalSignalValue")).show()
So I decided to try to manipulate the input to look like that, below is how I did it:
val signalIdColumn = columnsWithOperation.toSeq.flatMap { case (key, list) => list.map(key -> _) }
val result = signalIdColumn.map(tuple =>
if (tuple._2 == "mean")
mean(tuple._1)
else if (tuple._2 == "sum")
sum(tuple._1)
else if (tuple._2 == "max")
max(tuple._1))
Now I have a list of columns, which is still a problem for agg funciton.
I was able to solve it using a sequence of tuples like this Seq[(String, String)] instead of Map[String,String]
def groupByFunction(dataframeDummy: DataFrame,
columnsWithOperation: Seq[(String, String)],
someSession: String = "sessionId",
someSignal: String = "signalName"): DataFrame = {
dataframeDummy
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation)
and then with the information
from below post:
https://stackoverflow.com/a/34955432/2091294
userData
.groupBy(
col(someSession),
col(someSignal)
).agg(columnsWithOperation.head, columnsWithOperation.tail: _*)
I was using the code below to extract strings I needed in Spark SQL. But now I am working with more data in Spark Hadoop and I want to extract strings. I tried the same code, but it does not work.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
How do I solve this problem?
First off, you're using Spark not the way its meant to. Your DataFrame isn't partitioned at all. Use:
val values = List("#always_nidhi", "#YouTube", "no", "i", "dnt", "understand" ...). That way, each bulk of words will be assigned to a different partition, different JVMs and/or clusters (depending on the total number of partitions and size of data). In your solution, the entire sentence is assigned to a specific partition and thus there's no parallelism nor distribution.
Second, you don't have to use a UDF (try to avoid those in general).
In order to find your regex, you can simply execute:
dataFrame.filter(col("words") rlike "#\\w+")
Hope it helps :-)