Spark streaming from hive table, is it possible? - hive

I have a use case
We have java framework to parse realtime data from Kinesis to Hive table in every half an hour.
I need to access this hive table and do some processing near realtime. An hour delay is fine, as I dont have permission to access Kinesis stream.
Once processing is done in spark (pyspark preferably), I have to create a new kinesys stream and push the data.
I will then use Splunk and pull it near realtime.
Question is, any one has done spark streaming from hive using python ? I have to do a POC and then the actual work.
Any help will be highly appreciated.
Thanks in advance!!

There are 2 ways to go ahead on this:
Use spark-streaming to obtain messages drectly from Kinesis. That will give you something that is real time.
Once the file drops into your staging area ( either your hive ware-house OR your some HDFS location ), you can pick it up for processing using spark-streaming for files.
Do let us know which approch worked best for you.

Related

Accessing Spark Streaming Data pipelines. What option works best?

I am looking for the best option to access data from Spark data pipelines. The scenario is as follows:
I am reading data from Kafka topics, creating a streaming dataframe which is then cleaned and being printed on the console. I need this data to be integrated with existing Python scripts which is doing all the data operations by Pandas. I have considered the following options:
Write streaming data to local memory (e.g. Hive Tables).
Use Spark Structured Streaming ForeachBatch Sink.
I want to mention that the data is to be read after a certain interval and there will be a real time data dashboard in the future with this data.
Please advise which will be the best approach to handle this scenario. Apologies if the question sounds too basic. Thanks in advance.
If you save data on Hive each time before accessing the newly streamed data through python scripts, the newly added hive partitions are required to be refreshed each time as well.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Here are some disadvantages of having a hive for mentioned real-time scenarios.
https://www.quora.com/What-are-some-disadvantages-of-Apache-Hive#
Whereas, Using Spark Structured Streaming looks a better choice for the near-real-time experience.
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

How to pre-process BigQuery data coming from Stackdriver

I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)
As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.

How to read from BigQuery as a stream

I'm using Java + Apache Beam SDK for Java 2.0.1-SNAPSHOT
Scenario:
Read Data from BigQuery(BQ) -> ETL Process in Dataflow -> Write Data in BQ tables
The problem is that the pipeline is trying to process all data before performing the insertion in BQ.
Is there a way to execute stream inserts in this case? I've already tried to set a timestamp to the elements when extracting from BQ, but it didn't work.
Or is it possible to set the BatchLoads so that it inserts bulks of data time to time?
I would take a look at this link to Googles Solution. That being said, BigQuery sounds like it is being treated as a bounded source, but that shouldn't be a problem sinking data back into dataflow, see here.

Moving data from hdfs to sql

Im testing my setup and i need to move the data in hdfs to a sql DB and that too when the data is generated. What i mean is.. once the mapreduce job is completed, it will send a ActivMQ message. I need to move it to sql automatically once i receive a ActivMQ message using Sqoop. Can some one help how to acheive this.
Can someone let me know whether MQ & Sqoop work together..?
Thank You..
I am not entirely clear about the use-case but you can set up a Ooizie Work-Flow.The Sqoop job will only start once the map-reduce job is complete.You can actually create a complex DAG using Oozie.The Oozie work flow can inturn be invoked from a remote java client.
Hope this helped.

Easiest way to persist Cassandra data to S3 using Spark

I am trying to figure out how to best store and retrieve data, from S3 to Cassandra, using Spark: I have log data that I store in Cassandra. I run Spark using DSE to perform analysis of the data, and it works beautifully. The log data grows daily, and I only need two weeks worth in Cassandra at any given time. I still need to store older logs somewhere for at least 6 months, and after research, S3 with Glaciar looks like the most promising solution. I'd like to use Spark, to run a daily job that finds the logs from day 15, deletes them from Cassandra, and sends them to S3. My problem is this: I can't seem to settle on the right format to save the Cassandra rows to a file, such that I can one day potentially load the file back into Spark, and run an analysis, if I have to. I only want to run the analysis in Spark one day, not persist the data back into Cassandra. JSON seems to be an obvious solution, but is there any other format that I am not considering? Should I use Spark SQL? Any advice appreciated before I commit to one format or another.
Apache Spark is designed for this kind of use case. It is a storage format for columnar databases. It provides column compression and some indexing.
It is becoming a de facto standard. Many big data platforms are adopting it or at least providing some support for it.
You can query it efficiently directly in S3 using SparkSQL, Impala or Apache Drill. You can also run EMR jobs against it.
To write data to Parquet using Spark, use DataFrame.saveAsParquetFile.
Depending on your specific requirements you may even end up not needing a separate Cassandra instance.
You may also find this post interesting