Accessing Spark Streaming Data pipelines. What option works best? - pandas

I am looking for the best option to access data from Spark data pipelines. The scenario is as follows:
I am reading data from Kafka topics, creating a streaming dataframe which is then cleaned and being printed on the console. I need this data to be integrated with existing Python scripts which is doing all the data operations by Pandas. I have considered the following options:
Write streaming data to local memory (e.g. Hive Tables).
Use Spark Structured Streaming ForeachBatch Sink.
I want to mention that the data is to be read after a certain interval and there will be a real time data dashboard in the future with this data.
Please advise which will be the best approach to handle this scenario. Apologies if the question sounds too basic. Thanks in advance.

If you save data on Hive each time before accessing the newly streamed data through python scripts, the newly added hive partitions are required to be refreshed each time as well.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Here are some disadvantages of having a hive for mentioned real-time scenarios.
https://www.quora.com/What-are-some-disadvantages-of-Apache-Hive#
Whereas, Using Spark Structured Streaming looks a better choice for the near-real-time experience.
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

Related

Hortonworks: Hbase, Hive, etc used for which type of data

I would like to ask if anyone could tell me or refer me to an internet page which describes all possibilities to store data in an apache hadoop cluster.
What I would like to know is: Which type of data should be stored in which "system". Under type of data I mean for example:
Live data (realtime)
Historical data
Data which is regularly accessed from an application
...
The complete question is not reduced on Hbase or Hive ("System") but for everything which is available under Hdp.
I hope someone could lead me in a direction where i could find my answer. Thanks!
I can give you an overview, but rest of the things you have to read on your own.
Let's begin with the types of data you want to store in HDFS:
Data in Motion(Which you denoted as real-time data).
So, how can you fetch the real-time data? Is it even possible? The answer is NO. There will always be a delay. However, we can reduce the downtime and processing time of the data. For which we have HDF(Hortonworks Data Flow). It works with the data in motion. There are many services providing the real-time data streaming. You can take the example of Kafka, Nifi, Storm and many more. These tools are used to process the data. You also need to store the data in such a way that you'd be able to fetch it no time(~2 sec), for that we use HBase. HBase stores the data in the columnar structure.
Data at rest (Historic/Data stored for future use)
So, to store the data at rest, there are no such issues. HDP(Hortonworks Data Platform) is there providing us the services to ingest, store and process the data. Even we can integrate HDF services to HDP(prior to version 2.6), which makes it easier to process Data in motion also. Here we need Databases to store a large amount of data. However, we are provided with HDFS(Hadoop Distributed File System) which can help us store any kind of data. But we don't ONLY want to store our data, we want to fetch it no time when it is required. So, how are we planning to do that? By storing our data in a structured form. For which we are provided Hive and HBase. To store such amount of data which is in TB, we need to run heavy processes that are where MapReduce, YARN, Spark, Kubernetes, Spark comes in to picture.
This is the basic idea of storing and processing data in Hadoop.
Rest you can always read from the internet.

How can I load data from BigQuery to Spanner?

I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.

How to read from BigQuery as a stream

I'm using Java + Apache Beam SDK for Java 2.0.1-SNAPSHOT
Scenario:
Read Data from BigQuery(BQ) -> ETL Process in Dataflow -> Write Data in BQ tables
The problem is that the pipeline is trying to process all data before performing the insertion in BQ.
Is there a way to execute stream inserts in this case? I've already tried to set a timestamp to the elements when extracting from BQ, but it didn't work.
Or is it possible to set the BatchLoads so that it inserts bulks of data time to time?
I would take a look at this link to Googles Solution. That being said, BigQuery sounds like it is being treated as a bounded source, but that shouldn't be a problem sinking data back into dataflow, see here.

Beam - Handling failures during huge data load for bigquery

I have recently started with Apache beam. I am sure I am missing something here. I have a requirement to load from a very huge database to bigquery. These tables are huge. I have written sample beam jobs to load minimal rows from simple tables.
How would I able to load n number of rows from tables using JDBCIO? Is there anyway that i can load these data in batches as we do in conventional data migration jobs.?
Can I do batch read from a database and write in batches to bigquery?
Also i have seen that, the suggested approach to load the data to bigquery is by adding the files to the data store buckets. But, in automated environment, the requirement is to write it as a dataflow job to load from db and write it to bigquery. What should my design approach to solve this issue using apache beam?
Please help.!
It looks[1] like BigQueryIO will write batches of data if it comes from a bounded PCollection (otherwise it uses streaming inserts). It also appears to bound the size of each file and batch, so I don't think you'll need to do any manual batching.
I'd just read from your database via JDBCIO, transform it if needed, and write it to BigQueryIO.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

Easiest way to persist Cassandra data to S3 using Spark

I am trying to figure out how to best store and retrieve data, from S3 to Cassandra, using Spark: I have log data that I store in Cassandra. I run Spark using DSE to perform analysis of the data, and it works beautifully. The log data grows daily, and I only need two weeks worth in Cassandra at any given time. I still need to store older logs somewhere for at least 6 months, and after research, S3 with Glaciar looks like the most promising solution. I'd like to use Spark, to run a daily job that finds the logs from day 15, deletes them from Cassandra, and sends them to S3. My problem is this: I can't seem to settle on the right format to save the Cassandra rows to a file, such that I can one day potentially load the file back into Spark, and run an analysis, if I have to. I only want to run the analysis in Spark one day, not persist the data back into Cassandra. JSON seems to be an obvious solution, but is there any other format that I am not considering? Should I use Spark SQL? Any advice appreciated before I commit to one format or another.
Apache Spark is designed for this kind of use case. It is a storage format for columnar databases. It provides column compression and some indexing.
It is becoming a de facto standard. Many big data platforms are adopting it or at least providing some support for it.
You can query it efficiently directly in S3 using SparkSQL, Impala or Apache Drill. You can also run EMR jobs against it.
To write data to Parquet using Spark, use DataFrame.saveAsParquetFile.
Depending on your specific requirements you may even end up not needing a separate Cassandra instance.
You may also find this post interesting