Storing data obtained from cassandra in spark Memory and make it available to other spark job server jobs in same context - apache-spark-sql

I am using spark job server and using spark-sql to get data from a cassandra table as follows
public Object runJob(JavaSparkContext jsc, Config config) {
CassandraSQLContext sq = new CassandraSQLContext(JavaSparkContext.toSparkContext(jsc));
sq.setKeyspace("rptavlview");
DataFrame vadevent = sq.sql("SELECT username,plan,plate,ign,speed,datetime,odo,gd,seat,door,ac from rptavlview.vhistory ");
vadevent.registerTempTable("history");
sq.cacheTable("history");
DataFrame vadevent1 = sq.sql("SELECT plate,ign,speed,datetime FROM history where username='"+params[0]+"' and plan='"+params[1]+"'");
long count = vadevent.rdd().count();
}
But I am getting table not found history.
Can anybody mention how to cache cassandra data in spark memory and reuse the same data either in concurrent requests of same job or as two jobs one for caching and other for querying.
I am using dse5.0.4 so spark version is 1.6.1

You can allow spark jobs to share the state of other contexts. This link goes more in depth.

Related

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

Synapse Spark - Deltalake configs for Schema evolution and write optimizations

I am looking for databricks equivalent properties in Synapse spark. Please let me know if there are any or workaround for the same.
Using MERGE command to Insert/update the data. However, it does not support schema merge. Is there any property to enable auto merge ?
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled ","true") in Databricks
How to control number of part files or optimize writes with delta Merge command ?
set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;
set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;

Retrieving data from s3 bucket in pyspark

I am reading data from s3 bucket in pyspark . I need to parallelize read operation and doing some transformation on the data. But its throwing error. Below is the code.
s3 = boto3.resource('s3',aws_access_key_id=access_key,aws_secret_access_key=secret_key)
bucket = s3.Bucket(bucket)
prefix = 'clickEvent-2017-10-09'
files = bucket.objects.filter(Prefix = prefix)
keys=[k.key for k in files]
pkeys = sc.parallelize(keys)
I have a global variable d which is an empty list. And I am appending deviceId data into this.
applying flatMap on the keys
pkeys.flatMap(map_func)
This the function
def map_func(key):
print "in map func"
for line in key.get_contents_as_string().splitlines():
# parse one line of json
content = json.loads(line)
d.append(content['deviceID'])
But the above code gives me error.
Can anyone help!
You have two issues that I can see. The first is you are trying to manually read data from S3 using boto instead of using the direct S3 support built into spark and hadoop. It looks like you are trying to read text files containing json records per line. If that is case, you can just do this in spark:
df = spark.read.json('s3://my-bucket/path/to/json/files/')
This will create a spark DataFrame for you by reading in the JSON data with each line as a row. DataFrames require a rigid pre-defined schema (like a relational database table) which spark try to determine will determine by sampling some of your JSON data. After you have the DataFrame all you need to do to get your column is select it like this:
df.select('deviceID')
The other issue worth pointing out is you are attempting to use a global variable to store data computed across your spark cluster. It is possible to send data from your driver to all of the executors running on spark workers using either broadcast variables or implicit closures. But there is no way in spark to write to a variable in your driver from an executor! To transfer data from executors back to the driver you need to use spark's Action methods intended for exactly this purpose.
Actions are methods that tell spark you want a result computed so it needs to go execute the transformations you have told it about. In your case you would probably either want to:
If the results are large: use DataFrame.write to save the results of your tranformations back to S3
If the results are small: DataFrame.collect() to download them back to your driver and do something with them

Lazy Evaluation in Spark. How does Spark load data from DB

If suppose we have set limit of 100 , and Spark Application is connected to the DB with million records.Does Spark load all million record or loads 100 by 100 ?
How does Spark load data from DB? It depends on the database type & its connector implementation. Of course for a distributed processing framework, a distributed data ingestion is always the primary aim for building connectors.
As a brief example, if we have a (1 Mil) records in a table, and we defined the number of partition to be 100 when we load(), then ideally the read task will be distributed to executors so that each executor reads a range of (10,000) records and store them in their corresponding partitions in memory. See SQL Databases using JDBC.
In the Spark UI, you can see that the numPartitions dictate the number of tasks that are launched. Each task is spread across the executors, which can increase the parallelism of the reads and writes through the JDBC interface
Spark provides flexible interfaces (Spark DataSource V2) that allow us to build our own custom datasource connectors. The main design key here is parallelizing the read operation according to how many partitions defined. Also check (figure 4) to understand how distributed CSV ingestion works in Spark.
Update
Read from JDBC connections across multiple workers
df = spark.read.jdbc(
url=jdbcUrl,
table="employees",
column="emp_no",
lowerBound=1,
upperBound=100000,
numPartitions=100
)
display(df)
In the above sample code, we used JDBC read to split the table read across executors on the emp_no column using the partitionColumn, lowerBound, upperBound, and numPartitionsparameters.

Apache Ignite and Apache Spark integration, Cache loading into Spark Context using IgniteRDD

If I create a igniteRDD out of a cache with 10M entries in my spark job, will it load all 10M into my spark context? Please find my code below for reference.
SparkConf conf = new SparkConf().setAppName("IgniteSparkIntgr").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaIgniteContext<Integer, Subscriber> igniteCxt = new JavaIgniteContext<Integer,Subscriber>(context,"example-ignite.xml");
JavaIgniteRDD<Integer,Subscriber> cache = igniteCxt.fromCache("subscriberCache");
DataFrame query_res = cache.sql("select id, lastName, company from Subscriber where id between ? and ?", 12, 15);
DataFrame input = loadInput(context);
DataFrame joined_df = input.join(query_res,input.col("id").equalTo(query_res.col("ID")));
System.out.println(joined_df.count());
In the above code, subscriberCache is having more than 10M entries. Will at any point of the above code the 10M Subscriber objects be loaded into JVM? Or it only loads the query output?
FYI:(Ignite is running in a separate JVM)
cache.sql(...) method queries the data that is already in Ignite in-memory cache, so before doing this you should load the data. You can use IgniteRDD.saveValues(...) or IgniteRDD.savePairs(...) method for this. Each of them will iterate through all partitions and load all the data that currently exists in Spark into Ignite.
Note that any transformations or joins that you're doing with the resulting DataFrame will be done locally on the driver. You should avoid this as much as possible to get the best performance from Ignite SQL engine.