I'm beginner with big data and i'm working with spark-scala. I work with dataframes and to make things clear to me I used multiple scala objects to write my code. all the classes have main methods to run the code. The first scala object is used to load files into dataframes and the other scala objects make statistics computations. this is some of the code of the first one
object LoadFiles {
//classes for datasets
case class T(X: Option[String], P: Option[String],Y:Option[String])
println("Load File 1 into dataframe")
def main(args: Array[String]){
val sc = new SparkContext("local[*]", "LoadFiles1")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val dataframe1 = sc.textFile("file1.ttl").map(_.split(" |\\ . ")).map(p =>
T(Try(p(0).toString()).toOption,Try(p(1).toString()).toOption,Try(p(2).toString()).toOption)).toDF()
dataframe1
.write
.partitionBy("Predicate")
.mode(SaveMode.Overwrite)
.saveAsTable("dataframe1")
}}
The other scala objects are used to make many computations from the loaded dataframes and create other dataframes
this is the second one
object Statistics1 {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "LoadFiles1")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
// subject query
val Query1 = spark.sql("SELECT X As Res, P as Pred, COUNT(Y) As nbrFROM dataframe1 GROUP BY X, P")
.write
.mode(SaveMode.Overwrite)
.saveAsTable("stat1") }}
I got the error Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: dataframe1; line 1 pos 75
How can I fix this ?
Related
I do have a function parse_report which produces a Pandas-on-Spark dataframe. I would like to use it with a UDF, such as:
#input_schema is a StructType of StructField list
data_schema = ArrayType(StructType.fromJson(input_schema), False)
S_SCHEMA = StructType([
StructField('validation_result', BooleanType()),
StructField('parsed_content', data_schema),
StructField('year_month', IntegerType()),]
)
parse_udf = F.udf(lambda x, y, z: parse_report(x, y, z, **properties), S_SCHEMA)
#PARSE_REPORT --> Pyspark.Pandas
df = df.withColumn('p', parse_paramhis_report_udf(F.col('testx'), F.col('testy'), F.col('testz')))
As a result, I am having this error: (I also tried to set the spark session/context, throwed the same error). I wanted to use Pandas UDF, but saw that the nested StructType isn't supported.
SparkContext should only be created and accessed on the driver udf
I have local Spark installed. Running in VSCode on Jupyter Notebook.
Using this test code to create small dataframe and show it in the console using .show(), but my output is not aligned:
# %%
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("local").appName("my-application-name").getOrCreate()
)
sc = spark.sparkContext
spark.conf.set("spark.sql.shuffle.partitions", "5")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkByExamples.com").getOrCreate()
columns = ["language", "users_count"]
data = [
("Java", "20000"),
("Python", "100000"),
("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show(truncate=False)
Also converting to pandas and printing shows similarly:
df_pd = df.toPandas()
print(df_pd)
Can you help me, where can I look to try to fix it?
Thanks
I tried joining two sample dataframes using the code below :
from pyspark import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
inputDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://bayu-wbi-test/customers.csv"]}, format = "csv")
DF1 = inputDF.toDF()
DF2 = inputDF.toDF()
DoubleDF = DF1.join(DF2,DF1.col0 == DF2.col0)
DoubleDF.show()
however i encounter this error when i run it in my Glue container :
An error was encountered:
An error occurred while calling o135.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 4 ($anonfun$withThreadLocalCaptured$1 at FutureTask.java:266) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110) at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:351) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at
This container is running on my local machine and i've tried increasing the spark driver memory.
Thanks for the help.
Think it might be related to this issue:
https://issues.apache.org/jira/browse/SPARK-34790
According to this reported issue: https://github.com/delta-io/delta/issues/841
A possible workaround is to set
sparkConf.set("spark.sql.adaptive.fetchShuffleBlocksInBatch", "false")
I have declared a table in SQLA with the custom type SqliteDecimal. I am struggling with retrieving the value into a dataframe. The return type is 'numpy.float64' where I am expecting a Decimal. I suspect the TypeDecorator is incorrect:
import decimal
from sqlalchemy import Column
import sqlalchemy.types as types
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import pandas as pd
Base = declarative_base()
engine = create_engine("sqlite://", echo=False)
Decorator Class:
class SqliteDecimal(types.TypeDecorator):
"""Decimal decorator converts b/w decimal and text in SQLite"""
impl = types.String
def load_dialect_impl(self, dialect):
return dialect.type_descriptor(types.VARCHAR(100))
def process_bind_param(self, value, dialect):
return str(value)
def process_result_value(self, value, dialect):
if value in ["None", "NaN", "nan"]:
result = decimal.Decimal("NaN")
else:
result = decimal.Decimal(value)
return result
SQL Alchemy table:
class MyTable(Base):
__tablename__ = "my_table"
number = Column(SqliteDecimal, primary_key=True)
Main:
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
row1 = MyTable(number=decimal.Decimal(1.1))
session.add(row1)
session.commit()
session.close()
query = session.query(MyTable)
df = pd.read_sql_query(query.statement, engine)
print(type(df.iloc[0,0]))
Fixed. Needed coerce_float = False:
df = pd.read_sql_query(query.statement, engine, coerce_float=False)
I want to read json data from a folder location through spark streaming.
I assume my json data is
{"transactionId":111,"customerId":1,"itemId": 1,"amountPaid": 100}
I want the output in Spark SQL table as:--
transactionId customerId itemId amountPaid
111 1 1 100
my code is :
package org.training.spark.streaming
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.Duration
import org.apache.spark.sql.functions.udf
import org.training.spark.streaming.sqlstreaming.Persons
object jsonread {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("jsonstreaming")
val sc = new SparkContext(sparkConf)
// Create the context
val ssc = new StreamingContext(sc, Seconds(40))
val lines = ssc.textFileStream("src/main/resources/fileStreaming")
lines.foreachRDD(rdd=>rdd.foreach(println))
val words = lines.flatMap(_. split(","))
words.foreachRDD(rdd=>rdd.foreach(println))
val sqc = new SQLContext(sc);
import sqc.implicits._
words.foreachRDD { rdd =>
val persons = rdd.map(_.split(":")).map(p => (p(0), p(1))).toDF()
persons.registerTempTable("data")
val jsontable = sqc.sql("SELECT * from data")
jsontable.show
}
ssc.start()
ssc.awaitTermination()
}
}
Json Data:
{"transactionId":"111","customerId":"1","itemId": "1","amountPaid": "100"}
Pyspark Code to read from above json data:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.sql.types import IntegerType, LongType, DecimalType,StructType, StructField, StringType
from pyspark.sql import Row
from pyspark.sql.functions import col
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
ssc = StreamingContext(sc, 5)
stream_data = ssc.textFileStream("/filepath/")
def readMyStream(rdd):
if not rdd.isEmpty():
df = spark.read.json(rdd)
print('Started the Process')
print('Selection of Columns')
df = df.select('transactionId','customerId','itemId','amountPaid').where(col("transactionId").isNotNull())
df.show()
stream_data.foreachRDD( lambda rdd: readMyStream(rdd) )
ssc.start()
ssc.stop()