There is a field of type NUMBER(2,3) in the table (in oracle database).
When I read this table with spark, the following exception occurs:
org.apache.spark.sql.AnalysisException:
Decimal scale (3) cannot be greater than precision (2).
How can I solve this problem?
The code I am reading the table data is as follows:
val df = spark.read
.format("jdbc")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("url", "jdbc:oracle:thin:#**:1521:**")
.option("user", "**")
.option("password", "**")
.option("dbtable", "table_name")
.load()
Related
I have a protocol that needs to take in many (read millions) of records. The protocol requires all of the data is a single line feed (InfluxDB / QuestDB). Using the InfluxDB client isn't currently an option so I need to do this via a socket.
I am at the end of my ETL process and I now just have to take the final RDD I have created and take all of those rows and transpose them into a single line but can't figure out how to do this (and how to do it properly!)
In PySpark / AWS Glue I currently have:
def write_to_quest(df, measurement, table, timestamp_field, args):
HOST = args['questdb_host']
PORT = int(args['questdb_port'])
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
sock.connect((HOST, PORT))
rows = df.rdd.map(lambda row: row.asDict(True))
new_rdd = rows.map(lambda row:
_row_to_line_protocol(row, measurement, table, timestamp_field)).glom()
#transform new_rdd to single_line_rdd here
sock.sendall((single_line_rdd).encode())
except socket.error as e:
print("Got error: %s" % (e))
Called by:
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
allDaily = glueContext.create_dynamic_frame.from_catalog(database=args['db_name'],
table_name="daily",
transformation_ctx="allDaily",
push_down_predicate="(date_str='20040302' and meter_id='NEM1206106')"
# for faster testing
)
# TODO: Handle entire DF as a single payload
df = allDaily.toDF()
tdf = df.withColumn('reading_date_time', F.to_timestamp(df['reading_date_time'], '%Y-%m-%dT%H:%M:%S.%f'))
tdf = tdf.drop(*["ingestion_date", "period_start", "period_end", "quality_method",
"event", "import_reactive_total", "export_reactive_total"])
write_to_quest(df=tdf, measurement="meter_id", table="daily", timestamp_field="reading_date_time", args=args)
The shape of new_rdd is a set of lists of strings:
RDD[
['string here,to,join','another string,to,join'...x70]
['string here,to,join','another string,to,join'...x70]
['string here,to,join','another string,to,join'...x70]
x200
]
How do I get this so I have a single line that has everything joined by '\n' (newline)?
e.g:
'string here,to,join\nanother string,to,join\n....'
I have so far tried several combinations of foreach like:
foreach(lambda x: ("\n".join(x)))
But to absolutely no avail, I am also concerned about scalability for this - for example I am pretty sure using .collect() on millions of rows is going to kill things
So what is the best way to solve this final step?
Edit after accepted answer
The specific solution from Werners answer I implemented was this (I removed Glob to get one list item per row and then removed the whitespace (as Influx / Quest is whitespace sensitive)
def write_to_quest(df, measurement, table, timestamp_field, args):
"""
Open a socket and write the row directly into Quest
:param df_row:
:param measurement:
:param table:
:param timestamp_field:
:param args:
:return:
"""
HOST = args['questdb_host']
PORT = int(args['questdb_port'])
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
sock.connect((HOST, PORT))
rows = df.rdd.map(lambda row: row.asDict(True))
new_rdd = rows.map(lambda row:
_row_to_line_protocol(row, measurement, table, timestamp_field))
result = new_rdd.map(lambda r: "".join(r) + "\n") \
.aggregate("", lambda a, b: a + b, lambda a, b: a + b)
sock.sendall((result.encode()))
except socket.error as e:
print("Got error: %s" % (e))
sock.close()
Each row of the rdd can be mapped into one string per row using map and then the result of the map call can be aggregated into one large string:
result = rdd.map(lambda r: " ".join(r) + "\n")\
.aggregate("", lambda a,b: a+b, lambda a,b: a+b)
If the goal is to have one large string, all the data has to be moved to a single place at least for the final step. Using aggregate here is slightly better than collecting all rows and concatinating the strings on the driver as aggregate can do things distributed and in parallel for most of the time. However enough memory for the whole final string is still required on a single node.
The pyspark groupby operation does not produce unique group keys for large data sets
I see repeated keys in the final output.
new_df = df.select('key','value') \
.where(...) \
.groupBy('key') \
.count()
e.g. above query returns multiple rows for a groupBy column (key). The datatype for groupby column('key') is string.
I'm storing output in CSV by doing
new_df.write.format("csv") \
.option("header", "true") \
.mode("Overwrite") \
.save(CSV_LOCAL_PATH)
e.g. output in CSV has duplicate rows
key1, 10
key2, 20
key1, 05
Tested in Spark 2.4.3 and 2.3
There are duplicates. There is no difference in keys. This happens for multiple KEYS.
It gives 1, when I count the rows for particular keys.
new_df.select('key','total')\
.where((col("key") == "key1"))\
.count()
I'm not sure if pyarrow settings make any difference. I'd it enabled before. I tried with both enabling and disabling pyarrow but the same result.
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
I found that the issue was while saving to the CSV which ignores whitespaces.
Adding below options helps to resolve it.
.option("ignoreLeadingWhiteSpace", "false")\
.option("ignoreTrailingWhiteSpace", "false")
I have an output spark Dataframe which needs to be written to CSV. A column in the Dataframe is 'struct' type and is not supported by csv. I am trying to convert it to string or convert to pandas DF but nothing works.
userRecs1=userRecs.withColumn("recommendations", explode(userRecs.recommendations))
#userRecs1.write.csv('/user-home/libraries/Sampled_data/datasets/rec_per_user.csv')
Expected result: Recommendations column as string type so that it can be split into two separate columns and write to csv.
Actual results:
(recommendations column is struct type and cannot be written to csv)
ID_CTE| recommendations|
+-------+-----------------+
|3974081| [2229,0.8915096]|
|3974081| [2224,0.8593609]|
|3974081| [2295,0.8577902]|
|3974081|[2248,0.29922757]|
|3974081|[2299,0.28952467]|
Another option is to convert the struct column to a json and then save:
from pyspark.sql import functions as f
userRecs1 \
.select(f.col('ID_CTE'), f.to_json(f.col('recommendations.'))) \
.write.csv('/user-home/libraries/Sampled_data/datasets/rec_per_user.csv')
The following command will flatten your StructType into separate named columns:
userRecs1 \
.select('ID_CTE', 'recommendations.*') \
.write.csv('/user-home/libraries/Sampled_data/datasets/rec_per_user.csv')
I am attempting to insert records into a MySql table. The table contains id and name as columns.
I am doing like below in a pyspark shell.
name = 'tester_1'
id = '103'
import pandas as pd
l = [id,name]
df = pd.DataFrame([l])
df.write.format('jdbc').options(
url='jdbc:mysql://localhost/database_name',
driver='com.mysql.jdbc.Driver',
dbtable='DestinationTableName',
user='your_user_name',
password='your_password').mode('append').save()
I am getting the below attribute error
AttributeError: 'DataFrame' object has no attribute 'write'
What am I doing wrong? What is the correct method to insert records into a MySql table from pySpark
Use Spark DataFrame instead of pandas', as .write is available on Spark Dataframe only
So the final code could be
data =['103', 'tester_1']
df = sc.parallelize(data).toDF(['id', 'name'])
df.write.format('jdbc').options(
url='jdbc:mysql://localhost/database_name',
driver='com.mysql.jdbc.Driver',
dbtable='DestinationTableName',
user='your_user_name',
password='your_password').mode('append').save()
Just to add #mrsrinivas answer's.
Make sure that you have jar location of sql connector available in your spark session. This code helps:
spark = SparkSession\
.builder\
.config("spark.jars", "/Users/coder/Downloads/mysql-connector-java-8.0.22.jar")\
.master("local[*]")\
.appName("pivot and unpivot")\
.getOrCreate()
otherwise it will throw an error.
I am having a test csv file with below content:
"TVBQGEM461
2016-10-05 14:04:33 cvisser gefixt door company
"
I need to store this entire content into one single row. However while processing with pyspark, this are getting splitted into 2 rows.
Below is the pyspark code:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("desc", StringType(), True)])
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', inferschema='true').load('/dev/landingzone/test.csv', schema = customSchema)
df.registerTempTable("temp")
sqlContext.sql("create table dev_core_source.test as select * from temp")
Data is getting loaded into the hive table, but they are splited into 2 rows instead of 1 row
I have also tried some other options like the one below for creating data frame, but still facing the same issue.
df = sqlContext.read \
... .format('com.databricks.spark.csv') \
... .options(header='true') \
... .option(inferschema, 'true') \
... .option(wholeFile, 'true') \
... .options(parserLib ='UNIVOCITY') \
... .load('/dev/landingzone/test.csv', schema = customSchema)