PySpark sql write dataFrame that has a json column to mysql JDBC - apache-spark-sql

I am trying to bulk write a DataFrame to mysql JDBC database. I am using databricks/pyspark.sql to write the DataFrames to the table. This table has a column that accepts json data(binary data). I did transform the json object to a StructType with following structure:
json object structure and conversion to dataframe:
schema_dict = {'fields': [
{'metadata': {}, 'name': 'dict', 'nullable': True, 'type': {"containsNull": True, "elementType":{'fields': [
{'metadata': {}, 'name': 'y1', 'nullable': True, 'type': 'integer'},
{'metadata': {}, 'name': 'y2', 'nullable': True, 'type': 'integer'}
],"type": 'struct'}, "type": 'array'}}
], 'type': 'struct'}
cSchema = StructType([StructField("x1", IntegerType()),StructField("x2", IntegerType()),StructField("x3", IntegerType()),StructField("x4", TimestampType()), StructField("x5", IntegerType()), StructField("x6", IntegerType()),
StructField("x7", IntegerType()), StructField("x8", TimestampType()), StructField("x9", IntegerType()), StructField("x10", StructType.fromJson(schema_dict))])
df = spark.createDataFrame(parsedList,schema=cSchema)
The output dataframe:
df:pyspark.sql.dataframe.DataFrame
x1:integer
x2:integer
x3:integer
x4:timestamp
x5:integer
x6:integer
x7:integer
x8:timestamp
x9:integer
x10:struct
dict:array
element:struct
y1:integer
y2:integer
Now I am trying to write this dataframe to mysql table using the mysql table.
import urllib
from pyspark.sql import SQLContext
from pyspark.sql.functions import regexp_replace, col
sqlContext = SQLContext(sc)
sqlContext
driver = "org.mariadb.jdbc.Driver"
url = "jdbc:mysql://dburl?rewriteBatchedStatements=true"
trial = "dbname.tablename"
user = "dbuser"
password = "dbpassword"
properties = {
"user": user,
"password": password,
"driver": driver
}
df.write.jdbc(url=url, table=trial, mode="append", properties = properties)
I am getting this error:
An error occurred while calling o2118.jdbc.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 176.0 failed 4 times, most recent failure: Lost task 15.3 in stage 176.0 (TID 9528, 10.168.231.82, executor 5): java.lang.IllegalArgumentException: Can't get JDBC type for struct<dict:array<struct<y1:int,y2:int>>>
Any ideas on how to write a dataframe that has a json column to mysql table? or how to solve this issue?
I am using Databricks 5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)

Related

How to download a `geojson` file with Panda from Google Colab?

I'm a newbie in python and panda and I use Google Colab.
I have a dataframe I manipulate in many ways. This is OK.
At the end of my manipulations I have a returned geojson formatted data:
for record in json_result:
geojson['features'].append({
'type': 'Feature',
'geometry': {
'type': 'Polygon',
'coordinates': [
[
[ record['tlx'], record['tly'] ],
[ record['blx'], record['bly'] ],
[ record['brx'], record['bry'] ],
[ record['trx'], record['try'] ],
[ record['tlx'], record['tly'] ],
]],
},
'properties': {"surface": record['SURFACE']},
})
The output (geojson) is good; I don't think I have any problem here.
However I want to download this geojson file to my computer.
I've tried numerous ways, without success.
from google.colab import files
......
How to do?
Thanks
Mount your google drive:
from google.colab import drive
drive.mount('/content/drive')
Save your geojson object:
import json
# create json object from dictionary
geojson_str = json.dumps(geojson)
# open file for writing, "w"
f = open("drive/MyDrive/geojson_export.json","w")
# write json object to file
f.write(geojson_str)
# close file
f.close()

Airflow BigQueryInsertJobOperator configuration

I'm having some issue converting from the deprecated BigQueryOperator to BigQueryInsertJobOperator. I have the below task:
bq_extract = BigQueryInsertJobOperator(
dag="big_query_task,
task_id='bq_query',
gcp_conn_id='google_cloud_default',
params={'data': Utils().querycontext},
configuration={
"query": {"query": "{% include 'sql/bigquery.sql' %}", "useLegacySql": False,
"writeDisposition": "WRITE_TRUNCATE", "destinationTable": {"datasetId": bq_dataset}}
})
this line in my bigquery_extract.sql query is throwing the error:
{% for field in data.bq_fields %}
I want to use 'data' from params, which is calling a method, this method is reading from a .json file:
class Utils():
bucket = Variable.get('s3_bucket')
_qcontext = None
#property
def querycontext(self):
if self._qcontext is None:
self.load_querycontext()
return self._qcontext
def load_querycontext(self):
with open(path.join(conf.get("core", "dags"), 'traffic/bq_query.json')) as f:
self._qcontext = json.load(f)
the bq_query.json is this format, and I need to use the nested bq_fields list values:
{
"bq_fields": [
{ "name": "CONCAT(ID, '-', CAST(ID AS STRING), "alias": "new_id" },
{ "name": "TIMESTAMP(CAST(visitStartTime * 1000 AS INT64)", "alias": "new_timestamp" },
{ "name": "TO_JSON_STRING(hits.experiment)", "alias": "hit_experiment" }]
}
this file has a list which I want to use in the above mentioned query line, but its throwing this error:
jinja2.exceptions.UndefinedError: 'data' is undefined
There are two issues with your code.
First "params" is not a supported field in BigQueryInsertJobOperator. See this post where I post how to pass params to sql file when using BigQueryInsertJobOperator. How do you pass variables with BigQueryInsertJobOperator in Airflow
Second, if you happen to get an error that your file cannot be found, make sure you set the full path of your file. I have had to do this when migrating from local testing to the cloud, even though file is in same directory. You can set the path in the dag config with example below(replace path with your path):
with DAG(
...
template_searchpath = '/opt/airflow/dags',
...
) as dag:

Getstream Chat Documentation for Attachments & Actions / How to make clickable predefined answers?

I am not able to find Stream Chat React-Native documentation for Actions or I am not using what I found correct :-)
I am able to find a message example like:
{
'text': 'Wonderful! Thanks for asking.',
'attachments': [
{
'type': 'form',
'title': 'Select your account',
'actions': [
{
'name': 'account',
'text': 'Checking',
'style': 'primary',
'type': 'button',
'value': 'checking'
},
{
'name': 'account',
'text': 'Saving',
'style': 'default',
'type': 'button',
'value': 'saving'
},
{
'name': 'account',
'text': 'Cancel',
'style': 'default',
'type': 'button',
'value': 'cancel'
}
]
}
]
}
Pushing this message result in a OK rendering in the client.
Image: Message in chat client
However, the React-Native client throws an error when clicking any of the 3 buttons.
The error I get is:
WARN Possible Unhandled Promise Rejection (id: 0):
Error: StreamChat error code 4: RunMessageAction failed with error: "invalid or disabled command ``"
I have found some references to documentation like this:
Actions in combination with attachments can be used to build commands.
But the link does not end up describing anything about Commands.
Anyone have any tip. E.g. like a link to some documentation describing how to make clickable predefined answers work? :-D

RabbitMQ Performance Tool

I'm trying to use this tool (displaying the results in HTML)
https://github.com/rabbitmq/rabbitmq-perf-test/blob/master/html/README.md
trying to generate messages into topic exchange named: "MYXG.XYZ"
but i just get it into default direct exchange
my json is
cat spec-file.js
[{'name': 'AMQPe',
'type' :'simple',
'uri': 'amqp://guest:guest#192.168.127.23:5672',
'exchange-type': 'topic',
'exchange-name': 'MYXG.XYZ',
'routing-key':'#',
'variables': [{'name': 'min-msg-size', 'values': [3200]}],
'producer-rate-limit': 30000,
'params': [{
'time-limit': 100,
'producer-count': 4,
'consumer-count': 2
}]
}]
can you help?

How should I run task with "owner" not shell owner in AIRFLOW

My task code is the following.
from airflow.models import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
rootdir = "/tmp/airflow"
default_args = {
'owner': 'max',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['max#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('test3', default_args=default_args,
schedule_interval='*/2 * * * *')
t1 = BashOperator(
task_id='test3-task1',
bash_command='date >> {rootdir}/test3-task1.out'.format(rootdir=rootdir),
owner='max',
dag=dag)
t2 = BashOperator(
task_id='test3-task2',
bash_command='whoami',
retries=3,
owner='max',
dag=dag)
Then I run the command "airflow test test3 test3-task2 2016-07-25" with 'airflow' user of linux. The result of output "whoami" is "airflow".
But I hope that the output result is "owner" of task.
What is my wrong ?
Thanks
the following is the output result.
[2016-07-25 11:22:37,716] {bash_operator.py:64} INFO - Temporary script location :/tmp/airflowtmpoYNJE8//tmp/airflowtmpoYNJE8/test3-task2U1lpom
[2016-07-25 11:22:37,716] {bash_operator.py:65} INFO - Running command: whoami
[2016-07-25 11:22:37,722] {bash_operator.py:73} INFO - Output:
[2016-07-25 11:22:37,725] {bash_operator.py:77} INFO - airflow
[2016-07-25 11:22:37,725] {bash_operator.py:80} INFO - Command exited with return code 0
You can use the "run_as_user" parameter under default_args
default_args = {
'owner': 'max',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['max#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'run_as_user': 'max'
}
It doesn't look like what you're trying to do is supported. Looking at the source code for both the bash_operator and the BaseOperator, neither makes any attempt to change users before executing the task, unfortunately.