Dataset and its strictly typed - dataframe

I am new to spark .Regarding Datasets, I read that spark checks first that schema match the types of data during compiling . In streaming mode , how would this operation happen ? I mean how does the spark check that schema match the data when I don’t have the data in compile time ?
I tried to search alone

Related

Airflow GCSToBigQueryOperator is reordering my columns

I have the following operators in my DAG. They are receiving data from my MySQL database, uploading it to GCS, and then importing it to BigQuery. It runs great! With one small issue...
I can see that, inbetween the create and import tasks, the target table is created in BigQuery with the schema specified in the schema argument, with the correct column ordering. But, as soon as the import task runs, the schema of the table changes, and the columns are reordered into a seemingly arbitrary ordering. Why does this happen and is there a way to get BigQuery to stop doing this? I see that there are schema_update_options available on the operator but the documentation is quite poor...
create=BigQueryCreateEmptyTableOperator(
task_id="create",
bigquery_conn_id='google_cloud',
project_id="<myproject>",
dataset_id=target_dataset,
table_id=table_name,
schema_fields=schema
)
upload=MySQLToGCSOperator(
task_id='mysql_to_gcs',
mysql_conn_id='bi_mysql',
sql=self.sql,
bucket=self.bucket,
filename=self.filename,
export_format='NEWLINE_DELIMITED_JSON',
google_cloud_storage_conn_id='google_cloud'
)
import=GCSToBigQueryOperator(
task_id='gcs_to_bigquery',
bucket=self.bucket,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[self.filename],
destination_project_dataset_table="<myproject>..target_dataset.{table_name}",
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='google_cloud',
google_cloud_storage_conn_id='google_cloud',
)
create >> upload >> import
The re-ordering happens because you did not define the schema_fields inside your GCSToBigQueryOperator and it triggered BigQuery Schema Auto-detection wherein
BigQuery makes a best-effort attempt to automatically infer the schema from the source data.
In your case, to ensure the ordering of your columns is defined the way you wanted it to be, you must define schema_fields inside your GCSToBigQueryOperator.
You can already omit BigQueryCreateEmptyTableOperator since GCSToBigQueryOperator can already create BigQuery tables and define schemas.
Please see updated code based on your posted question:
upload=MySQLToGCSOperator(
task_id='mysql_to_gcs',
mysql_conn_id='bi_mysql',
sql=self.sql,
bucket=self.bucket,
filename=self.filename,
export_format='NEWLINE_DELIMITED_JSON',
google_cloud_storage_conn_id='google_cloud'
)
create_and_import=GCSToBigQueryOperator(
task_id='gcs_to_bigquery',
bucket=self.bucket,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[self.filename],
destination_project_dataset_table="<myproject>..target_dataset.{table_name}",
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='google_cloud',
google_cloud_storage_conn_id='google_cloud',
schema_fields=schema
)
upload >> create_and_import
You may refer to this GCSToBigQueryOperator Documentation for more details.

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.

Retrieving data from s3 bucket in pyspark

I am reading data from s3 bucket in pyspark . I need to parallelize read operation and doing some transformation on the data. But its throwing error. Below is the code.
s3 = boto3.resource('s3',aws_access_key_id=access_key,aws_secret_access_key=secret_key)
bucket = s3.Bucket(bucket)
prefix = 'clickEvent-2017-10-09'
files = bucket.objects.filter(Prefix = prefix)
keys=[k.key for k in files]
pkeys = sc.parallelize(keys)
I have a global variable d which is an empty list. And I am appending deviceId data into this.
applying flatMap on the keys
pkeys.flatMap(map_func)
This the function
def map_func(key):
print "in map func"
for line in key.get_contents_as_string().splitlines():
# parse one line of json
content = json.loads(line)
d.append(content['deviceID'])
But the above code gives me error.
Can anyone help!
You have two issues that I can see. The first is you are trying to manually read data from S3 using boto instead of using the direct S3 support built into spark and hadoop. It looks like you are trying to read text files containing json records per line. If that is case, you can just do this in spark:
df = spark.read.json('s3://my-bucket/path/to/json/files/')
This will create a spark DataFrame for you by reading in the JSON data with each line as a row. DataFrames require a rigid pre-defined schema (like a relational database table) which spark try to determine will determine by sampling some of your JSON data. After you have the DataFrame all you need to do to get your column is select it like this:
df.select('deviceID')
The other issue worth pointing out is you are attempting to use a global variable to store data computed across your spark cluster. It is possible to send data from your driver to all of the executors running on spark workers using either broadcast variables or implicit closures. But there is no way in spark to write to a variable in your driver from an executor! To transfer data from executors back to the driver you need to use spark's Action methods intended for exactly this purpose.
Actions are methods that tell spark you want a result computed so it needs to go execute the transformations you have told it about. In your case you would probably either want to:
If the results are large: use DataFrame.write to save the results of your tranformations back to S3
If the results are small: DataFrame.collect() to download them back to your driver and do something with them

Lazy Evaluation in Spark. How does Spark load data from DB

If suppose we have set limit of 100 , and Spark Application is connected to the DB with million records.Does Spark load all million record or loads 100 by 100 ?
How does Spark load data from DB? It depends on the database type & its connector implementation. Of course for a distributed processing framework, a distributed data ingestion is always the primary aim for building connectors.
As a brief example, if we have a (1 Mil) records in a table, and we defined the number of partition to be 100 when we load(), then ideally the read task will be distributed to executors so that each executor reads a range of (10,000) records and store them in their corresponding partitions in memory. See SQL Databases using JDBC.
In the Spark UI, you can see that the numPartitions dictate the number of tasks that are launched. Each task is spread across the executors, which can increase the parallelism of the reads and writes through the JDBC interface
Spark provides flexible interfaces (Spark DataSource V2) that allow us to build our own custom datasource connectors. The main design key here is parallelizing the read operation according to how many partitions defined. Also check (figure 4) to understand how distributed CSV ingestion works in Spark.
Update
Read from JDBC connections across multiple workers
df = spark.read.jdbc(
url=jdbcUrl,
table="employees",
column="emp_no",
lowerBound=1,
upperBound=100000,
numPartitions=100
)
display(df)
In the above sample code, we used JDBC read to split the table read across executors on the emp_no column using the partitionColumn, lowerBound, upperBound, and numPartitionsparameters.

is Parquet predicate pushdown works on S3 using Spark non EMR?

Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).
Further explanation might be helpful since it might involve understanding on distributed file system.
I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .
I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
I then used the Spark History Server to see how much data each job had as input.
Results:
For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.
I will add more details about the tests and results when I have time.
Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down).
See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L313 for the pushdown logic.
Here's the keys I'd recommend for s3a work
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
For committing the work. use the S3A "zero rename committer" (hadoop 3.1+) or the EMR equivalent. The original FileOutputCommitters are slow and unsafe
Recently I tried this with Spark 2.4 and seems like Pushdown predicate works with s3.
This is the spark sql query:
explain select * from default.my_table where month = '2009-04' and site = 'http://jdnews.com/sports/game_1997_jdnsports__article.html/play_rain.html' limit 100;
And here is the part of output:
PartitionFilters: [isnotnull(month#6), (month#6 = 2009-04)], PushedFilters: [IsNotNull(site), EqualTo(site,http://jdnews.com/sports/game_1997_jdnsports__article.html/play_ra...
Which clearly stats that PushedFilters is not empty.
Note: The used table was created on top of AWS S3
Spark uses the HDFS parquet & s3 libraries so the same logic works.
(and in spark 1.6 they've added even a faster shortcut for flat schema parquet files)