`spark.sql.warehouse.dir` is ignored if `enableHiveSupport()` - apache-spark-sql

I encountered a weird situation where a location - hdfs://<host>/hive/warehouse - is used to hold data for spark managed tables.
It is a path out of nowhere; I have spark.sql.warehouse.dir in spark-default.conf set to hdfs://<host>/usr/<usr>/spark/warehouse, a totally different location, and hive's default warehouse location (metastore.warehouse.dir) is /user/hive/warehouse.
I am using a standalone Hive Metastore Service instead of a full-fledged Hive instance.
This situation happens only when enableHiveSuport().
If I remove .enableHiveSupport() part from SparkSession initialization code.
saveAsTable() does the expected, the data is put inside the path set at by spark.sql.warehouse.dir.

Related

Spark streaming writing as delta and checkpoint location

I am trying to stream from a delta table as a source and then also writing as delta after performing some transformations. so, this all worked. I recently looked at some videos and posts about best practices and found that I needed to do an additional thing and a modification.
The addition was adding queryName
Changing the checkpoint location, so that it resides alongside the data and not in a separate directory , like I was doing.
So, I have one question and a problem
Question is- can I add the queryName now, after my stream has been running for sometime , without any consequences?
and the problem, is: Now, that I have put my checkpoint location as the same directory as my delta table would be , I can't seem to create an external hive table anymore , it seems. It fails with
pyspark.sql.utils.AnalysisException: Cannot create table ('`spark_catalog`.`schemaname`.`tablename`'). The associated location ('abfss://refined#datalake.dfs.core.windows.net/curated/schemaname/tablename') is not empty but it's not a Delta table
So, this was my original code, which worked
def upsert(microbatchdf, batchId):
.....some transformations on microbatchdf
..........................
..........................
# Create Delta table beforehand as otherwise generated columns can't be created
# after having written the data into the data lake with the usual partionBy
deltaTable = (
DeltaTable.createIfNotExists(spark)
.tableName(f"{target_schema_name}.{target_table_name}")
.addColumns(microbatchdf_deduplicated.schema)
.addColumn(
"trade_date_year",
"INT",
generatedAlwaysAs="Year(trade_date) ",
)
.addColumn(
"trade_date_month",
"INT",
generatedAlwaysAs="MONTH(trade_date)",
)
.addColumn("trade_date_day", "INT", generatedAlwaysAs="DAY(trade_date)")
.partitionedBy("trade_date_year", "trade_date_month", "trade_date_day")
.location(
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}"
)
.execute()
)
.....some transformations and writing to the delta table
#end
#this is how the stream is run
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/curated/checkpoints/",
)
.start()
)
streamjob.awaitTermination()
Now, to this working piece , I only tried adding the queryName and modifying the checkpoint location (see comment for the modification and addition)
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.queryName(f"{source_schema_name}.{source_table_name}") # this added
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}/_checkpoint", # this changed
)
.start()
)
streamjob.awaitTermination()
In my datalake the _checkpoint did get created and apparently for this folder, the external table creation complains of non empty folder, whereas the documentation here, mentions that
So, why is the external hive table creation fails then? Also, please note my question about the queryName addition to an already running stream.
Point to note is- I have tried dropping the external table and also removed the contents of that directory, so there is nothing in that directory except the _checkpoint folder Which got created when I ran the streaming job , just before it got to creating the table inside the upsert method.
Any questions and I can help clarify.
The problem is that checkpoint files are put before you call the ``DeltaTable.createIfNotExists` function that checks if you have any data in that location or not, and fails because additional files are there, but they don't belong to the Delta Lake table.
If you want to keep checkpoint with your data, you need to put DeltaTable.createIfNotExists(spark)... outside of the upsert function - in this case, table will be created before any checkpoint files are created.

missing variables in HRRR data accessing via THREDDS Data Server

I accessed hrrr data through thredds server as shown here. However, The two needed variables
"Upward long-wave radiation" and "Downward short-wave radiation" are not contained in the accessing dataset. With accessing using AWS, these variables are exist,
{'Best_4_layer_Lifted_Index_pressure_difference_layer',
'Categorical_freezing_rain_surface',
'Categorical_ice_pellets_surface',
'Categorical_rain_surface',
'Categorical_snow_surface',
'Composite_reflectivity_entire_atmosphere',
'Convective_available_potential_energy_pressure_difference_layer',
'Convective_available_potential_energy_surface',
'Convective_inhibition_pressure_difference_layer',
'Convective_inhibition_surface',
'Dewpoint_temperature_height_above_ground',
'Dewpoint_temperature_isobaric',
'Echo_top_cloud_tops',
'Geopotential_height_adiabatic_condensation_lifted',
'Geopotential_height_cloud_ceiling',
'Geopotential_height_cloud_tops',
'Geopotential_height_isobaric',
'Geopotential_height_surface',
'High_cloud_cover_high_cloud',
'Hourly_Maximum_of_Downward_Vertical_Velocity_in_the_lowest_400hPa_pressure_difference_layer_Mixed_intervals_Maximum',
'Hourly_Maximum_of_Simulated_Reflectivity_at_1_km_AGL_height_above_ground_Mixed_intervals_Maximum',
'Hourly_Maximum_of_Updraft_Helicity_over_Layer_2km_to_5_km_AGL_height_above_ground_layer_Mixed_intervals_Maximum',
'Hourly_Maximum_of_Upward_Vertical_Velocity_in_the_lowest_400hPa_pressure_difference_layer_Mixed_intervals_Maximum',
'Lightning_entire_atmosphere',
'Low_cloud_cover_low_cloud',
'Medium_cloud_cover_middle_cloud',
'Per_cent_frozen_precipitation_surface',
'Planetary_boundary_layer_height_surface',
'Precipitable_water_entire_atmosphere_single_layer',
'Pressure_of_level_from_which_parcel_was_lifted_pressure_difference_layer',
'Pressure_reduced_to_MSL_msl',
'Pressure_surface',
'Reflectivity_height_above_ground',
'Snow_depth_surface',
'Storm_relative_helicity_height_above_ground_layer',
'Surface_lifted_index_isobaric_layer',
'Temperature_height_above_ground',
'Temperature_isobaric',
'Total_cloud_cover_entire_atmosphere',
'Total_column_integrated_graupel_entire_atmosphere_single_layer_Mixed_intervals_Maximum',
'Total_precipitation_surface_1_Hour_Accumulation',
'Vertical_u-component_shear_height_above_ground_layer',
'Vertical_v-component_shear_height_above_ground_layer',
'Vertical_velocity_geometric_sigma_layer_Mixed_intervals_Average',
'Vertically_integrated_liquid_water_VIL_entire_atmosphere',
'Visibility_surface',
'Water_equivalent_of_accumulated_snow_depth_surface_1_Hour_Accumulation',
'Wind_speed_gust_surface',
'Wind_speed_height_above_ground_Mixed_intervals_Maximum',
'u-component_of_wind_height_above_ground',
'u-component_of_wind_isobaric',
'u-component_storm_motion_height_above_ground_layer',
'v-component_of_wind_height_above_ground',
'v-component_of_wind_isobaric',
'v-component_storm_motion_height_above_ground_layer'}

AsterixDB ERROR: Code: 1 "HYR0010: Node asterix_nc2 does not exist" M1 Mac

I'm trying to set up a sample cluster with asterixDB on my M1 mac. I have my environment up and running and I am able to successfully make SQL queries with the following code:
drop dataverse csv if exists;
create dataverse csv;
use csv;
create type csv_type as {
lat: int32,
long: int32
};
create dataset csv_set (csv_type)
primary key lat;
However, when I try to load the dataset with a CSV file it seems to brick my sample cluster and throws the error: Error Code: 1 "HYR0010: Node asterix_nc2 does not exist". The code which causes this is below.
use csv;
load dataset csv_set using localfs
(("path"="127.0.0.1:///Users/nicholassantini/Downloads/test.csv"),
("format"="delimited-text"));
Thus far I have tried both java's newest release of version 18 and 17.0.3 as well as a variety of ports for the queries. I'm not sure what else to try. Some logs that I think are relevant say that it is failing to connect to the node. Not sure if that's an issue with the port or the node itself. Here is a snippet of those logs.
image.png
Also in case it matters, my CSV is a simple 2 column 2 row file with all single-digit integer values.
I appreciate any and all help.
After consulting the developer help email thread, I was able to find that the issue stems from the release of asterixDB that I was using (0.9.7.1). Upgrading to the newest release(0.9.8) fixed this issue.
The link can be found here:
https://ci-builds.apache.org/job/AsterixDB/job/asterixdb-snapshot-integration/lastSuccessfulBuild/artifact/asterixdb/asterix-server/target/asterix-server-0.9.8-SNAPSHOT-binary-assembly.zip

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Insert overwrite Hive Table via the databricks notebook is throwing error

I've batch job that inserts the data to hive table on a daily basis and creates multiple smaller ORC's files on the blob location, i will need to combine all the small ORC files to one larger ORC file so that the read performance would be much better.
In this context, i used to schedule the below SQL query to run every day post my batch job completes in Azure HDInsight. When i try to schedule the same query in Azure Databricks notebook, it's throwing the below error. Is there a reason why this works in HDInsight and not working in Azure Databricks notebook.
Is there a better way i can achieve this.
My Azure Databricks runtime version: 6.3 (includes Apache Spark 2.4.4, Scala 2.11)
INSERT OVERWRITE TABLE TABLE_NAME SELECT * FROM TABLE_NAME ORDER BY dlloaddate desc;
Error:
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;
at org.apache.spark.sql.execution.command.DDLUtils$.verifyNotReadPath(ddl.scala:962)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:194)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:136)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:76)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:136)
at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:54)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:112)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:109)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:109)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:101)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:101)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:137)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:131)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:103)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$executeAndTrack$1.apply(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$executeAndTrack$1.apply(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:79)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:115)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:114)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:114)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$analyzed$1.apply(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$analyzed$1.apply(QueryExecution.scala:83)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:75)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:696)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:716)
at com.databricks.backend.daemon.driver.SQLDriverLocal$$anonfun$1.apply(SQLDriverLocal.scala:88)
at com.databricks.backend.daemon.driver.SQLDriverLocal$$anonfun$1.apply(SQLDriverLocal.scala:34)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:34)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:141)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:385)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:362)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:251)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:246)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:288)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:49)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:362)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:126)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:141)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:385)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:362)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:251)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:246)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:288)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:49)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:362)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE
can be used to merge small ORC files into a larger file, starting in Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC