I do have some code works without generating small files in hive, but when I use them in pyspark, I encountered small file issue.
Hi folks, I do use insert overwrite method to get rid of small file issue in orc table like this:
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table {tgt_db}.{tgt_tbl} PARTITION (p_checking_date='{espdate}')
select
week_num,
.....
.....
from {stg_db}.{stg_tbl} where p_checking_date='{espdate}'
order by .... desc
It works pretty well in hive but when I use pyspark, it generates lots of small files:
sql = """
insert overwrite table {tgt_db}.{tgt_tbl} PARTITION (p_checking_date='{espdate}')
select
week_num,
........
from {stg_db}.{stg_tbl} where p_checking_date='{espdate}'
order by ..... desc
""".format(tgt_db = tgt_db,
tgt_tbl = tgt_tbl,
espdate=espdate,
stg_db = stg_db,
stg_tbl = stg_tbl)
spark.sql(sql)
I've also try to set:
spark.sql("SET hive.merge.sparkfiles = true")
spark.sql("SET hive.merge.mapredfiles = true")
spark.sql("SET hive.merge.tezfiles=true")
spark.sql("SET hive.merge.mapfiles = true")
spark.sql("set hive.merge.smallfiles.avgsize = 128000000")
spark.sql("set hive.merge.size.per.task = 128000000")
But it does not work... By the way the CONCATENATE could only concat several small files instead of all.
Could anyone help me? Thank you so much.
Related
I'm trying to run multiple export statements in bigquery like this
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT field1, field2 FROM mydataset.table1 ORDER BY field1 LIMIT 10
The problem is that I don't want to overwrite new files, but instead I want new files to be created only if the query returns something.
I've tried changing the uri to gs://bucket/folder/*1.csv but this is creating an empty file (which I don't want).
I've also set the overwrite parameter to false; this results in Invalid value: overwrite option is not specified and destination is not empty. at [109:1]
Any ways to fix this?
You may try and consider below approach.
EXECUTE IMMEDIATE CONCAT(
"EXPORT DATA OPTIONS(uri='gs://your-bucket-bucket/your-folder/file_",
FORMAT_TIMESTAMP("%F-%T", TIMESTAMP(Current_TIMESTAMP()), "UTC"),
"_*.csv', format = 'CSV', overwrite = true, header = true, field_delimiter = ';') AS SELECT field1, field2 FROM your-dataset.table1 ORDER BY field1 "
)
This approach concatenates current timestamp to your filename so that it will serve as a unique identifier of file names and will create a new one if your query returns something.
Sample output on my Bucket:
In addition, if you experience a location error when executing the above query, the posted answer here will solve the problem.
I am trying to read data from hive into pyspark in order to write csv files. The following sql code results in 5 months:
select distinct posting_date from my_table
When I read the data with pyspark I only get 4 months:
sql_query = 'select * from my_table'
data = spark_session.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
I had the same problem in the past and I solved it by using the deprecated api for reading sql:
sql_context = SQLContext(spark_session.sparkContext)
data = sql_context.sql(sql_query)
data.groupBy("posting_date").count().orderBy("posting_date").show()
The problem is that for my current project I have the same issue and I cannot solve it with any method.
I also tried to use HiveContext instead of SQLContext but I had no luck.
I want to delete data from a delta file in databricks.
Im using these commands
Ex:
PR=spark.read.format('delta').options(header=True).load('/mnt/landing/Base_Tables/EventHistory/')
PR.write.format("delta").mode('overwrite').saveAsTable('PR')
spark.sql('delete from PR where PR_Number=4600')
This is deleting data from the table but not from the actual delta file. And i want to delete the data in the file without using merge operation, because the join condition is not matching. Can anyone please help me in resolving this issue.
Thanks
Please do remember : Subqueries are not supported in the DELETE in Delta.
Issue Link : https://github.com/delta-io/delta/issues/730
From the documentation itself , an Alternative is as follows
For Example :
DELETE FROM tdelta.productreferencedby_delta
WHERE id IN (SELECT KEY
FROM delta.productreferencedby_delta_dup_keys)
AND srcloaddate <= '2020-04-15'
Can be written as below in case of DELTA
MERGE INTO delta.productreferencedby_delta AS d
using (SELECT KEY FROM tdatamodel_delta.productreferencedby_delta_dup_keys) AS k
ON d.id = k.KEY
AND d.srcloaddate <= '2020-04-15'
WHEN MATCHED THEN DELETE
Using Spark SQL functions in python would be:
dt_path = "/mnt/landing/Base_Tables/EventHistory/"
my_dt = DeltaTable.forPath(spark, dt_path)
seq_keys = ["4600"] // You could add here as many as you want
my_dt.delete(col("PR_Number").isin(seq_keys))
And in scala:
val dt_path = "/mnt/landing/Base_Tables/EventHistory/"
val my_dt : DeltaTable = DeltaTable.forPath(spark, dt_path)
val seq_keys = Seq("4600") // You could add here as many as you want
my_dt.delete(col("PR_Number").isin(seq_keys:_*))
You can remove data that matches a predicate from a Delta table
https://docs.delta.io/latest/delta-update.html#delete-from-a-table
It worked like
delete from delta.`/mnt/landing/Base_Tables/EventHistory/` where PR_Number=4600
I have seen many posts and followed the syntax to write the below query. But it is still given the error "Column/Parameter wm_ad_hoc.temp.temp does not exist"
Please assist in figuring out what am I doing wrong here.
UPDATE temp
FROM wm_ad_hoc.OWNED_ITEM_STORE_DLY temp,
wm_ad_hoc.OWNED_ITEM_STORE_DLY_UTIL util
SET temp.VENDOR_STOCK_ID = util. VENDOR_STOCK_ID,
temp.ON_HAND_EACH_QTY = util. ON_HAND_EACH_QTY,
temp.VENDOR_STOCK_ID = util.VENDOR_STOCK_ID
WHERE temp. VENDOR_NBR = util. VENDOR_NBR
AND temp.WMI_ITEM_NBR = util.WMI_ITEM_NBR
AND temp. store_nbr = util. store_nbr
AND temp.BUSINESS_DATE = util.BUSINESS_DATE
You need to not qualify your SET columns. So:
UPDATE temp
FROM wm_ad_hoc.OWNED_ITEM_STORE_DLY temp,
wm_ad_hoc.OWNED_ITEM_STORE_DLY_UTIL util
SET VENDOR_STOCK_ID = util.VENDOR_STOCK_ID,
ON_HAND_EACH_QTY = util.ON_HAND_EACH_QTY,
VENDOR_STOCK_ID = util.VENDOR_STOCK_ID
...
For a table with
create table mytable (
..
)
partitioned by (my_part_column String)
We are executing a hive sql as follows:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
data = hc.sql("select * from my_table limit 10")
The values read back show the "my_part_columns" as the FIRST items for each row instead of the last ones.
Turns out this is a known bug fixed in spark 1.3.0 and 1.2.1
https://issues.apache.org/jira/browse/SPARK-5049