I've a dataframe from hive table I'm doing some changes to it, then while saving it again in hive as a new table which method should I use ? Assume this dataframe has 70 million record, I want to make saving process memory & time efficient.
For eg.
Dataframe name = df
df.createOrReplaceView(new_table) SQL("create table new_table as select * from new_table)
The way I see it there's no way operation 1 can be more efficient.
createOrReplaceView is creating a temporary table in memory, you can read about it in this previous question.
As such between (1) Reading from disk to create a temp table in memory, to write the same table to disk, and (2) Reading from disk to write to disk, number 2 seems the obvious favorite.
If this answer doesn't satisfy you. You can always try both ways and check the total time and memorySeconds consumed in the YARN application UI.
I created a delta lake table in databricks using a SQL command like the following:
LOCATION '/mnt/s3-mount-point/mytable/'
I then optimized the table:
OPTIMIZE mytable
When I query the table using a simple query like below, spark fires off 26,537 ScanParquet tasks. The table that I'm working with is on the order of 3.5 TB stored in 9,000 files. The cardinality of A is greater than 2 million
FROM mytable
WHERE A = 'value'
A few questions:
Why are the number of tasks to scan this table significantly greater than the number of files?
Why is every file being scanned if the table is zordered for this column? Shouldn't the data for 'value' be colocated in the same file? In this case, rows with 'value' represent roughly 5E-7 of the whole data, so it should be located in the same parquet file.
Why is the entire table scanned again when I perform the same query? This is true if I cache the table in 'Delta Cache' or not. I can't cache in memory since it would far overwhelm the memory available to the workers.
The files are optimized to ~1gb (most likely). Spark splits the files in 128mb chunks, for the read.
If youre using databricks, I dont expect the entire
table to be scanned. You can validate its not reading all the files
by looking at the spark ui, then going to the sql tab and looking at
number of files in the read stage.
Lazy evaluation re-runs the dag. If your using a delta cache enabled instance, it should read from cache on 2nd read. Check the spark ui, the go to the sql tab and see the cache read metrics.
If youre seeing something that contradicts what I said, let me know, and I'd like to explore further.
I have a large amount of data in ORC files in AWS S3. The data in ORC files is sorted by uuid. I create an AWS Athena (Presto) table on top of them and run the following experiment.
First, I retrieve the first row to see how much data gets scanned:
select * from my_table limit 1
This query reports 18 MB of data being scanned.
I record the uuid from the row returned from the first query and run the following query:
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query reports 8.5 GB of data being scanned.
By design, both queries return the same result but the second query scans 500 times more data!
Any ideas why this is happening? Is this something inherent to ORC design or is it specific to how Presto interacts with S3?
[EDIT after ilya-kisil's response]
Let's change the last query to only select the uuid column:
select uuid from my_table where uuid=<FIRST_ROW_UUID> limit 1
For this query, the amount of data scanned drops to about 600 MB! This means that the bulk of the 8.5 GB scanned in the second query is attributed to gathering values from all columns for the record found and not to finding this record.
Given that all values in the record add up to no more than 1 MB, scanning almost 8 GB of data to put these values together seems extremely excessive. This seems like some idiosyncrasy of ORC or columnar formats in general and I am wondering if there are standard practices, e.g. ORC properties, that help reduce this overhead?
Well this is fairly simple. The very first time your query would pick a random record from your data. On top of that it is not guaranteed that you have read the very first record, since ORC files are splittable and can be processed in parallel. On the other hand, the second query looks for a specific record.
Here is an analogy. Let's assume you have 100 coins UUID and some other info imprinted at on their backs. All of them are face up on a table, so you can't see their UUID.
select * from my_table limit 1
This query is like you flipped some random coin, looked at what it is written on the back and put it back on a table face up. Next, someone came and shuffled all of the coins.
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query is like you wanting to look at the information written on the back of a specific coin. It is unlikely that you would flip the correct coin with your first try. So you would need to "scan" more coins (data).
One of the common ways to reduce size of scanned data is to partition your data, i.e. put it into separate "folders" (not files) in your S3 bucket. Then "folder" names can be use as a virtual columns within your table definition, i.e. additional metadata for your table. Have a look at this post, which goes into mor details on how to optimise queries in Athena.
I have dataset that has data added almost everyday, and needs to be processed everyday in a part of a larger ETL.
When I select the partition directly, the query is really fast:
SELECT * FROM JSON.`s3://datalake/partitoned_table/?partition=2019-05-20`
Yet, the issue is that the event type does not generate data on some Sundays, resulting in a non-existing partition on that particular day. Because of this, I cannot use the previous statement to run my daily job.
Another attempt led me to try to have spark find the latest partition of that dataset, in order to be sure the bigger query wouldn't fail:
SELECT * FROM JSON.`s3://datalake/partitoned_table/`
WHERE partition = (SELECT MAX(partition) FROM JSON.`s3://datalake/partitoned_table/`)
This works every time, but it is unbelievably slow.
I found numerous articles and reference on how to build and maintain partitions, yet nothing about how to read them correctly.
Any idea how to have this done properly?
(SELECT MAX(partition) FROM JSON.s3://datalake/partitoned_table/)
This query will be executed as a subquery in Spark.
Reason for slowness
1. Subquery needs to be executed completely before the actual query execution starts.
2. The Above query will list all the S3 files to retrieve the partition information. If the folder has a large number of files, this process will take a long time. Time taken for listing is directly proportional to the number of files.
We could create a table on top of s3://datalake/partitoned_table/ with the partitioning scheme, let's say the name of the table is tbl
You could perform an
which stores the partition information in metastore. This also involves a listing, but it is a one-time operation and spark spawns multiple threads to perform the listing to make it faster.
Then we could fire
SELECT * FROM tbl WHERE partition = (SELECT MAX(partition) FROM tbl`)
Which will get the partition information only from metastore, without having to list the object store which I believe is an expensive operation.
The cost incurred in this approach is that of recovering partitions.
After which all queries will be faster(when data for new partition comes, we need to recover partitions again)
WorkAround when you don't have Hive-
FileSystem.get(URI.create("s3://datalake/partitoned_table"), conf).listStatus(new Path("s3://datalake/partitoned_table/"))
Above code will give you list of file partitions example - List(s3://datalake/partitoned_table/partition=2019-05-20, s3://datalake/partitoned_table/partition=2019-05-21....)
This is very efficient because it is only fetching metadata from the s3 location.
Just take the latest file partitions and use it your SQL.
I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.
I have a table containing 110GB in BLOBs in one schema and I want to copy it to another schema to a different table.
I only want to copy a column of the source table so I am using an UPDATE statement, but it takes 2,5 hours to copy 3 GB of data.
Is there a faster way to do this?
The code I am using is very simple:
update schema1.A a set blobA = (select blobB from schema2.B b where b.IDB = a.IDA);
ida and idb are indexes.
Check to see if there are indexes on the destination table that are causing the performance issue, if so, temporarily disable them then recreate them after the data is copied from one column in the source table to the column in the destination table.
If you are on Oracle 10 or 11, check ADDM to see what is causing problems. It is probably I/O or transaction log problem.
What kind of disc storage is this? Did you try to copy 110 GB file from one place to another on that disc system? How long it takes?
I don't know if oracle automatically grows the database size or not. If it does, then increase the amount of space allocated to the database to exceed the amount you are about to grow it prior to running your query.
I know in SQL server, under the default setup it will automatically allocate an additional 10% of the database size as you start filling it up. When it fills up, then it stops everything and reallocates another 10%. When running queries that do bulk loading of data, this can seriously slow the query down.
Also, as zendar pointed out, check the disk IO. If it has a high queue length then you may be constrained by have fast the drives work.