How to fix the block size in external databricks tables? - hive

I have a SQL notebook to change data and insert into another table.
I have a situation when I'm trying to change the storaged block size in blobStorage, I want to have less and bigger files. I try change a lot of parameters.
So i found a behaviour.
When I run the notebook the command create the files with almost 10MB for each.
If I create the table internaly in databricks and run another comand
create external_table as
select * from internal_table
the files had almost 40 MB...
So my question is..
There is a way to fix the minimal block size in external databricks tables?
When i'm transforming data in a SQL Notebook we have best pratices? like transform all data and store locally so after that move the data to external source?
Thanks!

Spark doesn't have a straightforward way to control the size of output files. One method people use is to call repartition or coalesce to the number of desired files. To use this to control the size of output files, you need to have an idea of how many files you want to create, e.g. to create 10MB files, if your output data is 100MB, you could call repartition(10) before the write command.
It sounds like you are using Databricks, in which case you can use the OPTIMIZE command for Delta tables. Delta's OPTIMIZE will take your underlying files and compact them for you into approximately 1GB files, which is an optimal size for the JVM in large data use cases.
https://docs.databricks.com/spark/latest/spark-sql/language-manual/optimize.html

Related

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

Is there any performance-wise better option for exporting data to local than Redshift unload via s3?

I'm working on a Spring project that needs exporting Redshift table data into local a single CSV file. The current approach is to:
Execute Redshift UNLOAD to write data across multiple files to S3 via JDBC
Download said files from S3 to local
Joining them together into one single CSV file
UNLOAD (
'SELECT DISTINCT #{#TYPE_ID}
FROM target_audience
WHERE #{#TYPE_ID} is not null
AND #{#TYPE_ID} != \'\'
GROUP BY #{#TYPE_ID}'
)
TO '#{#s3basepath}#{#s3jobpath}target_audience#{#unique}_'
credentials 'aws_access_key_id=#{#accesskey};aws_secret_access_key=#{#secretkey}'
DELIMITER AS ',' ESCAPE GZIP ;
The above approach has been fine and all. But i think the overall performance can be improved by, for example skipping the S3 part and get data directly from Redshift to local.
After searching through online resources, i found that you can export data from redshift directly through psql or to perform SELECT queries and move the result data myself. But neither option can top Redshift UNLOAD performance with parallel writing.
So is there any way i can mimic UNLOAD parallel writing to achieve the same performance without having to go through S3 ?
You can avoid the need to join files together by using UNLOAD with the PARALLEL OFF parameter. It will output only one file.
This will, however, create multiple files if the filesize exceeds 6.2GB.
See: UNLOAD - Amazon Redshift
It is doubtful that you would get better performance by running psql, but if performance is important for you then you can certainly test the various methods.
We do exactly same as you'r trying to do here. In our performance comparison, it found to be almost same or even better in some cases in our user case. Hence programming and debugging wise its easy. As there is practically one step.
//replace user/password,host,region,dbname appropriately in given command
psql postgresql://user:password#xxx1.xxxx.us-region-1.redshift.amazonaws.com:5439/dbname?sslmode=require -c "select C1,C2 from sch1.tab1" > ABC.csv
This enables us to avoid 3 steps,
Unload using JDBC
Download the exported Data from S3
Decompress gzip file, (this we used to save network Input/Output).
On other hand also saving some cost(S3 storing, though its negligible).
By the way, pgsql(9.0+) onwards, sslcompression is bydefault on.

Append data in existing file in U-SQL

Can we append data in existing file in U-SQL?
I have created a CSV file as output in U-SQL. I am writing another U-SQL query and I want to append the output of that query in the existing file.
Is it possible?
It's not supported, and would go against the design of a robust, distributed, idempotent big data system (although you could implement that behaviour by reading the previous output as a rowset and do UNION ALL).
The best way to deal with this is to use partitions properly, for example, create one or more new partitions for each of your executions: https://msdn.microsoft.com/en-us/library/azure/mt621324.aspx

Hive: create table and write it locally at the same time

Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher

Can a hive reducer script give output to more than one tables?

I wish to save some data which the reducer script (which is in php) accumulates over the course of its execution into another table (having a totally different schema). What is the best way to go about it?
Also, Can we write to a file from within a reducer?
One option you have is to upload the data to HDFS via WebHDFS into the appropriate directory (http://hadoop.apache.org/docs/stable/webhdfs.html / CREATE).
If it's more convenient you can also then expose it as a partition using WebHCatalog (http://hive.apache.org/docs/hcat_r0.5.0/resources.html / PUT ddl/database/:db/table/:table/partition/:partition)
It will then show up in the table you wanted to write it into.