Pig : Load in table, then overwrite that table after transformation - apache-pig

Let's say I have a table:
db.table
I load the table and do some transforms on it, and, finally, attempt to store it
mytable = LOAD 'db.table' USING HCatLoader();
.
.
-- My transforms
.
.
STORE mytable_final INTO 'db.table' USING HCatStorer();
But the code complains I'm writing into a table with existing data.
I've looked at this JIRA ticket, which seems to be inactive (I have tried using FORCE and OVERWRITE in several places in the STORE command).
I've also looked at this SO post, but the author is loading from one location and storing in a different location. If I use what is in that post, the result from the transformation is no data. Deleting the files isn't an option. I'm thinking of storing the files temporarily, but I don't know if this is the best option.
I am trying to get the behavior you get in Hive using INSERT OVERWRITE.

I am not familiar with HCatLoader and HCatStorer. But if you LOAD from and STORE to HDFS, Pig provides shell commands that enable you to do the deleting and moving from within your script.
STORE A INTO '/this/path/is/temporary';
RMF '/this/path/is/permanent';
MV '/this/path/is/temporary' '/this/path/is/permanent';

Related

Trying to upload a CSV into Snowflake and the date column isn't recognized

The column contains dates in the following format:
dd/mm/yyyy
I've tried DATE, TIMESTAMP etc but whatever I do, I can't seem to upload the file.
In the classic UI you can click on the Load Tablebutton and follow the dialogs to upload your file. It is a bit hard to find. Click on Databases right to the big snowflake icon. Then select a database and a table. Then you should the the button. In the wizard there will be a step for defining the 'File Format'. There, you have to scroll down to define the date format. (see Classic snowflake UI)
Without the classic UI you have to install SnowSQL on your device first (https://docs.snowflake.com/en/user-guide/snowsql-install-config.html).
Start SnowSQL and apply the following steps:
Use the database where to upload the file to. You need various privileges for creating a stage, a fileformat, and a table. E.g. USE TEST_DB
Create the fileformat you want to use for uploading your csv file. E.g.
CREATE FILE FORMAT "TEST_DB"."PUBLIC".MY_FILE_FORMAT TYPE = 'CSV' DATE_FORMAT = 'dd/mm/yyyy';
Create a stage using this fileformat
CREATE STAGE MY_STAGE file_format = "TEST_DB"."PUBLIC".MY_FILE_FORMAT;
Now you can put your file to this stage
PUT file://<file_path>/file.csv #MY_STAGE;
You can check the upload with
SELECT d.$1, ..., d.$N FROM #MY_STAGE/file.csv d;
Then, create your table.
Copy the content from your stage to your table. If you want to transform your data at this point, you have to use an inner select. If not then the following command is enough.
COPY INTO mycsvtable from #MY_STAGE/file.csv;
You can find documentation for configuring the fileupload at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
You can find documentation for configuring the stage at https://docs.snowflake.com/en/sql-reference/sql/create-stage.html
You can find documentation for copying the staged file into a table https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
I recommend that you upload your file with disabled automatically date detection OR that your initial table has a string column instead of a date column. IMHO, it is easier to transform your upload afterwards using the try_to_date function. Using this function it is much easier to handle possible parsing errors.
e.g. SELECT try_to_date(date_column, 'dd/mm/yyyy') as MY_DATE, IFNULL(MY_DATE, date_column) as MY_DATE_NOT_PARSABLE FROM upload;
You see that it is pretty much to do for loading a simple CSV file to Snowflake. It becomes even more complicated when you take into account that every step can cause some specific failures and that your file might contain erroneous lines. This is why my team and I are working at Datameer to make these types of tasks easier. We aim for a simple drag and drop solution that does most of the work for you. We would be happy if you would try it out here: https://www.datameer.com/upload-csv-to-snowflake/

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

How do I force hive to always create a consistent filename like 000000_0?

I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.
INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source
Better do not do this and modify your program to accept any number of files in the directory.
Each reducer (or mapper if it runs on map-only) creates it's own file. These reducers do know nothing about each other, they named during creation. Files are marked as 000001_0,000002_0. But it can be 000001_1 also if attempt number 0 has failed and attempt number 1 has succeeded. Also if table is partitioned and there is no distribute by partition key at the end, each reducer will create it's own file in each partition.
You can force it to work on a single final reducer (it can be done for example if you add order by clause or setting set mapred.reduce.tasks = 1;). But bear in mind that this solution is not scalable, because too many data will cause performance problems on single reducer. Also What will happen if attempt 0 has failed and it was restarted and attempt 1 succeeded? It will create 000001_1 instead of 000001_0.

Hive: create table and write it locally at the same time

Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher