SkipTrash in hive in insert overwrite query - hive

I have insert overwrite query in hive, After query is executed data is dumped into trash folder.
Is there any option or property by which it can avoided?

In Hive 1.2.0 there is a PURGE option for DROP operation: https://issues.apache.org/jira/browse/HIVE-9118
and
https://issues.apache.org/jira/browse/HIVE-7100
Unfortunately this does not work for external tables and for insert overwrite statement.
But still you can drop files before INSERT OVERWRITE, I know this is not always acceptable solution, using rm command with -skipTrash option:
hadoop fs -rm -r -f -skipTrash hdfs://your_table_path/*
If you are on DEV environment, you may want to disable TRASH feature at all. This can be done by setting fs.trash.interval=0 in core-site.xml

Related

PostgreSQL Query To Create A Directory

Files are being written to a directory using the COPY query:
Copy (SELECT * FROM animals) To '/var/lib/postgresql/data/backups/2020-01-01/animals.sql' With CSV DELIMITER ',';
However if the directory 2020-01-01 does not exist, we get the error
could not open file "/var/lib/postgresql/data/backups/2020-01-01/animals.sql" for writing: No such file or directory
PostgeSQL server is running inside a Docker container with the volume mapping /mnt/backups:/var/lib/postgresql/data/backups
The Copy query is being sent from a Node.js app outside of the Docker container.
The mapped host directory /mnt/backups was created by Docker Compose and is owned by root, so the Node.js app sending the COPY query is unable to create the missing directories due to insufficient permissions.
The backup file is meant to be transferred out of the Docker container to the Docker host.
Question: Is it possible to use an SQL query to ask PostgreSQL 11.2 to create a directory if it does not exist? If not, how will you recommend the directory creation be done?
Using Node.js 12.14.1 on Ubuntu 18.04 host. Using PostgreSQL 11.2 inside container, Docker 19.03.5
An easy way to solve it is to create the file directly into the client machine. Using STDOUT from COPY you can let the query output be redirected to the client standard output, which you can catch and save in a file. For instance, using psql in the client machine:
$ psql -U your_user -d your_db -c "COPY (SELECT * FROM animals) TO STDOUT WITH CSV DELIMITER ','" > file.csv
Creating an output directoy in case it does not exist:
$ mkdir -p /mnt/backups/2020-01/ && psql -U your_user -d your_db -c "COPY (SELECT * FROM animals) TO STDOUT WITH CSV DELIMITER ','" > /mnt/backups/2020-01/file.csv
On a side note: try to avoid exporting files into the database server. Although it is possible, I consider it a bad practice. Doing so you will either write a file into the postgres system directories or give the postgres user permission to write somewhere else, and it is something you shouldn't be comfortable with. Export data directly to the client either using COPY as I mentioned or follow the advice from #Schwern. Good luck!
Postgres has its own backup and restore utilities which are likely to be a better choice than rolling your own.
When used with one of the archive file formats and combined with pg_restore, pg_dump provides a flexible archival and transfer mechanism. pg_dump can be used to backup an entire database, then pg_restore can be used to examine the archive and/or select which parts of the database are to be restored. The most flexible output file formats are the “custom” format (-Fc) and the “directory” format (-Fd). They allow for selection and reordering of all archived items, support parallel restoration, and are compressed by default. The “directory” format is the only format that supports parallel dumps.
A simple backup rotation script might look like this:
#!/bin/sh
table='animals'
url='postgres://username#host:port/database_name'
date=`date -Idate`
file="/path/to/your/backups/$date/$table.sql"
mkdir -p `dirname $file`
pg_dump $url -w -Fc --table=$table -f $file
To avoid hard coding the database password, -w means it will not prompt for a password and instead look for a password file. Or you can use any of many Postgres authentication options.

How to export HDFS directory with beeline (no HDFS access)?

I have access to a hive cluster through beeline. Results of some queries get stored as files in hdfs (e.g. /user/hive/warehouse/project). These results are just lines of texts.
Would it be possible to "download" those files to my local machine only using beeline as I don't have access to hdfs?
You can by
INSERT OVERWRITE LOCAL DIRECTORY '/your/path/' SELECT your_query
Try do something like this.
beeline: -e "select * from yourtable" > LOCAL/PATH/your_output
I'm runing this command from a unix server on remote HDFS server.
Regards.

Save database on external hard drive

I am creating some databases using PostgreSQL but I want to save them on an external hard drive due to lack of memory in my computer.
How can I do this?
You can store the database on another disk by specifying it as the data_directory setting. You need to specify this at startup and it will apply to all databases.
You can put it in postgresql.conf:
data_directory = '/volume/path/'
Or, specify it on the command line when you start PostgreSQL:
postgres -c data_directory='/volume/path/'
Reference: 18.2. File Locations
STEP 1: If postgresql is running, stop it:
sudo systemctl stop postgresql
STEP 2: Get the path to access your hard drive.
(if Linux) Find and mount your hard drive by:
# Retrieve your device's name with:
sudo fdisk -l
# Then mount your device
sudo mount /dev/DEVICE_NAME YOUR_HD_DIR_PATH
STEP 3: Copy the existing database directory to the new location (in your hard drive) with rsync.
sudo rsync -av /var/lib/postgresql YOUR_HD_DIR_PATH
Then rename the previous postgres main dir with .bak extension to prevent conflicts
sudo mv /var/lib/postgresql/11/main /var/lib/postgresql/11/main.bak
Note: my postgres version was 11. Replace in path with your version.
STEP 4: Edit postgres configuration file:
sudo nano /etc/postgresql/11/main/postgresql.conf
Change the data_directory line with:
data_directory = 'YOUR_HD_DIR_PATH/postgresql/11/main'
STEP 5: Restart Postgres & Check everything is working
sudo systemctl start postgresql
pg_lsclusters
Output should shows status as 'online'
Ver Cluster Port Status Owner Data directory Log file
11 main 5432 online postgres YOUR_HD_DIR_PATH/postgresql/11/main /var/log/postgresql/postgresql-11-main.log
Finally your can access your PostgresSQL with:
sudo -u postgres psql
You can try following the walkthrough here. This worked well for me and is similar to #Antiez's answer.
Currently, I am trying to do the same and the only conflict that I'm having at the moment is that it seems there is an issue with PostgreSQL's incremental backup and point-in-time recovery proccesses. I think it has something to do with folder permissions. If I try uploading a ~30MB csv to the postgres db, it will crash and the server will not start again because files cannot be written to the pg_wal directory. The only file in that directory is 000000010000000000000001 and does not move on to 000000010000000000000002 etc. while writing to a new table.
My stackoverflow post looking for a solution to this issue can be found here.

load local data files into hive table failed when using hive

when i tried to load local data files into hive table,it report error while moving files.And i found the link,which give comments to fix this issue.I follow this step ,but it still can't work.
http://answers.mapr.com/questions/3565/getting-started-with-hive-load-the-data-from-sample-table txt-into-the-table-fails
After mkdir /user/hive/tmp,and set hive.exec.scratchdir= /user/hive/tmp,it still report RuntimeException Cannot make directory:file/user/hive/tmp/hive_2013* How can I fix this issue?Who are familiar with hive can help me?Thanks!
hive version is 0.10.0
hadoop version is 1.1.2
I suspect a permission issue here, because you are using MapR distribution.
Make sure that the user trying to create the directory has permissions to create the directory on CLDB.
Easy way to debug here is to do
$hadoop fs -chmod -R 777 /user/hive
and then try to load the data, to confirm if it's permission issue.

Hadoop put command doing nothing!

I am running Cloudera's distribution of Hadoop and everything is working perfectly.The hdfs contains a large number of .seq files.I need to merge the contents of all the .seq files into one large .seq file.However, the getmerge command did nothing for me.I then used cat and piped the data of some .seq files onto a local file.When i want to "put" this file into hdfs it does nothing.No error message shows up,and no file is created.
I am able to "touchz" files in the hdfs and user permissions are not a problem here.The put command simply does not work.What am I doing wrong?
Write a job that merges the all sequence files into a single one. It's just the standard mapper and reducer with only one reduce task.
if the "hadoop" commands fails silently you should have a look at it.
Just type: 'which hadoop', this will give you the location of the "hadoop" executable. It is a shell script, just edit it and add logging to see what's going on.
If the hadoop bash script fails at the beginning it is no surprise that the hadoop dfs -put command does not work.