This question may have been asked before, and I am relatively new to the HADOOP and HIVE language. So I'm trying to export content, as a test, to see if I am doing things correctly. The code is below.
Use MY_DATABASE_NAME;
INSERT OVERWRITE LOCAL DIRECTORY '/random/directory/test'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n"
SELECT date_ts,script_tx,sequence_id FROM dir_test WHERE date_ts BETWEEN '2018-01-01' and '2018-01-02';
That is what I have so far, but then it generates multiple files and I want to combine them into a .csv file or a .xls file, to be worked on. My question, what do I do next to accomplish this?
Thanks in advance.
You can achieve by following ways:
Use single reducer in the query like ORDER BY <col_name>
Store to HDFS and then use command hdfs dfs –getmerge [-nl] <src> <localdest>
Using beeline: beeline --outputformat=csv2 -f query_file.sql > <file_name>.csv
Related
My question is somewhat similar to the below post. I want to download some data from a hive table using select query. But because the data is large, I want to write it as an external table in a given path. so that I can create a csv file. Uses the below code
create external table output(col1 STRING, col2STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '{outdir}/output'
INSERT OVERWRITE TABLE output
Select col1, col2 from atable limit 1000
This works fine, and create a file in 0000_ format, which can be copied as a csv file.
But my question is how to ensure that the output will always have a single file? If there is no partition defined, will it always be single file? What is the rule it uses to split files?
Saw few similar questions like below. But it discuss hdfs file access.
How to point to a single file with external table
I know the below alternative, but I use a hive connection object to execute queries from a remote node.
hive -e ' selectsql; ' | sed 's/[\t]/,/g' > outpathwithfilename
You can set the below property before doing the overwrite
set mapreduce.job.reduces=1;
Note: If the hive engine doesn't allow to be modified at runtime, then whitelist the parameter by setting below property in hive-site.xml
hive.security.authorization.sqlstd.confwhitelist.append=|mapreduce.job.|mapreduce.map.|mapreduce.reduce.*
I am new to using SQL, so please bear with me.
I need to import several hundred csv files into PostgreSQL. My web search has only indicated how to import many csv files into one table. However, most csv files have different column types (all have one line headers). Is it possible to somehow run a loop, and have each csv imported to a table with the same name as the csv? Creating each table manually and specifying columns is not an option. I know that COPY will not work as the table needs to already by specified.
Perhaps this is not feasible in PostgreSQL? I would like to accomplish this in pgAdmin III or the PSQL console, but I am open to other ideas (using something like R to change the csv to a format more easily entered into PostgreSQL?).
I am using PostgreSQL on a Windows 7 computer. It was requested that I use PostgreSQL, thus the focus of the question.
The desired result is a database full of tables, that I will then join with a spreadsheet that includes specific site data. Thanks!
Use pgfutter.
The general syntax looks like this:
pgfutter csv
In order to run this on all csv files in a directory from Windows Command Prompt, navigate to the desired directory and enter:
for %f in (*.csv) do pgfutter csv %f
Note that the path for the downloaded program must be added to the list of accepted paths for Environmental Variables.
EDIT:
Here is the command line code for Linux users
Run it as
pgfutter *.csv
Or if that won't do
find -iname '*.csv' -exec pgfutter csv {} \;
In the terminal use nano to make a file to loop through moving csv files under my directory to postgres DB
>nano run_pgfutter.sh
The content of run_pgfutter.sh:
#! /bin/bash
for i in /mypath/*.csv
do
./pgfutter csv ${i}
done
Then make the file executable:
chmod u+x run_pgfutter.sh
I do the following from a hive table myTable.
INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT concat_ws('',NAME,PRODUCT,PRC,field1,field2,field3,field4,field5) FROM myTable;
So, this command generates 2 files 000000_0 and 000001_0 inside the folder out/.
But, I need the contents as a single file. What should I do?
There are multiple files in the directory because every reducer is writing one file. If you really need the contents as a single file, run your map reduce job with only 1 reducer which will write to a single file.
However depending on your data size, this might not be a good approach to run a single reducer.
Edit: Instead of forcing hive to run 1 reduce task and output a single reduce file, it would be better to use hadoop fs operations to merge outputs to a single file.
For example
hadoop fs -text /myDir/out/* | hadoop fs -put - /myDir/out.txt
A bit late to the game, but I found that using LIMIT large_number, where large_number is bigger than rows in your query. It forces hive to use at least a reducer. For example:
set mapred.reduce.tasks=1; INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT * FROM table_name LIMIT 1000000000
Worked flawlessly.
CLUSTER BY will make the work.
I know that by doing:
COPY test FROM '/path/to/csv/example.txt' DELIMITER ',' CSV;
I can import csv data to postgresql.
However, I do not have a static csv file. My csv file gets downloaded several times a day and it includes data which has previously been imported to the database. So, to get a consistent database I would have to leave out this old data.
My bestcase idea to realize this would be something like above. However, worstcase would be a java program manually checks each entry of the db with the csv file. Any recommendations for the implementation?
I really appreciate your answer!
You can dump latest data to the temp table using COPY command and MERGE temp table with the live table.
If you are using JAVA program for execute COPY command, then try CopyManager API.
Is it possible for me to write an SQL query from within PhpMyAdmin that will search for matching records from a .csv file and match them to a table in MySQL?
Basically I want to do a WHERE IN query, but I want the WHERE IN to check records in a .csv file on my local machine, not a column in the database.
Can I do this?
I'd load the .csv content into a new table, do the comparison/merge and drop the table again.
Loading .csv files into mysql tables is easy:
LOAD DATA INFILE 'path/to/industries.csv'
INTO TABLE `industries`
FIELDS TERMINATED BY ';'
IGNORE 1 LINES (`nogaCode`, `title`);
There are a lot more things you can tell the LOAD command, like what char wraps the entries, etc.
I would do the following:
Create a temporary or MEMORY table on the server
Copy the CSV file to the server
Use the LOAD DATA INFILE command
Run your comparison
There is no way to have the CSV file on the client and the table on the server and be able to compare the contents of both using only SQL.
Short answer: no, you can't.
Long answer: you'll need to build a query locally, maybe with a script (Python/PHP) or just uploading the CSV in a table and doing a JOIN query (or just the WHERE x IN(SELECT y FROM mytmmpTABLE...))
For anyone new asking, there is this new tool that i used : Write SQL on CSV file