I have an existing table (NameList) in to which I would like to load the contents of multiple csv files (fileA.csv, fileB.csv ...). The columns of the table are identical to those of the csv except that I want to record for each row the id of the csv file it came from. The id would be taken from another table which contains the properties of each file.
The table with the list of files would look like this:
CREATE TABLE files
(
id serial,
fileName varchar(128),
path varchar(256),
PRIMARY KEY (id)
)
The table to insert the csv contents in to would look like:
CREATE TABLE NameList
(
FirstName varchar(40),
LastName varchar(40),
SourceFile_ID int,
FOREIGN KEY (SourceFile_ID) REFERENCES files(id)
)
The csv files would look as follows:
Name of file:
fileA.csv
Contents:
FirstName,LastName
John,Smith
.
.
.
The only thing relating to this I could find so far is this:
Add extra column while importing csv data in table in SQL server table
However they suggest to use a default value on the additional column which would not solve my problem since I need to have a different value for each file I add.
You could insert the data into a temporary table (https://www.postgresqltutorial.com/postgresql-temporary-table), update the column, then move the data to the main table.
This would avoid problems with 2 CSVs being loaded at once, because they'd be using different temp tables (as long as 2 different db sessions are used for the inserts). Even if only one session is used, you could have different names for the temp table for different CSVs.
Related
I'm new to PostgreSQL and and looking for some guidance and best practice.
I have created a table by importing data from a csv file. I then altered the table by creating multiple generated columns like this:
ALTER TABLE master
ADD office VARCHAR(50)
GENERATED ALWAYS AS (CASE WHEN LEFT(location,4)='Chic' THEN 'CHI'
ELSE LEFT(location,strpos(location,'_')-1) END) STORED;
But when I try to import new data into the table I get the following error:
ERROR: column "office" is a generated column
DETAIL: Generated columns cannot be used in COPY.
My goal is to be able to import new data each day to the table and have the generated columns automatically populate in order to transform the data as I would like. How can I do so?
CREATE TEMP TABLE master (location VARCHAR);
ALTER TABLE master
ADD office VARCHAR
GENERATED ALWAYS AS (
CASE
WHEN LEFT(location, 4) = 'Chic' THEN 'CHI'
ELSE LEFT(location, strpos(location, '_') - 1)
END
) STORED;
--INSERT INTO master (location) VALUES ('Chicago');
--INSERT INTO master (location) VALUES ('New_York');
COPY master (location) FROM $$d:\cities.csv$$ CSV;
SELECT * FROM master;
Is this the structure and the behaviour you are expecting? If not, please provide more details regarding your table structure, your importable data and your importing commands.
Also, maybe when you try to import the csv file, the columns are not linked properly, or maybe the delimiter is not properly set. Try to specify each column in the exact order that appear in your csv file.
https://www.postgresql.org/docs/12/sql-copy.html
Note: d:\cities.csv contains:
Chicago
New_York
EDIT:
If columns positions are mixed up between table and csv, the following operation may come in handy:
1. create temporary table tmp (csv_column1 <data_type>, csv_column_2 <data_type>, ...); (including ALL csv columns)
2. copy tmp from '/path/to/file.csv';
3. insert into master (location, other_info, ...) select csv_column_3 as location, csv_column_7 as other_info, ... from tmp;
Importing data using an intermediate table may slow things down a little, but gives you a lot of flexibility.
I was getting the same error when importing to PG from a csv - I found that even though my column was generated, I still had to have it in the imported data, just left it empty. Worked fine when the column name was in there and mapped to my DB col name.
I have some data in a general table called ImportH. The data has been imposted from a csv file. I have also created two tables, Media and Host (each one has it's respective ID. These tables are related by a third table called HostMedia.
Each Host can have (or not) different types of Media (facebook, email, phone...).
I'll provide some images of the tables:
Table ImportH
Table Host
Table Media
How can I insert the data from the other tables into table HostMedia? This table looks like this:
create table HostMedia (
host_id int references Host (host_id),
id_media int references Media (id_verification),
primary key (host_id, id_media)
);
I have tried this:
insert into HostMedia (host_id, id_media)
select Host.host_id, Media.id_verification
from Host, Media;
But this does the cartesian product for all the hosts assigning them all the rows on the Media table. What's the correct way?
The "media" column in your "ImportH" table looks almost like a valid JSON, so this might work:
INSERT INTO HostMedia (host_id, id_media)
SELECT i.host_id, m.id_verification
FROM (
SELECT host_id,
json_array_elements_text(replace(media,'''','"')::json) AS media_name
FROM ImportH
) AS i
JOIN Media AS m ON m.media = i.media_name;
Notes: it would be easier if you
provided text data instead of screenshots
used logical column names
I have a csv file with 2 columns, one column is an id and the other one a name. No csv header.
"5138334","Here's Where the Story Ends"
"36615796","W31CZ-D"
"10283436","Elliant"
"8773661","Dobos torte"
"33139146","RTP Informação"
"2846867","The Legend of the Rollerblade Seven"
"36001757","Prescription Monitoring Program"
"2574520","Greg Wells"
"4498288","Catonsville Community College"
"15429275","Nozières"
"31736463","Bályok"
how do I insert this into a psql table?
I have tried creating a table.
create table csvtable(id bigserial not null primary key,
csv_id int not null,
csv_title varchar(100) not null);
and other variations without the id column(I tried making my own id in case the existing id wasn't unique)
and I have tried inserting the data through the copy command.
copy csvtable from 'file.csv' with csv;
and other variations with delimeter, etc. but no luck.
You need to specify which columns you are copying:
\copy csvtable(csv_id, csv_title) FROM 'data.csv' WITH (FORMAT csv)
Note that this is using \copy from psql rather than COPY (which might not be needed if the file is on the same server).
I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!
I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.