In MySQL I've used LOAD DATA LOCAL INFILE which works fine. At the end I get a message like:
Records: 460377 Deleted: 0 Skipped: 145280 Warnings: 0
How can I view the line number of the records that were skipped? SHOW warnings doesn't work:
mysql> show warnings;
Empty set (0.00 sec)
If there was no warnings, but some rows were skipped, then it may mean that the primary key was duplicated for the skipped rows.
The easiest way to find out duplicates is by openning the local file in excel and performing a duplicate removal on the primary key column to see if there are any.
You could create a temp table removing the primary key items so that it allows duplications, and then insert the data.
Construct a SQL statement like
select count(column_with_duplicates) AS num_duplicates,column_with_duplicates
from table
group by column_with_duplicates
having num_duplicates > 1;
This will show you the rows with redundancies. Another way is to just dump out the rows that were actually inserted into the table, and run a file difference command against the original to see which ones weren't included.
For anyone stumbling onto to this:
Another option would be to do a SELECT INTO and diff the two files. For example:
LOAD DATA LOCAL INFILE 'data.txt' INTO TABLE my_table FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\r' IGNORE 1 LINES (title, desc, is_viewable);
SELECT title, desc, is_viewable INTO OUTFILE 'data_rows.txt' FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r' FROM my_table;
Then execute FileMerge (on Mac OS X) data.txt data_rows.txt to see the differences. If you are getting an access denied error when doing the SELECT INTO make sure you:
GRANT FILE ON *.* TO 'mysql_user'#'localhost';
flush privileges;
As the root user in the mysql client.
Records would be skipped, when any database constraint is not met. Check for common ones like
Primary key duplication
Unique key condition
Partition condition
I use bash command-line to find the duplicate row in the csv file:
awk -F\, '{print $1$2}' /my/source/file.csv| sort -n| uniq -c| grep -v "^\ *1"
when the two first columns are the primary key.
Related
I have a table T in Oracle DB and I need to load (replace) the data on this table.
Replacing the data can be a long process and there are other processes that can use this table during my loading process (other processes can be running or start after I begin my loading process).
The solution is to load the data into a temporary table T_TMP and when the loading process will finish, we will:
1. Rename table T to T_REMOVE.
2. Rename table T_TMP to T.
During above renaming steps, other processes/jobs can use table T so they can receive invalid data.
Therefor, I need to rename the tables atomically.
In MySQL, the atomic statement is:
RENAME TABLE tbl_name TO new_tbl_name
[, tbl_name2 TO new_tbl_name2] ...
The question is what is the parallel atomic statement in Oracle?
it was suggested to use transactions: delete from t; insert into t ...; commit;
But the loading process isn't a simple insert statements but a script that I should run:
sqlldr user/pass control=scopes_group.ctl direct=true.
The file "scopes_group.ctl":
LOAD DATA INFILE 'scopes_group.dat' BADFILE 'scopes_group.bad' DISCARDFILE 'scopes_group.dsc' TRUNCATE INTO TABLE T FIELDS TERMINATED BY ',' TRAILING NULLCOLS (group_id , scopes char(4000) "replace(:scopes,';',',')", updated SYSDATE)
Thank you!
Mike
I have data stored in S3 like:
/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN
/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...
My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:
CREATE EXTERNAL TABLE search_input(
col 1 STRING,
col 2 STRING,
...
)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';
However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:
CREATE EXTERNAL TABLE search_input_20140701(
col 1 STRING,
col 2 STRING,
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';
I can query data just fine.
Why doesn't Hive recognize the nested directories with the "date=date_str" partition?
Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?
In order to get this to work I had to do 2 things:
Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
For some reason it would still not recognize my partitions so I had to recover them via:
ALTER TABLE search_input RECOVER PARTITIONS;
You can use:
SHOW PARTITIONS table;
to check and see that they've been recovered.
I had faced the same issue and realized that hive does not have partitions metadata with it. So we need to add that metadata using ALTER TABLE ADD PARTITION query. It becomes tedious, if you have few hundred partitions to create same queries with different values.
ALTER TABLE <table name> ADD PARTITION(<partitioned column name>=<partition value>);
Once you run above query for all available partitions. You should see the results in hive queries.
I would like to import data into my postgresql table.
I have .csv file that is formated like this:
1; John Blake
2; Roberto Young
3;Mark Palmer
Any solution how to strip first whitespace where it exists?
i used following code
\copy users from 'users.csv' using delimiters E';'
And it does keep whitespaces
COPY to a temporary staging table and INSERT into the target table from there, trimming the text column.
CREATE TEMP TABLE tmp_x AS
SELECT * FROM users LIMIT 0; -- empty temp table with structure of target
\copy tmp_x FROM '/absolute/path/to/file' delimiters E';'; -- psql command (!)
INSERT INTO users
(usr_id, usr, ...) -- list columns
SELECT usr_id, ltrim(usr), ...
FROM tmp_x;
DROP TABLE tmp_x; -- optional; is destroyed at end of session automatically
ltrim() only trims space from the left of the string.
This sequence of actions performs better than updating rows in the table after COPY, which take longer and produce a dead rows. Also, only newly imported rows are manipulated this way.
Related answer:
Delete rows of a table specified in a text file in Postgres
You won't be able to use COPY alone to do that.
You can use an UPDATE coupled with trim:
UPDATE table SET column = trim(from column)
Or use a script to clean the data before bulk inserting the data to the DB.
I want to import data in the form of csv file into a table.[using Oracle SQL developer].I have such hundred files and each has about 50 columns.
From the wiki of SQL*Loader (http://www.orafaq.com/wiki/SQL*Loader_FAQ)
load data
infile 'c:\data\mydata.csv'
into table emp
fields terminated by "," optionally enclosed by '"'
( empno, empname, sal, deptno ) //these are the columns headers
What i don't want to do is list down all the column headers.I just want all the enteries in the csv file to be assigned to members in the tables in the order in which they appear.
Moreover after all think i want to automate it for all the 100 files.
You should write down the columns (and their type optionally) so as to assign the values of your csv file to each column. You should do this because the order of the columns in the table in your Oracle Database is not known in the script.
After you write the columns in the order they appear in your csv files, you can automate this script for all of your files by typing:
infile *.csv
You can try oracle csv loader. It automatically creates the table and the controlfile based on the csv content and loads the csv into an oracle table using sql loader.
An alternative to sqlldr that does what you are looking for is the LOAD command in SQLcl. It simply matches header row in the csv to the table and loads it. However this is not as performant nor as much control as sqlldr.
LOAD [schema.]table_name[#db_link] file_name
Here's the full help for it.
sql klrice/klrice
...
KLRICE#xe>help load
LOAD
-----
Loads a comma separated value (csv) file into a table.
The first row of the file must be a header row. The columns in the header row must match the columns defined on the table.
The columns must be delimited by a comma and may optionally be enclosed in double quotes.
Lines can be terminated with standard line terminators for windows, unix or mac.
File must be encoded UTF8.
The load is processed with 50 rows per batch.
If AUTOCOMMIT is set in SQLCL, a commit is done every 10 batches.
The load is terminated if more than 50 errors are found.
LOAD [schema.]table_name[#db_link] file_name
KLRICE#xe>
Example from a git repo I have at https://github.com/krisrice/maxmind-oracledb
SQL> drop table geo_lite_asn;
Table GEO_LITE_ASN dropped.
SQL> create table geo_lite_asn (
2 "network" varchar2(32),
3 "autonomous_system_number" number,
4 "autonomous_system_organization" varchar2(200))
5 /
Table GEO_LITE_ASN created.
SQL> load geo_lite_asn GeoLite2-ASN-CSV_20180130/GeoLite2-ASN-Blocks-IPv4.csv
--Number of rows processed: 397,040
--Number of rows in error: 0
0 - SUCCESS: Load processed without errors
SQL>
When using LOAD DATA INFILE, is there a way to either flag a duplicate row, or dump any/all duplicates into a separate table?
From the LOAD DATE INFILE documentation:
The REPLACE and IGNORE keywords control handling of input rows that duplicate existing rows on unique key values:
If you specify REPLACE, input rows replace existing rows. In other words, rows that have the same value for a primary key or unique index as an existing row. See Section 12.2.7, “REPLACE Syntax”.
If you specify IGNORE, input rows that duplicate an existing row on a unique key value are skipped. If you do not specify either option, the behavior depends on whether the LOCAL keyword is specified. Without LOCAL, an error occurs when a duplicate key value is found, and the rest of the text file is ignored. With LOCAL, the default behavior is the same as if IGNORE is specified; this is because the server has no way to stop transmission of the file in the middle of the operation.
Effectively, there's no way to redirect the duplicate records to a different table. You'd have to load them all in, and then create another table to hold the non-duplicated records.
It looks as if there actually is something you can do when it comes to duplicate rows for LOAD DATA calls. However, the approach that I've found isn't perfect: it acts more as a log for all deletes on a table, instead of just for LOAD DATA calls. Here's my approach:
Table test:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
text VARCHAR(255) DEFAULT NULL
);
Table test_log:
CREATE TABLE test_log (
id INTEGER, -- not primary key, we want to accept duplicate rows
text VARCHAR(255) DEFAULT NULL,
time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Trigger del_chk:
delimiter //
drop trigger if exists del_chk;
CREATE TRIGGER del_chk AFTER DELETE ON test
FOR EACH ROW
BEGIN
INSERT INTO test_log(id,text) values(OLD.id,OLD.text);
END;//
delimiter ;
Test import (/home/user/test.csv):
1,asdf
2,jkl
3,qwer
1,tyui
1,zxcv
2,bnm
Query:
LOAD DATA INFILE '/home/ken/test.csv'
REPLACE INTO TABLE test
FIELDS
TERMINATED BY ','
LINES
TERMINATED BY '\n' (id,text);
Running the above query will result in 1,asdf, 1,tyui, and 2,jkl being added to the log table. Based on a timestamp, it could be possible to associate the rows with a particular LOAD DATA statement.