Unfortunately I have had issues with my storage and was forced to reacquire data. However, this came in many .csv files and don't know how to import all of them without doing it one by one. I would like to have the 10000+ .csv files into one table and would like help with coding all imports one time.
All of the files have the same schema:
'Symbol' (varchar(15))
'Date' (Date)
'Open' (Float)
'High' (Float)
'Low' (Float)
'Close' (Float)
'Volume' (Int)
Also: All files will have the same structure for their naming:
XXXXXX_YYYYMMDD
(XXXXXX is the name of the market; I have 7 unique names)
Create Table [investment data 1].dbo.AA
(
Symbol varchar(15),
[Date] Date,
[Open] Float,
High Float,
Low Float,
[Close] Float,
Volume Int
)
At this point I do not know how to generate a loop that will look at all files in the "Investment Data" folder; the below example is the sample code for one .csv file. If there is a better way than "bulk insert" then I will modify the statement below.
bulk insert [investment data 1].dbo.AA
from 'R:\Investment Data\NASDAQ_20090626.csv'
with
(
firstrow=2
,rowterminator = '\n'
,fieldterminator = ','
)
Any help is appreciated; if I can be more clear please let me know. Thanks for your time.
Does what you wrote (for that one file) work ?
Great.
Open a dos prompt
Navigate to the folder with your 10,000 files
type DIR /b >c:\temp\files.txt
Now install a decent text editor, like Notepad++ (these instructions are for notepad ++)
Open c:\temp\files.txt in that editor
Open the find/replace dialog, place a tick next to "Extended (\n, \r..." - this makes it match newlines, and support newlines in replacements
Put this in Find: \r\n
Put this in Replace: ' with(firstrow=2,rowterminator = '\\n',fieldterminator = ',');\r\nbulk insert [investment data 1].dbo.AA from 'R:\Investment Data\
This will make your list of files that used to look like this:
a.txt
b.txt
c.txt
d.txt
Look like this:
a.txt' with(firstrow=2,rowterminator = '\n',fieldterminator = ',')
bulk insert [investment data 1].dbo.AA from 'R:\Investment Data\b.txt' with(firstrow=2,rowterminator = '\n',fieldterminator = ',');
bulk insert [investment data 1].dbo.AA from 'R:\Investment Data\c.txt' with(firstrow=2,rowterminator = '\n',fieldterminator = ',');
bulk insert [investment data 1].dbo.AA from 'R:\Investment Data\d.txt' with(firstrow=2,rowterminator = '\n',fieldterminator = ',');
bulk insert [investment data 1].dbo.AA from 'R:\Investment Data\
Now just clean up the first and last lines so it's a proper SQL. Paste and run in SSMS
Related
I have a cvs file . But when ı want import my cvs file to database ı cant . Because my datas shifts to other tables . Im using SQL Server Management Studio. I have a lot of data . How can ı import my file . This my Csv file and This is after import view
To handle rows which aren't loaded into table because of invalid data or format, could be handle using [a link] https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?redirectedfrom=&view=sql-server-ver16 specify the error file name, it will write the rows having error to error file. code should look like.
BULK INSERT SchoolsTemp
FROM 'C:\CSVData\Schools.csv'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = ',', --CSV field delimiter
ROWTERMINATOR = '\n', --Use to shift the control to next row
ERRORFILE = 'C:\CSVDATA\SchoolsErrorRows.csv',
TABLOCK
)
I am using R to handle large datasets (largest dataframe 30.000.000 x 120). These are stored in Azure Datalake Storage as parquet files, and we would need to query these daily and restore these in a local SQL database. Parquet files can be read without loading the data into memory, which is handy. However, creating SQL tables from parquuet files is more challenging as I'd prefer not to load the data into memory.
Here is the code I used. Unfortunately, this is not a perfect reprex as the SQL database need to exist for this to work.
# load packages
library(tidyverse)
library(arrow)
library(sparklyr)
library(DBI)
# Create test data
test <- data.frame(matrix(rnorm(20), nrow=10))
# Save as parquet file
write_parquet(test2, tempfile(fileext = ".parquet"))
# Load main table
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
test <- spark_read_parquet(sc, name = "test_main", path = "/tmp/RtmpeJBgyB/file2b5f4764e153.parquet", memory = FALSE, overwrite = TRUE)
# Save into SQL table
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "schema", table = "table"),
value = test)
Is it possible to write a SQL table without loading parquet files into memory?
I lack the experience with T-sql bulk import and export but this is likely where you'll find your answer.
library(arrow)
library(DBI)
test <- data.frame(matrix(rnorm(20), nrow=10))
f <- tempfile(fileext = '.parquet')
write_parquet(test2, f)
#Upload table using bulk insert
dbExecute(connection,
paste("
BULK INSERT [database].[schema].[table]
FROM '", gsub('\\\\', '/', f), "' FORMAT = 'PARQUET';
")
)
here I use T-sql's own bulk insert command.
Disclaimer I have not yet used this command in T-sql, so it may riddled with error. For example I can't see a place to specify snappy compression within the documentation, although it can be specified if one instead defined a custom file format with CREATE EXTERNAL FILE FORMAT.
Now the above only inserts into an existing table. For your specific case, where you'd like to create a new table from the file, you would likely be looking more for OPENROWSET using CREATE TABLE AS [select statement].
column_definition <- paste(names(column_defs), column_defs, collapse = ',')
dbExecute(connection,
paste0("CREATE TABLE MySqlTable
AS
SELECT *
FROM
OPENROWSET(
BULK '", f, "' FORMAT = 'PARQUET'
) WITH (
", paste0([Column definitions], ..., collapse = ', '), "
);
")
where column_defs would be a named list or vector describing giving the SQL data-type definition for each column. A (more or less) complete translation from R data types to is available on the T-sql documentation page (Note two very necessary translations: Date and POSIXlt are not present). Once again disclaimer: My time in T-sql did not get to BULK INSERT or similar.
I'm trying to export data From postgresql to csv.
First i created the query and tried exporting From pgadmin with the File -> Export to CSV. The CSV is wrong, as it contains for example :
The header : Field1;Field2;Field3;Field4
Now, the rows begin well, except for the last field that it puts it on another line:
Example :
Data1;Data2;Data3;
Data4;
The problem is i get error when trying to import the data to another server.
The data is From a view i created.
I also tried
COPY view(field1,field2...) TO 'C:\test.csv' DELIMITER ',' CSV HEADER;
It exports the same file.
I just want to export the data to another server.
Edit:
When trying to import the csv i get the error :
ERROR : Extra data after the last expected column. Context Copy
actions, line 3: <<"Data1, data2 etc.">>
So the first line is the header, the second line is the first row with data minus the last field, which is on the 3rd line, alone.
In order for you to export the file in another server you have two options:
Creating a shared folder between the two servers, so that the
database also has access to this directory.
COPY (SELECT field1,field2 FROM your_table) TO '[shared directory]' DELIMITER ',' CSV HEADER;
Triggering the export from the target server using the STDOUT of
COPY. Using psql you can achieve this running the following
command:
psql yourdb -c "COPY (SELECT * FROM your_table) TO STDOUT" > output.csv
EDIT: Addressing the issue of fields containing line feeds (\n)
In case you wanna get rid of the line feeds, use the REPLACE function.
Example:
SELECT E'foo\nbar';
?column?
----------
foo +
bar
(1 Zeile)
Removing the line feed:
SELECT REPLACE(E'foo\nbaar',E'\n','');
replace
---------
foobaar
(1 Zeile)
So your COPY should look like this:
COPY (SELECT field1,REPLACE(field2,E'\n','') AS field2 FROM your_table) TO '[shared directory]' DELIMITER ',' CSV HEADER;
the described above export procedure is OK, e.g:
t=# create table so(i int, t text);
CREATE TABLE
t=# insert into so select 1,chr(10)||'aaa';
INSERT 0 1
t=# copy so to stdout csv header;
i,t
1,"
aaa"
t=# create table so1(i int, t text);
CREATE TABLE
t=# copy so1 from stdout csv header;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> i,t
1,"
aaa"
>> >> >> \.
COPY 1
t=# select * from so1;
i | t
---+-----
1 | +
| aaa
(1 row)
I have a current problem loading my .CSV file into Oracles SQL Database
I am using SQLLDR
I have an excel file that has a lot of stock information in it I will give you a sample of what it looks like
Tdate Symbol Open High Low Close Volume
19500103 SPX 16.66 16.66 16.66 16.66 1260000
19500104 SPX 16.85 16.85 16.85 16.85 1890000
19500105 SPX 16.93 16.93 16.93 16.93 2550000
Tdate , symbol , open , high , low , close and volume isnt in the .CSV file I just put it there because my database table will hold those values under those names.
I created my Table in Sql Developer
create table cts ( tdate date, symbol varchar(20), open numeric ( 18,8), high numeric (18,8), low ( numeric 18,8), close numeric (18,8) , volume int ) ;
So then I opened up a notepad file and created this
LOAD Data infile c:\cts.dump.csv
into table CTS
fields terminated by "," optionally enclosed by '"'
( tdate, symbol, open, high, low , close, volume)
I save it as loaderval.ctl in folder c:\data
I then proceed to open up my cmd window and type
sqlldr username/password control=c:\data\loaderval.ctl
I receive back that 64 lines have been committed which is impossible since the file has tons and tons of data. I then check my database and the table is empty.
I also receive a .bad file and the .bad file has the records from the first couple of rows of the excel sheet
( 19500103,SPX,16.66,16.66,16.66,16.66,1260000
19500104,SPX,16.85,16.85,16.85,16.85,1890000
19500105,SPX,16.93,16.93,16.93,16.93,2550000
19500106,SPX,16.98,16.98,16.98,16.98,2010000
19500109,SPX,17.08,17.08,17.08,17.08,2520000
19500110,SPX,17.03,17.03,17.03,17.03,2160000
19500111,SPX,17.09,17.09,17.09,17.09,2630000
19500112,SPX,16.76,16.76,16.76,16.76,2970000
19500113,SPX,16.67,16.67,16.67,16.67,3330000
19500116,SPX,16.72,16.72,16.72,16.72,1460000
19500117,SPX,16.86,16.86,16.86,16.86,1790000
19500118,SPX,16.85,16.85,16.85,16.85,1570000
19500119,SPX,16.87,16.87,16.87,16.87,1170000
19500120,SPX,16.90,16.90,16.90,16.90,1440000
19500123,SPX,16.92,16.92,16.92,16.92,1340000
19500124,SPX,16.86,16.86,16.86,16.86,1250000
19500125,SPX,16.74,16.74,16.74,16.74,1700000
19500126,SPX,16.73,16.73,16.73,16.73,1150000
19500127,SPX,16.82,16.82,16.82,16.82,1250000
19500130,SPX,17.02,17.02,17.02,17.02,1640000
19500131,SPX,17.05,17.05,17.05,17.05,1690000
19500201)
Please help :)
Looking at code it seems that date column may be the culprit here. You can check below link to how to handle dates for sql loader
https://oracle-base.com/articles/12c/sql-loader-enhancements-12cr1
LOAD DATA
INFILE c:\cts.dump.csv
INTO TABLE CTS
FIELDS CSV WITH EMBEDDED
(tdate DATE "YYYYDDMM" ":tdate",
symbol,
open,
high,
low,
close,
volumn)
$ sqlldr userid=userid/passwd#connect_string control=test.ctl
Imagine that you have the following data in a CSV:
Name, Age, Gender
Jake, 40, M
Bill, 17, M
Suzie, 21, F
Is it possible to exclude the Age variable when importing the above CSV? My current approach is to simply use the cut shell command.
Update
iluvcapra has a great answer for small CSVs. However, for very large CSVs this approach is inefficient. For example, imagine that the above CSV was very large, 30Gb lets say. Loading all that Age data only to immediately remove is a waste of time. With this in mind, is there a more efficient way to load subsets of columns into sqlite databases?
I suspect that the best option is to use the shell command cut to cull out unnecessary columns. Is that intuition correct? Is it common to use shell commands to pre-process CSV files into more sqlite friendly versions?
Create a temporary table with the age column, and then use an INSERT... SELECT to move the data from the temporary table into your main one:
CREATE TEMP TABLE _csv_import (name text, age integer, gender text);
.separator ","
.import file.csv test
INSERT INTO names_genders (name, gender) SELECT name, gender
FROM _csv_import WHERE 1;
DROP TABLE _csv_import;
EDIT: Updating into a view with a phantom age column:
CREATE VIEW names_ages_genders AS
SELECT (name, 0 AS age ,gender) FROM names_genders;
CREATE TRIGGER lose_age
INSTEAD OF INSERT ON names_ages_genders
BEGIN
INSERT INTO names_genders (name, gender)
VALUES (NEW.name, NEW.gender)
END;
This will create a view called names_ages_genders that will say everybody is zero years old, and will silently drop the age field from any INSERT statement called on it. Not tested! (I'm actually not sure .import can import into views.)
If you wish to avoid reading more than necessary into SQLite, and if you wish to avoid the hazards of using standard text-processing tools (such as cut and awk) on CSV files, one possibility would be to use your favorite csv2tsv converter (*) along the following lines:
csv2tsv input.csv | cut -f 1,3- > tmp.tsv
cat << EOF | sqlite3 demo.db
drop table if exists demo;
.mode csv
.separator "\t"
.import tmp.tsv demo
EOF
/bin/rm tmp.tsv
Note, though, that if input.csv has literal tabs or newlines or escaped double-quotes, then
whether the above will have the desired effect will depend on the csv2tsv that is used.
(*) csv2tsv
In case you don't have ready access to a suitable csv2tsv converter, here is a simple python3 script that does the job, handling embedded literal newlines, tabs, and the two-character sequences "\t" and "\n", in the CSV:
#!/usr/bin/env python3
# Take care of embedded tabs and newlines in the CSV
import csv, re, sys
if len(sys.argv) > 2 or (len(sys.argv) > 1 and sys.argv[1] == '--help'):
sys.exit("Usage: " + sys.argv[0] + " [input.csv [output.tsv]]")
csv.field_size_limit(sys.maxsize)
if len(sys.argv) == 3:
out=open(sys.argv[2], 'w+')
else:
out=sys.stdout
if len(sys.argv) == 1:
csvfile=sys.stdin
else:
csvfile=open(sys.argv[1])
# tabs and newlines ...
def edit(s):
s=re.sub(r'\\t', r'\\\\t', s)
s=re.sub(r'\\n', r'\\\\n', s)
s=re.sub('\t', r'\\t', s)
return re.sub('\n', r'\\n', s)
reader = csv.reader(csvfile, dialect='excel')
for row in reader:
line=""
for s in row:
s=edit(s)
if len(line) == 0:
line = s
else:
line += '\t' + s
print(line)