concatenate text files and import them into a SQLite DB - sql

Let us say I have thousands of comma separated text files with 1050 columns each (no header). Is there a way to concatenate and import all the text files into one table, one database in SQLite (Ideally I'd use R and sqldf to communicate with SQlite).
I.e.,
Each file is called, table1.txt, table2.txt, table3.txt; all of different number of rows, but same column types, and different unique IDs in the IDs column (first column of each file).
table1.txt
id1,20.3,1.2,3.4
id10,2.1,5.2,9.3
id21,20.5,1.2,8.4
table2.txt
id2,20.3,1.2,3.4
id92,2.1,5.2,9.3
table3.txt
id3,1.3,2.2,5.4
id30,9.1,4.4,9.3
The real example is pretty much the same but with more columns and more rows. AS you can note the first column in each file corresponds to a unique ID.
Now I'd like my new table in supertable, in the DB, super.db to be (also uniquely indexed):
super.db - name of the DB
mysupertable - name of the table in the DB
myids,v1,v2,v3
id1,20.3,1.2,3.4
id10,2.1,5.2,9.3
id21,20.5,1.2,8.4
id2,20.3,1.2,3.4
id92,2.1,5.2,9.3
id3,1.3,2.2,5.4
id30,9.1,4.4,9.3
For reference, I am using SQLite3; and I am looking for a SQL command that I can run on the background without logging interactively into the sqlite3 interpreter, i.e., IMPORT bla INTO,...
I could try in unix:
cat *.txt > allmyfiles.txt
and then a .sql file,
CREATE TABLE test (myids varchar(255), v1 float, v2 float, v3 float);
.separator ,
.import output.csv test
But this command does not work since I am using R sqldf library, and dbGetQuery(db, sql) and I have no idea how to create such string in R without getting an error.
p.s. I asked a similar Q for appending tables from a DB but this time I need to append/import text files not tables from a DB.

If you are using sqlite database files anyway, you might want to consider working with RSQLite.
install.packages( "RSQLite" ) # will install package "DBI"
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "super.db" )
You still can use the unix command within R which should be faster than any loop in R, using the system() command:
system( "cat *.txt > allmyfiles.txt" )
Provided that your allmyfiles.txt has a consistent format, you can import it as a data.frame into R
allMyFiles <- read.table( "allmyfiles.txt", header = FALSE, sep = "," )
and write it to your database, following #Martín Bel's advice, with something like
dbWriteTable( db, "mysupertable", allMyFiles, overwrite = TRUE, append = FALSE )
EDIT:
Or, if you don't want to route your data through R,you can again resort to using the system() command. This may get you started:
You have a file with the data you want to get into SQLite called allmyfiles.txt. Create a file called table.sql with this content (obviously the structure must match):
CREATE TABLE mysupertable (myids varchar(255), v1 float, v2 float, v3 float);
.separator ,
.import allmyfiles.txt mysupertable
and call it from R with
system( "sqlite3 super.db < table.sql" )
That should avoid routing the data through R but still do all the work from within R.

Take a look at termsql:
https://gitorious.org/termsql/pages/Home
cat *.txt | termsql -d ',' -t mysupertable -c 'myids,v1,v2,v3' -o mynew.db
This should do the job.

Related

Inserting huge batch of data from multiple csv files into distinct tables with Postgresql

I have a folder with multiple csv files, they all have the same column attributes.
My goal is to make every csv file into a distinct postgresql table named as the file's name but as there are 1k+ of them it would be a pretty long process to do manually.
I've been trying to search a solution for the whole day but the closest I've came up to solving the problem was this code:
for filename in select pg_ls_dir2 ('/directory_name/') loop
if (filename ~ '.csv$') THEN create table filename as fn
copy '/fullpath/' || filename to table fn
end if;
END loop;
the logic behind this code is to select every filename inside the folder, create a table named as the filename and import the content into said table.
The issue is that I have no idea how to actually put that in practice, for instance where should I execute this code since both for and pg_ls_dir2 are not SQL instructions?
If you use DBeaver, there is a recently-added feature in the software which fixes this exact issue. (On Windows) You have to right click the section "Tables" inside your schemas (not your target table!) and then just select "Import data" and you can select all the .csv files you want at the same time, creating a new table for each file as you mentioned.
Normally, I don' t like giving the answer directly, but I think you will need to change a few things at least.
Depending on the example from here I prepared a small example using bash script. Let' s assume you are in the directory that your files are kept.
postgres#213b483d0f5c:/home$ ls -ltr
total 8
-rwxrwxrwx 1 root root 146 Jul 25 13:58 file1.csv
-rwxrwxrwx 1 root root 146 Jul 25 14:16 file2.csv
On the same directory you can run:
for i in `ls | grep csv`
do
export table_name=`echo $i | cut -d "." -f 1`;
psql -d test -c "CREATE TABLE $table_name(emp_id SERIAL, first_name VARCHAR(50), last_name VARCHAR(50), dob DATE, city VARCHAR(40), PRIMARY KEY(emp_id));";
psql -d test -c "\COPY $table_name(emp_id,first_name,last_name,dob,city) FROM './$i' DELIMITER ',' CSV HEADER;";
done

BigQuery, is there a way to search the whole db for a string?

I have a datum that I need to find within the database, for example 'dsfsdfsads'. But, there are 100+ tables and views to search through. I have found numerous queries written for other databases that can find a specific string within the database. Example postings are below. However, I don't see the same for BigQuery. I found this question: Is it possible to do a full text search in all tables in BigQuery?, but this post feels incomplete and the 2 links provided in the answer do not answer my question.
Examples of other database queries capable of finding specific string:
Find a string by searching all tables in SQL Server Management Studio 2008
Search all tables, all columns for a specific value SQL Server
How do I search an SQL Server database for a string?
I am not sure why it doesn't suit you to search through your database using a wildcard table like in the post you mentioned. Because I have run this sample query to search through a public dataset and it works just fine.
SELECT *
FROM `bigquery-public-data.baseball.*` b
WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r'Cubs')
I guess it is because one of the limitation is that the wildcard table functionality does not support views.
Do you have a lot of them?
In that case you can use the wildcard only for your tables and filter out the views with _TABLE_SUFFIX or with a less general wildcard (it depends on the names of your views).
In general, with wildcard tables, using _TABLE_SUFFIX can greatly reduce the number of bytes scanned, which reduces the cost of running your queries. So use it also if you suspect of some tables to contain the string more then others.
For the views (or the whole dataset), you could:
• Iterate by calling the BigQuery API using one of the libraries with some multiprocessing module like multiprocessing in Python.
• Iterate by calling the REST API from a bash script.
• Iterate by using the bq command from a bash script.
If you get stuck with the programmatic part, post a new question and add the link here.
EDIT:
Here are two examples for you (bash and python). I tried them both and they work but any comments to help improve are welcome of course.
Python:
Install packages
pip install --upgrade google-cloud-bigquery
pip install multiprocess
Create filename.py. Change YOUR_PROJECT_ID and YOUR_DATASET.
from google.cloud import bigquery
import multiprocessing
def search(dataset_id):
"""
Lists and filters your dataset to keep only views
"""
client = bigquery.Client()
tables = client.list_tables(dataset_id)
views = []
for table in tables:
if table.table_type == 'VIEW':
views.append(table.table_id)
return views
def query(dataset_id, view):
"""
Searches for the string in your views and prints the first one it finds.
You can change or remove 'LIMIT 1' if needed.
"""
client = bigquery.Client()
query_job = client.query(
"""
SELECT *
FROM {}.{} b
WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r"true")
LIMIT 1
""".format(dataset_id, view)
)
results = query_job.result() # Waits for job to complete.
for row in results:
print(row)
if __name__ == '__main__':
# TODO: Set dataset_id to the ID of the dataset that contains the tables you are listing.
dataset_id = 'YOUR_PROJECT_ID.YOUR_DATASET'
views = search(dataset_id)
processes = []
for i in views:
p = multiprocessing.Process(target=query, args=(dataset_id, i))
p.start()
processes.append(p)
for process in processes:
process.join()
Run python filename.py
Bash:
Install jq (json parser) and test it
sudo apt-get install jq
Test
echo '{ "name":"John", "age":31, "city":"New York" }' | jq .
Output:
{
"name": "John",
"age": 31,
"city": "New York"
}
Reference
Create filename.sh. Change YOUR_PROJECT_ID and YOUR_DATASET.
#!/bin/bash
FILES="bq ls --format prettyjson YOUR_DATASET"
RESULTS=$(eval $FILES)
DETAILS=$(echo "${RESULTS}" | jq -c '.[]')
for d in $DETAILS
do
ID=$(echo $d | jq -r .tableReference.tableId)
table_type=$(echo $d | jq -r '.type')
if [[ $table_type == "VIEW" ]]
then
bq query --use_legacy_sql=false \
'SELECT *
FROM
`YOUR_PROJECT_ID`.YOUR_DATASET.'$ID' b
WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r"true")
LIMIT 1'
fi
done
Run bash filename.sh

How to clean bad data from huge csv file

So I have huge csv file (assume 5 GB) and I want to insert the data to the table but it return error that the length of the data is not the same
I found that some data has more columns than I want
For example the correct data I have has 8 columns but some data has 9 (it can be human/system error)
I want to take only 8 columns data, but because the data is so huge, I can not do it manually or using parsing in python
Any recommendation of a way to do it?
I am using linux, so any linux command also welcome
In sql I am using COPY ... FROM ... CSV HEADER; command to import the csv into table
You can use awk for this purpose. Assuming you field delimiter is comma (,) this code can do the work:
awk -F\, 'NF==8 {print}' input_file >output_file
A fast and dirty php solution as single command line:
php -r '$f=fopen("a.csv","rb"); $g=fopen("b.csv","wb"); while ( $r=fgetcsv($f) ) { $r = array_slice($r,0,8); fputcsv($g,$r); }'
It reads file a.csv and writes b.csv.

PSQL lo_import in client side script

we have a simple sql script we maintain that sets up your schema and populates a set of text/example values - so it's just like create table, create table table insert into table... and we run it with a simple shell script which calls psql
one of our tables requires files - what I wanted to do was just have the files in the same directory as the script and do something like insert into repository (id, picture) values ('first', lo_import('first.jpg'))
but I get errors saying must be superuser to use server-side script. Is there any way I can achieve this? I have just a .sql file and a bunch of image files and by running psql against the file import them?
Running as superuser is not an option.
Using psql, you could write a shell script like
oid=`psql -At -c "\lo_import 'first.jpg'" | tail -1 | cut -d " " -f 2`
psql -Aqt -c "INSERT INTO repository (id, picture) values ('first', $oid)"
because comments can't have code - thanks to Laurenz, I got it "working" like this:
drop table if exists some_landing_table;
create table some_landing_table( load_time timestamp, filename varchar, data bytea);
\set the_file 'example.jpg';
\lo_import 'example.jpg';
insert into some_landing_table
select now(), 'example.jpg', string_agg(data,decode('','escape') order by pageno)
from
pg_largeobject
where
loid = (select max(loid) from pg_largeobject);
select lo_unlink( max(loid) ) from pg_largeobject;
however, that is ugly for two reasons -
I don't seem to be able to get the result of \lo_import into a variable in any way. even though select \lo_import filename works select \lo_import filename into x doesn't.
I can't use a variable - if I do \lo_import :the_file - it just says example.jpg doesn't exist - enven though if I put it in directly it works perfectly
I can't find a simpler way of providing a 0 length bytea field than decode('','escape')

Generate a Properties File using Shell Script and Results from a SQL Query

I am trying to create a properties file like this...
firstname=Jon
lastname=Snow
occupation=Nights_Watch
family=Stark
...from a query like this...
SELECT
a.fname as firstname,
a.lname as lastname,
b.occ as occupation...
FROM
names a,
occupation b,
family c...
WHERE...
How can I do this? As I am aware of only using spool to a CSV file which won't work here?
These property files will be picked up by shell scripts to run automated tasks. I am using Oracle DB
Perhaps something like this?
psql -c 'select id, name from test where id = 1' -x -t -A -F = dbname -U dbuser
Output would be like:
id=1
name=test1
(For the full list of options: man psql.)
Since you mentionned spool I will assume you are running on Oracle. This should produce a result in the desired format, that you can spool straight away.
SELECT
'firstname=' || firstname || CHR(10) ||
'lastname=' || lastname || CHR(10) -- and so on for all fields
FROM your_tables;
The same approach should be possible with all database engines, if you know the correct incantation for a litteral new line and the syntax for string concatenation.
It is possible to to this from your command line SQL client but as STTLCU notes it might be better to get the query to output in something "standard" (like CSV) and then transform the results with a shell script. Otherwise, because a lot of the features you would use are not part of any SQL standard, they would depend on the database server and client application. Think of this step as sort of the obverse of ETL where you clean up the data you "unload" so that it is useful for some other application.
For sure there's ways to build this into your query application: e.g. if you use something like perl DBI::Shell as your client (which allows you to connect to many different servers using the DBI module) you can jazz up your output in various ways. But here you'd probably be best off if could send the query output to a text file and run it through awk.
Having said that ... here's how the Postgresql client could do what you want. Notice how the commands to set up the formatting are not SQL but specific to the client.
~/% psql -h 192.168.2.69 -d cropdusting -u stubblejumper
psql (9.2.4, server 8.4.14)
WARNING: psql version 9.2, server version 8.4.
Some psql features might not work.
You are now connected to database "cropdusting" as user "stubblejumper".
cropdusting=# \pset border 0 \pset format unaligned \pset t \pset fieldsep =
Border style is 0.
Output format is unaligned.
Showing only tuples.
Field separator is "=".
cropdusting=# select year,wmean_yld from bckwht where year=1997 AND freq > 13 ;
1997=19.9761904762
1997=14.5533333333
1997=17.9942857143
cropdusting=#
With the psql client the \pset command sets options affecting the output of query results tables. You can probably figure out which option is doing what. If you want to do this using your SQL client tell us which one it is or read through the manual page for tips on how to format the output of your queries.
My answer is very similar to the two already posted for this question, but I try to explain the options, and try to provide a precise answer.
When using Postgres, you can use psql command-line utility to get the intended output
psql -F = -A -x -X <other options> -c 'select a.fname as firstname, a.lname as lastname from names as a ... ;'
The options are:
-F : Use '=' sign as the field separator, instead of the default pipe '|'
-A : Do not align the output; so there is no space between the column header, separator and the column value.
-x : Use expanded output, so column headers are on left (instead of top) and row values are on right.
-X : Do not read $HOME/.psqlrc, as it may contain commands/options that can affect your output.
-c : The SQL command to execute
<other options> : Any other options, such as connection details, database name, etc.
You have to choose if you want to maintain such a file from shell or from PL/SQL. Both solutions are possible and both are correct.
Because Oracle has to read and write from the file I would do it from database side.
You can write data to file using UTL_FILE package.
DECLARE
fileHandler UTL_FILE.FILE_TYPE;
BEGIN
fileHandler := UTL_FILE.FOPEN('test_dir', 'test_file.txt', 'W');
UTL_FILE.PUTF(fileHandler, 'firstname=Jon\n');
UTL_FILE.PUTF(fileHandler, 'lastname=Snow\n');
UTL_FILE.PUTF(fileHandler, 'occupation=Nights_Watch\n');
UTL_FILE.PUTF(fileHandler, 'family=Stark\n');
UTL_FILE.FCLOSE(fileHandler);
EXCEPTION
WHEN utl_file.invalid_path THEN
raise_application_error(-20000, 'ERROR: Invalid PATH FOR file.');
END;
Example's source: http://psoug.org/snippet/Oracle-PL-SQL-UTL_FILE-file-write-to-file-example_538.htm
At the same time you read from the file using Oracle external table.
CREATE TABLE parameters_table
(
parameters_coupled VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY test_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE
FIELDS
(
parameters_coupled VARCHAR2(4000)
)
)
LOCATION ('test_file.txt')
);
At this point you can write data to your table which has one column with coupled parameter and value, i.e.: 'firstname=Jon'
You can read it by Oracle
You can read it by any shell script because it is a plain text.
Then it is just a matter of a query, i.e.:
SELECT MAX(CASE WHEN INSTR(parameters_coupled, 'firstname=') = 1 THEN REPLACE(parameters_coupled, 'firstname=') ELSE NULL END) AS firstname
, MAX(CASE WHEN INSTR(parameters_coupled, 'lastname=') = 1 THEN REPLACE(parameters_coupled, 'lastname=') ELSE NULL END) AS lastname
, MAX(CASE WHEN INSTR(parameters_coupled, 'occupation=') = 1 THEN REPLACE(parameters_coupled, 'occupation=') ELSE NULL END) AS occupation
FROM parameters_table;