How to run an SQL code on a very large CSV file (4 million+ records) without needing to open it - sql

I have a very large file of 4 million+ records that I want to run an sql query on. However, when I open the file it will only return 1 million contacts and not load the rest. Is there a way for me to run my query without opening the file so I do not lose any records? PS I am using a Macbook so some functions and add ins are not available for me.

Related

Divide a SQL file into smaller files and save each of them as a CSV/Excel/TXT file dynamically

So, I'm working on a SAP HANA database that has 10 million records in one table and there are 'n' number of tables in the db. The constraints that I'm facing are:
I do not have write access to the db.
The maximum RAM in the system is 6 GB.
Now, I need to extract the data from this table and save it as a csv or txt or excel file. I tried Select * from query. Using this the machine extracts ~700k records before showing an out of memory exception.
I've tried using LIMIT and OFFSET in SAP HANA and it works perfectly, but it takes around ~30 mins for the machine to process ~500k records. So, going by this route will be very time consuming.
So, I wanted to known if there is anyway by which I can automate the process of selecting 500k records using LIMIT and OFFSET and save each such sub-file containing 500k records automatically into as a csv/txt file on the system, so that I can run this query and leave the system overnight to extract data.

Using wbimport on sql workbench/j with presto driver

so I am using workbench to import a 160k line text file into a table code is:
Wbimport
-usepgcopy
-type=text
-endrow=164841
-file=‘book1.csv’
-table=it.table1
-delimiter=,
-multiline=true
So I have tried this with a 3 line version of my 160k line file and it completed in a few seconds. This only seems to complete in auto commit mode, and when I try to run it on the full 160k line file it takes over 200h to complete any idea why - or alternatives.
I am using workbench build 125 and presto jdbc-0.216
Thanks
Most likely the reason is that the overall transaction gets to large and this imposes too big a load on wbimport and the jdbc connection. It would probably work much faster if you break this up into separation imports of e.g. 1000 records per import.
If you cut up the file first into multiple files and then import them one at a time, you are also avoiding the repeated read of the large file to find the right record.

Improve ETL from COBOL file to SQL

I have a multiserver/multiprocess/multithreaded solution which can parse and extract over 7 million records from a 6gb EBCDIC Cobol file, to 27 SQL tables, all under 20 minutes. The problem; the actual parsing and extracting of the data really only takes about 10 minutes using bulk inserts into staging tables. It then takes almost another 10 minutes to copy the data from the staging tables to their final table. Any ideas on how I can improve the 2nd half of my process? I've tried using In-Memory tables but it blows out the SQL server.

Pentaho | Tools-> Wizard-> Copy Tables

I want to copy tables from one database to another database.
I have gone through google and find out that we can do this with Wizard option of Tools Menu in Spoon.
Currently I am trying to copy just one table from one database into another table.
My table has just 130 000 records and it took 10 mins to copy table.
Can we improve this loading timings? I mean just to copy 100k records, it should not take more than 10 seconds.
Try the mysql bulk loader - note: that is linux only
OR
fix the batch size:
http://julienhofstede.blogspot.co.uk/2014/02/increase-mysql-output-to-80k-rowssecond.html
You'll get massive improvements that way.

BigQuery faster way to insert million of rows

I'm using bq command line and trying to insert large amount of json files with one table per day.
My approach:
list all file to be push (date named YYYMMDDHHMM.meta1.meta2.json)
concatenate in the same day file => YYYMMDD.ndjson
split YYYMMDD.ndjson file (500 lines files each) YYYMMDD.ndjson_splittedij
loop over YYYMMDD.ndjson_splittedij and run
bq insert --template_suffix=20160331 --dataset_id=MYDATASET TEMPLATE YYYMMDD.ndjson_splittedij
This approach works. I just wonder if it is possible to improve it.
Again you are confusing streaming inserts and job loads.
You don't need to split each file in 500 rows (that applies to streaming insert).
You can have very large files for insert, see the Command line tab examples listed here: https://cloud.google.com/bigquery/loading-data#loading_csv_files
You have to run only:
bq load --source_format=NEWLINE_DELIMITED_JSON --schema=personsDataSchema.json mydataset.persons_data personsData.json
JSON file compressed must be under 4 GB if uncompressed must be under 5 TB, so larger files are better. Always try with 10 line sample file until you get the command working.