BigQuery faster way to insert million of rows - google-bigquery

I'm using bq command line and trying to insert large amount of json files with one table per day.
My approach:
list all file to be push (date named YYYMMDDHHMM.meta1.meta2.json)
concatenate in the same day file => YYYMMDD.ndjson
split YYYMMDD.ndjson file (500 lines files each) YYYMMDD.ndjson_splittedij
loop over YYYMMDD.ndjson_splittedij and run
bq insert --template_suffix=20160331 --dataset_id=MYDATASET TEMPLATE YYYMMDD.ndjson_splittedij
This approach works. I just wonder if it is possible to improve it.

Again you are confusing streaming inserts and job loads.
You don't need to split each file in 500 rows (that applies to streaming insert).
You can have very large files for insert, see the Command line tab examples listed here: https://cloud.google.com/bigquery/loading-data#loading_csv_files
You have to run only:
bq load --source_format=NEWLINE_DELIMITED_JSON --schema=personsDataSchema.json mydataset.persons_data personsData.json
JSON file compressed must be under 4 GB if uncompressed must be under 5 TB, so larger files are better. Always try with 10 line sample file until you get the command working.

Related

How to run an SQL code on a very large CSV file (4 million+ records) without needing to open it

I have a very large file of 4 million+ records that I want to run an sql query on. However, when I open the file it will only return 1 million contacts and not load the rest. Is there a way for me to run my query without opening the file so I do not lose any records? PS I am using a Macbook so some functions and add ins are not available for me.

How to split big sql dump file into small chunks and maintain each record in origin files despite later other records deletions

Here's what I want to do with (MySQL example):
dumping only structure - structure.sql
dumping all table data - data.sql
spliting data.sql and putting each table data info seperate files - table1.sql, table2, sql, table3.sql ... tablen.sql
splitting each table into smaller files (1k lines per file)
commiting all files in my local git repository
coping all dir out to remote secure serwerwer
I have a problem with #4 step.
For instance I split table1.sql into 3 files: table1_a.sql and table1_b.sql and table1_c.sql.
If on new dump there are new records that is fine - it's just added to table1_b.sql.
But if there are deleted records that were in table1_a.sql all next records will move and git will treat files table1_b.sql and table1_c.sql as changed and that not OK.
Basicly it destroys whole idea keeping sql backup in SCM.
My question: How to split big sql dump file into small chunks and maintain each record in origin files despite later other records deletions?
To Split SQL Dumps in files of 500 lines execute in your terminal:
$ split -l 5000 hit_2017-09-28_20-07-25.sql dbpart-
Don't split them at all. Or split them by ranges of PK values. Or split them right down to 1 db row per file (and name the file after tablename + the content of the primary key).
(That apart from the even more obvious XY answer, which was my instinctive reaction.)

Google Bigquery - Bulk Load

We have a csv file with 300 columns. Size is approx 250 MB. Trying to upload it to BQ through Web UI but the schema specification is hard work. I was anticipating BQ will identify file headers but it doesn't seems to be recognising unless I am missing something.Is there a way forward ?
Yes, you have to write the schema by your own. Bigquery is not able to auto-infert it. If you have 300 columns, I sugget writing a script to automatically create the schema.
With the command-line tool (cf here) If you have some lines with the wrong/different schema, you can use the following option to continue for other records :
--max_bad_records : The maximum number of bad rows to skip before the load job
In your case if you want to skip the first line of headers, that can be the following :
bq load --skip_leading_rows=1 --max_bad_records=10000 <destination_table> <data_source_uri> [<table_schema>]

My file gets truncated in Hive after uploading it completely to Cloudera Hue

I am using Cloudera's Hue. In the file browser, I upload a .csv file with about 3,000 rows (my file is small <400k).
After uploading the file I go to the Data Browser, create a table and import the data into it.
When I go to Hive and perform a simple query (say SELECT * FROM table) I only see results for 99 rows. The original .csv has more than those rows.
When I do other queries I notice that several rows of data are missing although they show in the preview in the Hue File Browser.
I have tried with other files and they also get truncated sometimes at 65 rows or 165 rows.
I have also removed all the "," from the .csv data before uploading the file.
I finally solved this. There were several issues that appeared to cause a truncation.
The main was that the variable type automatically set after importing the data was assigned according to the first lines. So when the data type changed from TinyINT to INT it got truncated or changed to "NULL". To solve this perform EDA and change the datatype before creating the table.
Other issues were that the memory I had assigned to the virtual machine slowed the preview process and that the csv contained commas. You can set the VM to have more memory or change a csv to tab separated.

Hive query - INSERT OVERWRITE LOCAL DIRECTORY creates multiple files for a single table

I do the following from a hive table myTable.
INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT concat_ws('',NAME,PRODUCT,PRC,field1,field2,field3,field4,field5) FROM myTable;
So, this command generates 2 files 000000_0 and 000001_0 inside the folder out/.
But, I need the contents as a single file. What should I do?
There are multiple files in the directory because every reducer is writing one file. If you really need the contents as a single file, run your map reduce job with only 1 reducer which will write to a single file.
However depending on your data size, this might not be a good approach to run a single reducer.
Edit: Instead of forcing hive to run 1 reduce task and output a single reduce file, it would be better to use hadoop fs operations to merge outputs to a single file.
For example
hadoop fs -text /myDir/out/* | hadoop fs -put - /myDir/out.txt
A bit late to the game, but I found that using LIMIT large_number, where large_number is bigger than rows in your query. It forces hive to use at least a reducer. For example:
set mapred.reduce.tasks=1; INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT * FROM table_name LIMIT 1000000000
Worked flawlessly.
CLUSTER BY will make the work.