I have standalone Java application, which i am using to fetch data from BQ table in the form of CSV file. My BQ table has more than 50k rows. My Java application is not able to read more than 15k rows into CSV file. Please suggest any solution.
You can't get more than 128MB of data unless you're storing the result in destination table. Perhaps you should save the result in destination table and then to call Export class which will produce a CSV file in Google Cloud Storage bucket.
Related
I am exporting a table of size>1GB from Bigquery into GCS but it splits the files into very small files of 2-3 MB. Is there a way to get bigger files like 40-60MB per files rather than 2-3 MB.
I do the expport via the api
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
https://cloud.google.com/bigquery/docs/reference/v2/jobs
The source table size is 60 GB on Bigquery. I extract the data with format - NewLine_Delimited_Json and GZIP compression
destination_cloud_storage_uris=[
'gs://bucket_name/main_folder/partition_date=xxxxxxx/part-*.gz'
]
Are you trying to export partitioned table? If yes, each partition is exported as different table and it might cause small files.
I run the export in cli with each of the following commands and received in both cases files of size 49 MB:
bq extract --compression=GZIP --destination_format=NEWLINE_DELIMITED_JSON project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
bq extract --compression=GZIP project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
Please add more details to the question so we can provide specific advice: How are you exactly asking for this export?
Nevertheless, if you have many files in GCS and you want to merge them all into one, you can do:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose
I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.
I have a folder of CSV files separated by date in Google Cloud Storage. How can I upload it directly to BigQuery as a partitioned table?
You can do the following:
Create partitioned table (for example: T)
Run multiple load jobs to load each day's data into the corresponding partition. So for example, you can load data for May 15th, 2016 by specifying the destination table of load as 'T$20160515'
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition
I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.
1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json
I know that by doing:
COPY test FROM '/path/to/csv/example.txt' DELIMITER ',' CSV;
I can import csv data to postgresql.
However, I do not have a static csv file. My csv file gets downloaded several times a day and it includes data which has previously been imported to the database. So, to get a consistent database I would have to leave out this old data.
My bestcase idea to realize this would be something like above. However, worstcase would be a java program manually checks each entry of the db with the csv file. Any recommendations for the implementation?
I really appreciate your answer!
You can dump latest data to the temp table using COPY command and MERGE temp table with the live table.
If you are using JAVA program for execute COPY command, then try CopyManager API.