I need to run a join query on BigQuery of one project, that may return large amount of data (that may not fit in VM's memeory), and then save the results in the BigQuery of another project.
Is there an easy way to do this without loading the data in VM, as data size can vary and VM may not have enough memory to load it?
One method is to bypass the VM for the operation and utilize Google Cloud Storage instead.
The process will look like following
Create a GS bucket that both projects has access to
Source project - Export the table to the GS bucket (this is possible from the web interface, pretty sure the CLI tools can do it to)
Destination project - Create a new table from the files in the GS bucket
to save result of query to a table in any project - you do not need to save it first to VM you should just set properly destination property and of course you need to have write permissions to dataset that contain that table!
Destination property can vary depend on client tool you use
for example, if you are using REST API's jobs.insert you should set below property
configuration.query.destinationTable nested object [Optional]
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results. This
property must be set for large results that exceed the maximum
response size.
configuration.query.destinationTable.datasetId string [Required]
The
ID of the dataset containing this table.
configuration.query.destinationTable.projectId string [Required]
The
ID of the project containing this table.
configuration.query.destinationTable.tableId string [Required]
The ID
of the table. The ID must contain only letters (a-z, A-Z), numbers
(0-9), or underscores (_). The maximum length is 1,024 characters.
Related
I built an API in nodejs+express that allows reactjs clients to upload CSV files(maximum size is atmost 1GB) to the server.
I also wrote another API which when given the filename and row numbers in an array (ie array of row numbers ) as input, it selects the rows corresponding to the row numbers, from the previously stored files and writes it to another result file (writeStream).
Then th resultant file is piped back to the client(all via streaming).
Currently as you see I am using files(basically nodejs' read and write streams) to asynchronously manage this.
But I have faced srious latency (only 2 cores are used) and some memory leak (900mb consumption) when I have 15 requests, each supplying about 600 rows to retrieve from files of size approximately 150mb.
I also have planned an alternate design.
Basically, I will store the entire file as a SQL Table with row numbers as primary indexed key.
I will convert the user inputted array of row numbrs to a another table using sql unnest and then join both these tables to get the rows needed.
Then I will supply back the resultant table as a csv file to the client.
Would this architecture be better than the previous architecture?
Any suggestions from devs is highly appreciated.
Thanks.
Use the client to do all the heavy lifting by using the XLSX package for any manipulation of content. Then have API to save information about the transaction. This will remove upload to server and download from the server and help you provide better experience.
I have a Known\Folder Path.
That folder contains several hundred small txt files.
Generally the filenames are of the form Prefix_<Code1>_<SubCode2>_<State>.txt
I want to know how many files there are for a specific value of Code1.
I was hoping to use the GetMetadata activity, with Path Known\Folder\Prefix_Value_*.txt, but that just returns empty set :(
Currently I've got it working with GetMetadata on Known\Folder, with childItems captured, and then a foreach over all the files, with If on #startsWith(file.name, 'Prefix_Value').
But that results in hundreds of iterations of the loop, in sequence, and each activity takes ~1 second so it ends up taking minutes to do this check.
Is there a better way to do this? Either to direclty locate all files matching my mask, or a better way to count the matching elements of a hundreds-of-items array?
Lots of little activities might be expensive if you do it often.
If you only want the count, you can do this in the following hideous way (promise it isn't written in Brainf&ck) ... it relies on the fact that you can use XPATH to scan XML in ADF. You only need a set-variable activity after your metadata lookup.
Set a variable equal to this - it will contain the number of files with 'Code1' in the name.
#{xpath(xml(json(concat('{"files":{',replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":',''),'}}'))),'count(/files/*[contains(local-name(),''Code1'')])')}
The inner part:
replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":','')
takes the metadata activity's output and strips the []{} parts and the type and name elements, then
json(concat('{"files":{',<the foregoing>,'}}')
wraps that up in to a JSON object, with files as the outer key and the filenames as inner keys (with text = "file" but that's going to be irrelevant).
Then you can take that JSON, turn it into XML and query the XML.
xpath(xml(<the above JSON>), 'count(/files/*[contains(local-name(),''Code1'')])')
The XPATH query counts all the elements under /files (which are now our filenames) whose names contain the text 'Code1'.
There is no way to get the file count directly matching Wildcard in Get Metadata activity by now. You can vote Get Metadata for Multiple Files Matching Wildcard to progress this feature.
If you only want to copy those files, you can use Wildcard file path.
If those files stored in Azure Blob Storage or somewhere that can be got file count with prefix by API, you can use Azure Function activity.
I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:
//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv
We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.
However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.
How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?
While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.
In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.
There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.
NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here:
https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
There is currently no access to data source location or parameter match values within the flow. Only the data in the dataset is available to you. (except SOURCEROWNUMBER())
Partial Solution
One method I have been using to mimic parameter insertion into the eventual table is to have multiple dataset imports by parameter and then union these before running your transformations into a final table.
For each known parameter search dataset, have a recipe that fills a column with that parameter per dataset and then union the results of each of these.
Obviously, this is only so scalable i.e. it works if you know the set of parameter values that will match. once you get to the granularity of time-stamp in the source file there is no way this is feasible.
In this example just the year value is the filtered parameter.
Longer Solution (An aside)
The alternative to this I eventually skated to was to define dataflow jobs using Dataprep, use these as dataflow templates and then run an orchestration function that ran the dataflow job (not dataprep) and amended the parameters for input AND output via the API. Then there was a transformation BigQuery Job that did the roundup append function.
Worth it if the flow is pretty settled, but not for adhoc; all depends on your scale.
I have a csv files, that it has the following structure.
ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
MARKETING,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
As you can see there is not header, but for your information the first part (first column) represents the sector where are getting the data.
What I have to do is depending on the first column value, for example (MARKETING or ERP) I have to send all that rows to a different output directory.
For example, all rows with ERP to /output/ERP/
all rows with MARKETING to /output/marketing/
I have an idea about how to do it, but my problem is about the RouteOnAttribute processor I am using, I don't know how to refer to the first column and to indicate what is the value (ERP or MARKETING) to later on send it to the correct output directory.
Here is my schema.
Thanks.
Use PartitionRecord processor for this case.
Configure the processor with record reader/writer controller services. Even though if you are not having header you can use col1,col2...etc in avro schema.
add new property that defines processor to use that field for partition the flowfile.
Now partition record processor adds the partition field attribute with value, by making use of this attribute value we can dynamically store files into respected directories dynamically.
Flow:
1.GetFile
2.PartitionRecord
3.PutFile //configure directory as /output/${<keep_partition_field_name_here>}
Please refer this link for configuring usage of partition record processor.
(or)
Old Approach:
Using RouteText processor instead of SplitText + RouteOnAttribute Processors
Configure RouteText processor as
Use the ERP/MARKETING connections connect to PutFile processor and use RouteText.Route attribute value to dynamically save the files into Directories.
Flow:
1.GetFile
2.RouteText
3.PutFile //configure directory as /output/${RouteText.Route}/
You can also use Group Regular expression property value to create partitions.
Note
Using PartitionRecord processor will be more efficient than RouteText processor.
I Used the python code for exporting data from bigquery to gcs,and then using gsutil to export to s3!But after exporting to gcs ,I noticed the some files are more tha 5 GB,which gsutil cannnot deal?So I want to know the way for limiting the size
So after the issue tracker, the correct way to take this is.
Single URI ['gs://[YOUR_BUCKET]/file-name.json']
Use a single URI if you want BigQuery to export your data to a single
file. The maximum exported data with this method is 1 GB.
Please note that data size is up to a maximum of 1GB, and the 1GB is not for the file size that is exported.
Single wildcard URI ['gs://[YOUR_BUCKET]/file-name-*.json']
Use a single wildcard URI if you think your exported data set will be
larger than 1 GB. BigQuery shards your data into multiple files based
on the provided pattern. Exported files size may vary, and files won't
be equally in size.
So again you need to use this method when your data size is above 1 GB, and the resulting files size may vary, and may go beyond the 1 GB, as you mentioned 5GB and 160Mb pair would happen on this method.
Multiple wildcard URIs
['gs://my-bucket/file-name-1-*.json',
'gs://my-bucket/file-name-2-*.json',
'gs://my-bucket/file-name-3-*.json']
Use multiple wildcard URIs if you want to partition the export output.
You would use this option if you're running a parallel processing job
with a service like Hadoop on Google Cloud Platform. Determine how
many workers are available to process the job, and create one URI per
worker. BigQuery treats each URI location as a partition, and uses
parallel processing to shard your data into multiple files in each
location.
the same applies here as well, exported file sizes may vary beyond 1 GB.
Try using single wildcard URI
See documentation for Exporting data into one or more files
Use a single wildcard URI if you think your exported data will be
larger than BigQuery's 1 GB per file maximum value. BigQuery shards
your data into multiple files based on the provided pattern. If you
use a wildcard in a URI component other than the file name, be sure
the path component does not exist before exporting your data.
Property definition:
['gs://[YOUR_BUCKET]/file-name-*.json']
Creates:
gs://my-bucket/file-name-000000000000.json
gs://my-bucket/file-name-000000000001.json
gs://my-bucket/file-name-000000000002.json ...
Property definition:
['gs://[YOUR_BUCKET]/path-component-*/file-name.json']
Creates:
gs://my-bucket/path-component-000000000000/file-name.json
gs://my-bucket/path-component-000000000001/file-name.json
gs://my-bucket/path-component-000000000002/file-name.json