I am currently sending StackDriver log files for my app to a BigQuery table. I would like to strip down the dataset and place it into a new BigQuery table to be queried later and render those results in a view on my app. I will be using python as my main language as I do not know Java, and creating a CRON job to run this script every 15 minutes to populate the new log dataset from StackDriver.
Striping down the dataset takes on two processes:
1.) Only write some of the columns from the original BigQuery table to the new one
2.) Create a subset of the data in certains columns to be written into new columns in the new BigQuery table. For example:
A row in the original BigQuery table will contain the string
Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3
I would like to strip out iPad and place this into a devices column, AppleWebKit and place this into a browsers column, etc, in the new BigQuery table.
I know I can load the bigquery libraries into python to query the orignal BigQuery table but how do I strip out what I want and write that to a new table? Would this be a good use case for pandas? Is there an easier way to accomplish this task then my current idea?
Related
I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.
https://arrow.apache.org/docs/python/dataset.html
https://issues.apache.org/jira/browse/ARROW-2659
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset
I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:
//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv
We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.
However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.
How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?
While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.
In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.
There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.
NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here:
https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
There is currently no access to data source location or parameter match values within the flow. Only the data in the dataset is available to you. (except SOURCEROWNUMBER())
Partial Solution
One method I have been using to mimic parameter insertion into the eventual table is to have multiple dataset imports by parameter and then union these before running your transformations into a final table.
For each known parameter search dataset, have a recipe that fills a column with that parameter per dataset and then union the results of each of these.
Obviously, this is only so scalable i.e. it works if you know the set of parameter values that will match. once you get to the granularity of time-stamp in the source file there is no way this is feasible.
In this example just the year value is the filtered parameter.
Longer Solution (An aside)
The alternative to this I eventually skated to was to define dataflow jobs using Dataprep, use these as dataflow templates and then run an orchestration function that ran the dataflow job (not dataprep) and amended the parameters for input AND output via the API. Then there was a transformation BigQuery Job that did the roundup append function.
Worth it if the flow is pretty settled, but not for adhoc; all depends on your scale.
I need to run a join query on BigQuery of one project, that may return large amount of data (that may not fit in VM's memeory), and then save the results in the BigQuery of another project.
Is there an easy way to do this without loading the data in VM, as data size can vary and VM may not have enough memory to load it?
One method is to bypass the VM for the operation and utilize Google Cloud Storage instead.
The process will look like following
Create a GS bucket that both projects has access to
Source project - Export the table to the GS bucket (this is possible from the web interface, pretty sure the CLI tools can do it to)
Destination project - Create a new table from the files in the GS bucket
to save result of query to a table in any project - you do not need to save it first to VM you should just set properly destination property and of course you need to have write permissions to dataset that contain that table!
Destination property can vary depend on client tool you use
for example, if you are using REST API's jobs.insert you should set below property
configuration.query.destinationTable nested object [Optional]
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results. This
property must be set for large results that exceed the maximum
response size.
configuration.query.destinationTable.datasetId string [Required]
The
ID of the dataset containing this table.
configuration.query.destinationTable.projectId string [Required]
The
ID of the project containing this table.
configuration.query.destinationTable.tableId string [Required]
The ID
of the table. The ID must contain only letters (a-z, A-Z), numbers
(0-9), or underscores (_). The maximum length is 1,024 characters.
My table name format: tableName_YYYYMMDD. I am trying to write to this table from a streaming dataflow pipeline. The reason I want to write to a new table everyday is because I want to expire tables after 30 days and only want to keep a window of 30 tables at a time.
Current code:
tableRow.apply(BigQueryIO.Write
.named("WriteBQTable")
.to(String.format("%1$s:%2$s.%3$s",projectId, bqDataSet, bqTable))
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
I do realize above code will not roll over to new day and start writing there.
As this answer suggests I can partition table and expire partitions, but writing to a partitioned tables seems like is not supported from a streaming pipeline.
Any ideas how can I work around this?
In the Dataflow 2.0 SDK there is a way to specify DynamicDestinations
See to(DynamicDestinations<T,?> dynamicDestinations) in BigQuery Dynamic Destionations.
Also, see the TableDestination version, which should be simpler and less code. Though unfortunately there is no example in the javadoc.
to(SerializableFunction<ValueInSingleWindow<T>,TableDestination> tableFunction)
https://beam.apache.org/documentation/sdks/javadoc/2.0.0/
This is an open source pipeline you can use to connect pub/sub to big query. I think google has also added support for streaming pipelines to date partitioned tables. Details here.
I have a Google bigQuery Table and I want to stream the entire table into pub-sub Topic
what should be the easy/fast way to do it?
Thank you in advance,
2019 update:
Now it's really easy with a click-to-bigquery option in Pub/Sub:
Find it on: https://console.cloud.google.com/cloudpubsub/topicList
The easiest way I know of is going through Google Cloud Dataflow, which natively knows how to access BigQuery and Pub/Sub.
In theory it should be as easy as the following Python lines:
p = beam.Pipeline(options=pipeline_options)
tablerows = p | 'read' >> beam.io.Read(
beam.io.BigQuerySource('clouddataflow-readonly:samples.weather_stations'))
tablerows | 'write' >> beam.io.Write(
beam.io.PubSubSink('projects/fh-dataflow/topics/bq2pubsub-topic'))
This combination of Python/Dataflow/BigQuery/PubSub doesn't work today (Python Dataflow is in beta, but keep an eye on the changelog).
We can do the same with Java, and it works well - I just tested it. It runs either locally, and also in the hosted Dataflow runner:
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<TableRow> weatherData = p.apply(
BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));
weatherData.apply(ParDo.named("tableRow2string").of(new DoFn<TableRow, String>() {
#Override
public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {
c.output(c.element().toString());
}
})).apply(PubsubIO.Write.named("WriteToPubsub").topic("projects/myproject/topics/bq2pubsub-topic"));
p.run();
Test if the messages are there with:
gcloud --project myproject beta pubsub subscriptions pull --auto-ack sub1
Hosted Dataflow screenshot:
That really depends on the size of the table.
If it's a small table (a few thousand records, a couple doze columns) then you could setup a process to query the entire table, convert the response into a JSON array, and push to pub-sub.
If it's a big table (millions/billions of records, hundreds of columns) you'd have to export to file, and then prepare/ship to pub-sub
It also depends on your partitioning policy - if your tables are set up to partition by date you might be able to, again, query instead of export.
Last but not least, it also depends on the frequency - is this a one time deal (then export) or a continuous process (then use table decorators to query only the latest data)?
Need some more information if you want a truly helpful answer.
Edit
Based on your comments for the size of the table, I think the best way would be to have a script that would:
Export the table to GCS as newline delimited JSON
Process the file (read line by line) and send to pub-sub
There are client libraries for most programming languages. I've done similar things with Python, and it's fairly straight forward.