How to add timestamp column when loading file to table - google-bigquery

I'm loading batch files to a table.
I want to add a timestamp column to the table so I can know the insertion times
on the record. I'm loading in append mode, so not all records insert at the same time.
Unfortunately, I didn't find a way to it in big query. When loading a file to a table, I didn't find an option to add padding the insertion with additional columns. I just want to calculate timestamp in my code and put it as constant field for all the insertion process.
The solution that I'm doing now, is to load to temp table and then query the table + new timestamp field into the target table. It works, but it's another step and I have multiple loadings and the full process takes too much time due to the latency of another step.
Does anyone know about another solution with only 1 step?

That's a great feature request for https://code.google.com/p/google-bigquery/issues/list. Unfortunately, there is no automated way to do it today. I like the way you are doing it though :)

If you are willing to make a new table to house this information, I recommend making a new table with the following settings:
table with _PARTITIONTIME field based on insertion
If you make a table using the default _PARTITIONTIME partitioning field, it does exactly what you are asking based on time of insertion

You can add a timestamp column/value using Pandas data-frame:
from datetime import datetime
import pandas as pd
from google.cloud import bigquery
insertDate = datetime.utcnow()
bigqueryClient = bigquery.Client()
tableRef = bigqueryClient.dataset("dataset-name").table("table-name")
dataFrame = pd.read_json("file.json")
dataFrame['insert_date'] = insertDate
bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef)
bigqueryJob.result()

You can leverage the "hive partitioning" functionality of BigQuery load jobs to accomplish this. This feature is normally used for "external tables" where the data just sits in GCS in carefully-organized folders, but there's no law against using it to import data into a native table.
When you write your batch files, include your timestamp as part of the path. For example, if your timestamp field is called "added_at" then write your batch files to gs://your-bucket/batch_output/added_at=1658877709/file.json
Load your data with the hive partitioning parameters so that the "added_at" value comes from the path instead of from the contents of your file. Example:
bq load --source_format=JSON \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=gs://your-bucket/batch_output/ \
dataset-name.table-name \
gs://your-bucket/output/added_at=1658877709/*
The python API has equivalent functionality.

Related

BigQueryIO.write() use SQL functions

I have a Dataflow streaming job. I am using BigqueryIO.write library to insert rows into BigQuery tables. There is a column in the BQ table, which is supposed to store the row creation timestamp. I need to use the SQL function "CURRENT_TIMESTAMP()" to update the value of this column.
I can not use any of the java's libraries (like Instant.now()) to get the current timestamp. Because that will derive the value during the job execution. I am using a BigQuery load job, whose triggering frequency is 10 mins. So if I use any java libraries to derive the timestamp, then it won't return the expected output.
I could not find any method in BigqueryIO.write, which takes any SQL function as input. So what's the solution to this issue?
It sounds like you want BigQuery to assign a timestamp to each row, based on when the row was inserted. The only way I can think of to accomplish this is to submit a QueryJob to BigQuery that contains an INSERT statement that includes CURRENT_TIMESTAMP() along with the values of the other columns. But this method is not particularly scalable with data volume, and it's not something that BigQueryIO.write() supports.
BigQueryIO.write supports batch loads, the streaming inserts API, and the Storage Write API, none of which to my knowledge provide a method to inject a BigQuery-side timestamp like you are suggesting.

Add dataset parameters into column to use them in BigQuery later with DataPrep

I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:
//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv
We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.
However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.
How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?
While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.
In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.
There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.
NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here:
https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
There is currently no access to data source location or parameter match values within the flow. Only the data in the dataset is available to you. (except SOURCEROWNUMBER())
Partial Solution
One method I have been using to mimic parameter insertion into the eventual table is to have multiple dataset imports by parameter and then union these before running your transformations into a final table.
For each known parameter search dataset, have a recipe that fills a column with that parameter per dataset and then union the results of each of these.
Obviously, this is only so scalable i.e. it works if you know the set of parameter values that will match. once you get to the granularity of time-stamp in the source file there is no way this is feasible.
In this example just the year value is the filtered parameter.
Longer Solution (An aside)
The alternative to this I eventually skated to was to define dataflow jobs using Dataprep, use these as dataflow templates and then run an orchestration function that ran the dataflow job (not dataprep) and amended the parameters for input AND output via the API. Then there was a transformation BigQuery Job that did the roundup append function.
Worth it if the flow is pretty settled, but not for adhoc; all depends on your scale.

How to set the number of partitions/nodes when importing data into Spark

Problem: I want to import data into Spark EMR from S3 using:
data = sqlContext.read.json("s3n://.....")
Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the data:
data.registerTempTable("table")
SqlData = sqlContext.sql("SELECT * FROM table")
Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and sc.parallelize I can pass the number of partitions as an input. Also, I have seen repartition(), but I am not sure if it can solve my problem. The variable data is a DataFrame in my example.
Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value
By default it partitions into 200 sets. You can change it by using set command in sql context sqlContext.sql("set spark.sql.shuffle.partitions=10");. However you need to set it with caution based up on your data characteristics.
You can call repartition() on dataframe for setting partitions. You can even set spark.sql.shuffle.partitions this property after creating hive context or by passing to spark-submit jar:
spark-submit .... --conf spark.sql.shuffle.partitions=100
or
dataframe.repartition(100)
Number of "input" partitions are fixed by the File System configuration.
1 file of 1Go, with a block size of 128M will give you 10 tasks. I am not sure you can change it.
repartition can be very bad, if you have lot of input partitions this will make lot of shuffle (data traffic) between partitions.
There is no magic method, you have to try, and use the webUI to see how many tasks are generated.

BigQuery Table Data Export

I am trying to export data from BigQuery Table using python api. Table contains 1 to 4 million of rows. So I have kept maxResults parameter to maximum i.e. 100000 and then paging through. But problem is that in One page I am getting 2652 rows only so number of paging is too much. Can anyone provide reason for this or solution to deal. Format is JSON.
Or can I export data into CSV format without using GCS?
I tried by inserting job and keeping allowLargeResults =true, but the result remain same.
Below is my query body :
queryData = {'query':query,
'maxResults':100000,
'timeoutMs':'130000'}
Thanks in advance.
You can try to export data from table without using GCS by using bq command line tool https://cloud.google.com/bigquery/bq-command-line-tool like this:
bq --format=prettyjson query --n=10000000 "SELECT * from publicdata:samples.shakespeare"
You can use --format=json depending on your needs as well.
Actual page size is determined not by row count, but rather by size of those rows in given page. I think it is something around 10MB
User can alsoset maxResults to limit rows in page in addition to above criteria

How should I partition data in s3 for use with hadoop hive?

I have a s3 bucket containing about 300gb of log files in no particular order.
I want to partition this data for use in hadoop-hive using a date-time stamp so that log-lines related to a particular day are clumped together in the same s3 'folder'. For example log entries for January 1st would be in files matching the following naming:
s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3
etc
What would be the best way for me to transform the data? Am I best just running a single script that reads in each file at a time and outputs data to the right s3 location?
I'm sure there's a good way to do this using hadoop, could someone tell me what that is?
What I've tried:
I tried using hadoop-streaming by passing in a mapper that collected all log entries for each date then wrote those directly to S3, returning nothing for the reducer, but that seemed to create duplicates. (using the above example, I ended up with 2.5 million entries for Jan 1st instead of 1.4million)
Does anyone have any ideas how best to approach this?
If Hadoop has free slots in the task tracker, it will run multiple copies of the same task. If your output format doesn't properly ignore the resulting duplicate output keys and values (which is possibly the case for S3; I've never used it), you should turn off speculative execution. If your job is map-only, set mapred.map.tasks.speculative.execution to false. If you have a reducer, set mapred.reduce.tasks.speculative.execution to false. Check out Hadoop: The Definitive Guide for more information.
Why not create an external table over this data, then use hive to create the new table?
create table partitioned (some_field string, timestamp string, created_date date) partition(created_date);
insert overwrite partitioned partition(created_date) as select some_field, timestamp, date(timestamp) from orig_external_table;
In fact, I haven't looked up the syntax, so you may need to correct it with reference to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries.