Custom Delimiters in CSV Data while exporting data from BigQuery to GCS bucket? - google-bigquery

Background:
I have a GA Premium account. I currently have the following process setup:
The raw data from the GA account flows into BigQuery.
Query the Bigquery tables.
Export query results to a GCS bucket. I export it in a CSV and gzipped format.
Export the CSV gzipped data from the GCS bucket to my Hadoop cluster HDFS.
Generate hive tables from the data on the cluster by using the comma as the field delimiter.
I run Steps 1-3 programmatically using the BigQuery REST API.
Problem:
My data contains embedded commas and newlines within quotes in some of the fields. When I generate my hive tables, the embedded commas and newlines are causing shifts in my field values for a row or are causing nulls in the records in the hive table.
I want to clean the data by either removing these embedded commas and newlines or by replacing them with custom delimiters within the quotes.
However, the catch is that I would like to do this data cleaning at Step 3 - while exporting to GCS. I looked into the possible query parameters that I can use achieving this but did not find any. The possible parameters that you can use to populate the configuration.extract object are listed at: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract
Here is snippet of the code that does the data export from Bigquery tables to GCS bucket.
query_request = bigquery_service.jobs()
DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';
DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
'extract': {
'sourceTable': {
'projectId': PROJECT_ID,
'datasetId': DATASET_ID,
'tableId': #####,
},
'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
'destinationFormat': 'CSV',
'printHeader': 'false',
'compression': 'GZIP'
}
}
}
query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
body=query_data).execute()
Thanks in advance.

EDIT: looks like I misunderstood your question. You wanted to modify the values to not include commas and new lines. I thought your issue was only commas and that the fix would be to just not use commas as deliminators.
To be clear, there is no way to make the modification while exporting. You will need to run another query to produce a new table.
Example:
SELECT x, y, z,
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(bad_data, '%', '%45'),
'\n', '%20'
)
',', '%54'
) FROM ds.tbl
This will encode the bad_data field in a query string compatible format. Remember to run this query with large results enabled if necessary.
A java.net.URLDecoder or something similar should be able to decode if you don't want to do it by hand later.
You can set the field delimiter of the export object.
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract.fieldDelimiter
query_request = bigquery_service.jobs()
DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';
DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
'extract': {
'sourceTable': {
'projectId': PROJECT_ID,
'datasetId': DATASET_ID,
'tableId': #####,
},
'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
'destinationFormat': 'CSV',
'fieldDelimiter': '~',
'printHeader': 'false',
'compression': 'GZIP'
}
}
}
query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
body=query_data).execute()

With Python:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = 'my-bucket'
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.ExtractJobConfig()
job_config.field_delimiter = ';'
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result()

Related

LIKE operator working on AWS lambda function but not =

I have a small csv file that looks like that :
is_employee,candidate_id,gender,hesa_type,university
FALSE,b9bb80,Male,Mathematical sciences,Birmingham
FALSE,8e552d,Female,Computer science,Swansea
TRUE,2bc475,Male,Engineering & technology,Aston
TRUE,c3ac8d,Female,Mathematical sciences,Heriot-Watt
FALSE,ceb2fa,Female,Mathematical sciences,Imperial College London
The following lambda function is used to query from an s3bucket.
import boto3
import os
import json
def lambda_handler(event, context):
BUCKET_NAME = 'foo'
KEY = 'bar/data.csv'
s3 = boto3.client('s3','eu-west-1')
response = s3.select_object_content(
Bucket = BUCKET_NAME,
Key = KEY,
ExpressionType = 'SQL',
Expression = 'Select count(*) from s3object s where s.gender like \'%Female%\'',
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'JSON': {}},
)
for i in response['Payload']:
if 'Records' in i:
query_result = i['Records']['Payload'].decode('utf-8')
print(list(json.loads(query_result).values())[0])
Now, this works great as I get back a result of 3.
But for some reason the same code does not work when changing the like operator to =, results drop down to 0, so no match found. What's happening here ?
So I found the problem. The problem was that the items of the last column were followed by a newline character, which was not understood by the AWS S3 interpreter. So really, a university name was not Swansea, but more Swansea\n.
So s.university = \'Swansea\'' does not work; however, s.university LIKE \'Swansea%\'' does work, and is still a sargable expression.

How can i save the Kafka read Structured Streaming data as a Dataframe and apply parsing on it?

I am trying to read realtime streaming data from Kafka topics through Spark Structured streaming, However my understanding i would need the streaming to stop at sometime so i can apply my parsing logic on it and push it to MongoDB. Is there a way i can save the streaming data into a separate dataframe with/without stopping the streaming?
I checked the guide and other blogs and i am not getting a straight forward answer for my requirement
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:9092, host:9092, host:9092")
.option("subscribe", "TOPIC_P2_R2, TOPIC_WITH_COMP_P2_R2.DIT, TOPIC_WITHOUT_COMP_P2_R2.DIT")
.option("startingOffsets", "earliest")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)")
val consoleOutput = dfs.writeStream
.outputMode("append")
.format("console")
.start()
consoleOutput.awaitTermination()
consoleOutput.stop()
I need the streaming data to be saved somehow in a dataframe either by stopping the streaming or without stopping
Below is the parsing logic what i have and instead of picking dataset from a file path i need the streamed data to be my new dataset and should be able to apply my rest of the logic and get output. Saving it to Mongo is not my primary focus now;
val log = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load("C:\\Users\\raheem_mohammed\\IdeaProjects\\diag.csv")
log.createOrReplaceTempView("logs")
val df = spark.sql("select _raw, _time from logs").toDF
//Adds Id number to each of the event
val logs = dfs.withColumn("Id", monotonicallyIncreasingId()+1)
//Register Dataframe as a temp table
logs.createOrReplaceTempView("logs")
val = spark.sql("select Id, value from logs")
//Extracts columns from _raw column. Also finds the probabilities of compositeNames.
//If true then the compositeName belongs to one of the four possibilities
val extractedDF = dfss.withColumn("managed_server", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\]",2))
.withColumn("alert_summary", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
.withColumn("oracle_details", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
.withColumn("ecid", regexp_extract($"_raw", "(?<=ecid: )(.*?)(?=,)",1))
//.withColumn("CompName",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+(\S+)\]""",2))
.withColumn("CompName",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+([a-zA-Z]+)""",2))
.withColumn("composite_name", col("_raw").contains("composite_name"))
.withColumn("compositename", col("_raw").contains("compositename"))
.withColumn("composites", col("_raw").contains("composites"))
.withColumn("componentDN", col("_raw").contains("componentDN"))
//Filters out any NULL values if found
val finalData = extractedDF.filter(
col("managed_server").isNotNull &&
col("alert_summary").isNotNull &&
col("oracle_details").isNotNull &&
col("ecid").isNotNull &&
col("CompName").isNotNull &&
col("composite_name").isNotNull &&
col("compositename").isNotNull &&
col("composites").isNotNull &&
col("componentDN").isNotNull)
finalData.show(false)

Read Avro File and Write it into BigQuery table

My objective is to read the avro file data from Cloud storage and write it to BigQuery table using Java. It would be good if some one provide the code snipet/ideas to read avro format data and write it to BigQuery table using Cloud Dataflow.
I see two possible approaches:
Using Dataflow:
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
// Read an AVRO file.
// Alternatively, read the schema from a file.
// https://beam.apache.org/releases/javadoc/2.11.0/index.html?org/apache/beam/sdk/io/AvroIO.html
Schema avroSchema = new Schema.Parser().parse(
"{\"type\": \"record\", "
+ "\"name\": \"quote\", "
+ "\"fields\": ["
+ "{\"name\": \"source\", \"type\": \"string\"},"
+ "{\"name\": \"quote\", \"type\": \"string\"}"
+ "]}");
PCollection<GenericRecord> avroRecords = p.apply(
AvroIO.readGenericRecords(avroSchema).from("gs://bucket/quotes.avro"));
// Convert Avro GenericRecords to BigQuery TableRows.
// It's probably better to use Avro-generated classes instead of manually casting types.
// https://beam.apache.org/documentation/io/built-in/google-bigquery/#writing-to-bigquery
PCollection<TableRow> bigQueryRows = avroRecords.apply(
MapElements.into(TypeDescriptor.of(TableRow.class))
.via(
(GenericRecord elem) ->
new TableRow()
.set("source", ((Utf8) elem.get("source")).toString())
.set("quote", ((Utf8) elem.get("quote")).toString())));
// https://cloud.google.com/bigquery/docs/schemas
TableSchema bigQuerySchema =
new TableSchema()
.setFields(
ImmutableList.of(
new TableFieldSchema()
.setName("source")
.setType("STRING"),
new TableFieldSchema()
.setName("quote")
.setType("STRING")));
bigQueryRows.apply(BigQueryIO.writeTableRows()
.to(new TableReference()
.setProjectId("project_id")
.setDatasetId("dataset_id")
.setTableId("avro_source"))
.withSchema(bigQuerySchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_TRUNCATE));
p.run().waitUntilFinish();
Import data into BigQuery directly without Dataflow. See this documentation: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
For this, you can try using the following Python script:
import apache_beam as beam
import sys
PROJECT='YOUR_PROJECT'
BUCKET='YOUR_BUCKET'
def run():
argv = [
'--project={0}'.format(PROJECT),
'--staging_location=gs://{0}/staging/'.format(BUCKET),
'--temp_location=gs://{0}/staging/'.format(BUCKET),
'--runner=DataflowRunner'
]
p = beam.Pipeline(argv=argv)
(p
| 'ReadAvroFromGCS' >> beam.io.avroio.ReadFromAvro('gs://{0}/file.avro'.format(BUCKET))
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery('{0}:dataset.avrotable'.format(PROJECT))
)
p.run()
if __name__ == '__main__':
run()
Hope it helps.

export data from bigquery to cloud storage- php client library - there is one extra empty new line in the cloud storage file

I followed this sample
https://cloud.google.com/bigquery/docs/exporting-data
public function exportDailyRecordsToCloudStorage($date, $tableId)
{
$validTableIds = ['table1', 'table2'];
if (!in_array($tableId, $validTableIds))
{
die("Wrong TableId");
}
$date = date("Ymd", date(strtotime($date)));
$datasetId = $date;
$dataset = $this->bigQuery->dataset($datasetId);
$table = $dataset->table($tableId);
// load the storage object
$storage = $this->storage;
$bucketName = 'mybucket';
$objectName = "daily_records/{$tableId}_" . $date;
$destinationObject = $storage->bucket($bucketName)->object($objectName);
// create the import job
$format = 'NEWLINE_DELIMITED_JSON';
$options = ['jobConfig' => ['destinationFormat' => $format]];
$job = $table->export($destinationObject, $options);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
print('Waiting for job to complete' . PHP_EOL);
$job->reload();
if (!$job->isComplete()) {
//throw new Exception('Job has not yet completed', 500);
}
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
$error = $job->info()['status']['errorResult']['message'];
printf('Error running job: %s' . PHP_EOL, $error);
} else {
print('Data exported successfully' . PHP_EOL);
}
I have 37670 rows in my table1, and the cloud storage file has 37671 lines.
And I have 388065 my table2, and the cloud storage file has 388066 lines.
The last line in both cloud storage files is empty line.
Is this a Google BigQuery feature improvement request? or I did something wrong in my codes above?
What you described seems like an unexpected outcome. The output file should generally has the same number of lines as the source table.
Your PHP code looks fine and shouldn't be the cause of the issue.
I'm trying reproduce it but unable to. Could you double-check if the last empty line is somehow added by another tool like a text editor or something? How are you counting the lines of the resulting output.
If you have ruled that out and are sure the newline is indeed added by BigQuery export feature, please consider opening a bug using the BigQuery Issue Tracker as suggested by xuejian and include your job ID so that we can investigate further.

output for append job in BigQuery using Luigi Orchestrator

I have a Bigquery task which only aims to append a daily temp table (Table-xxxx-xx-xx) to an existing table (PersistingTable).
I am not sure how to handle the output(self) method. Indeed, I can not just output PersistingTable as a luigi.contrib.bigquery.BigQueryTarget, since it already exists before the process started. Has anyone asked himself such a question?
I could not find an answer anywhere else so I will give my solution even though this is a very old question.
I created a new class that inherits from luigi.contrib.bigquery.BigQueryLoadTask
class BigQueryLoadIncremental(luigi.contrib.bigquery.BigQueryLoadTask):
'''
a subclass that checks whether a write-log on gcs exists to append data to the table
needs to define Two Outputs! [0] of type BigQueryTarget and [1] of type GCSTarget
Everything else is left unchanged
'''
def exists(self):
return luigi.contrib.gcs.GCSClient.exists(self.output()[1].path)
#property
def write_disposition(self):
"""
Set to WRITE_APPEND as this subclass only makes sense for this
"""
return luigi.contrib.bigquery.WriteDisposition.WRITE_APPEND
def run(self):
output = self.output()[0]
gcs_output = self.output()[1]
assert isinstance(output,
luigi.contrib.bigquery.BigQueryTarget), 'Output[0] must be a BigQueryTarget, not %s' % (
output)
assert isinstance(gcs_output,
luigi.contrib.gcs.GCSTarget), 'Output[1] must be a Cloud Storage Target, not %s' % (
gcs_output)
bq_client = output.client
source_uris = self.source_uris()
assert all(x.startswith('gs://') for x in source_uris)
job = {
'projectId': output.table.project_id,
'configuration': {
'load': {
'destinationTable': {
'projectId': output.table.project_id,
'datasetId': output.table.dataset_id,
'tableId': output.table.table_id,
},
'encoding': self.encoding,
'sourceFormat': self.source_format,
'writeDisposition': self.write_disposition,
'sourceUris': source_uris,
'maxBadRecords': self.max_bad_records,
'ignoreUnknownValues': self.ignore_unknown_values
}
}
}
if self.source_format == luigi.contrib.bigquery.SourceFormat.CSV:
job['configuration']['load']['fieldDelimiter'] = self.field_delimiter
job['configuration']['load']['skipLeadingRows'] = self.skip_leading_rows
job['configuration']['load']['allowJaggedRows'] = self.allow_jagged_rows
job['configuration']['load']['allowQuotedNewlines'] = self.allow_quoted_new_lines
if self.schema:
job['configuration']['load']['schema'] = {'fields': self.schema}
# test write to and removal of GCS pseudo output in order to make sure this does not fail.
gcs_output.fs.put_string(
'test write for task {} (this file should have been removed immediately)'.format(self.task_id),
gcs_output.path)
gcs_output.fs.remove(gcs_output.path)
bq_client.run_job(output.table.project_id, job, dataset=output.table.dataset)
gcs_output.fs.put_string(
'success! The following BigQuery Job went through without errors: {}'.format(self.task_id), gcs_output.path)
it uses a second output (which might violate luigis atomicity principle) on google cloud storage. Example usage:
class LeadsToBigQuery(BigQueryLoadIncremental):
date = luigi.DateParameter(default=datetime.date.today())
def output(self):
return luigi.contrib.bigquery.BigQueryTarget(project_id=...,
dataset_id=...,
table_id=...), \
create_gcs_target(...)