Apache Beam Pipeline Write to Multiple BQ tables - google-bigquery

I have a scenario where I need to do the following:
Read data from pubsub
Apply multiple Transformations to the data.
Persist the PCollection in multiple Google Big Query based on some config.
My question is how can I write data to multiple big query tables.
I searched for multiple bq writes using apache beam but could not find any solution

You can do that with 3 sinks, example with Beam Python :
def map1(self, element):
...
def map2(self, element):
...
def map3(self, element):
...
def main() -> None:
logging.getLogger().setLevel(logging.INFO)
your_options = PipelineOptions().view_as(YourOptions)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
result_pcollection = (
p
| 'Read from pub sub' >> ReadFromPubSub(subscription='input_subscription')
| 'Map 1' >> beam.Map(map1)
| 'Map 2' >> beam.Map(map2)
| 'Map 3' >> beam.Map(map3)
)
(result_pcollection |
'Write to BQ table 1' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table1',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection |
'Write to BQ table 2' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table2',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection_pub_sub |
'Write to BQ table 3' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table3',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
if __name__ == "__main__":
main()
The first PCollection is the result of input from PubSub.
I applied 3 transformations in the input PCollection
Sink the result to the 3 different Bigquery table
res = Flow
=> Map 1
=> Map 2
=> Map 3
res => Sink result to BQ table 1 with `BigqueryIO`
res => Sink result to BQ table 2 with `BigqueryIO`
res => Sink result to BQ table 3 with `BigqueryIO`
In this example I used STREAMING_INSERT for ingestion to Bigquery tables, but you can adapt and change it if needed in your case.

I see the previous answers satisfy your requirement of writing the same result to multiple tables. However, I assume the below scenarios, provide a bit different pipeline.
Read data from PubSub
Filter the data based on configs (from event message keys)
Apply the different/same transformation to the filtered collections
Write results from previous collections to different BigQuery Sinks
Here, we filtered the events at early stages in the pipeline, this is helpful in:
Avoid processing the same event messages multiple times.
You can skip the messages which are not needed.
Apply relevant transformations to event messages.
Overall efficient and cost-effective system.
For example, you are processing messages from all around the world and you need to process and store the data with respect to geography - storing Europe messages in the Europe region.
Also, you need to apply transformations which are relevant to the country-specific data - add an Aadhar number to messages generated from India and Social Security number to messages generated from the USA.
And you don't want to process/store any events from specific countries - data from oceanic countries are irrelevant and not required to process/stored in our use case.
So, in this made-up example, filtering the data (based on the config) at the early stage, you will be able to store country-specific data (multiple sinks), and you don't have to process all events generated from the USA/any other region for adding an Aadhar number (event specific transformations) and you will be able to skip/drop the records or simply store them in BigQuery without applying any transformations.
If the above made-up example resembles your scenario, the sample pipeline design may look like this
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,...
from apache_beam.io.gcp.internal.clients import bigquery
class TaggedData(beam.DoFn):
def process(self, element):
try:
# filter here
if(element["country"] == "in")
yield {"indiaelements:taggedasindia"}
if(element["country"] == "usa")
yield {"usaelements:taggedasusa"}
...
except:
yield {"taggedasunprocessed"}
def addAadhar(element):
"Filtered messages - only India"
yield "elementwithAadhar"
def addSSN(element):
"Filtered messages - only USA"
yield "elementwithSSN"
p = beam.Pipeline(options=options)
messages = (
p
| "ReadFromPubSub" >> ...
| "Tagging >> "beam.ParDo(TaggedData()).with_outputs('usa', 'india', 'oceania', ...)
)
india_messages = (
messages.india
| "AddAdhar" >> ...
| "WriteIndiamsgToBQ" >> streaming inserts
)
usa_messages = (
messages.usa
| "AddSSN" >> ...
| "WriteUSAmsgToBQ" >> streaming inserts
)
oceania_messages = (
messages.oceania
| "DoNothing&WriteUSAmsgToBQ" >> streaming inserts
)
deadletter = (
(messages.unprocessed, stage1.failed, stage2.failed)
| "CombineAllFailed" >> Flatn...
| "WriteUnprocessed/InvalidMessagesToBQ" >> streaminginserts...
)

Related

Record Duplication in BigQuery while Running a DataFlow Job

I'm running an hourly dataflow job that reads records from a source table, processes and writes them to a target table. Since some of the records may repeat in the source table, we've created a hash value based on the record fields of interest, append it to the read source table records(in memory), and filter out the existing hashes already stored on the target table(the hash value will be stored in the target table). This way we aim to avoid duplications from different jobs(triggered at different times). In order to avoid duplication on the same job, we're using a GroupByKey apache beam method, where the key is the hash value, and pick only the first element in the list. However, the duplication in bigquery still persists. My only hunch is that maybe, due to multiple workers handling the same job, they might be out of sync and process the same data, but since I'm using pipelines all the way, this assumption sounds unreasonable(at least to me..). Does any of you have an idea why the problem still persists?
Here's the job which creates the duplication:
with beam.Pipeline(options=options) as p:
# read fields of interest from the source table
records = p | 'Read Records from BigQuery' >> beam.io.Read(
beam.io.ReadFromBigQuery(query=read_from_source_query, use_standard_sql=True))
#step 1 - filter already existing records
# read existing hashes from the target table
hashes = p | 'read existing hashes from the target table' >> \
beam.io.Read(beam.io.ReadFromBigQuery(
query=select_hash_value_from_target_table,
use_standard_sql=True)) | \
'Get vals' >> beam.Map(lambda hash: hash['HashValue'])
# add hash value to each record and filter out the ones which already exist in the target table
hashed_records = (
records
| 'Add Hash Column in Memory to Each source table Record' >> beam.Map(lambda record: add_hash_field(record))
| 'Filter Existing Hashes' >> beam.Filter(lambda record,
hashes: record['HashValue'] not in hashes,
hashes=beam.pvalue.AsIter(hashes))
)
# step 2 - filter duplicated hashes created on the same job
key_val_records = (
hashed_records | 'Create a Key Value Pair' >> beam.Map(lambda record: (record['HashValue'], record))
)
# combine elements with the same key and get only one of them
unique_hashed_records = (
key_val_records | 'Combine the Same Hashes' >> beam.GroupByKey()
| 'Get First Element in Collection' >> beam.Map(lambda element: element[1][0])
)
records_to_store = unique_hashed_records | 'Create Records to Store' >> beam.ParDo(CreateTargetTableRecord(gal_options))
records_to_store | 'Write to target table' >> beam.io.WriteToBigQuery(
target_table)
As the code above suggested, i've expected to have no duplicates in the target table, but i'm still getting

export Bigquery results to GCS based on value of a column

I've been using the following code to export BQ result to GCS.
export_query = f"""
EXPORT DATA
OPTIONS(
uri='{uri}',
format=format,
overwrite=true,
compression='GZIP')
AS {query}"""
client.query(export_query, project=project).result()
Now I need to split a large table base on the value of a particular column, and then export each part separately , i.e.
query_k = """
SELECT * FROM table WHERE col = k
"""
for k = 1,2,3...N
I can do it by running N queries, but that seems to be very slow and consume a lot of resources. I am wondering if it is possible to accomplish the task with one single query.

How to truncate a table in PySpark?

In one of my projects, I need to check if an input dataframe is empty or not. If it is not empty, I need to do a bunch of operations and load some results into a table and overwrite the old data there.
On the other hand, if the input dataframe is empty, I do nothing and simply need to truncate the old data in the table. I know how to insert data in with overwrite but don't know how to truncate table only. I searched existing questions/answers and no clear answer found.
driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
stage_url = 'jdbc:sqlserver://server_name\DEV:51433;databaseName=project_stage;user=xxxxx;password=xxxxxxx'
if input_df.count()>0:
# Do something here to generate result_df
print(" write to table ")
write_dbtable = 'Project_Stage.StageBase.result_table'
write_df = result_df
write_df.write.format('jdbc').option('url', stage_url).option('dbtable', write_dbtable). \
option('truncate', 'true').mode('overwrite').option('driver',driver).save()
else:
print('no account to process!')
query = """TRUNCATE TABLE Project_Stage.StageBase.result_table"""
### Not sure how to run the query
Truncating is probably easiest done like this:
write_df = write_df.limit(0)
Also, for better performance, instead of input_df.count() > 0 you should use
Spark 3.2 and below: len(input_df.head(1)) > 0
Spark 3.3+: ~df.isEmpty()

How can I efficiently create unique relationships in Neo4j?

Following up on my question here, I would like to create a constraint on relationships. That is, I would like there to be multiple nodes that share the same "neighborhood" name, but each uniquely point to a particular city in which they reside.
As encouraged in user2194039's answer, I am using the following index:
CREATE INDEX ON :Neighborhood(name)
Also, I have the following constraint:
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
The following code fails to create unique relationships, and takes an excessively long period of time:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});
Note that there is a uniqueness constraint on City, but NOT on Neighborhood (because there should be multiple ones).
Profile with Limit 10,000:
+--------------+------+--------+---------------------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+---------------------------+------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 9750 | 3360 | anon[307], b, neighborhood, line | MergePattern |
| SchemaIndex | 9750 | 19500 | b, line | line.City; :City(name) |
| ColumnFilter | 9750 | 0 | line | keep columns line |
| Filter | 9750 | 0 | anon[220], line | anon[220] |
| Extract | 10000 | 0 | anon[220], line | anon[220] |
| Slice | 10000 | 0 | line | { AUTOINT0} |
| LoadCSV | 10000 | 0 | line | |
+--------------+------+--------+---------------------------+------------------------------+
Total database accesses: 22860
Following Guilherme's recommendation below, I implemented the helper yet it is raising the error py2neo.error.Finished. I've searched the documentation, and wasn't able to determine a work around from this. It looks like there's an open SO post about this exception.
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph()
authenticate("localhost:7474", "account", "password")
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
except error.Finished as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
main:
queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
for neighborhood in neighborhoods:
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
c +=1
if c % batch == 0:
print "running batch"
print c
s = time.time()*1000
r = run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
print c
if queries:
s = time.time()*1000
r = run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
If you want to create unique relationships you have 2 options:
Prevent the path from being duplicated, using MERGE, just like #user2194039 suggested. I think this is the simplest, and best approach you can take.
Turn your relationship into a node, and create an unique constraint on it. But it's hardly necessary for most cases.
If you're having trouble with speed, try using the transactional endpoint. I tried importing your data (random cities and neighbourhoods) through IMPORT CSV in 2.2.1, and I it was slow as well, though I am not sure why. If you send your queries with parameters to the transactional endpoint in batches of 1000-5000, you can monitor the process, and probably gain a performance boost.
I managed to import 1M rows in just under 11 minutes.
I used an INDEX for Neighbourhood(name) and a unique constraint for City(name).
Give it a try and see if it works for you.
Edit:
The transactional endpoint is a restful endpoint that allows you do execute transactions in batch. You can read about it here.
Basically, it allows you to stream a bunch of queries to the server at once.
I don't know what programming language/stack you're using, but in python, using a package like py2neo, it would be something like this:
with open("city.csv", "r") as fp:
reader = csv.reader(fp)
queries = []
template = ["MERGE (c :`City` {name: {city}})",
"MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
for row in reader:
city, neighborhood = row
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
if c % batch == 0:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
c += 1
if queries:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
Helper functions:
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
You will monitor how long each transaction takes, and you can tweak the number of queries per transactions, as well as the timeout.
To be sure we're on the same page, this is how I understand your model: Each city is unique and should have some number of neighborhoods pointing to it. The neighborhoods are unique within the context of a city, but not globally. So if you have a neighborhood 3 [IN] city Boston, you could also have a neighborhood 3 [IN] city Seattle, and both of those neighborhoods are represented by different nodes, even though they have the same name property. Is that correct?
Before importing, I would recommend adding an index to your neighborhood nodes. You can add the index without enforcing uniqueness. I have found that this greatly increases speeds on even small databases.
CREATE INDEX ON :Neighborhood(name)
And for the import:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})
If you are importing a large amount of data, it may be best to use the USING PERIODIC COMMIT command to commit periodically while importing. This will reduce the memory used in the process, and if your server is memory-constrained, I could see it helping performance. In your case, with almost a million records, this is recommended by Neo4j. You can even adjust how often the commit happens by doing USING PERIODIC COMMIT 10000 or such. The docs say 1000 is the default. Just understand that this will break the import into several transactions.
Best of luck!

SQL query engine for text files on Linux?

We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.
Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?
David Malcolm wrote a little tool named "squeal" (formerly "show"), which allows you to use SQL-like command-line syntax to parse text files of various formats, including CSV.
An example on squeal's home page:
$ squeal "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source |
--------+--------------------+
1633 |kernel |
1324 |NetworkManager |
98 |ntpd |
70 |avahi-daemon |
63 |dhclient |
48 |setroubleshoot |
39 |dnsmasq |
29 |nm-system-settings |
27 |bluetoothd |
14 |/usr/sbin/gpm |
13 |acpid |
10 |init |
9 |pcscd |
9 |pulseaudio |
6 |gnome-keyring-ask |
6 |gnome-keyring-daemon|
6 |gnome-session |
6 |rsyslogd |
5 |rpc.statd |
4 |vpnc |
3 |gdm-session-worker |
2 |auditd |
2 |console-kit-daemon |
2 |libvirtd |
2 |rpcbind |
1 |nm-dispatcher.action|
1 |restorecond |
q - Run SQL directly on CSV or TSV files:
https://github.com/harelba/q
Riffing off someone else's suggestion, here is a Python script for sqlite3. A little verbose, but it works.
I don't like having to completely copy the file to drop the header line, but I don't know how else to convince sqlite3's .import to skip it. I could create INSERT statements, but that seems just as bad if not worse.
Sample invocation:
$ sql.py --file foo --sql "select count(*) from data"
The code:
#!/usr/bin/env python
"""Run a SQL statement on a text file"""
import os
import sys
import getopt
import tempfile
import re
class Usage(Exception):
def __init__(self, msg):
self.msg = msg
def runCmd(cmd):
if os.system(cmd):
print "Error running " + cmd
sys.exit(1)
# TODO(dan): Return actual exit code
def usage():
print >>sys.stderr, "Usage: sql.py --file file --sql sql"
def main(argv=None):
if argv is None:
argv = sys.argv
try:
try:
opts, args = getopt.getopt(argv[1:], "h",
["help", "file=", "sql="])
except getopt.error, msg:
raise Usage(msg)
except Usage, err:
print >>sys.stderr, err.msg
print >>sys.stderr, "for help use --help"
return 2
filename = None
sql = None
for o, a in opts:
if o in ("-h", "--help"):
usage()
return 0
elif o in ("--file"):
filename = a
elif o in ("--sql"):
sql = a
else:
print "Found unexpected option " + o
if not filename:
print >>sys.stderr, "Must give --file"
sys.exit(1)
if not sql:
print >>sys.stderr, "Must give --sql"
sys.exit(1)
# Get the first line of the file to make a CREATE statement
#
# Copy the rest of the lines into a new file (datafile) so that
# sqlite3 can import data without header. If sqlite3 could skip
# the first line with .import, this copy would be unnecessary.
foo = open(filename)
datafile = tempfile.NamedTemporaryFile()
first = True
for line in foo.readlines():
if first:
headers = line.rstrip().split()
first = False
else:
print >>datafile, line,
datafile.flush()
#print datafile.name
#runCmd("cat %s" % datafile.name)
# Create columns with NUMERIC affinity so that if they are numbers,
# SQL queries will treat them as such.
create_statement = "CREATE TABLE data (" + ",".join(
map(lambda x: "`%s` NUMERIC" % x, headers)) + ");"
cmdfile = tempfile.NamedTemporaryFile()
#print cmdfile.name
print >>cmdfile,create_statement
print >>cmdfile,".separator ' '"
print >>cmdfile,".import '" + datafile.name + "' data"
print >>cmdfile, sql + ";"
cmdfile.flush()
#runCmd("cat %s" % cmdfile.name)
runCmd("cat %s | sqlite3" % cmdfile.name)
if __name__ == "__main__":
sys.exit(main())
Maybe write a script that creates an SQLite instance (possibly in memory), imports your data from a file/stdin (accepting your data's format), runs a query, then exits?
Depending on the amount of data, performance could be acceptable.
MySQL has a CVS storage engine, that might do what you need, if your files are CSV files.
Otherwise, you can use mysqlimport to import text files into MySQL. You could create a wrapper around mysqlimport, which figures out columns etc. and creates the necessary table.
You might also be able to use DBD::AnyData, a Perl module which lets you access text files like a database.
That said, it sounds a lot like you should really look at using a database. Is it really easier keeping table-oriented data in text files?
I have used Microsoft LogParser to query csv files several times... and it serves the purpose. It was surprising to see such a useful tool from M$ that too Free!