I just want to load 5GB from MySql into BigQuery - google-bigquery

Long time no see. I'd want to get 5GB of data from MySql into BigQuery. My best bet seems to be some sort of CSV export / import. Which doesn't work for various reasons, see:
agile-coral-830:splitpapers1501200518aa150120052659
agile-coral-830:splitpapers1501200545aa150120055302
agile-coral-830:splitpapers1501200556aa150120060231
This is likely because I don't have the right MySql incantation able to generate perfect CSV in accordance with RFC 4180. However, instead of arguing RFC 4180 minutia, this whole load business could be solved in five minutes by supporting customizable multi-character field separators and multi-character line separators. I'm pretty sure my data doesn't contain either ### nor ###, so the following would work like a charm:
mysql> select * from $TABLE_NAME
into outfile '$DATA.csv'
fields terminated by '###'
enclosed by ''
lines terminated by '###'
$ bq load --nosync -F '###' -E '###' $TABLE_NAME $DATA.csv $SCHEMA.json
Edit: Fields contain '\n', '\r', ',' and '"'. They also contain NULLs, which MySql represents as [escape]N, in the example "N. Sample row:
"10.1.1.1.1483","5","9074080","Candidate high myopia loci on chromosomes 18p and 12q do not play a major role in susceptibility to common myopia","Results
There was no strong evidence of linkage of common myopia to these candidate regions: all two-point and multipoint heterogeneity LOD scores were < 1.0 and non-parametric linkage p-values were > 0.01. However, one Amish family showed slight evidence of linkage (LOD>1.0) on 12q; another 3 Amish families each gave LOD >1.0 on 18p; and 3 Jewish families each gave LOD >1.0 on 12q.
Conclusions
Significant evidence of linkage (LOD> 3) of myopia was not found on chromosome 18p or 12q loci in these families. These results suggest that these loci do not play a major role in the causation of common myopia in our families studied.","2004","BMC MEDICAL GENETICS","JOURNAL","N,"5","20","","","","0","1","USER","2007-11-19 05:00:00","rep1","PDFLib TET","0","2009-05-24 20:33:12"

I found loading through a CSV very difficult. More restrictions and complications. I have been messing around this morning with moving data from MySQL to BigQuery.
Bellow is a Python script that will build the table decorator and stream the data directly into the BigQuery table.
My db is in the Cloud so you may need to change the connection string. Fill in the missing values for your particular situation then call it by:
SQLToBQBatch(tableName, limit)
I put the limit in to test with. For my final test I sent 999999999 for the limit and everything worked fine.
I would recommend using a backend module to run this over 5g.
Use "RowToJSON" to clean up and invalid characters (ie anything non utf8).
I haven't tested on 5gb but it was able to do 50k rows in about 20 seconds. The same load in CSV was over 2 minutes.
I wrote this to test things, so please excuse the bad codding practices and mini hacks. It works so feel free to clean it up for any production level work.
import MySQLdb
import logging
from apiclient.discovery import build
from oauth2client.appengine import AppAssertionCredentials
import httplib2
OAUTH_SCOPE = 'https://www.googleapis.com/auth/bigquery'
PROJECT_ID =
DATASET_ID =
TABLE_ID =
SQL_DATABASE_NAME =
SQL_DATABASE_DB =
SQL_USER =
SQL_PASS =
def Connect():
return MySQLdb.connect(unix_socket='/cloudsql/' + SQL_DATABASE_NAME, db=SQL_DATABASE_DB, user=SQL_USER, passwd=SQL_PASS)
def RowToJSON(cursor, row, fields):
newData = {}
for i, value in enumerate(row):
try:
if fields[i]["type"] == bqTypeDict["int"]:
value = int(value)
else:
value = float(value)
except:
if value is not None:
value = value.replace("\x92", "'") \
.replace("\x96", "'") \
.replace("\x93", '"') \
.replace("\x94", '"') \
.replace("\x97", '-') \
.replace("\xe9", 'e') \
.replace("\x91", "'") \
.replace("\x85", "...") \
.replace("\xb4", "'") \
.replace('"', '""')
newData[cursor.description[i][0]] = value
return newData
def GetBuilder():
return build('bigquery', 'v2',http = AppAssertionCredentials(scope=OAUTH_SCOPE).authorize(httplib2.Http()))
bqTypeDict = { 'int' : 'INTEGER',
'varchar' : 'STRING',
'double' : 'FLOAT',
'tinyint' : 'INTEGER',
'decimal' : 'FLOAT',
'text' : 'STRING',
'smallint' : 'INTEGER',
'char' : 'STRING',
'bigint' : 'INTEGER',
'float' : 'FLOAT',
'longtext' : 'STRING'
}
def BuildFeilds(table):
conn = Connect()
cursor = conn.cursor()
cursor.execute("DESCRIBE %s;" % table)
tableDecorator = cursor.fetchall()
fields = []
for col in tableDecorator:
field = {}
field["name"] = col[0]
colType = col[1].split("(")[0]
if colType not in bqTypeDict:
logging.warning("Unknown type detected, using string: %s", str(col[1]))
field["type"] = bqTypeDict.get(colType, "STRING")
if col[2] == "YES":
field["mode"] = "NULLABLE"
fields.append(field)
return fields
def SQLToBQBatch(table, limit=3000):
logging.info("****************************************************")
logging.info("Starting SQLToBQBatch. Got: Table: %s, Limit: %i" % (table, limit))
bqDest = GetBuilder()
fields = BuildFeilds(table)
try:
responce = bqDest.datasets().insert(projectId=PROJECT_ID, body={'datasetReference' :
{'datasetId' : DATASET_ID} }).execute()
logging.info("Added Dataset")
logging.info(responce)
except Exception, e:
logging.info(e)
if ("Already Exists: " in str(e)):
logging.info("Dataset already exists")
else:
logging.error("Error creating dataset: " + str(e), "Error")
try:
responce = bqDest.tables().insert(projectId=PROJECT_ID, datasetId=DATASET_ID, body={'tableReference' : {'projectId' : PROJECT_ID,
'datasetId' : DATASET_ID,
'tableId' : TABLE_ID},
'schema' : {'fields' : fields}}
).execute()
logging.info("Added Table")
logging.info(responce)
except Exception, e:
logging.info(e)
if ("Already Exists: " in str(e)):
logging.info("Table already exists")
else:
logging.error("Error creating table: " + str(e), "Error")
conn = Connect()
cursor = conn.cursor()
logging.info("Starting load loop")
count = -1
cur_pos = 0
total = 0
batch_size = 1000
while count != 0 and cur_pos < limit:
count = 0
if batch_size + cur_pos > limit:
batch_size = limit - cur_pos
sqlCommand = "SELECT * FROM %s LIMIT %i, %i" % (table, cur_pos, batch_size)
logging.info("Running: %s", sqlCommand)
cursor.execute(sqlCommand)
data = []
for _, row in enumerate(cursor.fetchall()):
data.append({"json": RowToJSON(cursor, row, fields)})
count += 1
logging.info("Read complete")
if count != 0:
logging.info("Sending request")
insertResponse = bqDest.tabledata().insertAll(
projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body={"rows":data}).execute()
cur_pos += batch_size
total += count
logging.info("Done %i, Total: %i, Response: %s", count, total, insertResponse)
if "insertErrors" in insertResponse:
logging.error("Error inserting data index: %i", insertResponse["insertErrors"]["index"])
for error in insertResponse["insertErrors"]["errors"]:
logging.error(error)
else:
logging.info("No more rows")

• Generate google service account key
o IAM & Admin > Service account > create_Service_account
o Once created then create key , download and save It to the project folder on local machine – google_key.json
• Run the code in pycharm environment after installing the packages.
NOTE : The table data in mysql remains intact. Also , if one uses preview in BQ to see that you won’t see. Go to console and fire the query.
o CODE
o import MySQLdb
from google.cloud import bigquery
import mysql.connector
import logging
import os
from MySQLdb.converters import conversions
import click
import MySQLdb.cursors
from google.cloud.exceptions import ServiceUnavailable
import sys
bqTypeDict = {'int': 'INTEGER',
'varchar': 'STRING',
'double': 'FLOAT',
'tinyint': 'INTEGER',
'decimal': 'FLOAT',
'text': 'STRING',
'smallint': 'INTEGER',
'char': 'STRING',
'bigint': 'INTEGER',
'float': 'FLOAT',
'longtext': 'STRING',
'datetime': 'TIMESTAMP'
}
def conv_date_to_timestamp(str_date):
import time
import datetime
date_time = MySQLdb.times.DateTime_or_None(str_date)
unix_timestamp = (date_time - datetime.datetime(1970, 1, 1)).total_seconds()
return unix_timestamp
def Connect(host, database, user, password):
return mysql.connector.connect(host='',
port='',
database='recommendation_spark',
user='',
password='')
def BuildSchema(host, database, user, password, table):
logging.debug('build schema for table %s in database %s' % (table, database))
conn = Connect(host, database, user, password)
cursor = conn.cursor()
cursor.execute("DESCRIBE %s;" % table)
tableDecorator = cursor.fetchall()
schema = []
for col in tableDecorator:
colType = col[1].split("(")[0]
if colType not in bqTypeDict:
logging.warning("Unknown type detected, using string: %s", str(col[1]))
field_mode = "NULLABLE" if col[2] == "YES" else "REQUIRED"
field = bigquery.SchemaField(col[0], bqTypeDict.get(colType, "STRING"), mode=field_mode)
schema.append(field)
return tuple(schema)
def bq_load(table, data, max_retries=5):
logging.info("Sending request")
uploaded_successfully = False
num_tries = 0
while not uploaded_successfully and num_tries < max_retries:
try:
insertResponse = table.insert_data(data)
for row in insertResponse:
if 'errors' in row:
logging.error('not able to upload data: %s', row['errors'])
uploaded_successfully = True
except ServiceUnavailable as e:
num_tries += 1
logging.error('insert failed with exception trying again retry %d', num_tries)
except Exception as e:
num_tries += 1
logging.error('not able to upload data: %s', str(e))
#click.command()
#click.option('-h', '--host', default='tempus-qa.hashmapinc.com', help='MySQL hostname')
#click.option('-d', '--database', required=True, help='MySQL database')
#click.option('-u', '--user', default='root', help='MySQL user')
#click.option('-p', '--password', default='docker', help='MySQL password')
#click.option('-t', '--table', required=True, help='MySQL table')
#click.option('-i', '--projectid', required=True, help='Google BigQuery Project ID')
#click.option('-n', '--dataset', required=True, help='Google BigQuery Dataset name')
#click.option('-l', '--limit', default=0, help='max num of rows to load')
#click.option('-s', '--batch_size', default=1000, help='max num of rows to load')
#click.option('-k', '--key', default='key.json',help='Location of google service account key (relative to current working dir)')
#click.option('-v', '--verbose', default=0, count=True, help='verbose')
def SQLToBQBatch(host, database, user, password, table, projectid, dataset, limit, batch_size, key, verbose):
# set to max verbose level
verbose = verbose if verbose < 3 else 3
loglevel = logging.ERROR - (10 * verbose)
logging.basicConfig(level=loglevel)
logging.info("Starting SQLToBQBatch. Got: Table: %s, Limit: %i", table, limit)
## set env key to authenticate application
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.path.join(os.getcwd(), key)
print('file found')
# Instantiates a client
bigquery_client = bigquery.Client()
print('Project id created')
try:
bq_dataset = bigquery_client.dataset(dataset)
bq_dataset.create()
logging.info("Added Dataset")
except Exception as e:
if ("Already Exists: " in str(e)):
logging.info("Dataset already exists")
else:
logging.error("Error creating dataset: %s Error", str(e))
bq_table = bq_dataset.table(table)
bq_table.schema = BuildSchema(host, database, user, password, table)
print('Creating schema using build schema')
bq_table.create()
logging.info("Added Table %s", table)
conn = Connect(host, database, user, password)
cursor = conn.cursor()
logging.info("Starting load loop")
cursor.execute("SELECT * FROM %s" % (table))
cur_batch = []
count = 0
for row in cursor:
count += 1
if limit != 0 and count >= limit:
logging.info("limit of %d rows reached", limit)
break
cur_batch.append(row)
if count % batch_size == 0 and count != 0:
bq_load(bq_table, cur_batch)
cur_batch = []
logging.info("processed %i rows", count)
# send last elements
bq_load(bq_table, cur_batch)
logging.info("Finished (%i total)", count)
print("table created")
if __name__ == '__main__':
# run the command
SQLToBQBatch()
o Command to run the file : python mysql_to_bq.py -d 'recommendation_spark' -t temp_market_store -i inductive-cocoa-250507 -n practice123 -k key.json

Related

Pool apply function hangs and never executes

I am trying to fetch Rally data by using its python library pyral. Sequentially the same code works, but its slow.
I thought of using python multiprocess package, however my pool.apply method gets stuck and never executes. I tried running it in Pycharm IDE as well as the windows cmd prompt.
import pandas as pd
from pyral import Rally
from multiprocessing import Pool, Manager
from pyral.entity import Project
def process_row(sheetHeaders: list, item: Project, L: list):
print('processing row : ' + item.Name) ## this print never gets called
row = ()
for header in sheetHeaders:
row.append(process_cell(header, item))
L.append(row)
def process_cell(attr, item: Project):
param = getattr(item, attr)
if param is None:
return None
try:
if attr == 'Owner':
return param.__getattr__('Name')
elif attr == 'Parent':
return param.__getattr__('ObjectID')
else:
return param
except KeyError as e:
print(e)
# Projects
# PortfolioItem
# User Story
# Hierarchical Req
# tasks
# defects
# -------------MAIN-----------------
def main():
# Rally connection
rally = Rally('rally1.rallydev.com', apikey='<my_key>')
file = 'rally_data.xlsx'
headers = {
'Project': ['Name', 'Description', 'CreationDate', 'ObjectID', 'Parent', 'Owner', 'State'],
}
sheetName = 'Project'
sheetHeaders = headers.get(sheetName)
p = Pool(1)
result = rally.get(sheetName, fetch=True, pagesize=10)
with Manager() as manager:
L = manager.list()
for item in result:
print('adding row for : ' + item.Name)
p.apply_async(func=process_row, args=(sheetHeaders, item, L)) ## gets stuck here
p.close()
p.join()
pd.DataFrame(L).to_excel(file, sheet_name=sheetName)
if __name__ == '__main__':
main()
Also tried without Manager list without any difference in the outcome
def main():
# Rally connection
rally = Rally('rally1.rallydev.com', apikey='<key>')
file = 'rally_data.xlsx'
headers = {
'Project': ['Name', 'Description', 'CreationDate', 'ObjectID', 'Parent', 'Owner', 'State'],
}
sheetName = 'Project'
sheetHeaders = headers.get(sheetName)
result = rally.get(sheetName, fetch=True, pagesize=10)
async_results = []
with Pool(50) as p:
for item in result:
print('adding row for : ' + item.Name)
async_results.append(p.apply_async(func=process_row, args=(sheetHeaders, item)))
res = [r.get() for r in async_results]
pd.DataFrame(res).to_excel(file, sheet_name=sheetName)
I dont know why, but replacing multiprocessing
with multiprocessing.dummy in the import statement worked.

Python Dbf append to memory indexed table fails

I'm using Python dbf-0.99.1 library from Ethan Furman. This approach to add record to table fails:
tab = dbf.Table( "MYTABLE" )
tab.open(mode=dbf.READ_WRITE)
idx = tab.create_index(lambda rec: (rec.id if not is_deleted(rec) else DoNotIndex ) ) # without this, append works
rec = { "id":id, "col2": val2 } # some values, id is numeric and is not None
tab.append( rec ) # fails here
My table contains various character and numeric columns. This is just an example. The exceptions is:
line 5959, in append
newrecord = Record(recnum=header.record_count, layout=meta, kamikaze=kamikaze)
line 3102, in __new__
record._update_disk()
line 3438, in _update_disk
index(self)
line 7550, in __call__
vindex = bisect_right(self._values, key)
TypeError: '<' not supported between instances of 'NoneType' and 'int'
Any help appreciated. Thanks.
EDIT: Here is testing script
import dbf
from dbf import is_deleted, DoNotIndex
tab = dbf.Table('temptable', "ID N(12,0)" )
tab.open(mode=dbf.READ_WRITE)
rc = { "id":1 }
tab.append( rc ) # need some data without index first
idx = tab.create_index(lambda rec: (rec.id if not is_deleted(rec) else DoNotIndex ) )
rc = { "id":2 }
tab.append( rc ) # fails here

Ruby - Implement an SQL server?

I have an application which has a Ruby API. I would like to link to this application from a SQL server system.
Is there a way for me to implement a Ruby SQL server which receives SQL statements and returns the requested data from the applications. Is it then possible to hook into this from an SQL server applications?
E.G.
# request as string like "SELECT * FROM MAIN_TABLE WHERE SOME_COLUMN = <SOME DATA>"
SQLEngine.OnRequest do |request|
Application.RunSQL(request)
end
P.S. I don't have any experience with SQL server, so have no idea how one would go about this...
Note: I'm not asking how I can query an SQL server database, I'm asking how I can implement an SQL server connection.
After some searching I found a few other stack overflow questions about how to make Database Drivers in other languages:
creating a custom odbc driver for application
Implementing a ODBC driver
Creating a custom ODBC driver
Alternatives to writing an ODBC driver
Potentially these will be useful for others going down this road, the most hopeful suggestion is implementing the wire protocol, of which one has been made in python which should be relatively easy to port
import SocketServer
import struct
def char_to_hex(char):
retval = hex(ord(char))
if len(retval) == 4:
return retval[-2:]
else:
assert len(retval) == 3
return "0" + retval[-1]
def str_to_hex(inputstr):
return " ".join(char_to_hex(char) for char in inputstr)
class Handler(SocketServer.BaseRequestHandler):
def handle(self):
print "handle()"
self.read_SSLRequest()
self.send_to_socket("N")
self.read_StartupMessage()
self.send_AuthenticationClearText()
self.read_PasswordMessage()
self.send_AuthenticationOK()
self.send_ReadyForQuery()
self.read_Query()
self.send_queryresult()
def send_queryresult(self):
fieldnames = ['abc', 'def']
HEADERFORMAT = "!cih"
fields = ''.join(self.fieldname_msg(name) for name in fieldnames)
rdheader = struct.pack(HEADERFORMAT, 'T', struct.calcsize(HEADERFORMAT) - 1 + len(fields), len(fieldnames))
self.send_to_socket(rdheader + fields)
rows = [[1, 2], [3, 4]]
DRHEADER = "!cih"
for row in rows:
dr_data = struct.pack("!ii", -1, -1)
dr_header = struct.pack(DRHEADER, 'D', struct.calcsize(DRHEADER) - 1 + len(dr_data), 2)
self.send_to_socket(dr_header + dr_data)
self.send_CommandComplete()
self.send_ReadyForQuery()
def send_CommandComplete(self):
HFMT = "!ci"
msg = "SELECT 2\x00"
self.send_to_socket(struct.pack(HFMT, "C", struct.calcsize(HFMT) - 1 + len(msg)) + msg)
def fieldname_msg(self, name):
tableid = 0
columnid = 0
datatypeid = 23
datatypesize = 4
typemodifier = -1
format_code = 0 # 0=text 1=binary
return name + "\x00" + struct.pack("!ihihih", tableid, columnid, datatypeid, datatypesize, typemodifier, format_code)
def read_socket(self):
print "Trying recv..."
data = self.request.recv(1024)
print "Received {} bytes: {}".format(len(data), repr(data))
print "Hex: {}".format(str_to_hex(data))
return data
def send_to_socket(self, data):
print "Sending {} bytes: {}".format(len(data), repr(data))
print "Hex: {}".format(str_to_hex(data))
return self.request.sendall(data)
def read_Query(self):
data = self.read_socket()
msgident, msglen = struct.unpack("!ci", data[0:5])
assert msgident == "Q"
print data[5:]
def send_ReadyForQuery(self):
self.send_to_socket(struct.pack("!cic", 'Z', 5, 'I'))
def read_PasswordMessage(self):
data = self.read_socket()
b, msglen = struct.unpack("!ci", data[0:5])
assert b == "p"
print "Password: {}".format(data[5:])
def read_SSLRequest(self):
data = self.read_socket()
msglen, sslcode = struct.unpack("!ii", data)
assert msglen == 8
assert sslcode == 80877103
def read_StartupMessage(self):
data = self.read_socket()
msglen, protoversion = struct.unpack("!ii", data[0:8])
print "msglen: {}, protoversion: {}".format(msglen, protoversion)
assert msglen == len(data)
parameters_string = data[8:]
print parameters_string.split('\x00')
def send_AuthenticationOK(self):
self.send_to_socket(struct.pack("!cii", 'R', 8, 0))
def send_AuthenticationClearText(self):
self.send_to_socket(struct.pack("!cii", 'R', 8, 3))
if __name__ == "__main__":
server = SocketServer.TCPServer(("localhost", 9876), Handler)
try:
server.serve_forever()
except:
server.shutdown()
Every programming language – including Ruby – supplies packages which implement interfaces to various SQL servers.
Start here: Ruby database access.

Scrapy best practice

I'm using scrapy to download large amount of data. I use default 16 concurrent requests.
As a guide shows, I use pipelines method process_item to collect data at share variable. And at close_spider save data to SQL.
If I load too large website, I lose all system memory.
How should I avoid that problem?
Now I use one DB connection, that prepared at open_spider method and I could not use it in every process_item simultaneously.
Create a list of scraped items in your pipelines, and once that list's size is greater than N, then call the DB function to save data. Here is 100% working code from my project. See close_spider(), at the time of spider closed, there is a chance the self.items had less than N items in it, so any remaining data inside self.items list will also be saved in DB when spiders gets closed.
from scrapy import signals
class YourPipeline(object):
def __init__(self):
self.items = []
def process_item(self, item, spider):
self.items.extend([ item ])
if len(self.items) >= 50:
self.insert_current_items(spider)
return item
def insert_current_items(self, spider):
for item in self.items:
update_query = ', '.join(["`" + key + "` = %s " for key, value in item.iteritems()])
query = "SELECT asin FROM " + spider.tbl_name + " WHERE asin = %s LIMIT 1"
spider.cursor.execute(query, (item['asin']))
existing = spider.cursor.fetchone()
if spider.cursor.rowcount > 0:
query = "UPDATE " + spider.tbl_name + " SET " + update_query + ", date_update = CURRENT_TIMESTAMP WHERE asin = %s"
update_query_vals = list(item.values())
update_query_vals.extend([existing['YOUR_UNIQUE_COLUMN']])
try:
spider.cursor.execute(query, update_query_vals)
except Exception as e:
if 'MySQL server has gone away' in str(e):
spider.connectDB()
spider.cursor.execute(query, update_query_vals)
else:
raise e
else:
# This ELSE is likely never to get executed because we are not scraping ASINS from Amazon website, we just import ASINs into DB from another script
try:
placeholders = ', '.join(['%s'] * len(item))
columns = ', '.join(item.keys())
query = "INSERT INTO %s ( %s ) VALUES ( %s )" % (spider.tbl_name, columns, placeholders)
spider.cursor.execute(query, item)
except Exception as e:
if 'MySQL server has gone away' in str(e):
spider.connectDB()
spider.cursor.execute(query, item)
else:
raise e
self.items = []
def close_spider(self, spider):
self.insert_current_items(spider)

Error: Message: Too many sources provided: 15285. Limit is 10000

I'm currently trying to run a Dataflow (Apache Beam, Python SDK) task to import a >100GB Tweet file into BigQuery, but running into Error: Message: Too many sources provided: 15285. Limit is 10000.
The task takes the tweets (JSON), extracts 5 relevant fields, transforms/sanitizes them a bit with some transforms and then write those values into BigQuery, which will be used for further processing.
There's Cloud Dataflow to BigQuery - too many sources but it seems to be caused by having a lot of different input files, whereas I have a single input file, so it doesn't seem relevant. Also the solutions mentioned there are rather cryptic and I'm not sure if/how I could apply them to my problem.
My guess is that BigQuery writes temporary files for each row or something before persisting them, and that's what's meant by "too many sources" ?
How can I fix this?
[Edit]
Code:
import argparse
import json
import logging
import apache_beam as beam
class JsonCoder(object):
"""A JSON coder interpreting each line as a JSON string."""
def encode(self, x):
return json.dumps(x)
def decode(self, x):
return json.loads(x)
def filter_by_nonempty_county(record):
if 'county_fips' in record and record['county_fips'] is not None:
yield record
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument('--input',
default='...',
help=('Input twitter json file specified as: '
'gs://path/to/tweets.json'))
parser.add_argument(
'--output',
required=True,
help=
('Output BigQuery table for results specified as: PROJECT:DATASET.TABLE '
'or DATASET.TABLE.'))
known_args, pipeline_args = parser.parse_known_args(argv)
p = beam.Pipeline(argv=pipeline_args)
# read text file
#Read all tweets from given source file
read_tweets = "Read Tweet File" >> beam.io.ReadFromText(known_args.input, coder=JsonCoder())
#Extract the relevant fields of the source file
extract_fields = "Project relevant fields" >> beam.Map(lambda row: {'text': row['text'],
'user_id': row['user']['id'],
'location': row['user']['location'] if 'location' in row['user'] else None,
'geo':row['geo'] if 'geo' in row else None,
'tweet_id': row['id'],
'time': row['created_at']})
#check what type of geo-location the user has
has_geo_location_or_not = "partition by has geo or not" >> beam.Partition(lambda element, partitions: 0 if element['geo'] is None else 1, 2)
check_county_not_empty = lambda element, partitions: 1 if 'county_fips' in element and element['county_fips'] is not None else 0
#tweet has coordinates partition or not
coordinate_partition = (p
| read_tweets
| extract_fields
| beam.ParDo(TimeConversion())
| has_geo_location_or_not)
#lookup by coordinates
geo_lookup = (coordinate_partition[1] | "geo coordinates mapping" >> beam.ParDo(BeamGeoLocator())
| "filter successful geo coords" >> beam.Partition(check_county_not_empty, 2))
#lookup by profile
profile_lookup = ((coordinate_partition[0], geo_lookup[0])
| "join streams" >> beam.Flatten()
| "Lookup from profile location" >> beam.ParDo(ComputeLocationFromProfile())
)
bigquery_output = "write output to BigQuery" >> beam.io.Write(
beam.io.BigQuerySink(known_args.output,
schema='text:STRING, user_id:INTEGER, county_fips:STRING, tweet_id:INTEGER, time:TIMESTAMP, county_source:STRING',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
#file_output = "write output" >> beam.io.WriteToText(known_args.output, coder=JsonCoder())
output = ((profile_lookup, geo_lookup[1]) | "merge streams" >> beam.Flatten()
| "Filter entries without location" >> beam.FlatMap(filter_by_nonempty_county)
| "project relevant fields" >> beam.Map(lambda row: {'text': row['text'],
'user_id': row['user_id'],
'county_fips': row['county_fips'],
'tweet_id': row['tweet_id'],
'time': row['time'],
'county_source': row['county_source']})
| bigquery_output)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
run()
It's a little bit complicated, so it would probably take too much time to do it in bigquery directly. The code reads the tweets json, splits the PCollection by whether it's geotagged or not, if not it tries to look it up via profile location, maps to location to what's relevant for our GIS analysis and then writes it to BigQuery.
The number of files correspond to the number of shards the elements were processed in.
One trick to reducing this is to generate some random keys, and group the elements based on that before writing them out.
For example, you could use the following DoFn and PTransform in your pipeline:
class _RoundRobinKeyFn(beam.DoFn):
def __init__(self, count):
self.count = count
def start_bundle(self):
self.counter = random.randint(0, self.count - 1)
def process(self, element):
self.counter += 1
if self.counter >= self.count:
self.counter -= self.count
yield self.counter, element
class LimitBundles(beam.PTransform):
def __init__(self, count):
self.count = count
def expand(self, input):
return input
| beam.ParDo(_RoundRobinKeyFn(self.count))
| beam.GroupByKey()
| beam.FlatMap(lambda kv: kv[1])
You would just use this before the bigquery_output:
output = (# ...
| LimitBundles(10000)
| bigquery_output)
(Note that I just typed this in without testing it, so there are likely some Python typos.)