Append to tables in python - google-bigquery

I would like to simply update a table in bigquery in python. I have a large table of data that I need to constantly update every hour.
The closest to updating tables I could find is this link here. However, only the command line and WebUI are supported for this feature.
Is it possible to do so? Or are there other alternatives? I tried searching for a similar question but did not find. thanks

There is a Python example on the documentation you shared and you can find the complete examples on Github.
with open(filename, "rb") as source_file:
job = client.load_table_from_file(source_file, table_ref, job_config=job_config)
job.result() # Waits for table load to complete.
I’m not sure if this is what you mean by “updating” the table, but you can truncate before inserting or append to the existing data by changing the write disposition of your load job’s configuration. Possible values are:
WRITE_TRUNCATE,
WRITE_APPEND,
WRITE_EMPTY

Related

Airflow GCSToBigQueryOperator is reordering my columns

I have the following operators in my DAG. They are receiving data from my MySQL database, uploading it to GCS, and then importing it to BigQuery. It runs great! With one small issue...
I can see that, inbetween the create and import tasks, the target table is created in BigQuery with the schema specified in the schema argument, with the correct column ordering. But, as soon as the import task runs, the schema of the table changes, and the columns are reordered into a seemingly arbitrary ordering. Why does this happen and is there a way to get BigQuery to stop doing this? I see that there are schema_update_options available on the operator but the documentation is quite poor...
create=BigQueryCreateEmptyTableOperator(
task_id="create",
bigquery_conn_id='google_cloud',
project_id="<myproject>",
dataset_id=target_dataset,
table_id=table_name,
schema_fields=schema
)
upload=MySQLToGCSOperator(
task_id='mysql_to_gcs',
mysql_conn_id='bi_mysql',
sql=self.sql,
bucket=self.bucket,
filename=self.filename,
export_format='NEWLINE_DELIMITED_JSON',
google_cloud_storage_conn_id='google_cloud'
)
import=GCSToBigQueryOperator(
task_id='gcs_to_bigquery',
bucket=self.bucket,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[self.filename],
destination_project_dataset_table="<myproject>..target_dataset.{table_name}",
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='google_cloud',
google_cloud_storage_conn_id='google_cloud',
)
create >> upload >> import
The re-ordering happens because you did not define the schema_fields inside your GCSToBigQueryOperator and it triggered BigQuery Schema Auto-detection wherein
BigQuery makes a best-effort attempt to automatically infer the schema from the source data.
In your case, to ensure the ordering of your columns is defined the way you wanted it to be, you must define schema_fields inside your GCSToBigQueryOperator.
You can already omit BigQueryCreateEmptyTableOperator since GCSToBigQueryOperator can already create BigQuery tables and define schemas.
Please see updated code based on your posted question:
upload=MySQLToGCSOperator(
task_id='mysql_to_gcs',
mysql_conn_id='bi_mysql',
sql=self.sql,
bucket=self.bucket,
filename=self.filename,
export_format='NEWLINE_DELIMITED_JSON',
google_cloud_storage_conn_id='google_cloud'
)
create_and_import=GCSToBigQueryOperator(
task_id='gcs_to_bigquery',
bucket=self.bucket,
source_format='NEWLINE_DELIMITED_JSON',
source_objects=[self.filename],
destination_project_dataset_table="<myproject>..target_dataset.{table_name}",
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='google_cloud',
google_cloud_storage_conn_id='google_cloud',
schema_fields=schema
)
upload >> create_and_import
You may refer to this GCSToBigQueryOperator Documentation for more details.

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!
There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.

Pig : Load in table, then overwrite that table after transformation

Let's say I have a table:
db.table
I load the table and do some transforms on it, and, finally, attempt to store it
mytable = LOAD 'db.table' USING HCatLoader();
.
.
-- My transforms
.
.
STORE mytable_final INTO 'db.table' USING HCatStorer();
But the code complains I'm writing into a table with existing data.
I've looked at this JIRA ticket, which seems to be inactive (I have tried using FORCE and OVERWRITE in several places in the STORE command).
I've also looked at this SO post, but the author is loading from one location and storing in a different location. If I use what is in that post, the result from the transformation is no data. Deleting the files isn't an option. I'm thinking of storing the files temporarily, but I don't know if this is the best option.
I am trying to get the behavior you get in Hive using INSERT OVERWRITE.
I am not familiar with HCatLoader and HCatStorer. But if you LOAD from and STORE to HDFS, Pig provides shell commands that enable you to do the deleting and moving from within your script.
STORE A INTO '/this/path/is/temporary';
RMF '/this/path/is/permanent';
MV '/this/path/is/temporary' '/this/path/is/permanent';