Record Duplication in BigQuery while Running a DataFlow Job - google-bigquery

I'm running an hourly dataflow job that reads records from a source table, processes and writes them to a target table. Since some of the records may repeat in the source table, we've created a hash value based on the record fields of interest, append it to the read source table records(in memory), and filter out the existing hashes already stored on the target table(the hash value will be stored in the target table). This way we aim to avoid duplications from different jobs(triggered at different times). In order to avoid duplication on the same job, we're using a GroupByKey apache beam method, where the key is the hash value, and pick only the first element in the list. However, the duplication in bigquery still persists. My only hunch is that maybe, due to multiple workers handling the same job, they might be out of sync and process the same data, but since I'm using pipelines all the way, this assumption sounds unreasonable(at least to me..). Does any of you have an idea why the problem still persists?
Here's the job which creates the duplication:
with beam.Pipeline(options=options) as p:
# read fields of interest from the source table
records = p | 'Read Records from BigQuery' >> beam.io.Read(
beam.io.ReadFromBigQuery(query=read_from_source_query, use_standard_sql=True))
#step 1 - filter already existing records
# read existing hashes from the target table
hashes = p | 'read existing hashes from the target table' >> \
beam.io.Read(beam.io.ReadFromBigQuery(
query=select_hash_value_from_target_table,
use_standard_sql=True)) | \
'Get vals' >> beam.Map(lambda hash: hash['HashValue'])
# add hash value to each record and filter out the ones which already exist in the target table
hashed_records = (
records
| 'Add Hash Column in Memory to Each source table Record' >> beam.Map(lambda record: add_hash_field(record))
| 'Filter Existing Hashes' >> beam.Filter(lambda record,
hashes: record['HashValue'] not in hashes,
hashes=beam.pvalue.AsIter(hashes))
)
# step 2 - filter duplicated hashes created on the same job
key_val_records = (
hashed_records | 'Create a Key Value Pair' >> beam.Map(lambda record: (record['HashValue'], record))
)
# combine elements with the same key and get only one of them
unique_hashed_records = (
key_val_records | 'Combine the Same Hashes' >> beam.GroupByKey()
| 'Get First Element in Collection' >> beam.Map(lambda element: element[1][0])
)
records_to_store = unique_hashed_records | 'Create Records to Store' >> beam.ParDo(CreateTargetTableRecord(gal_options))
records_to_store | 'Write to target table' >> beam.io.WriteToBigQuery(
target_table)
As the code above suggested, i've expected to have no duplicates in the target table, but i'm still getting

Related

Apache Beam Pipeline Write to Multiple BQ tables

I have a scenario where I need to do the following:
Read data from pubsub
Apply multiple Transformations to the data.
Persist the PCollection in multiple Google Big Query based on some config.
My question is how can I write data to multiple big query tables.
I searched for multiple bq writes using apache beam but could not find any solution
You can do that with 3 sinks, example with Beam Python :
def map1(self, element):
...
def map2(self, element):
...
def map3(self, element):
...
def main() -> None:
logging.getLogger().setLevel(logging.INFO)
your_options = PipelineOptions().view_as(YourOptions)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
result_pcollection = (
p
| 'Read from pub sub' >> ReadFromPubSub(subscription='input_subscription')
| 'Map 1' >> beam.Map(map1)
| 'Map 2' >> beam.Map(map2)
| 'Map 3' >> beam.Map(map3)
)
(result_pcollection |
'Write to BQ table 1' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table1',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection |
'Write to BQ table 2' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table2',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection_pub_sub |
'Write to BQ table 3' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table3',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
if __name__ == "__main__":
main()
The first PCollection is the result of input from PubSub.
I applied 3 transformations in the input PCollection
Sink the result to the 3 different Bigquery table
res = Flow
=> Map 1
=> Map 2
=> Map 3
res => Sink result to BQ table 1 with `BigqueryIO`
res => Sink result to BQ table 2 with `BigqueryIO`
res => Sink result to BQ table 3 with `BigqueryIO`
In this example I used STREAMING_INSERT for ingestion to Bigquery tables, but you can adapt and change it if needed in your case.
I see the previous answers satisfy your requirement of writing the same result to multiple tables. However, I assume the below scenarios, provide a bit different pipeline.
Read data from PubSub
Filter the data based on configs (from event message keys)
Apply the different/same transformation to the filtered collections
Write results from previous collections to different BigQuery Sinks
Here, we filtered the events at early stages in the pipeline, this is helpful in:
Avoid processing the same event messages multiple times.
You can skip the messages which are not needed.
Apply relevant transformations to event messages.
Overall efficient and cost-effective system.
For example, you are processing messages from all around the world and you need to process and store the data with respect to geography - storing Europe messages in the Europe region.
Also, you need to apply transformations which are relevant to the country-specific data - add an Aadhar number to messages generated from India and Social Security number to messages generated from the USA.
And you don't want to process/store any events from specific countries - data from oceanic countries are irrelevant and not required to process/stored in our use case.
So, in this made-up example, filtering the data (based on the config) at the early stage, you will be able to store country-specific data (multiple sinks), and you don't have to process all events generated from the USA/any other region for adding an Aadhar number (event specific transformations) and you will be able to skip/drop the records or simply store them in BigQuery without applying any transformations.
If the above made-up example resembles your scenario, the sample pipeline design may look like this
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,...
from apache_beam.io.gcp.internal.clients import bigquery
class TaggedData(beam.DoFn):
def process(self, element):
try:
# filter here
if(element["country"] == "in")
yield {"indiaelements:taggedasindia"}
if(element["country"] == "usa")
yield {"usaelements:taggedasusa"}
...
except:
yield {"taggedasunprocessed"}
def addAadhar(element):
"Filtered messages - only India"
yield "elementwithAadhar"
def addSSN(element):
"Filtered messages - only USA"
yield "elementwithSSN"
p = beam.Pipeline(options=options)
messages = (
p
| "ReadFromPubSub" >> ...
| "Tagging >> "beam.ParDo(TaggedData()).with_outputs('usa', 'india', 'oceania', ...)
)
india_messages = (
messages.india
| "AddAdhar" >> ...
| "WriteIndiamsgToBQ" >> streaming inserts
)
usa_messages = (
messages.usa
| "AddSSN" >> ...
| "WriteUSAmsgToBQ" >> streaming inserts
)
oceania_messages = (
messages.oceania
| "DoNothing&WriteUSAmsgToBQ" >> streaming inserts
)
deadletter = (
(messages.unprocessed, stage1.failed, stage2.failed)
| "CombineAllFailed" >> Flatn...
| "WriteUnprocessed/InvalidMessagesToBQ" >> streaminginserts...
)

How to truncate a table in PySpark?

In one of my projects, I need to check if an input dataframe is empty or not. If it is not empty, I need to do a bunch of operations and load some results into a table and overwrite the old data there.
On the other hand, if the input dataframe is empty, I do nothing and simply need to truncate the old data in the table. I know how to insert data in with overwrite but don't know how to truncate table only. I searched existing questions/answers and no clear answer found.
driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
stage_url = 'jdbc:sqlserver://server_name\DEV:51433;databaseName=project_stage;user=xxxxx;password=xxxxxxx'
if input_df.count()>0:
# Do something here to generate result_df
print(" write to table ")
write_dbtable = 'Project_Stage.StageBase.result_table'
write_df = result_df
write_df.write.format('jdbc').option('url', stage_url).option('dbtable', write_dbtable). \
option('truncate', 'true').mode('overwrite').option('driver',driver).save()
else:
print('no account to process!')
query = """TRUNCATE TABLE Project_Stage.StageBase.result_table"""
### Not sure how to run the query
Truncating is probably easiest done like this:
write_df = write_df.limit(0)
Also, for better performance, instead of input_df.count() > 0 you should use
Spark 3.2 and below: len(input_df.head(1)) > 0
Spark 3.3+: ~df.isEmpty()

matching several combinations of columns in a table

I am reading a table where all its values has to be validated before we process it further. The valid values are stored in another table that we match our main table with. The validation criteria is to match several columns as follows:
Table 1 (the main data we read in)
Name --- Unit --- Age --- Address --- Nationality
The above shows the column names that we are reading from the table and the other table contains the valid values of the above columns . When we look only for valid values in our main table, we have to consider combination of columns in the main data table, for example Name --- Unit --- Age. If all the value in a particular row for the column combination matches against the other table then we keep the row, otherwise we delete it.
How do I address the issue with Numpy ?
Thanks
you can just loop through rows. An easy/simple way would be:
dummy_df = table_df ## make a copy of your table, since we are deleting rows we want to have the original df saved.
relevant_columns = ['age','name','sex',...] ## define relevant columns, in case either dataframe has columns you dont want to compare on
for indx in dummy_df.index :
## checks if any row is identical, if so, drops it.
if ((np.array(dummy_df.loc[indx][relevant_columns]) == main_df[relevant_columns].values).sum(1) == len(relevant_columns)).sum() > 0:
dummy_df = dummy_df .drop(indx)
ps: i am assuming the data is in pandas dataframe format.
hope it helps :)
ps2: if the headers/columns have different names it wont work

Generating variable observations for one id to be observation for new variable of another id

I have a data set that allows linking friends (i.e. observing peer groups) and thereby one can observe the characteristics of an individual's friends. What I have is an 8 digit identifier, id, each id's friend id's (up to 10 friends), and then many characteristic variables.
I want to take an individual and create a variables that are the foreign born status of each friend.
I already have an indicator for each person that is 1 if foreign born. Below is a small example, for just one friend. Notice, MF1 means male friend 1 and then MF1id is the id number for male friend 1. The respondents could list up to 5 male friends and 5 female friends.
So, I need Stata to look at MF1id and then match it down the id column, then look over to f_born for that matched id, and finally input the value of f_born there back up to the original id under MF1f_born.
edit: I did a poor job of explaining the data structure. I have a cross section so 1 observation per unique id. Row 1 is the first 8 digit id number with all the variables following over the row. The repeating id numbers are between the friend id's listed for each person (mf1id for example) and the id column. I hope that is a bit more clear.
Kevin Crow wrote vlookup that makes this sort of thing pretty easy:
use http://www.ats.ucla.edu/stat/stata/faq/dyads, clear
drop team y
rename (rater ratee) (id mf1_id)
bys id: gen f_born = mod(id,2)==1
net install vlookup
vlookup mf1_id, gen(mf1f_born) key(id) value(f_born)
So, Dimitriy's suggestion of vlookup is perfect except it will not work for me. After trying vlookup with both my data set, the UCLA data that Dimitriy used for his example, and a toy data set I created vlookup always failed at the point the program attempts to save a temp file to my temp folder. Below is the program for vlookup. Notice its sets tempfile file, manipulates the data, and then saves the file.
*! version 1.0.0 KHC 16oct2003
program define vlookup, sortpreserve
version 8.0
syntax varname, Generate(name) Key(varname) Value(varname)
qui {
tempvar g k
egen `k' = group(`key')
egen `g' = group(`key' `value')
local k = `k'[_N]
local g = `g'[_N]
if `k' != `g' {
di in red "`value' is unique within `key';"
di in red /*
*/ "there are multiple observations with different `value'" /*
*/ " within `key'."
exit 9
}
preserve
tempvar g _merge
tempfile file
sort `key'
by `key' : keep if _n == 1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save `file', replace
restore
sort `varlist'
joinby `varlist' using `file', unmatched(master) _merge(`_merge')
drop `_merge'
}
end
exit
For some reason, Stata gave me an error, "invalid file," at the save `file', replace point. I have a restricted data set with requirments to point all my Stata temp files to a very specific folder that has an erasure program sweeping it every so often. I don't know why this would create a problem but maybe it is, I really don't know. Regardless, I tweaked the vlookup program and it appears to do what I need now.
clear all
set more off
capture log close
input aid mf1aid fborn
1 2 1
2 1 1
3 5 0
4 2 0
5 1 0
6 4 0
7 6 1
8 2 .
9 1 0
10 8 1
end
program define justlinkit, sortpreserve
syntax varname, Generate(name) Key(varname) Value(name)
qui {
preserve
tempvar g _merge
sort `key'
by `key' : keep if _n ==1
keep `key' `value'
sort `key'
rename `key' `varlist'
rename `value' `generate'
save "Z:\Jonathan\created data sets\justlinkit program\fchara.dta",replace
restore
sort `varlist'
joinby `varlist' using "Z:\Jonathan\created data sets\justlinkit program\fchara.dta", unmatched(master) _merge(`_merge')
drop `_merge'
}
end
// set trace on
justlinkit mf1aid, gen(mf1_fborn) key(aid) value(fborn)
sort aid
list
Well, this fixed my problem. Thanks to all who responded I would not have figured this out without you.

Evaluate SQL records in a For Each loop taking into account previous steps through loop

I have written a script in Powershell that searches through a relational database using Select Top statements to pick records which are suitable to match with items in an input text file. For the sake of simplicity I not include all of the condition that need to be met, just the ones I am having trouble with:
Each item in the input file has a corresponding requirement for resource x and resource y.
Input File:
Record 1 2x 1y
Record 2 1x 1y
In the database each record is similar
Database:
Record 1 4x 3y
Record 2 1x 2y
What my script does is loop through each item in my input file and searches through the database to find a record that has sufficient amount of resource x and resource y. The script does this and outputs a file which basically matches records of the input file to suitable records of the database.
However, it doesn't work properly as it step through each item in the loop, it doesn't take into account if the previous item(s) in the loop have been matched to records (and used up resources). For example:
The script evaluates input file record 1 (2x 1y) and matches it to record 1 in the DB (4x 3y). Now when the script goes to the next item in input file (1x 1y) it evaluates record 1 in the DB as still having 4x and 3y despite having been matched previously in the loop and it's resources should now be looked at as 2x 2y (4x-2x 3y-1y).
How can I accomplish this? In the end the script could be evaulating 200 input records at a time against a Database with 70,000 records. The answer doesn't have to be in PowerShell, I'm just having a hard time thinking of a conceptual answer to this problem.
Here's a PowerShell example using randomly generated CSVs.
Input table format:
RecordName ResourceX ResourceY Match
Record 0 8 0
Record 1 2 5
Record 2 5 9
Processing:
$cResources = Import-Csv resources-before.csv
$cResourcesNeeded = Import-Csv needed-infile.csv
foreach ($needed in $cResourcesNeeded) {
foreach ($supply in $cResources) {
if (($needed.ResourceX -le $supply.ResourceX) -and `
($needed.ResourceY -le $supply.ResourceY)) {
# Match found.
$needed.Match = $supply.RecordName
$supply.Match += $needed.RecordName
# Updating supply record.
$supply.ResourceX = $supply.ResourceX - $needed.ResourceX
$supply.ResourceY = $supply.ResourceY - $needed.ResourceY
# Back to outer loop.
break
}
}
}
$cResources | Export-Csv -NoTypeInformation resources-after.csv
$cResourcesNeeded | Export-Csv -NoTypeInformation needed-outfile.csv
Of course this is just a very basic example. I don't know what other requirements you have so feel free to elaborate further (i.e. update the question with your actual code) if you need something more specific.