I have a dag with the following structure:
loop by dates (date_var):
-> loop by countries:
-> process1 that outputs to gcs -> gcs to bigquery table 1
-> process2 that outputs to gcs -> gcs to bigquery table 2
I want to add an identifier of the date_var to each processed data, but I have a couple of restrictions:
I cannot change the output of each process adding at that step the identifier for date_var.
I don't want to have a different table for each date_var
I want the countries and dates to run in parallel so I can only append data to the tables. ** WRITE_APPEND**
Any idea on how to add a column coming from a parameter at this loading step? Or any idea of how to solve this considering the 3 restrictions?
Thanks
Related
I have a very large amount of data in my S3 bucket partitioned by two columns MODULE and DATE
such that the file structure of my parquets are:
s3://my_bucket/path/file.parquet/MODULE='XYZ'/DATE=2020-01-01
I have 7 MODULE and the DATE ranges from 2020-01-01 to 2020-09-01.
I found a discrepancy in the data and need to correct the MODULE entries for one of the module. Basically I need to change all data for a particular index number, belonging to MODULE XYZ to MODULE ABC.
I can do this in pyspark by loading the data frame and doing something like:
df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
But how do I repartition it so that only those entries that are changed get moved to the ABC MODULE partition? If I do something like:
df.mode('append').partitionBy('MODULE','DATE').parquet(s3://my_bucket/path/file.parquet")
I would be adding the data along with the erroneous MODULE data. Plus, I have almost a years worth of data and don't want to repartition the entire dataset as it would take a very long time.
Is there a way to do this?
If I understand well, you have data in partition MODULE=XYZ that should be moved to MODULE=ABC.
First, identify the impacted files.
from pyspark.sql import functions as F
file_list = df.where(F.col("index") == 34).select(
F.input_file_name()
).distinct().collect()
Then, you create a dataframe based only on theses files, you use it to complete both MODULE.
df = spark.read.parquet(file_list).withColumn(
"MODULE", when(col("index") == 34, "ABC").otherwise(col("MODULE"))
)
df.write.parquet(
"s3://my_bucket/path/ABC/", mode="append", partitionBy=["MODULE", "DATE"]
)
At this point, ABC should be OK (you just added the missing data), but XYZ should be wrong because of duplicate data. To recover XYZ, you just need to delete the list of files in file_list.
IIUC you can do this by filtering the data for that particular index then save that data with date as partition.
df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
df = df.filter(col('index')==34)
df.mode('overwrite').partitionBy('DATE').parquet(s3://my_bucket/path/ABC/")
In this way you will only end up modifying only the changed module i.e. ABC
Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.
After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)
Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps
I’m working on a project with test data close to 1 million records and 4 such files .
The task is to perform around 40 calculations joining the data from 4 different files each close to 1gb .
Currently, I save the data from each into a spark table using saveastable and perform operations . For e.g. - table1 joins with table2 and the results are saved to table3 . Table3(result of 1 and 2 ) joins with table4 and so on . Finally I’m saving these calculations on a different table and generating the reports.
The entire process takes around 20 minutes and my concern is when this code gets to the production with data probably 5 times more than this , will there be performance issues .
Or is it better to save those data from each file in a partitioned way and then perform the joins and arrive to the final resultset .
P.S - The objective is to get instant results and there might be cases where the user is updating a few rows from the file and expecting an instant result. And the data is on a monthly basis , basically once every month with categories and sub-categories within .
What you are doing is just fine, but make sure to cache + count after every resource extensive operations instead of writing all the joins and then save at last step.
If you do not cache in between, spark will run entire DAG from top to bottom at the last step , it may cause JVM to overflow and spill to disk during operations which may in turn affect the execution time.
I have data generated as part r from mapreduce job in the following format:
(19,[2468:5.0,1894:5.0,3173:5.0,3366:5.0,3198:5.0,1407:5.0,407:5.0,1301:5.0,2153:5.0,3007:5.0])
(20,[3113:5.0,3285:5.0,3826:5.0,3755:5.0,373:5.0,3510:5.0,3300:5.0,22:5.0,1358:5.0,3273:5.0])
19 and 20 are users ids and array within the [] are recommendations for the users, each recommendation separated by comma. I want to load this data in a tabular format - row 1 =19,2468,5.0,3175, row 2 = 19, 1894, 5.0, 3173 and so on.
How could I achieve this by Pig or Hive?
So far, I have tried in Pig but haven't been able to parse to get the desired output.
I am looking to create a report where I can display the user name (by joining with the user table), recommended movie names for the user (by joining the movie table) and the user rating.
In the data above, 19 is the user id. Within the parentheses are recommended movie ids for that user along with rating. Each recommendation is separated by a comma.
It is possible to return nested results(RECORD type) if noflatten_results flag is specified but it is possible to just view them on screen without writing it to table first.
for example, here is an simple user table(my actual table is big large(400+col with multi-level of nesting)
ID,
name: {first, last}
I want to view record particular user & display in my applicable, so my query is
SELECT * FROM dataset.user WHERE id=423421 limit 1
is it possible to return the result directly?
You should write your output to "temp" table with noflatten_results option (also respective expiration to be set to purge table after it is used) and serve your client out of this temp table. All "on-fly"
Have in mind that no matter how small "temp" table is - if you will be querying it (in above second step) you will be billed for at least 10MB, so you better use Tabledata.list API in this step (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list) which is free!
So if you try to get repeated records it will fail on the interface/BQ console with the error:
Error: Cannot output multiple independently repeated fields at the same time.
and in order to get past this error is to FLATTEN your output.