Google Cloud Dataprep - Scan for multiple input csv and create corresponding bigquery tables - google-bigquery

I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.

Related

I tried to upload a csv file (from my desktop) to create a table in Bigquery; however, the table does not appear under my project

---The steps I used to upload .csv file were:
upload, browse, select .csv from my desktop
select auto detect schema and input parameters
create table
One possible cause of the initial malfunction was that my .csv was a zipfile; which did not automatically load in the usual manner, and I canceled the process. After several failed attempts to upload the zip file I tried to upload a different .csv and was not been able to create a table then or since.
I located the error message (HTTP409) which states the table already exists, so I've tried changing the write preference to "overwrite" as well as "append to the table" but the processing function takes several minutes and does not complete. I'm unable to create any tables.

Copy and Merge files to another S3 bucket

I have a source bucket where small 5KB JSON files will be inserted every second.
I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler.
For better query performance AWS Athena recommends larger file sizes.
So I want to copy the files from the source bucket to bucket2 and merge them.
I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket. bucket2 then will be the source of the AWS Glue crawler.
Will this be the best approach or am I missing something?
Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose, which can automatically combine data based on either size or time period. It would output fewer, larger files.
You could also achieve this with a slight change to your current setup:
When a file is uploaded to S3, trigger an AWS Lambda function
The Lambda function reads the file and send it to Amazon Kinesis Data Firehose
Kinesis Firehose then batches the data by size or time
Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files. This file format is very efficient for querying. However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded). A kludge could be a script that does the following:
Create an external table in Athena that points to a batching directory (eg batch/)
Create an external table in Athena that points to the final data (eg final/)
Have incoming files come into incoming/
At regular intervals, trigger a Lambda function that will list the objects in incoming/, copy them to batch/ and delete those source objects from incoming/ (any objects that arrive during this copy process will be left for the next batch)
In Athena, run INSERT INTO final SELECT * FROM batch
Delete the contents of the batch/ directory
This will append the data into the final table in Athena, in a format that is good for querying.
However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.
You can probably achive that using glue itself. Have a look here https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md
This is what I think will be more simpler
Have input folder input/ let 5kb/ 1kb files land here; /data we will use this to have Json files with max size of 200MB.
Have a lambda that runs every 1minute which reads a set of files from input/ and appends to the last file in the folder /data using golang/ java.
The lambda (with max concurrency as 1) copies a set of 5kb files from input/ and the XMB files from data/ folder into its /tmp folder; and merge them and then upload the merged file to /data and also delte the files from input/ folder
When ever the file size crosses 200MB create a new file into data/ folder
The advantage here is at any instant if somebody wants data its the union of input/ and data/ folder or in other words
With little tweeks here and there you can expose a view on top of input and data folders which can expose final de-duplicated snapshot of the final data.

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

Is there any problems with saving parquet as a single file and no directory

I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).
I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet directory including its _SUCCESS and *.crc files.
While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.
Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?
Thanks
If you are having one parquet file and renaming that file to new filename then new file will be a valid parquet file.
If you are combining one or more parquet files and combining them to one then the combined file will not be a valid parquet file.
In case you are combining more parquet files into one then its better to create one file by using spark (using repartition) and write to the table.
(or)
You can also use parquet-tools-**.jar to merge multiple parquet files into one parquet file.