I have my data available in smartsheet location . Month folders and each month folder i have weekly sheets(example smartsheet/October/wk1,wk2,wk3,wk4 files) and i want to load the data dynamically to hive table.Can someone suggest me how to load dynamically.
Related
Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'
I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery
I have used the datatransfer table that will automatically send data from a s3 bucket to GCS. This transfer will be run everyday automatically in the morning.
I created a table in bigquery that will read the data from the GCS , so far no problem.
Now my concern is even though the files update on a daily basis on the GCS the bigquery table that is suppose to consume the GCS parquet file don't seem to update everyday.
What is the procedure for the table to be able to consume the latest data on the GCS bucket.
Example I created my table on the 17 April .
My data transfer send some file on the 19
But when i do a select max(created_at) from mytable
it doesn't give me the last updated data
blob contain Hive Partition table data partition created on Year, month and day.
Container look like Year=2016/ Months=1/Day-1 - 0000_1(File) to Day-31 - 0000_31(File)
Like this we have 3 months inside year and each month contain days folder and ear day folder contain a file.
Now we want o put that data into a azure sql Db table which is not partitioned.
If I understand it right , you have blobs with the structure
2016/03/01_001
2016/03/01_003
and the intend is to copy the data to SQL Azure . I am assuming that the blob structure is the same on all the files .
I suggest
1:Use the GetMetadata activity and get all the blob info
2:Use a foreach activity to read one blob at a time
3:Inside the foreach add a copy activity source being blob and sink SQL Azure .
I have a folder of CSV files separated by date in Google Cloud Storage. How can I upload it directly to BigQuery as a partitioned table?
You can do the following:
Create partitioned table (for example: T)
Run multiple load jobs to load each day's data into the corresponding partition. So for example, you can load data for May 15th, 2016 by specifying the destination table of load as 'T$20160515'
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition