I have started testing out AWS Athena, and it so far looks good. One problem I am having is about the updating of data in a table.
Here is the scenario: In order to update the data for a given date in the table, I am basically emptying out the S3 bucket that contains the CSV files, and uploading the new files to become the updated data source. However, the period of time during which the bucket is empty (i.e. when the old source is deleted and new source is being uploaded) actually is a bottleneck, because during this interval, anyone querying the table will get no result.
Is there a way around this?
Thanks.
Athena is a web service that allows to query data which resides on AWS S3. In order to run queries, Athena needs to now table schema and where to look for data on S3. All this information is stored in AWS Glue Meta Data catalog. This essentially means that each time you get a new data you simply need to upload a new csv file onto S3.
Let's assume that you get new data everyday at midnight and you store them in an S3 bucket:
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
└── data-file-2019-01-03.csv
and each of these files looks like:
| date | volume | product | price |
|------------|---------|---------|-------|
| 2019-01-01 | 100 | apple | 10 |
| 2019-01-01 | 200 | orange | 50 |
| 2019-01-01 | 50 | cherry | 100 |
Then after you have uploaded them to AWS S3 you can use the following DDL statement in order to define table
CREATE EXTERNAL TABLE `my_table`(
`date` timestamp,
`volume` int,
`product` string,
`price` double)
LOCATION
's3://my-s3-bucket/'
-- Additional table properties
Now when you get a new file data-file-2019-01-04.csv and you upload it to the same location as other files, Athena would be able to query new data as well.
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
├── data-file-2019-01-03.csv
└── data-file-2019-01-04.csv
Update 2019-09-19
If your scenario is when you need to updated data in the S3 bucket, then you can try to combine views, tables and keeping different versions of data
Let's say you have table_v1 that queries data in s3://my-data-bucket/v1/ location. You create a view for table_v1 which can be seen as a wrapper of some sort:
CREATE VIEW `my_table_view` AS
SELECT *
FROM `table_v1`
Now your users could use my_table to query data in s3://my-data-bucket/v1/ instead of table_v1. When you want to update data, you can simply upload it to s3://my-data-bucket/v2/ and define table table_v2. Next, you need to update your my_table_view view since all queries are run against it:
CREATE OR REPLACE VIEW `my_table_view` AS
SELECT *
FROM `table_v2`
After this is done, you can drop table_v1 and delete files from s3://my-data-bucket/v1/. Provided that data schema hasn't changed, all queries that ran against my_table_view view while it was based on table_v1 should still be valid and succeed after my_table_view got replaced.
I don't know what would the downtime of replacing a view, but I'd expect it to less then a second, which is definitely less that the time it takes to upload new files.
What most people want to do is probably MSCK REPAIR TABLE <table_name>.
This updates the metadata if you have added more files in the location, but it is only available if you table has partitions.
You might also want to do this with a Glue Crawler which can be scheduled to refresh the table with new data.
Relevant documentation.
Related
I have a table in BigQuery with 100 columns. Now I want to append more rows to it via Transfer but the new CSV has only 99 columns. How should I proceed with this?
I tried creating a schema and adding that column as NULLABLE but it didn't work
I am presuming your CSV file is stored in GCS Bucket and trying to use BQ Data Transfer service to load data periodically by scheduling it.
You can not directly Load/Append data into BQ Table due to schema mismatch.
But as an alternative, create a Staging table named staging_table_csv with 99 columns and Schedule a Data transfer service to load CSV to this table on Overwrite mode.
Now write a query to Append the contents of this staging table staging_table_csv to the target BQ table.
Query might look like this:
#standardSQL
INSERT INTO `project.dataset.target_table`
SELECT
*,
<DEFAULT_VALUE> AS COL100
FROM
`project.dataset.staging_table_csv`
Now schedule this query to run after the staging table is loaded
Make sure to keep a buffer between the Staging table load and the Target Table load. You can perform trials to find a suitable buffer.
For eg: If Transfer is scheduled at 12:00, Schedule Target Table load Query t 12:05 or 12:10
Note: Creating an extra Staging table would incur storage costs but
since it is overwritten for each load, historical data cost is not
incurred
Like to share my requirement, and how best it can be solved.
I have an SQL query, say, "SQL_QUERY_RUNS_AND_GIVES_RESULT_SET" which runs and passes the result set to a dataframe. Since the result set is huge, I create several partitions out of it and save the CSV files in a folder called "/tmp/CSV_FILE_NAME.csv". I have given a sample list of partitioned CSV files that are getting generated in the above folder.
df_3 = sqlContext.sql("""
SQL_QUERY_RUNS_AND_GIVES_RESULT_SET
""")
rowsPerPartition = 10000
partitions = int(1 + df_3.count() / rowsPerPartition)
df_3.repartition(numPartitions=partitions).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").option("delimiter", ";").save("/tmp/CSV_FILE_NAME.csv")
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00002-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-169-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00001-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-168-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00003-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-170-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00000-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-167-1-c000.csv
Now, my requirement is to create a table for each of this partitioned csv file. Since, I do not know how many partitioned csv files can result, the number of tables will be dynamic in number. Also, the table name can be made simpler like : CSV_FILE_NAME_{part-0001}
That is, I like to pick just the first 2 words from each partitioned csv file. And, the tables must be recreated every time I execute this while code snippet in notebook.
The reason for doing this : While the partitioned files are being imported into some other technology, some are failing, and not all rows are being imported. So, I want to focus my analysis on the partition that fails. If I create a table out of the data frame that executes the SQL query, it has become tedious to identify the root-cause. Since I am re-creating the tables each time code snippet runs, there is not an issue with storage being consumed.
I think the below command helps in creating a table but was not sure exactly about the looping to be done for my requirement. Appreciate any mindshare on this.
# Create the table.
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")
Create a column to your dataset, and set the value with the partition number.
For example,
row partition
data 1
data 1
data 2
...
and then use partitionBy to create the csv.
df.write.partitionBy('partition').csv('test')
The result will be saved as follows:
.
└── test
├── partition=1
│ └── part-00000-03446c22-782d-49cf-bbe0-7b702dd03309.c000.csv
├── partition=2
│ └── part-00000-03446c22-782d-49cf-bbe0-7b702dd03309.c000.csv
└── _SUCCESS
Then, you can fix the path to the partition folder and create tables by for loop once not everytime.
My requiremnt is like that I have two storage account sa01 and sa02. Let say Sa01 having 10 files and Sa02 also having 10 files at time 01:00 AM. Now I have uploaded 4 more files at 1:15AM in sa01 and my copy activity wil automatically runs beacause I am implemented the event trigger. So It will insert the 4 files to sa02.
Question - It will insert the 4 files and also updating the previous (10) files also, so I am getting 14 files at time 01:15 AM,and requriment say that if 10 files uploaded already at 01:00 AM and 4 files which is latest can inserted in sa02.
See the timings in image I have just uploaded one file all the files time is modified.
Azure Data Share is one good way to accomplish this. It is typically used to sync storage with a partner company. But you can sync in your own subscription. There is no code to write. There is a UI and a sync schedule.
You can use a Metadata activity to get the lastModified of the destination folder.
In your Copy activity, put dynamic content in the
Filter by last modified: start time field. Choose the lastModified field output from the Metadata activity.
Only files in the source newer than the destination's lastModified will be copied.
Metadata activity is tiny fractions of a penny in cost.
I am troubleshooting an application issue on an External (unmanaged) Table that was created using the CREATE TABLE X LIKE PARQUET syntax via Cloudera Impala. I am trying to determine the Location of the files comprising the partitions of the External table but having difficulty determining how to do this, or finding documentation describing this.
If I do a:
show create table T1;
I see the hive-managed location such as:
LOCATION 'hdfs://nameservice1/user/hive/warehouse/databaseName'
If I do a:
describe formatted T1;
I see that the table is in fact external but it doesnt give any insight on the unmanaged Location.
| Table Type: | EXTERNAL_TABLE
| Location: | hdfs://nameservice1/user/hive/warehouse/databaseName/T1
Question:
How do I determine the Location/URI/Parent Directory of the actual external files that comprise this External Table?
When you create a external table with impala or hive and you want know the location you must put the HDFS location, for example :
CREATE EXTERNAL TABLE my_db.table_name
(column string ) LOCATION 'hdfs_path'
The probably location o theses files if dont provite this, is under user directory that execute the comand create table.
For more detail you can see this link:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html
I hope to help!
An external table in HIVE is partitioned on year, month and day.
So does the following query delete data from external table for the specific partitioned referenced in this query?:-
ALTER TABLE MyTable DROP IF EXISTS PARTITION(year=2016,month=7,day=11);
Partitioning scheme is not data. Partitioning scheme is part of table DDL stored in metadata (simply saying: partition key value + location where the data-files are being stored).
Data itself are stored in files in the partition location(folder). If you drop partition of external table, the location remain untouched, but unmounted as partition (metadata about this partition is deleted). You can have few versions of partition location unmounted (for example previous versions).
You can drop partition and mount another location as partition (alter table add partition) or change existing partition location. Also drop external table do not delete table/partitions folders with files in it. And later you can create table on top of this location.
Have a look at this answer for better understanding external table/partition concept: It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.
No external table have only references that will be deleted actual file will still persists at location .
External Table data files are not owned by table neither moved to hive warehouse directory
Only PARTITION meta will be deleted from hive metastore tables..
Difference between Internal & external tables :
For External Tables -
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.
For Internal Tables-
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
Deleting the table deletes the metadata & data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organisation to organisation).
Hive may have internal or external tables this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Note: Meta table if you will look in to the database ( configured details)
|BUCKETING_COLS |
| COLUMNS |
| DBS |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SORT_COLS |
| TABLE_PARAMS |
| TBLS |