Databricks - Save partitioned CSV files into respective tables - apache-spark-sql

Like to share my requirement, and how best it can be solved.
I have an SQL query, say, "SQL_QUERY_RUNS_AND_GIVES_RESULT_SET" which runs and passes the result set to a dataframe. Since the result set is huge, I create several partitions out of it and save the CSV files in a folder called "/tmp/CSV_FILE_NAME.csv". I have given a sample list of partitioned CSV files that are getting generated in the above folder.
df_3 = sqlContext.sql("""
SQL_QUERY_RUNS_AND_GIVES_RESULT_SET
""")
rowsPerPartition = 10000
partitions = int(1 + df_3.count() / rowsPerPartition)
df_3.repartition(numPartitions=partitions).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").option("delimiter", ";").save("/tmp/CSV_FILE_NAME.csv")
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00002-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-169-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00001-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-168-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00003-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-170-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00000-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-167-1-c000.csv
Now, my requirement is to create a table for each of this partitioned csv file. Since, I do not know how many partitioned csv files can result, the number of tables will be dynamic in number. Also, the table name can be made simpler like : CSV_FILE_NAME_{part-0001}
That is, I like to pick just the first 2 words from each partitioned csv file. And, the tables must be recreated every time I execute this while code snippet in notebook.
The reason for doing this : While the partitioned files are being imported into some other technology, some are failing, and not all rows are being imported. So, I want to focus my analysis on the partition that fails. If I create a table out of the data frame that executes the SQL query, it has become tedious to identify the root-cause. Since I am re-creating the tables each time code snippet runs, there is not an issue with storage being consumed.
I think the below command helps in creating a table but was not sure exactly about the looping to be done for my requirement. Appreciate any mindshare on this.
# Create the table.
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")

Create a column to your dataset, and set the value with the partition number.
For example,
row partition
data 1
data 1
data 2
...
and then use partitionBy to create the csv.
df.write.partitionBy('partition').csv('test')
The result will be saved as follows:
.
└── test
├── partition=1
│ └── part-00000-03446c22-782d-49cf-bbe0-7b702dd03309.c000.csv
├── partition=2
│ └── part-00000-03446c22-782d-49cf-bbe0-7b702dd03309.c000.csv
└── _SUCCESS
Then, you can fix the path to the partition folder and create tables by for loop once not everytime.

Related

How to modify the source file path of a BigQuery external table in Airflow?

There is a process which exports some CSV files to GCS and puts the current datetime into the path, e.g.:
gs://my_gcs_bucket/path/to/export/<current_timestamp>/exported_file_1.csv
gs://my_gcs_bucket/path/to/export/<current_timestamp>/exported_file_2.csv
I want to create an external table for these files before I cast its columns and merge with other tables.
The problem is that I can't implement a DAG in Airflow which can handle the changing timestamps.
I can create an external table by specifying the path (with the current timestamp), but for tomorrow's exports this external table will not read the new files. But I will need tomorrow the same project_name.dataset.tablename for processing.
from airflow.contrib.operators.bigquery_operator import BigQueryCreateExternalTableOperator
CreateExternalBqTable = BigQueryCreateExternalTableOperator(
task_id = "CreateExternalBqTable",
field_delimiter = '|',
destination_project_dataset_table = f'{project_name}.{dataset}.{tablename}',
bucket = BUCKET_URI,
source_objects = [ 'path/to/export/2021-12-12-12-12-12/exported_file_*' ],
schema_fields = generate_custom_struct()
)
If tomorrow I'll try to run the same table creation task again it will fail because the external table already exists.
I can delete the existing external table then recreate it again to ensure it does not fail the external table creation task, but if it's already deleted then this will fail:
from airflow.providers.google.cloud.operators.bigquery import BigQueryDeleteTableOperator
DeleteExternalBqTable = BigQueryDeleteTableOperator(
task_id = "DeleteExternalBqTable",
deletion_dataset_table = f'{project_name}.{dataset}.{tablename}',
)
What is the good pattern here?
Should I always delete and recreate the external table just because of the new timestamps?
It seems to me very bad and error prone pattern.
Or could I change the path under an existing external table somehow?
I would be much happier if I could separate the initialization and deletion phases of my external table from the daily runs and not always delete or create them. I plan to initialize these tables only once and move away the CSV files if I'm done with the processing and keep the external tables empty until the next run.
(I need a working solution for Airflow 1.x)
In BigQueryDeleteTableOperator you can use ignore_if_missing argument as documented in Airflow docs.
But if you don't want to recreate the table every time consider using idempotent operator BigQueryCreateEmptyTableOperator to create the table (it will skip creation if table already exists) and then GoogleCloudStorageToBigQueryOperator to load the data. In this way you will separate the creation and load

AWS Athena Table Data Update

I have started testing out AWS Athena, and it so far looks good. One problem I am having is about the updating of data in a table.
Here is the scenario: In order to update the data for a given date in the table, I am basically emptying out the S3 bucket that contains the CSV files, and uploading the new files to become the updated data source. However, the period of time during which the bucket is empty (i.e. when the old source is deleted and new source is being uploaded) actually is a bottleneck, because during this interval, anyone querying the table will get no result.
Is there a way around this?
Thanks.
Athena is a web service that allows to query data which resides on AWS S3. In order to run queries, Athena needs to now table schema and where to look for data on S3. All this information is stored in AWS Glue Meta Data catalog. This essentially means that each time you get a new data you simply need to upload a new csv file onto S3.
Let's assume that you get new data everyday at midnight and you store them in an S3 bucket:
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
└── data-file-2019-01-03.csv
and each of these files looks like:
| date | volume | product | price |
|------------|---------|---------|-------|
| 2019-01-01 | 100 | apple | 10 |
| 2019-01-01 | 200 | orange | 50 |
| 2019-01-01 | 50 | cherry | 100 |
Then after you have uploaded them to AWS S3 you can use the following DDL statement in order to define table
CREATE EXTERNAL TABLE `my_table`(
`date` timestamp,
`volume` int,
`product` string,
`price` double)
LOCATION
's3://my-s3-bucket/'
-- Additional table properties
Now when you get a new file data-file-2019-01-04.csv and you upload it to the same location as other files, Athena would be able to query new data as well.
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
├── data-file-2019-01-03.csv
└── data-file-2019-01-04.csv
Update 2019-09-19
If your scenario is when you need to updated data in the S3 bucket, then you can try to combine views, tables and keeping different versions of data
Let's say you have table_v1 that queries data in s3://my-data-bucket/v1/ location. You create a view for table_v1 which can be seen as a wrapper of some sort:
CREATE VIEW `my_table_view` AS
SELECT *
FROM `table_v1`
Now your users could use my_table to query data in s3://my-data-bucket/v1/ instead of table_v1. When you want to update data, you can simply upload it to s3://my-data-bucket/v2/ and define table table_v2. Next, you need to update your my_table_view view since all queries are run against it:
CREATE OR REPLACE VIEW `my_table_view` AS
SELECT *
FROM `table_v2`
After this is done, you can drop table_v1 and delete files from s3://my-data-bucket/v1/. Provided that data schema hasn't changed, all queries that ran against my_table_view view while it was based on table_v1 should still be valid and succeed after my_table_view got replaced.
I don't know what would the downtime of replacing a view, but I'd expect it to less then a second, which is definitely less that the time it takes to upload new files.
What most people want to do is probably MSCK REPAIR TABLE <table_name>.
This updates the metadata if you have added more files in the location, but it is only available if you table has partitions.
You might also want to do this with a Glue Crawler which can be scheduled to refresh the table with new data.
Relevant documentation.

Hive import only certain file types from hdfs

I’m creating a external table using Serde
org.apache.hive.hcatalog.data.JsonSerde
Hdfs folder location has multiple file formats and I would like to import only json files types . I tried to use
**SERDEPROPERTIES (“input.regex” = “*.json”)**
But does not seem to work
Hive has virtual column named INPUT__FILE__NAME and you can filter it in the where clause.
select * from --do everything else on this level
(select * from your_table --do filtering in the subquery wrapper
where INPUT__FILE__NAME rlike '\\.json$' --ends with .json
)s
I'm not sure it will efficiently work because these related Jiras unfortunately are not implemented yet: Add file pruning into Hive and Selectively include EXTERNAL TABLE source files via REGEX
If this solution will work slow because mappers will read files, you may need to copy files required to the separate folder and build a table on top of it.
Have found one more solution which you can possibly use: SymlinkTextInputFormat
Create some new location with a file containing a list of all files required and build a table on top of this location.
For example, the location of the table is "/user/hive/mytable".
There is a file called "/user/hive/mytable/myfile.txt".
Inside the file, there are 2 lines, "/user/myname/textfile1.txt" and "/user/myname/textfile2.txt"
We can do:
CREATE TABLE mytable (...) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.io.SymlinkTextInputFormat' LOCATION '/user/hive/mytable';
SELECT * FROM mytable;
which will return the content of the 2 files: "/user/myname/textfile1.txt" and "/user/myname/textfile2.txt"

Loading data from multiple s3 folders selectively into table in Hive

I have a s3 bucket with multiple folders say, A, B and there are also some other folders. Folder structure is as below:
s3://buckets/AGGREGATED_STUDENT_REPORT/data/A/,
s3://buckets/AGGREGATED_STUDENT_REPORT/data/B/ etc.
And inside these two folders daily report gets generated in another folder like run_date=2019-01-01, so the resultant folder structure is something like below:
s3://buckets/AGGREGATED_STUDENT_REPORT/data/A/run_date=2019-01-01/..,
s3://buckets/AGGREGATED_STUDENT_REPORT/data/B/run_date=2019-01-01/..
Now in hive, I want to create a external table taking the data generated on last day of every month in only these two folders, ignoring others as follows:
CREATE EXTERNAL TABLE STUDENT_SUMMARY
(
ROLL_NUM STRING,
CLASS STRING,
REMARKS STRING,
LAST_UPDATED STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION 's3://AGGREGATED_STUDENT_REPORT/data/*/run_date=2018-12-31';
But in the above query I am not able to figure out how to process group of selected folders.
Any chance you can copy the folders to HDFS.
Two reasons:
a) You can create just 1 folder in HDFS and copy all the A,b,c, etc into the same HDFS folder and use the same under your location parameter.
b) I am guessing the query performance would be better if the data resides in HDFS rather than S3.

External Table in Hive - Location

The below table returns no data while running a select statement
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.
Is there any way the location be made dynamic like
/user/data/CSV/*/*/*/*/part-*
or just /user/data/CSV/* would do fine ?
(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)
First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.
If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.
Hive does not iterate through multiple folders -
Hence for the above scenario
I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)
hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
This line
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?
Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.
You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
;
alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;
You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.
You may also be able to leverage a
create table like
statement. To create a schema like you have in another table. Then alter the table to point at the files you want.
I know this isn't exactly what you want and is more of a work around. Good luck!