I have a requirement to create an Athena table from multiple zip files of multiple folders in S3.
I have a folder structure in S3 as follows: S3 bucket==>Clients folder==> multiple folders for multiple countries like (US, JAPAN,UK... till 50 countries) ==> 10 to 50 '.gz' files in each country folder
I need to merge all the '.gz' files from all the region folders and create a single table in S3, i used the glue crawlers and classifiers but the files are not getting merged into table.
Please help me with other ways to create a table 'companies_all_regions' on Athena from all the files
You could create an Amazon Athena external table at the top-level of the bucket. All files at that level, and in sub-folders, will be included in the table. All files will need to be in the same format.
If your CSV files contain commas within a column, then the values for the column would need to be placed "inside double quotes".
If you are able to change the way the files are created, you could choose an alternate column separator, such as the pipe (|) character. That will avoid problems with commas inside field values. You can then configure the table to use the pipe as the separator character.
Related
I have an S3 bucket with 500 csv files that are identical except for the number values in each file.
How do I write query that grabs dividendsPaid and make it positive for each file and send that back to s3?
Amazon Athena is a query engine that can perform queries on objects stored in Amazon S3. It cannot modify files in an S3 bucket. If you want to modify those input files in-place, then you'll need to find another way to do it.
However, it is possible for Amazon Athena to create a new table with the output files stored in a different location. You could use the existing files as input and then store new files as output.
The basic steps are:
Create a table definition (DDL) for the existing data (I would recommend using an AWS Glue crawler to do this for you)
Use CREATE TABLE AS to select data from the table and write it to a different location in S3. The command can include an SQL SELECT statement to modify the data (changing the negatives).
See: Creating a table from query results (CTAS) - Amazon Athena
I have a s3 bucket with multiple folders say, A, B and there are also some other folders. Folder structure is as below:
s3://buckets/AGGREGATED_STUDENT_REPORT/data/A/,
s3://buckets/AGGREGATED_STUDENT_REPORT/data/B/ etc.
And inside these two folders daily report gets generated in another folder like run_date=2019-01-01, so the resultant folder structure is something like below:
s3://buckets/AGGREGATED_STUDENT_REPORT/data/A/run_date=2019-01-01/..,
s3://buckets/AGGREGATED_STUDENT_REPORT/data/B/run_date=2019-01-01/..
Now in hive, I want to create a external table taking the data generated on last day of every month in only these two folders, ignoring others as follows:
CREATE EXTERNAL TABLE STUDENT_SUMMARY
(
ROLL_NUM STRING,
CLASS STRING,
REMARKS STRING,
LAST_UPDATED STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION 's3://AGGREGATED_STUDENT_REPORT/data/*/run_date=2018-12-31';
But in the above query I am not able to figure out how to process group of selected folders.
Any chance you can copy the folders to HDFS.
Two reasons:
a) You can create just 1 folder in HDFS and copy all the A,b,c, etc into the same HDFS folder and use the same under your location parameter.
b) I am guessing the query performance would be better if the data resides in HDFS rather than S3.
Is it possible to create a grok classifier for Parquet files? If so, where can I find examples?
I'm using AWS Glue Catalog and I'm trying to create external tables on top of Parquet files. I'd like the classifier to split the files according to one of the column of the files.
All my files have the column "table" and all records in a file have the same table.
My S3 structure is like this
- s3://my-bucket/my-prefix/table1/...
- s3://my-bucket/my-prefix/table2/...
No, classifier is not used for conditional parsing of data and moving to different tables.
You may write lambda/ecs/glue-job (depending on processing time) which will take these files and move to table wise folders in s3 bucket. e.g. s3-data-lake/ingestion/table1, s3-data-lake/ingestion/table2 and so on. Then you can run crawler over s3-data-lake/ingestion/ which will create all glue tables.
I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.
I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?
Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.