Is there a way to use UDF in Redshift, execute a SQL query and upload the result to AWS S3 ? Would really appreciate if someone knows how to this.
Thanks
To create a UDF in Redshift, you can use python. You can then call the function in a SQL SELECT statement. To output the results of a query to a file in S3, you can use the UNLOAD statement.
Related
When I add a file to S3, run a query against Athena, Athena returns the expected result with the data from this file.
Now if I then delete that same file from S3 and run the same query, Athena still returns the same data even though the file is not in S3 anymore.
Is this the expected behaviour? I thought Athena calls out to S3 on every query, but I'm now starting to think there is some sort of caching going on?
Does anyone have any ideas? I can't find any information online about this.
Thanks for the help in advance!
Athena (Hive)/Glue load partitions with a frequency. If you want to load latest result you need run
MSCK REPAIR TABLE table_name;
to refresh Athena caches.
Thanks for the help guys.
I actually was looking at the wrong files in S3 and the files I thought were removed were still present. Once I deleted them from S3, the query against Athena returned the expected results immediately.
Thanks!
I need to create procedural logic using data stored in aws s3 from athena or glue.
actually it is migrating a stored procedure from sql server to aws, but I don't know what aws service or where to do it with, it doesn't use database but s3 tables.
Thank you very much for guiding me on how to do it.
Athena doesn't support stored procedures, but however you can leverage UDFs to define the same logic as in your source stored procedure.
Below is the syntax for an UDF and refer to this for more information:
USING EXTERNAL FUNCTION UDF_name(variable1 data_type[, variable2 data_type][,...])
RETURNS data_type
LAMBDA 'lambda_function'
SELECT [...] UDF_name(expression) [...]
Is there a way for us to check how frequently a table has been accessed/queried in AWS redshift?
Frequency can be daily/monthly/every hour or whatever.. Can some one help me?
It could be sql queries using system tables from AWS Redshift or some python script. What is the best way?
I have a lambda function in which I am fetching a csv file from s3 now I want to run SQL query on that csv or query on JSON(after converting csv into JSON) which is best and easiest approch for this in node.js. As I want to use group by query so S3 select is not possible?
I found module "querycsv" in python , so I changed environment of code to Python. https://pythonhosted.org/querycsv/
Take a look at AWS Athena which helps you to run more complex querieson the files in S3.
I've been trying to store csv data into a table in a database using a pig script.
But instead of inserting the data into a table in a database I created a new file in the metastore.
Can someone please let me know if it is possible to insert data into a table in a database with a pig script, and if so what that script might look like?
You can take a look at DBStorage, but be sure to include the JDBC jar in your pig script and declaring the UDF.
The documentation for the storage UDF is here:
http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/DBStorage.html
you can use:
STORE into tablename USING org.apache.hcatalog.pig.HCatStorer()