Move specific file from SFTP to S3 using Airflow - amazon-s3

I have a requirement where I need to move a specific file from an SFTP Server to an S3 Bucket. I am currently using the below code to move the required file if I provide its complete path(including filename). There are multiple files in the SFTP directory, but I only want to move files which are of .xlsx format or has .xlsx in filename, please suggest how I can do that. I am using SFTPtoS3Operator to get the file.
import pysftp
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.sftp.operators.sftp import SFTPOperator
from airflow.providers.sftp.sensors.sftp import SFTPSensor
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow import models
from airflow.providers.amazon.aws.transfers.sftp_to_s3 import SFTPToS3Operator
with DAG("sftp_operators_workflow",
schedule_interval=None,
start_date=days_ago(1)) as dag:
wait_for_input_file = SFTPSensor(task_id="check-for-file",
sftp_conn_id="ssh_conn_id",
path="<full path with filename>",
poke_interval=10)
sftp_to_s3 = SFTPToS3Operator(
task_id="sftp_to_s3",
sftp_conn_id="ssh_conn_id",
sftp_path="<full path with filename>",
s3_conn_id="s3_conn_id",
s3_bucket="<bucket name>",
s3_key="full bucket path with filename")
wait_for_input_file >> sftp_to_s3
The file that needs to be moved as the filename like : M.DD.YYYY.xlsx.
I really appreciate all the help & support on this one.

Related

Error message on google bigquery when I try to import a file

I want to import my csv files to bigquery but it doesn't work.
I have this message : c2580321527929797.
I don't know why. The file is clean and works on dbeaver.

Trying to combine all csv files in directory into one csv file

My directory structure is as follows:
>Pandas-Data-Science
>Demo
>SalesAnalysis
>Sales_Data
>Sales_April_2019.csv
>Sales_August_2019.csv
....
>Sales_December_2019.csv
So Demo is a new python file I made and I want to take all the csv files from Sales_Data and create one csv file in Demo.
I was able to make a csv file for any particular csv file from Sales_Data
df = pd.read_csv('./SalesAnalysis/Sales_Data/Sales_August_2019.csv')
So I figured if I just get the file name and iterate through it I can concatenate it all into an empty csv file:
import os
import pandas as pd
df = pd.DataFrame(list())
df.to_csv('one_file.csv')
files = [f for f in os.listdir('./SalesAnalysis/Sales_Data')]
for f in files:
current = pd.read_csv("./SalesAnalysis/Sales_Data/"+f)
So my thinking was that current will create a single csv file since f prints out the exact string required ie. Sales_August_2019.csv
However I get an error with current that says: No such file or directory: './SalesAnalysis/Sales_Data/Sales_April_2019.csv'
when clearly I was able to make a csv file with the exact same string. So why does my code not work?
Probably a problem with your current working directory not being what you expect. I prefer to do these operations with absolute path which makes it easier for debugging:
from pathlib import Path
path = Path('./SalesAnalysis/Sales_Data').resolve()
current = [pd.read_csv(file) for file in path.glob('*.csv')]
demo = pd.concat(current)
You can set a breakpoint to find out what path is exactly.
try this:
import os
import glob
import pandas as pd
os.path.expanduser("/mydir")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

Can you use xr.open_mfdataset when reading files from S3 via s3fs?

I'm trying to read multiple netcdf files at once using xr.open_mfdataset from a S3 bucket, using s3fs. Is this possible?
Tried the below, which works for xr.open_dataset for a single file, but doesn't work for multiple files:
import s3fs
import xarray as xr
fs = s3fs.S3FileSystem(anon=False)
s3path = 's3://my-bucket/wind_data*'
store = s3fs.S3Map(root=s3path, s3=s3fs.S3FileSystem(), check=False)
data = xr.open_mfdataset(store, combine='by_coords')
I'm not sure exctly what S3Map does; the documentation from s3fs isn't specific in this.
However, I was able to create a working implementation of this within a Jupyter environment using S3FileSystem.glob() and S3FileSystem.open()
Here's a code sample:
import s3fs
import xarray as xr
s3 = s3fs.S3FileSystem(anon=False)
# This generates a list of strings with filenames
s3path = 's3://your-bucket/your-folder/file_prefix*'
remote_files = s3.glob(s3path)
# Iterate through remote_files to create a fileset
fileset = [s3.open(file) for file in remote_files]
# This works
data = xr.open_mfdataset(fileset, combine='by_coords')

How do I write items from the Items file to csv and then append the csv file every time I run the program afterwards

I want to create a csv file, fill it with items from the Items file and append the csv file with new data every time I run the program afterwards. My aim is to use cron to run it at certain intervals once it has been setup.
import scrapy
import json
from ..items import AnotherddItem
import datetime
import csv
class AnotherddSpider(scrapy.Spider):
name = 'ddgrab'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com/all?source=private']
csv_columns=['timestamp','sellerId','sellerName','adUrl']
dict_data = [timestamp, sellerId, sellerName, adUrl]
csv_file = 'test.csv'
with open(csv_file, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
for data in dict_data:
writer.writerow(data)
dict_data contains all the fields in my items.py file. I know what I have done here is wrong because the dict_data fields haven't been defined yet but I don't know how to access them.
If you really want to append data to your file you need to create a custom pipeline that will check if output file already exists and write header line if needed.

Reading csv file from s3 using pyarrow

I want read csv file located in s3 bucket using pyarrow and convert it to parquet to another bucket.
I am facing problem in reading csv file from s3.I tried reading below code but failed.Does pyarrow support reading csv from s3 ?
from pyarrow import csv
s3_input_csv_path='s3://bucket1/0001.csv'
table=csv.read_csv(s3_input_csv_path)
This is throwing error
"errorMessage": "Failed to open local file 's3://bucket1/0001.csv', error: No such file or directory",
I know we can read csv file using boto3 and then can use pandas to convert it into data frame and finally convert to parquet using pyarrow. But in this approach pandas is also required to be added to the package that makes package size go beyond 250 mb limit for lambda when taken along with pyarrow.
Try passing a file handle to pyarrow.csv.read_csv instead of an S3 file path.
Note that future editions of pyarrow will have built-in S3 support but I am not sure of the timeline (and any answer I provide here will grow quickly out of date with the nature of StackOverflow).
import pyarrow.parquet as pq
from s3fs import S3FileSystem
s3 = S3FileSystem() # or s3fs.S3FileSystem(key=ACCESS_KEY_ID, secret=SECRET_ACCESS_KEY)
s3_input_csv_path = f"s3://bucket1/0001.csv"
dataset = pq.ParquetDataset(s3_input_csv_path, filesystem=s3)
table = dataset.read_pandas().to_pandas()
print(table)
s3_output_csv_path = f"s3://bucket2/0001.csv"
#Wring table to another bucket
pq.write_to_dataset(table=table,
root_path=s3_output_csv_path,
filesystem=s3)
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
Example of CSV read:
import awswrangler as wr
df = wr.s3.read_csv(path="s3://...")
Reference
Its not possible as of now. But here is a workaround, we can load data to pandas and cast it to pyarrow table
import pandas as pd
import pyarrow as pa
df = pd.read_csv("s3://your_csv_file.csv", nrows=10). #reading 10 lines
pa.Table.from_pandas(df)