How to set FileIO writeDynamic name with input fields? - file-io

I'm using Dataflow to load some csv to Google Cloud Storage and I need to save some CSV files into different directories based on data values (like uuid, region, etc.).
How can I do this? Currently I'm able to add the key (from KV) in the path but I would also need some other information that currently is only available on values.
Currently this saves data to gs://my-bucket/<uuid>/extraction.csv but I need something like gs://my-bucket/<uuid>/<region>/<store>/extraction.
Example csv:
uuid,region,store,....
123e4567-e89b-12d3-a456-426614174000,central,store1,foo,bar
.apply("Write CSV files",
FileIO.<String, KV<String, String>>writeDynamic()
.by(KV::getKey)
.to("gs://my-bucket")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.via(Contextful.fn(KV::getValue), TextIO.sink())
.withNaming(key -> FileIO.Write.defaultNaming(String.format("%s/extraction",key),"csv"))
);

You would need to have <region> and <store> as part of the key as well, and then generate the right path in the function that you pass with withNaming.

Related

Databricks: Adding path to table from csv

In databricks I have several CSV files that I need to load. I would like to add a column to my table with the file path, but I can't seem to find that option
My data is structured with
FileStore/subfolders/DATE01/filenameA.csv
FileStore/subfolders/DATE01/filenameB.csv
FileStore/subfolders/DATE02/filenameA.csv
FileStore/subfolders/DATE02/filenameB.csv
I'm using this SQL function in databricks, as this can loop through all the dates and add all filenameA into clevertablenameA, and all filenameB into clevertablenameB etc.
DROP view IF EXISTS clevertablenameA;
create temporary view clevertablenameA
USING csv
OPTIONS (path "dbfs:/FileStore/subfolders/*/filenameA.csv", header = true)
My desired outcome is something like this
col1 | col2|....| path
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
Is there a clever option, or should I load my data another way?
The function input_file_name() could be used to retrieve the file name while reading.
SELECT *, input_file_name() as path FROM clevertablenameA
Note that this does not add a column to the view and merely returns the name of the file being read.
Refer to below link for more information.
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/functions/input_file_name
Alternatively you could try reading the files in a pyspark/scala cell and add the file name using the same function using the .withColumn("path", input_file_name()) function and then create the view on top of it.

read specific files names in adf pipeline

I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')

How i can read files from s3 using pyspark which is created after a particular time

I need to read json files from s3 using pyspark. The S3 location may contain hundreds of thousands of files. and every file have same metdata. But each time i need to read only the files that is created after a particular time. How i can do this?
If you have access to the system that creates these files, the simplest way to approach this would be to add a date partition when you write them:
s3://mybucket/myfolder/date=20210901/myfile1.json
s3://mybucket/myfolder/date=20210901/myfile1.json
s3://mybucket/myfolder/date=2021831/myfileA.json
And then you can read them with a filter; Pyspark will then only load the files that it needs into memory.
start_dt = '20210831'
end_dt = '20210901'
df = (
spark
.read
.json(path)
.filter(F.col("date").between(start_dt, end_dt))
)
Note that I have not explicitly tested this with JSON files, just with Parquet, so this method may need to be adapted.
If you don't have access to change how the files are written, I don't think Pyspark has direct access to the metadata of the files. Instead, you will want to query S3 directly using boto3 to generate a list of files, filter them using boto3 meta data, and then pass the list of files into the read method:
# generate this by querying via boto3
recent_files = ['s3://mybucket/file1.json', 's3://mybucket/file2.json']
df = spark.read.json(*recent_files)
Info about listing files from boto3.
You can provide modifiedAfter and modifiedBefore parameters to DataFrameReader.json function.
modifiedBefore an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Example
from datetime import datetime
# Fill this variable with your last date
lowerbound = datetime(2021, 9, 1, 13, 0, 0)
# Current execution
upperbound = datetime.now()
df = spark.read.json(source_path,
modifiedAfter=lowerbound.strftime('%Y-%m-%dT%H:%M:%S'),
modifiedBefore=upperbound.strftime('%Y-%m-%dT%H:%M:%S'))
As noted in the discussion on Kafels' answer, modifiedBefore and modifiedAfter don't work with S3 as a data source. This is a real shame!
The next best alternative is to use boto3 to list all objects in the partition, and then filter the results on the lastModified element in the results. The results don't contain a creation timestamp so lastModified is the best you can do. You also need to be careful to handle pagination given the large number of objects.
Something like this should work to retrieve the matching keys:
import boto3
def get_matching_s3_keys(bucket, prefix="", after_date=None):
"""
List keys in an S3 bucket that match specified criteria.
:param bucket: Name of the S3 bucket.
:param prefix: Only get objects whose key starts with
this prefix
:param after_date: Only get objects that were last modified
after this date. Note: this needs to be a timezone-aware date
"""
paginator = s3.get_paginator("list_objects_v2")
kwargs = {'Bucket': bucket, 'Prefix': prefix}
for page in paginator.paginate(**kwargs):
try:
contents = page["Contents"]
except KeyError:
break
for obj in contents:
last_modified = obj["LastModified"]
if after_date is None or last_modified > after_date:
yield obj["Key"]

Is there a way how to search filename in Minio API/SDK?

I'm trying to use Minio as storage for our manufacturing data. I planned to create a bucket with name "color" and I want to store measurement data file 160f33fa03fa8.csv(its just a unique id from our other system) and it will be nice to insert date to the filename to organize it like 2020_04_160f33fa03fa8.csv or create sub folders color/2020/04/160f33fa03fa8.csv.
Its any way how to search with API/SDK for filename 160f33fa03fa8.csv without knowing a specific path? Or is any way how to store date on the object and simple search, for example, all objects with date 2020/04?
Thank you
You can use the Minio client mc for this;
mc find myminio/mybucket --name "*160f33fa03fa8.csv"
source: https://docs.min.io/docs/minio-client-complete-guide.html

Concatenation of multiple file in Qlikview

Is that possible in qlikview to concatenate multiple files from different paths.
Suppose, i am loading multiple files with a path and want to concatenate multiple files which have same number and name of columns as first path's file. So, my question is how can i do that.
Thanks in Advance.
When you say "load a file", I am assuming you mean that you are loading the contents into a table, as you would an QVD, XML, or Excel file.
If this is the case, if the columns are identical in each load, QlikView will attempt to concatenate them by default if they are loaded in sequence.
Otherwise, name your first table, such as TableName:, then preface the following loads of other files with concatenate(TableName).
Ex:
TableName:
LOAD Col1, Col2
from [file.qvd];
CONCATENATE(TableName)
LOAD Col1, Col2
from [file2.qvd];
Note: As I mentioned above, since these are in sequence and have identically named columns, QlikView will attempt to autoconcatenate them in my example, so the CONCATENATE line, though still functional, is not required.
I just want to add example how to do it if there is dynamic amount of files in multiple directories with some name:
SUB LoadFromFolder (RootDir)
TRACE Loading data ...;
TRACE Directory: $(RootDir);
TRACE ;
FOR Each FoundFile in FileList(RootDir & '\FileName.xml')
TRACE Loading data from '$(FoundFile)' ...;
Data:
LOAD Prop1,
Prop2,
Prop3
From [$(FoundFile)] (XmlSimple, Table is [XmlRoot/XmlTag]);
TRACE Loaded.;
NEXT FoundFile
FOR Each SubDirectory in DirList(RootDir & '\*' )
CALL LoadFromFolder(SubDirectory);
NEXT SubDirectory
TRACE ;
END Sub
CALL LoadFromFolder ('C:\Path\To\Dir\WithoutslashAtTheEnd');
As Dickie already told, each time you load to "Data:", it will be added there.