Testing file matching in pyspark

Testing file matching in pyspark - testing

I have a SparkContext and a bunch of files that I want to access using the textFile method. The logic to find which files I need to access is complex and requires careful testing.
In my testing environment (using pytest), all the python files and files the SparkContext needs to access are on my local machine. The problem I encounter is that creating a SparkContext object in this testing environment and loading data using textFile often fails with a timeout exceeding 1 minute (cf Mock a Spark RDD in the unit tests).
In my testing environment, how to test that the file patterns I send to textFile are right, and therefore ensure that textFile returns an rdd from the right files?

Related

How to catalog datasets & models by S3 URI, but keep a local copy?

I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:
my_big_dataset.hdf5:
type: kedro.extras.datasets.pandas.HDFDataSet
filepath: "s3://my_bucket/data/04_feature/my_big_dataset.hdf5"
I want to refer to these objects in the catalog by their S3 URI so that my team can use them. HOWEVER, I want to avoid re-downloading the datasets, model weights, etc. every time I run a pipeline by keeping a local copy in addition to the S3 copy. How do I mirror files with Kedro?

This is a good question, Kedro has CachedDataSet for caching datasets within the same run, which handles caching the dataset in memory when it's used/loaded multiple times in the same run. There isn't really the same thing that persists across runs, in general Kedro doesn't do much persistent stuff.
That said, off the top of my head, I can think of two options that (mostly) replicates or gives this functionality:
Use the same catalog in the same config environment but with the TemplatedConfigLoader where your catalog datasets have their filepaths looking something like:
my_dataset:
filepath: ${base_data}/01_raw/blah.csv
and you set base_data to s3://bucket/blah when running in "production" mode and with local_filepath/data locally. You can decide how exactly you do this in your overriden context method (whether it's using local/globals.yml (see the linked documentation above) or environment variables or what not.
Use separate environments, likely local (it's kind of what it was made for!) where you keep a separate copy of your catalog where the filepaths are replaced with local ones.
Otherwise, your next best bet is to write a PersistentCachedDataSet similar to CachedDataSet which intercepts the loading/saving for the wrapped dataset and makes a local copy when loading for the first time in a deterministic location that you look up on subsequent loads.

Mosaic Decisions Azure BLOB writer node creating multiple files

I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?

Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.

Dask dataframe read parquet format fails from http

I have been dealing with this problem for a week.
I use the command
from dask import dataframe as ddf
ddf.read_parquet("http://IP:port/webhdfs/v1/user/...")
I got invalid parquet magic.
However ddf.read_parquet is Ok with "webhdfs://"
I would like the ddf.read_parquet works for http because I want to use it in dask-ssh cluster for workers without hdfs access.

Although the comments already partly answer this question, I thought I would add some information as an answer
HTTP(S) is supported by dask (actually fsspec) as a backend filesystem; but to get partitioning within a file, you need to get the size of that file, and to resolve globs, you need to be able to get a list of links, neither of which are necessarily provided by any given server
webHDFS (or indeed httpFS) don't work like HTTP downloads, you need to use a specific API to open a file and fetch a final URL on a cluster member to that file; so the two methods are not interchangeable
webHDFS is normally intended for use outside of the hadoop cluster; within the cluster, you would probably use plain HDFS ("hdfs://"). However, kerberos-secured webHDFS can be tricky to work with, depending on how the security was set up.

Joining ADLS files created with Append and ConcurrentAppend

We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Can ConcurrentAppend write to a file that has been joined using Join-AzureRmDataLakeStoreItem, even if one or more of the source files have been created using Append?
If not, we will use U-SQL to combine the files, but can ConcurrentAppend write to a file that has been outputted from a U-SQL job?
If not, do we have any other options than executing a local script (using the .NET API for example), which will read all files, and write a new set of files back to the lake using only ConcurrentAppend?
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.

At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.

How to customize the temporary directory in Twisted Trial

I am trying to run a twisted.trial.TestCase that depends on resource folders (images, for instance) that reside alongside my Python package called test. Unfortunately, the temporary directory that gets created upon running the test runner (i.e. issuing trial test) doesn't include (naturally) a copy of the whole original working directory, and my tests fail because the images cannot be found. The function of the software is heavily dependent on those images, so they'll need to be a part of testing.
The question is, is there a way to customize the _trial_temp directory that the test runner normally creates from scratch so that it includes certain files and folders, besides what the test runner itself thinks it needs?

No.
Don't do it this way. If you need data from your project, it is not in any sense temporary data. If you point trial at a directory using --temp-directory, it will assume it is in fact "temporary" and will blow it away. Instead, you should access the data relative to the path of the tests.
If you put your sample data into the same directory as your tests, and treat it as package_data, you can do this:
from twisted.python.modules import getModule
thisModule = getModule(__name__)
dataPath = thisModule.filePath.parent()
and to get data in your tests:
fileobj = dataPath.child("sample_file.data").open()
databytes = dataPath.child("other_file.txt").getContent()
so keep your temporary directories and your sample data separate.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Testing file matching in pyspark - testing

Related

How to catalog datasets & models by S3 URI, but keep a local copy?

Mosaic Decisions Azure BLOB writer node creating multiple files

Dask dataframe read parquet format fails from http

Joining ADLS files created with Append and ConcurrentAppend

How to customize the temporary directory in Twisted Trial

Categories

Resources