snakemake samples validation schema for csv file - jsonschema

I have a samples csv file with the following columns:
SMFID,Fastq1,Fastq2
I tried to use the following yaml to validate the schema in snakemake:
$schema: "https://json-schema.org/draft/2020-12/schema"
description: an entry in the sample sheet
properties:
SMFID:
type: string
description: sample name/identifier
Fastq1:
type: string
description: path to fastq file (first mate)
Fastq2:
type: string
description: path to fastq file (second mate)
required:
- SMFID
- Fastq1
- Fastq2
But I get the following error:
WorkflowError: Unsupported data type for validation.
Is there a way to specify that the input file is csv?

This can be done using snakemake's snakemake.utils.validate function after the csv has been loaded into a pandas DataFrame.
If the schema you provided is saved as your_schema.yaml and you have samples.csv:
SMFID,Fastq1,Fastq2
my_id,sread1,sread2
Then you can validate in your Snakefile like so:
import pandas as pd
from snakemake.utils import validate
samples = pd.read_csv("samples.csv")
validate(samples, "your_schema.yaml")
This is further described in the docs.

You would need to translate the file from CSV into JSON (or a JSON-compatible format that your evaluator supports).

Related

how to read a mounted dbc file in databricks?

I try to read a dbc file in databricks (mounted from an s3 bucket)
the file path is:
file_location="dbfs:/mnt/airbnb-dataset-ml/dataset/airbnb.dbc"
how to read this file using spark?
I tried the code below:
df=spark.read.parquet(file_location)
But it generates and error:
AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
thanks for help !
I tried the code below: df=spark.read.parquet(file_location) But it
generates and error:
You are using spark.read.parquet but want to read dbc file. It won't work this way.
Don't use parquet but use load. Give file path with file name (without .dbc extension) in path parameter and dbc in format paramter.
Try below code:
df=spark.read.load(path='<file_path_with_filename>', format='dbc')
Eg: df=spark.read.load(path='/mnt/airbnb-dataset-ml/dataset/airbnb', format='dbc')

Pyspark : how to get specific file based on date to load into dataframe from list of file

I'm trying to load a specific file from group of file.
example : I have files in hdfs in this format app_name_date.csv, i have 100's of files like this in a directory. i want to load a csv file into dataframe based on date.
dataframe1 = spark.read.csv("hdfs://XXXXX/app/app_name_+$currentdate+.csv")
but its throwing error since $currentdate is not accepting and says file doesnot exists
error :
pyspark.sql.utils.AnalysisException: Path does not exist: hdfs://XXXXX/app/app_name_+$currentdate+.csv"
any idea how to do this in pyspark
You can format the string with:
from datetime import date
formatted = date.today().strftime("%d/%m/%Y")
f"hdfs://XXXXX/app/app_name_{formatted}.csv"
Out[25]: 'hdfs://XXXXX/app/app_name_02/03/2022.csv'
use this option from datetime package

When using pandas dataframe.to_csv(), with compression='zip', it creates a zip file with two archive files with the EXACT same name

I am trying to save OHLCV (stock pricing) data from a dataframe into a single zipped csv file as follows. My test data is ohlcvData.csv, which I read into a dataframe with
import pandas as pd
df = pd.read_csv('ohlcvData.csv', header=None, names=['datetime', 'open', 'high', 'low', 'close', 'volume'], index_col='datetime')
and when I try to write it to a zip file like so (following stackoverflow.com/questions/55134716) :
df.to_csv('ohlcvData.zip', header=False, compression=dict(method='zip', archive_name='ohlcv.csv'))
I get the following warning ...
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\lib\zipfile.py:1473: UserWarning: Duplicate name: 'ohlcv.csv'
return self._open_to_write(zinfo, force_zip64=force_zip64)
and the resultant ohlcvData.zip file contains two files, both named ohlcv.csv, each containing a portion of the results.
When I try to read the zip file back into a dataframe ...
dfRead = pd.read_csv(ohlcvData.zip', header=None, names=['datetime', 'open', 'high', 'low', 'close', 'volume'], index_col='datetime')
... I get the following error...
*File "C:\Users\jeffm\AppData\Roaming\Python\Python37\site-packages\pandas\io\common.py", line 618, in get_handle
"Multiple files found in ZIP file. "
ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['ohlcv.csv', 'ohlcv.csv']*
However, when I reduce the number of rows in the input file from 200 to around 175 (for this file structure it varies slightly how many lines I have to remove depending on the data), it works and produces a zip file, containing one csv file, which can be loaded back into a dataframe without error. I have tried many different files, with different data and formats and I still get the same result -- any file with over (approx) 175 lines fails and any file with less works fine. So it looks like its splitting the file after a certain size, but from the docs there doesn't appear to be such a setting. Any help on this would be appreciated. Thanks.
This appears to be a bug introduced in 1.2.0, I created a minimal reproducing example and posted an issue: https://github.com/pandas-dev/pandas/issues/39190
import pandas as pd
# enough data to cause chunking into multiple files
n_data = 100000
df = pd.DataFrame(
{'name': ["Raphael"]*n_data,
'mask': ["red"]*n_data,
'weapon': ["sai"]*n_data,
}
)
compression_opts = dict(method='zip', archive_name='out.csv')
df.to_csv('out.csv.zip', index=False, compression=compression_opts)
# reading back the data produces an error
r_df = pd.read_csv("out.csv.zip")
# passing in compression_opts doesn't work either
r_df = pd.read_csv("out.csv.zip", compression=compression_opts)
Looks like this may be a recent Pandas bug. I was having the same issue in Pandas 1.2.0. Reverting to 1.1.3 (i.e. what I was using before) solved the issue. I haven't tested 1.1.4 and 1.1.5.

Cannot open a csv file

I have a csv file on which i need to work in my jupyter notebook ,even though i am able to view the contents in the file using the code in the picture
When i am trying to convert the data into a data frame i get a "no columns to parse from file error"
i have no headers. My csv file looks like this and also i have saved it in the UTF-8 format
Try to use pandas to read the csv file:
df = pd.read_csv("BON3_NC_CUISINES.csv)
print(df)

ADLA AUs assigned for JSON files

I have a custom Extractor with AtomicFileProcessing set to false. It extracts a large no of JSON files (each line in the file is a JSON document) and output two files with successful and failed requests, both of them contains the json rows (AUs allocated more than 1 to extract the files). Problem is when I use the same extractor to extract the outputted files in first step with more than one AU, it fails with the error, Unexpected character encountered while parsing value: e. Path '', line 0, position 0.
If I assign 1 AU on Azure or run this locally with AU set to more than 1, it successfully processes the data. Is this behavior because of more AU provided to process a single JSON file and since the file is in non-splittable format, it can't be parallelized?
you can solve this problem converting your json file to Jsonlines.
http://jsonlines.org/examples/
Then you need to read the file using text extractor and use JsonFunctions available on Microsoft.Analytics.Samples.Formats
to read the json.
That transformation will make your file splittable and you can parallelized it!