Validate and change different date formats in pyspark - dataframe

This is the continues of this problem(Validate and change the date formats in pyspark)
In the above scenario the solution was perfect but what if I have timestamp date formats and some more different date formats like below.
df = sc.parallelize([['12-21-2006'],
['05/30/2007'],
['01-01-1984'],
['22-12-2017'],
['12222019'],
['2020/12/23'],
['2020-12-23'],
['12.11.2020'],
['22/02/2012'],
['2020/12/23 04:50:10'],
['12/23/1996 05:56:20'],
['23/12/2002 10:30:50'],
['24.12.1990'],
['12/03/20']]).toDF(["Date"])
df.show()
+-------------------+
| Date|
+-------------------+
| 12-21-2006|
| 05/30/2007|
| 01-01-1984|
| 22-12-2017|
| 12222019|
| 2020/12/23|
| 2020-12-23|
| 12.11.2020|
| 22/02/2012|
|2020/12/23 04:50:10|
|12/23/1996 05:56:20|
|23/12/2002 10:30:50|
| 24.12.1990|
| 12/03/20|
+-------------------+
When I tried the same way of solving this(Validate and change the date formats in pyspark). There is an error which I am getting. As far as I know the error is due to time stamp formats and the record with similar MM/dd/yyy, dd/MM/yyyy are notable to convert into the required format.
sdf = df.withColumn("d1", F.to_date(F.col("Date"),'yyyy/MM/dd')) \
.withColumn("d2", F.to_date(F.col("Date"),'yyyy-MM-dd')) \
.withColumn("d3", F.to_date(F.col("Date"),'MM/dd/yyyy')) \
.withColumn("d4", F.to_date(F.col("Date"),'MM-dd-yyyy')) \
.withColumn("d5", F.to_date(F.col("Date"),'MMddyyyy')) \
.withColumn("d6", F.to_date(F.col("Date"),'MM.dd.yyyy')) \
.withColumn("d7", F.to_date(F.col("Date"),'dd-MM-yyyy')) \
.withColumn("d8", F.to_date(F.col("Date"),'dd/MM/yy')) \
.withColumn("d9", F.to_date(F.col("Date"),'yyyy/MM/dd HH:MM:SS'))\
.withColumn("d10", F.to_date(F.col("Date"),'MM/dd/yyyy HH:MM:SS'))\
.withColumn("d11", F.to_date(F.col("Date"),'dd/MM/yyyy HH:MM:SS'))\
.withColumn("d12", F.to_date(F.col("Date"),'dd.MM.yyyy')) \
.withColumn("d13", F.to_date(F.col("Date"),'dd-MM-yy')) \
.withColumn("result", F.coalesce("d1", "d2", "d3", "d4",'d5','d6','d7','d8','d9','d10','d11','d12','d13'))
sdf.show()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 34, ip-10-191-0-117.eu-west-1.compute.internal, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '01-01-1984' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
Is there any better way of solving this? I just want to know the perfect function library that can convert any kind of date format into a single date format.

Related

Input function for technical / biological replicates in snakemake

I´m currently trying to write a Snakemake workflow that can check automatically via a sample.tsv file if a given sample is a biological or technical replicate. And then use in this case at some point of my workflow a rule to merge technical/biological replicates.
My tsv file looks like this:
|sample | unit_bio | unit_tech | fq1 | fq2 |
|----------|----------|-----------|-----|-----|
| bCalAnn1 | 1 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_1_R2.fastq.gz |
| bCalAnn1 | 1 | 2 | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_2_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_2_R2.fastq.gz |
| bCalAnn2 | 1 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_1_R2.fastq.gz |
| bCalAnn2 | 1 | 2 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_2_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_2_R2.fastq.gz |
| bCalAnn2 | 2 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_2_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_2_1_R2.fastq.gz |
| bCalAnn2 | 3 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_3_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_3_1_R2.fastq.gz |
My Pipeline looks like this:
import pandas as pd
import os
import yaml
configfile: "config.yaml"
samples = pd.read_table(config["samples"], dtype=str)
rule all:
input:
expand(config["arima_mapping"] + "final/{sample}_{unit_bio}_{unit_tech}.bam", zip,
sample=samples["sample"], unit_bio=samples["unit_bio"], unit_tech=samples["unit_tech"])
..
some rules
..
rule add_read_groups:
input:
config["arima_mapping"] + "paired/{sample}_{unit_bio}_{unit_tech}.bam"
output:
config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
params:
platform = "ILLUMINA",
sampleName = "{sample}",
library = "{sample}",
platform_unit ="None"
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/paired_read_groups/{sample}_{unit_bio}_{unit_tech}.log"
shell:
"picard AddOrReplaceReadGroups I={input} O={output} SM={params.sampleName} LB={params.library} PU={params.platform_unit} PL={params.platform} 2> {log}"
rule merge_tech_repl:
input:
config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
output:
config["arima_mapping"] + "merge_tech_repl/{sample}_{unit_bio}_{unit_tech}.bam"
params:
val_string = "SILENT"
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/merged_tech_repl/{sample}_{unit_bio}_{unit_tech}.log"
threads:
2 #verwendet nur maximal 2
shell:
"picard MergeSamFiles -I {input} -O {output} --ASSUME_SORTED true --USE_THREADING true --VALIDATION_STRINGENCY {params.val_string} 2> {log}"
rule mark_duplicates:
input:
config["arima_mapping"] + "merge_tech_repl/{sample}_{unit_bio}_{unit_tech}.bam" if config["tech_repl"] else config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
output:
bam = config["arima_mapping"] + "final/{sample}_{unit_bio}_{unit_tech}.bam",
metric = config["arima_mapping"] + "final/metric_{sample}_{unit_bio}_{unit_tech}.txt"
#params:
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/mark_duplicates/{sample}_{unit_bio}_{unit_tech}.log"
shell:
"picard MarkDuplicates I={input} O={output.bam} M={output.metric} 2> {log}"
At the moment I have set a boolean in a config file that tells the mark_duplicates rule whether to take its input from the add_read_group or the merge_technical_replicates rule. This is of course not optimal since it could be that some samples may have duplicates (of any numbers) while others don´t. Therefore I want to create a syntax that checks the tsv table if a given sample name and unit_bio number are identical while the unit_tech number is different (and later analog to this for biological replicates), thus merging these specific samples while nonduplicate samples skip the merging rule.
EDIT
For clarification since I think I explained my goal confusingly.
My first attempt looks like this, I want "i" to be flexible, in case the duplicate number changes. I don't think that my input function returns all duplicates together that match each other but gives them one by one which is not what I want. I´m also unsure on how I would have to handle samples that do not have duplicates since they would have to skip this rule somehow.
input_function(wildcards):
return expand({sample}_{unit_bio}_{i}.bam", sample = wildcards.sample,
unit_bio = wildcards.unit_bio,
i = samples["sample"].str.count(wildcards.sample))
rule tech_duplicate_check:
input:
input_function #(that returns a list of 2-n duplicates, where n could be different for each sample)
output:
{sample}_{unit_bio}.bam
shell:
MergeTechDupl_tool {input} # input is a list
Therefore I want to create a syntax that checks the tsv table if a given sample name and unit_bio number are identical while the unit_tech number is different (and later analog to this for biological replicates), thus merging these specific samples while nonduplicate samples skip the merging rule.
rule gather_techdups_of_a_biodup:
output: "{sample}/{unit_bio}"
input: gather_techdups_of_a_biodup_input_fn
shell: "true" # Fill this in
rule gather_biodips_of_a_techdup:
output: "{sample}/{unit_tech}"
input: gather_biodips_of_a_techdup_input_fn
shell: "true" # Fill this in
After some attempts my main problem I struggle with is the table checking. As far as I know snakemake takes templates as input and checks for all samples that match this. But I would need to check the table for every sample that shares (e.g. for technical replicate) the sample name and the unit_bio number take all these samples and give them as input for the first rule run. Then I would have to take the next sample which was not already part of a previous run to prevent merging the same samples multiple times.
The logic you describe here can be implemented in the gather_techdups_of_a_biodup_input_fn and gather_biodips_of_a_techdup_input_fn functions above. For example, read your sample TSV file with pandas, filter for wildcards.sample and wildcards.unit_bio (or wildcards.unit_tech), then extract columns fq1 and fq2 from the filtered data frame.

Splunk bucket name conversion to epoch to human script

I'm facing the following issue: I have some frozen bucket in a Splunk enviroment that are saved in epoch format. More specifically the template is:
db_1181756465_1162600547_1001
that, if converted, return to me the end date, which is in the first number, and the start one, that is in the second one. So, it means that, base on my example:
1181756465 = Wednesday 13 June 2007 17:41:05
1162600547 = Saturday 4 November 2006 00:35:47
Now, ho to convert in human is clear for me, also because if not i coudn't put the translation here. My problem is that I have file full of bucket name that must be converted, with hunderds of entry; so, I'm asking if there is a script or other way to authomatize this conversion and print the output in a file. The idea is to have the final oputput with somehting like that:
db_1181756465_1162600547_1001 = Wednesday 13 June 2007 17:41:05 - Saturday 4 November 2006 00:35:47
You could use Splunk to view these values. They are outputted from the dbinspect command which provides startEpoch & endEpoch times for the frozen bucket
| dbinspect index=* state=frozen
| eval startDate=strftime(startEpoch,"%A %d %B %Y %H:%M:%S")
| eval endDate=strftime(endEpoch,"%A %d %B %Y %H:%M:%S")
| fields index, path, startDate, endDate
listing example using hot buckets since I don't have frozen in this test system
If you just have the list of folder names you can upload it to a Splunk instance as CSV and do some processing to extract startDate & endDate
| makeresults
| eval frozenbucket="db_1181756465_1162600547_1001"
| eval temp=split(frozenbucket,"_")
| eval sDate=mvindex(temp,2)
| eval eDate=mvindex(temp,1)
| eval startDate=strftime(sDate,"%A %d %B %Y %H:%M:%S")
| eval endDate=strftime(eDate,"%A %d %B %Y %H:%M:%S")
| fields frozenbucket,startDate,endDate
| fields - _time

Function to filter values in PySpark

I'm trying to run a for loop in PySpark that needs a to filter a variable for an algorithm.
Here's an example of my dataframe df_prods:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
I want to iterate over a list of ID's, get the match from the algorithm and then filter the product's type.
I created a function that gets the type:
def get_type(ID_PROD):
return [row[0] for row in df_prods.filter(df_prods.ID == ID_PROD).select("TYPE").collect()]
And wanted it to return:
print(get_type(7983))
Sneakers
But I find two issues:
1- it takes a long time to do that (longer than I got doing a similar thing on Python)
2- It returns an string array type: ['Sneakers'] and when I try to filter the products, this happens:
type = get_type(7983)
df_prods.filter(df_prods.type == type)
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [Sneakers]
Does anyone know a better way to approach this on PySpark?
Thank you very much in advance. I'm having a very hard time learning PySpark.
A little adjustment on your function. This returns the actual string of the target column from the first record found after filtering.
from pyspark.sql.functions import col
def get_type(ID_PROD):
return df.filter(col("ID") == ID_PROD).select("TYPE").collect()[0]["TYPE"]
type = get_type(7983)
df_prods.filter(col("TYPE") == type) # works
I find using col("colname") to be much more readable.
About the performance issue you've mentioned, I really cannot say without more details (e.g. inspecting the data and the rest of your application). Try this syntax and tell me if the performance improves.

Stata time conversion

I am working with a int (%8.0g) variable called timeinsecond that was badly coded. For example, a value for this variable 12192 should mean 3h 23min 12s. I'm trying to create a new variable that based on the value of time would give me the total time expressed in HH:MM:SS.
In the example I mentioned, the new variable would be 03:23:12.
Stata uses the units of milliseconds for date-times, so assuming that no time here is longer than 24 hours, you can use the principle here:
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen timeinsecond = 12192
. gen double wanted = timeinsecond * 1000
. format wanted %tcHH:MM:SS
. list
+---------------------+
| timein~d wanted |
|---------------------|
1. | 12192 03:23:12 |
+---------------------+
All documented at help datetime.

Value to table header in Pentaho

Hi I'm quite new in Pentaho Spoon and I have a problem:
I have a table like this:
model | type | color| q
--1---| --1-- | blue | 1
--1---| --2-- | blue | 2
--1---| --1-- | red | 1
--1---| --2-- | red | 3
--2---| --1-- | blue | 4
--2---| --2-- | blue | 5
And I would like to create a single table (to export in csv or excel) for each model grouped by type with the value of the group as header and as value the q value:
table-1.csv
type | blue | red
--1--| -1-- | -1-
--2--| -2-- | -3-
table-2.csv
type | blue
--1--| -4-
--2--| -5-
I tried with row denormalizer but nothing.
Any suggestion?
Typically it's helpful to see what you have done in order to offer help, but I know how counterintuitive the "help" on this step is.
Make sure you sort the rows on Model and Type before sending them to the denormalizer step. Then give this a try:
As for splitting the output into files, there are a few ways to handle that. Take a look at the Switch/Case step using the Model field.
Also, if you haven't found them already, take a look at the sample files that come with the PDI download. They should be in ...pdi-ce-6.1.0.1-196\data-integration\samples. They can be more helpful than the online documentation sometimes.
Row denormalizer can't be used here if number of colors is unknown, also, you can't define text output fields dynamically.
There are few ways that I can see without using java and js steps. One of them is based on the following idea: we can prepare rows with two columns:
Row Model
type|blue|red 1
1|1|1 1
2|2|3 1
type|blue 2
1|4 2
2|5 2
Then we can prepare filename for each row using Model field and then easily output all rows using text output where file name is taken from filename field. In this case all records will be exported into two files without additional efforts.
Here you can find sample transformation: copy-paste me into new transformation
Please note that it's a sample solution that works only with csv. Also it works only if you have the same number of colors for each type inside model. It's just a hint how to use spoon, it's not a complete solution.