Get array from 1 to number of columns of csv in nextflow - nextflow

One of my process gives output of one csv file. I want to create an array channel from 1 to number of columns. For example:
My output
my_out_ch.view() -> test.csv
Assume, test.csv has 11 columns. Now I want to create a channel which gives me:
1,2,3,4,5,6,7,8,910,11
How could I get this? I have tried with splitText operator as below without luck:
my_out_ch.splitText(by:1,limit:1)
But it only gives me the columns names. There is a parameter elem, I am not sure if elem could give me the array and also not sure how to use it. Any help?

You could use the splitCsv operator to parse the CSV file. Then create an intRange using the map operator. Either call collect() to emit a java.util.ArrayList or call join() to emit a string. For example:
params.input_tsv = 'test.tsv'
Channel.fromPath( params.input_tsv )
| splitCsv( sep: '\t', limit: 1 )
| map { (1..it.size()).join(',') }
| view()
Results:
1,2,3,4,5,6,7,8,9,10,11

Related

Convert a spark dataframe to a column

I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot
From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?

How to print lists in a scific order in kotlin?

im working on a project and i have a list in kotlin like:
val list = listOf("banana", "1","apple","3","banana","2")
and i want to print it like
Output:
banana = 1
banana = 2
apple = 3
so like every work with the number should be like one val, and i need to print in scific order (the order is toooo random for any sort command), so im panning on just coppying the whole xinhua dictionary here (since all chinese words have a scific unicode), and make the code it replace like:
val list = listOf("banana丨", "1","apple丩","3","banana丨","2")
but how to print them in the order?
ps. even me as a chinese dont know most of the words in xinhua dictionary lol so there is more then enofe
Assuming that you have the following input list, as shown in your question, where the order of occurrence is always one word followed by the scific order:
val list = listOf("banana", "1","apple","3","banana","2")
You could do the following:
1. Create a data class that defines one entry in your raw input list
data class WordEntry(val word: String, val order: Int)
2. Map over your raw input list by using the windowed and map methods
val dictionary = list.windowed(2, 2).map { WordEntry(it.first(), it.last().toInt()) }
Here, the windowed(2, 2) method creates a window of size 2 and step 2, meaning that we iterate over the raw input list and always work with two entries at every second step. Assuming that the order in the raw input list is always the word followed by the scific order, this should work. Otherwise, this would not work, so the order is very important here!
3. Sort the transformed dictionary by the order property
val sortedDictionary = dictionary.sortedBy { it.order }
Edit: You can also sort by any other property. Just pass another property to the lambda expression of sortedBy (e.g. sortedBy { it.word } if you want to sort it by the word property)
4. Finally, you can print out your sorted dictionary
val outputStr = sortedDictionary.joinToString("\n") { "${it.word} = ${it.order}" }
print(outputStr)
Output:
banana = 1
banana = 2
apple = 3

Function to filter values in PySpark

I'm trying to run a for loop in PySpark that needs a to filter a variable for an algorithm.
Here's an example of my dataframe df_prods:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
I want to iterate over a list of ID's, get the match from the algorithm and then filter the product's type.
I created a function that gets the type:
def get_type(ID_PROD):
return [row[0] for row in df_prods.filter(df_prods.ID == ID_PROD).select("TYPE").collect()]
And wanted it to return:
print(get_type(7983))
Sneakers
But I find two issues:
1- it takes a long time to do that (longer than I got doing a similar thing on Python)
2- It returns an string array type: ['Sneakers'] and when I try to filter the products, this happens:
type = get_type(7983)
df_prods.filter(df_prods.type == type)
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [Sneakers]
Does anyone know a better way to approach this on PySpark?
Thank you very much in advance. I'm having a very hard time learning PySpark.
A little adjustment on your function. This returns the actual string of the target column from the first record found after filtering.
from pyspark.sql.functions import col
def get_type(ID_PROD):
return df.filter(col("ID") == ID_PROD).select("TYPE").collect()[0]["TYPE"]
type = get_type(7983)
df_prods.filter(col("TYPE") == type) # works
I find using col("colname") to be much more readable.
About the performance issue you've mentioned, I really cannot say without more details (e.g. inspecting the data and the rest of your application). Try this syntax and tell me if the performance improves.

Call function in pyspark with values from dataframe as strings

I have to call a function func_test(spark,a,b) which accepts two string values and create a df out of it. spark is a SparkSession variable
These two string values are two columns of another dataframe and would be different for different rows of that dataframe.
I am unable to achieve this.
Things tried so far:
1.
ctry_df = func_test(spark, df.select("CTRY").first()["CTRY"],df.select("CITY").first()["CITY"])
Gives CTRY and CITY of only the first record of the df.
2.
ctry_df = func_test(spark, df['CTRY'],df['CITY'])
Gives Column<b'CTRY'> and Column<b'CITY'> as values.
Example:
df is:
+----------+----------+-----------+
| CTRY | CITY | XYZ |
+----------+----------+-----------+
| US | LA | HELLO|
| UK | LN | WORLD|
| SN | SN | SPARK|
+----------+----------+-----------+
So, I want first call to fetch func_test(spark,US,LA); second call to go func_test(spark,UK,LN); third call to be func_test(spark,SN,SN) and so on.
Pyspark - 3.7
Spark - 2.2
Edit 1:
Issue in detail:
func_test(spark,string1,string2) is a function which accepts two string values. Inside this function is a set of various dataframe operations done. For example:- First spark sql in the func_test is a normal select and these two variables string1 and string2 are used in the where clause. The result of this spark sql which generates a df is a temp table of next spark sql and so on. Finally, it creates a df which this function func_test(spark,string1,string2) returns.
Now, In the main class, I have to call this func_test and the two parameters string1 and string2 will be fetched from records of dataframe. So that, first func_test call generates query as select * from dummy where CTRY='US' and CITY='LA'. And the subsequent operations happen which results in df. Second call to func_test becomes select * from dummy where CTRY='UK' and CITY='LN'. Third call becomes select * from dummy where CTRY='SN' and CITY='SN' and so on.
instead of first() use collect() and iterate through the loop
collect_vals = df.select('CTRY','CITY').distinct().collect()
for row_col in collect_vals:
func_test(spark, row_col['CTRY'],row_col['CITY'])
hope this helps !!

iteration in spark sql dataframe , getting 1st row value in first iteration and second row value in next iteration and so on

Below is the query that will give the data and distance where distance is <=10km
var s=spark.sql("select date,distance from table_new where distance <=10km")
s.show()
this will give the output like
12/05/2018 | 5
13/05/2018 | 8
14/05/2018 | 18
15/05/2018 | 15
16/05/2018 | 23
---------- | --
i want to use first row of the dataframe s , store the date value in a variable v , in first iteration.
In next iteration it should pick the second row , and corresponding data value to be replaced the old variable b .
like wise so on .
I think you should look at Spark "Window Functions". You may find here what you need.
The "bad" way to do this would be to collect the dataframe using df.collect() which would return a list of Rows which you can manually iterate over each using a loop.This is bad cause it brings all the data in your driver.
The better way would be to use foreach() :
df.foreach(lambda x: <<your code here>>)
foreach() takes a lambda function as argument which iterates over each row of the dataframe without bringing all the data in the driver.But you cant use a simple local variable v inside a lambda fuction when there is overwriting involved.you can use spark accumulators for such a case.
eg: if i want to sum all the values in 2nd column
counter = sc.longAccumulator("counter")
df.foreach(lambda row: counter.add(row.get(1)))