[KDB+/Q]: Apply list of functions over data sequentially (pipe) - currying

In kdb+/q, how to pipe data through a sequential list of functions so that output of previous step is the input to next step?
For example:
q)t:([]sym:`a`c`b;val:1 3 2)
q)`sym xkey `sym xasc t / how to achieve the same result as this?
I presume some variation of over or / could work:
?? over (xasc;xkey)
Bonus: how to achieve the same in a way where t is piped in from the right-hand side (in the spirit of left-of-right reading of the q syntax)?
(xasc;xkey) ?? t

how to pipe data through a sequential list of functions so that output of previous step is the input to next step?
You can use the little known composition operator. For example:
q)f:('[;])over(2+;3*;neg)
q)f 1 # 2+3*neg 1
-1
If you want to use the left of right syntax, you will have to define your own verb:
q).q.bonus:{(('[;])over x)y}
q)(2+;3*;neg)bonus 1
-1

Use a lambda on the left as well as the over adverb (form of recursion)
Also the dot (.) form of apply is used to apply the function to the table and the column:
{.[y;(z;x)]}/[t;(xasc;xkey);`sym]
sym| val
---| ---
a | 1
b | 2
c | 3

Related

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

Split column in hive

I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe '|' character. Then I want to group up the 2 adjacent values and separate them into separate rows.
Example, I have a table
id mapper
1 a|0.1|b|0.2
2 c|0.2|d|0.3|e|0.6
3 f|0.6
I am able to split the column by using split(mapper, "\\|") which gives me the array
id mapper
1 [a,0.1,b,0.2]
2 [c,0.2,d,0.3,e,0.6]
3 [f,0.6]
Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.
Expected:
id mapper
1 [a,0.1]
1 [b,0.2]
2 [c,0.2]
2 [d,0.3]
2 [e,0.6]
3 [f,0.6]
Actual
id mapper
1 a
1 0.1
1 b
1 0.2
etc .......
How can I achieve this?
I would suggest you to split your pairs split(mapper, '(?<=\\d)\\|(?=\\w)'), e.g.
split('c|0.2|d|0.3|e|0.6', '(?<=\\d)\\|(?=\\w)')
results in
["c|0.2","d|0.3","e|0.6"]
then explode the resulting array and split by |.
Update:
If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\\.\\d)\\|(?=\\w|\\d)').
Update 2:
OK, the best way is to split on the second | as follows
split(mapper, '(?<!\\G[^\\|]+)\\|')
e.g.
split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\\G[^\\|]+)\\|')
results in
["6193439|0.0444035224643987","6186654|0.0444035224643987"]

iteration in spark sql dataframe , getting 1st row value in first iteration and second row value in next iteration and so on

Below is the query that will give the data and distance where distance is <=10km
var s=spark.sql("select date,distance from table_new where distance <=10km")
s.show()
this will give the output like
12/05/2018 | 5
13/05/2018 | 8
14/05/2018 | 18
15/05/2018 | 15
16/05/2018 | 23
---------- | --
i want to use first row of the dataframe s , store the date value in a variable v , in first iteration.
In next iteration it should pick the second row , and corresponding data value to be replaced the old variable b .
like wise so on .
I think you should look at Spark "Window Functions". You may find here what you need.
The "bad" way to do this would be to collect the dataframe using df.collect() which would return a list of Rows which you can manually iterate over each using a loop.This is bad cause it brings all the data in your driver.
The better way would be to use foreach() :
df.foreach(lambda x: <<your code here>>)
foreach() takes a lambda function as argument which iterates over each row of the dataframe without bringing all the data in the driver.But you cant use a simple local variable v inside a lambda fuction when there is overwriting involved.you can use spark accumulators for such a case.
eg: if i want to sum all the values in 2nd column
counter = sc.longAccumulator("counter")
df.foreach(lambda row: counter.add(row.get(1)))

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2

Ordering a list by two properties, ParentID and ChildID

I have a class that has 3 properties: Name, ID, and ParentID.
My data:
Name ID ParentID
Event A 1 1
Event B 2 1
Event C 3 1
Event D 4 2
I have everything in a List and was trying to use the OrderBy or perhaps the Sort methods. Not sure which would be better.
I need the data in the list to be ordered so that an event has it's child as the next item in the list. Any help on this would be greatly appreciated, I am doing this in VB by the way. Thanks!!
You can sort the list like this
list.Sort(Function(x, y) 2 * x.ParentID.CompareTo(y.ParentID) + _
x.ChildID.CompareTo(y.ChildID))
Explanation: I am using a lambda expression here. You can think of it as a kind of inline declaration of a function. CompareTo returns either -1, 0 or +1. A negative number means x is less than y, 0 both are equal and +1 means x is greater than y. By multiplying the first comparison by two, its sign takes precedence over the second comparison. The second has only an effect, if the first one returns 0.
The advantage of using the lists Sort method over LINQ is that the list is sorted in-place. With LINQ you would have to create a new list.