Array Pair Loading from databases using Qlik Sense - qlikview

Does anyone has experience how to load/prepare data:
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
taken from SQL database (stored there as value) into qlik sense table:
ID, Value
1, a
2, b
3, v
4, d

Check out the annotated script below.
After its execution the result table will be:
set vSQLData = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')];
SQLData:
Load
// at this point the data will look like: "1, a", "2, b"
// will split the string on "," and will
// get the first value as ID
// and the second one as Valule
SubField(TempField2, ',', 1) as ID,
SubField(TempField2, ',', 2) as Value,
;
Load
// split the string by ")," and generate N number of rows
// then for each row remove "(", ")" and "'" characters
PurgeChar(SubField(TempField1, '),'), '''()''') as TempField2
;
Load
// remove "[" and "]" characters
PurgeChar('$(vSQLData)', '[]') as TempField1
AutoGenerate(1)
;

Related

How/where to format INSERT statement by aligning columns with values

I would like to know how I can format the INSERT statement from:
INSERT tablename ([column1], [column2], [column3], [column4], [column5]) VALUES (2,1,11,NULL,4), (2,1,11,NULL,4), (2,1,11,NULL,4);
to:
INSERT tablename ([column1], [column2], [column3], [column4], [column5])
VALUES ( 2, 1, 11, NULL, 4),
( 2, 1, 11, NULL, 4),
( 2, 1, 11, NULL, 4);
Do you know some text formatter that do it?
I found several options, but none that do it like that.
I didn't discover an existent solution, so I do it with PHP.
Let me know about any questions,
Enjoy!
#!/usr/bin/env php
<?php
echo preg_replace_callback("/(INSERT)\s+(.+)\s+\((.+)\)\s+(VALUES)\s+\((.+)\);*/", function($matches) {
$columns = explode(",", $matches[3]);
$values = explode(",", $matches[5]);
//length of INSERT + tablename
$length = strlen("$matches[1] $matches[2]");
//fills VALUES with leading spaces
$values_with_spaces = str_pad($matches[4], $length, " ", STR_PAD_LEFT);
$newColumns = $newValues = [];
foreach($columns as $key => $column) {
$value = !isset($values[$key]) ? 'NULL' : $values[$key];
//takes the longest length between the column and the value
$column_length = max(strlen($column), strlen($value));
$newColumns[$key] = str_pad($column, $column_length, " ", STR_PAD_LEFT);
$newValues[$key] = str_pad($value, $column_length, " ", STR_PAD_LEFT);
}
return "$matches[1] $matches[2] (".join(",",$newColumns).")\n$values_with_spaces (".join(",",$newValues).");\n";
}, file_get_contents("php://stdin"));

Selecting values with Pandas multiindex using lists of tuples

I have a DataFrame with a MultiIndex with 3 levels:
id foo bar col1
0 1 a -0.225873
2 a -0.275865
2 b -1.324766
3 1 a -0.607122
2 a -1.465992
2 b -1.582276
3 b -0.718533
7 1 a -1.904252
2 a 0.588496
2 b -1.057599
3 a 0.388754
3 b -0.940285
Preserving the id index level, I want to sum along the foo and bar levels, but with different values for each id.
For example, for id = 0 I want to sum over foo = [1] and bar = [["a", "b"]], for id = 3 I want to sum over foo = [2] and bar = [["a", "b"]], and for id = 7 I want to sum over foo = [[1,2]] and bar = [["a"]]. Giving the result:
id col1
0 -0.225873
3 -3.048268
7 -1.315756
I have been trying something along these lines:
df.loc(axis = 0)[[(0, 1, ["a","b"]), (3, 2, ["a","b"]), (7, [1,2], "a")].sum()
Not sure if this is even possible. Any elegant solution (possibly removing the MultiIndex?) would be much appreciated!
The list of tuples is not the problem. The fact that each tuple does not correspond to a single index is the problem (Since a list isn't a valid key). If you want to index a Dataframe like this, you need to expand the lists inside each tuple to their own entries.
Define your options like the following list of dictionaries, then transform using a list comprehension and index using all individual entries.
d = [
{
'id': 0,
'foo': [1],
'bar': ['a', 'b']
},
{
'id': 3,
'foo': [2],
'bar': ['a', 'b']
},
{
'id': 7,
'foo': [1, 2],
'bar': ['a']
},
]
all_idx = [
(el['id'], i, j)
for el in d
for i in el['foo']
for j in el['bar']
]
# [(0, 1, 'a'), (0, 1, 'b'), (3, 2, 'a'), (3, 2, 'b'), (7, 1, 'a'), (7, 2, 'a')]
df.loc[all_idx].groupby(level=0).sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
A more succinct solution using slicers:
sections = [(0, 1, slice(None)), (3, 2, slice(None)), (7, slice(1,2), "a")]
pd.concat(df.loc[s] for s in sections).groupby("id").sum()
col1
id
0 -0.225873
3 -3.048268
7 -1.315756
Two things to note:
This may be less memory-efficient than the accepted answer since pd.concat creates a new DataFrame.
The slice(None)'s are mandatory, otherwise the index columns of the df.loc[s]'s mismatch when calling pd.concat.

internal representation not hidden in f# module

I'm doing a project, using modules, where I compute images and print them on screen
Everything in my code work just fine, but I'm not quite happy with this inconvenient: in my .fs file (where I define the type and all my functions) I've declared
type Picture = P of (Set<((float*float)*(float*float))> * (int*int))
Sets describe the points of the segments in the picture (the lines)
and the couple of int is needed to define the bounding box where the image will be shown (width and height)
When I try my function (in another file) I use my grid function
let rec setBuilder (ls:((int*int)*(int*int)) list) =
match ls with
| [] -> Set.empty
| ((x,y),(w,z))::xs -> setBuilder xs |> Set.add ((float x,float y),(float w, float z))
let grid ls (wdt,hgt) = P (setBuilder ls, (wdt, hgt))
With this I take a couple/couple int list and build my Picture
The problem is, when I build a Picture with grid (in another file where I try all my function) is visible the internal representation
let persn = [((4, 0),(6, 7)); ((6, 7), (6, 10)); ((6, 10), (0, 10)); ((0, 10), (0, 12)); ((0, 12), (6, 12)); ((6, 12), (6, 14)); ((6, 14), (4, 16)); ((4, 16), (4, 18)); ((4, 18), (6, 20)); ((6, 20), (8, 20));
((8, 20), (10,18)); ((10, 18), (10, 16)); ((10, 16), (8, 14)); ((8, 14), (8, 12)); ((8, 12), (10, 12)); ((10, 12), (10, 14)); ((10, 14), (12, 14)); ((12, 14), (12, 10)); ((12, 10), (8, 10)); ((8, 10), (8, 8));
((8, 8), (10, 0)); ((10, 0), (8, 0)); ((8, 0), (7, 4)); ((7, 4), (6, 0)); ((6, 0), (4, 0))]
let box = (15,20)
let person = grid persn box
when I interpret the last line I get from the console
val person : Picture =
P (set
[((0.0, 10.0), (0.0, 12.0)); ((0.0, 12.0), (6.0, 12.0));
((4.0, 0.0), (6.0, 7.0)); ((4.0, 16.0), (4.0, 18.0));
((4.0, 18.0), (6.0, 20.0)); ((6.0, 0.0), (4.0, 0.0));
((6.0, 7.0), (6.0, 10.0)); ((6.0, 10.0), (0.0, 10.0));
((6.0, 12.0), (6.0, 14.0)); ...], (15, 20))
there is a way to hide this information, I looked up and the solution seems to be the tagged value (but I'm already using them)
* EDIT *
I noticed that this behavior could be associated with the static member in my implementation files, without them the inner type is not shown
type Picture with
static member(*) (c:float,pic:Picture) =
match pic with
| P(set,(wdt,hgt)) -> P (Set.map (fun ((x,y),(w,z)) -> ((x*c,y*c),(w*c,z*c))) set, (int (round (float wdt * c)) ,int (round (float hgt * c))))
static member(|>>) (pic1:Picture,pic2:Picture) =
match pic1,pic2 with
(P (set1,(w1,h1)), P (set2,(w2,h2))) -> let new_p2 = (((float h1/ float h2)) * pic2)
match new_p2 with
P (nset2,(nw2,nh2)) -> P(Set.union set1 (Set.map (fun ((x,y),(w,z)) -> ((x + (float w1) ,y),(w + (float w1), z)) ) nset2),(w1 + nw2, h1))
static member(|^^) (pic1:Picture,pic2:Picture) =
match pic1,pic2 with
(P (set1,(w1,h1)), P (set2,(w2,h2))) -> let new_pic2 = (((float w1/ float w2)) * pic2)
match new_pic2 with
P (nset2,(nw2,nh2)) -> P(Set.union set1 (Set.map (fun ((x,y),(w,z)) -> ((x,(float h1) + y),(w,(float h1) + z)) ) nset2),(w1 , h1 + nh2))
static member (>||>) (n, pic:Picture) =
match n with
| 0 -> pic
| m -> pic |>> ((m-1) >||> pic)
static member (^||^) (n, pic:Picture) =
match n with
| 0 -> pic
| m -> pic |^^ ((m-1) ^||^ pic)
Simply write type Picture = private P of ... then other modules cannot see the internal of Picture.
Note: if you write type private Picture = P of ... it means that other modules cannot see the Picture type.

How to get objects from array of objects for given key

I get rows from database into array. And then for example I have in these rows fields like ´ID´,´Section´. Is it possible to get all ID´s for chosen section from array without loops using some operators?
For Instance
1 a
2 b
2 a
3 a
4 b
and then for section 'a' I'd like to get [1,2,3]
If your database is an Array of Tuples you can do something like this:
let database: [(Int, String)] = [(1, "a"), (2, "b"), (2, "a"), (3, "a"), (4, "b")]
let aNumbers = database.filter{ $0.1 == "a" }.map{ $0.0 }
Or with swift 2:
let aNumbers = database.flatMap{ $0.1 == "a" ? $0.0 : nil }

How do I get a SQL row_number equivalent for a Spark RDD?

I need to generate a full list of row_numbers for a data table with many columns.
In SQL, this would look like this:
select
key_value,
col1,
col2,
col3,
row_number() over (partition by key_value order by col1, col2 desc, col3)
from
temp
;
Now, let's say in Spark I have an RDD of the form (K, V), where V=(col1, col2, col3), so my entries are like
(key1, (1,2,3))
(key1, (1,4,7))
(key1, (2,2,3))
(key2, (5,5,5))
(key2, (5,5,9))
(key2, (7,5,5))
etc.
I want to order these using commands like sortBy(), sortWith(), sortByKey(), zipWithIndex, etc. and have a new RDD with the correct row_number
(key1, (1,2,3), 2)
(key1, (1,4,7), 1)
(key1, (2,2,3), 3)
(key2, (5,5,5), 1)
(key2, (5,5,9), 2)
(key2, (7,5,5), 3)
etc.
(I don't care about the parentheses, so the form can also be (K, (col1,col2,col3,rownum)) instead)
How do I do this?
Here's my first attempt:
val sample_data = Seq(((3,4),5,5,5),((3,4),5,5,9),((3,4),7,5,5),((1,2),1,2,3),((1,2),1,4,7),((1,2),2,2,3))
val temp1 = sc.parallelize(sample_data)
temp1.collect().foreach(println)
// ((3,4),5,5,5)
// ((3,4),5,5,9)
// ((3,4),7,5,5)
// ((1,2),1,2,3)
// ((1,2),1,4,7)
// ((1,2),2,2,3)
temp1.map(x => (x, 1)).sortByKey().zipWithIndex.collect().foreach(println)
// ((((1,2),1,2,3),1),0)
// ((((1,2),1,4,7),1),1)
// ((((1,2),2,2,3),1),2)
// ((((3,4),5,5,5),1),3)
// ((((3,4),5,5,9),1),4)
// ((((3,4),7,5,5),1),5)
// note that this isn't ordering with a partition on key value K!
val temp2 = temp1.???
Also note that the function sortBy cannot be applied directly to an RDD, but one must run collect() first, and then the output isn't an RDD, either, but an array
temp1.collect().sortBy(a => a._2 -> -a._3 -> a._4).foreach(println)
// ((1,2),1,4,7)
// ((1,2),1,2,3)
// ((1,2),2,2,3)
// ((3,4),5,5,5)
// ((3,4),5,5,9)
// ((3,4),7,5,5)
Here's a little more progress, but still not partitioned:
val temp2 = sc.parallelize(temp1.map(a => (a._1,(a._2, a._3, a._4))).collect().sortBy(a => a._2._1 -> -a._2._2 -> a._2._3)).zipWithIndex.map(a => (a._1._1, a._1._2._1, a._1._2._2, a._1._2._3, a._2 + 1))
temp2.collect().foreach(println)
// ((1,2),1,4,7,1)
// ((1,2),1,2,3,2)
// ((1,2),2,2,3,3)
// ((3,4),5,5,5,4)
// ((3,4),5,5,9,5)
// ((3,4),7,5,5,6)
The row_number() over (partition by ... order by ...) functionality was added to Spark 1.4. This answer uses PySpark/DataFrames.
Create a test DataFrame:
from pyspark.sql import Row, functions as F
testDF = sc.parallelize(
(Row(k="key1", v=(1,2,3)),
Row(k="key1", v=(1,4,7)),
Row(k="key1", v=(2,2,3)),
Row(k="key2", v=(5,5,5)),
Row(k="key2", v=(5,5,9)),
Row(k="key2", v=(7,5,5))
)
).toDF()
Add the partitioned row number:
from pyspark.sql.window import Window
(testDF
.select("k", "v",
F.rowNumber()
.over(Window
.partitionBy("k")
.orderBy("k")
)
.alias("rowNum")
)
.show()
)
+----+-------+------+
| k| v|rowNum|
+----+-------+------+
|key1|[1,2,3]| 1|
|key1|[1,4,7]| 2|
|key1|[2,2,3]| 3|
|key2|[5,5,5]| 1|
|key2|[5,5,9]| 2|
|key2|[7,5,5]| 3|
+----+-------+------+
This is an interesting problem you're bringing up. I will answer it in Python but I'm sure you will be able to translate seamlessly to Scala.
Here is how I would tackle it:
1- Simplify your data:
temp2 = temp1.map(lambda x: (x[0],(x[1],x[2],x[3])))
temp2 is now a "real" key-value pair. It looks like that:
[
((3, 4), (5, 5, 5)),
((3, 4), (5, 5, 9)),
((3, 4), (7, 5, 5)),
((1, 2), (1, 2, 3)),
((1, 2), (1, 4, 7)),
((1, 2), (2, 2, 3))
]
2- Then, use the group-by function to reproduce the effect of the PARTITION BY:
temp3 = temp2.groupByKey()
temp3 is now a RDD with 2 rows:
[((1, 2), <pyspark.resultiterable.ResultIterable object at 0x15e08d0>),
((3, 4), <pyspark.resultiterable.ResultIterable object at 0x15e0290>)]
3- Now, you need to apply a rank function for each value of the RDD. In python, I would use the simple sorted function (the enumerate will create your row_number column):
temp4 = temp3.flatMap(lambda x: tuple([(x[0],(i[1],i[0])) for i in enumerate(sorted(x[1]))])).take(10)
Note that to implement your particular order, you would need to feed the right "key" argument (in python, I would just create a lambda function like those:
lambda tuple : (tuple[0],-tuple[1],tuple[2])
At the end (without the key argument function, it looks like that):
[
((1, 2), ((1, 2, 3), 0)),
((1, 2), ((1, 4, 7), 1)),
((1, 2), ((2, 2, 3), 2)),
((3, 4), ((5, 5, 5), 0)),
((3, 4), ((5, 5, 9), 1)),
((3, 4), ((7, 5, 5), 2))
]
Hope that helps!
Good luck.
val test = Seq(("key1", (1,2,3)),("key1",(4,5,6)), ("key2", (7,8,9)), ("key2", (0,1,2)))
test: Seq[(String, (Int, Int, Int))] = List((key1,(1,2,3)), (key1,(4,5,6)), (key2,(7,8,9)), (key2,(0,1,2)))
test.foreach(println)
(key1,(1,2,3))
(key1,(4,5,6))
(key2,(7,8,9))
(key2,(0,1,2))
val rdd = sc.parallelize(test, 2)
rdd: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ParallelCollectionRDD[41] at parallelize at :26
val rdd1 = rdd.groupByKey.map(x => (x._1,x._2.toArray)).map(x => (x._1, x._2.sortBy(x => x._1).zipWithIndex))
rdd1: org.apache.spark.rdd.RDD[(String, Array[((Int, Int, Int), Int)])] = MapPartitionsRDD[44] at map at :25
val rdd2 = rdd1.flatMap{
elem =>
val key = elem._1
elem._2.map(row => (key, row._1, row._2))
}
rdd2: org.apache.spark.rdd.RDD[(String, (Int, Int, Int), Int)] = MapPartitionsRDD[45] at flatMap at :25
rdd2.collect.foreach(println)
(key1,(1,2,3),0)
(key1,(4,5,6),1)
(key2,(0,1,2),0)
(key2,(7,8,9),1)
From spark sql, Read the data files...
val df = spark.read.json("s3://s3bukcet/key/activity/year=2018/month=12/date=15/*");
The above file has fields user_id, pageviews and clicks
Generate the activity Id (row_number) partitioned by user_id and order by clicks
val output = df.withColumn("activity_id", functions.row_number().over(Window.partitionBy("user_id").orderBy("clicks")).cast(DataTypes.IntegerType));