Using multiple functions in the same field in QlikView - qlikview

I want to use multiple functions on a field and store the result into one field, like this:
left(Campagne,len(Campagne)-4) and Replace(Campagne,'%2f','/') and PurgeChar (Campagne,'.g.c') as Campagne;
How can I do this?

You can either nest functions, or use preceding loads to obtain what you wish. Depending on your load script, preceding loads are often neater and somewhat easier to follow, but result in slightly more script.
Preceding load:
MyTable:
LOAD
left(Campagne, len(Campagne) - 4) as Campagne;
LOAD
Replace(PurgeChar(Campagne,'.g.c'),'%2f','/') as Campagne
FROM ...
Nesting:
MyTable:
LOAD
left(replace(purgechar(Campagne,'.g.c'),'%2f','/'), len(replace(purgechar(Campagne,'.g.c'),'%2f','/'))-4) as Campagne
FROM ...
As you can see in the nesting example, as you're using len you end up repeating your operations twice.

Related

Qlik sense: How to aggregate strings into single row in script

I am trying to aggregate strings that belong to the same product code in one row. Which Qlik sense aggregation function should I use?
image
I am able to aggregate integers in such example, but failed for string aggregation.
Have you tried maxstring() - this is a string aggregation function.
As x3ja mentioned, you can use an aggregation function in charts that will work for strings, including:
MaxString()
Only()
Concat()
These can result in the type of thing you're looking for:
It's worth noting, though, that this sort of problem is almost always an issue with the underlying data model. Depending on what your source data looks like, you should consider investigating your use of Join and/or Concatenate. You can see more info on how to use those functions on this Qlik Help page.
Here's a very basic example of using a Join to properly combine the data in a way that results in all data showing up a single record without needing any aggregations in the table chart:

Difference between "Preview" and Query in BigQuery

I have the following table schema:
+-----+---------+----------+
+ chn | INTEGER | NULLABLE |
+-----+---------+----------|
+ size| STRING | NULLABLE |
+-----+---------+----------|
+ char| REPEATED| NULLABLE |
+-----+---------+----------|
+ ped | INTEGER | NULLABLE |
+-----+---------+----------
When I click on 'preview' in the Google BigQuery Web UI, I get the following result:
But when I query my table, I get this result:
It seems like "preview" is interpreting my repeated field as an array, I would want to get the same result in a query to limit the number of rows.
I did try to uncheck "Use Legacy SQL" which gave me the same result but the problem is that with my table, a same query takes ~1.0 sec to execute with "Use Legacy SQL" checked and ~12 seconds when it's unchecked.
I am looking for speed here so unfortunately, not using Legacy SQL is not an option...
Is there another way to render my repeated field like it does in the "preview" ?
Thanks for the help :)
In legacy SQL, BigQuery flattens the result of queries by default. This means two things:
All child fields of RECORD fields are propagated to the top-level, with their names changed from record.subrecord.leaf to record_subrecord_leaf. Parent records are removed from the schema.
All repeated fields are converted to fields of optional mode, with each repeated value expanded into its own row. (As a side note, this step is very similar to the FLATTEN function exposed in legacy SQL.)
What you see here is a product of #2. Each repeated value is becoming its own row (as you can see by the row count on the left-hand side in your two images) and the values from the other columns are, well, repeated for each new row.
You can prevent this behavior and receive "unflattened results" in a couple ways.
Using standard SQL, as you note in your original question. All standard SQL queries return unflattened results.
While using legacy SQL, setting the flattenResults parameter to false. This requires also specifying a destination table and setting allowLargeResults to false. These can be found in the Show Options panel beneath the query editor if you want to set them within the UI. Mikhail has some good suggestions for managing the temporary-ness of destination tables if you aren't interested in keeping them around.
I should note that there are a number of corner cases with legacy SQL with flattenResults set to false which might trip you up if you start writing more complex queries. A prominent example is that you can't output more than one independently repeated field in query results using legacy SQL, but you can output multiple with standard SQL. These issues are unlikely to be resolved in legacy SQL, and going forward we're suggesting people use standard SQL when they run into them.
If you could provide more details about your much slower query using standard SQL (e.g. job ID in legacy SQL, job ID in standard SQL, for comparison), I, and the rest of the BigQuery team, would be very interested in investigating further.
Is there another way to render my repeated field like it does in the
"preview" ?
To see original not-flattened output in Web UI for Legacy SQL, i used to set respective options (click Show Options) to actually write output to table with checked Allow Large Results and unchecked Flatten Results.
This actually not only saves result into table but also shows result in the same way as preview does (because it is actually preview of that table). To make sure that table gets removed afterwards - i have "dedicated" dataset (temp) with default expiration set to 1 day (or hour - depends on how aggressive you want to be with your junk), so you don't need to worry of that table(s) - it will get deleted automatically for you. Wanted to note: this was quite a common pattern for us to deal with and having to do extra settings was boring, so we ended up with our own custom UI that does all this for user automatically
What you see is called Flatten.
By default the UI flattens the query output, there is currently no option to show query results like you want. In order to produce unflatten results you must write to a table, but that's different thing.

Dynamically execute a transformation against a column at runtime

I have a Pentaho Kettle job that can load data from x number of tables, and put it into target tables with a different schema.
Assume I have table 1, like so:
I want to load this table into a destination table that looks like this:
The columns have been renamed, the order has been changed, and the data has been transformed. The rename, and order is easily managed by using the Select Values step, which can be used within an ETL Metadata Injection step, making it dependent on some configuration values loaded at runtime.
But if I need to perform some transformation logic on some of the columns, based on where they go in the target table, this seems to be less straightforward.
In my example, I want the column "CountryName" to be capitalised, and the column "Rating" to be floored (as in changing the real number to the previous integer value).
While I could do this by just manually adding a transformation to accomplish each, I want my solution to be dynamic, so it could just as easily run the "CountryName" column through a checksum component, or perform a ceiling on "Rating" instead.
I can easily wrap these transformations in another transformation so that they can be parameterised and executed when needed:
But, where I'm having trouble is, when I process a row of data, I need a way to be able to say:
Column "CountryName" should be passed through the Capitalisation transform
Column "Rating" should be passed through the Floor transform
Column(s) "AnythingElse" should be passed through the SomeOther transform
Is there a way to dynamically split out the columns in a row, and execute a different transform on each one, based on some configuration metadata that can be supplied?
Logically, it would be something like this, although I suspect there may be a way to handle it as a loop or some form of dynamic transformation, rather than mapping out a path per column:
Kettle is so flexible that it seems like there must be a way to do this, I'm just struggling to know which components to use and how to do it. Any experts out there have some suggestions?
I'm dealing with some biggish data sets here (hundreds of millions of rows) so reluctant to use Row Normaliser/Denormaliser or writing to file/DB if possible.
Have you considered the Modified Java Script Value step? Start with the Data Grid step, the a Select Values step, then the Modified Java Script Value step. In that step you will transform the value of each column in what you form you want and output that in a file.
That of course requires some Java script knowledge but given your example it seems that the required knowledge is pretty basic.

What is the order of data across multiple nested fields in BigQuery?

Given a BigQuery table with the schema: target:STRING,evName:STRING,evTime:TIMESTAMP, consider the following subselect:
SELECT target,
NEST(evName) AS evNames,
NEST(evTime) AS evTimes,
FROM [...]
GROUP BY target
This will group events by target into rows with two repeated fields evNames and evTimes. I understand that the values within each of the repeated fields are not ordered in any predictable way, but is the ordering guaranteed to be consistent between the two repeated fields?
In other words, if I pick N-th value from evNames and N-th value from evTimes within a given row, will they form a proper pair from the original table?
What I would really like to do is to create a nested repeated record, something like:
SELECT target, NEST(RECORD(evName, evTime)) AS events FROM [...] GROUP BY target
but I believe creating RECORDs on the fly like this is currently not supported.
By the way, this question is motivated by the desire to use recently introduced BigQuery user defined functions to implement state machines, as an alternative to window functions tricks.
Note: I realize that an alternative is to emulate record by serializing multiple fields into a single string representation, e.g.:
SELECT target, NEST(CONCAT(evName, ',', STRING(evTime))) ...
and then deserialize the "record" in later stages, but I'd like to avoid that if I can.

Hive UDF returning an array called twice - performance?

I have created a GenericUDF in hive that takes one string argument and returns an array of two strings, something like:
> select normalise("ABC-123");
...
> [ "abc-123", "abc123" ]
The UDF makes a call out via JNI to a C++ program for each row to calculate the return data so it would be preferable to only have to make the call once per input row for performance reasons.
However, I want to be able to take each value from the array and put it into a separate field in the output table. I know I can do:
> select normalise("ABC-123")[0] as first_string, normalise("ABC-123")[1] as second_string;
Will hive call the normalise function twice - once for each time it is used in this statement - or will it see both calls have the same argument and only call it once, cache the output, and use the cache rather than making the call a second time?
If it is going to make two UDF calls per row, what other options are there to use this UDF and put the two strings from the output array into separate columns in an output table? (I don't think INLINE will work here)
The use case for this function will be something like:
a|b
1|ABC-123
2|DEF-456
select a, normalise(b)[0] as first_string, normalise(b)[1] as second_string;
If you want to make sure that the udf is only called once, you could save the results to a temporary table first:
create table tmp as
select a, normalize(b) arr
from mytable;
select a, arr[0] first_string, arr[1] second_string
from tmp;
That said, I probably wouldn't worry about this kind of performance tuning if I were you, in my opinion Hive is best approached with more of a "brute force" state of mind: just write the simplest code to achieve your task and if it's slow, you can always add more nodes to your cluster.
Also, it might be worth considering whether you really need a custom UDF for your task, or whether you can simplify your codebase by using inbuilt Hive functions; in the example you gave:
select lower(b) as first_string, regexp_replace(lower(b), '-', '') as second_string