Apache Pig: Extracting url query parameters that appear in arbitrary order - apache-pig

I have a logfile with urls that are tagged with custom Google Analytics campaign parameters (utm_source, utm_medium, utm_campaign). I need to extract the parameters from the urls and create a csv file where source, medium and campaign appear each in their own column (plus several other fields from the logfile).
This is how I started (url is the field that contains the url obviously):
extracted = foreach mydata GENERATE date, time,
FLATTEN(REGEX_EXTRACT_ALL(url, '.*utm_source=(.*)&utm_medium=(.*)&utm_campaign=(.*)&.*?'))
AS (source:CHARARRAY, medium:CHARARRAY, campaign:CHARARRAY);
This works, but only as long as the parameters appear in a fixed order (and are not preceeded by another parameter in the url).
So this will e.g. extract data from https://www.example.com/page.html?&utm_source=publisher&utm_medium=display&utm_campaign=standard&someotherparam but not from https://www.example.com/page.html?&utm_medium=display&utm_source=publisher&utm_campaign=standard&someotherparam. Since the parameter order is not consistent that doesn't work for me.
I have tried multiple conditions for the regexp separated by or (|) but that only ever gave me the first match. I have also tried to extract each parameter in it's own extract command and then join the data but that took ages and ended up duplicating the data.
So what would be the best (or at least a working) way to rewrite my pig command so that it will extract all three utm parameters from the urls independently from the order in which they appear ?

I would simply have three REGEX_ECTRACT:
... FOREACH mydata GENERATE FLATTEN(REGEX_EXTRACT(url, '.*utm_source=([^&]*)'), 1) AS (source:CHARARRAY)
...
Although you could probably do it with just one regex but I find this simpler and more readdable.

Related

how to properly load a gcs file into gbq with double-pipe delimiters

my existing query:
bq load --field_delimiter="||" --skip_leading_rows=1 source.taxassessor gs://taxassessor/OFFRS_5_0_TAXASSESSOR_0001_001.txt taxassessor.txt
the error I get back is:
Not enough positional args, still looking for destination_table
I tried to mimic the command in the Web UI, and I cannot reproduce because the Web UI doesn't allow double-pipe delimiters (limitation on the UI? or the solution?)
I have two questions:
How do I repair the current query
the source file OFFRS_5_0_TAXASSESSOR_0001_001.txt is one of many source files with the last three characters of the file name showing what file number in the series that file is. How do I use wild cards so I can get file 002.txt, 003.txt, etc. something like OFFRS_5_0_TAXASSESSOR_0001_*.txt?
Thanks
How do I use wild cards so I can get file 002.txt, 003.txt, etc. something like OFFRS_5_0_TAXASSESSOR_0001_*.txt?
Do as you suggested, for instance:
bq load --field_delimiter="||" --skip_leading_rows=1 source.taxassessor gs://taxassessor/OFFRS_5_0_TAXASSESSOR_0001_*.txt taxassessor.txt
It should already work.
How do I repair the current query?
Not sure why you are getting this message as everything seems to be correct...but still it shouldn't work as your delimiter have 2 characters and they are supposed to have just one (imagine for instance if your file has the string "abcd|||efg||hijk|||l", it'd be hard to tell where the delimiter is; whether it's the first two pipes or the last).
If you can't change the delimiter, one thing you could do is saving everything in BigQuery as one entire STRING field. After that, you can extract the fields as you want, something like:
WITH data AS(
select "alsdkfj||sldkjf" as field UNION ALL
select "sldkfjld|||dlskfjdslk"
)
SELECT SPLIT(field, "||") all_fields FROM data
all_fields will have all columns in your files, you can then save the results to some other table or run any analyzes you want.
As a recommendation, it would probably be better if you could change this delimiter to something else, having just one character.

Pentaho Data Integration (Spoon) Value Mapper Wildcard

Is there a wildcard character for the Value Mapper transformation in Pentaho Spoon? I've done some digging and only found wildcard solutions for uploading files and documents. I need to be able to map any and all potential values that contain a specific word yet I don't have a way of identifying all possible variations of the phrase that contains that word.
Example: Map website values to a category.
Value -> Mapped Category
facebook.com -> Facebook
m.facebook.com -> Facebook
google.com -> Google
google.ca -> Google
I'd prefer to use a wildcard character (let's call it % for example) so that one mapping captures all cases for a given category (e.g. %facebook% -> Facebook) in my Value Mapper. Another benefit is that the wildcard would correctly map any future site traffic value that comes along. (e.g. A hypothetical l.facebook.com would be correctly mapped if it ever entered my data)
I've tried various characters as wildcards and none have worked. + \ * %
Please and thank you!
You can use the step Replace in String with regular expressions to do this.
If you still need the original field, create a copy first using the Calculator step. Then you can put a number of mappings into the Replace step. They will run in sequence and if the regex matches, replace the contents of the field with your chosen mapping.
The performance may not be great, but it gives you the full flexibility of regexes. Do keep in mind this way gives you the first match. See my example for what can go wrong.

How do I partition a large file into files/directories using only U-SQL and certain fields in the file?

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber

Extracting data using U-SQL file set pattern when silent switch is true

I want to extract data from multiple file so I am using file set pattern that requires one virtual column. Because of some issues in my data, I also require silent switch other wise I am not able to process my data. It looks like, when I use virtual column with silent switch it does not extract any row.
#drivers =
EXTRACT name string,
age string,
origin string
FROM "/input/{origin:*}file.csv"
USING Extractors.Csv(silent:true);
Note that I can extract data from a single file by removing virtual column. Is there any solution for this problem?
first you do not need to name the wildcard (and expose a virtual column) if you do not plan on referring to the value. Although we recommend that you make sure that you are not processing too many files with this pattern, so best may be to use the virtual column as a filter to restrict the number of files to a few thousand right now until we improve the implementation to work on more files.
I assume that at least one file contains some rows with two columns? If that is the case I think you found a bug. Could you please send me a simple repro (one file that works, and an additional file where it stops working and the script) to my email address so I can file it and we can investigate it?
Thanks!

Test multiple regex on each document

I am getting all documents from a mongodb collection (millions), and I have a lot of regex in a postgreSQL.
I wanted to test each regex until one match on multiple fields containded in documents.
Do you have any idea how to do that ?
I tried with a Filter Row step, but I can't figure how to loop over all regex from postgreSQL.
You can solve your problem by using a Join rows (Cartesian Product) component. One of your inputs will have to read in the docs, the other will have to read in the regular expressions. The join component will create a outer product from these resulting in every possible combination of regex expressions and docs. This stream you will have to feed into the Filter Rows component and send the result to some output.
The following transformation will mimick this approach (it reads from CSV files but that should not make any difference to reading it from postgreSQL or MungoDB):
The input data for "documents" is configured as follows:
The input data for "regular expressions" is configured as follows:
The Join Rows does not have to be configured at all since we will NOT provide a join condition and hence making it effectively an full outer join.
In the Filter component you will have to use the DOC_TEXT and the REGEX_TEXT fields to execute the check base upon REGEXP operator.
For this document input
DOC_ID;DOC_TEXT
1;DFGBGGG
2;UHLLJAL
3;JJJJHHH
4;FGAKKBL
and this regex input
REGEX_ID;REGEX_TEXT
1;.*A.*
2;.*B.*
the transformation will output the following result:
DOC_ID;DOC_TEXT;REGEX_ID;REGEX_TEXT
1;DFGBGGG;2;.*B.*
2;UHLLJAL;1;.*A.*
4;FGAKKBL;1;.*A.*
4;FGAKKBL;2;.*B.*