Extracting data using U-SQL file set pattern when silent switch is true - azure-data-lake

I want to extract data from multiple file so I am using file set pattern that requires one virtual column. Because of some issues in my data, I also require silent switch other wise I am not able to process my data. It looks like, when I use virtual column with silent switch it does not extract any row.
#drivers =
EXTRACT name string,
age string,
origin string
FROM "/input/{origin:*}file.csv"
USING Extractors.Csv(silent:true);
Note that I can extract data from a single file by removing virtual column. Is there any solution for this problem?

first you do not need to name the wildcard (and expose a virtual column) if you do not plan on referring to the value. Although we recommend that you make sure that you are not processing too many files with this pattern, so best may be to use the virtual column as a filter to restrict the number of files to a few thousand right now until we improve the implementation to work on more files.
I assume that at least one file contains some rows with two columns? If that is the case I think you found a bug. Could you please send me a simple repro (one file that works, and an additional file where it stops working and the script) to my email address so I can file it and we can investigate it?
Thanks!

Related

how to properly load a gcs file into gbq with double-pipe delimiters

my existing query:
bq load --field_delimiter="||" --skip_leading_rows=1 source.taxassessor gs://taxassessor/OFFRS_5_0_TAXASSESSOR_0001_001.txt taxassessor.txt
the error I get back is:
Not enough positional args, still looking for destination_table
I tried to mimic the command in the Web UI, and I cannot reproduce because the Web UI doesn't allow double-pipe delimiters (limitation on the UI? or the solution?)
I have two questions:
How do I repair the current query
the source file OFFRS_5_0_TAXASSESSOR_0001_001.txt is one of many source files with the last three characters of the file name showing what file number in the series that file is. How do I use wild cards so I can get file 002.txt, 003.txt, etc. something like OFFRS_5_0_TAXASSESSOR_0001_*.txt?
Thanks
How do I use wild cards so I can get file 002.txt, 003.txt, etc. something like OFFRS_5_0_TAXASSESSOR_0001_*.txt?
Do as you suggested, for instance:
bq load --field_delimiter="||" --skip_leading_rows=1 source.taxassessor gs://taxassessor/OFFRS_5_0_TAXASSESSOR_0001_*.txt taxassessor.txt
It should already work.
How do I repair the current query?
Not sure why you are getting this message as everything seems to be correct...but still it shouldn't work as your delimiter have 2 characters and they are supposed to have just one (imagine for instance if your file has the string "abcd|||efg||hijk|||l", it'd be hard to tell where the delimiter is; whether it's the first two pipes or the last).
If you can't change the delimiter, one thing you could do is saving everything in BigQuery as one entire STRING field. After that, you can extract the fields as you want, something like:
WITH data AS(
select "alsdkfj||sldkjf" as field UNION ALL
select "sldkfjld|||dlskfjdslk"
)
SELECT SPLIT(field, "||") all_fields FROM data
all_fields will have all columns in your files, you can then save the results to some other table or run any analyzes you want.
As a recommendation, it would probably be better if you could change this delimiter to something else, having just one character.

Pentaho Data Integration (Spoon) Value Mapper Wildcard

Is there a wildcard character for the Value Mapper transformation in Pentaho Spoon? I've done some digging and only found wildcard solutions for uploading files and documents. I need to be able to map any and all potential values that contain a specific word yet I don't have a way of identifying all possible variations of the phrase that contains that word.
Example: Map website values to a category.
Value -> Mapped Category
facebook.com -> Facebook
m.facebook.com -> Facebook
google.com -> Google
google.ca -> Google
I'd prefer to use a wildcard character (let's call it % for example) so that one mapping captures all cases for a given category (e.g. %facebook% -> Facebook) in my Value Mapper. Another benefit is that the wildcard would correctly map any future site traffic value that comes along. (e.g. A hypothetical l.facebook.com would be correctly mapped if it ever entered my data)
I've tried various characters as wildcards and none have worked. + \ * %
Please and thank you!
You can use the step Replace in String with regular expressions to do this.
If you still need the original field, create a copy first using the Calculator step. Then you can put a number of mappings into the Replace step. They will run in sequence and if the regex matches, replace the contents of the field with your chosen mapping.
The performance may not be great, but it gives you the full flexibility of regexes. Do keep in mind this way gives you the first match. See my example for what can go wrong.

How do I partition a large file into files/directories using only U-SQL and certain fields in the file?

I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber

Azure Stream Analytics -> how much control over path prefix do I really have?

I'd like to set the prefix based on some of the data coming from event hub.
My data is something like:
{"id":"1234",...}
I'd like to write a blob prefix that is something like:
foo/{id}/guid....
Ultimately I'd like to have one blob for each id. This will help how it gets consumed downstream by a couple of things.
What I don't see is a way to create prefixes that aren't related to date and time. In theory I can write another job to pull from blobs and break it up after the stream analytics step. However, it feels like SA should allow me to break it up immediately.
Any ideas?
{date} , {time} and {partition} are the only ones supported in blob output prefix. {partition} is a number.
Using a column value in blob prefix is currently not supported.
If you have a limited number of such {id}s then you could workaround by writing multiple "select --" statements with different filters writing to different outputs and hardcode the prefix in the output. Otherwise it is not possible with just ASA.
It should be noted that now you actually can do this. Not sure when it was implemented but you can now use a single property from your message as a custom partition key and the syntax is exactly as the OP has asked for: foo/{id}/something/else
More details are documented here: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-custom-path-patterns-blob-storage-output
Key points:
Only one custom property allowed
Must be a direct reference to an existing message property (i.e. no concatenations like {prop1+prop2})
If the custom property results in too many partitions (more than 8,000) then an arbitrary number of blobs may be created for the same parition

Solr 5.3 implementation processes docs but doesn't return results

I have recently set up a local instance of Solr 5.3 in an effort to get it going for my company. As an initial test case I've set up a Data Import Handler (DIH) that returns PDFs stored within a file directory. When I execute the full import in the admin tool, the DIH processes all the files within the directory, and I'm able to run a general query (*:*) which returns all indexed fields for every record in the index.
When I switch to a specific query using a word definitely contained within the files, however, Solr returns no results. What connection am I not making here?
I can provide excerpts from the schema, solrconfig, and custom data config if needed, but I don't want to oversaturate this post.
The answer I came up with involved a simple newbie mistake combined with something I wasn't anticipating.
1) First, I didn't have my field set to indexed="true". I set that. Yeesh, it stinks being new to this!
2) I needed to make a change to solrconfig.xml for the core in question. Thanks to this article, I was able to determine that I needed to add a default field in the /select requestHandler. Uncommenting the relevant line in solrconfig and changing the field name did the trick-- I no longer need to supply the name in df to return results.
My carryover question for anyone coming across this question in the future is whether this latter point is the proper way to go about using default fields. I see in schema.xml that is deprecated (or heading that direction) in 5.3.0. So is it alright to define df in solrconfig instead?