I have an extremely large CSV, where each row contains customer and store ids, along with transaction information. The current test file is around 40 GB (about 2 days worth), so partitioning is an absolute must for any reasonable return time on select queries.
My question is this: When we receive a file, it contains multiple store's data. I would like to use the "virtual column" functionality to separate this file into the respective directory structure. That structure is "/Data/{CustomerId}/{StoreID}/file.csv".
I haven't yet gotten it to work with the OUTPUT statement. The statement use was thus:
// Output to file
OUTPUT #dt
TO #"/Data/{CustomerNumber}/{StoreNumber}/PosData.csv"
USING Outputters.Csv();
It gives the following error:
Bad request. Invalid pathname. Cosmos Path: adl://<obfuscated>.azuredatalakestore.net/Data/{0}/{1}/68cde242-60e3-4034-b3a2-1e14a5f7343d
Has anyone attempted the same kind of thing? I tried to concatenate the outputpath from the fields, but that was a no-go. I thought about doing it as a function (UDF) that takes the two ID's and filters the whole dataset, but that seems terribly inefficient.
Thanks in advance for reading/responding!
Currently U-SQL requires that all the file outputs of a script must be understood at compile time. In other words, the output files cannot be created based on the input data.
Dynamic outputs based on data are something we are actively working for release sometime later in 2017.
In the meanwhile until the dynamic output feature is available, the pattern to accomplish what you want requires using two scripts
The first script will use GROUP BY to identify all the unique combinations of CustomerNumber and StoreNumber and write that to a file.
Then through the use of scripting or a tool written using our SDKs, download the previous output file and then programmatically create a second U-SQL script that has an explicit OUTPUT statement for each pair of CustomerNumber and StoreNumber
Suppose I have the following flat file on HDFS (let's call this key_value):
1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing
Here is the output I'm looking for:
(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)
In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).
Towards this end, I started putting together this pig script:
data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
sorted = ORDER data BY key, value;
GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;
The above script gives me the following output:
(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)
Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).
Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?
UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:
(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)
First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.
I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.
You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.
I am a newbie to ABAP (3 days experience) and I am currently on a task to write reports using ABAP code. It is like moving some data from a specific SAP database to a Business Intelligence staging area.
So the core difficulty is that some data on the SAP server is in the format of dictionary structures (FMOIX, FMCOX, etc.) and I need to move these data into internal tables during program runtime. I was told that OPENSQL would not work in this case.
If you still do not get what I mean, I can suggest several ways, actually given by my supervisor. First is to use GET event, say
GET FMOIX.
IF FMOIX-zhdlt > From_dat and FMOIX-zhdlt < to_dat.
Append FMOIX to itab.
ENDIF.
The thing is that I am still not very clear about this GET event. Is it just a event handler thing, or can it loop through data records?
What I googled for more than two days give me something like
LOOP at FMOIX.
MOVE FMOIX to itab.
ENDLOOP.
So what are the ways to move transactional structure like FMOIX into internal tables, say the internal table name is ITAB?
Your answer would be greatly appreciated. Though I have time, I am totally new.
Thanks a lot.
If your supervisor is suggesting that you use the GET event, it means that your program is (or should be) using a logical database - in this case probably FMF or FMF_BCS.
Doing GET FMOIX reads a set of fields defined in the logical database (as a node). Underneath your GET statement, you can use FMOIX as a structure, e.g. WRITE FMOIX-field1. The program will (implicitly, it's not explicity defined in the code like a LOOP...ENDLOOP is) loop through all the rows returned according to your selection criteria. You should be able to use MOVE-CORRESPONDING to move the contents of each row into a proper structure, and then APPEND that structure to your itab.
Quick link on GET in ABAPDocu
Note: this answer is a bit of a guess, since I've only used a logical database once, and the documentation is a little thin on the ground compared to the volumes out there about standard SELECTs and internal tables.
You can create your internal table in type of that structure such as:
data: itab like table of fmoix with header line.
And you can use this internal table to fill up wherever you are using your select codes.
Such as:
select * from ____
into corresponding fields of itab
where zhdlt gt from_dat
and zhdlt lt to_dat.
I'm not sure this is what you are looking for but I can tell you creating itab in type of that structure can be filled up with all corresponding datas that coming from your select. You cant loop FMOIX because its not a table, its a structure. So is there any specific reason to hold your datas in structures?
Hope it was helpful.
Talha
If I have a POCO class with ResultColumn attribute set and then when I do a Single<Entity>() call, my result column isn't mapped. I've set my column to be a result column because its value should always be generated by SQL column's default constraint. I don't want this column to be injected or updated from business layer. What I'm trying to say is that my column's type is a simple SQL data type and not a related entity type (as I've seen ResultColumn being used mostly on those).
Looking at code I can see this line in PetaPoco:
// Build column list for automatic select
QueryColumns = ( from c in Columns
where !c.Value.ResultColumn
select c.Key
).ToArray();
Why are result columns excluded from automatic select statement because as I understand it their nature is to be read only. So used in selects only. I can see this scenario when a column is actually a related entity type (complex). Ok. but then we should have a separate attribute like ComputedColumnAttribute that would always be returned in selects but never used in inserts or updates...
Why did PetaPoco team decide to omit result columns from selects then?
How am I supposed to read result columns then?
I can't answer why the creator did not add them to auto-selects, though I would assume it's because your particular use-case is not the main one that they were considering. If you look at the examples and explanation for that feature on their site, it's more geared towards extra columns you bring back in a join or calculation (like maybe a description from a lookup table for a code value). In these situations, you could not have them automatically added to the select because they are not part of the underlying table.
So if you want to use that attribute, and get a value for the property, you'll have to use your own manual select statement rather than relying on the auto-select.
Of course, the beauty of using PetaPoco is that you can easily modify it to suit your needs, by either creating a new attribute, like you suggest above, or modifying the code you showed to not exclude those fields from the select (assuming you are not using ResultColumn in other join-type situations).
I have a table called book with, the attrbutes are booked_id, yearmon, and day_01....day_31. Now i need to unpivot the table and transform day_01...day_31 into rows, I have successed in doing that, but the problem is that my yearmon is a format of 200805 and i need to append a day to it based on day_01 or day_02 etc, so that i can create a new column with date information for example, if it is day_01, it looks like 20080501. Instead of writing huge query, does anyone how to use ssis to tranform it
You should be able to use the Unpivot component and the Derived Column component to do what you need. Look into those and post back if they don't seem to do what you need.