apache-pig dse pig flatten usage - apache-pig

When should I use flatten in pig? Not able to understand from the documentation. The error messages shown and the issue are entirely different in Pig. It says sometimes flatten could not be imported but the same flatten works somewhere.

Whenever you use group command for any of the identfier in your data file ,it will list down all the tuples pertaining to the identifier in a bag, which sometimes is quite cumbersome to read.
So if you use flatten on top of the group clause it will list all the tuples separately in your output file .The drawback of using flatten is dulplicacy of the same record.So to remove dulpicate you need to write an extra piece of code.
Example of Non-flattened code:
X = GROUP A BY f1;
DUMP X;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})
Example of flattened code:
X = GROUP A BY f1;
DUMP X;
(1,2,3)
(4,2,1)
(4,3,3)
(8,3,4)

Related

is there a way to subset an AnnData object after reading it in?

I read in the excel file like so:
data = sc.read_excel('/Users/user/Desktop/CSVB.xlsx',sheet= 'Sheet1', dtype= object)
There are 3 columns in this data set that I need to work with as .obs but it looks like everything is in the .X data matrix.
Anyone successfully subset after reading in the file or is there something I need to do beforehand?
Okay, so assuming sc stands for scanpy package, the read_excel just takes the first row as .var and the first column as .obs of the AnnData object.
The data returned by read_excel can be tweaked a bit to get what you want.
Let's say the index of the three columns you want in the .obs are stored in idx variable.
idx = [1,2,4]
Now, .obs is just a Pandas DataFrame, and data.X is just a Numpy matrix (see here). Thus, the job is simple.
# assign some names to the new columns
new_col_names = ['C1', 'C2', 'C3']
# add the columns to data.obs
data.obs[new_col_names] = data.X[:,idx]
If you may wish to remove the idx columns from data.X, I suggest making a new AnnData object for this.

Selecting Columns Based on Multiple Criteria in a Julia DataFrame

I need to select values from a single column in a Julia dataframe based on multiple criteria sourced from an array. Context: I'm attempting to format the data from a large Julia DataFrame to support a PCA (primary component analysis), so I first split the original data into an anlytical matrix and a label array. This is my code, so far (doesn't work):
### Initialize source dataframe for PCA
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
### Extract 1/2 of rows into analytical matrix
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
### Extract last column as labels
arLabels=dfSource[1:2:end,3]
### Select filtered rows
datGet=matSource[:,arLabels>=0.2 & arLabels<0.7][1,:]
print(datGet)
output> MethodError: no method matching...
At the last line before the print(datGet) statement, I get a MethodError indicating a method mismatch related to use of the & logic. What have I done wrong?
A small example of alternative implementation (maybe you will find it useful to see what DataFrames.jl has in-built):
# avoid materialization if dfSource is large
dfSourceHalf = #view dfSource[1:2:end, :]
lazyFilter = Iterators.filter(row -> 0.2 <= row[3] < 0.7, eachrow(dfSourceHalf))
matFiltered = mapreduce(row -> collect(row[1:2]), hcat, lazyFilter)
matFiltered[1, :]
(this is not optimized for speed, but rather as a showcase what is possible, but still it is already several times faster than your code)
This code works:
dfSource=DataFrame(
colDataX=[0,5,10,15,5,20,0,5,10,30],
colDataY=[1,2,3,4,5,6,7,8,9,0],
colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
arLabels=dfSource[1:2:end,3]
datGet=matSource[:,(arLabels.>=0.2) .& (arLabels.<0.7)][1,:]
print(datGet)
output> [0,10,0]
Note the use of parenthetical enclosures (arLabels.>=0.2) and (arLabels<0.7), as well as the use of the .>= and .< syntax (which forces Julia to iterate through a container/collection). Finally, and most crucially (since it's the part most people miss), note the use of .& in place of just &. The dot operator makes all the difference!

Parsing a SQL spatial column in Python

I am struggling a bit as I am new to programming. I am currently writing a python script and I am a bit stuck. The goal is to parse some spatial information the gets pulled from SQL to a format that is usable for my py script down the line.
I was able to CAST through a SQL query and fetchall using the obdc module. However once I fetch the data that is where it gets trick for me. Here is an example of a print from the fetchall:
[(u'POLYGON ((7014.186279296875 6602.99658203125 1612.5, 7015.984375 6600.416015625 1612.5))',), (u'POLYGON ((6730.962646484375 6715.2490234375 1522.5, 6730.0869140625 6714.13916015625 1522.5))',)]
I am not exactly sure what I am getting here it is like a list of tuples. which I have tried converting to a list of list, but there must be something I am missing.
Here is the usable format I am looking for:
[[7014.186279296875, 6602.99658203125, 1612.5], [7015.984375, 6600.416015625, 1612.5]]
[[6730.962646484375, 6715.2490234375, 1522.5], [6730.0869140625, 6714.13916015625, 1522.5]]
Any ideas of how I can accomplish this? Maybe there is a better way to CAST in SQL or a module in python that would be easier to use instead of just doing a cursor.fetchall() and parsing? Or any any parsing help would be useful. Thanks.
If you want to do parsing, that should be straight forward. For example you've provided next code would do the thing:
result = []
for element in data:
single_elements = element[0][10:-2].split(', ')
for se in single_elements:
row = str(se).split(' ')
result.append([float(a) for a in row])
Result will contain what you need. If parsing is not an option, then paste some of your code so I can see how you're fetching data.

apache pig group by output -- remove "(" and "{"

I do the following:
a = load '/hive/warehouse/' USING PigStorage('^') as (a1,b1,c1);
b = group a by (a1) ;
c = foreach b generate group, a.$2;
dump c;
Output shows all the groups:
abc {(1),(44),(66)}
cde {(1),(44),(66)}
How can I remove "{" and "(" characters so that the final HDFS file can be read as a coma delimited file?
You can't do this directly in Pig. The special syntax is required because you are storing a bag, and in order for Pig to be able to read this bag later, it needs to be stored with braces (for the bag) and parentheses (for the tuples contained in the bag).
You have a couple of options. You can read the file back into Pig, but instead of reading it as a bag, read it as a chararray. Then you can perform regex substitution to get rid of the punctuation (untested):
a = LOAD 'output' AS (group:chararray, list:chararray);
b = FOREACH A GENERATE group, REPLACE(list, '[{()}]', '');
Another option is to write a UDF which will turn a bag into a tuple. Note that this is not a well-defined operation: bags have no particular order, so from one run to the next, your tuple is not guaranteed to be in the same order. But for your purposes it sounds like that may not matter. The UDF could look like (very rough draft, untested):
public class BAG_TO_TUPLE extends EvalFunc(Tuple) {
public Tuple exec(Tuple input) {
DataBag bag = input.get(0);
Iterator<Tuple> iterator = bag.iterator();
Tuple out = new DefaultTuple();
while(iterator.hasNext()) {
out.append(iterator.next().get(0));
}
return out;
}
}
The above UDF is terrible -- it assumes that you have exactly one element in every tuple of the bag (that you care about) and does no checking whatsoever that the input is valid, etc. But it should get you towards what you want.
The best solution, though, is to find a way to handle the extra punctuation outside of Pig if Pig is not part of your downstream processing.
This functionality is now provided in Pig as a built-in func (I'm using 0.11).
http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToString.html
c = foreach b generate group, a.$2 as stuff;
d = foreach c generate group, BagToString(stuff, ',');
I don't need a comma-delimited file for my use case, but I assume you can use a store func to get the final comma (between group and the now-comma-delimited-list of bag things).
Try the FLATTEN operator;
c = foreach b generate group, FLATTEN(a.$2);

How to read data from data bag within a PIG script

I have a databag which is the following format
{([ChannelName#{ (bigXML,[])} ])}
DataBag consists of only one item which is a Tuple.
Tuple consists of only item that is Map.
Map is of type which is a map between channel names and values.
Here is value is of type DataBag, which consists of only one tuple.
The tuple consists of two items one is a charrarray (very big string) and other is a map
I have a UDF that emits the above bag.
Now i need to invoke another UDF by passing the only tuple within the DataBag against a given Channel from the Map.
Assuming there was not data bag and a tuple as
([ChannelName#{ (bigXML,[])} ])
I can access the data using $0.$0#'StdOutChannel'
Now with the tuple inside a bag
{([ChannelName#{ (bigXML,[])} ])}
If i do $0.$0.$0#'StdOutChannel' (Prepend $0), i get the following error
ERROR 1052: Cannot cast bag with schema bag({bytearray}) to map
How can I access data within a data bag?
Try to break this problem down a little.
Let's say you get your inner bag:
MYBAG = $0.$0#'StdOutChannel';
First, can you ILLUSTRATE or DUMP this?
What can you do with this bag? Usually FOREACH over the tuples inside.
A = FOREACH MYBAG {
GENERATE $0 AS MyCharArray, $1 AS MyMap
};
ILLUSTRATE A; -- or if this doesn't work
DUMP A;
Can you try this interactively and maybe edit your question a little more with some details as a result of you trying these things.
Some editing hints for StackOverflow:
put backticks around your code (`ILLUSTRATE`)
indent code blocks by 4 spaces on each line