How to load array of strings with tab delimiter in pig - apache-pig

I have a text file with tab delimiter and I am trying to print first column as id and remaining array of strings as second column names.
consider below is the file to load:
cat file.txt;
1 A B
2 C D E F
3 G
4 H I J K L M
In the above file, first column is an id and the remaining are names.
I should get the output like:
id names
1 A,B
2 C,D,E,F
3 G
4 H,I,J,K,L,M
If names are split with delimiter ,, then I am getting the output by using below commands:
test = load '/tmp/arr' using PigStorage('\t') as (id:int,names:chararray)
btest = FOREACH test GENERATE id, FLATTEN(TOBAG(STRSPLIT(name,','))) as value:tuple(name:CHARARRAY);
But for the array with delimiter ('\t'), I am not getting them because it's considering only the first value in the column 2 (i.e, names).
Any solution for this?

I have a solution for this:
When using PigStorage('\t') in the load, the file should have tab delimiter. So the number of tab used in a line that many coloumns(+1) is created. This is how it works.
But you have a trick
You can change the default delimiter and use some other delimiter to load the file like comma and then you can have the names in commaseperated.
It will work for sure
Input file sample
1,A B
2,C D E F
3,G
4,H I J K L M
Hope this helps

Related

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

Split one row into multiple rows based on comma-separated string column

I have a table like below with columns A(int) and B(string):
A B
1 a,b,c
2 d,e
3 f,g,h
I want to create an output like below:
A B
1 a
1 b
1 c
2 d
2 e
3 f
3 g
3 h
If it helps, I am doing this in Amazon Athena (which is based on presto). I know that presto gives a function to split a string into an array. From presto docs:
split(string, delimiter) → array
Splits string on delimiter
and returns an array.
Not sure how to proceed from here though.
Use unnest on the array returned by split.
SELECT a,split_b
FROM tbl
CROSS JOIN UNNEST(SPLIT(b,',')) AS t (split_b)

Extracting a word from string from n rows and append that word as a new col in SQL Server

I have got a data set that contains 3 columns and has 15565 observations. one of the columns has got several words in the same row.
What I am looking to do is to extract a particular word from each row and append it to a new column (i will have 4 cols in total)
The problem is that the word that i am looking for are not the same and they are not always on the same position.
Here is an extract of my DS:
x y z
-----------------------------------------------------------------------
1 T 3C00652722 (T558799A)
2 T NA >> MSP: T0578836A & 3C03024632
3 T T0579010A, 3C03051500, EAET03051496
4 U T0023231A > MSP: T0577506A & 3C02808556
8 U (T561041A C72/59460)>POPMigr.T576447A,C72/221816*3C00721502
I am looking to extract all the words that start with 3Cand are 10 characters long and then append the to a new col so it looks like this:
x y z Ref
----------------------------------------------------------------
1 T 3C00652722 (T558799A) 3C00652722
2 T NA >> MSP: T0578836A & 3C03024632 3C03024632
3 T T0579010A, 3C03051500, EAET03051496 3C03051500
4 U T0023231A > MSP: T0577506A & 3C02808556 3C02808556
8 U >POPMigr.T576447A,C72/221816*3C00721502 3C00721502
I have tried using the Contains, Like and substring methods but it does not give me the results i am looking for as it basically finds the rows that have the 3C number but does not extract it, it just copies the whole cell and pastes is on the Ref column.
SQL Server doesn't have good string functions, but this should suffice if you only want to extract one value per row:
select t.*,
left(stuff(col,
1,
patindex('%3C[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', col),
''
), 10)
from t ;

character Counting in apache

I have few text files and I'm looking to count letters in all those text files combined in total. For example text1.txt contains "Stackoverflow is so cool". I'm looking to get the total letter count
Load all the files using wildcard character * into field of type chararray.Split the line into words and then into letters and count them.
A = LOAD '/path/text*.txt' AS (lines:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(lines))) AS words;
C = FOREACH B GENERATE FLATTEN(TOKENIZE(REPLACE(words,'','|'), '|')) AS letters;
D = GROUP C BY letters;
E = FOREACH D GENERATE COUNT(C), group;
DUMP E;

Split content of a column and get the other replicated

I have a file (too large) with a structure like this
A B C,D,E,F
The third column contains 4 values (but could be variable) separated with commas. I would like to convert that file into
A B C
A B D
A B E
A B F
Basically replicating the first two and splitting the second into rows.
Any idea on how to do that in awk?
$ awk '{n=split($3,a,/,/);for(i=1;i<=n;i++)print $1,$2,a[i]}' file
A B C
A B D
A B E
A B F