Adding date to new columns returns error - apache-pig

I'm trying to add a new column to my file. I want to add the date to each row of my file.
Filename is: 2016-06-15.txt
The schema my file is:
A B C
7 8 13
I want to obtain:
Date A B C
2016-06-15 7 8 13
For that I'm using Pig with following script:
A = LOAD 'user/cloudera/Analytics/source/file.txt' using PigStorage(' ','-tagPath');
DUMP A ; ****--> ERROR****
STORE A INTO 'user/cloudera/Analytics/source/file.txt' USING PigStorage(' '); ****--> ERROR****
But I'm getting an error and I don't have any log available :( Anyone can help? Many thanks!

You will have to use the -tagFile option to get the file name as the first field.
Before that check to make sure the file path is correct.Looks like a forward slash is missing at the beginning of your file path.Ensure you are using the correct delimiter in PigStorage.Seems like the columns are separated by a tab or multiple spaces.Lastly choose a different folder to Store the new file or else you will get a file exists error.
A = LOAD '/user/cloudera/Analytics/source/2016-06-15.txt' using PigStorage(' ','-tagFile');
STORE A INTO '/user/cloudera/Analytics/NEW_source/2016-06-15.txt' USING PigStorage(' ');

Related

How to load data into pig using different PigStorage operator

I am new to Apache Pig and trying to load test twitter data to find out the number of tweets by each user name. Below is my data
format(twitterId,comment,userRefId):
Sample Data
When I am trying to load data into Pig using PigStorage as (',') it is separating my comment section also into multiple fields because comments could also have','. Please let me know how to load this data properly in Pig. I am using below command:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage(',') AS (id:chararray,comment:chararray,refId:chararray);
Load the record into a line,then replace ," with | and ", with |.This will ensure the fields are separated and then use STRSPLIT to get the 3 fields.
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(line,',"','|'),'",','|');
C = FOREACH B GENERATE STRSPLIT($0,'\\|',3);
DUMP C;
EDIT:
I used sample text to run the script and works fine.See below
If changing the separator in your source data is an option, I would go that route. Makes it probably a lot easier to get started and to track down issues.
If you change your separator to a |, your code could look like:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage('|') AS (id:chararray,comment:chararray,refId:chararray);

Load CSV file in PIG

In PIG, When we load a CSV file using LOAD statement without mentioning schema & with default PIGSTORAGE (\t), what happens? Will the Load work fine and can we dump the data? Else will it throw error since the file has ',' and the pigstorage is '/t'? Please advice
When you load a csv file without defining a schema using PigStorage('\t'), since there are no tabs in each line of the input file, the whole line will be treated as one tuple. You will not be able to access the individual words in the line.
Example:
Input file:
john,smith,nyu,NY
jim,young,osu,OH
robert,cernera,mu,NJ
a = LOAD 'input' USING PigStorage('\t');
dump a;
OUTPUT:
(john,smith,nyu,NY)
(jim,young,osu,OH)
(robert,cernera,mu,NJ)
b = foreach a generate $0, $1, $2;
dump b;
(john,smith,nyu,NY,,)
(jim,young,osu,OH,,)
(robert,cernera,mu,NJ,,)
Ideally, b should have been:
(john,smith,nyu)
(jim,young,osu)
(robert,cernera,mu)
if the delimiter was a comma. But since the delimiter was a tab and a tab does not exist in the input records, the whole line was treated as one field. Pig doe snot complain if a field is null- It just outputs nothing when there is a null. Hence you see only the commas when you dump b.
Hope that was useful.

apache pig load data with multiple delimiters

Hi everyone I have a problem about loading data using apache pig, the file format is like:
"1","2","xx,yy","a,sd","3"
So I want to load it by using the multiple delimiter "," 2double quotes and one comma like:
A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);
but the PigStorage doesn't accept the multiple delimiter ",".How I can do it? Thank you very much!
PigStorage takes single character as delimiter.You will have use builtin functions from PiggyBank. Download piggybank.jar and save in the same folder as your pigscript.Register the jar in your pigscript.
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;
Alternate option is to load the data into a line and then use STRSPLIT
A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;

Loading selected files from a directory in Pig script

I would like to know how to load some files from a directory in Pig Script .
Let's say there are 4 files in a directory for JAN month and those 4 file names are as below
2016-01-01.txt
2016-01-02.txt
2016-01-03.txt
2016-01-04.txt
Now my requirement is to read files from 2016-01-01 to 2016-01-03, that means taking first 3 files of JAN 2016 ..
My Pig script :
This below line works:
rec = LOAD '/home/dir/{2016-01-01*,2016-01-02*,2016-01-03*}' USING PigStorage(',');
This below line does not work :
rec = LOAD '/home/dir/{2016-01-{01*-03*}}' USING PigStorage(',');
I am getting the below error. I am using Pig 0.14 in MAPR Cluster
N/A file_records MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern maprfs:///home/dir/{2016-01-{01*-03*}} matches 0 files. Paths with components .*, _* were skipped.
0 additional path filters were applied
Could some body explain me what happened and how do I resolve this ?
Possible duplicate Load mutilple files over a date range in PIG
rec = LOAD '/home/dir/{2016-01-0{1,2,3}*}' USING PigStorage(',');
or
rec = LOAD '/home/dir/{2016-01-{01,02,03}*}' USING PigStorage(',');
or
rec = LOAD '/home/dir/{2016-01-0[1-3]*}' USING PigStorage(',');

UniVerse - SQL LIST: View List of All Database Tables

I am trying to obtain a list of all the DB Tables that will give me visibility on what tables I may need to JOIN for running SQL scripts.
For example, in TCL when I run "LIST.DICT" it returns "Name of File:" for input. I then enter "PRODUCT" and it returns a list of all available fields.
However, Where can I get a list of all my available Tables or list of my options that I can enter after "Name of File:"?
Here is what I am trying to achieve. In the screen shot below, I would like to run a SQL script that gives me the latest Log File Activity, Date - Time - Description. I would like the script to return '8/13/14 08:40am BR: 3;BuyPkg'
Thank you in advance for your help.
From TCL within the database account containing your database files, type: LISTF
Sample output:
FILES in your vocabulary 03:21:38pm 29 Jun 2015 Page 1
Filename........................... Pathname...................... Type Modulo
File - Contains all logical device names
DICT &DEVICE& /u1/uv/D_&DEVICE& 2 1
DATA &DEVICE& /u1/uv/&DEVICE& 2 3
File - Used by MAKE.MAP.FILE
DICT &MAP& /u1/uv/D_&MAP& 2 1
DATA &MAP& /u1/uv/&MAP& 6 7
File - Contains all parts of Distributed Files
DICT &PARTFILES& /u1/uv/D_&PARTFILES& 2 1
DATA &PARTFILES& /u1/uv/&PARTFILES& 18 7
DICT &PH& D_&PH& 3 1
DATA &PH& &PH& 1
DICT &SAVEDLISTS& D_&SAVEDLISTS& 3 1
DATA &SAVEDLISTS& &SAVEDLISTS& 1
File - Used by uniVerse to access the current directory.
DICT &UFD& /u1/uv/D_UFD 2 1
DATA &UFD& . 19 1
DICT &XML& D_&XML& 18 3
DATA &XML& &XML& 19 1
Firstly, UniVerse has no Log File Activity Date and Time.
However, you can still obtain the table's modified/ accessed date from the file system however.
To do this,
You need to have a subroutine accepting a path of the table to return a date or a time.
e.g. SUBROUTINE GET.FILE.MOD.DATE(DAT.MOD, S.FILE.PATH)
Inside the subroutine, you can use EXECUTE to run shell command like istat for getting these info on a unix e.g.
Please beware that for a dynamic file e.g. there are Data and Overflow parts under a directory. You should compare the dates obtained and return only the latest one.
Globally catalog the subroutine
Create an I-Desc in VOC, e.g. I.FILE.MOD.DATE in the field definition of this I-Desc: SUBR("*GET.FILE.MOD.DATE",F2) and Conv Code as "D/MDY2"
Create another I-Desc e.g. I.FILE.MOD.TIME
Finally, you can
LIST VOC I.FILE.MOD.DATE I.FILE.MOD.TIME DESC WITH TYPE LIKE "F..."
alternatively in SQL,
SELECT I.FILE.MOD.DATE, I.FILE.MOD.TIME, VOC.DESC FROM VOC WHERE TYPE LIKE "F%";