pig latin - not showing the right record numbers - apache-pig

I have written a pig script for wordcount which works fine. I could see the results from pig script in my output directory in hdfs. But towards the end of my console, I see the following:
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_local1695568121_0002 1 1 0 0 0 0 0 0 0 0 words_sorted SAMPLER
job_local2103470491_0003 1 1 0 0 0 0 0 0 0 0 words_sorted ORDER_BY /output/result_pig,
job_local696057848_0001 1 1 0 0 0 0 0 0 0 0 book,words,words_agg,words_grouped GROUP_BY,COMBINER
Input(s):
Successfully read 0 records from: "/data/pg5000.txt"
Output(s):
Successfully stored 0 records in: "/output/result_pig"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local696057848_0001 -> job_local1695568121_0002,
job_local1695568121_0002 -> job_local2103470491_0003,
job_local2103470491_0003
2014-07-01 14:10:35,241 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
As you can see, the job is success. but not the Input(s) and output(s). Both of the them say successfully read/stored 0 records and the counter values are all 0.
why the value is zero. These should not be zero.
I am using hadoop2.2 and pig-0.12
Here is the script:
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
words = foreach book generate FLATTEN(TOKENIZE(lines)) as word;
words_grouped = group words by word;
words_agg = foreach words_grouped generate group as word, COUNT(words);
words_sorted = ORDER words_agg BY $1 DESC;
STORE words_sorted into '/output/result_pig' using PigStorage(':','-schema');
NOTE: my data is present in /data/pg5000.txt and not in default directory which is /usr/name/data/pg5000.txt
EDIT: here is the output of printing my file to console
hadoop fs -cat /data/pg5000.txt | head -10
The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete
by Leonardo Da Vinci
(#3 in our series by Leonardo Da Vinci)
Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.
This header should be the first thing seen when viewing this Project
Gutenberg file. Please do not remove it. Do not change or edit the
cat: Unable to write to output stream.

Please correct the following line
book = load '/data/pg5000.txt' using PigStorage() as (lines:chararray);
to
book = load '/data/pg5000.txt' using PigStorage(',') as (lines:chararray);
I am assuming the delimiter as comma here use the one which is used to separate the records in your file. This will solve the issue
Also note --
If no argument is provided, PigStorage will assume tab-delimited format. If a delimiter argument is provided, it must be a single-byte character; any literal (eg: 'a', '|'), known escape character (eg: '\t', '\r') is a valid delimiter.

Related

Why Dump operator return a path?

I have a simple pig code :
CRE_28001 = LOAD '$input' USING PigStorage(';') AS (CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray);
-- Generer les colonnes du fichier
Data = FOREACH CRE_28001 GENERATE
(chararray) CIA_CD_CRV_CIA AS CIA_CD_CRV_CIA,
(chararray) CIA_DA_EM_CRV AS CIA_DA_EM_CRV,
(chararray) CIA_CD_CTRL_BLCE AS CIA_CD_CTRL_BLCE,
(chararray) RUB_202 AS RUB_202;
-- Etablir le filtre exigee
CRE_28001_FILTER = FILTER Data BY (RUB_202 == '6');
LIMIT_DATA = LIMIT CRE_28001_FILTER 10;
DUMP LIMIT_DATA;
I am sure that my filter is correct. The column RUB_202 has over than 100 lines having '6' as a value. I verified that many times
Look what I get :
Input(s):
Successfully read 444 records (583792 bytes) from: "/hdfs/data/adhoc/PR/02/RDO0/BB0/MGM28001-2019-08-19.csv"
Output(s):
Successfully stored 0 records in: "hdfs://ha-manny/hdfs/hadoop/pig/tmp/temp1618713487/tmp-1281522727"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1549794175705_3500029 -> job_1549794175705_3500031,
job_1549794175705_3500031
Note that I didn't demand to save the data in hdfs://ha-manny/hdfs/hadoop/pig/tmp/temp1618713487/tmp-1281522727.
Why this was automatically generated and I can see any data description or presentation.
I get that also when I just look to see the resukt of the filter
The solution is to reference columns using their index number instead of names.
In other words :
Data = FOREACH CRE_28001 GENERATE
(chararray) $0 AS CIA_CD_CRV_CIA,
(chararray) $1 AS CIA_DA_EM_CRV,
(chararray) $2 AS CIA_CD_CTRL_BLCE,
(chararray) $3 AS RUB_202;
Then I used the TRIM operator , because there was some columns wich has empty spaces in data !
And it works

Combine image data, row id and labels in one input file?

I have train / test input files in this format (filename label):
...\000881.JPG 2
...\000961.JPG 1
...\001700.JPG 1
...\001291.JPG 1
The input file above will be used with the ImageDeserializer. Since I have been unable to retrieve a row ID and the label from my code after the model have been trained, I created a second test file in this format:
|index 881 |piece_type 0 0 1 0 0 0
|index 961 |piece_type 0 1 0 0 0 0
|index 1700 |piece_type 0 1 0 0 0 0
|index 1291 |piece_type 0 1 0 0 0 0
The format of the second file is the same information as represented in the first file, but formatted differently. The index is the row number and the !piece_type is the label encoded in the one hot format. I need the file in the second format in order to be able to get to the row number and the label. The second file is used with the CTFDeserializer to create a composite reader like this:
image_source = ImageDeserializer(map_file, StreamDefs(
features = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
labels = StreamDef(field='label', shape=num_classes) # and second as 'label'
))
text_source = CTFDeserializer("test_map2.txt")
text_source.map_input('index', dim=1, format="dense")
text_source.map_input('piece_type', dim=6, format="dense")
# define a composite reader
reader_config = ReaderConfig([image_source, text_source])
minibatch_source = reader_config.minibatch_source()
The reason I have added the second file is to be able to create a confusion matrix and then I need to be able to have both the true labels and the predicted labels for a given minibatch that I test with. The row numbers are nice to have in order to get a pointer pack to the input images.
Would it be possible somehow to be able to do this with just one input file? It's bit of a hassle to deal with multiple files and formats.
You could load the test images without using a reader as described in this wiki page. Admittedly this puts the burden of all the transformations (cropping/mean subtraction etc.) to the user but at least the PIL package makes these easy. This CNTK tutorial uses PIL to crop and scale the input images before feeding them to CNTK.

How to get CPU(idle) value in the vmstat result

I am trying to get the value of 'id' in the vmstat result.
However, I found out that the position of 'id' column is different between platforms such as linux/AIX/HP...
## Linux
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 35268 117568 158244 1849104 0 0 3 11321 5 2 9 15 73 3 0
So, I think I should find the string 'id' and get the position(the) then, get the value of the position in the next row.
How can I do that with awk script?
this oneliner does what you want:
awk '{for(i=NF;i>0;i--)if($i=="id"){x=i;break}}END{print $x}'
first find out the id index, then print the corresponding column in the last line.

Fortran non advancing reading of a text file

I have a text file with a header of information followed by lines with just numbers, which are the data to be read.
I don't know how many lines are there in the header, and it is a variable number.
Here is an example:
filehandle: 65536
total # scientific data sets: 1
file description:
This file contains a Northern Hemisphere polar stereographic map of snow and ice coverage at 1024x1024 resolution. The map was produced using the NOAA/NESDIS Interactive MultisensorSnow and Ice Mapping System (IMS) developed under the directionof the Interactive Processing Branch (IPB) of the Satellite Services Division (SSD). For more information, contact: Mr. Bruce Ramsay at bramsay#ssd.wwb.noaa.gov.
Data Set # 1
Data Label:
Northern Hemisphere 1024x1024 Snow & Ice Chart
Coordinate System: Polar Stereographic
Data Type: BYTE
Format: I3
Dimensions: 1024 1024
Min/Max Values: 0 165
Units: 8-bit Flag
Dimension # 0
Dim Label: Longitude
Dim Format: Device Coordinates
Dim Units: Pixels
Dimension # 1
Dim Label: Latitude
Dim Format: Device Coordinates
Dim Units: Pixels
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
..........................
I open the file using:
open(newunit=U, file = ValFile, STATUS = 'OLD', ACCESS = 'SEQUENTIAL', ACTION = 'READ')
Then, I read the file line by line and test for the type of line: header line or data line:
ios = 0
do while ( .NOT. is_iostat_end(ios) )
read(U, '(A)', iostat = ios, advance = 'NO') line ! Shouldn't advance to next line
if (is_iostat_end(ios)) stop "End of file reached before data section."
tol = getTypeOfLine(line, nValues) ! nValues = 1024, needed to test if line is data.
if ( tol > 0 ) then ! If the line holds data.
exit ! Exits the loop
else
read(U, '(A)', iostat = ios, advance = 'YES') line ! We advance to the next line
end if
end do
But the first read in the loop, always advances to the next line, and this is a problem.
After exiting the above loop, enter a new loop to read the data:
read(U, '(1024I1)', iostat = ios) Values(c,:)
The 1024 set of data can span some lines, but each set is a row in the matrix "Values".
The problem is that this second loop doesn't read the last line read in the testing loop (which is the first line of data).
A possible solution is to read the lines in the testing loop, without advancing to the next line. I used for this, advance='no', but it still advances to the next line, Why?.
A non-advancing read will still set the file position to before start of the next record if the end of the current record is encountered while reading from the file to satisfy the items in the output item list of the read statement - non-advancing doesn't mean "never-advancing". You can use the value assigned to the variable nominated in an iostat specifier for the read statement to see if the end of the current record was reached - use the IS_IOSTAT_EOR intrinsic or test against the equivalent value from ISO_FORTRAN_ENV.
(Implicit in the above is the fact that a non-advancing read still advances over the file positions that correspond to items actually read... hence once that getTypeOfLine procedure decides that it has a line of data at least part of that line has already been read. Unless you reposition the file subsequent "data" read statements will miss that part.)

Convert data in a specific format in Apache Pig.

I want to convert data in to a specific format in Apache Pig so that I can use a reporting tool on top of it.
For example:
10:00,abc
10:00,cde
10:01,abc
10:01,abc
10:02,def
10:03,efg
The output should be in the following format:
abc cde def efg
10:00 1 1 0 0
10:01 2 0 0 0
10:02 0 0 1 0
The main problem here is that a value can occur multiple times in a row, depending on the different values available in the sample csv file, up to a total of 120.
Any suggestions to tackle this are more than welcome.
Thanks
Gagan
Try something like the following:
A = load 'data' using PigStorage(",") as (key:chararray,value:chararray);
B = foreach A generate key,(value=='abc'?1:0) as abc,(value=='cde'?1:0) as cde,(value=='efg'?1:0) as efg;
C = group B by key;
D = foreach C generate group as key, COUNT(abc) as abc, COUNT(cde) as cde, COUNT(efg) as efg;
That should get you a count of the occurances of a particular value for a particular key.
EDIT: just noticed the limit 120 part of the question. If you cannot go above 120 put the following code
E = foreach D generate key,(abc>120?"OVER 120":abc) as abc,(cde>120?"OVER 120":cde) as cde,(efg>120?"OVER 120":efg) as efg;