how do I set one variable equal to another in pig latin - sql

I would like to do
register s3n://uw-cse344-code/myudfs.jar
-- load the test file into Pig
--raw = LOAD 's3n://uw-cse344-test/cse344-test-file' USING TextLoader as (line:chararray);
-- later you will load to other files, example:
raw = LOAD 's3n://uw-cse344/btc-2010-chunk-000' USING TextLoader as (line:chararray);
-- parse each line into ntriples
ntriples = foreach raw generate FLATTEN(myudfs.RDFSplit3(line)) as (subject:chararray,predicate:chararray,object:chararray);
--filter 1
subjects1 = filter ntriples by subject matches '.*rdfabout\\.com.*' PARALLEL 50;
--filter 2
subjects2 = subjects1;
but I get the error:
2012-03-10 01:19:18,039 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input ';' expecting LEFT_PAREN
Details at logfile: /home/hadoop/pig_1331342327467.log
so it seems pig doesn't like that. How do I accomplish this?

i don't think that kind of 'typical' assignment works in pig. It's not really a programming language in the strict sense - it's a high-level language on top of hadoop with some specialized functions.
i think you'll need to simply re-project the data from subjects1 to subjects2, such as:
subjects2 = foreach subjects1 generate $0, $1, $2;
another approach might be to use the LIMIT function with some absurdly high parameter.
subjects2 = subjects2 LIMIT 100000000 ;
there could be a lot of reasons why that doesn't make sense, but it's a thought.
i sense you are considering doing things as you would in a programming language
i have found that rarely works out like you want it to but you can always get the job done once you think like Pig.

As I understand your example fro DataScience coursera course.
It's strange but I found the same problem. This code works on the on amount of data and don't on the another.
Because we need to change parameters I used this code:
filtered2 = foreach filtered generate subject as subject2, predicate as predicate2, object as object2;

Related

How to load data into pig using different PigStorage operator

I am new to Apache Pig and trying to load test twitter data to find out the number of tweets by each user name. Below is my data
format(twitterId,comment,userRefId):
Sample Data
When I am trying to load data into Pig using PigStorage as (',') it is separating my comment section also into multiple fields because comments could also have','. Please let me know how to load this data properly in Pig. I am using below command:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage(',') AS (id:chararray,comment:chararray,refId:chararray);
Load the record into a line,then replace ," with | and ", with |.This will ensure the fields are separated and then use STRSPLIT to get the 3 fields.
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(line,',"','|'),'",','|');
C = FOREACH B GENERATE STRSPLIT($0,'\\|',3);
DUMP C;
EDIT:
I used sample text to run the script and works fine.See below
If changing the separator in your source data is an option, I would go that route. Makes it probably a lot easier to get started and to track down issues.
If you change your separator to a |, your code could look like:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage('|') AS (id:chararray,comment:chararray,refId:chararray);

Apache Pig - Numeric data missing while loading in a pig relation

I am learning Apache Pig. I am trying to load some data in to pig. When i see the txt file in vi editor, I find the following (sample) row.
[ABBOTT,DEEDEE W GRADES 9-12 TEACHER 52,122.10 0 LBOE
ATLANTA INDEPENDENT SCHOOL SYSTEM 2010].
I use the following command to load data into a pig relation.
A = LOAD 'salaryTravelReport_sample.txt' USING PigStorage() as (name:chararray,
prof:chararray,max_sal:float,travel:float,board:chararray,state:chararray,year:int);
However, when I do a dump in pig in the distributed environment, I find the following result (for the row mentioned above):
(ABBOTT,DEEDEE W,GRADES 9-12 TEACHER,,0.0,LBOE,ATLANTA INDEPENDENT
SCHOOL SYSTEM,2010).
The numeric data "52,122.10 " seems to be missing.
Please help.
PigStorage() is inbuilt function in pig which takes record delimiter as arguments. here its tab -- > \t
A = LOAD 'salaryTravelReport_sample.txt' USING PigStorage('\t') as (name:chararray,
prof:chararray,max_sal:float,travel:float,board:chararray,state:chararray,year:int);

Unable to extract data with double pipe delimiter in Pig Script

I am trying to extract data which is pipe delimited in Pig. Following is my command
L = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('||');
Iam getting following error
2016-08-04 23:58:21,122 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 1, column 4> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[||]'
My input sample file has exactly 5 lines as following
POS_TIBCO||HDFS||POS_LOG||1||7806||2016-07-18||1||993||0
POS_TIBCO||HDFS||POS_LOG||2||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||3||7806||2016-07-18||1||0||5
POS_TIBCO||HDFS||POS_LOG||4||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||5||7806||2016-07-18||1||0||19.99
I tried several options like using the backslash before delimiter(\||,\|\|) but everything failed. Also, I tried with schema but got the same error.I am using Horton works(HDP2.2.4) and pig (0.14.0).
Any help is appreciated. Please let me know if you need any further details.
I have faced this case, and by checking PigStorage code source, i think PigStorage argument should be parsed into only one character.
So we can use this code instead:
L0 = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('|');
L = FOREACH L0 GENERATE $0,$2,$4,$6,$8,$10,$12,$14,$16;
Its helpful if you know how many column you have, and it will not affect performance because it's map side.
When you load data using PigStorage, It only expects single character as delimiter.
However if still you want to achieve this you can use MyRegExLoader-
REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('||')
as (movieid:int, title:chararray, genre:chararray);

Print only nonzero results using AMPL + Neos server

I'm doing a optimization model of a relatively big model. I will use 15 timesteps in this model, but now when I'm testing it I am only using 4. However, even with 11 time steps less than desired the model still prints 22 000 rows of variables, where perhaps merely a hundred differs from 0.
Does anyone see a way past this? I.e. a way using NEOS server to only print the variable name and corresponding value if it is higher than 0.
What I've tested is:
solve;
option omit_zero_rows 0; (also tried 1;)
display _varname, _var;
Using both omit_zero_rows 0; or omit_zero_rows 1; still prints every result, and not those higher than 0.
I've also tried:
solve;
if _var > 0 then {
display _varname, _var;
}
but it gave me syntax error. Both (or really, the three) variants were tested in the .run file I use for NEOS server.
I'm posting a solution to this issue, as I believe that this is an issue more people will stumble upon. Basically, in order to print only non-zero values using NEOS Server write your command file (.run file) as:
solve;
display {j in 1.._nvars: _var[j] > 0} (_varname[j], _var[j]);

powershell assigning output of a ps1 script to a variable

Let me start with I am very new to powershell and programming for that matter. I have a powershell script that takes some arguments and that outputs a value.
The result of the script is going to be something like 9/10 where 9 would be the number active out of the total amount of nodes. I want to assign the output to a variable so I can then call another script based on the value.
This is what I have tried, but it does not work:
$active = (./MyScript.ps1 lb uid **** site)
I have also tried the following which seems to assign the variable an empty string
$active = (./MyScript.ps1 lb uid **** site | out-string)
In both cases they run and give me the value immediately instead of assigning it to the variable. When I call the variable, I get no data.
I would embrace PowerShell's object-oriented nature and rather than output a string like "9/10", create an object with properties like NumActiveNodes and TotalNodes e.g. in your script output like so:
new-object psobject -Property #{NumActiveNodes = 9; TotalNodes = 10}
Of course, substitute in the dynamic values for num active and total nodes. Note that uncaptured objects will automatically appear on your script's output. Then, if this is your scripts only output, you can do this:
$obj = .\MyScript.ps1
$obj.NumActiveNodes
9
$obj.TotalNodes
10
It will make it nicer for those consuming the output of your script. In fact the output is somewhat self-documenting e.g.:
C:\PS> .\MyScript.ps1
NumActiveNodes TotalNodes
-------------- ----------
9 10
P.S. When did StackOverflow start sucking so badly at formatting PowerShell script?
If you don't want to change the script ( and assuming only that $avail_count/$total_count line is written by the script), you can do:
$var= powershell .\MyScript.ps1
Or just drop the write-host and have just $avail_count/$total_count
and then do:
$var = .\MyScript.ps1
you could just do a $global:foobar in your script and it will persist after the script is closed
I know, the question is a bit older, but it might help someone to find the right answer.
I had the similar problem with executing PS script with another PS script and saving the output into variable, here are 2 VERY good answers:
Mathias
mklement0
Hope it helps!
Please up-vote them if so, because they are really good!