In pig how do i write a date, like in sql we write where date ='' - apache-pig

I am new to Pig scripting but good with SQL. I wanted the pig equivalent for this SQL line :
SELECT * FROM Orders WHERE Date='2008-11-11'.
Basically I want to load data for one id or date how do I do that?

I did this and it worked, used FILTER in pig, and got the desired results.
`ivr_src = LOAD '/raw/prod/...;
info = foreach ivr_src generate timeEpochMillisUTC as time, cSId as id;
Filter_table= FILTER info BY id == '700000';
sorted_filter_table = Order Filter_table BY $1;
store sorted_filter_table into 'sorted_filter_table1' USING PigStorage('\t', '-
schema');`

Related

data processing in pig , with tab separate

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();

PIG script for creating IDxCITY matrix from given csv file

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');

How to count the number of rows with a date from a certain year in CodeIgniter?

I have the following query.
$query = $this->db->query('SELECT COUNT(*) FROM iplog.persons WHERE begin_date LIKE '2014%'');
I need to count the number of columns with a begin_date in the year 2014.
When I run this script I'm getting an error:
Parse error: syntax error, unexpected '2014' (T_LNUMBER) in C:\xampp\htdocs\iPlog2\application\controllers\stat.php on line 12
I was trying to change my CI script to
$query = $this->db->query('SELECT COUNT(*) FROM iplog.persons WHERE begin_date LIKE "2014%"');
but it caused an error.
You mean, count ROWS:
So for that, just count the number of rows you have based on a condition:
$year = '2014'
$this->db->from('iplog');
$this->db->like('begin_date', $year);
$query = $this->db->get();
$rowcount = $query->num_rows();
First, you have a simple typo regarding the use of single quotes. Your complete sql string should be double quoted so that your value-quoting can be single quoted.
Second, you are using inappropriate query logic. When you want to make a comparison on a DATE or DATETIME type column, you should NEVER be using LIKE. There are specific MYSQL functions dedicated to handling these types. In your case, you should be using YEAR() to isolate the year component of your begin_date values.
Resource: https://www.w3resource.com/mysql/date-and-time-functions/mysql-year-function.php
You could write the raw query like this: (COUNT(*) and COUNT(1) are equivalent)
$count = $this->db
->query("SELECT COUNT(1) FROM persons WHERE YEAR(begin_date) = 2014")
->row()
->COUNT;
Or if you want to employ Codeigniter methods to build the query:
$count = $this->db
->where("YEAR(begin_date) = 2014")
->count_all_results("persons");
You could return all of the values in all of the rows that qualify, but that would mean asking the database for values that you have no intention of using -- this is not best practice. I do not recommend the following:
$count = $this->db
->get_where('persons', 'YEAR(begin_date) = 2014')
->num_rows();
For this reason, you should not be generating a fully populated result set then calling num_rows() or count() when you have no intention of using the values in the result set.
Replace quotes like this :
$query = $this->db->query("SELECT COUNT(*) FROM iplog.persons WHERE begin_date LIKE '2014%'");
Double quote your entire query, then simple quote your LIKE criteria.

Storing Date and Time In PIG

I am trying to store a txt file that has two columns date and time respectively.
Something like this:
1999-01-01 12:08:56
Now I want to perform some Date operations using PIG, but i want to store date and time like this
1999-01-01T12:08:56 ( I checked this link):
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
What I want to know is that what kind of format can I use in which my date and time are in one column, so that I can feed it to PIG, and then how to load that date into pig. I know we change it into datetime, but its showing errors. Can somebody kindly tell me how to load Date&Time data together. An example would be of great help.
Please let me know if this works for you.
input.txt
1999-01-01 12:08:56
1999-01-02 12:08:57
1999-01-03 12:08:58
1999-01-04 12:08:59
PigScript:
A = LOAD 'input.txt' using PigStorage(' ') as(date:chararray,time:chararray);
B = FOREACH A GENERATE CONCAT(date,'T',time) as myDateString;
C = FOREACH B GENERATE ToDate(myDateString);
dump C;
Output:
(1999-01-01T12:08:56.000+05:30)
(1999-01-02T12:08:57.000+05:30)
(1999-01-03T12:08:58.000+05:30)
(1999-01-04T12:08:59.000+05:30)
Now the myDateString is in date object, you can process this data using all the build in date functions.
Incase if you want to store the output as in this format
(1999-01-01T12:08:56)
(1999-01-02T12:08:57)
(1999-01-03T12:08:58)
(1999-01-04T12:08:59)
you can use REGEX_EXTRACT to parse the each data till "." something like this
D = FOREACH C GENERATE ToString($0) as temp;
E = FOREACH D GENERATE REGEX_EXTRACT(temp, '(.*)\\.(.*)', 1);
dump E;
Output:
(1999-01-01T12:08:56)
(1999-01-02T12:08:57)
(1999-01-03T12:08:58)
(1999-01-04T12:08:59)

Convesion from Hive to PigLatin

I am trying to convert the below Hive statement to Pig:
max(substr(case when url like 'http:%' then '' else url end,1,50))
My pig statement for the above is:
url_group = GROUP data by (uid);
max_substr_url= FOREACH url_group generate SUBSTRING(MAX(((Coalesce(data.url) matches '.*http:%.*') ? '' : Coalesce(data.url))), 0, 49);
For some of the data, the url can be null. So I have written a pig UDF called Coalesce(String) which returns an empty string if the data is either null or empty. If the data is not null or not empty it returns the string back.
The above pig statement is giving me lot of trouble and tried n different options/ways but nothing worked. Anyone got any ideas on how to implement this? Please help me.
Thanks in advance
You are going to want to use a nested FOREACH so that you can do the substring transformation on each tuple in the data bag then take the MAX of the transformed bag.
A = GROUP data by (uid);
B = FOREACH url_group {
-- MAX needs a one column bag
transformed = FOREACH data
GENERATE SUBSTRING((Coalesce(url) matches '.*http:.*' ? '' : Coalesce(url)), 0, 49);
GENERATE group AS uid, MAX(transformed) ;
}