How to apply global MAX() in pig - apache-pig

I am new to Pig Scripting
I have a dataset as follows:
name | age
-------+----
Ashis | 60
Arun | 22
Nirmal | 48
Ram | 67
Amar | 35
How can I get the record with maximum age using Pig Scripting?
My Output should be
Ram,67

You need to order your data in descending order by age and limit your data by 1 to get the record with max age. Like so:
inputData = LOAD 'path' USING PigStorage('\t') AS (name:charray, age:long);
sortedInput = ORGER inputData BY age DESC;
topRecord = LIMIT sortedInput 1;
DUMP topRecord;
It is worth mentioning that this is not an operation suited for map-reduce (through pig here) since both ORDER and LIMIT don't make use of parallelism and your job will be bottle-necked by a single reducer.

Related

django database design when you will have too many rows

I have a django web app with postgres db; the general operation is that every day I have an array of values that need to be stored in one of the tables.
There is no foreseeable need to query the values of the array but need to be able to plot the values for a specific day.
The problem is that this array is pretty big and if I were to store it in the db, I'd have 60 million rows per year but if I store each row as a blob object, I'd have 60 thousand rows per year.
Is is a good decision to use a blob object to reduce table size when you do not want to query with the row of values?
Here are the two options:
option1: keeping all
group(foreignkey)| parent(foreignkey) | pos(int) | length(int)
A | B | 232 | 45
A | B | 233 | 45
A | B | 234 | 45
A | B | 233 | 46
...
option2: collapsing the array into a blob:
group(fk)| parent(fk) | mean_len(float)| values(blob)
A | B | 45 |[(pos=232, len=45),...]
...
so I do NOT want to query pos or length but I want to query group or parent.
An example of read query that I'm talking about is:
SELECT * FROM "mytable"
LEFT OUTER JOIN "group"
ON ( "group"."id" = "grouptable"."id" )
ORDER BY "pos" DESC LIMIT 100
which is a typical django admin list_view page main query.
I tried loading the data and tried displaying the table in the django admin page without doing any complex query (just a read query).
When I get pass 1.5 millions rows, the admin page freezes. All it takes is a some count query on that table to cause the app to crash so I should definitely either keep the data as a blob or not keep it in the db at all and use the filesystem instead.
I want to emphasize that I've used django 1.8 as my test bench so this is not a postgres evaluation but rather a system evaluation with django admin and postgres.

Single record buffering in SAP ABAP

My table is stud.
+-----+------+-------+
| no | name | grade |
+-----+------+-------+
| 101 | naga | A |
| 102 | raj | A |
| 103 | john | A |
+-----+------+-------+
The query I'm using is:
SELECT * FROM stud WHERE no = 101 AND grade = 'A'.
If am using single record buffering, how much data is being stored in the buffer area?
This query doesn't do anything. There is no "into" clause. meaning it wont store anything selected.
You are probably looking to do something like this....
SELECT * FROM stud into wa_stud WHERE no = 101 AND grade = 'A'.
"processing of each single row is performed here
endselect.
or perhaps something like this, where only 1 row (the first rows ordered by primary key) is selected...
select single * from stud into wa_stud where no = 101 and grade = 'A' .
or perhaps you want everything brought in to a table, meaning number and grade does not include the full primary key.
select * from stud into table it_stud where no = 101 and grade = 'A'.
this is from ABAP Keyword documentation in SE38:
SAP Buffer - Single Record Buffering
Only those rows in the table are buffered that are actually accessed.
This requires less space in the buffer than when using generic or full
buffering. On the other hand, more administration work is required and
significantly more direct database accesses.
So since your query returns a single record (based on the data you displayed) it should just get one row and hold in the buffer.
I'd suggest looking at SAP help and Google - also have a look at SELECT SINGLE and incompletely specified keys - there used to be a problem with the buffer being bypassed in some situations - have a read for reference.

Poor performance on Amazon Redshift queries based on VARCHAR size

I'm building an Amazon Redshift data warehouse, and experiencing unexpected performance impacts based on the defined size of the VARCHAR column. Details are as follows. Three of my columns are shown from pg_table_def:
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
------------+-----------+-----------------+-----------------------------+-----------+---------+---------+---------
public | logs | log_timestamp | timestamp without time zone | delta32k | f | 1 | t
public | logs | event | character varying(256) | lzo | f | 0 | f
public | logs | message | character varying(65535) | lzo | f | 0 | f
I've recently run Vacuum and Analyze, I have about 100 million rows in the database, and I'm seeing very different performance depending on which columns I include.
Query 1:
For instance, the following query takes about 3 seconds:
select log_timestamp from logs order by log_timestamp desc limit 5;
Query 2:
A similar query asking for more data runs in 8 seconds:
select log_timestamp, event from logs order by log_timestamp desc limit 5;
Query 3:
However, this query, very similar to the previous, takes 8 minutes to run!
select log_timestamp, message from logs order by log_timestamp desc limit 5;
Query 4:
Finally, this query, identical to the slow one but with explicit range limits, is very fast (~3s):
select log_timestamp, message from logs where log_timestamp > '2014-06-18' order by log_timestamp desc limit 5;
The message column is defined to be able to hold larger messages, but in practice it doesn't hold much data: the average length of the message field is 16 charachters (std_dev 10). The average length of the event field is 5 charachters (std_dev 2). The only distinction I can really see is the max length of the VARCHAR field, but I wouldn't think that should have an order of magnitude affect on the time a simple query takes to return!
Any insight would be appreciated. While this isn't the typical use case for this tool (we'll be aggregating far more than we'll be inspecting individual logs), I'd like to understand any subtle or not-so-subtle affects of my table design.
Thanks!
Dave
Redshift is a "true columnar" database and only reads columns that are specified in your query. So, when you specify 2 small columns, only those 2 columns have to be read at all. However when you add in the 3rd large column then the work that Redshift has to do dramatically increases.
This is very different from a "row store" database (SQL Server, MySQL, Postgres, etc.) where the entire row is stored together. In a row store adding/removing query columns does not make much difference in response time because the database has to read the whole row anyway.
Finally the reason your last query is very fast is because you've told Redshift that it can skip a large portion of the data. Redshift stores your each column in "blocks" and these blocks are sorted according the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned.
The limit clause doesn't reduce the work that has to be done because you've told Redshift that it must first order all by log_timestamp descending. The problem is your ORDER BY … DESC has to be executed over the entire potential result set before any data can be returned or discarded. When the columns are small that's fast, when they're big it's slow.
Out of curiosity, how long does this take?
select log_timestamp, message
from logs l join
(select min(log_timestamp) as log_timestamp
from (select log_timestamp
from logs
order by log_timestamp desc
limit 5
) lt
) lt
on l.log_timestamp >= lt.log_timestamp;

Is there a way to transpose data in Hive?

Can data in Hive be transposed? As in, the rows become columns and columns are the rows? If there is no function straight up, is there a way to do it in a couple of steps?
I have a table like this:
| ID | Names | Proc1 | Proc2 | Proc3 |
| 1 | A1 | x | b | f |
| 2 | B1 | y | c | g |
| 3 | C1 | z | d | h |
| 4 | D1 | a | e | i |
I want it to be like this:
| A1 | B1 | C1 | D1 |
| x | y | z | a |
| b | c | d | e |
| f | g | h | i |
I have been looking up other related questions and they all mention using lateral views and explode, but is there a way to selectively choose columns for lateral(ly) view(ing) and explod(ing)?
Also, what might be the rough process to achieve what I would like to do? Please help me out. Thanks!
Edit: I have been reading this link: https://cwiki.apache.org/Hive/languagemanual-lateralview.html and it shows me half of what I want to achieve. The first example in the link is basically what I'd like except that I don't want the rows to repeat and want them as column names. Any ideas on how to get the data to a form such that if I do an explode, it would result in my desired output, or the other way, ie, explode first to lead to another step that would then lead to my desired output table. Thanks again!
I don't know of a way out of the box in hive to do this, sorry. You get close with explode etc. but I don't think it can get the job done.
Overall, conceptually, I think it's hard to a transpose without knowing what the columns of the destination table are going to be in advance. This is true, in particular for hive, because the metadata related to how many columns, their types, their names, etc. in a database - the metastore. And, it's true in general, because not knowing the columns beforehand, would require some sort of in-memory holding of data (ok, sure with spills) and users may need to be careful about not overflowing the memory and such (just like dynamic partitioning in hive).
In any case, long story short, if you know the columns of the destination table beforehand, life is good. There isn't a set command in hive per se, to the best of my knowledge, but you could use a bunch of if clauses and case statements (ugly I know, but that's how I have done the same in the past) in the select clause to transpose the data. Something along the lines of SQL - How to transpose?
Do let me know how it goes!
As Mark pointed out there's no easy way to do this in Hive since PIVOT doesn't present in Hive and you may also encounter issues when trying to use the case/when 'trick' since you have multiple values (proc1,proc2,proc3).
As for testing purposes, you may try a different approach:
select v, o1, o2, o3 from (
select k,
v,
LEAD(v,3) OVER() as o1,
LEAD(v,6) OVER() as o2,
LEAD(v,9) OVER() as o3
from (select transform(name,proc1,proc2,proc3) using 'python strm.py' AS (k, v)
from input_table) q1
) q2 where k = 'A1';
where strm.py:
import sys
for line in sys.stdin:
line = line.strip()
name, proc1, proc2, proc3 = line.split('\t')
print '%s\t%s' % (name, proc1)
print '%s\t%s' % (name, proc2)
print '%s\t%s' % (name, proc3)
The trick here is to use a python script in the map phase which emits each column of a row as distinct rows. Then every third (since we have 3 proc columns) row will form the resulting row which we get by peeking forward (lead).
However, this query does the job, it has the drawback that as the input grows, you need to peek the next 3rd element in the query which may lead to performance hit. Anyway you may evaluate it for testing purposes.

Is this SELECT and ORDER BY query the most efficient way I could have done it?

In my journey to learn SQL, I'm writing various queries on an old database of mine, but getting into more complex things, I want to make sure I'm not over engineering this. I have a table Agent, with different agents offering different prices for cities. Multiple agents can serve the same city, each with different prices. I wanted to run a query which would return the total cost of hiring all of the agents for any given city, ordered by the most expensive.
WITH orderedPrices AS (
SELECT SUM(agtFMPrice)
OVER (PARTITION BY agtCity)
AS IX FROM Agent)
SELECT IX
FROM orderedPrices
ORDER BY IX DESC
I found that doing it without the view returned by orderedPrices, it wouldn't order the prices (I assume because it's an aggregate function, or whatever they're called). Did I do this in the best way I could have, or could it be simplified?
Also, if you're feeling particularly bored, go ahead and give me a new assignment/query to do on this table. I could use the practice.
What you have written in English doesn't seem to quite match qhat you have written in SQL.
English:
- One record per City
- One field per record, showing the total cost of all associated agents
SQL:
- One record per Agent
- One field per record, showing the total cost of all agents in the same city
AgentID | agtCity | agtFMPrice
---------+---------+------------
1 | 1 | 10
2 | 1 | 20
3 | 2 | 30
4 | 2 | 10
5 | 2 | 25
Results of SQL version Results of English version
------------------------ ----------------------------
30 30
30 65
65
65
65
If you want the English version, I'd do this...
SELECT
agtCity,
SUM(agtFMPrice) AS IX
FROM
Agent
GROUP BY
agtCity
ORDER BY
SUM(agtFMPrice) DESC
To assist performance, the table could (should?) also have an Index on (agtCity)