Join two files with out any common key - apache-pig

I have 2 input files The first file is having 2 columns and second file is having three column both have different values in it like
First file :
Type:
(String)|(Integer)
Value:
City1|Value1
City2|Value2
City3|Value3
Second File:
Type:
(String)|(String)|(Integer)
Value:
String1|Text1|Int1
String2|Text2|Int2
String3|Text3|Int3
I need Output as
Text1|City1|Value1
Text2|City2|Value2
Text3|City3|Value3
I can use any program skill to get this, If it is not possible in pig then i can go with other programs also. Please suggest me which one will be better and how to do that.
Please help me on this.Thanks in advance

Your example is not clear. If first relation has M values and second has N values, do you expect M*N values in results? Or do you expect M=N values in result
Assuming first (M*N values), you can use CROSS operation.
Assuming second (M=N values), you can:
a. Use Enumerate on both the relations to add numbers (unique enumerator number) to each tuples.
b. Then join on the enumerator number to make sure 1st row from both relation joins, then 2nd row and so on.
Hope this helps.

You can't join without a common key in Pig. Try using concat function for your usecase

Related

Compare names of two columns

Hellow everyone. Here is my question. In the 1st table person name written in 2 languages in two columns. In the 2nd table column is one for name, so names are written either 1st language or 2nd language.
How to compare these two table? Does my code works?
... t.datebirth=p.datebirth and (t.name=p.name1 or t.name=p.name2)
t.datebirth=p.datebirth and (t.name=p.name1 or t.name=p.name2)
Does my code works?
As I understood your question with the limited information you provided: yes, it works. It checks whether any of the two names in table p is equal to the name in table t.
You can simplify the logic with in:
t.datebirth = p.datebirth and t.name in (p.name1, p.name2)
This might not be a very efficient approach though. Depending on your use case, you might also want to consider two left joins, each joining on one of the names, and additional conditional logic in the rest of the query. But that cannot be assessed without a more detailed description of your use case.

How to use a WHERE statement in Postgres with an array of OR combinations?

I'm not sure how to phrase this question, but the premise is I have a table where the primary key is defined by two columns: row and col. I also want to query for many individual columns, which is where my problem comes into play.
If I had a simple column named id, I would be able to do a clause such as WHERE id=ANY($1) where $1 is an array of integers. However, with a primary key consisting of two columns, I wouldn't be able to the same.
WHERE row=ANY($1) AND col=ANY($2) gives me a region of what I want, but not the exact set of tuples that I need. Right now I'm generating a template query string with many conditions, such as:
WHERE row=$1 AND col=$2 OR
row=$3 AND col=$4 OR ...
How can I avoid generating this "query template"? I don't think this is a very elegant solution, but it's the solution I have right now. Any insight would be appreciated!
where (row,col) = any(array[(1,2),(3,4)])
or
where (row,col) in ((1,2),(3,4))

pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

Suppose I have the following flat file on HDFS (let's call this key_value):
1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing
Here is the output I'm looking for:
(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)
In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).
Towards this end, I started putting together this pig script:
data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
sorted = ORDER data BY key, value;
GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;
The above script gives me the following output:
(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)
Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).
Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?
UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:
(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)
First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.
I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.
You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.

Hide Empty columns

I got a table with 75 columns,. what is the sql statement to display only the columns with values in in ?
thanks
It's true that a similar statement doesn't exist (in a SELECT you can use condition filters only for the rows, not for the columns). But you could try to write a (bit tricky) procedure. It must check which are the columns that contains at least one not NULL/empty value, using queries. When you get this list of columns just join them in a string with a comma between each one and compose a query that you can run, returning what you wanted.
EDIT: I thought about it and I think you can do it with a procedure but under one of these conditions:
find a way to retrieve column names dynamically in the procedure, that is the metadata (I never heard about it, but I'm new with procedures)
or hardcode all column names (loosing generality)
You could collect column names inside an array, if stored procedures of your DBMS support arrays (or write the procedure in a programming language like C), and loop on them, making a SELECT each time, checking if it's an empty* column or not. If it contains at least one value concatenate it in a string where column names are comma-separated. Finally you can make your query with only not-empty columns!
Alternatively to stored procedure you could write a short program (eg in Java) where you can deal with a better flexibility.
*if you check for NULL values it will be simple, but if you check for empty values you will need to manage with each column data type... another array with data types?
I would suggest that you write a SELECT statement and define which COLUMNS you wish to display and then save that QUERY as a VIEW.
This will save you the trouble of typing in the column names every time you wish to run that query.
As marc_s pointed out in the comments, there is no select statement to hide columns of data.
You could do a pre-parse and dynamically create a statement to do this, but this would be a very inefficient thing to do from a SQL performance perspective. Would strongly advice against what you are trying to do.
A simplified version of this is to just select the relevant columns, which was what I needed personally. A quick search of what we're dealing with in a table
SELECT * FROM table1 LIMIT 10;
-> shows 20 columns where im interested in 3 of them. Limit is just to not overflow the console.
SELECT column1,column3,colum19 FROM table1 WHERE column3='valueX';
It is a bit of a manual filter but it works for what I need.

Read number of columns and their type from query result table (in C)

I use PostgreSQL database and C to connect to it. With a help from dyntest.pgc I can access to number of columns and their (SQL3) types from a result table of a query.
Problem is that when result table is empty, I can't fetch a row to get this data. Does anyone have a solution for this?
Query can be SELECT 1,2,3 - so, I think I can't use INFORMATION SCHEMA for this because there is no base table.
I'm not familiar with ecpg, but with libpq you should be able to call PQnfields to get the number of fields and then call various PQf* routines (like PQftype, PQfname) to get detailed info. Those functions take a PGResult, which you have even if there are no rows.
Problem is that when result table is empty, I can't fetch a row to get this data. Does anyone have a solution for this?
I am not sure to really get what you want, but it seems the answer is in the question. If the table is empty, there are no rows...
The only solution here seems you must wait a non empty result table, and then get the needed informations.