PIG generate data using different load variables - apache-pig

I have 2 files and i want to generate data using different columns of diff
files. I want to do something like this:-
Here is my problem with example:-
I have 2 files abc.txt(col1,col2) and xyz.txt(col3,col4) Number of records in both the files differ say abc.txt has 1000 records and xyz.txt has 100 records.
I want to store output in a file such that , i get col1,col2 from abc.txt and col3 from xyz.txt (as we have less records in xyz then abc i want my col3 values to get repeated either randomly or in same sequence as in input file anything is ok)
Input
abc.txt xyz.txt
col1 col2 col3 col4
1 A 4 X
2 B 5 Y
3 C 6 Z
4 D
5 D
6 F
7 A
A = LOAD '/user/abc.txt' Using PigStorage('|');
B = LOAD '/user/xyz.txt' Using PigStorage('|');
C = FOREACH A GENERATE A.$0,A.$1,B.$0;
Output
col1 col2 col3
1 A 4
2 B 5
3 C 6
4 D 5
5 D 4
6 F 4
7 A 6
Is it possible to do this using PIG?

GENERATE is not operator in Pig. So you cannot use it to generate data. Pig provides FOREACH for iterating over a relation. It works for one relation only. To me it looks like you can generate the data as you have specified in question until you want to perform some sort of JOIN on data.

Related

Pandas, multiply part of one DF against another based on condition

Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...

Dynamic transpose of rows to column without pivot (Number of rows are not fixed all the time)

i have a table like
a 1
a 2
b 1
b 3
b 2
b 4
i wanted out put like this
1 2 3 4
a a
b b b b
Number of rows in output may vary.
Pivoting is not working as it is in exasol, and case cant work as it is dynamic

i want to know how i get corresponding value of columns to the selective columns value

I am trying to get those rows from the table which is corresponding to the selective indexes. For example, i have one xls file in which different columns of data. currently my code search the selective two columns and their indexes also, know i want to search those selective rows corresponding elements which is in different rows.
Lets A B C D E F G are columns name in which 1000 of rows of numbers
like
A B c D E F G
1 3 4 5 6 3 3
3 4 5 6 3 2 7
.............
4 7 3 2 5 3 2
So Currently my code search two specific columns (lets suppose B and F selective values which is in some range), now i want to search column A value which is present in those selective ranges.
B F A
3 4 5
3 5 3
7 7 3
5 4 6
...
like this
This is my current code VI
I hope we've finally gotten to the bottom of it. How about this one?

Python Pandas: LabelEncoding fitting unknown variables

Hi I have a dataframe full of strings and I want to encode these strings and store their corresponding codes.
I want to produce these codes on one column and fit onto another column.
When I fit these codes on some other column that has a string that I haven't seen on my training column I want to create another unique value for that.
I have tried LabelEncoding function but it gives error on the previously unseen strings.
For example a have dataframe:
col1 col2
a a
b b
c e
d f
After training LabelEncoding on first column I get something like this:
col1 col2
1 a
2 b
3 e
4 f
After fitting on the created codes onthe second column I want to have something like this:
col1 col2
1 1
2 2
3 5
4 6
What is the easiest way to do this. Thank you.
Created df dataframe by copying sample from OP's post as follows.
df=pd.read_clipboard()
Its value will be as follows when we print it:
col1 col2
0 a a
1 b b
2 c e
3 d f
Could you please try following. I have given here only 1st 6 alphabets you could mention all in case you have them in your actual Input_file.
dict1 = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6}
df.applymap(lambda s: dict1.get(s) if s in dict1 else s)
Output will be as follows.
col1 col2
0 1 1
1 2 2
2 3 5
3 4 6
You could encoding yourself using pd.factorize:
v, k = pd.factorize(sorted(df.stack().unique()))
m = dict(zip(k.tolist(), (v+1).tolist()))
df.replace(m)
Output:
col1 col2
0 1 1
1 2 2
2 3 5
3 4 6
I think the real trick is to stack col1 and col2 then encoding the values of both list as one.
le = LabelEncoder()
le.fit(df.stack())

How to match already-calculated means to the original data set?

I am now learning R. I feel that there is a very easy succinct answer to my problem, but I am having trouble solving it myself.
I have a large data set. One column contains various 'categories'. I aggregated these categories to get the mean for each one. So, right now, my aggregated table looks like this:
Category __ Average
A ________ a
B ________ b
C ________ c
etc...
I want now to take these average and combine it as another column onto my original data.
So, I want it to look something like this:
Categories _____ Averages
B _____________ b
A______________a
B______________b
C______________c
B______________b
C______________c
In other words, I want to match each category with its corresponding mean. I have tried variations of merge(), match(), and different apply functions. The fact that my aggregated table is so much smaller than my original data is causing some problems.
Is there a specific function I can use for this simple problem? Thanks in advance.
In base R:
data <- data.frame(Category=c(rep("A",3), rep("B",4), rep("C",2)), Value=1:9)
> data
Category Value
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 B 7
8 C 8
9 C 9
> avg <- lapply(split(data$Value, data$Category), mean)
$A
[1] 2
$B
[1] 5.5
$C
[1] 8.5
> data$Averages <- avg[data$Category]
> data
Category Value Averages
1 A 1 2
2 A 2 2
3 A 3 2
4 B 4 5.5
5 B 5 5.5
6 B 6 5.5
7 B 7 5.5
8 C 8 8.5
9 C 9 8.5
You can use plyr, data.table, etc. more efficiently for larger datasets.