How to change a particular column value for certain number of rows in Pig latin - apache-pig

I have a a pig file with say 10000 rows. Is there any quick way where I can change the value of a certain column for say first 1000 rows ?

Since some info is missing, I will make a few assumptions, and then offer a solution.
by "first 1000 rows" you mean that you can order them records using some column
you which to change the value of column $1 in first 1000 records when ordering by column $2
The following code snippet will do what you asked for:
a = load ...
b = rank a by $2;
c = foreach b generate $0, (rank_a<1000?$1:3*$1), $2..;

Use For Each and Limit Operations to achieve the effect.

Related

Removing more than 2 duplicates from a CSV file

I have found the following script to remove duplicates:
awk -F, '!x[$7]++' 'business-records.csv' > 'business-records-deduped.csv'
When it finds duplicate records instead of deleting all the duplicates and keeping only the first record it would be amazing if it could keep the first 2 or 3 records and remove the rest. So basically allowing the original and one duplicate but deleting the entire row of any more than one or two duplicates.
How to adjust it so that it keeps the original record and the first duplicate and deletes the entire rows of any more than the first duplicate?
You can use awk like this:
awk -F, '++x[$7] <= 2' business-records.csv > business-records-deduped.csv
This will keep 2 duplicate records for 7th column and will delete any more dupes as you desire.
I propose following minimal ameloration of your code
awk -F, '2>x[$7]++' 'business-records.csv' > 'business-records-deduped.csv'
Explanation: ++ is post-increment operation so execution order might be somewhat counter-intuitive
x[$7] gets value from array x for key being content of 7th field, if not present assume 0
2> is test deciding about printing, if this condition does hold line is printed
++ does increase value inside array x, therefore next time you encounter same 7th field content value will be bigger by 1
Observe that sole thing altered is test, regarding non-negative integers ! is true for zero and false for values above 0.

Building a new dataset

I want to take data from one set and enter it into another empty set.
So, for example, I want to do something like:
if ([i,x] > 9){
new_data$House[y,x] <- data[i,2]
}
but I want to do it over and over, creating new rows in new_data.
How do I keep adding data to new_data and overriding/saving the new row?
Essentially, I just want to know how to "grow" an empty data set.
Please ignore any errors in the code, it is just an example and I am still working on other details.
Thanks
If you are using r language, I presume you are looking for rbind:
new_data = NULL # define your new dataset
for(i in 1:nrow(data)) # loop over row of data
{
if(data[i,x] > 9) # if statement for implementing a condition
{
new_data = rbind(new_data,data[i,2:6]) # adding values of the row i and column 2 to 6
}
}
At the end, new_data will contain as many rows that satisfy the if statement and each row will contain values extracted from column 2 to 6.
If it is what you are looking for, there is various ways to do that without the need of a for loop, as an example:
new_data = data[data[i,x]>9,2:6]
If this answer is not satisfying for you, please provide more details in your question, include a reproducible example of your data and the expected output

I have 50 fields, Is there any option in pig to print first 40 field in Apache Pig? I require something like range $0-$39

I have 50 fields, Is there any option in pig to print first 40 fields? I require something like range $0-$39.
I don’t want to specify each and every field like $0, $1,$2 etc
Giving every column when the number of columns is less is acceptable but when there are a huge number of columns what is the case?
You can use the .. notation.
First 40 fields
B = FOREACH A GENERATE $0..$39;
All fields
B = FOREACH A GENERATE $0..;
Multiple ranges,for example first 10,15-20,25-50
B = FOREACH A GENERATE $0..$9,$14..$19,$24..;
Random fields 22,33-44,46
B = FOREACH A GENERATE $21,$32..$43,$45;

matching several combinations of columns in a table

I am reading a table where all its values has to be validated before we process it further. The valid values are stored in another table that we match our main table with. The validation criteria is to match several columns as follows:
Table 1 (the main data we read in)
Name --- Unit --- Age --- Address --- Nationality
The above shows the column names that we are reading from the table and the other table contains the valid values of the above columns . When we look only for valid values in our main table, we have to consider combination of columns in the main data table, for example Name --- Unit --- Age. If all the value in a particular row for the column combination matches against the other table then we keep the row, otherwise we delete it.
How do I address the issue with Numpy ?
Thanks
you can just loop through rows. An easy/simple way would be:
dummy_df = table_df ## make a copy of your table, since we are deleting rows we want to have the original df saved.
relevant_columns = ['age','name','sex',...] ## define relevant columns, in case either dataframe has columns you dont want to compare on
for indx in dummy_df.index :
## checks if any row is identical, if so, drops it.
if ((np.array(dummy_df.loc[indx][relevant_columns]) == main_df[relevant_columns].values).sum(1) == len(relevant_columns)).sum() > 0:
dummy_df = dummy_df .drop(indx)
ps: i am assuming the data is in pandas dataframe format.
hope it helps :)
ps2: if the headers/columns have different names it wont work

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.
I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.