Hadoop Pig Query - apache-pig

Whenever I try to run this query, it returns zero records even though I know there are records that meet this criteria.
I want to find records where column 10 is equal to 0, 1, 2, or 3 and column 16 is equal to 4, 5, 6, or 7.
Pig Script:
newdata = FILTER data BY ((((($10==0) OR ($10==1)
OR ($10==2)
OR ($10==3)))) AND
(((($16==4) OR ($16==5)
OR ($16==6)
OR ($16==7)))));

Column indexes start from 0, so the 10th column will be $9 and 16th column with me $15.Try this
newdata = FILTER data BY ((($9==0) OR
($9==1) OR
($9==2) OR
($9==3))
AND
(($15==4) OR
($15==5) OR
($15==6) OR
($15==7)));

Related

Code to obtain the maximum value out of a list and the three consecutive values after the maximum value

I am writing a model to calculate the maximum production capacity for a machine in a year based on 15-min data. As the maximum capacity is not the sum of the required capacity for all 15-min over the year, I want to write a piece of code that determines the maximum value in the list and then adds this maximum value and the three next consecutive values after this maximum value to a new variable. An simplified example would be:
fifteen_min_capacity = [10, 12, 3, 4, 8, 12, 10, 9, 2, 10, 4, 3, 15, 8, 9, 3, 4, 10]
The piece of code I want to write would be able to determine the maximum capacity in this list (15) and then add this capacity plus the three consecutive ones (8,9,3) to a new variables:
hourly_capacity = 35
Does anyone now the code that would give this output?
I have tried using the max(), the sum() and a combination of both. However, I do not get a working code. Any help would be much appreciated!

How can I use IF and ELSE IF in looping and display 2 statements in GAMS?

I am a beginner level in this program. I try to improve this loop according to this condition. The details are as follows:
When CUTI(k) = CUTI(k)-4 then,
1)If the result shows this CUTI(k) value greater than 0, then print this CUTI(k) value.
2)If the result shows CUTI(k) value less than 0, then print this CUTI(k) value is added 12 with showing a word "*" after the number in display, e.g. 10*, 9*
I am not sure this loop is correct and enough to add this condition. Look forward to seeing your recoomendation. :)
set k /1*20/;
parameter
CUTI(k)/1 6, 2 2, 3 8, 4 5, 5 1, 6 3, 7 7, 8 8, 9 6, 10 8,11 1, 12 2, 13 4, 14 7,
15 5, 16 2, 17 8, 18 9, 19 2, 20 10/;
loop(k,
if(CUTI(k)-4 > 0,
CUTI(k) = CUTI(k)-4;
else
CUTI(k) = (CUTI(k)-4)+12 ;
)
);
display CUTI;
Your logic looks correct. However, instead of the loop/if/else you could simplify this to one assignment:
CUTI(k) = CUTI(k)-4+12$(CUTI(k)<=4);
However, modifying the display statement by adding a * to some elements is not possible. If you need to distinguish the cases in such a statement, you might assign the values to two different parameters and display them individually.

Select column with the most unique values from csv, python

I'm trying to come up with a way to select from a csv file the one numeric column that shows the most unique values. If there are multiple with the same amount of unique values it should be the left-most one. The output should be either the name of the column or the index.
Position,Experience in Years,Salary,Starting Date,Floor,Room
Middle Management,5,5584.10,2019-02-03,12,100
Lower Management,2,3925.52,2016-04-18,12,100
Upper Management,1,7174.46,2019-01-02,10,200
Middle Management,5,5461.25,2018-02-02,14,300
Middle Management,7,7471.43,2017-09-09,17,400
Upper Management,10,12021.31,2020-01-01,11,500
Lower Management,2,2921.92,2019-08-17,11,500
Middle Management,5,5932.94,2017-11-21,15,600
Upper Management,7,10192.14,2018-08-18,18,700
So here I would want 'Floor' or 4 as my output given that Floor and Room have the same amount of unique values but Floor is the left-most one (I need it in pure python, i can't use pandas)
I have this nested in a whole bunch of other code for what I need to do as a whole, i will spare you the details but these are the used elements in the code:
new_types_list = [str, int, str, datetime.datetime, int, int] #all the datatypes of the columns
l1_listed = ['Position', 'Experience in Years', 'Salary', 'Starting Date', 'Floor', 'Room'] #the header for each column
difference = [3, 5, 9, 9, 6, 7] #is basically the amount of unique values each column has
And here I try to do exactly what I mentioned before:
another_list = [] #now i create another list
for i in new_types_list: # this is where the error occurs, it only fills the list with the index of the first integer 3 times instead of with the individual indices
if i== int:
another_list.append(new_types_list.index(i))
integer_listi = [difference[i] for i in another_list] #and this list is the corresponding unique values from the integers
for i in difference: #now we want to find out the one that is the highest
if i== max(integer_listi):
chosen_one_i = difference.index(i) #the index of the column with the most unique values is the chosen one -
MUV_LMNC = l1_listed[chosen_one_i]
```
You can use .nunique() to get number of unique in each column:
df = pd.read_csv("your_file.csv")
print(df.nunique())
Prints:
Position 3
Experience in Years 5
Salary 9
Starting Date 9
Floor 7
Room 7
dtype: int64
Then to find max, use .idxmax():
print(df.nunique().idxmax())
Prints:
Salary
EDIT: To select only integer columns:
print(df.loc[:, df.dtypes == np.integer].nunique().idxmax())
Prints:
Floor

Picking one of many identical rows with certain condition

To set the scene, what I define as identical rows are when the combination of destination and vehicle_brand are the same. For instance in the figure below,
SQL table name: cardriven
rows 2 and 3 are "identical" because of the Dallas-Toyota "combination." Now I want to only display the row with the higher request_id. So for example, between rows 2 and 3, row 3 would get displayed and row 2 would be hidden/removed because 169 > 100. So in the end, only rows 3, 4, 5, 7, and 8 will show and rows 1, 2, 6, and 9 would get hidden/removed.
Hopefully you understand what I am going for here but if you have any questions, please let me know. This will be written in SQL code.
Another problem: I added a new column for dates and entered some random ones for rows 2-4. Row 2 is 12/1/17, row 3 is 11/5/2016, and row 4 is 7/6/2017. Note that row 3 has the highest request_id out of the Dallas-Toyota combination. I decided to enter a new entry in with a request_id = 501 and entry of Dallas, Toyota, and 12/22/2017. After running the program, for Dallas-Toyota I return row 3 but with request_id = 501! It SHOULD return the entry I just entered.
You can use Group By and the Max function to get the highest value.
SELECT MAX(request_id), destination, vehicle_brand
FROM cardriven
GROUP BY destination, vehicle_brand

Get key and score of the last inserted item in Redis sorted set

Now I made some changes in my solution. What I want to get now is key->score pairs for given range of keys. For example:
set = [1: 3, 2: 5, 7: 8, 10: 1]
for range [2, 8] I want to get: [2: 5, 7: 8]
How can I get the last inserted (or last 5) items from Redis sorted set. I tried the zrange function, but it takes into account the score when doing the sort. Can I somehow get them sorted by the insertion time? Or, by the key?
I considered using list, but I need also to access elements by key, and that is why I want to use sorted sets (better access time complexity).
Thanks!
You can make your score a composite value : a concatenation of timestamp and the original score.
The first 10 digit are timestamp when the item has been inserted. The x last digit are the score of the item (means you have to add some 0 at the beginning to always have the same number of digit).
Example : 148594228400023
Then, you can get the last 5 inserted items with zrevrangebyscore and retrieve the score of the item.