Hive - selecting the id for which the other field's value is ascending in consecutive timestamps - hive

I need to select the equipment_id for which the "Reading" is ascending in in consecutive Timestamps from the below Hive table 'Whether_report'.
station_id equipment_id timpe_stamp Reading
1 100 00:00:01 60
2 100 00:00:02 61
3 100 00:00:03 62
4 100 00:00:04 60
5 100 00:00:05 61
. . . .
. . . .
16 114 00:00:11 66
17 114 00:00:12 65
. . . .
. . . .
. . . .
. . . .
29 112 00:00:23 71
30 113 00:00:24 69
for example:- i need to select the euipment_id whose Reading is in ascending for five consecutive timestamps (eg:- 60->61->62->63->64->65) and should not select the equipment_id for which the readings for consequent timestamps (eg:- 60->61->62->60->61). I am struggling to get the correct query.Any suggestion is much appreciated.

I tried a loop to for your requirement:
List<Integer> lis = new ArrayList<Integer>();
int j=0, flag=1, width=0;
lis.add(0, 60);
lis.add(1, 61);
lis.add(2, 61);
lis.add(3, 60);
lis.add(4, 61);
lis.add(5, 62);
lis.add(6, 64);
lis.add(7, 66);
lis.add(8, 68);
Iterable<Integer> itr = lis;
for(int i : itr)
{
if( j != 0) {
if( width == 4)
break;
if( i>j ) {
flag = 1;
width++;
}
else if( i<j && width != 4) {
flag = 0;
width = 0;
}
}
System.out.println(i);
j=i;
}
System.out.println("flag = "+flag+"width = "+ (width));
}
Output:
60
61
61
60
61
62
64
66
flag = 1 width = 4
I think if this can be plugged in the reducer class where the key is IntWritable equipment_id and value is Iterable IntWritable values and feed the values to this loop, assuming all time stamp values are unique.
Don't know if this is an optimal solution, considering the volume of data. Hope it helps !!!!!

You probably have to go to pig or MR. You are trying to find a sorted sub_sequence of length 5 in an bunch of readings, which probably cannot be achieved in a single query.

Related

How to print optimal tours of a vehicle routing problem in CPLEX?

I modeled a Vehicle Routing Problem in CPLEX and now I'd like to print the optimal tours it found using post-processing.
My decision variable looks like this:
dvar boolean x[vehicles][edges];
1, if the edge is traversed by the vehicle, 0 otherwise.
Edge is a tuple containg two customers as follows:
tuple edge {
string i;
string j;
}
with customers being:
{string} customers = {"0", "1", "2", "3", "4", "5", "6"}
where 0 and 6 represent the depot where all tours start and end.
My post-processing right now looks the following:
execute {
writeln("Optimal value: ", cplex.getObjValue());
writeln("The following tours should be driven:");
for (var k in vehicles) {
write("Vehicle ", k, ": ");
var y = 0;
write(y);
for (var a in edges) {
if (x[k][a] == 1 && a.i == y) {
write(" - ", a.j);
y = a.j;
}
}
writeln();
}
}
Sadly it doesn't work the intented way.
you need to turn boolean values for edges into tours.
See MTZ from How to with OPL
// What is better and relies on CPLEX is the MTZ model ( Miller-Tucker-Zemlin formulation )
// Cities
int n = ...;
range Cities = 1..n;
// Edges -- sparse set
tuple edge {int i; int j;}
setof(edge) Edges = {<i,j> | ordered i,j in Cities};
int dist[Edges] = ...;
setof(edge) Edges2 = {<i,j> | i,j in Cities : i!=j};
int dist2[<i,j> in Edges2] = (<i,j> in Edges)?dist[<i,j>]:dist[<j,i>];
// Decision variables
dvar boolean x[Edges2];
dvar int u[1..n] in 1..n;
/*****************************************************************************
*
* MODEL
*
*****************************************************************************/
// Objective
minimize sum (<i,j> in Edges2) dist2[<i,j>]*x[<i,j>];
subject to {
// Each city is linked with two other cities
forall (j in Cities)
{
sum (<i,j> in Edges2) x[<i,j>]==1;
sum (<j,k> in Edges2) x[<j,k>] == 1;
}
// MTZ
u[1]==1;
forall(i in 2..n) 2<=u[i]<=n;
forall(e in Edges2:e.i!=1 && e.j!=1) (u[e.j]-u[e.i])+1<=(n-1)*(1-x[e]);
};
{edge} solution={e | e in Edges2 : x[e]==1};
int follower[Cities];
{int} sol;
execute
{
//writeln("path ",solution);
for(var e in solution) follower[e.i]=e.j;
var k=1;
for(var i in Cities)
{
sol.add(k);
k=follower[k];
}
writeln("sol = ",sol);
}
/*
which gives
// solution (optimal) with objective 7542
sol = {1 22 31 18 3 17 21 42 7 2 30 23 20 50 29 16 46 44 34 35 36 39 40 37 38 48
24 5 15 6 4 25 12 28 27 26 47 13 14 52 11 51 33 43 10 9 8 41 19 45 32
49}
*/

Splitting a coordinate string into X and Y columns with a pandas data frame

So I created a pandas data frame showing the coordinates for an event and number of times those coordinates appear, and the coordinates are shown in a string like this.
Coordinates Occurrences x
0 (76.0, -8.0) 1 0
1 (-41.0, -24.0) 1 1
2 (69.0, -1.0) 1 2
3 (37.0, 30.0) 1 3
4 (-60.0, 1.0) 1 4
.. ... ... ..
63 (-45.0, -11.0) 1 63
64 (80.0, -1.0) 1 64
65 (84.0, 24.0) 1 65
66 (76.0, 7.0) 1 66
67 (-81.0, -5.0) 1 67
I want to create a new data frame that shows the x and y coordinates individually and shows their occurrences as well like this--
x Occurrences y Occurrences
76 ... -8 ...
-41 ... -24 ...
69 ... -1 ...
37 ... -30 ...
60 ... 1 ...
I have tried to split the string but don't think I am doing it correctly and don't know how to add it to the table regardless--I think I'd have to do something like a for loop later on in my code--I scraped the data from an API, here is the code to set up the data frame shown.
for key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = 1
else:
shots[scoordinates] += 1
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in goals:
goals[gcoordinates] = 1
else:
goals[gcoordinates] += 1
#create data frame using pandas
gdf = pd.DataFrame(list(goals.items()),columns = ['Coordinates','Occurences'])
print(gdf)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences'])
print()
try this
import re
df[['x', 'y']] = df.Coordinates.apply(lambda c: pd.Series(dict(zip(['x', 'y'], re.findall('[-]?[0-9]+\.[0-9]+', c.strip())))))
using the in-built string methods to achieve this should be performant:
df[["x", "y"]] = df["Coordinates"].str.strip(r"[()]").str.split(",", expand=True).astype(np.float)
(this also converts x,y to float values, although not requested probably desired)

Output the result of each loop in different columns

price.txt file has two columns: (name and value)
Mary 134
Lucy 56
Jack 88
range.txt file has three columns: (fruit and min_value and max_value)
apple 57 136
banana 62 258
orange 88 99
blueberry 98 121
My aim is to test whether the value in price.txt file is between the min_value and max_value in range.txt. If yes, putout 1, If not, output "x".
I tried:
awk 'FNR == NR { name=$1; price[name]=$2; next} {
for (name in price) {
if ($2<=price[name] && $3>=price[name]) {print 1} else {print "x"}
}
}' price.txt range.txt
But my results are all in one column, just like follows:
1
1
x
x
x
x
x
x
1
1
1
x
Actually, I want my result to be like: (Each name has one column)
1 x 1
1 x 1
x x 1
x x x
Because I need to use paste to add the output file and range.txt file together. The final result should be like:
apple 57 136 1 x 1
banana 62 258 1 x 1
orange 88 99 x x 1
blueberry 98 121 x x x
So, how can I get the result of each loop in different columns? And is there anyway to output the final result without paste based on my current code? Thank you.
This builds on what you provided,
# load prices by index to maintain read order
FNR == NR {
price[names++]=$2
next
}
# save max index to avoid using non-standard length(array)
END {
names=NR
}
{
l = $1 " " $2 " " $3
for (i=0; i < names; i++) {
if ($2 <= price[i] && $3 >= price[i]) {
l = l " 1"
} else {
l = l " x"
}
}
print l
}
and generates output,
apple 57 136 1 x 1
banana 62 258 1 x 1
orange 88 99 x x 1
blueberry 98 121 x x x
However, you don't have the person name for the score (anonymous results) - maybe that's intentional?
The change here is to explicitly index array populated in first block to maintain order.

OCaml: Print a long int list 10 elements per row

I'm working with really long lists of integers and need a way of printing them 10 to a row. This is what I've got so far and now I'm stuck:
open Printf
let print_list list = List.iter (printf "%d ") list;;
(* Remove first n elements from list *)
let rec remove n list =
if n== 0 then list
else match list with
| [] -> []
| hd::tl -> remove (n-1) tl;;
(* Remove and return first n elements from a list *)
let rec take n list =
match n with
| 0 -> []
| _ -> List.hd list :: take (n-1) (List.tl list);;
let rec print_rows list =
if List.length list > 10 then
begin
let l = take 10 list;
print_list l;
print_endline " ";
print_rows (remove 5 list)
end else print_list list;;
I'm sure there is a better way recursively with matching patterns, but I can't figure this out. Help!
Here's a function that does something close to what you want. It doesn't do anything fancy, it just counts the number of ints printed so far and inserts endlines at the right times.
let printby10 intlist =
let iprint count n =
Printf.printf "%d " n;
if count mod 10 = 9 then Printf.printf "\n";
count + 1
in
ignore (List.fold_left iprint 0 intlist)
This code leaves an incomplete line if the number of ints isn't a multiple of 10. Maybe you would want to fix that up.
Another (but very close to that of #Jeffrey Scofield) approach would be to use the standard function List.iteri, which provides the current element's index:
let print_by_rows n_per_row =
List.iteri (fun i x ->
print_int x;
if (i + 1) mod n_per_row <> 0 then print_string " "
else print_newline ())
A test:
μ> print_by_rows 10 (Array.to_list (Array.make 20 42));;
42 42 42 42 42 42 42 42 42 42
42 42 42 42 42 42 42 42 42 42
- : unit = ()
And one more:
μ> print_by_rows 5 (Array.to_list (Array.make 20 42));;
42 42 42 42 42
42 42 42 42 42
42 42 42 42 42
42 42 42 42 42
- : unit = ()

Efficient way to calculate averages, standard deviations from a txt file

Here is a copy of what one of many txt files looks like.
Class 1:
Subject A:
posX posY posZ x(%) y(%)
0 2 0 81 72
0 2 180 63 38
-1 -2 0 79 84
-1 -2 180 85 95
. . . . .
Subject B:
posX posY posZ x(%) y(%)
0 2 0 71 73
-1 -2 0 69 88
. . . . .
Subject C:
posX posY posZ x(%) y(%)
0 2 0 86 71
-1 -2 0 81 55
. . . . .
Class 2:
Subject A:
posX posY posZ x(%) y(%)
0 2 0 81 72
-1 -2 0 79 84
. . . . .
The number of classes, subjects, row entries all vary.
Class1-Subject A always has posZ entries that have 0 alternating with 180
Calculate average of x(%), y(%) by class and by subject
Calculate standard deviation of x(%), y(%) by class and by subject
Also ignore the posZ of 180 row when calculating averages and std_deviations
I have developed an unwieldly solution in excel (using macro's and VBA) but I would rather go for a more optimal solution in python.
numpy is very helpful but the .mean(), .std() functions only work with arrays- I am still researching some more into it as well as the panda's groupby function.
I would like the final output to look as follows (1. By Class, 2. By Subject)
1. By Class
X Y
Average
std_dev
2. By Subject
X Y
Average
std_dev
I think working with dictionaries (and a list of dictionaries) is a good way to get familiar with working with data in python. To format your data like this, you'll want to read in your text files and define variables line by line.
To start:
for line in infile:
if line.startswith("Class"):
temp,class_var = line.split(' ')
class_var = class_var.replace(':','')
elif line.startswith("Subject"):
temp,subject = line.split(' ')
subject = subject.replace(':','')
This will create variables that correspond to the current class and current subject. Then, you want to read in your numeric variables. A good way to just read in those values is through a try statement, which will try to make them into integers.
else:
line = line.split(" ")
try:
keys = ['posX','posY','posZ','x_perc','y_perc']
values = [int(item) for item in line]
entry = dict(zip(keys,values))
entry['class'] = class_var
entry['subject'] = subject
outputList.append(entry)
except ValueError:
pass
This will put them into dictionary form, including the earlier defined class and subject variables, and append them to an outputList. You'll end up with this:
[{'posX': 0, 'x_perc': 81, 'posZ': 0, 'y_perc': 72, 'posY': 2, 'class': '1', 'subject': 'A'},
{'posX': 0, 'x_perc': 63, 'posZ': 180, 'y_perc': 38, 'posY': 2, 'class': '1', 'subject': 'A'}, ...]
etc.
You can then average/take SD by subsetting the list of dictionaries (applying rules like excluding posZ=180 etc.). Here's for averaging by Class:
classes = ['1','2']
print "By Class:"
print "Class","Avg X","Avg Y","X SD","Y SD"
for class_var in classes:
x_m = np.mean([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
y_m = np.mean([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
x_sd = np.std([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
y_sd = np.std([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
print class_var,x_m,y_m,x_sd,y_sd
You'll have to play around printed output to get exactly what you want, but this should get you started.