Hive query execution issue - sql

When executing a hive query, here is the output, wondering for "Map 1" and "Reducer 2", what does the 1 and 2 mean?
Map 1: 21/27 Reducer 2: 0/1
Map 1: 22/27 Reducer 2: 0/1
Map 1: 23/27 Reducer 2: 0/1
Map 1: 24/27 Reducer 2: 0/1
Map 1: 26/27 Reducer 2: 0/1
Map 1: 27/27 Reducer 2: 0/1
Map 1: 27/27 Reducer 2: 1/1
thanks in advance,
Lin

The hive query is interpreted by the MapReduce framework as a Map-Reduce task. The task gets assigned mappers and reducers based on the input. When the task is in progress you can see the output displayed in terms of mappers and reducers progress.

Related

Optimize batch transform inference on sagemaker

With current batch transform inference I see a lot of bottlenecks,
Each input file can only have close to 1000 records
Currently it is processing 2000/min records on 1 instance of ml.g4dn.12xlarge
GPU instance are not necessarily giving any advantage over cpu instance.
I wonder if this is the existing limitation of the currently available tensorflow serving container v2.8. If thats the case config should I play with to increase the performance
i tried changing max_concurrent_transforms but doesn't seem to really help
my current config
transformer = tensorflow_serving_model.transformer(
instance_count=1,
instance_type="ml.g4dn.12xlarge",
max_concurrent_transforms=0,
output_path=output_data_path,
)
transformer.transform(
data=input_data_path,
split_type='Line',
content_type="text/csv",
job_name = job_name + datetime.now().strftime("%m-%d-%Y-%H-%M-%S"),
)
Generally speaking, you should first have a performing model (steps 1+2 below) yielding a satisfactory TPS, before you move over to batch transform parallelization techniques to push your overall TPS higher with parallization nobs.
Steps:
GPU enabling - Run manual test to see that your model can utilize GPU instances to begin with (this isn't related to batch transform).
picking instance - Use SageMaker Inference recommender to find the the most cost/effective instance type to run inference on.
Batch transform inputs - Sounds like you have multiple input files which is needed if you'll want to speed up the job by adding more instances.
Batch Transform Job single instance noobs - If you are using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy. The ideal value for MaxConcurrentTransforms is equal to the number of compute workers in the batch transform job. If you are using the SageMaker console, you can specify these optimal parameter values in the Additional configuration section of the Batch transform job configuration page. SageMaker automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an execution-parameters endpoint.
Batch transform cluster size - Increase the instance_count to more than 1, using the cost/effective instance you found in (1)+(2).

Remeshing depending on physical quantity values

I am performing Code Saturne simulations of fluid flows. I mesh the computational domain with gmsh. I would like to adapt my mesh by refining regions of my investigated case depending on the results of a previous simulation. Here is the process:
Step 1 : Mesh -> first_mesh.msh
Step 2 : Simulation with first_mesh.msh
Step 3 : Mesh adaptation -> if a quantity extracted from the simulation field is above a given value in a particular area then I would like to refine this area and produce a new mesh called adaptated_mesh.msh for instance
Step 4 : Simulation with adaptated_mesh.msh
Is that possible ?
If yes, could you give me a simple example to see how it's work ?
Thanks a lot

Is it possible to refresh a customer level dataset daily to give churn predictions using BigqueryML

Is it possible to refresh a customer level dateset daily to give churn predictions using Bigquery ML
You can run a scheduled query to run CREATE MODEL periodically. The model is persisted in BigQuery storage when CREATE MODEL is executed.
For immediate updates, automating the pipeline workflow with either Composer or AirFlow is the best option.
The word "refresh" here may have ambiguous meaning, and the word "dataset" probably refers to a BigQuery Table.
Your question as I understand it is whether you can create an up-to-date table with users and a churn score that was calculated using BigQuery ML.
The answer is yes, after you train a model using BQML you can run scheduled queries to give you predictions. Please keep in mind:
Updating data in a BigQuery table is highly unrecommended. Instead consider appending the results to a table with a timestamp that identifies the predictions (for each day let's say). Then create a view that shows only the most recent predictions.
You may need to run a chain of queries for example: Create a training dataset with features, run predictions and save the results. For this you may want to use Cloud Workflows.
Solution to achieve your required solution is to create ML BQ model and store output at runtime. Then Use predicted output to perform any action of Bigquery tables.
You can execute Dataflow or simple python code to achieve required data archrivals.
from dateutil.parser import parse
import datetime
from google.cloud import bigquery
stream_query = """delete from `ikea-itsd-ml.test123.YOUR_NEW_TABLE4` WHERE 1=1"""
stream_client = bigquery.Client()
stream_Q = stream_client.query(stream_query)
stream_data_df = stream_Q.to_dataframe()

Tensorboard printed summary Precision and Accuracy does not show the Step count

We want to correlate the StepCount with the trends of Precision and Accuracy. The Tensorboard GUI allows this: but we want to automate the results. The printed results do not print the Step Count - at least not by default:
Is there a way to tease the Step Count from Tensorboard's printed results?
The steps are actually available in the TF Events file but require coding to extract and correlate. I show how to do that here: How to parse the tensorflow events file?

I want to map a function to each element of a vector in Theano, can I do it without using scan?

Say a function that counts the appearances of ones at each index of an array:
import theano
import theano.tensor as T
A = T.vector("A")
idx_range = T.arange(A.shape[0])
result, updates = theano.scan(fn=lambda idx: T.sum(A[:idx+1]), sequences=idx_range)
count_ones = theano.function(inputs=[A], outputs=result)
print count_ones([0,0,1,0,0,1,1,1])
# gives [ 0. 0. 1. 1. 1. 2. 3. 4.]
As said here, using scan may not be efficient. Plus, theano.scan always produces "RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility from scan_perform.scan_perform import *" on my machine.
So I was wondering is there a better way of mapping functions in Theano?
Thanks in advance.
Edit:
I just realized it is a terrible example, apparently there's a more efficient way of just looping over the vector once like:
result, updates = theano.scan(fn=lambda prior_result, a: prior_result + a,
outputs_info=T.alloc(np.int32(0), 1),
sequences=A,
n_steps=A.shape[0])
However according to #Daniel Renshaw's answer, since
the computation in one step is dependent on the same computation at
some earlier step
so actually I can't avoid using scan in this regard, right?
Edit:
I thought of a way of vercotrizing it as:
A = T.vector()
in_size = 8
# a matrix with ones at and below the given diagonal and zeros elsewhere
mask = theano.shared(numpy.tri(in_size))
result = T.dot(mask, A)
count_ones = theano.function(inputs=[A], outputs=result)
print count_ones(numpy.asarray([0,0,1,0,0,1,1,1]))
But in this case I have to know the size of the input in advance (unless I can craft numpy.tri like matrices on the fly?).
Any suggestions would be welcome. :)
Edit:
I benchmarked the three methods using a 512D input array and 10000 iterations, and got the following results:
map a sum function to each element: CPU 16s GPU 140s
loop over the array using scan: CPU 13s GPU 32s
vectorization: CPU 0.8s GPU 0.8s (actually I don't think theano has engaged GPU to do this
In the most general case, if no assumptions are made about the function, then scan would have to be used. However, many (maybe most?) useful functions can be vectorized such that scan is not needed. As is pointed out in the question edit, the example function can certainly be applied to the input without using scan.
To decided whether scan is needed will depend on the function that needs to be applied. Case that certainly will require scan are those when the computation in one step is dependent on the same computation at some earlier step.
P.S. The warning about binary incompatibility can be safely ignored.