One of the essential features is the ability to start a batch scoring job. This post aims to illustrate how to invoke the batch endpoint to start a batch scoring job, which is a fundamental task that falls under the Microsoft Azure DP-100 exam: “Designing and Implementing a Data Science Solution on Azure”.
A Brief Overview
Batch scoring is helpful when dealing with a large amount of data that doesn’t require real-time scoring. In Azure, you can use pipelines to create a batch scoring process. All it takes is an input dataset, your trained model, and a compute resource to run the process.
Defining The Batch Scoring Job
The batch scoring job starts with creating a ParallelRunStep, which is a step in your pipeline where the batch scoring will take place. ParallelRunStep applies a function to all rows of a given dataset in parallel, which significantly speeds up the scoring process. You would need to specify a few things while creating this step, such as the script to be run, the input and output data, and the compute target:
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
parallel_run_config = ParallelRunConfig(
source_directory=model_folder,
entry_script=batch_score_script,
mini_batch_size="5",
error_threshold=10,
output_action="append_row",
environment=env,
compute_target=cpu_cluster,
node_count=2
)
parallelrun_step = ParallelRunStep(
name="batch-score",
parallel_run_config=parallel_run_config,
inputs=[named_input],
output=output_dir,
arguments=[model_path],
allow_reuse=True
)
In the above code, the ParallelRunConfig is set to execute a specified script (which would contain your scoring logic) on your compute target. It also defines the batch or mini-batch size, which is the number of data rows to be processed per batch.
Creating and Running the Pipeline
After defining the batch scoring step, create your pipeline. Add the defined step and submit the pipeline to the Azure Machine Learning workspace:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(workspace=ws, steps=[parallelrun_step])
pipeline_run = experiment.submit(pipeline)
The above code creates a new pipeline, adds the step defined earlier, and submits it to the Azure Machine Learning experiment.
Monitoring the Batch Scoring Job
Once your scoring job is up and running, it’s good practice to monitor its performance and progress. Azure provides monitoring through the dashboard in the Azure portal, and you can also monitor directly from the console output in your Jupyter notebook:
# Monitor the pipeline run status
pipeline_run.wait_for_completion(show_output=True)
Final Thoughts
Batch scoring jobs, as covered in the DP-100 exam, are crucial when scoring large amounts of data in parallel. Invoking the batch endpoint to begin a batch scoring job begins with creating your batch scoring step using ParallelRunStep, incorporating it into your pipeline, and then submitting the pipeline to the Azure Machine Learning workspace.
Practice Test
True or False: The batch endpoint is used to start a batch scoring job in Azure Machine Learning.
- True
- False
Answer: True
Explanation: The batch endpoint in Azure Machine Learning is used for batch predictions or batch scoring jobs.
True or False: In order to invoke the batch endpoint, you don’t need to be authenticated.
- True
- False
Answer: False
Explanation: You need to use proper authentication including the endpoint key to invoke the batch endpoint for starting a batch scoring job.
Which Azure service allows you to set up a batch scoring job?
- a. Azure Storage
- b. Azure Batch AI
- c. Azure Machine Learning
- d. Azure Cognitive Services
Answer: c. Azure Machine Learning
Explanation: Azure Machine Learning offers the batch endpoint for setting up and starting a batch scoring job.
True or False: The batch endpoint in Azure Machine Learning runs in real-time.
- True
- False
Answer: False
Explanation: The batch endpoint in Azure Machine Learning does not run in real-time. Instead, it is designed specifically for asynchronous jobs such as large-scale scoring tasks.
True or False: You can use the Azure portal to invoke the batch endpoint and start a batch scoring job.
- True
- False
Answer: False
Explanation: Invoking the batch endpoint is usually done through code, using methods such as the Python SDK, and not directly through the Azure portal.
What is the primary purpose of a batch scoring job in Azure Machine Learning?
- a. Storing data
- b. Running real-time predictions
- c. Running large-scale predictions
- d. Monitoring application performance
Answer: c. Running large-scale predictions
Explanation: The primary purpose of a batch scoring job is to run large-scale predictions or classifications on datasets.
True or False: You can use both REST and gRPC APIs to invoke the batch endpoint for batch scoring.
- True
- False
Answer: True
Explanation: Both REST and gRPC interfaces can be used to send requests to the batch endpoint for batch scoring jobs in Azure Machine Learning.
True or False: A batch scoring job can only use a trained model that is registered in Azure Machine Learning.
- True
- False
Answer: True
Explanation: In Azure Machine Learning, a batch scoring job uses a model that has been trained and registered within the service.
Which of the following are required to invoke the batch endpoint for a batch scoring job?
- a. Endpoint URL
- b. Authentication Key
- c. Data to score
- d. All of the above
Answer: d. All of the above
Explanation: The endpoint URL, authentication key, and the data to score are all required to invoke the batch endpoint to start a batch scoring job.
True or False: Batch scoring jobs are billable based on the number of transactions processed by the batch endpoint.
- True
- False
Answer: True
Explanation: Azure Machine Learning charges are based on the number of transactions processed by the batch endpoint during batch scoring.
True or False: Once started, a batch scoring job immediately returns predictions.
- True
- False
Answer: False
Explanation: Unlike real-time endpoints, batch scoring jobs do not return predictions immediately. They work asynchronously, processing large datasets in the background.
Which programming language do you often use to invoke the batch endpoint for batch scoring in Azure Machine Learning?
- a. C#
- b. Python
- c. Java
- d. Ruby
Answer: b. Python
Explanation: Python is the most commonly used language for invoking the batch endpoint for batch scoring in Azure Machine Learning due to its strong support for data analysis and manipulation tasks.
Interview Questions
What is the basic purpose of using a batch endpoint in Azure?
A batch endpoint in Azure is used to run large-scale predictions where the results aren’t needed immediately.
What steps we need to take to invoke a batch endpoint?
To invoke a batch endpoint, you need to generate an input file, store the input file in data storage accessible to the service, submit a job, get the job status until it’s done, and then retrieve the result file.
How can you submit a batch scoring job?
You can submit a batch scoring job by calling the run method on the created batch endpoint.
Which Python object can be used to access the status information about the batch scoring job?
The BatchJob object can be used to access the status info about the batch scoring job.
What role does Azure blob storage have in batch scoring?
Azure blob storage is used to store the input and output data used in batch scoring.
What is contained in the input file used to invoke a batch scoring job?
The input file includes all the rows of data that need to be processed in the batch scoring job.
How is the result data provided after the scoring job is done?
After the batch scoring job is complete, the results are typically provided in a blob storage, which is defined during the creation of the batch endpoint.
Can we monitor the status of a batch scoring job?
Yes, we can monitor the status of a batch scoring job using the get_status() method on the batch_job object in python.
What does the function get_logs() for batch jobs in Azure return?
The get_logs() function on a BatchJob object retrieves the container logs for the completed job.
What data types are considered valid input for a batch scoring job in Azure?
The data types considered valid for batch scoring are CSV files and Parquet.
Is it possible to cancel a batch scoring job?
Yes, it is possible to cancel a batch scoring job by using the function batch_job.cancel().
Why might you use a batch endpoint rather than deploying a real-time scoring endpoint on Azure?
A batch endpoint would be more appropriate when dealing with large-scale data sets or when real-time results are not required.
How is error handling managed in batch scoring jobs?
Errors are typically logged during execution of the batch scoring job. These logs can be accessed after the job completion to understand and troubleshoot the issues.
Can batch scoring jobs be run concurrently in Azure?
Yes, Azure supports running multiple batch scoring jobs concurrently by default.
Is response time important for batch scoring jobs?
Generally, batch scoring jobs prioritize high throughput over low latency, so response time is not critical as long as all data is processed in the window defined for the job.