Oracle Data Science Jobs & MLOps - Kxodia 肯佐迪亞

Oracle Data Science Jobs & MLOps . MLOps is machine learning operations and are based on DevOps principles and practices that increase the efficiency of workflows and improve the quality and consistency of your machine learning solution.

MLOps 是機器學習生命週期管理的標準化(standardization)、精簡流動化(streamlining)和自動化(automation)。 ML 資產被視為迭代(automation)、持續集成(continuous-integration, CI)、持續交付(continuous-delivery, CD)環境中的其他軟件資產。

ML 模型與包裝它們的服務和使用它們的服務一起部署，作為統一發布過程的一部分。

歸根結底，MLOps 是一組技術和實踐，用於在生產環境中快速部署和管理可擴展且受管控的 ML 應用程序。

Oracle Data Science Jobs & MLOps — MLOps life cycle

Continuous Practices in MLOps

continuous integration refers to the validation and integration of new data and ML models
continuous deployment refers to releasing that model into production.
Continuous training is unique to MLOps and refers to the automatic retraining of ML models for redeployment.

雖然 DevOps 和 MLOps 有很多共同點，但有一個關鍵區別。雖然軟件可能是相對靜態的，數據卻總是在變化。這意味機器學習模型需要不斷學習並適應新的輸入。這種數據漂移是機器學習模型 continuous retraining 如此重要的原因。

如果模型沒有更新以反映新數據，預測可能會變得越來越不准確。由於模型性能隨著時間的推移而下降，因此盡快根據新數據重新訓練模型非常重要。

這就是 MLOps 中的continuous integration提供了巨大優勢的地方。同時每個重新訓練的模型都應該經過交叉驗證(cross-validated)，並經過一個確保模型預測質量的過程。

Oracle Data Science Jobs Service

Oracle Data Science Jobs Service 位於 Data Science 服務中，並在 Oracle Cloud 上實現 MLOps。在需要運行流程時提供按需求的基礎架構，有助於優化成本。 job會在完全託管的基礎架構上定義並可以運行可重複的任務，該基礎架構僅在作業運行時呈現。

Enables MLOps for Oracle Cloud
Provides the on-demand infrastructure required to run processes
Runs repeatable tasks on fully managed infrastructure

OCI Jobs: Benefits

Fully Managed : No need to install or manage software or servers. Reduces complexity.

Integrated : Jobs is an OCI native service, so you can leverage Oracle Cloud capabilities.

On-Demand : Infrastructure is provisioned on demand and automatically deprovisioned at job end.

Jobs Versus Job Runs

Data Science Jobs 有兩個關鍵概念：Job本身和Job run。 Job是描述任務並指定相關組件詳細信息的template。它定義了基礎設施和實際用例artifact。 Job run是作業的單次執行，通常設定一些自定義參數。每個Job可以有多個關聯的Job run。

Job Components

Name
Artifact : Instructions for the job to be run; required and immutable.其中包含要運行的Job的指令。它是不可變的，上傳到Job後無法修改。
Environment Variables
Command-Line Arguments

Job Components

Name
Artifact : Instructions for the job to be run; required and immutable.其中包含要運行的Job的指令。它是不可變的，上傳到Job後無法修改。
Environment Variables
Command-Line Arguments

Environment Variables and Command-Line Arguments parameters that can be customized for each job run. Job的環境變量或 CLI 參數。對於未來的所有Job run，這些可選項目可能會類似。或者可以在每個Job run中覆蓋它們。

Compute: CPU, GPU
Logging
Block Storage
VCN
Max Run Time

A single job can have several sequential or simultaneous job runs with different parameters. 例如可以套用不同超參數在相同model來看結果

Jobs Life Cycle

Create Job

- Name
- Artifact
- Environment Variables
- Command-Line Arguments
- Compute: CPU, GPU
- Logging
- Block Storage
- Max Run Time
- VCN

Run Job

- Name
- Environment Variables
- Command-Line Arguments
- Logging
- Max Run Time

Monitor + Log

- Compute Metrics
- Service Logging
- Custom Logging

End

- Finish/Cancel
- Deprovisioning
- Events

Ways to Run Jobs

Job可以以多種方式運行並接受幾種不同類型的工件。

對於具有單文件腳本的簡單項目，可以使用 Python 或 Bash / Shell。Job運行已經預裝了 Python。可以使用所有 Python 系統庫運行代碼，包括安裝的任何第三方庫。在 conda 環境中運行的Job可以讓控制並封裝的Job運行所需的所有第三方 Python 依賴項，例如 Numpy、Dask 或 XGBoost。 Bash 和 Shell 腳本則使用 Oracle Linux 運行。

如果的項目更複雜並且需要比單個文件中可行的更多代碼，可以使用 .zip 或 .tar 文件。如果有復雜的 Python 項目和 Shell 腳本，則可以歸檔整個內容主體並將其作為利用數據科學服務或自定義 conda 環境的Job運行。zip 和壓縮的 tar 工件也可以有一個運行時 YAML 文件，設置所需的Job運行環境變量。對於Job運行，使用 JOB_RUN_ENTRYPOINT 環境變量指向主條目文件。此變量僅用於使用 ZIP 或壓縮 tar Job工件的Job。

Access for Jobs

Oracle Cloud Access for Jobs

Job和Job run都支持多個訪問和管理選項。只要製定了適當的策略，Jobs就可以訪問租戶中的所有 OCI 資源，例如 ADW 中的資料或object storage。如果配置了適當的 VCN，Job也可以訪問外部源。可以使用vault提供針對第三方資源進行身份驗證的安全方式。

External Access for Jobs

Job run支持 OCI SDK 和 API。可以從所有可能的第三方服務運行Job，包括客戶端計算機、MLOps、Bitbucket、Github 或其他 CI/CD 管道、Oracle 或其他 AI 服務或事件服務。

可以在 OCI 控制台中創建Job並啟動Job run，但也可以使用 OCI CLI 或多種語言（包括 Python、Java、JavaScript、TypeScript、Go、Ruby 和 Terraform）創建和運行Job。

Batch Inference

發動 job的一種方法是通過batch inference。 batch inference是一個異步(asynchronous)過程，它的預測基於一批觀察或數據。模型被構建並存儲在object storage中。當新數據可用時或以特定時間間隔（例如每小時或每天一次）觸發新job以分析傳入的新數據。需要配置能夠處理傳入資料量大小的shape和處理能力。

Mini Batch

有時會傳入大量數據，或者必須如此快速地處理數據，因此最好使用Mini Batch。例如不想等待數百萬筆交易累積，這樣需要分析大量數據，會減慢整個檢測過程。像這樣的Mini Batch job通常使用 Cron Scheduler 或基於觸發job的某些進度或事件來執行。

Distributed Batch

第三個種批處理選項是Distributed Batch。在數據量很大的情況下，可以將其拆分成若干塊，多個模型和作業可以同時運行，以加快處理速度。這些任務或作業中的每一個都可以獨立執行，它們之間沒有任何依賴關係。這些也被稱為embarrassingly parallel jobs。

	Batch Inference	Mini Batch Inference	Distributed Batch Inference
Infrastructure	Large	Light to medium	Very large
VM	Single	Single or multiple (at a small scale)	Multiple
Provisioning Speed – Required	Medium	Fast	Average to slow
Scheduler – Required	Yes	Yes	Use case dependent
Trigger – Required	Yes	Yes	No
Workloads	Large	Light	Large or heavy
Datasets Size	Large	Small	Extremely large or auto-scaling
Batch Process Time (estimate though could differ depending on use case)	Medium to very long (usually from two digits minutes long process to days or hours)	Short to near real-time	Medium to very long (usually from few hours up to days)
Model Deployment	Not required	Yes, but not required	Not required
Endpoints	No	No	No

Compare Batch Inference Workloads

Scaling

Compute Shape : 1 OCPU / 15 GB –> 24 OCPUs / 320 GB

Block Storage : 50 GB –> 10TB

Jobs Monitoring and Logging

Monitoring and logging is the last step in the job’s life cycle before the job is finished and the infrastructure is de-provisioned. It provides you with insights into your jobs’ performance and metrics and creates records for each job run for later reference.

Monitoring consists of metrics and alarms, and it enables you to check the health, capacity, and performance of your cloud resources. 監控由指標(metrics)和警報(alarms)組成，它使您能夠檢查雲資源的運行狀況、容量和性能。

Alarms is a passive monitoring service that gets triggered when a metric breaches set alarm thresholds. 警報(alarms)是一種被動監控服務，當指標(metrics)違反設置警報閾值(thresholds)時觸發。

Metrics track CPU or GPU utilization, depending on your instance, the percentage of available job run container memory usage, container network bytes in and out, and container disk utilization. 指標追蹤 CPU 或 GPU 的利用率，具體取決於實例、可用作業運行容器內存使用的百分比、容器網絡輸入和輸出字節數以及容器磁盤利用率。
When these numbers reach a certain threshold, you will want to scale up your resources, such as block storage and compute shape, to accommodate the workload.當這些數字達到某個閾值時，將需要擴展資源。

Job Run Logs

With Jobs, customers can either use service logs or custom logs.

Service logs are standard out-of-the-box logs emitted by the job run to OCI Logging service.
Custom logs the customer captures log events collected in a particular context and specifies the location where the logs will be stored. You can use the Logging service to enable, manage, and browse job run logs for your jobs. Both standard out (stdout) and standard error (stderr) outputs from your job artifact are captured and made available in the custom log.

Job run resource principal還必須具有適當的 IAM 權限才能寫入service logs 或 custom logs.

Integrating jobs resources with the Logging service is optional but recommended.

Each job run can send outputs to its own log or all can use the same log, though it is recommended to use separate logs for each job.

Logs aren’t deleted when the job and job runs are deleted.

Events, Rules, and Actions

event service 當resource中有改變時通知我們，event是表示resource有什麼更動的結構化訊息。 event可以是 CRUDE 操作，代表創建、讀取、更新和刪除。並且它通過使用 Oracle 函數、通知或流來響應它們。

Event : Structured message that denotes a change in a resource
Rule : Filter that selects events to monitor and triggers an action
Action : User-defined response to event, e.g. triggering a notification