Oracle Data Science Model Catalog provides a method to track and immutably store models. The model catalog allows organizations to maintain the provenance of models during all phases of the model’s lifecycle. 用來儲存與追蹤各種不同階段的model
These model artifacts can be shared among data scientists tracked for provenance, reproduced, and deployed. You create the model artifact to save with your model in the model catalog. This creates centralized storage of the model, and enables you to track model metadata.創建model artifact與模型一起保存在model catalog中。 可以集中儲存model artifact,並使您能夠跟踪模型元數據。
A model’s artifacts include
- the model
- hyperparameters
- metadata about the model
- the input and output schema
- a script to load the model and make predictions.
Oracle Data Science Model Catalog
The purpose of the model catalog is to provide a managed and centralized storage space for models. It ensures that model artifacts are immutable and allows data scientists to share models, and reproduce them as needed.
model catalog 可在 ADS 的notebook session中直接訪問。 或者可以通過轉到Data Science projects 頁面,選擇一個project,然後點擊model來使用。 model頁面顯示在project下的model catalog 中的model artifacts。
Model Entry
Components of a model entry in the model catalog
- Model artifact, i.e., a ZIP archive that includes the saved model object
- Metadata about the model provenance, including Git-related information
- Script or notebook used to push the model to the catalog
Model Artifacts
Model artifacts stored in the model catalog are immutable by design. (儲存在model catalog 的artifacts無法修改)
- To apply any changes to a model, create a new model.(若要修改,請創建一個新模型。)
- Immutability prevents unwanted changes.(可防止不必要的更改。)
- It ensures that any model in production can be tracked down to the exact artifact behind the model predictions.
Maximum size limit of artifacts
- 100 MB when saved from the console (從console儲存只能存100M的artifacts)
- 20 GB from ADS, the OCI SDKs, and CLI
Components of a Model Artifact
A model artifact is a ZIP archive of the files necessary to deploy your model as a model deployment.
-
Custom logic score.py : score.py contains your custom logic for loading serialized model objects to memory, and define an inference endpoint (predict()).
-
Deployment configuration runtime.yaml : runtime.yaml provides instructions about which conda environment to use when deploying the model using a Data Science model deployment.
-
Test definitions model_artifact_validate.py : provides an optional series of test definitions that you can run on your model artifact before saving it to the model catalog. These model introspection tests capture many of the most common errors when preparing a model artifact. 提供常見錯誤的測試驗證,在儲存artifact 於model catalog之前可以測一下
-
Third-party requirements requirements.txt : There is also a requirements text file that lists the third party dependencies that you must install in your local environment before running introspection tests. 在運行introspection tests之前列出有法些需要先安裝的第三方相依性物件
- README.md : gives you a series of step-by-step instructions to prepare and save a model artifact to the model catalog. We highly recommend that you follow these steps.
Important: Any code used for inference should be zipped at the same level as score.py or any level below the file. If any required files are present at folder levels above the score.py file, they are ignored and could result in deployment failure.
Custom Logic: Score.py
Score.py contains 2 functions; the function parameters associated are customizable.
- load_model () : 反序列化(deserializes)模型並返回它。
- This score.py template uses this to return the model estimator object.
- Any custom Python modules can be imported in score.py if they are available in the artifact file or as part of the conda environment used for inference purposes.
- predict () : 接收資料和模型(option),並返回預測結果的字典。
- This function takes in data and the model object returned by load_model().
- The body of predict()can include data transformations and other data manipulation tasks before a model prediction is made.
Deployment Configuration File : runtime.yaml
The purpose of the runtime YAML file is to provide the necessary runtime conda environment references for model deployment purposes. This is required if you want to deploy your model by using the model deployment feature of the data science service.YAML文件目的是在 model deployment 提供必要的conda環境參考,如果要做model deployment在data science service 時這步是必須的
- MODEL_ARTIFACT_VERSION
The version of this artifact format. This is automatically extracted by ADS when the model is saved in a notebook session. - MODEL_DEPLOYMENT.INFERENCE_CONDA_ENV.INFERENCE_ENV_SLUG
The slug of the conda environment you want to use for deployment and scoring purposes. In most cases, the inference environment is the same as the training environment though that does not have to be the case - MODEL_DEPLOYMENT.INFERENCE_CONDA_ENV.INFERENCE_ENV_TYPE
The type of the conda environment you want to use for deployment and scoring purposes. There are two possible values: either data_science or published. - MODEL_DEPLOYMENT.INFERENCE_CONDA_ENV.INFERENCE_ENV_PATH
The path in object storage of the conda environment you want to use for deployment and scoring purposes. The path follows this syntax, oci://<bucket-name>@<namespace>/<file-path>. - MODEL_DEPLOYMENT.INFERENCE_CONDA_ENV.INFERENCE_PYTHON_VERSION
The Python version of the conda environment you want to use for model deployment. The default version is python 3.6. The supported versions are python 3.6 and python 3.7.
Model Catalog Documentation
Model Input and Output Schema 模型輸入和輸出概要 : Description of the features that are necessary to make a successful model prediction
- Input Schema : Provides the blueprint of the data parameter of score.py file predict(). Is the definition of the input feature vector that your model requires to make successful predictions.
- Output Schema : Output schema definition documents what the predict() function returns.
- Successful Predictions : The schema definitionis a description of the features that are necessary to make a successful model prediction.
Model Provenance 模型出處 : Documentation that helps you improve the model reproducibility and auditability
幫助提高模型的可再現性(reproducibility)和可審計性(auditability)的文檔,使用 ADS SDK 保存模型時會自動提取參數。當在 Git 中工作時,ADS 能夠獲取 Git 信息,並自動填充模型出處的metadata。 希望能夠重現模型,會需要知道模型是在哪裡訓練的、在哪個環境中訓練的、源代碼、計算資源和訓練數據以及生成的特徵。
Model Introspection Tests 模型自省測試 : Series of tests and checks run on a model artifact to test all aspects of the operational health of the model
自省測試是可選的,可以在將模型保存到模型目錄之前運行它們。 然後可以將測試結果保存為模型metadata的一部分。 當運行自省測試時,它們會生成一個本地 test_json_output.json 文件。
Model Taxonomy 模型分類 : Description of the model that you are saving to the model catalog
模型分類描述要保存到模型目錄中的模型
- Preset model taxonomy : The metadata fields associated with model taxonomy allow you to describe the machine learning use case and framework behind the model.
- UseCaseType
- Framework and Framework Version
- Algorithm
- Hyperparameters
- ArtifactsTestResults
- Custom model taxonomy : You can add your own custom metadata to document your model. The maximum allowed file size for the combined defined and custom metadata is 32 Kilobytes
- Key
- Value
- Category
- Description