在OCI – Spark & Data Flow 我們探討Spark, Data Flow 與OCI的關係
Oracle Cloud Infrastructure Data Flow
Oracle Cloud Infrastructure (OCI) Data Flow是一項完全託管的 Apache Spark 服務,開源,可對超大型數據集執行處理任務,無需部署或管理基礎設施。開發人員還可以使用 Spark Streaming 對其持續生成的流數據執行雲 ETL。這樣可以快速交付應用程序,因為開發人員可以專注於應用程序開發,而不是基礎設施管理。
Data Flow enables you to:
- Run large-scale Apache Spark jobs for data science and machine learning
- Run applications written in any Spark language: PySpark, SQL, Java, or Scala
- Process any data in Object Storage (or other Spark-compatible data sources with connector)
- perform a variety of data preparation tasks, including data aggregation and transformation, feature engineering, data cleaning, and data joins.
- The Spark distribution also comes with the MLlib machine learning library, which offers a comprehensive set of algorithms and models that can be trained at scale on Spark dataframes.
Data Flow: Components
- An Application is an infinitely reusable Spark application template consisting of a Spark application, its dependencies, default parameters, and a default run-time resource specification. Once a developer creates a Data FlowApplication, anyone can use it without worrying about the complexities of deploying it, setting it up, or running it. 可重複使用的Spark application template,具有相依性,預設參數,預設的runtime 資源。
- The Library is the central repository of Data FlowApplications. Anyone can browse, search, and execute applications published to the Library, if they have the correct permissions in the Data Flowsystem. Library是Data Flow applications的中央存儲庫。擁有正確的權限,任何人都可以瀏覽、搜索和執行已經發佈到Library的應用程序。
- Every time a Data FlowApplication is run, a Run is created. The Data FlowRun captures the Application’s output, logs, and statistics that are automatically securely stored. Output is saved so it can be viewed by anyone with the correct permissions using the UI or REST API. RUN捕獲應用程序運行時創建的輸出、日誌和統計信息
- Spark generates Spark Log files, which are useful for debugging and diagnostics. Each Data FlowRun automatically stores log files in an Object Storage bucket. You can access them via UI or API, subject to the Run’s authorization policies. 由Data FlowRun創建並用於診斷的文件,會自動將日誌文件存儲在Object Storage bucket中。
Data Flow: Capabilities
- Connect to Apache Spark data sources and launch a job in seconds
- Create reusable Apache Spark applications in any Spark language
- Manage all Apache Spark applications from a single platform
- Process data in the Cloud or on-premises in your data center
- Bring your own connectors to connect to Object Storage
- Execute any spark job (in our case PySpark and SQL) with no changes to the source code
Data Flow: Security
Privacy : Private clusters, VMs, networks, isolated pools
Encryption : Data encrypted at rest and in motion
Access Control : Authentication & authorization with OCI IAM
Spark Application Configuration
在 Data Flow 中創建應用程序時會選擇 Spark 資源。您必須選擇driver 與 executor shapes,以及每個資源的executors 數量。
SparkContext 是任何 Spark 功能的入口點。
Spark Driver 是執行代碼的中央協調器(central coordinator)。 它與啟動 Spark 應用程序的Cluster Manager和多個distributed worker nodes進行通信。
每個節點至少有一個 Executor,負責運行任務。 Executors 向 Driver 註冊自己並提供他們的信息。
這種 Driver 和 Workers 的工作組合被稱為 Spark Application。
10 小時內處理 500 GB 的數,將需要 5 個執行程序 OCPU。
OCI – Spark & Data Flow Integration with Data Science
Data Flow integrates with the Oracle Accelerated Data Science Python library. You can submit a Spark application or run to Data Flow using ADS, which is preinstalled in notebook sessions.
How to use Data Flow and Data Science to create and manage Spark apps.
- Prerequisites for Data Science : 建立 Data Science project 與 notebook與適當的權限
- Prerequisites for Data Flow :
-
- Object storage buckets 存放 data flow logs 與 data flow warehouse
- A Spark Application in Java, Scala, SparkSQL, or PySpark 上傳到 Object Storage
- 要處理的資料也傳到 Object Storage
- 授予Policies權限讀取 storage buckets與其他資源
Build and Train ML Models with Data Flow
- Launch a notebook session in Data Science.
- Install a Spark Conda environment.
- Configure the environment.
- Develop your ML training script in PySpark.
- Create a Data Flow application and run with ADS.