Oracle Data Science Data Access - Kxodia 肯佐迪亞

Data Access : Real-world data resides in different data sources. Every organization stores their data in multiple platforms and in different formats. Usually referred as raw data, it can be application logs, user events logs, mobile/web data, on-premises batch data or streaming data from devices. No domain is untouched. Data can be accessed on OCI from different sources, either locally or remotely.

Data Collection

Batch Data : Data silos created over a period of time from daily workloads, backups, migration.
Streaming Data : Messages or logs from various user events and IoT devices like Geospatial or telemetry.
Application Data : Usually through API calls, application events, log files.

Data Access from Some Common Sources

OCI Object Storage

To load a dataframe from Object Storage

oci://*<bucketname>@<namespace>/<file-name>*

It can be done by either using api_key or resource_principal

ads.set_auth(auth=”api_key“, profile=”DEFAULT”) or
ads.set_auth(auth=‘resource_pricipal’)

Local Storage

To load a dataframe from a local source, use functions from pandas directly

Pandas

df = pd.read_csv(“/path/to/data.data”)

OCI Autonomous DB (ADB/ATP)

Use pd.DataFrame.ads.read_sql(…) to read SQL query or database table into a dataframe. This is up to 15 times faster being specifically written for Oracle ADB.

A connection to Autonomous Database is established through the database Wallet file from OCI.

OCIFS / PyArrow

Apache Arrow is a development platform for in-memory analytics.

The Arrow Python bindings (PyArrow) have first-class integration with NumPy, pandas, and built-in Python objects.

ADS supports reading files into the PyArrow data set directly via ocifs.

MySQL

Available with Accelerated Data Science v2.5.6 and later

Loads dataframe from a MySQL db using pd.DataFrame.ads.read_sql

Saves the dataframe df to MySQL using df.ads.to_sql

Set engine=mysql

Amazon S3

You can open Amazon public and private files in Accelerated Data Science.

For private files, pass the right credentials from storage_options.

For large files, increase the blocksize.

HTTP(S) endpoints

To open a data set from a remote web server source, use pandas.

Specify the URL of the data:

DatasetBrowser

To open a data set from reference libraries, use DatasetBrowser.

To see supported libraries, use DatasetBrowser.list().

Data Types

To inspect Accelerated Data Science data type, use feature_types.

To see the summary information about the data set, use show_in_notebook().

There are four ADS semantic data types:

Categorical or Qualitative: Data that can be categorized, that is, labeled into different groups.(Nominal and Ordinal both come under the categorical data type.)
Continuous: It is a type of Quantitative data that couldn’t be counted but measured. Usually in the form of Fractions
Datetime: It deals with date time format.
Ordinal : Ordinal data is an ordered category that has an intrinsic ordering. For example, Education, Elementary, High School, Undergraduate, Graduate.
Nominal : Nominal data is used for labeling without any quantitative value associated with it. It doesn’t have an intrinsic ordering. For example, Race. It can be categorized but all races are equal.

Supported Sources/Formats by Oracle ADS

ADS doesn’t support text, doc, pdf, raw images, Sequences(list, tuple, range), dict and set type data.

ADS provides a text extraction module to convert PDF and doc into plain text file.