Data Access : Real-world data resides in different data sources. Every organization stores their data in multiple platforms and in different formats. Usually referred as raw data, it can be application logs, user events logs, mobile/web data, on-premises batch data or streaming data from devices. No domain is untouched. Data can be accessed on OCI from different sources, either locally or remotely.
Data Collection
- Batch Data : Data silos created over a period of time from daily workloads, backups, migration.
- Streaming Data : Messages or logs from various user events and IoT devices like Geospatial or telemetry.
- Application Data : Usually through API calls, application events, log files.
Data Access from Some Common Sources
OCI Object Storage
To load a dataframe from Object Storage
oci://*<bucketname>@<namespace>/<file-name>*
It can be done by either using api_key or resource_principal
- ads.set_auth(auth=”api_key“, profile=”DEFAULT”) or
- ads.set_auth(auth=‘resource_pricipal’)
Local Storage
To load a dataframe from a local source, use functions from pandas directly
Pandas
df = pd.read_csv(“/path/to/data.data”)
OCI Autonomous DB (ADB/ATP)
Use pd.DataFrame.ads.read_sql(…) to read SQL query or database table into a dataframe. This is up to 15 times faster being specifically written for Oracle ADB.
A connection to Autonomous Database is established through the database Wallet file from OCI.
OCIFS / PyArrow
Apache Arrow is a development platform for in-memory analytics.
The Arrow Python bindings (PyArrow) have first-class integration with NumPy, pandas, and built-in Python objects.
ADS supports reading files into the PyArrow data set directly via ocifs.
MySQL
Available with Accelerated Data Science v2.5.6 and later
Loads dataframe from a MySQL db using pd.DataFrame.ads.read_sql
Saves the dataframe df to MySQL using df.ads.to_sql
Set engine=mysql
Amazon S3
You can open Amazon public and private files in Accelerated Data Science.
For private files, pass the right credentials from storage_options.
For large files, increase the blocksize.
HTTP(S) endpoints
To open a data set from a remote web server source, use pandas.
Specify the URL of the data:
DatasetBrowser
To open a data set from reference libraries, use DatasetBrowser.
To see supported libraries, use DatasetBrowser.list().
Data Types
To inspect Accelerated Data Science data type, use feature_types.
To see the summary information about the data set, use show_in_notebook().
There are four ADS semantic data types:
- Categorical or Qualitative: Data that can be categorized, that is, labeled into different groups.(Nominal and Ordinal both come under the categorical data type.)
- Continuous: It is a type of Quantitative data that couldn’t be counted but measured. Usually in the form of Fractions
- Datetime: It deals with date time format.
- Ordinal : Ordinal data is an ordered category that has an intrinsic ordering. For example, Education, Elementary, High School, Undergraduate, Graduate.
- Nominal : Nominal data is used for labeling without any quantitative value associated with it. It doesn’t have an intrinsic ordering. For example, Race. It can be categorized but all races are equal.
Supported Sources/Formats by Oracle ADS
ADS doesn’t support text, doc, pdf, raw images, Sequences(list, tuple, range), dict and set type data.
ADS provides a text extraction module to convert PDF and doc into plain text file.