DataSources
This section explains about various datasources with which the connection can be established to fetch the datasets or files.
from utils.rc.client.requests import Requests
from utils.rc.client.auth import AuthClient
from utils.rc.dtos.project import Project
from utils.rc.dtos.dataset import Dataset
from utils.rc.dtos.recipe import Recipe
from utils.rc.dtos.transform import Transform
from utils.rc.dtos.template_v2 import TemplateV2, TemplateTransformV2
from utils.rc.dtos.segment import Segment, ItemExpression, Operator
from utils.rc.dtos.scenario import Scenario
from utils.rc.dtos.dataSource import DataSource
from utils.rc.dtos.dataSource import DataSourceType
from utils.rc.dtos.dataSource import SnowflakeConfig
from utils.rc.dtos.dataSource import MongoConfig
from utils.rc.dtos.dataSource import S3Config
from utils.rc.dtos.dataSource import GcpConfig
from utils.rc.dtos.dataSource import AzureBlobConfig
from utils.rc.dtos.dataSource import MySQLConfig
from utils.rc.dtos.dataSource import RedshiftConfig
from utils.rc.dtos.dataSource import RedisStorageConfig
import pandas as pd
import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)
# Requests.setRootHost("https://test.dev.rapidcanvas.net/api/")
# Requests.setRootHost("http://localhost:8080/api/")
AuthClient.setToken()
INFO:Authentication successful
Snowflake
Establishing a connection with Snowflake datasource
Use this code snippet in Notebook to establish a connection with Snowflake data source.
dataSource = DataSource.createDataSource(
"snowflake-101",
DataSourceType.SNOWFLAKE,
{
SnowflakeConfig.USER: "nikunjrc",
SnowflakeConfig.PASSWORD: "sZEWA27V86YGs5G",
SnowflakeConfig.ACCOUNT: "OM82799.us-central1.gcp"
}
)
Creating a project
The following code snippet is used to create a project.
project = Project.create(
name="Test Snowflake",
description="Testing snowflake lib",
icon="https://rapidcanvas.ai/wp-content/uploads/2022/09/bitcoin_prediction_med.jpg",
createEmpty=True
)
Fetching files from this database and uploading to canvas
This code is used on the Notebook to fetch the file from Snowflake and upload this onto the canvas.
signup = project.addDataset(
dataset_name="signup",
dataset_description="signup golden",
data_source_id=dataSource.id,
data_source_options={
SnowflakeConfig.WAREHOUSE: "COMPUTE_WH",
SnowflakeConfig.QUERY: "SELECT * FROM rapidcanvas.public.SIGNUP"
}
)
signup.getData()
Exporting the output dataset to Snowflake datasource
The following code snippet allows you to export the output dataset to the Snowflake datasource.
dataset.update_sync_options(
dataSource.id,
{
SnowflakeConfig.TABLE: "table name",
SnowflakeConfig.DATABASE: "database name",
SnowflakeConfig.SCHEMA: "schema name",
SnowflakeConfig.IF_TABLE_EXISTS = "append"
}
)
dataset.sync()
Scheduling a job
When a scheduled job is run, the source dataset updated with a fresh set of records is used in the machine learning flow of a project to generate a new output dataset. Subsequently, this output dataset is exported to Snowflake.
project_run = ProjectRun.create_project_run(
project.id, "test-run-v1", "*/2 * * * *"
)
project_run.add_project_run_sync(
dataset.id,
dataSource.id,
{
SnowflakeConfig.TABLE: "table name",
SnowflakeConfig.DATABASE: "database name",
SnowflakeConfig.SCHEMA: "schema name"
}
)
Mongo
Establishing a connection with Mongo datasource
Use this code snippet in Notebook to establish a connection with Mongo data source.
dataSource = DataSource.createDataSource(
"mongo-101",
DataSourceType.MONGO,
{
MongoConfig.CONNECT_STRING: "mongodb://testuser2:testuser2@34.68.122.18:27017/test"
}
)
2023-02-02 12:07:09.094 INFO root: Found existing data source by name: mongo-101
2023-02-02 12:07:09.095 INFO root: Updating the same
Creating a project
The following code snippet is used to create a project.
project = Project.create(
name="Test Mongodb",
description="Testing mongodb lib",
icon="https://rapidcanvas.ai/wp-content/uploads/2022/09/bitcoin_prediction_med.jpg",
createEmpty=True
)
2023-02-02 12:09:07.010 INFO root: Found existing project by name: Test Mongodb
2023-02-02 12:09:07.011 INFO root: Deleting existing project
2023-02-02 12:09:07.123 INFO root: Creating new project by name: Test Mongodb
Fetching a file from database and uploading to canvas
The following code snippet is used to upload the dataset that is fetched from Mongo database onto the canvas.
titanic = project.addDataset(
dataset_name="titanic",
dataset_description="titanic golden",
data_source_id=dataSource.id,
data_source_options={
MongoConfig.DATABASE: "test",
MongoConfig.COLLECTION: "titanic",
MongoConfig.QUERY_IN_JSON_FORMAT: "{}"
}
)
2023-02-02 12:09:07.300 INFO root: Creating new dataset by name:titanic
titanic.getData()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.925 | nan | S |
1 | 6 | 0 | 3 | Moran, Mr. James | male | nan | 0 | 0 | 330877 | 8.4583 | nan | Q |
2 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
3 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.075 | nan | S |
4 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.275 | nan | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 105 | 0 | 3 | Gustafsson, Mr. Anders Vilhelm | male | 37.0 | 2 | 0 | 3101276 | 7.925 | nan | S |
96 | 106 | 0 | 3 | Mionoff, Mr. Stoytcho | male | 28.0 | 0 | 0 | 349207 | 7.8958 | nan | S |
97 | 107 | 1 | 3 | Salkjelsvik, Miss. Anna Kristine | female | 21.0 | 0 | 0 | 343120 | 7.65 | nan | S |
98 | 78 | 0 | 3 | Moutal, Mr. Rahamin Haim | male | nan | 0 | 0 | 374746 | 8.05 | nan | S |
99 | 109 | 0 | 3 | Rekic, Mr. Tido | male | 38.0 | 0 | 0 | 349249 | 7.8958 | nan | S |
100 rows × 12 columns
Exporting the output dataset to Mongo datasource
The following code snippet allows you to export the output dataset to the Snowflake datasource.
dataset.update_sync_options(
dataSource.id,
{
MongoConfig.COLLECTION: "collection name",
MongoConfig.DATABASE: "database name",
}
)
dataset.sync()
Scheduling a job
When a scheduled job is run, the source dataset updated with a fresh set of records is used in the machine learning flow of a project to generate a new output dataset. Subsequently, this output dataset is exported to Snowflake.
project_run = ProjectRun.create_project_run(
project.id, "test-run-v1", "*/2 * * * *"
)
project_run.add_project_run_sync(
dataset.id,
dataSource.id,
{
MongoConfig.COLLECTION: "collection name",
MongoConfig.DATABASE: "database name",
}
)
Amazon S3
Establishing a connection with Amazon S3 datasource
Use this code snippet in Notebook to establish a connection with Amazon S3 data source.
dataSource = DataSource.createDataSource(
"s3-101",
DataSourceType.S3_STORAGE,
{
S3Config.BUCKET: "bucket-name",
S3Config.ACCESS_KEY_ID: "access-key-id",
S3Config.ACCESS_KEY_SECRET: "access-key-secret"
}
)
Creating a project
The following code snippet is used to create a project.
project = Project.create(
name="Test Amazon S3",
description="Testing Amazon S3",
icon="https://rapidcanvas.ai/wp-content/uploads/2022/09/bitcoin_prediction_med.jpg",
createEmpty=True
)
Fetching a file from database and uploading to canvas
The following code snippet is used to upload the dataset that is imported from Amazon S3 onto the canvas.
project.addDataset(
dataset_name="signup",
dataset_description="signup golden",
data_source_id=dataSource.id,
data_source_options={
S3Config.FILE_PATH: "file-path"
}
)
signup.getData()
Exporting the output dataset to Amazon S3 datasource
The following code snippet allows you to export the output dataset to the Amazon S3 datasource.
dataset.update_sync_options(
dataSource.id,
{
S3Config.OUTPUT_FILE_DIRECTORY: "files/",
S3Config.OUTPUT_FILE_NAME: "dataset.parquet"
}
)
dataset.sync()
Scheduling a job
When a scheduled job is run, the source dataset updated with a fresh set of records is used in the machine learning flow of a project to generate a new output dataset. Subsequently, this output dataset is exported to Snowflake.
project_run = ProjectRun.create_project_run(
project.id, "test-run-v1", "*/2 * * * *"
)
project_run.add_project_run_sync(
dataset.id,
dataSource.id,
{
S3Config.OUTPUT_FILE_DIRECTORY: "files/",
S3Config.OUTPUT_FILE_NAME: "dataset-${RUN_ID}.parquet"
}
)
Google Cloud Storage (GCS)
Establishing a connection with GCS datasource
Use this code snippet in Notebook to establish a connection with Google Cloud Storage data source.
dataSource = DataSource.createDataSource(
"gcp-101",
DataSourceType.GCP_STORAGE,
{
GcpConfig.BUCKET: "bucket-name",
GcpConfig.ACCESS_KEY: "access key path"
}
)
Creating a project
The following code snippet is used to create a project.
project = Project.create(
name="Test Google Cloud Storage",
description="Testing Google Cloud Storage",
icon="https://rapidcanvas.ai/wp-content/uploads/2022/09/bitcoin_prediction_med.jpg",
createEmpty=True
)
Fetching a file from database and uploading to canvas
The following code snippet is used to upload the dataset that is imported from Google Cloud Storage onto the canvas.
project.addDataset(
dataset_name="signup",
dataset_description="signup golden",
data_source_id=dataSource.id,
data_source_options={
GcpConfig.FILE_PATH: "file-path"
}
)
signup.getData()
Exporting the output dataset to GCS datasource
The following code snippet allows you to export the output dataset to the Google Cloud Storage (GCS) datasource.
dataset.update_sync_options(
dataSource.id,
{
GcpConfig.OUTPUT_FILE_DIRECTORY: "files/",
GcpConfig.OUTPUT_FILE_NAME: "dataset.parquet"
}
)
dataset.sync()
Scheduling a job
When a scheduled job is run, the source dataset updated with fresh set of records is used in the machine learning flow of a project to generate a new output dataset. Subsequently, this output dataset is exported to Google Cloud Storage.
project_run = ProjectRun.create_project_run(
project.id, "test-run-v1", "*/2 * * * *"
)
project_run.add_project_run_sync(
dataset.id,
dataSource.id,
{
GcpConfig.OUTPUT_FILE_DIRECTORY: "files/",
GcpConfig.OUTPUT_FILE_NAME: "dataset-${RUN_ID}.parquet"
}
)
Azure Blob Storage
Establishing a connection with Azure Blob datasource
Use this code snippet in Notebook to establish a connection with Azure blob storage data source.
dataSource = DataSource.createDataSource(
"azure-101",
DataSourceType.AZURE_BLOB,
{
AzureBlobConfig.CONTAINER_NAME: "container-name",
AzureBlobConfig.CONNECT_STR: "connect-string",
}
)
Creating a project
The following code snippet is used to create a project.
project = Project.create(
name="Test Azure Blob Storage",
description="Testing Azure Blob Storage",
icon="https://rapidcanvas.ai/wp-content/uploads/2022/09/bitcoin_prediction_med.jpg",
createEmpty=True
)
Fetching a file from database and uploading to canvas
The following code snippet is used to upload the dataset that is imported from Azure Blob Storage onto the canvas. This has the file path from where the file is located.
project.addDataset(
dataset_name="signup",
dataset_description="signup golden",
data_source_id=dataSource.id,
data_source_options={
AzureBlobConfig.FILE_PATH: "file-path"
}
)
Exporting the output dataset to Azure Blob datasource
The following code snippet allows you to export the output dataset to the Azure Blob datasource.
dataset.update_sync_options(
dataSource.id,
{
AzureBlobConfig.OUTPUT_FILE_DIRECTORY: "files/",
AzureBlobConfig.OUTPUT_FILE_NAME: "dataset.parquet"
}
)
dataset.sync()
Scheduling a job
When a scheduled job is run, the source dataset updated with a fresh set of records is used in the machine learning flow of a project to generate a new output dataset. Subsequently, this output dataset is exported to Azure Blob.
project_run = ProjectRun.create_project_run(
project.id, "test-run-v1", "*/2 * * * *"
)
project_run.add_project_run_sync(
dataset.id,
dataSource.id,
{
AzureBlobConfig.OUTPUT_FILE_DIRECTORY: "files/",
AzureBlobConfig.OUTPUT_FILE_NAME: "dataset-${RUN_ID}.parquet"
}
)
MySQL/MsSQL
Establishing a connection with MySQL datasource
Use this code snippet in Notebook to establish a connection with MySQL data source.
dataSource = DataSource.createDataSource(
"mysql-101",
DataSourceType.MYSQL,
{
MySQLConfig.CONNECT_STRING: "mysql://root:password@34.170.43.138/azure"
}
)
Creating a project
The following code snippet is used to create a project.
project = Project.create(
name="Test MySQL/MsSQL",
description="Testing MySQL/MsSQL",
icon="https://rapidcanvas.ai/wp-content/uploads/2022/09/bitcoin_prediction_med.jpg",
createEmpty=True
)
Fetching a file from database and uploading to canvas
The following code snippet is used to upload the dataset that is imported from MySQL/MsSQL onto the canvas.
dataset = project.addDataset(
dataset_name="titanic",
dataset_description="titanic golden",
data_source_id=dataSource.id,
data_source_options={
MySQLConfig.QUERY: "SELECT * FROM titanic limit 100"
}
)
Exporting the output dataset to MySQL datasource
The following code snippet allows you to export the output dataset to the MySQL/MsSQL datasource.
dataset.update_sync_options(
dataSource.id,
{
MySQLConfig.TABLE: "titanic"
}
)
dataset.sync()
Scheduling a job
When a scheduled job is run, the source dataset updated with a fresh set of records is used in the machine learning flow of a project to generate a new output dataset. Subsequently, this output dataset is exported to MySQL/MsSQL.
project_run = ProjectRun.create_project_run(
project.id, "test-run-v1", "*/2 * * * *"
)
project_run.add_project_run_sync(
dataset.id,
dataSource.id,
{
MySQLConfig.TABLE: "titanic"
}
)
Redshift
Establishing a connection with Redshift datasource
Use this code snippet in Notebook to establish a connection with Redshift data source.
dataSource = DataSource.createDataSource(
"redshift-101",
DataSourceType.REDSHIFT,
{
RedshiftConfig.CONNECT_STRING: "mysql://root:password@34.170.43.138/azure"
}
)
Creating a project
The following code snippet is used to create a project.
project = Project.create(
name="Test Redshift",
description="Testing Redshift",
icon="https://rapidcanvas.ai/wp-content/uploads/2022/09/bitcoin_prediction_med.jpg",
createEmpty=True
)
Fetching a file from database and uploading to canvas
The following code snippet is used to upload the dataset that is imported from Redshift onto the canvas.
dataset = project.addDataset(
dataset_name="titanic",
dataset_description="titanic golden",
data_source_id=dataSource.id,
data_source_options={
RedshiftConfig.QUERY: "SELECT * FROM titanic limit 100"
}
)
Exporting the output dataset to Redshift datasource
The following code snippet allows you to export the output dataset to the Redshift datasource.
dataset.update_sync_options(
dataSource.id,
{
RedshiftConfig.TABLE: "titanic"
}
)
dataset.sync()
Scheduling a job
When a scheduled job is run, the source dataset updated with a fresh set of records is used in the machine learning flow of a project to generate a new output dataset. Subsequently, this output dataset is exported to Redshift.
project_run = ProjectRun.create_project_run(
project.id, "test-run-v1", "*/2 * * * *"
)
project_run.add_project_run_sync(
dataset.id,
dataSource.id,
{
RedshiftConfig.TABLE: "titanic"
}
)
Redis
Establishing a connection with Redis datasource
Use this code snippet in Notebook to establish a connection with Redis data source.
dataSource = DataSource.createDataSource(
"redis-101",
DataSourceType.REDIS_STORAGE,
{
RedisStorageConfig.HOST: "127.0.0.1",
RedisStorageConfig.PORT: "6379"
}
)
Creating a project
The following code snippet is used to create a project.
project = Project.create(
name="Test Redis",
description="Testing Redis",
icon="https://rapidcanvas.ai/wp-content/uploads/2022/09/bitcoin_prediction_med.jpg",
createEmpty=True
)
Note
You cannot import files from Redis to the platform but can export files and store in this data source.
Exporting the output dataset to Redis datasource
The following code snippet allows you to export the output dataset to the Redis datasource.
dataset.update_sync_options(
dataSource.id,
{
RedisStorageConfig.FEATURE_NAME: "titanic",
RedisStorageConfig.FEATURE_KEY_COLUMN: "PassengerId",
RedisStorageConfig.FEATURE_VALUE_COLUMNS: "Sex,Parch"
}
)
dataset.sync()
Scheduling a job
When a scheduled job is run, the source dataset updated with a fresh set of records is used in the machine learning flow of a project to generate a new output dataset. Subsequently, this output dataset is exported to Redis.
project_run = ProjectRun.create_project_run(
project.id, "test-run-v1", "*/2 * * * *"
)
project_run.add_project_run_sync(
dataset.id,
dataSource.id,
{
RedisStorageConfig.FEATURE_NAME: "titanic",
RedisStorageConfig.FEATURE_KEY_COLUMN: "PassengerId",
RedisStorageConfig.FEATURE_VALUE_COLUMNS: "Sex,Parch"
}
)