Dataset API

This module provide functionalities to interact with Amorphic Dataset API. You will need following details for querying Amorphic:

  • API TOKEN

  • ROLE ID

  • ENVIRONMENT

  • AMORPHIC URL

Functionalities

  • Get Dataset details by dataset name

  • Create and Delete Dataset

  • Create and Delete Domain

Usage

  1. Initialize Amorphic wrapper

  • Running from Local machine

If you are querying Amorphic from local machine then set the AMORPHIC_API_TOKEN to the authorizationn token of user. Follow tutorial from Amorphic documentation to create the authorization token.

from amorphicutils.api.amorphic import Amorphic

url = "https://bw7rwkd87f.execute-api.us-east-1.amazonaws.com"
environment = "master"
role_id = "admin-role-535343eb-0g44-4h34-g5df-766u87ed5ded"

amorphic_api = Amorphic(url, environment, role_id)
  • Running in Amorphic ETL job

If you running the job in Amorphic ETL then store the token in parameter store and provide the key to the wrapper.

from amorphicutils.api.amorphic import Amorphic

url = "https://bw7rwkd87f.execute-api.us-east-1.amazonaws.com"
environment = "master"
role_id = "admin-role-535343eb-0g44-4h34-g5df-766u87ed5ded"

amorphic_api = Amorphic(url, environment, role_id, param_store_key="parameter-name-key")
  1. Create Dataset

payload = {
        "DatasetName": "dataset_name",
        "Domain": "domain_name",
        "ConnectionType": "api",
        "FileType": "csv",
        "TargetLocation": "s3",
        "TableUpdate": "append"
    }
response = amorphic_api.dataset.create_dataset(**payload)

Implementation

class amorphicutils.api.models.datasets.Dataset(api_wrapper)

Class to call dataset related API

create_dataset(DatasetName, Domain, ConnectionType, FileType, TargetLocation, TableUpdate, MalwareDetectionOptions=None, Keywords=None, DatasetDescription='Created using Amorphicutils', IsDataProfilingEnabled='false', FileDelimiter=',', IsDataValidationEnabled='false', SkipFileHeader='false', DataClassification=None, TargetTablePrepMode='truncate', NotificationSettings='all', DatasetSchema=None, RecordKeys=None, DatasetKeyOptions=None, LatestRecordIndicator=None, **kwargs)

Creates the dataset in Amorphic, returns dataset details if already exists

Parameters
  • DatasetName – Dataset name

  • Domain – Domain under which to create dataset

  • ConnectionType – Connection type for dataset, must be one of [‘api’, ‘s3’, ‘jdbc’, ‘ext-fs’]

  • FileType – File type of the input based on target type. Full list is [‘csv’, ‘xlsx’, ‘parquet’, ‘txt’, ‘pdf’, ‘jpg’, ‘png’, ‘mp3’, ‘wav’, ‘others’]

  • TargetLocation – Target location of dataset, must be from [‘redshift’, ‘auroramysql’, ‘s3athena’, ‘s3’]

  • TableUpdate – Data ingestion type, must be from [‘append’, ‘reload’, ‘update’]

  • MalwareDetectionOptions – Malware detection in dict format, key must be from [‘ScanForMalware’, ‘AllowUnscannableFiles’] and value can be True or False

  • Keywords – Keywords for the dataset

  • DatasetDescription – Description of the dataset

  • IsDataProfilingEnabled – if want to enable data profiling, default: false

  • FileDelimiter – delimiter for structure data, default: ‘,’

  • IsDataValidationEnabled – Validate schema for each file for s3athena type dataset. Default: false

  • SkipFileHeader – true if header exists, default false

  • DataClassification – Data classification for the dataset

  • TargetTablePrepMode – Mode for reload type dataset, can be from [‘recreate’, ‘truncate’]. Default: truncate

  • NotificationSettings – type of notification setings, default: all

  • DatasetSchema – schema of the dataset in format of [{‘name’: ‘id’, ‘type’: ‘varchar’}]

  • RecordKeys – record key for table update type update

  • DatasetKeyOptions – Defines options for SortType, SortKeys, DistType, DistKey * SortType: sort type for target type redshift, must be from [‘none’, ‘interleaved’, ‘compound’] * SortKeys: sort keys for target type redshift * DistType: dist type for target type redshift must be from [‘auto’, ‘even’, ‘key’, ‘all’] * DistKey: dist key type for target type redshift * Default: {‘SortType’: ‘none’, ‘SortKeys’:None, ‘DistType’: ‘auto’, ‘DistKey’: ‘’}

  • LatestRecordIndicator – latest record key for table update type update, default: {“name”: “upload_time”, “type”: “bigint”}

  • kwargs

Returns

create_domain(DomainName, DisplayName=None, DomainDescription=None)

Creates domain in Amorphic , return success if domain already exists

Parameters
  • DomainName – domain name

  • DisplayName – display name for domain

  • DomainDescription – description of the domain

Returns

delete_dataset(DatasetName=None, DatasetId=None)

Deletes the dataset from Amorphic

Parameters
  • DatasetName – Dataset name

  • DatasetId – Id of the dataset

Returns

delete_domain(DomainName)

Deletes domain

Parameters

DomainName

Returns

get_all_dataset(datasets_list=None)

Returns list of all the datasets owned by user

Parameters

datasets_list

Returns

get_dataset(DatasetName=None, DatasetId=None)

Get the dataset details based on name or dataset id

Parameters
  • DatasetName – Dataset name

  • DatasetId – Dataset Id

Returns

get_domain(DomainName)

Returns domain details

Parameters

DomainName – domain name

Returns

search_dataset(DatasetName, datasets_list=None)

Search the dataset by name

Parameters
  • DatasetName – Dataset name

  • datasets_list

Returns

update_dataset(DatasetName, MalwareDetectionOptions=None, Keywords=None, DatasetDescription=None, IsDataProfilingEnabled=None, IsDataValidationEnabled=None, DataClassification=None, **kwargs)

Updates the dataset in Amorphic, returns dataset details if already exists

Parameters
  • DatasetName – Dataset name

  • MalwareDetectionOptions – Malware detection in dict format, key must be from [‘ScanForMalware’, ‘AllowUnscannableFiles’] and value can be True or False

  • Keywords – Keywords for the dataset

  • DatasetDescription – Description of the dataset

  • IsDataProfilingEnabled – if want to enable data profiling, default: false

  • IsDataValidationEnabled – Validate schema for each file for s3athena type dataset. Default: false

  • DataClassification – data classification for the dataset

  • kwargs

Returns

validated_schema(schema)

Returns the validated schema for Amorphic request

Parameters

schema – schema of format [{‘name’: ‘id’, ‘type’: ‘varchar’}]

Returns