SONG Python SDK¶
The SONG Python SDK is a simple python module that allows you to interact with a SONG server through Python, with a minimum of coding effort.
It lets you upload payloads synchronously or asynchronously, check their status and create analyses. From there, you can use the power of Python to process and analyze the data within those objects however you see fit.
Prerequisites¶
Python 3.6
is REQUIRED, since the SDK uses the dataclasses module.
Installation¶
The official SONG Python SDK is publically hosted on PyPi. To install it, just run the command below:
pip install overture-song
Configuration¶
- in generic way, explain how to configure the sdk to be used. just explain ApiConfig and which library to import
Tutorial¶
This section demonstrates example usage of the overture-song
sdk.
After completing this tutorial, you will have uploaded your first SONG metadata payload!
For the impatient, the code used below can be found in examples/example_upload.py.
Warning
Python 3.6
or higher is required.
Configuration¶
Create an ApiConfig
object. This object contains the serverUrl
, accessToken
, and studyId
that will be used to interact with the SONG API. In this example we will use https://song.cancercollaboratory.org for
the serverUrl and ‘ABC123’ for the studyId. For the access token, please refer to Creating an Access Token.
from overture_song.model import ApiConfig
api_config = ApiConfig('https://song.cancercollaboratory.org', 'ABC123', <my_access_token>)
Next the main API client needs to be instantiated in order to interact with the SONG server.
from overture_song.client import Api
api = Api(api_config)
As a sanity check, ensure that the server is running. If the response is True
, then you may proceed with the next
section, otherwise the server is not running.
>>> api.is_alive()
True
Create a Study¶
If the studyId ‘ABC123’ does not exist, then the StudyClient
must be
instantiated in order to read and create studies.
First create a study client,
from overture_song.client import StudyClient
study_client = StudyClient(api)
If the study associated with the payload does not exist, then create
a Study
entity,
from overture_song.entities import Study
if not study_client.has(api_config.study_id):
study = Study.create(api_config.study_id, "myStudyName", "myStudyDescription", "myStudyOrganization")
study_client.create(study)
Create a Simple Payload¶
Now that the study exists, you can create your first payload!
In this example, a SequencingReadAnalysis
will be created.
It follows the
SequencingRead JsonSchema.
See also
Similarly, for the VariantCallAnalysis
, refer to the
VariantCall JsonSchema.
Firstly, import all the entities to minimize the import statements.
from overture_song.entities import *
Next, create an example Donor
entity:
donor = Donor()
donor.studyId = api_config.study_id
donor.donorGender = "male"
donor.donorSubmitterId = "dsId1"
donor.set_info("randomDonorField", "someDonorValue")
Create an example Specimen
entity:
specimen = Specimen()
specimen.specimenClass = "Tumour"
specimen.specimenSubmitterId = "sp_sub_1"
specimen.specimenType = "Normal - EBV immortalized"
specimen.set_info("randomSpecimenField", "someSpecimenValue")
Create an example Sample
entity:
sample = Sample()
sample.sampleSubmitterId = "ssId1"
sample.sampleType = "RNA"
sample.set_info("randomSample1Field", "someSample1Value")
Create 1 or more example File
entities:
# File 1
file1 = File()
file1.fileName = "myFilename1.bam"
file1.studyId = api_config.study_id
file1.fileAccess = "controlled"
file1.fileMd5sum = "myMd51"
file1.fileSize = 1234561
file1.fileType = "VCF"
file1.set_info("randomFile1Field", "someFile1Value")
# File 2
file2 = File()
file2.fileName = "myFilename2.bam"
file2.studyId = api_config.study_id
file2.fileAccess = "controlled"
file2.fileMd5sum = "myMd52"
file2.fileSize = 1234562
file2.fileType = "VCF"
file2.set_info("randomFile2Field", "someFile2Value")
Create an example SequencingRead
experiment entity:
# SequencingRead
sequencing_read_experiment = SequencingRead()
sequencing_read_experiment.aligned = True
sequencing_read_experiment.alignmentTool = "myAlignmentTool"
sequencing_read_experiment.pairedEnd = True
sequencing_read_experiment.insertSize = 0
sequencing_read_experiment.libraryStrategy = "WXS"
sequencing_read_experiment.referenceGenome = "GR37"
sequencing_read_experiment.set_info("randomSRField", "someSRValue")
Finally, use the SimplePayloadBuilder
class along with the previously
create entities to create a payload.
from overture_song.tools import SimplePayloadBuilder
builder = SimplePayloadBuilder(donor, specimen, sample, [file1, file2], sequencing_read_experiment)
payload = builder.to_dict()
Use a Custom AnalysisId¶
In some situations, the user may prefer to use a custom analysisId
. If not specified in the payload, it is
automatically generated by the SONG server during the Save the Analysis step.
Although this tutorial uses the analysisId
generated by the SONG server, a custom analysisId
can be set
as follows:
payload['analysisId'] = 'my_custom_analysis_id'
Upload the Payload¶
With the payload built, the data can now be uploaded to the SONG server for validation. There are 2 modes for validation:
- Synchronous - uploads are validated SYNCHRONOUSLY. Although this is the default mode, it can be selected by setting the kwarg
is_async_validation
toFalse
from theupload
method. - Asynchronously - uploads are validated ASYNCHRONOUSLY. This allows the user to upload a batch of payloads. This mode can be selected by setting
is_async_validation
toTrue
.
After calling the upload
method, the payload will be sent to the SONG server for validation, and a response will be returned:
>>> api.upload(json_payload=payload, is_async_validation=False)
{
"status": "ok",
"uploadId": "UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58"
}
If the status
field from the response is ok
, this means the payload was successfully submitted to the SONG server for validation, and returned a randomly generated uploadId
, which is a receipt for the upload request.
Check the Status of the Upload¶
Before continuing, the previous upload’s status must be checked using the
status
method, in order to ensure the payload was successfully validated.
Using the previous uploadId
, the status of the upload can be requested and will return the following response:
>>> api.status('UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58')
{
"analysisId": "",
"uploadId": "UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58",
"studyId": "ABC123",
"state": "VALIDATED",
"createdAt": [
2018,
2,
16,
0,
54,
31,
73774000
],
"updatedAt": [
2018,
2,
16,
0,
54,
31,
75476000
],
"errors": [
""
],
"payload": {
"analysisState": "UNPUBLISHED",
"sample": [
{
"info": {
"randomSample1Field": "someSample1Value"
},
"sampleSubmitterId": "ssId1",
"sampleType": "RNA",
"specimen": {
"info": {
"randomSpecimenField": "someSpecimenValue"
},
"specimenSubmitterId": "sp_sub_1",
"specimenClass": "Tumour",
"specimenType": "Normal - EBV immortalized"
},
"donor": {
"info": {
"randomDonorField": "someDonorValue"
},
"donorSubmitterId": "dsId1",
"studyId": "Study1",
"donorGender": "male"
}
}
],
"file": [
{
"info": {
"randomFile1Field": "someFile1Value"
},
"fileName": "myFilename1.bam",
"studyId": "Study1",
"fileSize": 1234561,
"fileType": "VCF",
"fileMd5sum": "myMd51",
"fileAccess": "controlled"
},
{
"info": {
"randomFile2Field": "someFile2Value"
},
"fileName": "myFilename2.bam",
"studyId": "Study1",
"fileSize": 1234562,
"fileType": "VCF",
"fileMd5sum": "myMd52",
"fileAccess": "controlled"
}
],
"analysisType": "sequencingRead",
"experiment": {
"info": {
"randomSRField": "someSRValue"
},
"aligned": true,
"alignmentTool": "myAlignmentTool",
"insertSize": 0,
"libraryStrategy": "WXS",
"pairedEnd": true,
"referenceGenome": "GR37"
}
}
}
In order to continue with the next section, the state
field MUST have the value VALIDATED
, which indicates
the upload was validated and there were no errors. If there were errors, the state
field would have the value
VALIDATION_ERROR
, and the field errors
would contains details of the validation issues. If there is an error,
the user can simply correct the payload, re-upload and check the status again.
Save the Analysis¶
Once the upload is successfully validated, the upload must be saved using the
save
method. This generates the following response:
>>> api.save(status_response.uploadId, ignore_analysis_id_collisions=False)
{
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"status": "ok"
}
The value of ok
in the status
field of the response indicates that an analysis was successfully created. The analysis
will contain the same data as the payload, with the addition of server-side generated ids, which are generated by an
centralized id server. By default, the request DOES NOT IGNORE analysisId
collisions, however by setting the save method parameter ignore_analysis_id_collisions
to True
, collisions will
be ignored. This mechanism is considered an override and is heavily discouraged, however it is necessary considering the
complexities associated with managing genomic data.
Observe the UNPUBLISHED Analysis¶
Verify the analysis is unpublished by observing the value of the analysisState
field in the response for the
get_analysis
call. The value should be UNPUBLISHED
. Also, observe that
the SONG server generated an unique sampleId, specimenId, analysisId and objectId:
>>> api.get_analysis('23c61f55-12b4-11e8-b46b-23a48c7b1324')
{
"analysisType": "sequencingRead",
"info": {},
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"study": "ABC123",
"analysisState": "UNPUBLISHED",
"sample": [
{
"info": {
"randomSample1Field": "someSample1Value"
},
"sampleId": "SA599347",
"specimenId": "SP196154",
"sampleSubmitterId": "ssId1",
"sampleType": "RNA",
"specimen": {
"info": {
"randomSpecimenField": "someSpecimenValue"
},
"specimenId": "SP196154",
"donorId": "DO229595",
"specimenSubmitterId": "sp_sub_1",
"specimenClass": "Tumour",
"specimenType": "Normal - EBV immortalized"
},
"donor": {
"donorId": "DO229595",
"donorSubmitterId": "dsId1",
"studyId": "ABC123",
"donorGender": "male",
"info": {}
}
}
],
"file": [
{
"info": {
"randomFile1Field": "someFile1Value"
},
"objectId": "f553bbe8-876b-5a9c-a436-ff47ceef53fb",
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"fileName": "myFilename1.bam",
"studyId": "ABC123",
"fileSize": 1234561,
"fileType": "VCF",
"fileMd5sum": "myMd51 ",
"fileAccess": "controlled"
},
{
"info": {
"randomFile2Field": "someFile2Value"
},
"objectId": "6e2ee06b-e95d-536a-86b5-f2af9594185f",
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"fileName": "myFilename2.bam",
"studyId": "ABC123",
"fileSize": 1234562,
"fileType": "VCF",
"fileMd5sum": "myMd52 ",
"fileAccess": "controlled"
}
],
"experiment": {
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"aligned": true,
"alignmentTool": "myAlignmentTool",
"insertSize": 0,
"libraryStrategy": "WXS",
"pairedEnd": true,
"referenceGenome": "GR37",
"info": {
"randomSRField": "someSRValue"
}
}
}
Generate the Manifest¶
With an analysis created, a manifest file must be generated using the
ManifestClient
, the analysisId from the previously generated analysis, a path to the directory containing the files to be uploaded
, and an output file path. If the source_dir does not exist or if the files to be uploaded are not present in that directory
, then an error will be shown. By calling the
write_manifest
method, a
Manifest
object is generated and then written to a file.
This step is required for the next section involving the upload of the object files to the storage server.
from overture_song.client import ManifestClient
manifest_client = ManifestClient(api)
source_dir = "/path/to/directory/containing/files"
manifest_file_path = './manifest.txt'
manifest_client.write_manifest('23c61f55-12b4-11e8-b46b-23a48c7b1324', source_dir, manifest_file_path)
After successful execution, a manifest.txt
file will be generated and will have the following contents:
23c61f55-12b4-11e8-b46b-23a48c7b1324
f553bbe8-876b-5a9c-a436-ff47ceef53fb /path/to/directory/containing/files/myFilename1.bam myMd51
6e2ee06b-e95d-536a-86b5-f2af9594185f /path/to/directory/containing/files/myFilename2.bam myMd52
Upload the Object Files¶
Upload the object files specified in the payload, using the icgc-storage-client and the manifest file.
This will upload the files specified in the manifest.txt
file, which should all be located in the same directory.
For Collaboratory - Toronto:
./bin/icgc-storage-client --profile collab upload --manifest ./manifest.txt
For AWS - Virginia:
./bin/icgc-storage-client --profile aws upload --manifest ./manifest.txt
See also
Refer to the SCORE Client section for more information about installation, configuration and usage.
Publish the Analysis¶
Using the same analysisId
as before, publish it.
Essentially, this is the handshake between the metadata stored in the SONG server (via the analysisIds) and the object
files stored in the storage server (the files described by the analysisId
)
>>> api.publish('23c61f55-12b4-11e8-b46b-23a48c7b1324')
AnalysisId 23c61f55-12b4-11e8-b46b-23a48c7b1324 successfully published
Observe the PUBLISHED Analysis¶
Finally, verify the analysis is published by observing the value of the analysisState
field in the response for the
get_analysis
call. If the value is PUBLISHED
, then congratulations on your first metadata upload!!
>>> api.get_analysis('23c61f55-12b4-11e8-b46b-23a48c7b1324')
{
"analysisType": "sequencingRead",
"info": {},
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"study": "ABC123",
"analysisState": "PUBLISHED",
"sample": [
{
"info": {
"randomSample1Field": "someSample1Value"
},
"sampleId": "SA599347",
"specimenId": "SP196154",
"sampleSubmitterId": "ssId1",
"sampleType": "RNA",
"specimen": {
"info": {
"randomSpecimenField": "someSpecimenValue"
},
"specimenId": "SP196154",
"donorId": "DO229595",
"specimenSubmitterId": "sp_sub_1",
"specimenClass": "Tumour",
"specimenType": "Normal - EBV immortalized"
},
"donor": {
"donorId": "DO229595",
"donorSubmitterId": "dsId1",
"studyId": "ABC123",
"donorGender": "male",
"info": {}
}
}
],
"file": [
{
"info": {
"randomFile1Field": "someFile1Value"
},
"objectId": "f553bbe8-876b-5a9c-a436-ff47ceef53fb",
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"fileName": "myFilename1.bam",
"studyId": "ABC123",
"fileSize": 1234561,
"fileType": "VCF",
"fileMd5sum": "myMd51 ",
"fileAccess": "controlled"
},
{
"info": {
"randomFile2Field": "someFile2Value"
},
"objectId": "6e2ee06b-e95d-536a-86b5-f2af9594185f",
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"fileName": "myFilename2.bam",
"studyId": "ABC123",
"fileSize": 1234562,
"fileType": "VCF",
"fileMd5sum": "myMd52 ",
"fileAccess": "controlled"
}
],
"experiment": {
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"aligned": true,
"alignmentTool": "myAlignmentTool",
"insertSize": 0,
"libraryStrategy": "WXS",
"pairedEnd": true,
"referenceGenome": "GR37",
"info": {
"randomSRField": "someSRValue"
}
}
}