PISCES metadata file format¶

Tip

Column order does not matter to PISCES, but column names must match the required fields exactly.

SampleID	Directory	Fastq1	Fastq2
Sample1	/path/to/output_dir	s1_R1_001.fastq.gz	s1_R2_001.fastq.gz
Sample2	/path/to/output_dir	s2_R1_001.fastq.gz	s2_R2_001.fastq.gz

SampleID` is the unique identifier used to construct output folders, and as an identifier in pisces summarize-expression data table column headers. The Directory path points to the top level directory where PISCES outputs for a sample should be created. This may be a relative or absolute path. Fastq1 is a required field for fragment (single end) sequencing libraries, and Fastq2 is required to analyze paired end libraries. Fastq1 and Fastq2 paths can specify multiple files using a semicolon (;) separator:

Fastq1	Fastq2
s1_R1_001.fastq.gz;s1_R1_002.fastq.gz	s1_R2_001.fastq.gz;s1_R2_002.fastq.gz
s2_R1_001.fastq.gz;s2_R1_002.fastq.gz	s2_R2_001.fastq.gz;s2_R2_002.fastq.gz

If NCBI Sequence Read Archive (SRA) accessions are specified, these must be added as SRR “run” accessions in the SRA column. If only SRA experiments are specified, the Fastq1 column is optional.

SampleID	Directory	Fastq1	Fastq2	SRA
Sample1	/path/to/output_dir	s1_R1_001.fastq.gz	s1_R2_001.fastq.gz
Sample2	/path/to/output_dir	s2_R1_001.fastq.gz	s2_R2_001.fastq.gz
Sample3	/path/to/output_dir			SRR000001
Sample4	/path/to/output_dir			SRR000002

A metadata table such as this can be constructed using a bash script:

$ ls Sample*
Sample1:
s1_R1_001.fastq.gz
s1_R2_001.fastq.gz

Sample2:
s2_R1_001.fastq.gz
s2_R2_001.fastq.gz

echo "SampleID,Directory,Fastq1,Fastq2" > metadata.csv
for dir in Sample*
  do
    fq1=$(ls $dir/*_R1_* | tr '\n' ';' | sed 's/;$//')
    fq2=$(ls $dir/*_R2_* | tr '\n' ';' | sed 's/;$//')
    printf "$dir,$dir/PISCES,$fq1,$fq2\n"
  done >> metadata.csv

Tip

You may also find it easy to construct the metadata table using a spreadsheet editor.

Including analysis variables as metadata¶

PISCES utilizes variables defined in the metadata file when using pisces summarize-expression to run differential expression analysis, and for producing normalized fold-change tables. Any columns added to the file can be used in downstream analysis.

SampleID	Treatment	Timepoint	Directory	Fastq1	Fastq2
Sample1	DMSO	4hours	/path/to/output_dir	s1_R1_001.fastq.gz	s1_R2_001.fastq.gz
Sample2	Estrogen	12hours	/path/to/output_dir	s2_R1_001.fastq.gz	s2_R2_001.fastq.gz

Tip

For differential expression analysis in pisces summarize-expression it’s often handy to create “replicate group” variables composed of one or more treatment variables, e.g: Treatment_Timepoint.

Specifying NCBI SRA projects¶

You can easily create a metadata file for PISCES from the NCBI SRA “runinfo” format. For example:

$ wget -O - 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRP093386' | \
  sed -e '1 s_Run,_SRA,_' -e '1 s_SampleName,_SampleID,_'  -e '1 s_Sample,_Directory,_' > metadata.csv

SRA,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Directory,BioSample,SampleType,TaxID,ScientificName,SampleID,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR5024081,2017-06-05 10:28:10,2016-11-15 17:02:50,19429539,2914430850,19429539,150,1118,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-9/SRR5024081/SRR5024081.1,SRX2350772,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800856,SAMN06018691,simple,9606,Homo sapiens,GSM2392582,,,,,,,no,,,,,GEO,SRA494622,,public,3CA090F9CB93F0E2B50ECA6C5F3B51D0,09AF133AC433F56BE0CAA545601CC843
SRR5024082,2017-06-05 10:28:10,2016-11-15 17:05:18,19357711,2903656650,19357711,150,1121,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-9/SRR5024082/SRR5024082.1,SRX2350773,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800857,SAMN06018690,simple,9606,Homo sapiens,GSM2392583,,,,,,,no,,,,,GEO,SRA494622,,public,16CC07DBC3F955605BBF174F03CE97E6,F0780C1A43CCBDCC8C84920A69AA7AC4
SRR5024083,2017-06-05 10:28:10,2016-11-15 17:05:35,20295588,3044338200,20295588,150,1166,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-9/SRR5024083/SRR5024083.1,SRX2350774,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800858,SAMN06018689,simple,9606,Homo sapiens,GSM2392584,,,,,,,no,,,,,GEO,SRA494622,,public,94593849B90924BBAA2205455EE74D93,849E78F3439B48BB1DF3D7A0AC357EA6
SRR5024084,2017-06-05 10:28:10,2016-11-15 17:04:24,21185745,3177861750,21185745,150,1229,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR5024084/SRR5024084.1,SRX2350775,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800859,SAMN06018688,simple,9606,Homo sapiens,GSM2392585,,,,,,,no,,,,,GEO,SRA494622,,public,944F48E5B1B2C30FDD695434880FA9A7,2691E7B75072BE574F33D87063D7094D
SRR5024085,2017-06-05 10:28:10,2016-11-15 17:05:18,19049066,2857359900,19049066,150,1101,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR5024085/SRR5024085.1,SRX2350776,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800860,SAMN06018687,simple,9606,Homo sapiens,GSM2392586,,,,,,,no,,,,,GEO,SRA494622,,public,3A908C56681585EF268C7D25F97904BC,C0A3987AACA6AFE3484021BD7EB5582C
SRR5024086,2017-06-05 10:28:10,2016-11-15 17:04:29,19311258,2896688700,19311258,150,1121,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR5024086/SRR5024086.1,SRX2350777,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800861,SAMN06018686,simple,9606,Homo sapiens,GSM2392587,,,,,,,no,,,,,GEO,SRA494622,,public,D574DA5DCD34978C9CDD584271E86FCA,860508A148A28FF63DFF56C2E188F9DC
SRR5024087,2017-06-05 10:28:10,2016-11-15 17:04:43,19680863,2952129450,19680863,150,1136,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR5024087/SRR5024087.1,SRX2350778,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800862,SAMN06018685,simple,9606,Homo sapiens,GSM2392588,,,,,,,no,,,,,GEO,SRA494622,,public,3BC055EDBC9B2B3F6ABC6B40974A5EC3,ABF39F53854C0920B3DFDE038443724D
...