PISCES metadata file format¶
Tip
Column order does not matter to PISCES, but column names must match the required fields exactly.
SampleID |
Directory |
Fastq1 |
Fastq2 |
---|---|---|---|
Sample1 |
/path/to/output_dir |
s1_R1_001.fastq.gz |
s1_R2_001.fastq.gz |
Sample2 |
/path/to/output_dir |
s2_R1_001.fastq.gz |
s2_R2_001.fastq.gz |
SampleID`
is the unique identifier used to construct output folders, and as an
identifier in pisces summarize-expression data table column headers. The Directory
path
points to the top level directory where PISCES outputs for a sample should be created. This
may be a relative or absolute path. Fastq1
is a required field for fragment (single end)
sequencing libraries, and Fastq2
is required to analyze paired end libraries. Fastq1
and Fastq2
paths can specify multiple files using a semicolon (;
) separator:
Fastq1 |
Fastq2 |
---|---|
s1_R1_001.fastq.gz;s1_R1_002.fastq.gz |
s1_R2_001.fastq.gz;s1_R2_002.fastq.gz |
s2_R1_001.fastq.gz;s2_R1_002.fastq.gz |
s2_R2_001.fastq.gz;s2_R2_002.fastq.gz |
If NCBI Sequence Read Archive (SRA) accessions are specified, these must be added as SRR
“run” accessions
in the SRA
column. If only SRA experiments are specified, the Fastq1
column is
optional.
SampleID |
Directory |
Fastq1 |
Fastq2 |
SRA |
---|---|---|---|---|
Sample1 |
/path/to/output_dir |
s1_R1_001.fastq.gz |
s1_R2_001.fastq.gz |
|
Sample2 |
/path/to/output_dir |
s2_R1_001.fastq.gz |
s2_R2_001.fastq.gz |
|
Sample3 |
/path/to/output_dir |
SRR000001 |
||
Sample4 |
/path/to/output_dir |
SRR000002 |
A metadata table such as this can be constructed using a bash script:
$ ls Sample*
Sample1:
s1_R1_001.fastq.gz
s1_R2_001.fastq.gz
Sample2:
s2_R1_001.fastq.gz
s2_R2_001.fastq.gz
echo "SampleID,Directory,Fastq1,Fastq2" > metadata.csv
for dir in Sample*
do
fq1=$(ls $dir/*_R1_* | tr '\n' ';' | sed 's/;$//')
fq2=$(ls $dir/*_R2_* | tr '\n' ';' | sed 's/;$//')
printf "$dir,$dir/PISCES,$fq1,$fq2\n"
done >> metadata.csv
Tip
You may also find it easy to construct the metadata table using a spreadsheet editor.
Including analysis variables as metadata¶
PISCES utilizes variables defined in the metadata file when using pisces summarize-expression to run differential expression analysis, and for producing normalized fold-change tables. Any columns added to the file can be used in downstream analysis.
SampleID |
Treatment |
Timepoint |
Directory |
Fastq1 |
Fastq2 |
---|---|---|---|---|---|
Sample1 |
DMSO |
4hours |
/path/to/output_dir |
s1_R1_001.fastq.gz |
s1_R2_001.fastq.gz |
Sample2 |
Estrogen |
12hours |
/path/to/output_dir |
s2_R1_001.fastq.gz |
s2_R2_001.fastq.gz |
Tip
For differential expression analysis in pisces summarize-expression it’s often handy to
create “replicate group” variables composed of one or more treatment variables, e.g:
Treatment_Timepoint
.
Specifying NCBI SRA projects¶
You can easily create a metadata file for PISCES from the NCBI SRA “runinfo” format. For example:
$ wget -O - 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRP093386' | \
sed -e '1 s_Run,_SRA,_' -e '1 s_SampleName,_SampleID,_' -e '1 s_Sample,_Directory,_' > metadata.csv
SRA,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Directory,BioSample,SampleType,TaxID,ScientificName,SampleID,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR5024081,2017-06-05 10:28:10,2016-11-15 17:02:50,19429539,2914430850,19429539,150,1118,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-9/SRR5024081/SRR5024081.1,SRX2350772,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800856,SAMN06018691,simple,9606,Homo sapiens,GSM2392582,,,,,,,no,,,,,GEO,SRA494622,,public,3CA090F9CB93F0E2B50ECA6C5F3B51D0,09AF133AC433F56BE0CAA545601CC843
SRR5024082,2017-06-05 10:28:10,2016-11-15 17:05:18,19357711,2903656650,19357711,150,1121,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-9/SRR5024082/SRR5024082.1,SRX2350773,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800857,SAMN06018690,simple,9606,Homo sapiens,GSM2392583,,,,,,,no,,,,,GEO,SRA494622,,public,16CC07DBC3F955605BBF174F03CE97E6,F0780C1A43CCBDCC8C84920A69AA7AC4
SRR5024083,2017-06-05 10:28:10,2016-11-15 17:05:35,20295588,3044338200,20295588,150,1166,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-9/SRR5024083/SRR5024083.1,SRX2350774,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800858,SAMN06018689,simple,9606,Homo sapiens,GSM2392584,,,,,,,no,,,,,GEO,SRA494622,,public,94593849B90924BBAA2205455EE74D93,849E78F3439B48BB1DF3D7A0AC357EA6
SRR5024084,2017-06-05 10:28:10,2016-11-15 17:04:24,21185745,3177861750,21185745,150,1229,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR5024084/SRR5024084.1,SRX2350775,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800859,SAMN06018688,simple,9606,Homo sapiens,GSM2392585,,,,,,,no,,,,,GEO,SRA494622,,public,944F48E5B1B2C30FDD695434880FA9A7,2691E7B75072BE574F33D87063D7094D
SRR5024085,2017-06-05 10:28:10,2016-11-15 17:05:18,19049066,2857359900,19049066,150,1101,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR5024085/SRR5024085.1,SRX2350776,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800860,SAMN06018687,simple,9606,Homo sapiens,GSM2392586,,,,,,,no,,,,,GEO,SRA494622,,public,3A908C56681585EF268C7D25F97904BC,C0A3987AACA6AFE3484021BD7EB5582C
SRR5024086,2017-06-05 10:28:10,2016-11-15 17:04:29,19311258,2896688700,19311258,150,1121,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR5024086/SRR5024086.1,SRX2350777,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800861,SAMN06018686,simple,9606,Homo sapiens,GSM2392587,,,,,,,no,,,,,GEO,SRA494622,,public,D574DA5DCD34978C9CDD584271E86FCA,860508A148A28FF63DFF56C2E188F9DC
SRR5024087,2017-06-05 10:28:10,2016-11-15 17:04:43,19680863,2952129450,19680863,150,1136,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR5024087/SRR5024087.1,SRX2350778,,RNA-Seq,cDNA,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2000,SRP093386,PRJNA353646,2,353646,SRS1800862,SAMN06018685,simple,9606,Homo sapiens,GSM2392588,,,,,,,no,,,,,GEO,SRA494622,,public,3BC055EDBC9B2B3F6ABC6B40974A5EC3,ABF39F53854C0920B3DFDE038443724D
...