PISCES metadata file format --------------------------- .. tip:: Column order does not matter to PISCES, but column names must match the required fields exactly. .. _submit_metadata_example: .. table:: "pisces submit" metadata file example +------------+-------------------------+---------------------+--------------------+ | SampleID | Directory | Fastq1 | Fastq2 | +============+=========================+=====================+====================+ | Sample1 | /path/to/output_dir | s1_R1_001.fastq.gz | s1_R2_001.fastq.gz | +------------+-------------------------+---------------------+--------------------+ | Sample2 | /path/to/output_dir | s2_R1_001.fastq.gz | s2_R2_001.fastq.gz | +------------+-------------------------+---------------------+--------------------+ ``SampleID``` is the unique identifier used to construct output folders, and as an identifier in :ref:`summarize_example` data table column headers. The ``Directory`` path points to the top level directory where PISCES outputs for a sample should be created. This may be a relative or absolute path. ``Fastq1`` is a required field for fragment (single end) sequencing libraries, and ``Fastq2`` is required to analyze paired end libraries. ``Fastq1`` and ``Fastq2`` paths can specify multiple files using a semicolon (``;``) separator: +------------------------------------------+---------------------------------------+ | Fastq1 | Fastq2 | +==========================================+=======================================+ | s1_R1_001.fastq.gz;s1_R1_002.fastq.gz | s1_R2_001.fastq.gz;s1_R2_002.fastq.gz | +------------------------------------------+---------------------------------------+ | s2_R1_001.fastq.gz;s2_R1_002.fastq.gz | s2_R2_001.fastq.gz;s2_R2_002.fastq.gz | +------------------------------------------+---------------------------------------+ If NCBI Sequence Read Archive (SRA) accessions are specified, these must be added as ``SRR`` "run" accessions in the ``SRA`` column. If only SRA experiments are specified, the ``Fastq1`` column is optional. +------------+-------------------------+---------------------+--------------------+-------------+ | SampleID | Directory | Fastq1 | Fastq2 | SRA | +============+=========================+=====================+====================+=============+ | Sample1 | /path/to/output_dir | s1_R1_001.fastq.gz | s1_R2_001.fastq.gz | | +------------+-------------------------+---------------------+--------------------+-------------+ | Sample2 | /path/to/output_dir | s2_R1_001.fastq.gz | s2_R2_001.fastq.gz | | +------------+-------------------------+---------------------+--------------------+-------------+ | Sample3 | /path/to/output_dir | | | SRR000001 | +------------+-------------------------+---------------------+--------------------+-------------+ | Sample4 | /path/to/output_dir | | | SRR000002 | +------------+-------------------------+---------------------+--------------------+-------------+ A metadata table such as this can be constructed using a bash script: .. code:: shell $ ls Sample* Sample1: s1_R1_001.fastq.gz s1_R2_001.fastq.gz Sample2: s2_R1_001.fastq.gz s2_R2_001.fastq.gz .. code:: shell echo "SampleID,Directory,Fastq1,Fastq2" > metadata.csv for dir in Sample* do fq1=$(ls $dir/*_R1_* | tr '\n' ';' | sed 's/;$//') fq2=$(ls $dir/*_R2_* | tr '\n' ';' | sed 's/;$//') printf "$dir,$dir/PISCES,$fq1,$fq2\n" done >> metadata.csv .. tip:: You may also find it easy to construct the metadata table using a spreadsheet editor. Including analysis variables as metadata ======================================== PISCES utilizes variables defined in the metadata file when using :ref:`summarize_example` to run differential expression analysis, and for producing normalized fold-change tables. Any columns added to the file can be used in downstream analysis. +------------+----------------+--------------+-------------------------+---------------------+--------------------+ | SampleID | Treatment | Timepoint | Directory | Fastq1 | Fastq2 | +============+================+==============+=========================+=====================+====================+ | Sample1 | DMSO | 4hours | /path/to/output_dir | s1_R1_001.fastq.gz | s1_R2_001.fastq.gz | +------------+----------------+--------------+-------------------------+---------------------+--------------------+ | Sample2 | Estrogen | 12hours | /path/to/output_dir | s2_R1_001.fastq.gz | s2_R2_001.fastq.gz | +------------+----------------+--------------+-------------------------+---------------------+--------------------+ .. tip:: For differential expression analysis in :ref:`summarize_example` it's often handy to create "replicate group" variables composed of one or more treatment variables, e.g: ``Treatment_Timepoint``. Specifying NCBI SRA projects ============================ You can easily create a metadata file for PISCES from the NCBI SRA "runinfo" format. For example: .. code:: shell $ wget -O - 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRP093386' | \ sed -e '1 s_Run,_SRA,_' -e '1 s_SampleName,_SampleID,_' -e '1 s_Sample,_Directory,_' > metadata.csv .. program-output:: wget --quiet -O - 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRP093386' | sed -e '1 s_Run,_SRA,_' -e '1 s_SampleName,_SampleID,_' -e '1 s_Sample,_Directory,_' :shell: :ellipsis: 8