starcode下载 - starcode源代码下载

StarCode：基于全对搜索的序列聚类

内容：

 1. What is starcode?
2. Source file list.
3. Compilation and installation.
4. Running starcode.
5. Running starcode-umi.
6. File formats.
7. License.
8. Citation.

I.什么是StarCode？

StarCode是DNA序列聚类软件。 StarCode群集基于指定的Levenshtein距离内的所有对搜索（允许插入和删除），然后是群集算法：消息传递，球形或连接的组件。通常，将包含一组DNA序列的文件作为输入传递，并与所需的聚类距离和algorihtm共同传递。 StarCode返回集群的规范序列，群集大小，组成群集的不同序列的集合以及群集组件的输入线数。

StarCode在生物学领域中有许多应用，例如DNA/RNA图案恢复，条形码/UMI聚类，测序误差恢复等。

ii。源文件列表

StarCode-Umi StarCode脚本到群集UMI标记的序列。
Main-Starcode.c StarCode主文件（参数解析）。
StarCode.c主星码算法。
Trie.C Trie搜索和施工功能。
view.c starcode输出的图形表示。
makefile制作指令文件。

iii。汇编和安装

要安装StarCode，请克隆此GIT存储库（或手动下载最新版本的Starcode v1.3）：

git克隆https://github.com/gui11aume/starcode

文件应在名为“ StarCode”的文件夹中下载。使用Make来编译（Mac用户需要“ XCode”，可在Mac AppStore上找到）：

制作-c starcode

将创建一个二进制文件“ StarCode”。您可以选择创建一个符号链接以从任何目录运行StarCode：

sudo ln -s starcode/starcode/usr/bin/starcode

iv。运行StarCode

StarCode在Linux和Mac上运行。它尚未在Windows上进行测试。

用法：

StarCode [options] {[-i] input_file | -1配对_end_file1 -2配对_end_file2} [-o output_file]

StarCode默认值（请阅读此）：

默认情况下，StarCode使用聚类参数，这些参数对许多问题有意义。但是，输出可能看起来并不像您期望的那样。这可能是由于以下原因：

聚类方法是消息传递。这意味着簇是通过将小簇合并为较大的簇来自下而上的。该过程是递归的，因此群集中的序列可能不是邻居，即它们可能不在指定的Levenshtein距离之内。如果是这种情况，请改用球体聚类（请参阅下面的选项-s或- spheres ）。
聚类比为5。这意味着簇只有在大于五倍的情况下才能吸收较小的簇。一个实际的含义是，没有合并具有相似大小的簇。您可以选择合并簇的另一个阈值（请参见下面的选项-R或- 群集比例）。

搜索选项：

-d或 -距离距离

 Defines the maximum Levenshtein distance for clustering.
 When not set it is automatically computed as:
 min(8, 2 + [median seq length]/30)

聚类算法：

-r或 - 群集比率

 (Message passing only) Specifies the minimum sequence count ratio to cluster two matching
 sequences, i.e. two matching sequences A and B will be clustered together only if
 count(A) > ratio * count(B).
 Sparse datasets may need to set -r to small values (minimum is 1.0) to trigger clustering.
 Default is 5.0.

-s或-spheres

 Use sphere clustering algorithm instead of message passing (MP). Spheres is more greedy than MP:
 sorted by size, centroids absorb all their matches.

-c或 - 连接组件

 Clusters are defined by the connected components.

输出格式：

- 非冗余

 Removes redundant sequences from the output. Only the canonical sequence of each cluster is
 returned.

- 捕集者

 Adds a third column to the starcode output, containing the sequences that compose each cluster.
 By default, the output contains only the centroid and the counts.

-seq-id

 Shows the input sequence order (1-based) of the cluster components.

输入文件：

单文件模式：
-i或 - 输入文件
指定输入文件。
配对 - 末端FASTQ文件：
-1 file1 -2 file2
指定两个配对端群集模式的配对端FASTQ文件。

当不设置-I或-1/-2时，都会使用标准输入。

输出文件：

-O或 - 输出文件

 Specifies output file. When not set, standard output is used instead.

-unput1 file1- unput2 file2

 (Paired-end mode with --non-redundant option only). Specifies the output file names of the
  processed paired-end files.

未设置-O时使用标准输出。

当未用配对端 - 非冗余模式指定的-ox-oxput1/2时，输出文件名是带有“ -starcode”后缀的输入文件名。

其他选项：

-t或-threads线程

 Defines the maximum number of parallel threads.
 Default is 1.

-Q或 - Quiet

 Non verbose. By default, starcode prints verbose information to
 the standard error channel.

-v或 - version

 Prints version information.

-h或 - 螺旋

 Prints usage information.

V.运行StarCode-Umi

StarCode-UMI是一个使用starcode群集UMI标记序列的Python脚本。假定UMI标记的序列在读取开始时包含一个唯一的分子标识符，然后是其他一些（更长）序列。 StarCode-Umi执行双轮聚类并合并，以找到UMI和序列对的最佳簇。

用法：

StarCode-umi [选项] - umi-len n input_file1 [input_file2]

必需的参数：

- umi len数字

 Defines the length of the UMI tags. Adding some extra nucleotides may improve the clustering
 performance.

- 标准路径路径

  Path to `starcode` binary file. Default is `./starcode`.

聚类选项：

- UMI-D距离

 Match distance (Levenshtein) for the UMI region.

-Seq-D距离

 Match distance (Levenshtein) for the sequence region.

- UMI簇聚类算法

 Clustering algorithm to be used in the UMI region. ('mp' for message passing, 's' for spheres,
 'cc' for connected components). Default is message passing.

- Seq-Cluster聚类算法

 Clustering algorithm to be used in the seq region. ('mp' for message passing, 's' for spheres,
 'cc' for connected components). Default is message passing.

- UMI-CLUSTER-RATIO聚类算法

 (Only for message passing in UMI). Minimum clustering ratio (same as -r option in starcode).

- seq-cluster-Ratio聚类算法

 (Only for message passing in seq). Minimum clustering ratio (same as -r option in starcode).

- Seq-Trim装饰

  Use only *trim* nucleotides of the sequence for clustering. Starcode becomes memory inefficient
  with very long sequences, this parameter defines the maximum length of the sequence that will
  be used for clustering. Set it to 0 to use the full sequence. Default is 50.

输出选项：

-seq-id

 Shows the input sequence order (1-based) of the cluster components.

其他选项：

-umi-threads线程

 Defines the maximum number of parallel threads to be used in the UMI process.
 Default is 1.

- seq-threads线程

 Defines the maximum number of parallel threads to be used in the sequence process.
 Default is 1.

vi。文件格式

vi.i.支持的输入文件格式：

vi.ii纯文本：

由一个包含一个序列的文件组成。仅支持标准的DNA基碱字符（'a'，'c'，'g'，'t'）。这些序列可能在字符串的开头或末端不包含空空间，因为这些序列将被视为对齐字符。该文件可能不包含空线，因为这些线路将被视为零长度序列。这些序列无需分类，并且可以重复。

例子：

 TTACTATCGATCATCATCGACTGACTACG
ACTGCATCGACTAGCTACGACTACGCTACCATCAG
TTACTATCGATCATCATCGACTGACTAGC
ACTACGACTACGACTCAGCTCACTATCAGC
GCATCGACCGCTACTACGCATACTACGACATC

vi.i.ii。具有序列计数的纯文本：

如果已知序列的计数，则可以使用以下格式在输入文件中指定它：

[序列] t [count] n

其中' t'表示tab字符，' n'newline字符。这些序列不需要分类，也可以重复。如果发现重复的序列，则它们的计数将被添加在一起。和以前一样，序列可能不包含任何其他字符，并且文件可能不包含空行。

例子：

 TATCGACTCTATCTATCGCTGATGCGTAC       200
CGAGCCGCCGGCACGTCACGACGCATCAA       1
TAGCACCTACGCATCTCGACTATCACG         234
CGAGCCGCCGGCACGTCACGACGCATCAA       17
TGACTCTATCAGCTAC                    39

vi.i.iii。 fasta/fastq

StarCode也支持FASTA和FASTQ文件。但是请注意，StarCode不使用质量因素，唯一的相关信息是序列本身。 FASTA/FASTQ标签将不会用于识别输出文件中的序列。这些序列无需分类，并且可以重复。

示例Fasta：

 > FASTA sequence 1 label
ATGCATCGATCACTCATCAGCTACAG
> FASTA sequence 2 label
TATCGACTATCTACGACTACATCA
> FASTA sequence 3 label
ATCATCACTCTAGCAGCGTACTCGCA
> FASTA sequence 4 label
ATGCATCGATTACTCATCAGCTACAG

示例FastQ：

 @ FASTQ sequence 1 label
CATCGAGCAGCTATGCAGCTACGAGT
+
-$#'%-#.&)%#)"".)--'*()$)%
@ FASTQ sequence 2 label
TACTGCTGATATTCAGCTCACACC
+
,*#%+#&*$-#,''+*)'&.,).,

vi.ii。输出格式：

vi.ii.i标准输出格式：

StarCode用以下格式为每个检测到的群集打印一条线：

[规范序列] t [群集大小] t [群集序列] n

其中' t'表示tab字符，' n'newline字符。 “规范序列”是群集的序列，具有更多计数，“群集大小”是所有形成群集的序列的汇总计数，而“群集序列”是所有commas和in in in in Commas和in任意顺序。这些线以降序以“群集大小”为单位。

例如，执行以下输入和聚类距离为3（-d3）：

 TAGCTAGACGTA   250
TAGCTAGCCGTA   10
TAAGCTAGGGGT   16
ACGCGAGCGGAA   155
ACTTTAGCGGAA   1

将产生以下输出：

 TAGCTAGACGTA    260       TAGCTAGACGTA,TAGCTAGCCGTA
ACGCGAGCGGAA    156       ACGCGAGCGGAA,ACTTTAGCGGAA
TAAGCTAGGGGT    16        TAAGCTAGGGGT

使用更限制的距离-D2执行的同一示例将产生以下输出：

 TAGCTAGACGTA    260       TAGCTAGACGTA,TAGCTAGCCGTA
ACGCGAGCGGAA    155       ACGCGAGCGGAA
TAAGCTAGGGGT    16        TAAGCTAGGGGT
ACTTTAGCGGAA    1         ACTTTAGCGGAA

vi.ii.ii非冗余输出格式：

在非冗余输出模式下，StarCode仅在每行中打印每个群集的规范序列。按照上一节的示例，距离3（-d3）的输出将是：

  TAGCTAGACGTA
  ACGCGAGCGGAA

而对于-d2：

  TAGCTAGACGTA
  ACGCGAGCGGAA
  TAAGCTAGGGGT
  ACTTTAGCGGAA

vii。执照

StarCode已根据GNU通用公共许可证版本3（GPLV3）获得许可，以获取更多信息，请阅读许可证文件或参考：

http://www.gnu.org/licenses/

viii。引用

如果您使用我们的软件，请引用：

Zorita E，Cusco P，Filion GJ。 2015。StarCode：基于全对搜索的序列聚类。生物信息学31（12）：1913-1919。

展开