starcode下載 - starcode源代碼下載

StarCode：基於全對搜索的序列聚類

內容：

 1. What is starcode?
2. Source file list.
3. Compilation and installation.
4. Running starcode.
5. Running starcode-umi.
6. File formats.
7. License.
8. Citation.

I.什麼是StarCode？

StarCode是DNA序列聚類軟件。 StarCode群集基於指定的Levenshtein距離內的所有對搜索（允許插入和刪除），然後是群集算法：消息傳遞，球形或連接的組件。通常，將包含一組DNA序列的文件作為輸入傳遞，並與所需的聚類距離和algorihtm共同傳遞。 StarCode返回集群的規範序列，群集大小，組成群集的不同序列的集合以及群集組件的輸入線數。

StarCode在生物學領域中有許多應用，例如DNA/RNA圖案恢復，條形碼/UMI聚類，測序誤差恢復等。

ii。源文件列表

StarCode-Umi StarCode腳本到群集UMI標記的序列。
Main-Starcode.c StarCode主文件（參數解析）。
StarCode.c主星碼算法。
Trie.C Trie搜索和施工功能。
view.c starcode輸出的圖形表示。
makefile製作指令文件。

iii。彙編和安裝

要安裝StarCode，請克隆此GIT存儲庫（或手動下載最新版本的Starcode v1.3）：

git克隆https://github.com/gui11aume/starcode

文件應在名為“ StarCode”的文件夾中下載。使用Make來編譯（Mac用戶需要“ XCode”，可在Mac AppStore上找到）：

製作-c starcode

將創建一個二進製文件“ StarCode”。您可以選擇創建一個符號鏈接以從任何目錄運行StarCode：

sudo ln -s starcode/starcode/usr/bin/starcode

iv。運行StarCode

StarCode在Linux和Mac上運行。它尚未在Windows上進行測試。

用法：

StarCode [options] {[-i] input_file | -1配對_end_file1 -2配對_end_file2} [-o output_file]

StarCode默認值（請閱讀此）：

默認情況下，StarCode使用聚類參數，這些參數對許多問題有意義。但是，輸出可能看起來並不像您期望的那樣。這可能是由於以下原因：

聚類方法是消息傳遞。這意味著簇是通過將小簇合併為較大的簇來自下而上的。該過程是遞歸的，因此群集中的序列可能不是鄰居，即它們可能不在指定的Levenshtein距離之內。如果是這種情況，請改用球體聚類（請參閱下面的選項-s或- spheres ）。
聚類比為5。這意味著簇只有在大於五倍的情況下才能吸收較小的簇。一個實際的含義是，沒有合併具有相似大小的簇。您可以選擇合併簇的另一個閾值（請參見下面的選項-R或- 群集比例）。

搜索選項：

-d或 -距離距離

 Defines the maximum Levenshtein distance for clustering.
 When not set it is automatically computed as:
 min(8, 2 + [median seq length]/30)

聚類算法：

-r或 - 群集比率

 (Message passing only) Specifies the minimum sequence count ratio to cluster two matching
 sequences, i.e. two matching sequences A and B will be clustered together only if
 count(A) > ratio * count(B).
 Sparse datasets may need to set -r to small values (minimum is 1.0) to trigger clustering.
 Default is 5.0.

-s或-spheres

 Use sphere clustering algorithm instead of message passing (MP). Spheres is more greedy than MP:
 sorted by size, centroids absorb all their matches.

-c或 - 連接組件

 Clusters are defined by the connected components.

輸出格式：

- 非冗餘

 Removes redundant sequences from the output. Only the canonical sequence of each cluster is
 returned.

- 捕集者

 Adds a third column to the starcode output, containing the sequences that compose each cluster.
 By default, the output contains only the centroid and the counts.

-seq-id

 Shows the input sequence order (1-based) of the cluster components.

輸入文件：

單文件模式：
-i或 - 輸入文件
指定輸入文件。
配對 - 末端FASTQ文件：
-1 file1 -2 file2
指定兩個配對端群集模式的配對端FASTQ文件。

當不設置-I或-1/-2時，都會使用標準輸入。

輸出文件：

-O或 - 輸出文件

 Specifies output file. When not set, standard output is used instead.

-unput1 file1- unput2 file2

 (Paired-end mode with --non-redundant option only). Specifies the output file names of the
  processed paired-end files.

未設置-O時使用標準輸出。

當未用配對端 - 非冗餘模式指定的-ox-oxput1/2時，輸出文件名是帶有“ -starcode”後綴的輸入文件名。

其他選項：

-t或-threads線程

 Defines the maximum number of parallel threads.
 Default is 1.

-Q或 - Quiet

 Non verbose. By default, starcode prints verbose information to
 the standard error channel.

-v或 - version

 Prints version information.

-h或 - 螺旋

 Prints usage information.

V.運行StarCode-Umi

StarCode-UMI是一個使用starcode群集UMI標記序列的Python腳本。假定UMI標記的序列在讀取開始時包含一個唯一的分子標識符，然後是其他一些（更長）序列。 StarCode-Umi執行雙輪聚類並合併，以找到UMI和序列對的最佳簇。

用法：

StarCode-umi [選項] - umi-len n input_file1 [input_file2]

必需的參數：

- umi len數字

 Defines the length of the UMI tags. Adding some extra nucleotides may improve the clustering
 performance.

- 標準路徑路徑

  Path to `starcode` binary file. Default is `./starcode`.

聚類選項：

- UMI-D距離

 Match distance (Levenshtein) for the UMI region.

-Seq-D距離

 Match distance (Levenshtein) for the sequence region.

- UMI簇聚類算法

 Clustering algorithm to be used in the UMI region. ('mp' for message passing, 's' for spheres,
 'cc' for connected components). Default is message passing.

- Seq-Cluster聚類算法

 Clustering algorithm to be used in the seq region. ('mp' for message passing, 's' for spheres,
 'cc' for connected components). Default is message passing.

- UMI-CLUSTER-RATIO聚類算法

 (Only for message passing in UMI). Minimum clustering ratio (same as -r option in starcode).

- seq-cluster-Ratio聚類算法

 (Only for message passing in seq). Minimum clustering ratio (same as -r option in starcode).

- Seq-Trim裝飾

  Use only *trim* nucleotides of the sequence for clustering. Starcode becomes memory inefficient
  with very long sequences, this parameter defines the maximum length of the sequence that will
  be used for clustering. Set it to 0 to use the full sequence. Default is 50.

輸出選項：

-seq-id

 Shows the input sequence order (1-based) of the cluster components.

其他選項：

-umi-threads線程

 Defines the maximum number of parallel threads to be used in the UMI process.
 Default is 1.

- seq-threads線程

 Defines the maximum number of parallel threads to be used in the sequence process.
 Default is 1.

vi。文件格式

vi.i.支持的輸入文件格式：

vi.ii純文本：

由一個包含一個序列的文件組成。僅支持標準的DNA基鹼字符（'a'，'c'，'g'，'t'）。這些序列可能在字符串的開頭或末端不包含空空間，因為這些序列將被視為對齊字符。該文件可能不包含空線，因為這些線路將被視為零長度序列。這些序列無需分類，並且可以重複。

例子：

 TTACTATCGATCATCATCGACTGACTACG
ACTGCATCGACTAGCTACGACTACGCTACCATCAG
TTACTATCGATCATCATCGACTGACTAGC
ACTACGACTACGACTCAGCTCACTATCAGC
GCATCGACCGCTACTACGCATACTACGACATC

vi.i.ii。具有序列計數的純文本：

如果已知序列的計數，則可以使用以下格式在輸入文件中指定它：

[序列] t [count] n

其中' t'表示tab字符，' n'newline字符。這些序列不需要分類，也可以重複。如果發現重複的序列，則它們的計數將被添加在一起。和以前一樣，序列可能不包含任何其他字符，並且文件可能不包含空行。

例子：

 TATCGACTCTATCTATCGCTGATGCGTAC       200
CGAGCCGCCGGCACGTCACGACGCATCAA       1
TAGCACCTACGCATCTCGACTATCACG         234
CGAGCCGCCGGCACGTCACGACGCATCAA       17
TGACTCTATCAGCTAC                    39

vi.i.iii。 fasta/fastq

StarCode也支持FASTA和FASTQ文件。但是請注意，StarCode不使用質量因素，唯一的相關信息是序列本身。 FASTA/FASTQ標籤將不會用於識別輸出文件中的序列。這些序列無需分類，並且可以重複。

示例Fasta：

 > FASTA sequence 1 label
ATGCATCGATCACTCATCAGCTACAG
> FASTA sequence 2 label
TATCGACTATCTACGACTACATCA
> FASTA sequence 3 label
ATCATCACTCTAGCAGCGTACTCGCA
> FASTA sequence 4 label
ATGCATCGATTACTCATCAGCTACAG

示例FastQ：

 @ FASTQ sequence 1 label
CATCGAGCAGCTATGCAGCTACGAGT
+
-$#'%-#.&)%#)"".)--'*()$)%
@ FASTQ sequence 2 label
TACTGCTGATATTCAGCTCACACC
+
,*#%+#&*$-#,''+*)'&.,).,

vi.ii。輸出格式：

vi.ii.i標準輸出格式：

StarCode用以下格式為每個檢測到的群集打印一條線：

[規範序列] t [群集大小] t [群集序列] n

其中' t'表示tab字符，' n'newline字符。 “規範序列”是群集的序列，具有更多計數，“群集大小”是所有形成群集的序列的匯總計數，而“群集序列”是所有commas和in in in in Commas和in任意順序。這些線以降序以“群集大小”為單位。

例如，執行以下輸入和聚類距離為3（-d3）：

 TAGCTAGACGTA   250
TAGCTAGCCGTA   10
TAAGCTAGGGGT   16
ACGCGAGCGGAA   155
ACTTTAGCGGAA   1

將產生以下輸出：

 TAGCTAGACGTA    260       TAGCTAGACGTA,TAGCTAGCCGTA
ACGCGAGCGGAA    156       ACGCGAGCGGAA,ACTTTAGCGGAA
TAAGCTAGGGGT    16        TAAGCTAGGGGT

使用更限制的距離-D2執行的同一示例將產生以下輸出：

 TAGCTAGACGTA    260       TAGCTAGACGTA,TAGCTAGCCGTA
ACGCGAGCGGAA    155       ACGCGAGCGGAA
TAAGCTAGGGGT    16        TAAGCTAGGGGT
ACTTTAGCGGAA    1         ACTTTAGCGGAA

vi.ii.ii非冗餘輸出格式：

在非冗餘輸出模式下，StarCode僅在每行中打印每個群集的規範序列。按照上一節的示例，距離3（-d3）的輸出將是：

  TAGCTAGACGTA
  ACGCGAGCGGAA

而對於-d2：

  TAGCTAGACGTA
  ACGCGAGCGGAA
  TAAGCTAGGGGT
  ACTTTAGCGGAA

vii。執照

StarCode已根據GNU通用公共許可證版本3（GPLV3）獲得許可，以獲取更多信息，請閱讀許可證文件或參考：

http://www.gnu.org/licenses/

viii。引用

如果您使用我們的軟件，請引用：

Zorita E，Cusco P，Filion GJ。 2015。StarCode：基於全對搜索的序列聚類。生物信息學31（12）：1913-1919。

展開