starcode 다운로드 - starcode 소스 코드 다운로드

StarCode : All-Pairs 검색을 기반으로 한 시퀀스 클러스터링

내용물:

 1. What is starcode?
2. Source file list.
3. Compilation and installation.
4. Running starcode.
5. Running starcode-umi.
6. File formats.
7. License.
8. Citation.

I. 스타 코드 란 무엇입니까?

스타 코드는 DNA 서열 클러스터링 소프트웨어입니다. 스타 코드 클러스터링은 지정된 Levenshtein 거리 (삽입 및 삭제 허용) 내의 모든 쌍 검색을 기반으로하고 클러스터링 알고리즘과 같은 메시지 전달, 구체 또는 연결 구성 요소가 이어집니다. 일반적으로, 일련의 DNA 서열을 함유하는 파일은 원하는 클러스터링 거리 및 AlgoriHTM과 공동으로 입력으로 전달된다. 스타 코드는 클러스터의 표준 시퀀스, 클러스터 크기, 클러스터를 구성하는 다른 시퀀스 세트 및 클러스터 구성 요소의 입력 라인 번호를 반환합니다.

Starcode는 생물학 분야에서 DNA/RNA 모티프 회복, 바코드/UMI 클러스터링, 시퀀싱 오류 복구 등과 같은 많은 응용 프로그램이 있습니다.

II. 소스 파일 목록

Cluster UMI- 태그 시퀀스에 대한 StarCode-UMI StarCode 스크립트.
main-starcode.c 스타 코드 메인 파일 (매개 변수 구문 분석).
starcode.c 메인 스타 코드 알고리즘.
trie.c trie 검색 및 시공 기능.
view.c 스타 코드 출력의 그래픽 표현.
makefile 명령 파일을 만듭니다.

III. 편집 및 설치

스타 코드를 설치하려면이 git 저장소를 복제하십시오 (또는 최신 릴리스 스타 코드 v1.3을 수동으로 다운로드) :

git 클론 https://github.com/gui11aume/starcode

파일은 'StarCode'라는 폴더로 다운로드해야합니다. Make to Compile (MAC 사용자는 Mac AppStore에서 사용할 수있는 'Xcode'가 필요함)를 사용합니다.

-C 스타 코드를 만듭니다

이진 파일 '스타 코드'가 생성됩니다. 선택적으로 모든 디렉토리에서 스타 코드를 실행하기 위해 상징적 링크를 만들 수 있습니다.

Sudo ln -s Starcode/Starcode/usr/bin/starcode

IV. 스타 코드 실행

스타 코드는 Linux 및 Mac에서 실행됩니다. Windows에서 테스트되지 않았습니다.

용법:

스타 코드 [옵션] {[-i] input_file | -1 paired_end_file1 -2 paired_end_file2} [-o output_file]

StarCode 기본값 (이 읽기를 읽으십시오) :

기본적으로 StarCode는 많은 문제에 의미있는 클러스터링 매개 변수를 사용합니다. 그러나 출력은 당신이 기대하는 것처럼 보이지 않을 수 있습니다. 이것은 다음과 같은 이유 일 수 있습니다.

클러스터링 방법은 메시지 전달입니다. 이것은 작은 클러스터를 더 큰 클러스터로 병합하여 클러스터가 상향식으로 구축되었음을 의미합니다. 프로세스는 재귀 적이므로 클러스터의 시퀀스는 이웃이 아닐 수 있으며 , 즉 , 지정된 Levenshtein 거리 내에 있지 않을 수 있습니다. 이 경우 대신 Sphere 클러스터링을 사용하십시오 (아래 옵션 -S 또는 -Spheres 참조).
클러스터링 비율은 5입니다.이 비율은 클러스터가 5 배 이상 더 큰 경우에만 작은 클러스터를 흡수 할 수 있음을 의미합니다. 실질적인 의미는 비슷한 크기의 클러스터가 병합되지 않는다는 것입니다. 클러스터 병합에 대한 다른 임계 값을 선택할 수 있습니다 (아래 옵션 -r 또는 -cluster -ratio 참조).

검색 옵션 :

-D 또는 - 주 거리

 Defines the maximum Levenshtein distance for clustering.
 When not set it is automatically computed as:
 min(8, 2 + [median seq length]/30)

클러스터링 알고리즘 :

-r 또는 -Cluster-ratio 비율

 (Message passing only) Specifies the minimum sequence count ratio to cluster two matching
 sequences, i.e. two matching sequences A and B will be clustered together only if
 count(A) > ratio * count(B).
 Sparse datasets may need to set -r to small values (minimum is 1.0) to trigger clustering.
 Default is 5.0.

-s 또는 -스피어

 Use sphere clustering algorithm instead of message passing (MP). Spheres is more greedy than MP:
 sorted by size, centroids absorb all their matches.

-C 또는-연결-컴파운드

 Clusters are defined by the connected components.

출력 형식 :

-중복

 Removes redundant sequences from the output. Only the canonical sequence of each cluster is
 returned.

-프린트 클러스터

 Adds a third column to the starcode output, containing the sequences that compose each cluster.
 By default, the output contains only the centroid and the counts.

-seq-id

 Shows the input sequence order (1-based) of the cluster components.

입력 파일 :

단일 파일 모드 :
-i 또는 -입력 파일
입력 파일을 지정합니다.
페어링 엔드 FASTQ 파일 :
-1 파일 -2 파일 2
페어링 엔드 클러스터링 모드에 대해 두 개의 페어링 엔드 FASTQ 파일을 지정합니다.

표준 입력은 -i 또는 -1/-2가 설정 될 때 사용됩니다.

출력 파일 :

-O 또는 -출력 파일

 Specifies output file. When not set, standard output is used instead.

-output1 file1- output2 file2

 (Paired-end mode with --non-redundant option only). Specifies the output file names of the
  processed paired-end files.

-O가 설정되지 않은 경우 표준 출력이 사용됩니다.

-output1/2가 쌍 엔드-중복 모드로 지정되지 않은 경우 출력 파일 이름은 "-starcode"접미사 인 입력 파일 이름입니다.

기타 옵션 :

-t 또는 - 스레드 스레드

 Defines the maximum number of parallel threads.
 Default is 1.

-Q 또는 --Quiet

 Non verbose. By default, starcode prints verbose information to
 the standard error channel.

-v 또는 -version

 Prints version information.

-h 또는 -help

 Prints usage information.

V. Starcode-Umi 실행

StarCode-UMI는 starcode 클러스터 UMI- 태그 시퀀스를 사용하는 파이썬 스크립트입니다. UMI- 태그 서열은 읽기 시작시 독특한 분자 식별자를 포함하고 다른 (더 긴) 서열이 이어진 것으로 가정한다. StarCode-UMI는 UMI 및 시퀀스 쌍의 최상의 클러스터를 찾기 위해 두 배의 클러스터링 및 병합을 수행합니다.

용법:

StarCode-UMI [옵션] -umi-len n input_file1 [input_file2]

필수 주장 :

-umi-len 번호

 Defines the length of the UMI tags. Adding some extra nucleotides may improve the clustering
 performance.

-스타 코드-경로 경로

  Path to `starcode` binary file. Default is `./starcode`.

클러스터링 옵션 :

-umi-d 거리

 Match distance (Levenshtein) for the UMI region.

-seq-d 거리

 Match distance (Levenshtein) for the sequence region.

-umi-cluster 클러스터링 알고리즘

 Clustering algorithm to be used in the UMI region. ('mp' for message passing, 's' for spheres,
 'cc' for connected components). Default is message passing.

-seq-cluster 클러스터링 알고리즘

 Clustering algorithm to be used in the seq region. ('mp' for message passing, 's' for spheres,
 'cc' for connected components). Default is message passing.

-umi-cluster-ratio 클러스터링 알고리즘

 (Only for message passing in UMI). Minimum clustering ratio (same as -r option in starcode).

-seq-cluster-ratio 클러스터링 알고리즘

 (Only for message passing in seq). Minimum clustering ratio (same as -r option in starcode).

-seq-trim 트림

  Use only *trim* nucleotides of the sequence for clustering. Starcode becomes memory inefficient
  with very long sequences, this parameter defines the maximum length of the sequence that will
  be used for clustering. Set it to 0 to use the full sequence. Default is 50.

출력 옵션 :

-seq-id

 Shows the input sequence order (1-based) of the cluster components.

기타 옵션 :

-umi-shreads 스레드

 Defines the maximum number of parallel threads to be used in the UMI process.
 Default is 1.

-seq-shreads 스레드

 Defines the maximum number of parallel threads to be used in the sequence process.
 Default is 1.

VI. 파일 형식

VI.I. 지원되는 입력 파일 형식 :

vi.ii 일반 텍스트 :

라인 당 하나의 시퀀스를 포함하는 파일로 구성됩니다. 표준 DNA- 기본 문자 만 지원됩니다 ( 'A', 'C', 'G', 'T'). 시퀀스는 문자열의 시작 또는 끝에 빈 공간을 포함하지 않을 수 있으며, 이는 정렬 문자로 계산되므로. 파일에는 빈 줄이 포함되어 있지 않을 수 있습니다. 시퀀스를 정렬 할 필요가 없으며 반복 될 수 있습니다.

예:

 TTACTATCGATCATCATCGACTGACTACG
ACTGCATCGACTAGCTACGACTACGCTACCATCAG
TTACTATCGATCATCATCGACTGACTAGC
ACTACGACTACGACTCAGCTCACTATCAGC
GCATCGACCGCTACTACGCATACTACGACATC

VI.I.II. 시퀀스 수의 일반 텍스트 :

시퀀스의 카운트가 알려진 경우 다음 형식을 사용하여 입력 파일에 지정 될 수 있습니다.

[시퀀스] t [count] n

여기서 ' t'는 탭 문자와 ' n'을 Newline 문자를 나타냅니다. 시퀀스를 정렬 할 필요가 없으며 반복 될 수 있습니다. 반복 시퀀스가 발견되면 해당 카운트가 추가됩니다. 이전과 같이 시퀀스에는 추가 문자가 포함되지 않을 수 있으며 파일에는 빈 줄이 포함되어 있지 않을 수 있습니다.

예:

 TATCGACTCTATCTATCGCTGATGCGTAC       200
CGAGCCGCCGGCACGTCACGACGCATCAA       1
TAGCACCTACGCATCTCGACTATCACG         234
CGAGCCGCCGGCACGTCACGACGCATCAA       17
TGACTCTATCAGCTAC                    39

vi.i.iii. Fasta/Fastq

StarCode는 FASTA 및 FASTQ 파일도 지원합니다. 그러나 스타 코드는 품질 요소를 사용하지 않으며 관련 정보는 시퀀스 자체입니다. FASTA/FASTQ 라벨은 출력 파일의 시퀀스를 식별하는 데 사용되지 않습니다. 시퀀스를 정렬 할 필요가 없으며 반복 될 수 있습니다.

예제 Fasta :

 > FASTA sequence 1 label
ATGCATCGATCACTCATCAGCTACAG
> FASTA sequence 2 label
TATCGACTATCTACGACTACATCA
> FASTA sequence 3 label
ATCATCACTCTAGCAGCGTACTCGCA
> FASTA sequence 4 label
ATGCATCGATTACTCATCAGCTACAG

예제 FASTQ :

 @ FASTQ sequence 1 label
CATCGAGCAGCTATGCAGCTACGAGT
+
-$#'%-#.&)%#)"".)--'*()$)%
@ FASTQ sequence 2 label
TACTGCTGATATTCAGCTCACACC
+
,*#%+#&*$-#,''+*)'&.,).,

VI.II. 출력 형식 :

vi.ii.i 표준 출력 형식 :

StarCode는 다음 형식으로 감지 된 각 클러스터의 선을 인쇄합니다.

[표준 시퀀스] t [클러스터 크기] t [클러스터 시퀀스] n

여기서 ' t'는 탭 문자와 ' n'을 Newline 문자를 나타냅니다. 'Canonical Sequence'는 더 많은 카운트를 갖는 클러스터의 시퀀스이며, '클러스터 크기'는 클러스터를 형성하는 모든 시퀀스의 집계 된 카운트이며 '클러스터 시퀀스'는 쉼표와 임의의 순서. 라인은 내림차순으로 '클러스터 크기'로 정렬됩니다.

예를 들어, 다음 입력 및 클러스터링 거리 3 (-d3)을 사용한 실행 :

 TAGCTAGACGTA   250
TAGCTAGCCGTA   10
TAAGCTAGGGGT   16
ACGCGAGCGGAA   155
ACTTTAGCGGAA   1

다음 출력을 생성합니다.

 TAGCTAGACGTA    260       TAGCTAGACGTA,TAGCTAGCCGTA
ACGCGAGCGGAA    156       ACGCGAGCGGAA,ACTTTAGCGGAA
TAAGCTAGGGGT    16        TAAGCTAGGGGT

보다 제한적인 거리 -d2로 실행 된 동일한 예제는 다음과 같은 출력을 생성합니다.

 TAGCTAGACGTA    260       TAGCTAGACGTA,TAGCTAGCCGTA
ACGCGAGCGGAA    155       ACGCGAGCGGAA
TAAGCTAGGGGT    16        TAAGCTAGGGGT
ACTTTAGCGGAA    1         ACTTTAGCGGAA

vi.ii.ii 비 중복 출력 형식 :

중복되지 않은 출력 모드에서 StarCode는 각 클러스터의 표준 시퀀스를 라인 당 하나씩 인쇄합니다. 이전 섹션의 예제에 따라 거리 3 (-d3)이있는 출력은 다음과 같습니다.

  TAGCTAGACGTA
  ACGCGAGCGGAA

-d2의 경우 :

  TAGCTAGACGTA
  ACGCGAGCGGAA
  TAAGCTAGGGGT
  ACTTTAGCGGAA

VII. 특허

StarCode는 GNU General Public License, 버전 3 (GPLV3)에 따라 라이센스를 부여받으며 자세한 내용은 라이센스 파일을 읽거나 다음을 참조하십시오.

http://www.gnu.org/licenses/

VIII. 소환

우리 소프트웨어를 사용하는 경우 다음을 인용하십시오.

Zorita E, Cusco P, Filion GJ. 2015. StarCode : 모든 페어 검색을 기반으로 한 시퀀스 클러스터링. 생물 정보학 31 (12) : 1913-1919.

확장하다