該代碼旨在使大量OpenAI的耳語成績單容易搜索。因此,人們可以找到特定的段落或期限發生,並且在成績單中什麼時候。但是,它應該與.vtt文件的任何文件夾一起使用:播客的非OpenAI成績單。
我使用Whisper Openai轉錄偶然的技術播客,並在此處部署了一個實時搜索引擎網站前端,由該模塊提供支持(特別是SearchTranscripts )。
該模塊有兩個類:
LoadTranscripts :從轉錄文件文件的文件夾( .vtt或.json文件)創建一個SQLITE數據庫和FTS5虛擬表。它從原始文件中的短成績單段中創建了更長的文本(每個單詞約300個單詞),以使文本塊可搜索。它將單個成績單段保存在單獨的數據庫中。
SearchTranscripts :這是一個使用SQLite數據庫返回搜索查詢最佳結果的PANDAS數據框架的Python類。
一旦使用LoadTranscripts創建了SQLITE數據庫,您就可以通過您喜歡的任何SQLITE接口訪問該數據庫,例如DataSette,dbeaver,dbeaver,命令行,SQL Alchemy等。 SearchTranscripts類是一種簡單便捷的方式,可以從Python中使用python,使用python,使用內置的In In In In In Ins Sqlite3 Module和P.Sqlite 3 Module和P.Sqlite和P. In In In In Is In Is In Is In Is It In Is It In Is It In It。
克隆和CD進入倉庫的主要目錄,然後運行:
pip install .
from search_transcripts import LoadTranscripts, SearchTranscripts
l = LoadTranscripts('transcripts') ## will create main.db and bm25.pickle
s = SearchTranscripts()
## Returns a pandas dataframe of the top scoring transcript sections, across all transcripts.
s.search('starship enterprise')
##find the exact phrase
s.search('"starship enterprise"')
因此,在我意識到Whisper會創建標準.VTT文件之前,我直接使用Python API。它生成了Python詞典列表。將其保存為當時的JSON似乎是合乎邏輯的。我發現JSON比.vtt更容易讀取,並且可以輕鬆地轉換為VTT,因此我仍然支持這種有些古怪的格式。看起來如此:
[
{
"start": 606.1800000000001,
"end": 610.74,
"text": " It's important to have a goal to work toward and accomplish rather than just randomly learning and half building things"
},
{
"start": 610.74,
"end": 613.0600000000001,
"text": " Having a specific thing you want to build is a good substitute"
},
{
"start": 613.38,
"end": 619.78,
"text": " Keep making things until you've made something you're proud enough a proud of enough to show off in an interview by the time you've built a few"
},
{
"start": 619.78,
"end": 624.26,
"text": " Things you'll start developing the taste you need to make that determination of what's quote unquote good enough"
},
]