该代码旨在使大量OpenAI的耳语成绩单容易搜索。因此,人们可以找到特定的段落或期限发生,并且在成绩单中什么时候。但是,它应该与.vtt文件的任何文件夹一起使用:播客的非OpenAI成绩单。
我使用Whisper Openai转录偶然的技术播客,并在此处部署了一个实时搜索引擎网站前端,由该模块提供支持(特别是SearchTranscripts )。
该模块有两个类:
LoadTranscripts :从转录文件文件的文件夹( .vtt或.json文件)创建一个SQLITE数据库和FTS5虚拟表。它从原始文件中的短成绩单段中创建了更长的文本(每个单词约300个单词),以使文本块可搜索。它将单个成绩单段保存在单独的数据库中。
SearchTranscripts :这是一个使用SQLite数据库返回搜索查询最佳结果的PANDAS数据框架的Python类。
一旦使用LoadTranscripts创建了SQLITE数据库,您就可以通过您喜欢的任何SQLITE接口访问该数据库,例如DataSette,dbeaver,dbeaver,命令行,SQL Alchemy等。 SearchTranscripts类是一种简单便捷的方式,可以从Python中使用python,使用python,使用内置的In In In In In Ins Sqlite3 Module和P.Sqlite 3 Module和P.Sqlite和P. In In In In Is In Is In Is In Is It In Is It In Is It In It。
克隆和CD进入仓库的主要目录,然后运行:
pip install .
from search_transcripts import LoadTranscripts, SearchTranscripts
l = LoadTranscripts('transcripts') ## will create main.db and bm25.pickle
s = SearchTranscripts()
## Returns a pandas dataframe of the top scoring transcript sections, across all transcripts.
s.search('starship enterprise')
##find the exact phrase
s.search('"starship enterprise"')
因此,在我意识到Whisper会创建标准.VTT文件之前,我直接使用Python API。它生成了Python词典列表。将其保存为当时的JSON似乎是合乎逻辑的。我发现JSON比.vtt更容易读取,并且可以轻松地转换为VTT,因此我仍然支持这种有些古怪的格式。看起来如此:
[
{
"start": 606.1800000000001,
"end": 610.74,
"text": " It's important to have a goal to work toward and accomplish rather than just randomly learning and half building things"
},
{
"start": 610.74,
"end": 613.0600000000001,
"text": " Having a specific thing you want to build is a good substitute"
},
{
"start": 613.38,
"end": 619.78,
"text": " Keep making things until you've made something you're proud enough a proud of enough to show off in an interview by the time you've built a few"
},
{
"start": 619.78,
"end": 624.26,
"text": " Things you'll start developing the taste you need to make that determination of what's quote unquote good enough"
},
]