THIS REPOSITORY IS NO LONGER MAINTAINED
textpipe is a Python package for converting raw text in to clean, readable text and
extracting metadata from that text. Its functionalities include transforming
raw text into readable text by removing HTML tags and extracting
metadata such as the number of words and named entities from the text.
HTML and other unreadable constructsIt is recommended that you install textpipe using a virtual environment.
First, create your virtual environment using virtualenv or virtualenvwrapper.
Using Venv if your default interpreter is python3.6
python3 -m venv .venvvirtualenv venv -p python3.6mkvirtualenv textpipe -p python3.6pip install textpipepip install -r requirements.txtWhile the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.
>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2
>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 3}In order to extend the existing Textpipe operations with your own proprietary operations;
test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
return 1
custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))See CONTRIBUTING for guidelines for contributors.
0.12.1
0.12.0
0.11.9
ents properties0.11.8
cats attribute0.11.7
0.11.6
0.11.5
0.11.4
0.11.1
0.11.0
0.9.0
0.8.6
0.8.5
0.8.4
0.8.3
0.8.2
0.8.1
0.8.0
0.7.2
0.7.0
context kwargregister_operation in pipeline