clintk.text_parser.parser module¶
object to parse text reports, compatible with scikit-learn transformer API
The format of typical reports to be parsed can be found in data/ directory of this repo. ReportsParser enables choosing custom :
- section delimiters with headers attribute
- tags that dont contain informative texte (style tag for instance) with remove_tags
- additional stop words, that may be specific to a corpus or a task
@TODO add examples @TODO change remove_sections into sections_to_keep
-
class
clintk.text_parser.parser.
ReportsParser
(strategy='strings', sections=None, remove_tags=['h4', 'table', 'link', 'style'], col_name='report', headers='h3', is_html=True, stop_words=[], norm=True, verbose=False, n_jobs=1)[source]¶ Bases:
sklearn.base.BaseEstimator
a parser for html-like text reports
Parameters: - strategy (string, default=’strings’) – defines the type of object returned by the transformation, if ‘strings’, each line of the returned df is string. ‘strings’ is to be used for CountVectorizer and TFiDFVectorizer if ‘tokens’, the string is split into a list of words. ‘tokens’ is to be used for gensim’s Word2Vec and Doc2Vec models
- sections (tuple or None, default=None) – tuple containing section names to keep if None, keep all the sections
- remove_tags (list, default=[‘h4’, ‘table’, ‘link’, ‘style’]) – list of tags to remove from html page
- headers (str or None, default=’h3) – name of the html tag that delimits the sections in the
- is_html (bool, default=True) – boolean indicating weather the structure of the reports is strictly html format or not. Check documentation usage for details
- stop_words (list, default=[]) – additional words to remove from the text, specific to the kind of parsed document
- verbose (bool, default=False)
- norm (bool, default=True) – weather normalising text (removing stopwords, lemmatization etc..)
- n_jobs (int, default=1) – number of CPU cores to use, if -1 then all the available one are used
See also
text_parser