clintk.text_parser.parser module¶

object to parse text reports, compatible with scikit-learn transformer API

The format of typical reports to be parsed can be found in data/ directory of this repo. ReportsParser enables choosing custom :

section delimiters with headers attribute
tags that dont contain informative texte (style tag for instance) with remove_tags
additional stop words, that may be specific to a corpus or a task

@TODO add examples @TODO change remove_sections into sections_to_keep

class clintk.text_parser.parser.ReportsParser(strategy='strings', sections=None, remove_tags=['h4', 'table', 'link', 'style'], col_name='report', headers='h3', is_html=True, stop_words=[], norm=True, verbose=False, n_jobs=1)[source]¶

Bases: sklearn.base.BaseEstimator

a parser for html-like text reports

Parameters:

strategy (string, default=’strings’) – defines the type of object returned by the transformation, if ‘strings’, each line of the returned df is string. ‘strings’ is to be used for CountVectorizer and TFiDFVectorizer if ‘tokens’, the string is split into a list of words. ‘tokens’ is to be used for gensim’s Word2Vec and Doc2Vec models
sections (tuple or None, default=None) – tuple containing section names to keep if None, keep all the sections
remove_tags (list, default=[‘h4’, ‘table’, ‘link’, ‘style’]) – list of tags to remove from html page
headers (str or None, default=’h3) – name of the html tag that delimits the sections in the
is_html (bool, default=True) – boolean indicating weather the structure of the reports is strictly html format or not. Check documentation usage for details
stop_words (list, default=[]) – additional words to remove from the text, specific to the kind of parsed document
verbose (bool, default=False)
norm (bool, default=True) – weather normalising text (removing stopwords, lemmatization etc..)
n_jobs (int, default=1) – number of CPU cores to use, if -1 then all the available one are used

Parameters:	X (pd.Series or DataFrame) – each entry is a string defining a report
Returns:	each entry is either a string or list of words depending on the strategy
Return type:	pd.Series