clintk.text_parser.parser module

object to parse text reports, compatible with scikit-learn transformer API

The format of typical reports to be parsed can be found in data/ directory of this repo. ReportsParser enables choosing custom :

  • section delimiters with headers attribute
  • tags that dont contain informative texte (style tag for instance) with remove_tags
  • additional stop words, that may be specific to a corpus or a task

@TODO add examples @TODO change remove_sections into sections_to_keep

class clintk.text_parser.parser.ReportsParser(strategy='strings', sections=None, remove_tags=['h4', 'table', 'link', 'style'], col_name='report', headers='h3', is_html=True, stop_words=[], norm=True, verbose=False, n_jobs=1)[source]

Bases: sklearn.base.BaseEstimator

a parser for html-like text reports

Parameters:
  • strategy (string, default=’strings’) – defines the type of object returned by the transformation, if ‘strings’, each line of the returned df is string. ‘strings’ is to be used for CountVectorizer and TFiDFVectorizer if ‘tokens’, the string is split into a list of words. ‘tokens’ is to be used for gensim’s Word2Vec and Doc2Vec models
  • sections (tuple or None, default=None) – tuple containing section names to keep if None, keep all the sections
  • remove_tags (list, default=[‘h4’, ‘table’, ‘link’, ‘style’]) – list of tags to remove from html page
  • headers (str or None, default=’h3) – name of the html tag that delimits the sections in the
  • is_html (bool, default=True) – boolean indicating weather the structure of the reports is strictly html format or not. Check documentation usage for details
  • stop_words (list, default=[]) – additional words to remove from the text, specific to the kind of parsed document
  • verbose (bool, default=False)
  • norm (bool, default=True) – weather normalising text (removing stopwords, lemmatization etc..)
  • n_jobs (int, default=1) – number of CPU cores to use, if -1 then all the available one are used

See also

text_parser

fit(X, y=None)[source]
transform(X)[source]

parses the reports in input

Parameters:X (pd.Series or DataFrame) – each entry is a string defining a report
Returns:each entry is either a string or list of words depending on the strategy
Return type:pd.Series