clintk.text_parser.parser_utils module

This script contains the functions used to parse one report, ie the functions to split the html text into a dictionnary of sections.

Only main_parser is used in practice since all the other functions are auxiliary. Moreover, they should not be used “as-is” since they are wrapped up in the ReportsParser object for convenience.

clintk.text_parser.parser_utils.clean_soup(soup, remove, verbose)[source]

Remove the tags indicated in remove parameter from the soup @TODO change function name to to_alpha_num Transfo done inplace

Parameters:
  • soup (BeautifulSoup instance)
  • remove (list) – name of the tags to remove from the soup
  • verbose (bool) – controls logging
Returns:

the same as input, transformation is done inplace

Return type:

BeautifulSoup

clintk.text_parser.parser_utils.clean_string(s)[source]

remove non alphanumeric characters from string s returns the lowerCase string

Parameters:s (str)
Returns:string with only alphanumeric and lowercased
Return type:str
clintk.text_parser.parser_utils.last_tag_text(final_tag, is_html)[source]

Fetches text from last tag

Parameters:final_tag
Returns:content of the last section
Return type:string
clintk.text_parser.parser_utils.main_parser(text, is_html, verbose, remove, headers)[source]

takes as input the string from the report and splits it into sections

Parameters:
  • text (string) – report in html format
  • is_html (bool) – set True if text is actually structured as html
  • verbose (bool) – True for logging
  • remove (list) – name of the tags to remove because contain useless information
  • headers (string) – name of the tags that delimit the sections
Returns:

keys are section names, values are the content of the section

Return type:

dict

clintk.text_parser.parser_utils.parse_soup(soup, is_html, verbose, headers='h3')[source]

Splits the soup between headers and returns a dictionnary

Parameters:
  • soup (BeautifulSoup)
  • is_html (bool) – true if text is exact html format
  • verbose (bool, (default=False)) – weather to print information about parsing
  • headers (string) – delimiters of the sections
Returns:

keys are section names values are section contents

Return type:

dict

clintk.text_parser.parser_utils.text_between_tags(tag1, tag2, is_html)[source]

This function fetches the text between tag 1 and tag 2

The soup should already be cleansed from useless tags such as span

Parameters:
  • tag1
  • tag2
  • is_html
Returns:

all the text between tag1 and tag2

Return type:

str