clintk.text_parser.parser_utils module¶

This script contains the functions used to parse one report, ie the functions to split the html text into a dictionnary of sections.

Only main_parser is used in practice since all the other functions are auxiliary. Moreover, they should not be used “as-is” since they are wrapped up in the ReportsParser object for convenience.

clintk.text_parser.parser_utils.clean_soup(soup, remove, verbose)[source]¶

Remove the tags indicated in remove parameter from the soup @TODO change function name to to_alpha_num Transfo done inplace

Parameters:	soup (BeautifulSoup instance) remove (list) – name of the tags to remove from the soup verbose (bool) – controls logging
Returns:	the same as input, transformation is done inplace
Return type:	BeautifulSoup

clintk.text_parser.parser_utils.clean_string(s)[source]¶

remove non alphanumeric characters from string s returns the lowerCase string

Parameters:	s (str)
Returns:	string with only alphanumeric and lowercased
Return type:	str

clintk.text_parser.parser_utils.last_tag_text(final_tag, is_html)[source]¶

Fetches text from last tag

Parameters:	final_tag
Returns:	content of the last section
Return type:	string

clintk.text_parser.parser_utils.main_parser(text, is_html, verbose, remove, headers)[source]¶

takes as input the string from the report and splits it into sections

Parameters:	text (string) – report in html format is_html (bool) – set True if text is actually structured as html verbose (bool) – True for logging remove (list) – name of the tags to remove because contain useless information headers (string) – name of the tags that delimit the sections
Returns:	keys are section names, values are the content of the section
Return type:	dict

clintk.text_parser.parser_utils.parse_soup(soup, is_html, verbose, headers='h3')[source]¶

Splits the soup between headers and returns a dictionnary

Parameters:	soup (BeautifulSoup) is_html (bool) – true if text is exact html format verbose (bool, (default=False)) – weather to print information about parsing headers (string) – delimiters of the sections
Returns:	keys are section names values are section contents
Return type:	dict

clintk.text_parser.parser_utils.text_between_tags(tag1, tag2, is_html)[source]¶

This function fetches the text between tag 1 and tag 2

The soup should already be cleansed from useless tags such as span

Parameters:	tag1 tag2 is_html
Returns:	all the text between tag1 and tag2
Return type:	str