clintk.text_parser.parser_utils module¶
This script contains the functions used to parse one report, ie the functions to split the html text into a dictionnary of sections.
Only main_parser is used in practice since all the other functions are auxiliary. Moreover, they should not be used “as-is” since they are wrapped up in the ReportsParser object for convenience.
-
clintk.text_parser.parser_utils.
clean_soup
(soup, remove, verbose)[source]¶ Remove the tags indicated in remove parameter from the soup @TODO change function name to to_alpha_num Transfo done inplace
Parameters: - soup (BeautifulSoup instance)
- remove (list) – name of the tags to remove from the soup
- verbose (bool) – controls logging
Returns: the same as input, transformation is done inplace
Return type: BeautifulSoup
-
clintk.text_parser.parser_utils.
clean_string
(s)[source]¶ remove non alphanumeric characters from string s returns the lowerCase string
Parameters: s (str) Returns: string with only alphanumeric and lowercased Return type: str
-
clintk.text_parser.parser_utils.
last_tag_text
(final_tag, is_html)[source]¶ Fetches text from last tag
Parameters: final_tag Returns: content of the last section Return type: string
-
clintk.text_parser.parser_utils.
main_parser
(text, is_html, verbose, remove, headers)[source]¶ takes as input the string from the report and splits it into sections
Parameters: - text (string) – report in html format
- is_html (bool) – set True if text is actually structured as html
- verbose (bool) – True for logging
- remove (list) – name of the tags to remove because contain useless information
- headers (string) – name of the tags that delimit the sections
Returns: keys are section names, values are the content of the section
Return type:
-
clintk.text_parser.parser_utils.
parse_soup
(soup, is_html, verbose, headers='h3')[source]¶ Splits the soup between headers and returns a dictionnary
Parameters: - soup (BeautifulSoup)
- is_html (bool) – true if text is exact html format
- verbose (bool, (default=False)) – weather to print information about parsing
- headers (string) – delimiters of the sections
Returns: keys are section names values are section contents
Return type:
This function fetches the text between tag 1 and tag 2
The soup should already be cleansed from useless tags such as span
Parameters: - tag1
- tag2
- is_html
Returns: all the text between tag1 and tag2
Return type: