Inputs required on pdfplumber.utils.cluster_objects function

27 views Asked by At

I am trying to extract the content from PDF documents. I am using PDFplumber package. I got the below code through google search

pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5)

non_table_words are words present in the page in question which are not present in any of the table bounding boxes present in that page.

tables are list of list containing the content of the tables present in that page.

The output for the above function is something like this for text

[{'text': 'Part', 'x0': 283.55999025, 'x1': 302.5229490361815, 'top': 50.991063499999996, 'doctop': 3418.6712235, 'bottom': 60.951438499999995, 'upright': True, 'direction': 1}, {'text': 'B', 'x0': 305.0131206065914, 'x1': 311.6467303565914, 'top': 50.991063499999996, 'doctop': 3418.6712235, 'bottom': 60.951438499999995, 'upright': True, 'direction': 1}]

The output for the above function is something like this for table

[{'table': [['S.No', 'Name of\nDisease', 'Definitions of Critical Illnesses'], ['1', 'Cancer of\nspecified\nseverity', 'A malignant tumor characterized by the uncontrolled growth and spread of malignant\ncells with invasion and destruction of normal tissues. This diagnosis must be\nsupported by histological evidence of malignancy. The term cancer includes\nleukemia, lymphoma and sarcoma.\nThe following are excluded –\ni. All tumors which are histologically described as carcinoma in situ, benign, pre-\nmalignant, borderline malignant, low malignant potential, neoplasm of unknown\nbehavior, or non-invasive, including but not limited to: Carcinoma in situ of breasts,\nCervical dysplasia CIN-1, CIN -2 and CIN-3.\nii. Any non-melanoma skin carcinoma unless there is evidence of metastases to lymph\nnodes or beyond;\niii. Malignant melanoma that has not caused invasion beyond the epidermis;\niv. All tumors of the prostate unless histologically classified as having a Gleason\nscore greater than 6 or having progressed to at least clinical TNM classification\nT2N0M0\nv. All Thyroid cancers histologically classified as T1N0M0 (TNM Classification) or\nbelow;\nvi. Chronic lymphocytic leukaemia less than RAI stage 3\nvii. Non-invasive papillary cancer of the bladder histologically described as TaN0M0\nor of a lesser classification,\nviii. All Gastro-Intestinal Stromal Tumors histologically classified as T1N0M0 (TNM\nClassification) or below and with mitotic count of less than or equal to 5/50 HPFs;'], ['2', 'Myocardia\nl\nInfarction', 'The first occurrence of heart attack or myocardial infarction, which means the death\nof a portion of the heart muscle as a result of inadequate blood supply to the relevant\narea. The diagnosis for Myocardial Infarction should be evidenced by all of the\nfollowing criteria:\ni. A history of typical clinical symptoms consistent with the diagnosis of acute\nmyocardial infarction (For e.g. typical chest pain)\nii. New characteristic electrocardiogram changes\niii. Elevation of infarction specific enzymes, Troponins or other specific biochemical\nmarkers.\nThe following are excluded:\ni. Other acute Coronary Syndromes\nii. Any type of angina pectoris']], 'top': 297.119998625}]

I tried to extract the doc string for this function but it returns None. Unable to find any leads in pypi documentation page also.

Can anyone helps me in understanding this function.

Thanks in advance.

-Subbu S.

0

There are 0 answers