I am trying to use the core-nlp annotation pipeline with default settings all through from tokenizing until ner_tags. I did observe that the "tokenizer" module is identifying , say "Vice President" as two individual tokens {vice,President} resulting in ner_tags identification as {o,TITLE} instead of {Vice President} and {TITLE}. How can I get the tokenizer to identify "Vice president" as one single token , that help Ner_Tags to identify titles appropriately.
How to identify n-gram before tokenization in stanford core-nlp?
365 views Asked by Trinadh Gupta At
1
There are 1 answers
Related Questions in NLP
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- Clarification on T5 Model Pre-training Objective and Denoising Process
- The training accuracy and the validation accuracy curves are almost parallel to each other. Is the model overfitting?
- Give Bert an input and ask him to predict. In this input, can Bert apply the first word prediction result to all subsequent predictions?
- Output of Cosine Similarity is not as expected
- Getting an error while using the open ai api to summarize news atricles
- SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced
- Should I use beam search on validation phase?
- Dialogflow failing to dectect the correct intent
- How to detect if two sentences are simmilar, not in meaning, but in syllables/words?
- Is BertForSequenceClassification using the CLS vector?
- Issue with memory when using spacy_universal_sentence_encoder for similarity detection
- Why does the Cloud Natural Language Model API return so many NULLs?
- Is there any OCR or technique that can recognize/identify radio buttons printed out in the form of pdf document?
- Model, lexicon to do fine grained emotions analysis on text in r
Related Questions in TOKENIZE
- How to solve Config validation error when tokenizer is not callable in Python?
- SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced
- Altova Mapforce - How to use results from Tokenize at the same time in a database call?
- How do handle compound nouns (animal names) in word2vec (using tensorflow)?
- Tensorflow tokenizer question. What num_words does exactly?
- Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task in Python
- Getting `ValueError: as_list() is not defined on an unknown TensorShape.` when trying to tokenize as part of the model
- Trying to run the LLama-2-7B-chat-GGUF on local machine but getting NotImplementedError
- why Tokenizer and TokenizerFast encode the same sentence get different result
- Compare vocabulary size of WordPiece and BPE tokenizer algorithm
- Why did I install Chinese tokenization package but the terminator kept saying I should install them?
- Pre-training or using the existing model of FastText?
- integrate `openai-whisper` tokenizer with spaCy
- Paytabs recurring payment request with token is giving [422 - Unable to process your request] error
- How to remove last N tokens in a string with XSLT?
Related Questions in NAMED-ENTITY-RECOGNITION
- Customized named entities is throwing vlaue error in spacy
- NER grouping into objects
- Is there some way to efficiently annotate data for a custom spaCy NER model?
- Enhancing BERT+CRF NER Model with keyphrase list
- Error when I trying to run a trained ner model on local pc
- Spacy EntityRuler - Tagging multiple labels on a single entity
- Tructed BIO format in NER prediction results
- How to get ClassLabel for the ai4privacy/pii-masking-200k dataset?
- Named Entity Recognition on Search Engine Queries with Python
- SpaCy: Regex pattern does not work in rule-based matcher
- How to solve the problem ValueError: indices.shape[-1]
- Why am I not able to load and use below spacy pipeline properly?
- Spacy EntityRuler not adding new patterns when built via add_pipe()
- Issue with 'ValueError' when computing metrics in NER using transformers library (Tuple is empty)
- How to Handle Imbalance Dataset in NER?
Related Questions in STANFORD-NLP
- Why are SST-2 and CoLA commonly used datasets for debiasing?
- How can I correctly change the upos of words in a sentence using Stanza?
- I wanted to evaluate and see the performance of Spider 1.0 dataset using llama-2-7B model, hugging g=face transformer, not working, how to fix it?
- Facing error to evaluate spider 1.0 dataset using orca-2-7B model, hugging face transformers
- java.lang.IllegalArgumentException using Stanford Parser and Jetpack Compose
- Displaying a graph for parsed sentences with Stanford-parser
- Displaying parser tree using Jetpack Compose
- sentences to clauses with Python
- Stanford Stanza sometimes splits a sentence into two sentences
- GloVe Nearest neighbors (NLP)
- How to use local files in an Azure Function hosted on the Linux Consumption plan?
- Sentences Annotation Class giving null value using Stanford Core NLP using c#
- How to make stanza lemmatizer to return just the lemma instead of a dictionary?
- GloVe algorithm: reading the coccurence.bin file contents in Python
- Stanford CoreNLP library doesn't tokenize new lines
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
What properties are you using to get TITLE as an NER tag? This is not one of the standard tags, and if you're using the TokensRegexNER annotator (e.g., for the
kbpannotator) multi-word titles like 'vice president' should be picked up. It works on corenlp.run at least.In general, it's not the tokenizer's job to collapse NER spans into a single mention. The tokenizer should separate 'vice' and 'president' into different tokens, both of which should be marked TITLE by an appropriate NER annotator. You may be interested in the
entitymentionannotator, which groups contiguous NER tags into NER mentions -- this would give you 'vice president' as a single mention, rather than two tokens both marked as TITLE. These mentions can be retrieved using the mentions annotation on a sentence CoreMap, or using theList<String> mention(String nerTag)orList<String> mentions()functions in the simple API.