I'm training BertForSequenceClassification for a classification task. My dataset consists of 'contains adverse effect' (1) and 'does not contain adverse effect' (0). The dataset contains all of the 1s and then the 0s after (the data isn't shuffled). For training I've shuffled my data and get the logits. From what I've understood, the logits are the probability distributions before softmax. An example logit is [-4.673831, 4.7095485]. Does the first value correspond to the label 1 (contains AE) because it appears first in the dataset, or label 0. Any help would be appreciated thanks.
How does the BERT model select the label ordering?
1.1k views Asked by abhishekkuber At
1
There are 1 answers
Related Questions in PYTORCH
- Pytorch install with anaconda error
- How should I save the model of PyTorch if I want it loadable by OpenCV dnn module
- PyTorch: memorize output from several layers of sequencial
- in Pytorch, restore the model parameters but the same initial loss
- Seq2seq pytorch Inference slow
- Why does autograd not produce gradient for intermediate variables?
- pytorch inception model outputs the wrong label for every input image
- "expected CPU tensor(got CUDA tensor)" error for PyTorch
- Float16 (HalfTensor) in pytorch + cuda
- Access parameter names in torch
Related Questions in BERT-LANGUAGE-MODEL
- Are special tokens [CLS] [SEP] absolutely necessary while fine tuning BERT?
- BERT NER Python
- Fine tuning of Bert word embeddings
- how to predict a masked word in a given sentence
- Batch size keeps on changin, throwing `Pytorch Value Error Expected: input batch size does not match target batch size`
- Huggingface BERT SequenceClassification - ValueError: too many values to unpack (expected 2)
- How do I train word embeddings within a large block of custom text using BERT?
- what's the difference between "self-attention mechanism" and "full-connection" layer?
- Convert dtype('<U13309') to string in python
- Can I add a layer of meta data in a text classification model?
Related Questions in HUGGINGFACE-TRANSFORMERS
- Loading saved NER back into HuggingFace pipeline?
- Pytorch BERT: Misshaped inputs
- How to handle imbalanced classes in transformers pytorch binary classification
- Getting Cuda Out of Memory while running Longformer Model in Google Colab. Similar code using Bert is working fine
- Does using FP16 help accelerate generation? (HuggingFace BART)
- How to initialize BertForSequenceClassification for different input rather than [CLS] token?
- How to join sub words produced by the named entity recognization task on transformer huggingface?
- Transformer: cannot import name 'AutoModelWithLMHead' from 'transformers'
- Flask app continuously restarting after downloading huggingface models
- Add dense layer on top of Huggingface BERT model
Related Questions in LOGITS
- ValueError: logits and labels must have the same shape ((None, 4) vs (None, 1))
- How to fix the problem the error number of items to replace is not a multiple of replacement length in R?
- How does the BERT model select the label ordering?
- calculating attention scores in Bahdanau attention in tensorflow using decoder hidden state and encoder output
- Package for Multivariate Multinomial Logit
- multiplying conv layer weights( N,C,H,W) with Logits (H,W) pytorch
- Python logit regression matrix shape error "ValueError: endog and exog matrices are different sizes"
- Multinomial logit with random effects does not converge using mblogit
- cleverhans, tf2, fgsm - how can i pass my LSTM regression model to the fast gradient method function in cleverhans? (logits)
- Shape mismatch, 2D Input & 2D Labels
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Popular Tags
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
The first value corresponds to label 0 and the second value corresponds to label 1. What BertForSequenceClassification does is feeding the output of the pooler to a linear layer (after a dropout which I will ignore in this answer). Let's look at the following example:
Output:
The pooled_output is a tensor of shape [batch_size,hidden_size] and represents the contextualized (i.e. attention was applied)
[CLS]token of your input sequences. This tensor is feed to a linear layer to calculate the logits of your sequence:When we normalize these logits we can see that the linear layer predicts that our input should belong to label 1:
Output (will differ since the linear layer is initialed randomly):
The linear layer applies a linear transformation:
y=xA^T+band you can already see that the linear layer is not aware of your labels. It 'only' has a weights matrix of size [2,768] to produce logits of size [1,2] (i.e.: first row corresponds to the first value and second row to the second):Output:
The BertForSequenceClassification model learns by applying a CrossEntropyLoss. This loss function produces a small loss when the logits for a certain class (label in your case) deviate only slightly from the expectation. That means the CrossEntropyLoss is the one that lets your model learn that the first logit should be high when the input
does not contain adverse effector small when itcontains adverse effect. You can check this for our example with the following:Output: