Extract text from CSV in R

Question

Extract text from CSV in R

126 views Asked by CogNeuro123 At 09 February 2023 at 03:21

I have an Excel .CSV file in which one column has the transcription of a conversation. Whenever the speaker uses Spanish, the Spanish is written within brackets.

One example sentence:

so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day

Ideally, I'd like to extract the English and Spanish separately, so one file would contain all the Spanish words, and another would contain all the English words.

Any ideas on how to do this? Or which function/package to use?

Edited to add: there's about 100 cells that contain text in this Excel sheet. I guess where I'm confused is how do I treat this entire CSV as a "string"?

I don't want to copy and paste every cell as a "strng" -- I was hoping I could someone just upload the entire CSV

Original Q&A

There are 1 answers

**Jon Spring** · Answer 1 · 2023-02-09T03:50:52+00:00

To load the CSV into R, you could use readr::read_CSV(YOUR_FILE.CSV). There are more options, some of which are available to you if you use the "File -- Import Dataset -- From Text (readr)" menu option in RStudio.

Supposing you have the data loaded, you will likely need to rely on some form of "regex" to parse the text into sections based on the brackets. There are some base R functions for this, but I find the functions in stringr (part of the tidyverse meta-package) to be useful for this. And tidyr::separate_rows is a nice way to split the text into more lines.

In the regex below, there are a few ingredients:

(?=...) means to split before the [ but to keep it.
\\[ is how we refer to [ because brackets have special meaning in regex so we need to "escape" them to treat them as a literal character.
(?<=...) means to split after the ] but keep it.
| in the last row means "or"

(Granted, I'm still a regex beginner, so I expect there are more concise ways to do this.)

So we could do something like:

df1 <- data.frame(text = "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day")

library(tidyverse)
df1 %>%
  mutate(orig_row = row_number()) %>%
  separate_rows(text, sep = "(?=\\[)") %>%
  separate_rows(text, sep = "(?<=\\] )") %>%
  mutate(language = if_else(str_detect(text, "\\[|\\]"), "Espanol", "English"),
         text = str_remove_all(text, "\\[|\\]"))

Result

# A tibble: 5 × 3
  text                                                                                               orig_row language
  <chr>                                                                                                 <int> <chr>   
1 "so "                                                                                                     1 English 
2 "usualmente "                                                                                             1 Espanol 
3 "maybe "                                                                                                  1 English 
4 "me levanto como a las nueve y media "                                                                    1 Espanol 
5 "like I exercise and the I like either go to class online or in person like it depends on the day"        1 English

TechQA.

Extract text from CSV in R

There are 1 answers

Related Questions in R

Related Questions in EXTRACT

Related Questions in TEXT-EXTRACTION

Related Questions in DATA-EXTRACTION

Popular Questions

Trending Questions