I am used to webscrape with R but when it comes to webscrape password protected sites I face difficulties. My goal is to simulate an internet browser and read-in my conversations in the logged-in session page. In the following site it is pretty easy to register and create an account. The idea of the page is that user can find online flats and apply for them.
I am not looking for Rselenium solutions :)
wg_site="https://www.wg-gesucht.de"
credentials_id <- list(login_email_username = wg_id,
login_password = wg_pw )
I found 2 ways of entering:
Type A - submit a form
wg_req_form <- request(wg_site) %>% #base url
req_url_path_append("/nachrichten.html")%>% #path
req_body_form(login_email_username=wg_id,
login_password=wg_pw) #submit credentials
wg_form_resp <- wg_req_form %>%
req_perform() %>%
resp_body_html()
wg_form_resp %>%
html_nodes("#main_column > div.row.my40 > div > div > div > div") %>%
html_text() %>%
gsub(" ","",.) %>%
gsub("\n"," ",.)
- The log in did not work. I wonder if I am missing another field to push (POST)
Type B - find the json file to extract session token and apply it together with the credentials
wg_session_token <- read_html("https://www.wg-gesucht.de") %>%
html_nodes(xpath="/html/body/script[21]") %>%
html_attr("src") %>%
sub("^/ajax/api/Smp/js/Session.min.tjb", "", .) %>%
sub(".js$", "", .)
request(wg_site) %>%
req_url_path_append("/nachrichten.html")%>% #path
req_body_json(list(credentials_id))%>%req_perform() #credentials & resp check
#req_url_query(login_token = wg_session_token)
I wonder if this last token should be placed right after the Json approach or somwhere else. Besides there are several token that could be relevant when accessing in my account: When I was refreshing the site by logging in and out I found another token related link that could help.
https://www.wg-gesucht.de/ajax/sessions.php?action=login
I believe there are different types of token - like one for the session token & another one for the login.
So my question are:
How to perform generally when webscraping password protected page?
I rely on the schema: open a html_session->draw cookies and token-> post credentials via form o rendering json file -> GET (retrieve data according to data format)
How to submit this form in this example? Are submitting forms better than find api endpoints to json file?
How to log in with the json variant ? How is it possible to authenticate via json when inserting credentials and apply correctly right token? I guess it the answer to this is bound to find the right js on api endpoint
Thank you very much and any help is more than welcome!