javascript - R web-scraping - hidden text in HTML -
i want scrape urls following page:
http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=cre&leg=5
there 180 urls collected page (each link speech given in parliament), running problems whenever there more 100 urls scraped, additional speeches accessible clicking on "see more" box @ bottom of page. i've tried figure out how reveal additional links think hidden "getmore" function, no luck! apologies naiveté here...
my current code follows:
read in page
mep.speech.list.url <-"http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=cre&leg=5" speech.list.data<-try(readlines(mep.speech.list.url),silent=true)
find urls
mep.speech.list<-speech.list.data mep.speech.lines<-grep("href",mep.speech.list) mep.speech.list<-mep.speech.list[mep.speech.lines] mep.speech.lines<-grep("target",mep.speech.list) mep.speech.list<-mep.speech.list[mep.speech.lines] mep.speech.list<-mep.speech.list[-length(mep.speech.list)]
clean urls
mep.speech.list.end<-regexpr("target",mep.speech.list) mep.speech.list<-substr(mep.speech.list,1, mep.speech.list.end) mep.speech.list<-gsub("\t","",mep.speech.list) mep.speech.list<-gsub('<a href=\"',"",mep.speech.list) mep.speech.list<-gsub('\" target',"",mep.speech.list) mep.speech.list<-gsub('\" targe',"",mep.speech.list) mep.speech.list<-gsub('\" targ',"",mep.speech.list) mep.speech.list<-gsub('\" tar',"",mep.speech.list) mep.speech.list<-gsub('\" ta',"",mep.speech.list) mep.speech.list<-gsub('\" t',"",mep.speech.list) mep.speech.list<-mep.speech.list[5:length(mep.speech.list)] print(mep.speech.list)
the see more button executes javascript carries out ajax call. can use selenium automate browser , extract links:
require(rselenium) appurl <- "http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=cre&leg=5" rselenium::startserver() remdr <- remotedriver() remdr$open() remdr$navigate(appurl) remdr$findelement("id", "seemore")$clickelement() sys.sleep(5) jsscript <-"var hrefs = new array(); $('#content_left .listcontent a').each(function(){ hrefs.push($(this).attr('href')); }); return hrefs;" apphref <- remdr$executescript(jsscript)[[1]] > head(apphref) [1] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040504+item-008+doc+xml+v0//en&language=en&query=interv&detail=2-205" [2] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040422+item-005+doc+xml+v0//en&language=en&query=interv&detail=4-069" [3] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040422+item-005+doc+xml+v0//en&language=en&query=interv&detail=4-122" [4] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040421+item-008+doc+xml+v0//en&language=en&query=interv&detail=3-207" [5] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040330+item-004+doc+xml+v0//en&language=en&query=interv&detail=2-074" [6] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040330+item-004+doc+xml+v0//en&language=en&query=interv&detail=2-099" >
Comments
Post a Comment