javascript - R web-scraping - hidden text in HTML -


i want scrape urls following page:

http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=cre&leg=5

there 180 urls collected page (each link speech given in parliament), running problems whenever there more 100 urls scraped, additional speeches accessible clicking on "see more" box @ bottom of page. i've tried figure out how reveal additional links think hidden "getmore" function, no luck! apologies naiveté here...

my current code follows:

read in page

mep.speech.list.url <-"http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=cre&leg=5" speech.list.data<-try(readlines(mep.speech.list.url),silent=true) 

find urls

mep.speech.list<-speech.list.data mep.speech.lines<-grep("href",mep.speech.list) mep.speech.list<-mep.speech.list[mep.speech.lines] mep.speech.lines<-grep("target",mep.speech.list) mep.speech.list<-mep.speech.list[mep.speech.lines] mep.speech.list<-mep.speech.list[-length(mep.speech.list)]     

clean urls

mep.speech.list.end<-regexpr("target",mep.speech.list) mep.speech.list<-substr(mep.speech.list,1, mep.speech.list.end) mep.speech.list<-gsub("\t","",mep.speech.list) mep.speech.list<-gsub('<a href=\"',"",mep.speech.list) mep.speech.list<-gsub('\" target',"",mep.speech.list) mep.speech.list<-gsub('\" targe',"",mep.speech.list)     mep.speech.list<-gsub('\" targ',"",mep.speech.list) mep.speech.list<-gsub('\" tar',"",mep.speech.list) mep.speech.list<-gsub('\" ta',"",mep.speech.list) mep.speech.list<-gsub('\" t',"",mep.speech.list)     mep.speech.list<-mep.speech.list[5:length(mep.speech.list)] print(mep.speech.list) 

the see more button executes javascript carries out ajax call. can use selenium automate browser , extract links:

require(rselenium) appurl <- "http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=cre&leg=5" rselenium::startserver() remdr <- remotedriver() remdr$open() remdr$navigate(appurl) remdr$findelement("id", "seemore")$clickelement() sys.sleep(5) jsscript <-"var hrefs = new array(); $('#content_left .listcontent a').each(function(){ hrefs.push($(this).attr('href')); }); return hrefs;"  apphref <- remdr$executescript(jsscript)[[1]] > head(apphref) [1] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040504+item-008+doc+xml+v0//en&language=en&query=interv&detail=2-205" [2] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040422+item-005+doc+xml+v0//en&language=en&query=interv&detail=4-069" [3] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040422+item-005+doc+xml+v0//en&language=en&query=interv&detail=4-122" [4] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040421+item-008+doc+xml+v0//en&language=en&query=interv&detail=3-207" [5] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040330+item-004+doc+xml+v0//en&language=en&query=interv&detail=2-074" [6] "http://www.europarl.europa.eu/sides/getdoc.do?pubref=-//ep//text+cre+20040330+item-004+doc+xml+v0//en&language=en&query=interv&detail=2-099" >  

Comments

Popular posts from this blog

windows - Single EXE to Install Python Standalone Executable for Easy Distribution -

c# - Access objects in UserControl from MainWindow in WPF -

javascript - How to name a jQuery function to make a browser's back button work? -