30/09/2018, 23:33
How to crawl web data using urllib2 and BeautifulSoup?
Hi guys,
I’m using urllib2
and BeautifulSoup
to crawl web data. I can get all tags in a webpage but if tag inside is not class='some value'
. It use something like <a rel="nofollow" href="http://google.com.vn">Google</a>
So how to solve this issue ?
I want to extract the href
value in this tag <a rel="nofollow" href="http://google.com.vn">
.
If you know some links similar to this requirement, please send me this.
Thank you so much.
Bài liên quan
BeautifulSoup class has
find()
andfind_all()
methods. In your situation, it woud be:to find first link’s href.
find_all()
returns a list of found links with attributes you pass inPS: I recommend using
requests
instead ofurllib
Yes, I know. But my PC behind a proxy, so I just found some tutorial relate to urllib2 and BeautifulSoup. Wait, request also support proxy. Thank for your mention.
Problem solve still use urllib2 and bs4. I redirect the result to *.txt file so next step will create a def to call an download manager program and automatically run it.
Python version : 2.7.10
OS Platform : Windows 7
Your solution got an error. Could you check it ?
find_all()
returns a list so you have to loop through it to get items. Each item has its ownattrs
For sending links to a download manager, please refer to my script: https://github.com/htlcnn/scripts/blob/master/send_link_to_IDM.py