30/09/2018, 23:33

How to crawl web data using urllib2 and BeautifulSoup?

Hi guys,

I’m using urllib2 and BeautifulSoup to crawl web data. I can get all tags in a webpage but if tag inside is not class='some value'. It use something like <a rel="nofollow" href="http://google.com.vn">Google</a>
So how to solve this issue ?
I want to extract the href value in this tag <a rel="nofollow" href="http://google.com.vn">.

If you know some links similar to this requirement, please send me this.

Thank you so much.

htl@PyMI.vn viết 01:40 ngày 01/10/2018

BeautifulSoup class has find() and find_all() methods. In your situation, it woud be:

link = soup.find('a', {'rel':'nofollow'})
print(link.attrs['href'])

to find first link’s href.
find_all() returns a list of found links with attributes you pass in
PS: I recommend using requests instead of urllib

Jack Vo viết 01:44 ngày 01/10/2018

Yes, I know. But my PC behind a proxy, so I just found some tutorial relate to urllib2 and BeautifulSoup. Wait, request also support proxy. Thank for your mention.
Problem solve still use urllib2 and bs4. I redirect the result to *.txt file so next step will create a def to call an download manager program and automatically run it.

Python version : 2.7.10
OS Platform : Windows 7

# coding:utf-8
import urllib2, re
from bs4 import BeautifulSoup

# declare proxy configuration
proxy = urllib2.ProxyHandler({'http': 'http://192.168.1.1:8080'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

# specify the url
quote_page = "http://web_you_wanna_crawl"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
#create an empty list link
link = []
for i in soup.find_all('a'):
    j = i.get('href')
    # type(j) is unicode so convert it to string
    if re.findall('.rar',str(j)):
        print j
        link.append(j)

with open('linkdownload.txt', 'wb') as file:
    for item in link:
        file.write("%s\n" % item)

Your solution got an error. Could you check it ?

print(link.attrs['href'])
AttributeError: 'ResultSet' object has no attribute 'attrs'

htl@PyMI.vn viết 01:49 ngày 01/10/2018

find_all() returns a list so you have to loop through it to get items. Each item has its own attrs

For sending links to a download manager, please refer to my script: https://github.com/htlcnn/scripts/blob/master/send_link_to_IDM.py

Bình luận về bài viết này

Chia sẻ tin đăng đến bạn bè

Gửi Messenger

Bài liên quan

Jack Vo

0 chủ đề

0 bài viết

Tác giả nổi bật

Từ khóa nổi bật

AngularJS Blog Bootstrap C / C++ Cấu trúc dữ liệu & Giải thuật Cơ sở dữ liệu Codeigniter Công cụ lập trình CSS CSS cơ bản CSS3 Học Excel HTML HTML cơ bản HTML5 Java Javascript jQuery Json Lập trình mobile Laravel Linux MongoDB MySQL NodeJS Oracle Pascal PHP PHP cơ bản PHP nâng cao Python React Native ReactJS Ruby SEO SQL Server Swift Visual Basic VueJS WordPress XML

Chủ đề nổi bật

How to crawl web data using urllib2 and BeautifulSoup?

Đăng ký nhận thông báo

HỖ TRỢ HỌC VIÊN

VỀ CODE24H

HỢP TÁC VÀ LIÊN KẾT

KẾT NỐI VỚI CHÚNG TÔI

TẢI ỨNG DỤNG TRÊN ĐIỆN THOẠI