파이썬 웹 크롤링(Web Crawling) 강좌 : 1. BeautifulSoup 간단 사용법

파이썬 웹 크롤링(Web Crawling) 강좌 : 1. BeautifulSoup 간단 사용법Python/웹크롤링2023. 10. 21. 22:09@webnautes

Table of Contents

BeautifulSoup를 사용하여 간단한 웹 크롤러를 만드는 방법을 다루고 있습니다.

최초작성 2015. 10. 31

최종작성 2023. 5. 21

웹 크롤러(Web Crawler)는 웹문서, 이미지 등을 주기적으로 수집하여 자동으로 데이터베이스화하는 프로그램입니다. 웹 크롤러가 하는 작업을 웹 크롤링(Web Crawling)이라고 부릅니다.

보통 웹 크롤러를 사용하여 웹문서에서 필요한 정보를 가져옵니다. 검색 엔진은 이렇게 생성된 데이터를 인덱싱하여 빠른 검색을 할 수 있도록 합니다.

웹 페이지의 내용을 가져오는 간단한 웹 크롤러를 만들어 보겠습니다.

시작하기 전에 requests와 beautifulsoup4 패키지를 설치해줘야 합니다.

pip install requests beautifulsoup4

1. 웹 문서 전체 가져오기

urlopen 함수를 사용하여 원하는 주소로부터 웹페이지를 가져온 후, BeautifulSoup 객체로 변환합니다.

BeautifulSoup 객체는 웹문서를 파싱한 상태입니다. 웹 문서가 태그 별로 분해되어 태그로 구성된 트리가 구성됩니다.

예를 들어 html 태그아래에 head와 body 태그가 존재하고 다시 head와 body 태그 아래에 하위 태그가 존재합니다.

파싱이란 일련의 문자열로 구성된 문서를 의미 있는 토큰(token)으로 분해하고 토큰으로 구성된 파스 트리(parse tree)를 만드는 것입니다.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.naver.com")
bsObject = BeautifulSoup(html, "html.parser")

print(bsObject) # 웹 문서 전체가 출력됩니다.

<!DOCTYPE html>

rel="apple-touch-icon"/>

  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

        window['EAGER-DATA']['GV'] = {
            svt: '20230521073831', }; </script> <script>(()=>{function t(t){var e=document.getElementById("search_area"),n=document.getElementById("search-btn");t?(e&&e.setAttribute("style","border-color: "+t),n&&n.setAttribute("class","ico_btn_search type_color"),n&&n.setAttribute("style","background-color: "+t)):e&&e.removeAttribute("style")}var e=window["EAGER-DATA"]["CAS-MINICONTENT-PC-SPECIAL-LOGO"];if(e&&e.items&&e.items.length>0!=!1){var n=Math.floor(Math.random()*e.items.length),o=e.items[n];e.items=[o],t(o&&o.content&&o.content.input&&o.content.input.color);var r=o&&o.content&&o.content.input&&o.content.input.placeholder;if(r){var c=document.getElementById("query");c&&r&&c.setAttribute("title",r),c&&r&&c.setAttribute("placeholder",r)}}else t()})()</script> <script>window.__timeboard_rendered=!1,window.__timeboard_called=!1,window.__timeboard_called2=!1,window.__timeboard_gladsdksize=0</script> </body> </html>
(newenv) webnautes@webnautesui-MacBookAir Python_Example %

2. 타이틀 가져오기

태그로 구성된 트리에서 title 태그만 출력합니다.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.naver.com")
bsObject = BeautifulSoup(html, "html.parser")

print(bsObject.head.title)

<title>NAVER</title>

3. 모든 메타 데이터의 내용 가져오기

웹문서에서 메타 태그를 찾아서 content 속성값을 가져옵니다.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://www.python.org/about")
bsObject = BeautifulSoup(html, "html.parser")

for meta in bsObject.head.find_all('meta'):
print(meta.get('content'))

None
IE=edge
Python.org
The official home of the Python Programming Language
Python.org
yes
black
width=device-width, initial-scale=1.0
True
telephone=no
on
false
/static/metro-icon-144x144-precomposed.png
#3673a5
#3673a5
The official home of the Python Programming Language
Python programming language object oriented web free open source software license documentation download community
website
Python.org
Welcome to Python.org
The official home of the Python Programming Language
https://www.python.org/static/opengraph-icon-200x200.png
https://www.python.org/static/opengraph-icon-200x200.png
https://www.python.org/about/

4. 원하는 태그의 내용 가져오기

find를 사용하면 원하는 태그의 정보만 가져올 수 있습니다.

예를 들어 www.python.org/about 에서 다음 meta 태그의 content 속성값을 가져오려면..

우선 웹문서에 있는 meta 태그 중 가져올 태그를 name 속성 값이 description인 것으로 한정합니다.

print (bsObject.head.find("meta", {"name":"description"}))

meta 태그의 content 속성을 가져옵니다.

print (bsObject.head.find("meta", {"name":"description"}).get('content'))

The official home of the Python Programming Language

전체 소스코드입니다.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://www.python.org/about")
bsObject = BeautifulSoup(html, "html.parser")

print (bsObject.head.find("meta", {"name":"description"}).get('content'))

5. 모든 링크의 텍스트와 주소 가져오기

a 태그로 둘러싸인 텍스트와 a 태그의 href 속성을 출력합니다.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://www.naver.com")
bsObject = BeautifulSoup(html, "html.parser")

for link in bsObject.find_all('a'):
print(link.text.strip(), link.get('href'))

바로가기 #
/
전체삭제 #
@txt@ #
삭제 #
도움말 https://help.naver.com/support/alias/search/word/word_29.naver
도움말 https://help.naver.com/support/alias/search/word/word_29.naver
도움말 https://help.naver.com/support/service/main.help?serviceNo=605&categoryNo=1991
자동저장 끄기 #
@5@회 당첨번호 동행복권 제공 @6@@7@@8@@9@@10@@11@@12@ #
@14@ @txt@@currency@ @8@(@9@%) @6@원 #
@txt@ @7@, @message@ @7@ @8@° #
@txt@ @5@ 바로가기 @5@
@txt@ #
추가 #
@txt@ #
추가 #
@query@ @intend@ #
추가 #
자세히보기 #
관심사를 반영한 컨텍스트 자동완성도움말 https://help.naver.com/support/alias/search/word/word_16.naver
컨텍스트 자동완성 #
자세히 https://help.naver.com/support/alias/search/word/word_16.naver
로그인 https://nid.naver.com/nidlogin.login
자세히 https://help.naver.com/support/alias/search/word/word_16.naver
컨텍스트 자동완성 레이어 닫기 #
도움말 https://help.naver.com/support/service/main.help?serviceNo=605&categoryNo=1987
신고 https://help.naver.com/support/contents/contents.help?serviceNo=605&categoryNo=18215
자동완성 끄기 #

저작자표시 비영리 동일조건

'Python > 웹크롤링' 카테고리의 다른 글

웹크롤링시 ConnectionResetError(104, 'Connection reset by peer') 해결방법 (0)	2024.08.08
웹크롤링 강좌 – 기상청의 단기예보 가져오기 (0)	2023.10.22
파이썬 웹 크롤링(Web Crawling) 강좌 : 3. 네이버 베스트셀러 책이름, 저자, 가격 출력하기 (0)	2023.10.21
파이썬 웹 크롤링(Web Crawling) 강좌 : 2. Yes24 특정 키워드 책 검색 순위 출력하기 (0)	2023.10.21

1. 웹 문서 전체 가져오기

2. 타이틀 가져오기

3. 모든 메타 데이터의 내용 가져오기

4. 원하는 태그의 내용 가져오기

5. 모든 링크의 텍스트와 주소 가져오기

'Python > 웹크롤링' 카테고리의 다른 글

시간날때마다 틈틈이 이것저것 해보며 블로그에 글을 남깁니다.

블로그의 문서는 종종 최신 버전으로 업데이트됩니다.
여유 시간이 날때 진행하는 거라 언제 진행될지는 알 수 없습니다.

영화,책, 생각등을 올리는 블로그도 운영하고 있습니다.
https://freewriting2024.tistory.com

제가 쓴 책도 한번 검토해보세요 ^^

티스토리툴바

1. 웹 문서 전체 가져오기

2. 타이틀 가져오기

3. 모든 메타 데이터의 내용 가져오기

4. 원하는 태그의 내용 가져오기

5. 모든 링크의 텍스트와 주소 가져오기

'Python > 웹크롤링' 카테고리의 다른 글

시간날때마다 틈틈이 이것저것 해보며 블로그에 글을 남깁니다. 블로그의 문서는 종종 최신 버전으로 업데이트됩니다. 여유 시간이 날때 진행하는 거라 언제 진행될지는 알 수 없습니다.

영화,책, 생각등을 올리는 블로그도 운영하고 있습니다. https://freewriting2024.tistory.com

제가 쓴 책도 한번 검토해보세요 ^^

티스토리툴바

시간날때마다 틈틈이 이것저것 해보며 블로그에 글을 남깁니다.

블로그의 문서는 종종 최신 버전으로 업데이트됩니다.
여유 시간이 날때 진행하는 거라 언제 진행될지는 알 수 없습니다.

영화,책, 생각등을 올리는 블로그도 운영하고 있습니다.
https://freewriting2024.tistory.com