Python] 크롤링 - bs4 (Pycharm)

크롤링(스크래핑)

웹 페이지를 가져와 데이터를 추출하는 행위

header - 파이썬을 통해 Http 통신을 할 경우 모든 정보를 담기 어려울 때 헤더를 직접 지정

cookie - HTTP에서 사용자의 정보를 저장하는 데이터

코드

requests(url = url, headers = header , cookies,cookie)

%header와 cookie는 모든 사이트에서 사용하기 때문에 값을 조절하면 브라우저에서 보내는 정보처럼 보내는 등 가능

파이썬을 사용하여 크롤링하기

bs4 (beautifulsoup4) 설치

ctrl + alt +s

+ 클릭 > bs4 검색 후 패키지 설치

import requests
from bs4 import BeautifulSoup

header 코드 작성

header = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
a = requests.get('https://www.genie.co.kr/chart/top200',headers=header)

header 작성 후

b = BeautifulSoup(a.text, 'html.parser')

url에서 받을 정보 분석 코드

url 에서 필요한 정보 추출

코드 복사 후

num = b.select_one('추출 코드')

안에 코드 입력

num = b.select_one('#body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.info > a.title.ellipsis')

% select , select_one 차이점

select는 해당 정보가 리스트로 묶여나옴

select_one은 리스트 안의 값 추출

전체 코드

import requests
from bs4 import BeautifulSoup

header = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
a = requests.get('https://www.genie.co.kr/chart/top200',headers=header)

b = BeautifulSoup(a.text, 'html.parser')

num = b.select_one('#body-content > div.newest-list > div > table > tbody > tr:nth-child(1) > td.info > a.title.ellipsis')

print(num)

결과

'Python' 카테고리의 다른 글

Python] Flask (0)	2022.11.03
Python] 크롤링한 데이터 DB에 업데이트하기 (0)	2022.11.03
Python] DB연결, 조작 (mongoDB 사용) (0)	2022.11.02
Python] Requests 패키지 사용하기 (pycharm) (0)	2022.11.02
Python] 파이썬 (0)	2022.11.02

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

김인광의 코딩식당

Python] 크롤링 - bs4 (Pycharm)

'Python' 카테고리의 다른 글

티스토리툴바

Python] 크롤링 - bs4 (Pycharm)

'Python' 카테고리의 다른 글

'Python' 관련글

티스토리툴바