Scrapling：一款轻量级自适应Web爬虫工具

摘要：关于ScraplingScrapling 是一款高性能、智能的 Python 网页抓取库，可自动适应网站变化，同时性能远超其他热门工具。无论是初学者还是专家，Scrapling 都能提供强大的功能，同时保持简单性。

关于ScraplingScrapling 是一款高性能、智能的 Python 网页抓取库，可自动适应网站变化，同时性能远超其他热门工具。无论是初学者还是专家，Scrapling 都能提供强大的功能，同时保持简单性。

>> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher# Fetch websites' source under the radar!>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)>> print(page.status)200>> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!>> # Later, if the website structure changes, pass `auto_match=True`>> products = page.css('.product', auto_match=True) # and Scrapling still finds them!功能介绍

1、支持按照您的喜好获取网站；

2、自适应爬取，智能内容爬取；

3、运行速度快，内存高效，快速JSON序列化；

4、强大的导航API，富文本处理；

5、支持自动选择器生成；

6、提供了与Scrapy/BeautifulSoup类似的API；

工具要求

Python 3.8+

工具安装

由于该工具基于Python 3开发，因此我们首先需要在本地设备上安装并配置好最新版本的Python 3环境。

pip安装

pip3 install scrapling

Windows

camoufox fetch --browserforge

macOS

python3 -m camoufox fetch --browserforge

Linux

python -m camoufox fetch --browserforge

基于 Debian 的发行版

sudo apt install -y libgtk-3-0 libx11-xcb1 libasound2

基于 Arch 的发行版

sudo pacman -S gtk3 libx11 libxcb cairo libasound alsa-lib源码获取git clone https://github.com/D4Vinci/Scrapling.git工具使用

智能导航

>>> quote.tag'div'>>> quote.parent ...'>>>> quote.parent.tag'div'>>> quote.children[“The...' parent=', by , Tags: ]>>> quote.siblings[ , ,>>> quote.next # gets the next element, the same logic applies to `quote.previous` >>> quote.children.css_first(".author::text")>>> quote.has_class('quote')# Generate new selectors for any element>>> quote.generate_css_selector# Test these selectors on your favorite browser or reuse them again in the library's methods!>>> quote.generate_xpath_selector'//body/div/div[2]/div/div'

如果你的案例需要的不仅仅是元素的父元素，你可以像下面这样迭代任何元素的整个祖先树：

您可以搜索满足函数的元素的特定祖先，您需要做的就是传递一个以Adaptor对象作为参数的函数，并返回True条件是否满足，否则False，如下所示：

>>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row')) ...' parent='

基于内容的选择和查找相似元素

可以通过多种方式根据文本内容选择元素，以下是另一个网站上的完整示例：

>>> page = Fetcher.get('https://books.toscrape.com/index.html')>>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text

>>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more[

]>>> page.find_by_regex(r'£[\d\.]+') # Get the first element that its text content matches my price regex
£51.77
' parent='
>>> page.find_by_regex(r'£[\d\.]+', first_match=False) # Get all elements that matches my price regex[
£51.77
' parent='
,
£53.74
' parent='
,
£50.10
' parent='
,
£47.82
' parent='
,# For this case, ignore the 'title' attribute while matching>>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])[

,

,# You will notice that the number of elements is 19 not 20 because the current element is not included.>>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))# Get the `href` attribute from all similar elements>>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]['catalogue/a-light-in-the-attic_1000/index.html', 'catalogue/soumission_998/index.html', 'catalogue/sharp-objects_997/index.html', ...]>>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar: print({ "name": product.css_first('h3 a::text'), "price": product.css_first('.price_color').re_first(r'[\d\.]+'), "stock": product.css('.availability::text')[-1].clean }){'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}{'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}{'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}

处理结构变化

假设你正在抓取具有如下结构的页面：

Product 1

Description 1

Product 2

Description 2

如果你想抓取第一个产品，也就是有p1ID 的产品。你可能会写一个这样的选择器：

page.css('#p1')

当网站所有者实施结构性变化时：

Description 1

Description 2

选择器将不再起作用，您的代码需要维护。这就是 Scrapling 自动匹配功能发挥作用的地方：

from scrapling import Adaptor# Before the changepage = Adaptor(page_source, url='example.com')element = page.css('#p1' auto_save=True)if not element: # One day website changes? element = page.css('#p1', auto_match=True) # Scrapling still finds it!# the rest of the code...>> from scrapling import Fetcher>> page = Fetcher.get('https://quotes.toscrape.com/')# Find all elements with tag name `div`.>> page.find_all('div')[ , ,# Find all div elements with a class that equals `quote`.>> page.find_all('div', class_='quote')[ , ,# Same as above.>> page.find_all('div', {'class': 'quote'})# Find all elements with a class that equals `quote`.>> page.find_all({'class': 'quote'})# Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))[ ]# Find all elements that don't have children.>> page.find_all(lambda element: len(element.children) > 0)[, Quote...' parent=', ,# Find all elements that contain the word 'world' in its content.>> page.find_all(lambda element: "world" in element.text)[“The...' parent=', Tags: ]# Find all span elements that match the given regex>> page.find_all('span', re.compile(r'world'))[“The...' parent=']# Find all div and span elements with class 'quote' (No span elements like that so only div returned)>> page.find_all(['div', 'span'], {'class': 'quote'})# Mix things up>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')['Albert Einstein', 'J.K. Rowling',...]许可证协议

本项目的开发与发布遵循BSD-3-Clause开源许可协议。

项目地址

Scrapling：

在这里，拓宽网安边界

https://camoufox.com/python/installation/#download-the-browserhttps://github.com/Vinyzuhttps://github.com/daijro/browserforgehttps://github.com/daijro/camoufox