Welcome to zyte-parsers’s documentation!
zyte-parsers
is a Python 3.7+ library that contains functions to extract
data from webpage parts.
Intro
zyte-parsers
provides functions that extract specific data from HTML
elements. The input element can be an instance of either
parsel.selector.Selector
or lxml.html.HtmlElement
. Some
functions can also take a string with text (e.g. extracted from HTML or JSON)
as input.
- zyte_parsers.SelectorOrElement
alias of
Union
[Selector
,HtmlElement
,HtmlComment
]
Parsers
Brand
- zyte_parsers.extract_brand_name(node: Selector | HtmlElement | HtmlComment, search_depth: int = 0) str | None [source]
Extract a brand name from a node that contains it.
It tries element text and image alt and title attributes.
- Parameters:
node – Node including the brand name.
search_depth – Max depth for searching images.
- Returns:
The brand name or None.
GTIN
- zyte_parsers.extract_gtin(node: Selector | HtmlElement | HtmlComment | str) Gtin | None [source]
Extract a GTIN (Global Trade Item Number) from a node or a string that contains its text.
It detects the GTIN type and returns it together with the cleaned GTIN value. The following types are supported: isbn10, isbn13, issn, ismn, upc, gtin8, gtin13, gtin14.
- Parameters:
node – A node or a string that includes the GTIN text.
- Returns:
A GTIN item.
Price
- zyte_parsers.extract_price(node: Selector | HtmlElement | HtmlComment | str, *, currency_hint: Selector | HtmlElement | HtmlComment | str | None = None) Price [source]
Extract a price value from a node or a string that contains it.
- Parameters:
node – A node or a string that includes the price text.
currency_hint – A string or a node that can contain currency. It will be passed as a hint to
price-parser
. If currency is present in the price string, it could be preferred over the value extracted fromcurrency_hint
.
- Returns:
The price value as a
price_parser.Price
object.
Ratings and review count
- class zyte_parsers.AggregateRating(bestRating: float | None = None, ratingValue: float | None = None)[source]
- zyte_parsers.extract_rating(node: Selector | HtmlElement | HtmlComment) AggregateRating [source]
Extract rating data from a node.
- Parameters:
node – Node that includes the rating data.
- Returns:
AggregateRating item.
- zyte_parsers.extract_rating_stars(node: Selector | HtmlElement | HtmlComment) float | None [source]
Extract a rating value from a node containing rating stars.
- Parameters:
node – Node that includes the rating stars.
- Returns:
Rating value as a float or None.
- zyte_parsers.extract_review_count(node: Selector | HtmlElement | HtmlComment) int | None [source]
Extract review count from a node containing it.
- Parameters:
node – Node that includes the review count.
- Returns:
Review count as an int or None.
Changes
0.5.0 (2024-01-24)
Add the
extract_rating
andextract_rating_stars
functions for extracting values.Add the
extract_review_count
function for extracting review counts.
0.4.0 (2023-12-26)
New dependencies:
gtin-validator >= 1.0.3
python-stdnum >= 1.19
six
Add the
extract_gtin
function for extracting GTIN values of various types.Add support for text input to
extract_price
.Add support for Python 3.12.
CI improvements.
0.3.0 (2023-07-28)
Now requires
price-parser >= 0.3.4
.Add the
extract_price
function for extracting prices and currencies.
0.2.0 (2023-07-07)
Add the
extract_brand_name
function for extracting brands.Drop Python 3.7 support.
0.1.1 (2023-05-24)
Fix building documentation.
0.1.0 (2023-05-24)
Initial version.
Includes extraction of
Breadcrumb
objects.