Welcome to zyte-parsers’s documentation!

zyte-parsers is a Python 3.7+ library that contains functions to extract data from webpage parts.

Intro

zyte-parsers provides functions that extract specific data from HTML elements. The input element can be an instance of either parsel.selector.Selector or lxml.html.HtmlElement. Some functions can also take a string with text (e.g. extracted from HTML or JSON) as input.

zyte_parsers.SelectorOrElement

alias of Union[Selector, HtmlElement, HtmlComment]

Parsers

Brand

zyte_parsers.extract_brand_name(node: Selector | HtmlElement | HtmlComment, search_depth: int = 0) str | None[source]

Extract a brand name from a node that contains it.

It tries element text and image alt and title attributes.

Parameters:
  • node – Node including the brand name.

  • search_depth – Max depth for searching images.

Returns:

The brand name or None.

GTIN

class zyte_parsers.Gtin(type: str, value: str)[source]
type: str
value: str
zyte_parsers.extract_gtin(node: Selector | HtmlElement | HtmlComment | str) Gtin | None[source]

Extract a GTIN (Global Trade Item Number) from a node or a string that contains its text.

It detects the GTIN type and returns it together with the cleaned GTIN value. The following types are supported: isbn10, isbn13, issn, ismn, upc, gtin8, gtin13, gtin14.

Parameters:

node – A node or a string that includes the GTIN text.

Returns:

A GTIN item.

Price

zyte_parsers.extract_price(node: Selector | HtmlElement | HtmlComment | str, *, currency_hint: Selector | HtmlElement | HtmlComment | str | None = None) Price[source]

Extract a price value from a node or a string that contains it.

Parameters:
  • node – A node or a string that includes the price text.

  • currency_hint – A string or a node that can contain currency. It will be passed as a hint to price-parser. If currency is present in the price string, it could be preferred over the value extracted from currency_hint.

Returns:

The price value as a price_parser.Price object.

Ratings and review count

class zyte_parsers.AggregateRating(bestRating: float | None = None, ratingValue: float | None = None)[source]
bestRating: float | None
ratingValue: float | None
zyte_parsers.extract_rating(node: Selector | HtmlElement | HtmlComment) AggregateRating[source]

Extract rating data from a node.

Parameters:

node – Node that includes the rating data.

Returns:

AggregateRating item.

zyte_parsers.extract_rating_stars(node: Selector | HtmlElement | HtmlComment) float | None[source]

Extract a rating value from a node containing rating stars.

Parameters:

node – Node that includes the rating stars.

Returns:

Rating value as a float or None.

zyte_parsers.extract_review_count(node: Selector | HtmlElement | HtmlComment) int | None[source]

Extract review count from a node containing it.

Parameters:

node – Node that includes the review count.

Returns:

Review count as an int or None.

Changes

0.5.0 (2024-01-24)

  • Add the extract_rating and extract_rating_stars functions for extracting values.

  • Add the extract_review_count function for extracting review counts.

0.4.0 (2023-12-26)

  • New dependencies:

    • gtin-validator >= 1.0.3

    • python-stdnum >= 1.19

    • six

  • Add the extract_gtin function for extracting GTIN values of various types.

  • Add support for text input to extract_price.

  • Add support for Python 3.12.

  • CI improvements.

0.3.0 (2023-07-28)

  • Now requires price-parser >= 0.3.4.

  • Add the extract_price function for extracting prices and currencies.

0.2.0 (2023-07-07)

  • Add the extract_brand_name function for extracting brands.

  • Drop Python 3.7 support.

0.1.1 (2023-05-24)

  • Fix building documentation.

0.1.0 (2023-05-24)

  • Initial version.

  • Includes extraction of Breadcrumb objects.