Module `pyfx.pricingscience.similarity`

Functions and utilities for product similarity analysis.

Contains formulas to evaluate distances between products regarding either textual or categorical or numerical features. Based on SentenceTransformer to extract textual features of products from textual fields (ie description). Based on Fast Approximate Nearest Neighbor search to identify similar products Based on Leiden Algorithm to extract partition from the graph of nearest neighbors, i.e. groups of similar products.

Functions

def approx_top_knn(text_embedding: Optional[numpy.ndarray], cat_array: Optional[numpy.ndarray], num_array: Optional[numpy.ndarray], top_k: int, batch_queries: bool = True, text_n_neighbor_connect: int = 30, cat_n_neighbor_connect: int = 30, num_n_neighbor_connect: int = 30, text_query_rate: float = 0.1, cat_query_rate: float = 0.1, num_query_rate: float = 0.1) ‑> Optional[numpy.ndarray]

Identify closest neighbours.

Identify among an array of products * descriptors which products are neighbors using fast approximate nearest neighbor search.

Args

text_embedding: 2-D input array of products * textual features,
cat_array: 2-D input array of products * categorical features,
num_array: 2-D input array of products * numerical features,
top_k: number of nearest neighbors to return,
batch_queries: use parallelism for queries,
text_n_neighbor_connect: The number of neighbors to use for ANN search using
textual features. High values increase accuracy at cost of time,
cat_n_neighbor_connect: The number of neighbors to use for ANN search using
categorical features. High values increase accuracy at cost of time,
num_n_neighbor_connect: The number of neighbors to use for ANN search using
numerical features. High values increase accuracy at cost of time,
text_query_rate: backtracking parameter that increases accuracy at cost of time
during textual features based ANN search,
cat_query_rate: backtracking parameter that increases accuracy at cost of time
during categorical features based ANN search,
num_query_rate: backtracking parameter that increases accuracy at cost of time

during numerical features based ANN search.

Returns

The array of approximate nearest neighbors across all kind of features.

def extract_embeddings(text_list_: List[str], text_transformer_: str) ‑> pandas.core.frame.DataFrame

Create the embeddings from a list of texts using a specific transformer.

Args

text_list_: the list of texts to process,
text_transformer_: the name of the text transformer to use,

see (https://www.sbert.net/docs/pretrained_models.html) for other available transformers.

Returns

A DataFrame of the embeddings

def get_leiden_partition_opti_modularity(prod_graph: igraph.Graph, _max_communities_size: int) ‑> igraph.Graph

Build the modularity maximising partition of a graph using leiden algorithm.

Args

prod_graph: original graph built using iGraph library,
_max_communities_size: restrict the communities size.

Returns

The graph containing optimal partition information.

def get_leiden_partition_with_resolution_parameter(prod_graph: igraph.Graph, _resolution_param: float, _max_communities_size: int) ‑> igraph.Graph

Build a partition of a graph using leiden algorithm.

Args

prod_graph: original graph built using iGraph library,
_resolution_param: higher resolutions lead to more communities and
lower resolutions lead to fewer communities,
_max_communities_size: restrict the communities size.

Returns

The graph containing partition information.

def pairwise_cos(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Cosine similarity.

Compute an array of Cosine similarity between an array (of textual descriptors) of products and their respective neighbours.

Args

ndarray: first 2-D input array of products,
approximate_neighbors: second 2-D input array of neighbours.

Returns

The array of Cosine distance between ndarray and approximate_neighbors.

def pairwise_ham(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Hamming similarity.

Compute an array of Hamming similarity between an array (of categorical descriptors) of products and their respective neighbours.

Args

ndarray: first 2-D input array of products,
approximate_neighbors: second 2-D input array of neighbours.

Returns

The array of Hamming distance between ndarray and approximate_neighbors.

def pairwise_maha(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray, vi: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Mahalanobis similarity.

Compute an array of Mahalanobis similarity between an array (of numerical descriptors) of products and their respective neighbours.

Args

ndarray: first 2-D input array of products,
approximate_neighbors: second 2-D input array of neighbours,
vi: the pseudo-inverse of the covariance matrix.

Returns

The array of Mahalanobis distance between ndarray and approximate_neighbors.

def process_new_product(new_product_index: int, new_products_text_df: pandas.core.frame.DataFrame, new_products_num_df: pandas.core.frame.DataFrame, new_products_cat_df: pandas.core.frame.DataFrame, original_text_df: pandas.core.frame.DataFrame, original_num_df: pandas.core.frame.DataFrame, original_cat_df: pandas.core.frame.DataFrame, graph_of_prods: igraph.Graph, products_with_group: pandas.core.frame.DataFrame, max_iterations: int = 10, text_weight: float = 1.0, categorical_weight: float = 1.0, numerical_weight: float = 1.0, nb_neighbours: int = 20, vi: numpy.ndarray = array([], dtype=float64), group_column: str = 'GroupID', index_column: str = 'prodmap_index', sampling_fraction: float = 0.03) ‑> tuple[list[int], numpy.ndarray]

Finds closest known products to newly added products.

Args

new_product_index: Index of a single of the newly added products,
new_products_text_df: textual DataFrame of new products,
new_products_num_df: numerical DataFrame of new products,
new_products_cat_df: categorical DataFrame of new products,
original_text_df: textual DataFrame of known products,
original_num_df: numerical DataFrame of known products,
original_cat_df: categorical DataFrame of known products,
graph_of_prods: graph of known products based on similarity information,
products_with_group: known products with assigned group
max_iterations: max number of iterations performed by algorithm in search of best set of known products. If best set is found, algorithm stops searching,
nb_neighbours: number of candidates to find.
vi: inverse of covariance matrix of original numerical dataframe, used to compute mahalanobis similarity.
group_column: name of column from products_with_group used for stratified sampling.
index_column: name of column containing product indexes in products_with_group.
sampling_fraction: fraction of products returned by stratified sampling.

Returns

list of indexes of closest candidates and their similarities

def similarity(input_df: pandas.core.frame.DataFrame, cat_columns: Optional[List[str]] = None, num_columns: Optional[List[str]] = None, text_embedding: Optional[pandas.core.frame.DataFrame] = None, k_approx_neighbors: int = 400) ‑> pandas.core.frame.DataFrame

Build a table of similar products based on numerical, categorical and textual descriptors.

Args

input_df: original graph built using iGraph library,
cat_columns: list of categorical column names,
num_columns: list of numerical column names,
text_embedding: 2-D input array of products * textual features,
k_approx_neighbors: number of approximate nearest neighbors to return.

Returns

The DataFrame of similar products.