Module pyfx.pricingscience.similarity

Functions and utilities for product similarity analysis.

Contains formulas to evaluate distances between products regarding either textual or categorical or numerical features. Based on SentenceTransformer to extract textual features of products from textual fields (ie description). Based on Fast Approximate Nearest Neighbor search to identify similar products Based on Leiden Algorithm to extract partition from the graph of nearest neighbors, i.e. groups of similar products.

Functions

def approx_top_knn(text_embedding: Optional[numpy.ndarray], cat_array: Optional[numpy.ndarray], num_array: Optional[numpy.ndarray], top_k: int, batch_queries: bool = True, text_n_neighbor_connect: int = 30, cat_n_neighbor_connect: int = 30, num_n_neighbor_connect: int = 30, text_query_rate: float = 0.1, cat_query_rate: float = 0.1, num_query_rate: float = 0.1) ‑> Optional[numpy.ndarray]

Identify closest neighbours.

Identify among an array of products * descriptors which products are neighbors using fast approximate nearest neighbor search.

Args

text_embedding
2-D input array of products * textual features,
cat_array
2-D input array of products * categorical features,
num_array
2-D input array of products * numerical features,
top_k
number of nearest neighbors to return,
batch_queries
use parallelism for queries,
text_n_neighbor_connect
The number of neighbors to use for ANN search using
textual features. High values increase accuracy at cost of time,
cat_n_neighbor_connect
The number of neighbors to use for ANN search using
categorical features. High values increase accuracy at cost of time,
num_n_neighbor_connect
The number of neighbors to use for ANN search using
numerical features. High values increase accuracy at cost of time,
text_query_rate
backtracking parameter that increases accuracy at cost of time
during textual features based ANN search,
cat_query_rate
backtracking parameter that increases accuracy at cost of time
during categorical features based ANN search,
num_query_rate
backtracking parameter that increases accuracy at cost of time

during numerical features based ANN search.

Returns

The array of approximate nearest neighbors across all kind of features.

def extract_embeddings(text_list_: List[str], text_transformer_: str) ‑> pandas.core.frame.DataFrame

Create the embeddings from a list of texts using a specific transformer.

Args

text_list_
the list of texts to process,
text_transformer_
the name of the text transformer to use,

see (https://www.sbert.net/docs/pretrained_models.html) for other available transformers.

Returns

A DataFrame of the embeddings

def get_leiden_partition_opti_modularity(prod_graph: igraph.Graph, _max_communities_size: int) ‑> igraph.Graph

Build the modularity maximising partition of a graph using leiden algorithm.

Args

prod_graph
original graph built using iGraph library,
_max_communities_size
restrict the communities size.

Returns

The graph containing optimal partition information.

def get_leiden_partition_with_resolution_parameter(prod_graph: igraph.Graph, _resolution_param: float, _max_communities_size: int) ‑> igraph.Graph

Build a partition of a graph using leiden algorithm.

Args

prod_graph
original graph built using iGraph library,
_resolution_param
higher resolutions lead to more communities and
lower resolutions lead to fewer communities,
_max_communities_size
restrict the communities size.

Returns

The graph containing partition information.

def pairwise_cos(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Cosine similarity.

Compute an array of Cosine similarity between an array (of textual descriptors) of products and their respective neighbours.

Args

ndarray
first 2-D input array of products,
approximate_neighbors
second 2-D input array of neighbours.

Returns

The array of Cosine distance between ndarray and approximate_neighbors.

def pairwise_ham(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Hamming similarity.

Compute an array of Hamming similarity between an array (of categorical descriptors) of products and their respective neighbours.

Args

ndarray
first 2-D input array of products,
approximate_neighbors
second 2-D input array of neighbours.

Returns

The array of Hamming distance between ndarray and approximate_neighbors.

def pairwise_maha(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray, vi: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Mahalanobis similarity.

Compute an array of Mahalanobis similarity between an array (of numerical descriptors) of products and their respective neighbours.

Args

ndarray
first 2-D input array of products,
approximate_neighbors
second 2-D input array of neighbours,
vi
the pseudo-inverse of the covariance matrix.

Returns

The array of Mahalanobis distance between ndarray and approximate_neighbors.

def process_new_product(new_product_index: int, new_products_text_df: pandas.core.frame.DataFrame, new_products_num_df: pandas.core.frame.DataFrame, new_products_cat_df: pandas.core.frame.DataFrame, original_text_df: pandas.core.frame.DataFrame, original_num_df: pandas.core.frame.DataFrame, original_cat_df: pandas.core.frame.DataFrame, graph_of_prods: igraph.Graph, products_with_group: pandas.core.frame.DataFrame, max_iterations: int = 10, text_weight: float = 1.0, categorical_weight: float = 1.0, numerical_weight: float = 1.0, nb_neighbours: int = 20, vi: numpy.ndarray = array([], dtype=float64), group_column: str = 'GroupID', index_column: str = 'prodmap_index', sampling_fraction: float = 0.03) ‑> tuple[list[int], numpy.ndarray]

Finds closest known products to newly added products.

Args

new_product_index
Index of a single of the newly added products,
new_products_text_df
textual DataFrame of new products,
new_products_num_df
numerical DataFrame of new products,
new_products_cat_df
categorical DataFrame of new products,
original_text_df
textual DataFrame of known products,
original_num_df
numerical DataFrame of known products,
original_cat_df
categorical DataFrame of known products,
graph_of_prods
graph of known products based on similarity information,
products_with_group
known products with assigned group
max_iterations
max number of iterations performed by algorithm in search of best set of known products. If best set is found, algorithm stops searching,
nb_neighbours
number of candidates to find.
vi
inverse of covariance matrix of original numerical dataframe, used to compute mahalanobis similarity.
group_column
name of column from products_with_group used for stratified sampling.
index_column
name of column containing product indexes in products_with_group.
sampling_fraction
fraction of products returned by stratified sampling.

Returns

list of indexes of closest candidates and their similarities

def similarity(input_df: pandas.core.frame.DataFrame, cat_columns: Optional[List[str]] = None, num_columns: Optional[List[str]] = None, text_embedding: Optional[pandas.core.frame.DataFrame] = None, k_approx_neighbors: int = 400) ‑> pandas.core.frame.DataFrame

Build a table of similar products based on numerical, categorical and textual descriptors.

Args

input_df
original graph built using iGraph library,
cat_columns
list of categorical column names,
num_columns
list of numerical column names,
text_embedding
2-D input array of products * textual features,
k_approx_neighbors
number of approximate nearest neighbors to return.

Returns

The DataFrame of similar products.