Module pyfx.pricingscience.similarity
Functions and utilities for product similarity analysis.
Contains formulas to evaluate distances between products regarding either textual or categorical or numerical features. Based on SentenceTransformer to extract textual features of products from textual fields (ie description). Based on Fast Approximate Nearest Neighbor search to identify similar products Based on Leiden Algorithm to extract partition from the graph of nearest neighbors, i.e. groups of similar products.
Functions
def approx_top_knn(text_embedding: Optional[numpy.ndarray], cat_array: Optional[numpy.ndarray], num_array: Optional[numpy.ndarray], top_k: int, batch_queries: bool = True, text_n_neighbor_connect: int = 30, cat_n_neighbor_connect: int = 30, num_n_neighbor_connect: int = 30, text_query_rate: float = 0.1, cat_query_rate: float = 0.1, num_query_rate: float = 0.1) ‑> Optional[numpy.ndarray]
-
Identify closest neighbours.
Identify among an array of products * descriptors which products are neighbors using fast approximate nearest neighbor search.
Args
text_embedding
- 2-D input array of products * textual features,
cat_array
- 2-D input array of products * categorical features,
num_array
- 2-D input array of products * numerical features,
top_k
- number of nearest neighbors to return,
batch_queries
- use parallelism for queries,
text_n_neighbor_connect
- The number of neighbors to use for ANN search using
- textual features. High values increase accuracy at cost of time,
cat_n_neighbor_connect
- The number of neighbors to use for ANN search using
- categorical features. High values increase accuracy at cost of time,
num_n_neighbor_connect
- The number of neighbors to use for ANN search using
- numerical features. High values increase accuracy at cost of time,
text_query_rate
- backtracking parameter that increases accuracy at cost of time
- during textual features based ANN search,
cat_query_rate
- backtracking parameter that increases accuracy at cost of time
- during categorical features based ANN search,
num_query_rate
- backtracking parameter that increases accuracy at cost of time
during numerical features based ANN search.
Returns
The array of approximate nearest neighbors across all kind of features.
def extract_embeddings(text_list_: List[str], text_transformer_: str) ‑> pandas.core.frame.DataFrame
-
Create the embeddings from a list of texts using a specific transformer.
Args
text_list_
- the list of texts to process,
text_transformer_
- the name of the text transformer to use,
see (https://www.sbert.net/docs/pretrained_models.html) for other available transformers.
Returns
A DataFrame of the embeddings
def get_leiden_partition_opti_modularity(prod_graph: igraph.Graph, _max_communities_size: int) ‑> igraph.Graph
-
Build the modularity maximising partition of a graph using leiden algorithm.
Args
prod_graph
- original graph built using iGraph library,
_max_communities_size
- restrict the communities size.
Returns
The graph containing optimal partition information.
def get_leiden_partition_with_resolution_parameter(prod_graph: igraph.Graph, _resolution_param: float, _max_communities_size: int) ‑> igraph.Graph
-
Build a partition of a graph using leiden algorithm.
Args
prod_graph
- original graph built using iGraph library,
_resolution_param
- higher resolutions lead to more communities and
- lower resolutions lead to fewer communities,
_max_communities_size
- restrict the communities size.
Returns
The graph containing partition information.
def pairwise_cos(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray
-
Compute an array of Cosine similarity.
Compute an array of Cosine similarity between an array (of textual descriptors) of products and their respective neighbours.
Args
ndarray
- first 2-D input array of products,
approximate_neighbors
- second 2-D input array of neighbours.
Returns
The array of Cosine distance between ndarray and approximate_neighbors.
def pairwise_ham(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray
-
Compute an array of Hamming similarity.
Compute an array of Hamming similarity between an array (of categorical descriptors) of products and their respective neighbours.
Args
ndarray
- first 2-D input array of products,
approximate_neighbors
- second 2-D input array of neighbours.
Returns
The array of Hamming distance between ndarray and approximate_neighbors.
def pairwise_maha(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray, vi: numpy.ndarray) ‑> numpy.ndarray
-
Compute an array of Mahalanobis similarity.
Compute an array of Mahalanobis similarity between an array (of numerical descriptors) of products and their respective neighbours.
Args
ndarray
- first 2-D input array of products,
approximate_neighbors
- second 2-D input array of neighbours,
vi
- the pseudo-inverse of the covariance matrix.
Returns
The array of Mahalanobis distance between ndarray and approximate_neighbors.
def process_new_product(new_product_index: int, new_products_text_df: pandas.core.frame.DataFrame, new_products_num_df: pandas.core.frame.DataFrame, new_products_cat_df: pandas.core.frame.DataFrame, original_text_df: pandas.core.frame.DataFrame, original_num_df: pandas.core.frame.DataFrame, original_cat_df: pandas.core.frame.DataFrame, graph_of_prods: igraph.Graph, products_with_group: pandas.core.frame.DataFrame, max_iterations: int = 10, text_weight: float = 1.0, categorical_weight: float = 1.0, numerical_weight: float = 1.0, nb_neighbours: int = 20, vi: numpy.ndarray = array([], dtype=float64), group_column: str = 'GroupID', index_column: str = 'prodmap_index', sampling_fraction: float = 0.03) ‑> tuple[list[int], numpy.ndarray]
-
Finds closest known products to newly added products.
Args
new_product_index
- Index of a single of the newly added products,
new_products_text_df
- textual DataFrame of new products,
new_products_num_df
- numerical DataFrame of new products,
new_products_cat_df
- categorical DataFrame of new products,
original_text_df
- textual DataFrame of known products,
original_num_df
- numerical DataFrame of known products,
original_cat_df
- categorical DataFrame of known products,
graph_of_prods
- graph of known products based on similarity information,
products_with_group
- known products with assigned group
max_iterations
- max number of iterations performed by algorithm in search of best set of known products. If best set is found, algorithm stops searching,
nb_neighbours
- number of candidates to find.
vi
- inverse of covariance matrix of original numerical dataframe, used to compute mahalanobis similarity.
group_column
- name of column from products_with_group used for stratified sampling.
index_column
- name of column containing product indexes in products_with_group.
sampling_fraction
- fraction of products returned by stratified sampling.
Returns
list of indexes of closest candidates and their similarities
def similarity(input_df: pandas.core.frame.DataFrame, cat_columns: Optional[List[str]] = None, num_columns: Optional[List[str]] = None, text_embedding: Optional[pandas.core.frame.DataFrame] = None, k_approx_neighbors: int = 400) ‑> pandas.core.frame.DataFrame
-
Build a table of similar products based on numerical, categorical and textual descriptors.
Args
input_df
- original graph built using iGraph library,
cat_columns
- list of categorical column names,
num_columns
- list of numerical column names,
text_embedding
- 2-D input array of products * textual features,
k_approx_neighbors
- number of approximate nearest neighbors to return.
Returns
The DataFrame of similar products.