wordviz package

Submodules

wordviz.clustering module

wordviz.clustering.create_clusters(vectors: ndarray, n_clusters: int = 5, method: str = 'kmeans') → Tuple[ndarray, ndarray | None, ndarray][source]

Performs clustering on embeddings for visualization purposes. If the input vectors have more than 2 dimensions, dimensionality reduction is applied first.

Parameters:

vectorsnp.ndarray: Array of embeddings to cluster.
n_clustersint, default=5: Number of clusters to generate (used only for k-means).
methodstr, default=’kmeans’: Clustering method to use (‘kmeans’ or ‘dbscan’).

Returns:

labelsnp.ndarray: Cluster labels assigned to each vector.
centersnp.ndarray or None: Coordinates of cluster centers (only for k-means; None for dbscan).
reduced_embnp.ndarray: 2D reduced embeddings used for clustering and plotting.

wordviz.dim_reduction module

wordviz.dim_reduction.reduce_dim(vectors: ndarray, method: str = 'pca', n_dimensions: int = 2, dist: str = 'euclidean', **kwargs) → ndarray[source]

Applies dimensionality reduction for visualization to input vectors using a specified method.

Parameters:

vectors: array

Input high-dimensional data to reduce.

method: str, default:’pca’

Dimensionality reduction algorithm to apply. Options are: - ‘pca’: Principal Component Analysis - ‘tsne’: t-Distributed Stochastic Neighbor Embedding - ‘umap’: Uniformed Manifold Approximation and Projection - ‘isomap’: Isometric Mapping - ‘mds’: Multidimensional Scaling

n_dimensions: int, default:2

Number of dimensions for output. Must be 2 or 3

diststr, default=’euclidean’: Distance metric for methods that require a distance matrix (MDS). Ignored for other methods.

**kwargsdict

Additional parameters passed to the selected dimensionality reduction algorithm.

Returns:

np.ndarray: Embedding reduced at specified dimensions.

wordviz.loading module

class wordviz.loading.EmbeddingLoader[source]

Bases: object

Loads word or sentence embedding.

embeddings_raw

KeyedVectors format for static embeddings

Type:: Any

embeddings

Array of embeddings

Type:: np.ndarray

tokens

Representative elements for the embeddings in natural language (words, sentences, or other elements to visualize)

Type:: list of str

dimension

Dimensionality of the embeddings.

Type:: int

type

Type of embedding - ‘word’: word embeddings - ‘sentence’: Sentence/document/passage embeddings - ‘word_context’: Word embeddings in different contexts - ‘custom’: User-defined

Type:: str

download_zip(url, filename)[source]: downloads zip file from url

export_embedding(source_path, dest_folder)[source]: saves locally pretrained embeddings file

get_cache_dir()[source]

get_embedding(token)[source]: returns corresponding embeddings using KeyedVectors object for a string given by the user

list_available_pretrained()[source]: prints a list of pretrained embeddings provided by the package

load_contextual(embeddings, labels, embedding_type='sentence') → ndarray[source]

Loads embeddings from contextual models.

Parameters:

embeddings (various formats) –
- numpy.ndarray
- torch.Tensor
- List[List[float]]
labels (list of str) – labels corresponding to embedding
embedding_type (str) –
- ‘sentence’: Sentence/document/passage embeddings
- ’word_context’: Word embeddings in different contexts
- ’word’: word embeddings
- ’custom’: User-defined

Returns:

Loaded embedding matrix (n_labels x dimension).

Return type:

np.ndarray

load_from_file(path: str, format: str) → ndarray[source]

Loads word embeddings from a file in .txt, .vec, or .bin format.

Parameters:

path (str) – Path to the embedding file.
format (str) – Format of the embedding model: ‘word2vec’, ‘fasttext’, or ‘glove’.

Returns:

np.ndarray – Loaded embedding matrix.
Notes
——
- For GloVe files, they are first converted to word2vec format.
- FastText binary files are supported via Facebook’s native loader.
- Loaded tokens are stored in self.tokens.
- Embedding matrix is stored in self.embeddings.

load_pretrained(model: str, lang: str, source: str, dimension: str, save_file: bool = False, export_dir: str = None) → ndarray[source]

Downloads and loads a pretrained embedding model from an online source.

Parameters:

model (str) – Name of the embedding model (‘word2vec’, ‘fasttext’, etc.).
lang (str) – Language code of the embedding (‘en’, ‘it’).
source (str) – Data source (‘wiki’, ‘cc’).
dimension (str or int) – Embedding dimensionality (e.g., ‘300’).
save_file (bool, default=False) – If True, saves the embedding to the specified export directory.
export_dir (str, optional) – Path to the directory where the file will be exported (used if save_file=True).

Returns:

Loaded embedding matrix (n_words x dimension).

Return type:

np.ndarray

subset(n: int = 1000, strategy: str = 'first', random_seed: int = None)[source]

Create a subset of the current embeddings and tokens. Useful for speeding up visualizations or managing memory with large embedding spaces.

Parameters:

n (int, default=1000) – Number of embeddings to retain. If n exceeds the total number of available embeddings, all are retained.
strategy (str, default='first') –
Selection strategy:
- ’first’: select the first n embeddings in original order.
- ’random’: select n random embeddings.
random_seed (int, optional) – Seed for reproducible random sampling (only used if strategy is ‘random’).
Updates
--------
self.tokens_subset (list of str) – List of selected token strings.
self.embeddings_subset (np.ndarray) – Corresponding selected embedding vectors.

use_subset(n: int = 1000)[source]: returns embedding subset. If None, creates 1000 words subset and returns it.

wordviz.plotting module

class wordviz.plotting.BaseVisualizer(loader)[source]

Bases: object

get_theme(theme='light1')[source]

list_theme_colors()[source]: prints a list of available themes provided by the package

map_colors(labels, theme)[source]: automatizes color and legend label mapping for clustering applied to embeddings

select_sparse_labels(embeddings, n)[source]: uses clustering to select n distributed labels to visualize

class wordviz.plotting.Visualizer(loader)[source]

Bases: BaseVisualizer

interactive_embeddings(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]

DEPRECATED: This method will be renamed to plot_interactive in a future release.

Parameters:

red_methodstr, default=’auto’: Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
gridbool, default=True: Whether to display grid lines.
themestr, default=’light1’: Plot color theme.
titlestr, optional: Title of the plot. If None, no title is shown.
use_subsetbool, default=False: If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : plotly.graph_objects.Figure

plot_clusters(n_clusters: int = 5, method: str = 'kmeans', red_method: str = 'auto', show_centers: bool = False, grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]

Creates a 2D scatterplot of clustered embeddings using a clustering algorithm.

Parameters:

n_clustersint, default=5: Number of clusters to generate.
methodstr, default=’kmeans’: Clustering method to use (‘kmeans’ or others supported by create_clusters).
red_methodstr, default=’auto’: Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
show_centersbool, default=False: If True, displays cluster centers on the plot.
gridbool, default=True: Whether to display grid lines.
themestr, default=’light1’: Plot color theme.
titlestr, optional: Title of the plot. If None, no title is shown.
nlabelsint, default=0: Number of token labels to display on the plot.
use_subsetbool, default=False: If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : matplotlib.figure.Figure ax : matplotlib.axes.Axes

plot_dendrogram(label_fontsize=10, grid=False, use_subset=False, n: int = 500)[source]

Creates a 2D circular dendrogram of clustered embeddings using hierarchical clustering. This first version of this function does not include title and theme parameters. Adapted from Claude Sonnet 4.5 generation.

Z = linkage(reduced_emb, method=’complete’) clusters = fcluster(Z, t=n_clusters, criterion=’maxclust’) clusters_colors, legend_labels = self.map_colors(clusters)

Z2 = dendrogram(Z, labels=tokens, no_plot=True) labels = [v[1] for v in legend_labels.values()]

rt.plot(

Z2, colorlabels={‘cluster’: clusters_colors}, colorlabels_legend={‘cluster’: {

‘colors’: clusters_colors, ‘labels’: labels

}}, fontsize=6,

)

Parameters:

n_clustersint, default=8: Number of clusters to generate.
red_methodstr, default=’auto’: Dimensionality reduction method for better interpretability (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
use_subsetbool, default=False: If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : matplotlib.figure.Figure

plot_embeddings(red_method: str = 'auto', grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]

Creates a simple static 2D scatterplot of the embeddings.

Parameters:

red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
grid (bool, default=True) – If True, displays a background grid on the plot.
theme (str, default='light1') – Color theme to apply.
title (str, optional) – Title to display on the plot.
nlabels (int, default=0) – Number of word labels to display. If 0, no labels are shown.
use_subset (bool, default=False) – If True, uses the embedding subset instead of the full embeddings.

Returns:

fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)

plot_heatmap(use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]

Creates a heatmap showing every vectorial value of every word.

Parameters:

dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.
use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.
n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.
theme (str, default='light1') – Plot color theme to use.
title (str, optional) – Title for the heatmap. If None, a default title is assigned.

Returns:

fig

Return type:

plotly.graph_objects.Figure

plot_interactive(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]

Creates an interactive 2D scatterplot of embeddings using Plotly.

Parameters:

red_methodstr, default=’auto’: Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
gridbool, default=True: Whether to display grid lines.
themestr, default=’light1’: Plot color theme.
titlestr, optional: Title of the plot. If None, no title is shown.
use_subsetbool, default=False: If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : plotly.graph_objects.Figure

plot_similarity(target_word: str, dist: str = 'cosine', n: int = 10, red_method: str = 'pca', grid: bool = True, theme: str = 'light1', title: str = None)[source]

Creates a scatterplot showing the most similar words to a target word.

Parameters:

target_word (str) – The word for which to find and plot the most similar words.
dist (str, default='cosine') – Distance metric to use when computing word similarity.
n (int, default=10) – Number of similar words to display.
red_method (str, default='pca') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.).
grid (bool, default=True) – If True, displays a background grid on the plot.
theme (str, default='light1') – Color theme to apply to the plot.
title (str, optional) – Title to display. If None, a default title will be generated.

Returns:

fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)

plot_similarity_heatmap(dist: str = 'cosine', use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]

Creates a heatmap showing pairwise distances between word embeddings.

Parameters:

dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.
use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.
n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.
theme (str, default='light1') – Plot color theme to use.
title (str, optional) – Title for the heatmap. If None, a default title is assigned.

Returns:

fig

Return type:

plotly.graph_objects.Figure

plot_topography(red_method: str = 'auto', use_subset: bool = True, grid: bool = True, theme: str = 'light1', title: str = None)[source]

Plots word embeddings in a topographical map using dimensionality reduction to maintain word distances in the representation. Allows to visualize word density in the space.

Parameters:

red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
use_subset (bool, default=True) – If True, uses a subset of the embeddings for visualization. This is recommended in this plot for larger embeddings.
grid (bool, default=True) – If True, shows grid lines on the plot.
theme (str, default='light1') – The plot theme to use, which controls the colors of the plot.
title (str, optional) – Title of the plot. If not provided, a default title is used.

Returns:

fig

Return type:

plotly.graph_objs.Figure

similarity_heatmap(dist: str = 'cosine', use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]

DEPRECATED: This method will be renamed to plot_interactive in a future release.

Parameters:

dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.
use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.
n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.
theme (str, default='light1') – Plot color theme to use.
title (str, optional) – Title for the heatmap. If None, a default title is assigned.

Returns:

fig

Return type:

plotly.graph_objects.Figure

wordviz.plotting3d module

class wordviz.plotting3d.Visualizer3D(loader)[source]

Bases: BaseVisualizer

plot_clusters(n_clusters=5, method='kmeans', red_method='auto', show_centers=False, grid=True, theme='light1', title=None, nlabels=0, use_subset=False)[source]

Creates a 3D scatterplot of clustered embeddings using a clustering algorithm.

Parameters:

n_clustersint, default=5: Number of clusters to generate.
methodstr, default=’kmeans’: Clustering method to use (‘kmeans’ or others supported by create_clusters).
red_methodstr, default=’auto’: Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
show_centersbool, default=False: If True, displays cluster centers on the plot.
gridbool, default=True: Whether to display grid lines.
themestr, default=’light1’: Plot color theme.
titlestr, optional: Title of the plot. If None, no title is shown.
nlabelsint, default=0: Number of token labels to display on the plot.
use_subsetbool, default=False: If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : plotly.graph_objects.Figure

Notes:

In 3D plotting Plotly.py tends to use GPU to visualize an high number of elements and label, so it is possible that this function does not work properly with a whole embedding set.

plot_embeddings(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]

Creates an interactive 3D scatterplot of embeddings using Plotly.

Parameters:

red_methodstr, default=’auto’: Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
gridbool, default=True: Whether to display grid lines.
themestr, default=’light1’: Plot color theme.
titlestr, optional: Title of the plot. If None, no title is shown.
use_subsetbool, default=False: If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : plotly.graph_objects.Figure

Notes:

In 3D plotting Plotly.py tends to use GPU to visualize an high number of elements and label, so it is possible that this function does not work properly with a whole embedding set.

plot_similarity(target_word: str, dist: str = 'cosine', n: int = 10, red_method: str = 'pca', grid: bool = True, theme: str = 'light1', title: str = None)[source]

Creates a dynamic 3D scatterplot showing the most similar words to a target word.

Parameters:

target_word (str) – The word for which to find and plot the most similar words.
dist (str, default='cosine') – Distance metric to use when computing word similarity.
n (int, default=10) – Number of similar words to display.
red_method (str, default='pca') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.).
grid (bool, default=True) – If True, displays a background grid on the plot.
theme (str, default='light1') – Color theme to apply to the plot.
title (str, optional) – Title to display. If None, a default title will be generated.

Returns:

fig

Return type:

plotly.graph_objects.Figure

plot_static(red_method: str = 'auto', grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]

Creates a simple static 3D scatterplot of the embeddings.

Parameters:

red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
grid (bool, default=True) – If True, displays a background grid on the plot.
theme (str, default='light1') – Color theme to apply.
title (str, optional) – Title to display on the plot.
nlabels (int, default=0) – Number of word labels to display. If 0, no labels are shown.
use_subset (bool, default=False) – If True, uses the embedding subset instead of the full embeddings.

Returns:

fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)

wordviz.similarity module

wordviz.similarity.compute_distances(X, metric='euclidean')[source]

wordviz.similarity.n_most_similar(loader: EmbeddingLoader, target_word: str, dist: str = 'cosine', n: int = 10) → Tuple[List[str], ndarray, List[float]][source]

Finds pairwise the n most similar words to a given target word using a specified distance metric.

Parameters:

loader (EmbeddingLoader) – An instance of the embedding loader containing word vectors.
target_word (str) – The word for which to find the most similar neighbors.
dist (str, default='cosine') – The distance metric to use. Options include ‘cosine’, ‘euclidean’, etc.
n (int, default=10) – The number of most similar words to retrieve.

Returns:

words (list of str) – The most similar words found.
vectors (np.ndarray) – Embedding vectors corresponding to the most similar words.
distances (list of float) – Distances from the target word to each of the most similar words.

wordviz.similarity.word_distance(loader: EmbeddingLoader, word1: str, word2: str, dist: str = 'cosine') → float[source]

Computes distance between two words given by user. Also supports sentence distance.

Parameters:

loader (EmbeddingLoader) – Object used to load embeddings
word1 (str) – Word to compute distance between
word2 (str) – Word to compute distance between
dist (str, default='cosine') – Type of distance to use: - ‘braycurtis’ - ‘canberra’ - ‘chebyshev’ - ‘cosine’ - ‘dot’ - ‘euclidean’ - ‘manhattan’ - ‘pearson’ - ‘pearson’

Returns:

distance

Return type:

float

wordviz package

Submodules

wordviz.clustering module

Parameters:

Returns:

wordviz.dim_reduction module

Parameters:

Returns:

wordviz.loading module

wordviz.plotting module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

wordviz.plotting3d module

Parameters:

Returns:

Notes:

Parameters:

Returns:

Notes:

wordviz.similarity module

Module contents