wordviz package

Submodules

wordviz.clustering module

wordviz.clustering.create_clusters(vectors: ndarray, n_clusters: int = 5, method: str = 'kmeans') Tuple[ndarray, ndarray | None, ndarray][source]

Performs clustering on embeddings for visualization purposes. If the input vectors have more than 2 dimensions, dimensionality reduction is applied first.

Parameters:

vectorsnp.ndarray

Array of embeddings to cluster.

n_clustersint, default=5

Number of clusters to generate (used only for k-means).

methodstr, default=’kmeans’

Clustering method to use (‘kmeans’ or ‘dbscan’).

Returns:

labelsnp.ndarray

Cluster labels assigned to each vector.

centersnp.ndarray or None

Coordinates of cluster centers (only for k-means; None for dbscan).

reduced_embnp.ndarray

2D reduced embeddings used for clustering and plotting.

wordviz.dim_reduction module

wordviz.dim_reduction.reduce_dim(vectors: ndarray, method: str = 'pca', n_dimensions: int = 2, dist: str = 'euclidean', **kwargs) ndarray[source]

Applies dimensionality reduction for visualization to input vectors using a specified method.

Parameters:

vectors: array

Input high-dimensional data to reduce.

method: str, default:’pca’

Dimensionality reduction algorithm to apply. Options are: - ‘pca’: Principal Component Analysis - ‘tsne’: t-Distributed Stochastic Neighbor Embedding - ‘umap’: Uniformed Manifold Approximation and Projection - ‘isomap’: Isometric Mapping - ‘mds’: Multidimensional Scaling

n_dimensions: int, default:2

Number of dimensions for output. Must be 2 or 3

diststr, default=’euclidean’

Distance metric for methods that require a distance matrix (MDS). Ignored for other methods.

**kwargsdict

Additional parameters passed to the selected dimensionality reduction algorithm.

Returns:

np.ndarray

Embedding reduced at specified dimensions.

wordviz.loading module

class wordviz.loading.EmbeddingLoader[source]

Bases: object

Loads word or sentence embedding.

embeddings_raw

KeyedVectors format for static embeddings

Type:

Any

embeddings

Array of embeddings

Type:

np.ndarray

tokens

Representative elements for the embeddings in natural language (words, sentences, or other elements to visualize)

Type:

list of str

dimension

Dimensionality of the embeddings.

Type:

int

type

Type of embedding - ‘word’: word embeddings - ‘sentence’: Sentence/document/passage embeddings - ‘word_context’: Word embeddings in different contexts - ‘custom’: User-defined

Type:

str

download_zip(url, filename)[source]

downloads zip file from url

export_embedding(source_path, dest_folder)[source]

saves locally pretrained embeddings file

get_cache_dir()[source]
get_embedding(token)[source]

returns corresponding embeddings using KeyedVectors object for a string given by the user

list_available_pretrained()[source]

prints a list of pretrained embeddings provided by the package

load_contextual(embeddings, labels, embedding_type='sentence') ndarray[source]

Loads embeddings from contextual models.

Parameters:
  • embeddings (various formats) –

    • numpy.ndarray

    • torch.Tensor

    • List[List[float]]

  • labels (list of str) – labels corresponding to embedding

  • embedding_type (str) –

    • ‘sentence’: Sentence/document/passage embeddings

    • ’word_context’: Word embeddings in different contexts

    • ’word’: word embeddings

    • ’custom’: User-defined

Returns:

Loaded embedding matrix (n_labels x dimension).

Return type:

np.ndarray

load_from_file(path: str, format: str) ndarray[source]

Loads word embeddings from a file in .txt, .vec, or .bin format.

Parameters:
  • path (str) – Path to the embedding file.

  • format (str) – Format of the embedding model: ‘word2vec’, ‘fasttext’, or ‘glove’.

Returns:

  • np.ndarray – Loaded embedding matrix.

  • Notes

  • ——

  • - For GloVe files, they are first converted to word2vec format.

  • - FastText binary files are supported via Facebook’s native loader.

  • - Loaded tokens are stored in self.tokens.

  • - Embedding matrix is stored in self.embeddings.

load_pretrained(model: str, lang: str, source: str, dimension: str, save_file: bool = False, export_dir: str = None) ndarray[source]

Downloads and loads a pretrained embedding model from an online source.

Parameters:
  • model (str) – Name of the embedding model (‘word2vec’, ‘fasttext’, etc.).

  • lang (str) – Language code of the embedding (‘en’, ‘it’).

  • source (str) – Data source (‘wiki’, ‘cc’).

  • dimension (str or int) – Embedding dimensionality (e.g., ‘300’).

  • save_file (bool, default=False) – If True, saves the embedding to the specified export directory.

  • export_dir (str, optional) – Path to the directory where the file will be exported (used if save_file=True).

Returns:

Loaded embedding matrix (n_words x dimension).

Return type:

np.ndarray

subset(n: int = 1000, strategy: str = 'first', random_seed: int = None)[source]

Create a subset of the current embeddings and tokens. Useful for speeding up visualizations or managing memory with large embedding spaces.

Parameters:
  • n (int, default=1000) – Number of embeddings to retain. If n exceeds the total number of available embeddings, all are retained.

  • strategy (str, default='first') –

    Selection strategy:
    • ’first’: select the first n embeddings in original order.

    • ’random’: select n random embeddings.

  • random_seed (int, optional) – Seed for reproducible random sampling (only used if strategy is ‘random’).

  • Updates

  • --------

  • self.tokens_subset (list of str) – List of selected token strings.

  • self.embeddings_subset (np.ndarray) – Corresponding selected embedding vectors.

use_subset(n: int = 1000)[source]

returns embedding subset. If None, creates 1000 words subset and returns it.

wordviz.plotting module

class wordviz.plotting.BaseVisualizer(loader)[source]

Bases: object

get_theme(theme='light1')[source]
list_theme_colors()[source]

prints a list of available themes provided by the package

map_colors(labels, theme)[source]

automatizes color and legend label mapping for clustering applied to embeddings

select_sparse_labels(embeddings, n)[source]

uses clustering to select n distributed labels to visualize

class wordviz.plotting.Visualizer(loader)[source]

Bases: BaseVisualizer

interactive_embeddings(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]

DEPRECATED: This method will be renamed to plot_interactive in a future release.

Parameters:

red_methodstr, default=’auto’

Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

gridbool, default=True

Whether to display grid lines.

themestr, default=’light1’

Plot color theme.

titlestr, optional

Title of the plot. If None, no title is shown.

use_subsetbool, default=False

If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : plotly.graph_objects.Figure

plot_clusters(n_clusters: int = 5, method: str = 'kmeans', red_method: str = 'auto', show_centers: bool = False, grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]

Creates a 2D scatterplot of clustered embeddings using a clustering algorithm.

Parameters:

n_clustersint, default=5

Number of clusters to generate.

methodstr, default=’kmeans’

Clustering method to use (‘kmeans’ or others supported by create_clusters).

red_methodstr, default=’auto’

Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

show_centersbool, default=False

If True, displays cluster centers on the plot.

gridbool, default=True

Whether to display grid lines.

themestr, default=’light1’

Plot color theme.

titlestr, optional

Title of the plot. If None, no title is shown.

nlabelsint, default=0

Number of token labels to display on the plot.

use_subsetbool, default=False

If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : matplotlib.figure.Figure ax : matplotlib.axes.Axes

plot_dendrogram(label_fontsize=10, grid=False, use_subset=False, n: int = 500)[source]

Creates a 2D circular dendrogram of clustered embeddings using hierarchical clustering. This first version of this function does not include title and theme parameters. Adapted from Claude Sonnet 4.5 generation.

Z = linkage(reduced_emb, method=’complete’) clusters = fcluster(Z, t=n_clusters, criterion=’maxclust’) clusters_colors, legend_labels = self.map_colors(clusters)

Z2 = dendrogram(Z, labels=tokens, no_plot=True) labels = [v[1] for v in legend_labels.values()]

rt.plot(

Z2, colorlabels={‘cluster’: clusters_colors}, colorlabels_legend={‘cluster’: {

‘colors’: clusters_colors, ‘labels’: labels

}}, fontsize=6,

)

Parameters:

n_clustersint, default=8

Number of clusters to generate.

red_methodstr, default=’auto’

Dimensionality reduction method for better interpretability (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

use_subsetbool, default=False

If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : matplotlib.figure.Figure

plot_embeddings(red_method: str = 'auto', grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]

Creates a simple static 2D scatterplot of the embeddings.

Parameters:
  • red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

  • grid (bool, default=True) – If True, displays a background grid on the plot.

  • theme (str, default='light1') – Color theme to apply.

  • title (str, optional) – Title to display on the plot.

  • nlabels (int, default=0) – Number of word labels to display. If 0, no labels are shown.

  • use_subset (bool, default=False) – If True, uses the embedding subset instead of the full embeddings.

Returns:

  • fig (matplotlib.figure.Figure)

  • ax (matplotlib.axes.Axes)

plot_heatmap(use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]

Creates a heatmap showing every vectorial value of every word.

Parameters:
  • dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.

  • use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.

  • n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.

  • theme (str, default='light1') – Plot color theme to use.

  • title (str, optional) – Title for the heatmap. If None, a default title is assigned.

Returns:

fig

Return type:

plotly.graph_objects.Figure

plot_interactive(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]

Creates an interactive 2D scatterplot of embeddings using Plotly.

Parameters:

red_methodstr, default=’auto’

Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

gridbool, default=True

Whether to display grid lines.

themestr, default=’light1’

Plot color theme.

titlestr, optional

Title of the plot. If None, no title is shown.

use_subsetbool, default=False

If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : plotly.graph_objects.Figure

plot_similarity(target_word: str, dist: str = 'cosine', n: int = 10, red_method: str = 'pca', grid: bool = True, theme: str = 'light1', title: str = None)[source]

Creates a scatterplot showing the most similar words to a target word.

Parameters:
  • target_word (str) – The word for which to find and plot the most similar words.

  • dist (str, default='cosine') – Distance metric to use when computing word similarity.

  • n (int, default=10) – Number of similar words to display.

  • red_method (str, default='pca') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.).

  • grid (bool, default=True) – If True, displays a background grid on the plot.

  • theme (str, default='light1') – Color theme to apply to the plot.

  • title (str, optional) – Title to display. If None, a default title will be generated.

Returns:

  • fig (matplotlib.figure.Figure)

  • ax (matplotlib.axes.Axes)

plot_similarity_heatmap(dist: str = 'cosine', use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]

Creates a heatmap showing pairwise distances between word embeddings.

Parameters:
  • dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.

  • use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.

  • n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.

  • theme (str, default='light1') – Plot color theme to use.

  • title (str, optional) – Title for the heatmap. If None, a default title is assigned.

Returns:

fig

Return type:

plotly.graph_objects.Figure

plot_topography(red_method: str = 'auto', use_subset: bool = True, grid: bool = True, theme: str = 'light1', title: str = None)[source]

Plots word embeddings in a topographical map using dimensionality reduction to maintain word distances in the representation. Allows to visualize word density in the space.

Parameters:
  • red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

  • use_subset (bool, default=True) – If True, uses a subset of the embeddings for visualization. This is recommended in this plot for larger embeddings.

  • grid (bool, default=True) – If True, shows grid lines on the plot.

  • theme (str, default='light1') – The plot theme to use, which controls the colors of the plot.

  • title (str, optional) – Title of the plot. If not provided, a default title is used.

Returns:

fig

Return type:

plotly.graph_objs.Figure

similarity_heatmap(dist: str = 'cosine', use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]

DEPRECATED: This method will be renamed to plot_interactive in a future release.

Parameters:
  • dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.

  • use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.

  • n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.

  • theme (str, default='light1') – Plot color theme to use.

  • title (str, optional) – Title for the heatmap. If None, a default title is assigned.

Returns:

fig

Return type:

plotly.graph_objects.Figure

wordviz.plotting3d module

class wordviz.plotting3d.Visualizer3D(loader)[source]

Bases: BaseVisualizer

plot_clusters(n_clusters=5, method='kmeans', red_method='auto', show_centers=False, grid=True, theme='light1', title=None, nlabels=0, use_subset=False)[source]

Creates a 3D scatterplot of clustered embeddings using a clustering algorithm.

Parameters:

n_clustersint, default=5

Number of clusters to generate.

methodstr, default=’kmeans’

Clustering method to use (‘kmeans’ or others supported by create_clusters).

red_methodstr, default=’auto’

Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

show_centersbool, default=False

If True, displays cluster centers on the plot.

gridbool, default=True

Whether to display grid lines.

themestr, default=’light1’

Plot color theme.

titlestr, optional

Title of the plot. If None, no title is shown.

nlabelsint, default=0

Number of token labels to display on the plot.

use_subsetbool, default=False

If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : plotly.graph_objects.Figure

Notes:

In 3D plotting Plotly.py tends to use GPU to visualize an high number of elements and label, so it is possible that this function does not work properly with a whole embedding set.

plot_embeddings(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]

Creates an interactive 3D scatterplot of embeddings using Plotly.

Parameters:

red_methodstr, default=’auto’

Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

gridbool, default=True

Whether to display grid lines.

themestr, default=’light1’

Plot color theme.

titlestr, optional

Title of the plot. If None, no title is shown.

use_subsetbool, default=False

If True, uses the embedding subset instead of the full embeddings.

Returns:

fig : plotly.graph_objects.Figure

Notes:

In 3D plotting Plotly.py tends to use GPU to visualize an high number of elements and label, so it is possible that this function does not work properly with a whole embedding set.

plot_similarity(target_word: str, dist: str = 'cosine', n: int = 10, red_method: str = 'pca', grid: bool = True, theme: str = 'light1', title: str = None)[source]

Creates a dynamic 3D scatterplot showing the most similar words to a target word.

Parameters:
  • target_word (str) – The word for which to find and plot the most similar words.

  • dist (str, default='cosine') – Distance metric to use when computing word similarity.

  • n (int, default=10) – Number of similar words to display.

  • red_method (str, default='pca') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.).

  • grid (bool, default=True) – If True, displays a background grid on the plot.

  • theme (str, default='light1') – Color theme to apply to the plot.

  • title (str, optional) – Title to display. If None, a default title will be generated.

Returns:

fig

Return type:

plotly.graph_objects.Figure

plot_static(red_method: str = 'auto', grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]

Creates a simple static 3D scatterplot of the embeddings.

Parameters:
  • red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.

  • grid (bool, default=True) – If True, displays a background grid on the plot.

  • theme (str, default='light1') – Color theme to apply.

  • title (str, optional) – Title to display on the plot.

  • nlabels (int, default=0) – Number of word labels to display. If 0, no labels are shown.

  • use_subset (bool, default=False) – If True, uses the embedding subset instead of the full embeddings.

Returns:

  • fig (matplotlib.figure.Figure)

  • ax (matplotlib.axes.Axes)

wordviz.similarity module

wordviz.similarity.compute_distances(X, metric='euclidean')[source]
wordviz.similarity.n_most_similar(loader: EmbeddingLoader, target_word: str, dist: str = 'cosine', n: int = 10) Tuple[List[str], ndarray, List[float]][source]

Finds pairwise the n most similar words to a given target word using a specified distance metric.

Parameters:
  • loader (EmbeddingLoader) – An instance of the embedding loader containing word vectors.

  • target_word (str) – The word for which to find the most similar neighbors.

  • dist (str, default='cosine') – The distance metric to use. Options include ‘cosine’, ‘euclidean’, etc.

  • n (int, default=10) – The number of most similar words to retrieve.

Returns:

  • words (list of str) – The most similar words found.

  • vectors (np.ndarray) – Embedding vectors corresponding to the most similar words.

  • distances (list of float) – Distances from the target word to each of the most similar words.

wordviz.similarity.word_distance(loader: EmbeddingLoader, word1: str, word2: str, dist: str = 'cosine') float[source]

Computes distance between two words given by user. Also supports sentence distance.

Parameters:
  • loader (EmbeddingLoader) – Object used to load embeddings

  • word1 (str) – Word to compute distance between

  • word2 (str) – Word to compute distance between

  • dist (str, default='cosine') – Type of distance to use: - ‘braycurtis’ - ‘canberra’ - ‘chebyshev’ - ‘cosine’ - ‘dot’ - ‘euclidean’ - ‘manhattan’ - ‘pearson’ - ‘pearson’

Returns:

distance

Return type:

float

Module contents