wordviz package
Submodules
wordviz.clustering module
- wordviz.clustering.create_clusters(vectors: ndarray, n_clusters: int = 5, method: str = 'kmeans') Tuple[ndarray, ndarray | None, ndarray][source]
Performs clustering on embeddings for visualization purposes. If the input vectors have more than 2 dimensions, dimensionality reduction is applied first.
Parameters:
- vectorsnp.ndarray
Array of embeddings to cluster.
- n_clustersint, default=5
Number of clusters to generate (used only for k-means).
- methodstr, default=’kmeans’
Clustering method to use (‘kmeans’ or ‘dbscan’).
Returns:
- labelsnp.ndarray
Cluster labels assigned to each vector.
- centersnp.ndarray or None
Coordinates of cluster centers (only for k-means; None for dbscan).
- reduced_embnp.ndarray
2D reduced embeddings used for clustering and plotting.
wordviz.dim_reduction module
- wordviz.dim_reduction.reduce_dim(vectors: ndarray, method: str = 'pca', n_dimensions: int = 2, dist: str = 'euclidean', **kwargs) ndarray[source]
Applies dimensionality reduction for visualization to input vectors using a specified method.
Parameters:
- vectors: array
Input high-dimensional data to reduce.
- method: str, default:’pca’
Dimensionality reduction algorithm to apply. Options are: - ‘pca’: Principal Component Analysis - ‘tsne’: t-Distributed Stochastic Neighbor Embedding - ‘umap’: Uniformed Manifold Approximation and Projection - ‘isomap’: Isometric Mapping - ‘mds’: Multidimensional Scaling
- n_dimensions: int, default:2
Number of dimensions for output. Must be 2 or 3
- diststr, default=’euclidean’
Distance metric for methods that require a distance matrix (MDS). Ignored for other methods.
- **kwargsdict
Additional parameters passed to the selected dimensionality reduction algorithm.
Returns:
- np.ndarray
Embedding reduced at specified dimensions.
wordviz.loading module
- class wordviz.loading.EmbeddingLoader[source]
Bases:
objectLoads word or sentence embedding.
- embeddings_raw
KeyedVectors format for static embeddings
- Type:
Any
- embeddings
Array of embeddings
- Type:
np.ndarray
- tokens
Representative elements for the embeddings in natural language (words, sentences, or other elements to visualize)
- Type:
list of str
- dimension
Dimensionality of the embeddings.
- Type:
int
- type
Type of embedding - ‘word’: word embeddings - ‘sentence’: Sentence/document/passage embeddings - ‘word_context’: Word embeddings in different contexts - ‘custom’: User-defined
- Type:
str
- get_embedding(token)[source]
returns corresponding embeddings using KeyedVectors object for a string given by the user
- load_contextual(embeddings, labels, embedding_type='sentence') ndarray[source]
Loads embeddings from contextual models.
- Parameters:
embeddings (various formats) –
numpy.ndarray
torch.Tensor
List[List[float]]
labels (list of str) – labels corresponding to embedding
embedding_type (str) –
‘sentence’: Sentence/document/passage embeddings
’word_context’: Word embeddings in different contexts
’word’: word embeddings
’custom’: User-defined
- Returns:
Loaded embedding matrix (n_labels x dimension).
- Return type:
np.ndarray
- load_from_file(path: str, format: str) ndarray[source]
Loads word embeddings from a file in .txt, .vec, or .bin format.
- Parameters:
path (str) – Path to the embedding file.
format (str) – Format of the embedding model: ‘word2vec’, ‘fasttext’, or ‘glove’.
- Returns:
np.ndarray – Loaded embedding matrix.
Notes
——
- For GloVe files, they are first converted to word2vec format.
- FastText binary files are supported via Facebook’s native loader.
- Loaded tokens are stored in self.tokens.
- Embedding matrix is stored in self.embeddings.
- load_pretrained(model: str, lang: str, source: str, dimension: str, save_file: bool = False, export_dir: str = None) ndarray[source]
Downloads and loads a pretrained embedding model from an online source.
- Parameters:
model (str) – Name of the embedding model (‘word2vec’, ‘fasttext’, etc.).
lang (str) – Language code of the embedding (‘en’, ‘it’).
source (str) – Data source (‘wiki’, ‘cc’).
dimension (str or int) – Embedding dimensionality (e.g., ‘300’).
save_file (bool, default=False) – If True, saves the embedding to the specified export directory.
export_dir (str, optional) – Path to the directory where the file will be exported (used if save_file=True).
- Returns:
Loaded embedding matrix (n_words x dimension).
- Return type:
np.ndarray
- subset(n: int = 1000, strategy: str = 'first', random_seed: int = None)[source]
Create a subset of the current embeddings and tokens. Useful for speeding up visualizations or managing memory with large embedding spaces.
- Parameters:
n (int, default=1000) – Number of embeddings to retain. If n exceeds the total number of available embeddings, all are retained.
strategy (str, default='first') –
- Selection strategy:
’first’: select the first n embeddings in original order.
’random’: select n random embeddings.
random_seed (int, optional) – Seed for reproducible random sampling (only used if strategy is ‘random’).
Updates
--------
self.tokens_subset (list of str) – List of selected token strings.
self.embeddings_subset (np.ndarray) – Corresponding selected embedding vectors.
wordviz.plotting module
- class wordviz.plotting.BaseVisualizer(loader)[source]
Bases:
object
- class wordviz.plotting.Visualizer(loader)[source]
Bases:
BaseVisualizer- interactive_embeddings(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]
DEPRECATED: This method will be renamed to plot_interactive in a future release.
Parameters:
- red_methodstr, default=’auto’
Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
- gridbool, default=True
Whether to display grid lines.
- themestr, default=’light1’
Plot color theme.
- titlestr, optional
Title of the plot. If None, no title is shown.
- use_subsetbool, default=False
If True, uses the embedding subset instead of the full embeddings.
Returns:
fig : plotly.graph_objects.Figure
- plot_clusters(n_clusters: int = 5, method: str = 'kmeans', red_method: str = 'auto', show_centers: bool = False, grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]
Creates a 2D scatterplot of clustered embeddings using a clustering algorithm.
Parameters:
- n_clustersint, default=5
Number of clusters to generate.
- methodstr, default=’kmeans’
Clustering method to use (‘kmeans’ or others supported by create_clusters).
- red_methodstr, default=’auto’
Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
- show_centersbool, default=False
If True, displays cluster centers on the plot.
- gridbool, default=True
Whether to display grid lines.
- themestr, default=’light1’
Plot color theme.
- titlestr, optional
Title of the plot. If None, no title is shown.
- nlabelsint, default=0
Number of token labels to display on the plot.
- use_subsetbool, default=False
If True, uses the embedding subset instead of the full embeddings.
Returns:
fig : matplotlib.figure.Figure ax : matplotlib.axes.Axes
- plot_dendrogram(label_fontsize=10, grid=False, use_subset=False, n: int = 500)[source]
Creates a 2D circular dendrogram of clustered embeddings using hierarchical clustering. This first version of this function does not include title and theme parameters. Adapted from Claude Sonnet 4.5 generation.
Z = linkage(reduced_emb, method=’complete’) clusters = fcluster(Z, t=n_clusters, criterion=’maxclust’) clusters_colors, legend_labels = self.map_colors(clusters)
Z2 = dendrogram(Z, labels=tokens, no_plot=True) labels = [v[1] for v in legend_labels.values()]
- rt.plot(
Z2, colorlabels={‘cluster’: clusters_colors}, colorlabels_legend={‘cluster’: {
‘colors’: clusters_colors, ‘labels’: labels
}}, fontsize=6,
)
Parameters:
- n_clustersint, default=8
Number of clusters to generate.
- red_methodstr, default=’auto’
Dimensionality reduction method for better interpretability (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
- use_subsetbool, default=False
If True, uses the embedding subset instead of the full embeddings.
Returns:
fig : matplotlib.figure.Figure
- plot_embeddings(red_method: str = 'auto', grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]
Creates a simple static 2D scatterplot of the embeddings.
- Parameters:
red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
grid (bool, default=True) – If True, displays a background grid on the plot.
theme (str, default='light1') – Color theme to apply.
title (str, optional) – Title to display on the plot.
nlabels (int, default=0) – Number of word labels to display. If 0, no labels are shown.
use_subset (bool, default=False) – If True, uses the embedding subset instead of the full embeddings.
- Returns:
fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)
- plot_heatmap(use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]
Creates a heatmap showing every vectorial value of every word.
- Parameters:
dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.
use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.
n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.
theme (str, default='light1') – Plot color theme to use.
title (str, optional) – Title for the heatmap. If None, a default title is assigned.
- Returns:
fig
- Return type:
plotly.graph_objects.Figure
- plot_interactive(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]
Creates an interactive 2D scatterplot of embeddings using Plotly.
Parameters:
- red_methodstr, default=’auto’
Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
- gridbool, default=True
Whether to display grid lines.
- themestr, default=’light1’
Plot color theme.
- titlestr, optional
Title of the plot. If None, no title is shown.
- use_subsetbool, default=False
If True, uses the embedding subset instead of the full embeddings.
Returns:
fig : plotly.graph_objects.Figure
- plot_similarity(target_word: str, dist: str = 'cosine', n: int = 10, red_method: str = 'pca', grid: bool = True, theme: str = 'light1', title: str = None)[source]
Creates a scatterplot showing the most similar words to a target word.
- Parameters:
target_word (str) – The word for which to find and plot the most similar words.
dist (str, default='cosine') – Distance metric to use when computing word similarity.
n (int, default=10) – Number of similar words to display.
red_method (str, default='pca') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.).
grid (bool, default=True) – If True, displays a background grid on the plot.
theme (str, default='light1') – Color theme to apply to the plot.
title (str, optional) – Title to display. If None, a default title will be generated.
- Returns:
fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)
- plot_similarity_heatmap(dist: str = 'cosine', use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]
Creates a heatmap showing pairwise distances between word embeddings.
- Parameters:
dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.
use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.
n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.
theme (str, default='light1') – Plot color theme to use.
title (str, optional) – Title for the heatmap. If None, a default title is assigned.
- Returns:
fig
- Return type:
plotly.graph_objects.Figure
- plot_topography(red_method: str = 'auto', use_subset: bool = True, grid: bool = True, theme: str = 'light1', title: str = None)[source]
Plots word embeddings in a topographical map using dimensionality reduction to maintain word distances in the representation. Allows to visualize word density in the space.
- Parameters:
red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
use_subset (bool, default=True) – If True, uses a subset of the embeddings for visualization. This is recommended in this plot for larger embeddings.
grid (bool, default=True) – If True, shows grid lines on the plot.
theme (str, default='light1') – The plot theme to use, which controls the colors of the plot.
title (str, optional) – Title of the plot. If not provided, a default title is used.
- Returns:
fig
- Return type:
plotly.graph_objs.Figure
- similarity_heatmap(dist: str = 'cosine', use_subset: bool = True, n: int = 500, theme: str = 'light1', title: bool = None)[source]
DEPRECATED: This method will be renamed to plot_interactive in a future release.
- Parameters:
dist (str, default='cosine') – Distance metric to use for computing similarity between embeddings.
use_subset (bool, default=True) – If True, uses a subset of the embeddings. Otherwise, uses the full set.
n (int, optional) – Number of embeddings to subset. Ignored if a subset already exists and use_subset is True.
theme (str, default='light1') – Plot color theme to use.
title (str, optional) – Title for the heatmap. If None, a default title is assigned.
- Returns:
fig
- Return type:
plotly.graph_objects.Figure
wordviz.plotting3d module
- class wordviz.plotting3d.Visualizer3D(loader)[source]
Bases:
BaseVisualizer- plot_clusters(n_clusters=5, method='kmeans', red_method='auto', show_centers=False, grid=True, theme='light1', title=None, nlabels=0, use_subset=False)[source]
Creates a 3D scatterplot of clustered embeddings using a clustering algorithm.
Parameters:
- n_clustersint, default=5
Number of clusters to generate.
- methodstr, default=’kmeans’
Clustering method to use (‘kmeans’ or others supported by create_clusters).
- red_methodstr, default=’auto’
Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
- show_centersbool, default=False
If True, displays cluster centers on the plot.
- gridbool, default=True
Whether to display grid lines.
- themestr, default=’light1’
Plot color theme.
- titlestr, optional
Title of the plot. If None, no title is shown.
- nlabelsint, default=0
Number of token labels to display on the plot.
- use_subsetbool, default=False
If True, uses the embedding subset instead of the full embeddings.
Returns:
fig : plotly.graph_objects.Figure
Notes:
In 3D plotting Plotly.py tends to use GPU to visualize an high number of elements and label, so it is possible that this function does not work properly with a whole embedding set.
- plot_embeddings(red_method='auto', grid=True, theme='light1', title=None, use_subset=False)[source]
Creates an interactive 3D scatterplot of embeddings using Plotly.
Parameters:
- red_methodstr, default=’auto’
Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
- gridbool, default=True
Whether to display grid lines.
- themestr, default=’light1’
Plot color theme.
- titlestr, optional
Title of the plot. If None, no title is shown.
- use_subsetbool, default=False
If True, uses the embedding subset instead of the full embeddings.
Returns:
fig : plotly.graph_objects.Figure
Notes:
In 3D plotting Plotly.py tends to use GPU to visualize an high number of elements and label, so it is possible that this function does not work properly with a whole embedding set.
- plot_similarity(target_word: str, dist: str = 'cosine', n: int = 10, red_method: str = 'pca', grid: bool = True, theme: str = 'light1', title: str = None)[source]
Creates a dynamic 3D scatterplot showing the most similar words to a target word.
- Parameters:
target_word (str) – The word for which to find and plot the most similar words.
dist (str, default='cosine') – Distance metric to use when computing word similarity.
n (int, default=10) – Number of similar words to display.
red_method (str, default='pca') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.).
grid (bool, default=True) – If True, displays a background grid on the plot.
theme (str, default='light1') – Color theme to apply to the plot.
title (str, optional) – Title to display. If None, a default title will be generated.
- Returns:
fig
- Return type:
plotly.graph_objects.Figure
- plot_static(red_method: str = 'auto', grid: bool = True, theme: str = 'light1', title: str = None, nlabels: int = 0, use_subset: bool = False)[source]
Creates a simple static 3D scatterplot of the embeddings.
- Parameters:
red_method (str, default='auto') – Dimensionality reduction method to apply (‘pca’, ‘tsne’, ‘umap’, etc.). If ‘auto’ searches for cached reduction, if None runs pca.
grid (bool, default=True) – If True, displays a background grid on the plot.
theme (str, default='light1') – Color theme to apply.
title (str, optional) – Title to display on the plot.
nlabels (int, default=0) – Number of word labels to display. If 0, no labels are shown.
use_subset (bool, default=False) – If True, uses the embedding subset instead of the full embeddings.
- Returns:
fig (matplotlib.figure.Figure)
ax (matplotlib.axes.Axes)
wordviz.similarity module
- wordviz.similarity.n_most_similar(loader: EmbeddingLoader, target_word: str, dist: str = 'cosine', n: int = 10) Tuple[List[str], ndarray, List[float]][source]
Finds pairwise the n most similar words to a given target word using a specified distance metric.
- Parameters:
loader (EmbeddingLoader) – An instance of the embedding loader containing word vectors.
target_word (str) – The word for which to find the most similar neighbors.
dist (str, default='cosine') – The distance metric to use. Options include ‘cosine’, ‘euclidean’, etc.
n (int, default=10) – The number of most similar words to retrieve.
- Returns:
words (list of str) – The most similar words found.
vectors (np.ndarray) – Embedding vectors corresponding to the most similar words.
distances (list of float) – Distances from the target word to each of the most similar words.
- wordviz.similarity.word_distance(loader: EmbeddingLoader, word1: str, word2: str, dist: str = 'cosine') float[source]
Computes distance between two words given by user. Also supports sentence distance.
- Parameters:
loader (EmbeddingLoader) – Object used to load embeddings
word1 (str) – Word to compute distance between
word2 (str) – Word to compute distance between
dist (str, default='cosine') – Type of distance to use: - ‘braycurtis’ - ‘canberra’ - ‘chebyshev’ - ‘cosine’ - ‘dot’ - ‘euclidean’ - ‘manhattan’ - ‘pearson’ - ‘pearson’
- Returns:
distance
- Return type:
float