fix some broken links (#7859)

julien-c · lhoestq · web-flow · commit 17f40a318a1f · 2025-11-10T18:11:05.000+01:00
* fix some broken links

* some more

---------

Co-authored-by: Quentin Lhoest &lt;lhoest.q@gmail.com&gt;
diff --git a/docs/source/dataset_card.mdx b/docs/source/dataset_card.mdx
@@ -24,4 +24,4 @@ Creating a dataset card is easy and can be done in just a few steps:
 
 YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.
 
-Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.
+Feel free to take a look at the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli), [CNN/DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/tblard/allocine) dataset cards as examples to help you get started.
diff --git a/docs/source/faiss_es.mdx b/docs/source/faiss_es.mdx
@@ -22,7 +22,7 @@ FAISS retrieves documents based on the similarity of their vector representation
 
 ```py
 >>> from datasets import load_dataset
->>> ds = load_dataset('crime_and_punish', split='train[:100]')
+>>> ds = load_dataset('community-datasets/crime_and_punish', split='train[:100]')
 >>> ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()})
 ```
 
@@ -62,7 +62,7 @@ FAISS retrieves documents based on the similarity of their vector representation
 7. Reload it at a later time with [`Dataset.load_faiss_index`]:
 
 ```py
->>> ds = load_dataset('crime_and_punish', split='train[:100]')
+>>> ds = load_dataset('community-datasets/crime_and_punish', split='train[:100]')
 >>> ds.load_faiss_index('embeddings', 'my_index.faiss')
 ```
 
diff --git a/docs/source/image_load.mdx b/docs/source/image_load.mdx
@@ -10,7 +10,7 @@ When you load an image dataset and call the image column, the images are decoded
 ```py
 >>> from datasets import load_dataset, Image
 
->>> dataset = load_dataset("beans", split="train")
+>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train")
 >>> dataset[0]["image"]
 ```
 
@@ -33,7 +33,7 @@ You can load a dataset from the image path. Use the [`~Dataset.cast_column`] fun
 If you only want to load the underlying path to the image dataset without decoding the image object, set `decode=False` in the [`Image`] feature:
 
 ```py
->>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False))
+>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False))
 >>> dataset[0]["image"]
 {'bytes': None,
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/bean_rust/bean_rust_train.29.jpg'}
diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -327,7 +327,7 @@ Select specific rows of the `train` split:
 ```py
 >>> train_10_20_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[10:20]")
 ===STRINGAPI-READINSTRUCTION-SPLIT===
->>> train_10_20_ds = datasets.load_dataset("bookcorpu", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs"))
+>>> train_10_20_ds = datasets.load_dataset("rojagtap/bookcorpus", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs"))
 ```
 
 Or select a percentage of a split with:
diff --git a/docs/source/object_detection.mdx b/docs/source/object_detection.mdx
@@ -8,14 +8,14 @@ To run these examples, make sure you have up-to-date versions of [albumentations
 pip install -U albumentations opencv-python
 ```
 
-In this example, you'll use the [`cppe-5`](https://huggingface.co/datasets/cppe-5) dataset for identifying medical personal protective equipment (PPE) in the context of the COVID-19 pandemic.
+In this example, you'll use the [`cppe-5`](https://huggingface.co/datasets/rishitdagli/cppe-5) dataset for identifying medical personal protective equipment (PPE) in the context of the COVID-19 pandemic.
 
 Load the dataset and take a look at an example:
 
 ```py
 >>> from datasets import load_dataset
 
->>> ds = load_dataset("cppe-5")
+>>> ds = load_dataset("rishitdagli/cppe-5")
 >>> example = ds['train'][0]
 >>> example
 {'height': 663,
diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx
@@ -288,7 +288,7 @@ pip install -U albumentations opencv-python
 
 ## NLP
 
-Text needs to be tokenized into individual tokens by a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). For the quickstart, you'll load the [Microsoft Research Paraphrase Corpus (MRPC)](https://huggingface.co/datasets/glue/viewer/mrpc) training dataset to train a model to determine whether a pair of sentences mean the same thing.
+Text needs to be tokenized into individual tokens by a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer). For the quickstart, you'll load the [Microsoft Research Paraphrase Corpus (MRPC)](https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc) training dataset to train a model to determine whether a pair of sentences mean the same thing.
 
 **1**. Load the MRPC dataset by providing the [`load_dataset`] function with the dataset name, dataset configuration (not all datasets will have a configuration), and dataset split:
 
diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -160,11 +160,11 @@ You can split your dataset one of two ways:
 
 🤗 Datasets supports sharding to divide a very large dataset into a predefined number of chunks. Specify the `num_shards` parameter in [`~IterableDataset.shard`] to determine the number of shards to split the dataset into. You'll also need to provide the shard you want to return with the `index` parameter.
 
-For example, the [amazon_polarity](https://huggingface.co/datasets/amazon_polarity) dataset has 4 shards (in this case they are 4 Parquet files):
+For example, the [amazon_polarity](https://huggingface.co/datasets/fancyzhx/amazon_polarity) dataset has 4 shards (in this case they are 4 Parquet files):
 
 ```py
 >>> from datasets import load_dataset
->>> dataset = load_dataset("amazon_polarity", split="train", streaming=True)
+>>> dataset = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
 >>> print(dataset)
 IterableDataset({
     features: ['label', 'title', 'content'],
diff --git a/docs/source/use_with_jax.mdx b/docs/source/use_with_jax.mdx
@@ -195,11 +195,11 @@ part.
 
 The easiest way to get JAX arrays out of a dataset is to use the `with_format('jax')` method. Lets assume
 that we want to train a neural network on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) available
-at the HuggingFace Hub at https://huggingface.co/datasets/mnist.
+at the HuggingFace Hub at https://huggingface.co/datasets/ylecun/mnist.
 
 ```py
 >>> from datasets import load_dataset
->>> ds = load_dataset("mnist")
+>>> ds = load_dataset("ylecun/mnist")
 >>> ds = ds.with_format("jax")
 >>> ds["train"][0]
 {'image': DeviceArray([[  0,   0,   0, ...],
diff --git a/docs/source/use_with_numpy.mdx b/docs/source/use_with_numpy.mdx
@@ -160,7 +160,7 @@ at the HuggingFace Hub at https://huggingface.co/datasets/mnist.
 
 ```py
 >>> from datasets import load_dataset
->>> ds = load_dataset("mnist")
+>>> ds = load_dataset("ylecun/mnist")
 >>> ds = ds.with_format("numpy")
 >>> ds["train"][0]
 {'image': array([[  0,   0,   0, ...],
diff --git a/src/datasets/arrow_dataset.py b/src/datasets/arrow_dataset.py
@@ -1970,7 +1970,7 @@ def class_encode_column(self, column: str, include_nulls: bool = False) -> "Data
 
         ```py
         >>> from datasets import load_dataset
-        >>> ds = load_dataset("boolq", split="validation")
+        >>> ds = load_dataset("google/boolq", split="validation")
         >>> ds.features
         {'answer': Value('bool'),
          'passage': Value('string'),
@@ -4725,7 +4725,7 @@ def train_test_split(
         >>> ds = ds.train_test_split(test_size=0.2, seed=42)
 
         # stratified split
-        >>> ds = load_dataset("imdb",split="train")
+        >>> ds = load_dataset("stanfordnlp/imdb",split="train")
         Dataset({
             features: ['text', 'label'],
             num_rows: 25000
@@ -6175,15 +6175,15 @@ def add_faiss_index(
         Example:
 
         ```python
-        >>> ds = datasets.load_dataset('crime_and_punish', split='train')
+        >>> ds = datasets.load_dataset('community-datasets/crime_and_punish', split='train')
         >>> ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']}))
         >>> ds_with_embeddings.add_faiss_index(column='embeddings')
         >>> # query
         >>> scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10)
         >>> # save index
         >>> ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')
 
-        >>> ds = datasets.load_dataset('crime_and_punish', split='train')
+        >>> ds = datasets.load_dataset('community-datasets/crime_and_punish', split='train')
         >>> # load index
         >>> ds.load_faiss_index('embeddings', 'my_index.faiss')
         >>> # query
@@ -6314,7 +6314,7 @@ def add_elasticsearch_index(
 
         ```python
         >>> es_client = elasticsearch.Elasticsearch()
-        >>> ds = datasets.load_dataset('crime_and_punish', split='train')
+        >>> ds = datasets.load_dataset('community-datasets/crime_and_punish', split='train')
         >>> ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index")
         >>> scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10)
         ```
diff --git a/src/datasets/arrow_reader.py b/src/datasets/arrow_reader.py
@@ -459,34 +459,34 @@ class ReadInstruction:
     Examples::
 
       # The following lines are equivalent:
-      ds = datasets.load_dataset('mnist', split='test[:33%]')
-      ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
-      ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
-      ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
+      ds = datasets.load_dataset('ylecun/mnist', split='test[:33%]')
+      ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
+      ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
+      ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction(
           'test', from_=0, to=33, unit='%'))
 
       # The following lines are equivalent:
-      ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]')
-      ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
+      ds = datasets.load_dataset('ylecun/mnist', split='test[:33%]+train[1:-1]')
+      ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction.from_spec(
           'test[:33%]+train[1:-1]'))
-      ds = datasets.load_dataset('mnist', split=(
+      ds = datasets.load_dataset('ylecun/mnist', split=(
           datasets.ReadInstruction('test', to=33, unit='%') +
           datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))
 
       # The following lines are equivalent:
-      ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)')
-      ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
+      ds = datasets.load_dataset('ylecun/mnist', split='test[:33%](pct1_dropremainder)')
+      ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction.from_spec(
           'test[:33%](pct1_dropremainder)'))
-      ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
+      ds = datasets.load_dataset('ylecun/mnist', split=datasets.ReadInstruction(
           'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))
 
       # 10-fold validation:
       tests = datasets.load_dataset(
-          'mnist',
+          'ylecun/mnist',
           [datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
           for k in range(0, 100, 10)])
       trains = datasets.load_dataset(
-          'mnist',
+          'ylecun/mnist',
           [datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')
           for k in range(0, 100, 10)])
 
diff --git a/src/datasets/dataset_dict.py b/src/datasets/dataset_dict.py
@@ -515,7 +515,7 @@ def class_encode_column(self, column: str, include_nulls: bool = False) -> "Data
 
         ```py
         >>> from datasets import load_dataset
-        >>> ds = load_dataset("boolq")
+        >>> ds = load_dataset("google/boolq")
         >>> ds["train"].features
         {'answer': Value('bool'),
          'passage': Value('string'),
diff --git a/src/datasets/download/download_manager.py b/src/datasets/download/download_manager.py
@@ -269,7 +269,7 @@ def iter_files(self, paths: Union[str, list[str]]):
         Example:
 
         ```py
-        >>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip')
+        >>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/train.zip')
         >>> files = dl_manager.iter_files(files)
         ```
         """
diff --git a/src/datasets/download/streaming_download_manager.py b/src/datasets/download/streaming_download_manager.py
@@ -206,7 +206,7 @@ def iter_files(self, urlpaths: Union[str, list[str]]) -> Iterable[str]:
         Example:
 
         ```py
-        >>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/beans/resolve/main/data/train.zip')
+        >>> files = dl_manager.download_and_extract('https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/train.zip')
         >>> files = dl_manager.iter_files(files)
         ```
         """
diff --git a/src/datasets/iterable_dataset.py b/src/datasets/iterable_dataset.py
@@ -3218,7 +3218,7 @@ def shard(
 
         ```py
         >>> from datasets import load_dataset
-        >>> ds = load_dataset("amazon_polarity", split="train", streaming=True)
+        >>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
         >>> ds
         Dataset({
             features: ['label', 'title', 'content'],
diff --git a/src/datasets/utils/patching.py b/src/datasets/utils/patching.py
@@ -28,7 +28,7 @@ class patch_submodule:
         >>> from datasets.load import dataset_module_factory
         >>> from datasets.streaming import patch_submodule, xjoin
         >>>
-        >>> dataset_module = dataset_module_factory("snli")
+        >>> dataset_module = dataset_module_factory("stanfordnlp/snli")
         >>> snli_module = importlib.import_module(dataset_module.module_path)
         >>> patcher = patch_submodule(snli_module, "os.path.join", xjoin)
         >>> patcher.start()
diff --git a/tests/test_metadata_util.py b/tests/test_metadata_util.py
@@ -282,23 +282,23 @@ def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset
             "dataset": "AI-Lab-Makerere/beans",
             "config": "default",
             "split": "test",
-            "url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet",
+            "url": "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet",
             "filename": "0000.parquet",
             "size": 17707203,
         },
         {
             "dataset": "AI-Lab-Makerere/beans",
             "config": "default",
             "split": "train",
-            "url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet",
+            "url": "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet",
             "filename": "0000.parquet",
             "size": 143780164,
         },
         {
             "dataset": "AI-Lab-Makerere/beans",
             "config": "default",
             "split": "validation",
-            "url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/validation/0000.parquet",
+            "url": "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/refs%2Fconvert%2Fparquet/default/validation/0000.parquet",
             "filename": "0000.parquet",
             "size": 18500862,
         },
@@ -332,15 +332,15 @@ def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset
                 },
             },
             download_checksums={
-                "https://huggingface.co/datasets/beans/resolve/main/data/train.zip": {
+                "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/train.zip": {
                     "num_bytes": 143812152,
                     "checksum": None,
                 },
-                "https://huggingface.co/datasets/beans/resolve/main/data/validation.zip": {
+                "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/validation.zip": {
                     "num_bytes": 18504213,
                     "checksum": None,
                 },
-                "https://huggingface.co/datasets/beans/resolve/main/data/test.zip": {
+                "https://huggingface.co/datasets/AI-Lab-Makerere/beans/resolve/main/data/test.zip": {
                     "num_bytes": 17708541,
                     "checksum": None,
                 },

Original file line number	Diff line number	Diff line change
`@@ -24,4 +24,4 @@ Creating a dataset card is easy and can be done in just a few steps:`
`24`	`24`
`25`	`25`	`YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.`
`26`	`26`
`27`		`-Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.`
	`27`	`+Feel free to take a look at the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli), [CNN/DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/tblard/allocine) dataset cards as examples to help you get started.`