easygraph.datasets package



easygraph.datasets.citation_graph module

Cora, citeseer, pubmed dataset.

(lingfan): following dataset loading and preprocessing code from tkipf/gcn https://github.com/tkipf/gcn/blob/master/gcn/utils.py

class easygraph.datasets.citation_graph.CitationGraphDataset(name, raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: EasyGraphBuiltinDataset

The citation graph dataset, including cora, citeseer and pubmeb. Nodes mean authors and edges mean citation relationships.

  • name (str) – name can be ‘cora’, ‘citeseer’ or ‘pubmed’.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.


Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property num_classes
property num_labels

Loads input data from data directory and reorder graph for better locality

ind.name.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.name.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.name.allx => the feature vectors of both labeled and unlabeled training instances

(a superset of ind.name.x) as scipy.sparse.csr.csr_matrix object;

ind.name.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.name.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.name.ally => the labels for instances in ind.name.allx as numpy.ndarray object; ind.name.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict


ind.name.test.index => the indices of test instances in graph, for the inductive setting as list object.

property reverse_edge
property save_name
class easygraph.datasets.citation_graph.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Citeseer citation network dataset.

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.


  • Nodes: 3327

  • Edges: 9228

  • Number of Classes: 6

  • Label Split:

    • Train: 120

    • Valid: 500

    • Test: 1000

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.


Number of label classes




The node feature is row-normalized.

In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.


>>> dataset = CiteseerGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>> # get node feature
>>> feat = g.ndata['feat']
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>> # get labels
>>> label = g.ndata['label']
class easygraph.datasets.citation_graph.CoraBinary(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]

Bases: EasyGraphBuiltinDataset

A mini-dataset for binary classification task using Cora.

After loaded, it has following members:

graphs : list of DGLGraph pmpds : list of scipy.sparse.coo_matrix labels : list of numpy.ndarray

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.


Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.


Overwrite to realize your own logic of processing the input data.

property save_name
class easygraph.datasets.citation_graph.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Cora citation network dataset.

Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.


  • Nodes: 2708

  • Edges: 10556

  • Number of Classes: 7

  • Label split:

    • Train: 140

    • Valid: 500

    • Test: 1000

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.


Number of label classes




The node feature is row-normalized.


>>> dataset = CoraGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>> # get node feature
>>> feat = g.ndata['feat']
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>> # get labels
>>> label = g.ndata['label']
class easygraph.datasets.citation_graph.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]

Bases: CitationGraphDataset

Pubmed citation network dataset.

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.


  • Nodes: 19717

  • Edges: 88651

  • Number of Classes: 3

  • Label Split:

    • Train: 60

    • Valid: 500

    • Test: 1000

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.


Number of label classes




The node feature is row-normalized.


>>> dataset = PubmedGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_of_class
>>> # get node feature
>>> feat = g.ndata['feat']
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>> # get labels
>>> label = g.ndata['label']
easygraph.datasets.citation_graph.load_citeseer(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get CiteseerGraphDataset

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type:


easygraph.datasets.citation_graph.load_cora(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get CoraGraphDataset

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type:


easygraph.datasets.citation_graph.load_pubmed(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]

Get PubmedGraphDataset

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type:


easygraph.datasets.get_sample_graph module


Returns the undirected graph of blogcatalog.


get_graph_blogcatalog – The undirected graph instance of blogcatalog from dataset: https://github.com/phanein/deepwalk/blob/master/example_graphs/blogcatalog.mat

Return type:




Returns the undirected graph of Flickr dataset.


get_graph_flickr – The undirected graph instance of Flickr from dataset: http://socialnetworks.mpi-sws.mpg.de/data/flickr-links.txt.gz

Return type:




Returns the undirected graph of Karate Club.


get_graph_karateclub – The undirected graph instance of karate club from dataset: http://vlado.fmf.uni-lj.si/pub/networks/data/Ucinet/UciData.htm

Return type:




Returns the undirected graph of Youtube dataset.


get_graph_youtube – The undirected graph instance of Youtube from dataset: http://socialnetworks.mpi-sws.mpg.de/data/youtube-links.txt.gz

Return type:



easygraph.datasets.gnn_benchmark module

class easygraph.datasets.gnn_benchmark.AmazonCoBuyComputerDataset(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]

Bases: GNNBenchmarkDataset

‘Computer’ part of the AmazonCoBuy dataset for node classification task.

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets


  • Nodes: 13,752

  • Edges: 491,722 (note that the original dataset has 245,778 edges but DGL adds the reverse edges and remove the duplicates, hence with a different number)

  • Number of classes: 10

  • Node feature size: 767

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.


Number of classes for each node.




>>> data = AmazonCoBuyComputerDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
property num_classes

Number of classes.

Return type:


easygraph.datasets.graph_dataset_base module

easygraph.datasets.karate module

class easygraph.datasets.karate.KarateClubDataset(transform=None)[source]

Bases: EasyGraphDataset

Karate Club dataset for Node Classification

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002. Official website: http://konect.cc/networks/ucidata-zachary/

Karate Club dataset statistics:

  • Nodes: 34

  • Edges: 156

  • Number of Classes: 2


transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.


Number of node classes




>>> dataset = KarateClubDataset()
>>> num_classes = dataset.num_classes
>>> g = dataset[0]
>>> labels = g.ndata['label']
property num_classes

Number of classes.


Overwrite to realize your own logic of processing the input data.

easygraph.datasets.ppi module

PPIDataset for inductive learning.

class easygraph.datasets.ppi.LegacyPPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: PPIDataset

Legacy version of PPI Dataset

class easygraph.datasets.ppi.PPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: EasyGraphBuiltinDataset

Protein-Protein Interaction dataset for inductive node classification

A toy Protein-Protein Interaction network dataset. The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels. 20 graphs for training, 2 for validation and 2 for testing.

Reference: http://snap.stanford.edu/graphsage/


  • Train examples: 20

  • Valid examples: 2

  • Test examples: 2

  • mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.


Number of labels for each node




Node labels




Node features




>>> dataset = PPIDataset(mode='valid')
>>> num_labels = dataset.num_labels
>>> for g in dataset:
....    feat = g.ndata['feat']
....    label = g.ndata['label']
....    # your code here

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property num_labels

Overwrite to realize your own logic of processing the input data.


Overwrite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.data.utils.save_graphs to save dgl graph into files and use dgl.data.utils.save_info to save extra information into files.

easygraph.datasets.utils module

easygraph.datasets.utils.download(url, path=None, overwrite=True, sha1_hash=None, retries=5, verify_ssl=True, log=True)[source]

Download a given URL.

Codes borrowed from mxnet/gluon/utils.py

  • url (str) – URL to download.

  • path (str, optional) – Destination path to store downloaded file. By default stores to the current directory with the same name as in url.

  • overwrite (bool, optional) – Whether to overwrite the destination file if it already exists. By default always overwrites the downloaded file.

  • sha1_hash (str, optional) – Expected sha1 hash in hexadecimal digits. Will ignore existing file when hash is specified but doesn’t match.

  • retries (integer, default 5) – The number of times to attempt downloading in case of failure or non 200 return codes.

  • verify_ssl (bool, default True) – Verify SSL certificates.

  • log (bool, default True) – Whether to print the progress for download


The file path of the downloaded file.

Return type:


easygraph.datasets.utils.extract_archive(file, target_dir, overwrite=False)[source]

Extract archive file.

  • file (str) – Absolute path of the archive file.

  • target_dir (str) – Target directory of the archive to be uncompressed.

  • overwrite (bool, default True) – Whether to overwrite the contents inside the directory. By default always overwrites.


Generate mask tensor according to different backend For torch, it will create a bool tensor :param mask: input mask tensor :type mask: numpy ndarray


Get the absolute path to the download directory.


dirname – Path to the download directory

Return type:



Module contents