easygraph.datasets.citation_graph module#

Cora, citeseer, pubmed dataset.

class easygraph.datasets.citation_graph.CitationGraphDataset(name, raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#

Bases: EasyGraphBuiltinDataset

The citation graph dataset, including Cora, CiteSeer and PubMed. Nodes mean authors and edges mean citation relationships.

Parameters:
  • name (str) – name can be ‘Cora’, ‘CiteSeer’ or ‘PubMed’.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a Graph object and returns a transformed version. The Graph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

Attributes:
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes
num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

reverse_edge
save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Loads input data from data directory and reorder graph for better locality

save()

Overwrite to realize your own logic of saving the processed dataset into files.

has_cache()[source]#

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

property num_classes#
property num_labels#
process()[source]#

Loads input data from data directory and reorder graph for better locality

ind.name.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object; ind.name.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object; ind.name.allx => the feature vectors of both labeled and unlabeled training instances

(a superset of ind.name.x) as scipy.sparse.csr.csr_matrix object;

ind.name.y => the one-hot labels of the labeled training instances as numpy.ndarray object; ind.name.ty => the one-hot labels of the test instances as numpy.ndarray object; ind.name.ally => the labels for instances in ind.name.allx as numpy.ndarray object; ind.name.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict

object;

ind.name.test.index => the indices of test instances in graph, for the inductive setting as list object.

property reverse_edge#
property save_name#
class easygraph.datasets.citation_graph.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#

Bases: CitationGraphDataset

Citeseer citation network dataset.

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

  • Nodes: 3327

  • Edges: 9228

  • Number of Classes: 6

  • Label Split:

    • Train: 120

    • Valid: 500

    • Test: 1000

Parameters:
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes#

Number of label classes

Type:

int

Notes

The node feature is row-normalized.

In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.

Examples

>>> dataset = CiteseerGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
Attributes:
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes
num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

reverse_edge
save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Loads input data from data directory and reorder graph for better locality

save()

Overwrite to realize your own logic of saving the processed dataset into files.

class easygraph.datasets.citation_graph.CoraBinary(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]#

Bases: EasyGraphBuiltinDataset

A mini-dataset for binary classification task using Cora.

After loaded, it has following members:

graphs : list of DGLGraph pmpds : list of scipy.sparse.coo_matrix labels : list of numpy.ndarray

Parameters:
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Attributes:
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Overwrite to realize your own logic of processing the input data.

save()

Overwrite to realize your own logic of saving the processed dataset into files.

has_cache()[source]#

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

process()[source]#

Overwrite to realize your own logic of processing the input data.

property save_name#
class easygraph.datasets.citation_graph.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#

Bases: CitationGraphDataset

Cora citation network dataset.

Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.

Statistics:

  • Nodes: 2708

  • Edges: 10556

  • Number of Classes: 7

  • Label split:

    • Train: 140

    • Valid: 500

    • Test: 1000

Parameters:
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes#

Number of label classes

Type:

int

Notes

The node feature is row-normalized.

Examples

>>> dataset = CoraGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
Attributes:
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes
num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

reverse_edge
save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Loads input data from data directory and reorder graph for better locality

save()

Overwrite to realize your own logic of saving the processed dataset into files.

class easygraph.datasets.citation_graph.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None, reorder=False)[source]#

Bases: CitationGraphDataset

Pubmed citation network dataset.

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

  • Nodes: 19717

  • Edges: 88651

  • Number of Classes: 3

  • Label Split:

    • Train: 60

    • Valid: 500

    • Test: 1000

Parameters:
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • reorder (bool) – Whether to reorder the graph using reorder_graph(). Default: False.

num_classes#

Number of label classes

Type:

int

Notes

The node feature is row-normalized.

Examples

>>> dataset = PubmedGraphDataset()
>>> g = dataset[0]
>>> num_class = dataset.num_of_class
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
Attributes:
hash

Hash value for the dataset and the setting.

name

Name of the dataset.

num_classes
num_labels
raw_dir

Raw file directory contains the input data folder.

raw_path

Directory contains the input data files.

reverse_edge
save_dir

Directory to save the processed dataset.

save_name
save_path

Path to save the processed dataset.

url

Get url to download the raw dataset.

verbose

Whether to print information.

Methods

download()

Automatically download data and extract it.

has_cache()

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

load()

Overwrite to realize your own logic of loading the saved dataset from files.

process()

Loads input data from data directory and reorder graph for better locality

save()

Overwrite to realize your own logic of saving the processed dataset into files.

easygraph.datasets.citation_graph.load_citeseer(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#

Get CiteseerGraphDataset

Parameters:
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type:

CiteseerGraphDataset

easygraph.datasets.citation_graph.load_cora(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#

Get CoraGraphDataset

Parameters:
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type:

CoraGraphDataset

easygraph.datasets.citation_graph.load_pubmed(raw_dir=None, force_reload=False, verbose=True, reverse_edge=True, transform=None)[source]#

Get PubmedGraphDataset

Parameters:
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • reverse_edge (bool) – Whether to add reverse edges in graph. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Return type:

PubmedGraphDataset