How to use Cypher projection in Neo4j Graph Data Science library on a Lord of the Rings social… Towards Data Science – Medium

Learn all about cypher projection and how to use it to project virtual graphs

Tomaz Bratanic

Continuing the introduction of the Neo4j Graph data science library graph catalog feature from the last blog post, where we took a deep dive in Native projections, we will take a look at Cypher projection in this blog post. It is the more expressive variant of the graph projection and allows us to project virtual graphs as well.

We will be using some procedures from the APOC library in our import statement. I have gotten so used to always having APOC around, that I rarely even mention it individually. It contains many awesome procedures, and it is hard to imagine life without it.

LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/ontologies/ontology.csv" as row FIELDTERMINATOR "\t"
WITH row, CASE row.type WHEN 'per' THEN 'Person'
WHEN 'gro' THEN 'Group'
WHEN 'thin' THEN 'Thing'
WHEN 'pla' THEN 'Place' END as label
CALL apoc.create.nodes(['Node',label], [apoc.map.clean(row,['type','subtype'],[null,""])]) YIELD node
WITH node, row.subtype as class
MERGE (c:Class{id:class})
MERGE (node)-[:PART_OF]->(c);
UNWIND ['1','2','3'] as book
LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/tables/networks-id-volume" + book + ".csv" AS row
MATCH (source:Node{id:coalesce(row.IdSource,row.Source)})
MATCH (target:Node{id:coalesce(row.IdTarget,row.Target)})
CALL apoc.create.relationship(source, "INTERACTS_" + book,
{weight:toInteger(row.Weight)}, target) YIELD rel
RETURN distinct true;

The general syntax to use cypher projection is:

CALL gds.graph.create.cypher(
graphName: String,
nodeQuery: String,
relationshipQuery: String,
configuration: Map
)

Node query is a cypher statement used to describe the nodes we want to project. It must return the internal ids of the nodes and optionally any of their properties. Relationship query, on the other hand, describes the relationships we want to project. The cypher statement should return internal ids of source and target nodes of relationships, and optionally their type and any of their properties.

We will begin with a simple scenario and project the entire graph in memory. Adding the column type in the relationship query allows the data science engine to distinguish between relationship types, which in turn gives us an option to filter relationships when executing algorithms.

CALL gds.graph.create.cypher(
'whole-graph',
// nodeQuery
'MATCH (n) RETURN id(n) AS id',
// relationshipQuery
'MATCH (n)-[r]->(m) RETURN id(n) AS source, id(m) AS target, type(r) as type'

As in the previous blog post, we will start with the weakly connected components algorithm. It is used to examine how many islands or disconnected components there are in our graph, which will help us better understand results from different graph algorithms. Also, sometimes we might want to run other graph algorithms only on the largest connected component.

CALL gds.wcc.stream('whole-graph')
YIELD nodeId, componentId
RETURN componentId, count(*) as size
ORDER BY size DESC limit 10

Results

As there is only one connected component in our graph, we don’t have to worry about skewed results from other graph algorithms.

Drop the projected graph from the catalog

After each analysis, we will release the projected graph from memory.

CALL gds.graph.drop('whole-graph');

Next, we are going to project an undirected weighted graph. Let’s take a look at how does the native projection handle undirected relationships:

  • : each relationship is projected in both natural and reverse orientation

To produce an undirected relationship with cypher projection, we project a single relationship in both directions, effectively allowing the graph data science engine to traverse it in both directions. Let’s take a look at the following example to gain a better understanding. We create two nodes with a single connection between them.

CREATE (:Test)-[:REL]->(:Test);

To project the relationship in both directions, we only have to omit the direction of the relationship in our MATCH statement, and that’s it. A tiny but crucial detail!

MATCH (n:Test)-[r]-(m:Test)
RETURN id(n) as source, id(m) as target;

Results

We will also demonstrate a favorable way to project multiple node labels using a UNION statement in the node query.

CALL gds.graph.create.cypher(
'undirected_interactions',
// nodeQuery
'MATCH (n:Person) RETURN id(n) AS id
UNION
MATCH (n:Thing) RETURN id(n) as id',
// relationshipQuery (notice no direction on relationship)
'MATCH (n)-[r:INTERACTS_1|:INTERACTS_2|:INTERACTS_3]-(m)
RETURN id(n) AS source, id(m) AS target,
type(r) as type, r.weight as weight')

Random walk algorithm

To stray away from the common graph algorithms like PageRank or Louvain, let’s use the Random Walk algorithm. In essence, it mimics how a drunk person would traverse our graph. It is commonly used in the node2vec algorithms. We define Frodo as the start node and then walk five random steps twice.

MATCH (n:Node{Label:'Frodo'})
CALL gds.alpha.randomWalk.stream('undirected_interactions',
{start:id(n), steps:5, walks:2})
YIELD nodeIds
RETURN [nodeId in nodeIds | gds.util.asNode(nodeId).Label] as result

Results

You will get different results as it is a random walk algorithm, after all.

Triangle count and clustering coefficient

Another useful algorithm for analyzing social networks is Triangle Count and Clustering Coefficient algorithm. A triangle composes of three nodes, where each node has a relationship to the other two. The clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. This allows us to estimate how tightly-knit nodes in our graph are.

CALL gds.alpha.triangleCount.write('undirected_interactions',
{relationshipTypes:['INTERACTS_1'],
writeProperty:'triangles',
clusteringCoefficientProperty:'clustering'})
YIELD nodeCount, triangleCount, averageClusteringCoefficient

Results

The global or average clustering coefficient is 0.54, which means that the persons in our graph are quite tightly-knit. We can also look at individuals and their local clustering coefficients.

MATCH (p:Person)
RETURN p.Label as person,
p.clustering as coefficient,
p.triangles as triangles
ORDER BY coefficient DESC LIMIT 5

Results

Let’s visualize Thráin and his connections to observe how they form a tight community.

All of Thrain’s contacts have also interacted with each other, hence the 1.0 value of the clustering coefficient.

Drop the projected graph from the catalog

CALL gds.graph.drop('undirected_interactions');

Up until this point, all the above graph analysis could be done with the native projection. Cypher projection can project graphs that exist only at query time or when we want to use more advanced filtering than just node labels or relationship types.

Categorical PageRank is a concept first introduced by Kenny Bastani in his blog post. I have also written a blog post about it using the Graph algorithms library. Now it is time to demonstrate it with the Graph Data Science library as well.

Source: Categorical Pagerank post by Kenny Bastani

The idea behind it is pretty simple. As shown in the example above, we have a graph of pages that have links between each other and might also belong to one or more categories. To better understand the global pagerank score of nodes in a network, we can breakdown our graph into several subgraphs, one for each category and execute the pagerank algorithm on each of that subgraphs. We store results as a relationship property between category and pages. This way, we can break down which are the contributing categories to page’s global pagerank score.

We will start by assuming that each interaction is a positive endorsement(I know it’s actually not, but let’s pretend). We will breakdown our graph into several subgraphs by the class(men, elves) that the characters belong. For example, when calculating the pagerank for the category of men, all nodes will be considered. Still, only relationships that come from characters that belong to the class of men will be considered.

CALL gds.graph.create.cypher(
'categorical_men',
// nodeQuery
'MATCH (n:Person) RETURN id(n) AS id
UNION
MATCH (n:Thing) RETURN id(n) as id',
// relationshipQuery
'MATCH (c:Class)<-[:PART_OF]-(n)-[r:INTERACTS_1|:INTERACTS_2|:INTERACTS_3]-(m)
// Use the parameter
WHERE c.id = $class
RETURN id(n) AS source, id(m) AS target,
r.weight as weight,type(r) as type',
{parameters: { class: 'men' }})

Let us now run the weighted pageRank on this graph.

CALL gds.pageRank.stream('categorical_men',
{relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).Label as name, score
ORDER BY score DESC LIMIT 5

Results

Gandalf comes out on top, with Aragorn following very closely in the second place. I am wondering if Aragorn has the most support from men in the third book, as he becomes their king at the end.

CALL gds.pageRank.stream('categorical_men',
{relationshipTypes:['INTERACTS_3'],
relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).Label as name, score
ORDER BY score DESC LIMIT 5

Results

As predicted, Aragorn takes the lead. Frodo is no longer on the list as he is quite isolated from everybody in the third book and walks alone with Sam to Mount Doom. To be honest, if you have Sam by your side, you are never lonely though.

Drop the projected graph from the catalog

CALL gds.graph.drop('categorical');

We have only looked at the class of men and calculated the categorical pagerank for that specific subgraph. It would be very time consuming if we projected a subgraph for each class separately. That is why we will project a subgraph for each class in a single named graph using virtual relationship types. In the relationship query, we have the option to return whatever we feel like as the column type. We will return the original relationship type combined with the class of the person to create virtual relationship types. This will allow us to calculate the categorical pagerank for each class.

CALL gds.graph.create.cypher(
'categorical_virtual',
'MATCH (n:Person) RETURN id(n) AS id
UNION
MATCH (n:Thing) RETURN id(n) as id',
'MATCH (c:Class)<-[:PART_OF]-(n)-[r:INTERACTS_1|:INTERACTS_2|:INTERACTS_3]-(m)
RETURN id(n) AS source, id(m) AS target,
type(r) + c.id as type, r.weight as weight')

We can now calculate the categorical pagerank for the class of elves.

CALL gds.pageRank.stream('categorical_virtual', 
{relationshipTypes:['INTERACTS_1elves','INTERACTS_2elves','INTERACTS_3elves'],
relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).Label as name, score
ORDER BY score DESC LIMIT 5

Results

If we want to calculate categorical pagerank for each class and store the results, we can use the same approach as in the original blog post, where we saved the results in the relationship between a class and a person.

MATCH (c:Class)
CALL gds.pageRank.stream('categorical_virtual',
{relationshipTypes:['INTERACTS_1'+c.id,'INTERACTS_2'+c.id,'INTERACTS_3'+c.id],
relationshipWeightProperty:'weight'})
YIELD nodeId, score
WITH c, gds.util.asNode(nodeId) as node, score
// 0.15 is the default value for nodes with no incoming links
WHERE score > 0.151
CREATE (c)-[:PAGERANK{score:score}]->(node)

We can now get the top three members for each class based on their categorical pagerank score.

MATCH (c:Class)-[s:PAGERANK]->(p:Person)
WITH c, p, s.score as pagerank
ORDER BY pagerank DESC
RETURN c.id as class,
collect(p.Label)[..3] as top_3_members

Result

Frodo leads with four first places, Gandalf has three first places, and Sam has two. Most of the results are self-explanatory, like the support from dwarf class, where Gandalf leads, with Gimli himself being a dwarf in second place, so he probably has interactions with all the other dwarves. Because Legolas hangs out with Gimli so much, he gets to be in the third place. On the other hand, Sam and Frodo venture into the heart of Orc land by themselves, and have the most interactions with Orcs (not really an endorsement).

Drop the projected graph from the catalog

CALL gds.graph.drop('categorical_virtual');

Cypher projection is very expressive and allows us to project any subgraph of the stored graph as well as any virtual graph that exists only at query time. We are mostly limited only by our imagination. Try out the Graph Data Science library for yourselves and let us know if you have any suggestions or feedback.

As always, the code is available on GitHub.

[chinadefi提示]点击查看原链接

What do you think?

发表评论

电子邮件地址不会被公开。 必填项已用*标注

Loading…

0

Comments

0 comments

A complete NLP classification pipeline in scikit-learn Towards Data Science – Medium

A simpler experimentation workflow with Jupyter, Papermill, and MLflow Towards Data Science – Medium