图学习会议-2024-笔记-全-

图学习会议 2024 笔记（全）

图机器学习会议｜ Learning On Graphs Conference 2024 p01 P01_Day_1-Yusu_Wang_keynoteKG-LLMheterophilic_graphs_tutorial__orals -BV1k9pAzpE8S_p1-

Okay。All right， I will let everyone in。Right， now I can see a lot of people。😊，Alright。

good morning good afternoon， good evening wherever you are welcome to the tutorial session for integrating knowledge graphs and large language models so you are most welcome to engage with your organizers and each other in the Zoom chat function the Zoom Q&A or on the tutorial discussions black channel this session will be one and a half hours long because there is another event following this session please do end on time thank you very much and please note that the tutorial will be recorded and uploaded to our YouTube channel for those who are unable to join or who would like to rewatch the session in the future so if you were prefer not to be recorded you're welcome to keep your cameras off and just participate privately in the chat。

😊，Now without further ado， I will hand it over to our tutorial organizers。Hello， so yeah。

Thanks for everyone who joined our tutorial for integrating large graph and large learning models for advancing scientific research。

So， yeah， I should say good morning。 Good afternoon。 and good evening for for where you are。

So I'm Joan。 So I will give an overview overview of this tutorial and then introduce the first part。

And we have another two speakers。 Doctor Chang Zhang。

who will give the second part and doctor Zi Chaomeeng， We give the。😊，The third part。

So at the same time， let， let me start。So， yeah， this is some brief information for myself。

I'm a lecturer at Department of Comp Science University of Manchester。

So I mainly work on knowledge graph on knowledge plantation and the techniques or smart web techniques and the integration of all the symbolic technical with machine learning and large learning model。

😊，So you can find my email here。 So if you have a question， you can also email me。So okay， before I。

before we give three parts。 So we want to first give you some ideas why we want to combine large graphs and large leg models。

So。Lning models， lightning models are very successful in many aspects。 For example， it has。

they have general knowledge。 They they are very good at language processing。

They have high generalizability， but we also know that light models still have some problems like implicity knowledge。

hallucination。😊，Indecisiveness， black black box。 And they may also short off some kind of domain specific knowledge or short of new knowledge。

so。But we find knowledge graphs actually can complement large language model in these aspects。

Nowge graphs have structured knowledge， which are usually quite a， decisive。

and knowledge graphs can give interpretable decision making。

and knowledge graphs can have domain specific knowledge。 And also。

it's easy to deal with evolving knowledge。So this is the general perspective about why we try to combine large graphs and large leg models。

I always， we also want to give some perspective from the dimension of scientific knowledge and text understanding。

So you see， this is a very typical pipeline of using large language model for task addressing。

So we usually pering a large language model with a lot of copper and fine tune such a large language model with task specific data and then uses large language model for prediction or generation。

However， there are several problems in scientific knowledge and text understanding。

The one is new discovered knowledge。 So we may find some new scientific phenomenon。

And how can we last like model know these new phenomenon and new facts。And also， you know。

knowledge will be updated。 We have new， new facts。 Then these new facts will be updated。

will replace the old knowledge and update the the， the human knowledge。

Then how can we deal with this updated knowledge。 So both newly discovered knowledge and updated knowledge。

there should be no enough training samples in there there should be no enough training samples。

Then how can we let large model capture this new knowledge。And also。

the third issue is long tele knowledge。 So there are some， for example， rare disease。Oh， or some。

some phenomenon that do not frequently appear。 How can we。

How can we get enough training samples to to let language model capture these knowledge。

So all these three issues are big problems in applying large language models for scientific text processing。

So we can utilize knowledge graph to deal with all these things。

because knowledge graph may should be more easier to be updated。 And so we。

if we have some mechanism to combine knowledge graph and large link models。

so we can easily deal with all these three phenomenons， Three issues。So after。

after giving the motivation， I want to give the presentation of the first part。

knowledgegraphs for science。😊，So this part includes three components。

The one is about knowledge graph definition and some core concepts。

Then I will give a simple case of using knowledge graph for equalcotesological effect prediction。

And finally， I will very brieflyly review knowledge graph for life science and the the the challenges of life science knowledge graphs。

The start from the definition of knowledge graph。 So in 2012， Google proposed a term knowledge graph。

So it is some kind of knowledge base that is used by Google to support its searching search engines result。

for example， if you search Manchester Baby on Google。

it will returns not only one page of Manchester Baby。

but also thestruct data on the right side such as the data introduced developer memory all these information are supported by the Google's knowledge graph。

So in this definition of knowledge graph knowledge actually is current to instances and facts instances could be something of entitiess like Manchester Baby。

Fox could be some kind of relational knowledge like the developer of Manchester Baby includes Tom Kburn。

😊，So， in such a knowledgegraph。Actually， it's。 it's， it's quite relevant。 or it's quite。

The definition is quite equivalent to link the structure data。 So we link。😊，D the data。

and they form a multi relational graph。So then I want to give some larger plantation perspective about the definition of large graph。

So we need different knowledge plantation languages to build a large graph。

So first one is RF resource discrete framework。 So it is to。Describe it is two model facts。

With three components subject， predicate and object。 So， for example。

we can have Manchester Baby a subject has developed a predicate and Tom Kbin as object。

Then we can we can describe the facts that Manchester baby has developed Tom Tom Kbin。

So you can find that if we have a set of idea of triple。 they can form a graph。

which is multi relational， as shown on the right side。 we can have the relation。

we can have the relation of the three entitiess about Manchester baby and its developers。😊。

So database have schema to describe as met data。 RDF facts also has RDF schema to describe its met information。

For example， we can define that Alan Turing is a mathematician。

Home King is computer science scientist and mathematician and computer science。

both belong to scientist。And we can also define the property。 And sorry。

we can also define the domain range of each property for constraining their。

for constraining what they are connected。Besides a schema。

we can have another more advanced language called web ontology language。

So it is widely used to define taxonomies and vocabies。 For example。

the ontology called food and for defining all kinds of food terminologies。

another example is noma city， which is widely used for defining all kinds of medical terms。

And you may also know gene ont DI D disease ontology。 on the right side。

I show a picture about a segment of the ont of food you can see the categorization of food products from plant food product to the leaf gluten soy bread。

😊，We Web oil ontology can not only describe such taxonomy。

but also constrain and logical relationship， which which are underpin the discrete logic。

That means it can define some logical relationship with logical operators like conjunction， disn。

extension quantify universal quantifier and negation。For example。

we can define that food and material is equivalent to equivalent to。

Environment material in conjunction with something has a role in food， right。

And we can also define the personalityality of parent is too。

which means every person should exactly have two parents， right。😊，So what is knowledge graph。

So let's revisit this term。 So by Google's definition， it is。😊。

It is actually a set of other effects。 They form a relational graph。And。

Many people would think that we can add a schema to audio audio effects。

So schema could also be a part of a large graph。And many researchers in the KR community or Sw community or traditional AI community。

they would also think that。Onts or onts in oil could also be regarded as a large graph。

That means this is a graph in conjunction with some reasoning agent。

or we can say oil on target is some kind of logical equi knowledge graph。

So after we have a knowledge of what is knowledge graph。

So let's take another look at why we try to use knowledge graph。

So here are some good features about knowledge graph。

But this is with regard to traditional database， traditional data management systems like relational database。

😊，The first it's more intuitive。 It doesn't foreign key， but using a graph view。

and it can support a schema with logical equi ontology。

And it used usually to connect things of the world instead of local data。 It's also more flexible。

extendsable when you can easily merge to graphs by getting linking by。

by linking the entitiess in the notes。😊，And it also supposed some rule language to do reason over the over the graph。

like data rules， And it can support other features of utilizing the characteristics of the graph。

For example， navigation over the graph。And calculating the similarity over the graph and utilize the location of the graph for kinds of。

for solving kinds of problems。😊，so now嗯。Notge graph is good， but people often complain。 How can。

Construction large graph is very cost。So there are different ways for constructing knowledge graph。

The one is directly using crossscing。 The example， one example is Wiiki data。

which is continuously added and edited。By volunteers in the whole world。

there are some other domain on target like like the J on target mentioned above C R O created by human expertss。

Of course， we can also create。Nge graph from existing thermo or structure data。 For example。

we can build the knowledge， For example， we build the knowledge graph D P utilizing the info box of Wikipedia。

So briefly， the first solution for knowledge of construction is crowdcing。

But the second solution is more automatic。So researchers in open information instruction or web mining。

they try to automatically in an instance relations and facts from unstructed text or semistruct web data。

So they utilize information like name anti recognition and linking relation instruction。

They may also utilize the features of the， of the data on the web。

like the structure of the H TM L page to utilize to extract knowledge from resources on the web。

So this is automatic knowledge graph construction。So， you know。

many of our data of the enterprises or individuals are stored now stored in， in。

in some kind of structure format or semi format。 So， for example， databases， web tables。

Excel shade， CB files。 So how can we construct knowledge graph from all these existing resources。

So there are two solutions。 The one is we can define some rules or some。

some agent based solutions to transform tables to large graph。So this is quite， quite absolute。

And from the scratch。Another solution is knowledge graph population。

So this is perhaps more practical。 So assume on the right on in this figure。

assume we have an existing knowledge graph about formulaula 1。 So we have some drivers。

racing drivers， as well as their country and the teams， right， So。

and we have another table about more information of of formula1。

So we can do two steps to populate the knowledge graph with the with the table。

The one is doing matching。 We can match the first column with the type of race。

We can match the type of the first column with the concept of racing driver。

We can match the relationship between the first column and the second column to the property of race race 4。

Then we can find that some cells in in this table have no corresponding knowledge in in the large graph。

For example， Hamilton has no corresponding entity。

Then we can extract this new entity from the table。

and we can extract the facts that Hamilton lives in England into as a new facts into。

and in this new factor into large graph。 So this is large graph construction or population utilizing。

Uil。Stionstructed or semiconductor data。Of course， we may have a lot of knowledge several knowledge graphs or onology。

How can we integrate some knowledge graphs or onologies for a new knowledge graph。

So there are several steps One is we can do， we need to do mapping。

We need to find the equivalent entities between two knowledge graphs。

and then we need to do modization。 We need to extract a subset of the knowledge from one knowledge graph as。

as the knowledge you want to build the new knowledge graph， right。

And you also need to do something like canonization。 So for example。

you have two entities which are regarded as equivalent。

and you need to merge the two entities to make to make the new knowledge graph more formally represented。

😊，So this is data integration for auto graph construction。

Heres a bit a small advertisement of our on work on deeper on。

This is a light model based on target engineering library。

So it provides some tools that you can use for on target alignment for on target completion and for also for on target population utilizing external resource like documents。

So based to implement these tools we also provide basic APIs for on target processing。

So it originally all on target processing APIs are based on Java。

So we are reimment or package these APIs into Python， such as that。😊，啊。

Touch that they can be easily integrated with ILP or deep learning libraries for for kinds of functions。

So the second component of part one is a simple case study for utilizing knowledge graph for equal tasological effect assessment。

So this is Etological effect assessment in Norwegian Institute Water Research based on experiments。

So assume we have some exposure about chemicals in the water。

then we can get some risk quots and find some potential effects between species and chemicals。

then we go back to the labs and test these potential effects between species and labs and we find some do have effects and some do not。

And we can we can further come back to combined exposure and effects and analysis， other chemicals。

other related chemicals and other species and do another round of experiments， right。

but if if But if when there are more species， more chemicals to consider the combination space is big and you need a lot of experiments for for testing。

right。😊，Its a huge waste of experimental resource， and it also has some aesthetic issues。

So we propose， propose to use knowledge graph embedding and link prediction to address this problem。

So this is a Ph D project from。Yeah， this is a PhD project。 So。

so the idea is to construct knowledge graph with a species knowledge， with the chemicals knowledge。

together with some existing experimental results about chemical species effects。

And we build this knowledge graph。 And we can predict further links。😊。

Further the links between species and chemical chemicals to find the potential effects in this way。

we can make a filtering before we doing experiments。So in this project。

we mainly focus on one effect， which is called mortality。

This is some lethal concentration of the chemical to 50% of test population。

That lead to death or the lose of generation of next generation are measured at 48 hours。

So how can we construct this knowledge graph， So we need to select these sources。

So for the biological set， we select the equalt logic database。

which covers 1 million testings 12000 compounds and compound main chemicals and 13000 species and over the 0。

6 percentage of the chemical species， it has experimental results。 and it means 99。

4 percentage of the species。 they are missing they miss effects knowledge。

and we need to using the embedding model to predict them。So for the biological side side。

we use the N CBBI technology about species to get more features of the species。

We also use the encyclopedia of life。 and for chemical side， we use three onts。😊，PPcam。

Campbell and M， E， S， H。Then how can we integrate the different resources， So first。

we use the resources on wki data。 So we did provide some mappings between species from encyclopedia。

Ecoological database and some species in N CBBR taxonies， For example。 And second。

we use on target alignment tools。 We use log map and AM L。

which are too very typical on target line system based on electrical matching。

indexing and symbolic reasoning。Because this project was conducted five years ago。

and now we can have some alternative， like bird map introduced above in In deep or introduced above。

对啊。So this is a picture of the knowledge graph we constructed。

So this knowledge graph actually includes three parts。

The one part is about is a sub sub graph about taxonomy。

The second part is a sub graph about chemicals， and the third third knowledge graph is about third sub graph is about the effects between chemicals and species。

And this knowledge graph is， is managed as audio of triples， and it can。

it can be accessed by sparkle queries。 And we have open sourced this knowledge graph on。

On thato and also， as well as the codes for constructing this knowledge graph。So this。

here are some triples about this knowledge graph。 I think I can omit them。

And this is a picture about a segment of knowledge graph around a specific test。

So this is an experimental test。 And you can find it is it is for specific this specific chemical。

And this is it is for specific this。😊，Fcious， and you can find the。

this testing it happens in the water。 And also the result， this is the result。

whether it has a mortality relationship under specific concentration。With the large graph。

we can do link prediction。 So we adapt two solutions。 One is very naive。

which is just to feed the embeddings portrayed by some embedding algorithms like trans E into a fully connected neural network。

So here we use two fully connected neural network。 One is quite simple。 It just con light。

the embedding of the chemistry chemical and species and feed it into a fully connected layer。

Another is adding。😊，Fully connected layer sub to the embedding of the chemical and the spec and contaminat the the output and further feed them into another fully connected layer。

So the second solution is end to end。 So it simultaneously train the knowledge graph embeddings and the material perception。

So it has three losses。 The first loss is the embedding loss over the triples of the chemical sub knowledgege graph。

The second loss is the is for the triples of the spec subnge graph。

And third loss is the predicted is comparing the predicted output of the material perception with the effect in the effect subnge graph。

And we can。 that means we train the material perception together with the three in with the chemical subnge graph and spec sub knowledgege graph embeddings。

😊，So we explored quite a few large graph embeddings ranging from chance A to rotate take E P， Ra E。

and we also explore different sampling strategy。 And in the end。

we can take a look at some segment of the result。 So we find under the best setting the sensitive could be larger than 0。

9。 and the specific could be larger than 0。75。Under the combination of this are these two embedding algorithms。

right。即诶。Before I go to the cellular components， I would like to discuss discuss about using knowledgegraph for addressing such scientific discovery problems from a knowledge graph perspective。

The first is knowledge graph construction。 So it's always hard to construct a knowledge graph con old data sources。

Currently， we use some structure data， but and existing knowledge graphs。

Can we consider literatures and reports， They may include new scientific knowledge。

And how can we extract these new scientific knowledge facts and add them into the new knowledge graph。

And there are a lot of。 there are quite a few specific experimental system。

And they use some specific language for annotating the output results。

How can we extract these results and insert them into knowledge graph。

And then how can we consider multimod data besides direct the data， Can we consider the text。

images or even sequences like sequence of the genes。Plink prediction， of course， we try to。

Get better better result， which means not only acive， but also explanation。

So can we use more advanced embedding technical， Can we consider larger angle model for auto graph embedding together with interpretation。

And can we consider symbolic reasoning in conjunction with these embedding based technical。😊。

So in the end， I try to use two minutes to introduce some review and challenges about knowledge graph for life science。

So briefly there's only one， briefly there is just one page to introduce。

and this page give the sumization of a recent perspective paper called knowledge graph knowledge graphs for life sciences。

recent development and challenges and opportunities。 So it includes three parts。

we give some literature review about knowledge graph construction and management in life science about life science knowledge discovery about knowledge graph for explainable AI。

And in this perspective paper， we we consider several kind。

we reviewed several kinds of knowledge graphs like schema life knowledge graph。

schema based knowledge graph simple ontology。Like taxonomies and expressive oil ontologists with logics。

And we also discussed quite a few challenges， like the scability in scientific discovery。

together with notgraph construction and and also human interaction and explainability and multimod and multi domain and multimodality yeah。

and also about representation learning， how to combine symbolic and sub symbolic representation。

So finally， I need to mention that this prospective paper is published in a new journal called Trans or graph data and knowledge。

So this is a fully open source open access journal。 this is to in the community of web。

this new journal is to replace the original journal of web ses as the letter is non open access。😊。

Okay， so this is all about my part。 and perhaps I do not have time to answer questions。

and you can answer questions by email me， but perhaps you can answer question in the select channel。

I think， yeah， I'm on the select channel。 and we have created a channel for exactly for this tutorial。

So you can ask question in that channel。 we， we will also upload the slide。So， yeah。

that's all my part。 And if we have a time in the end， after all three parts。

We will also be in the in the zoom， and you can also ask questions at that time。

So I would like to give my flow to Doctor。 Qang Zhang for second part。Okay， thank you。 Yes， so I is。

so can you hear me。いや、。And also， you are able to see my screen， I believe。Can you see my screen？Yes。

it's good to me。Okay， cool。 So let's get started。 Let's move on to the second part of this tutorial。

So hi everyone。 My name is Qiang Zhang from Zjiang University。😊。

And for the second part of this tutorial， I would like to guide you through this part so in this part we will focus on scientific large language model。

And so to give you some background， I'm currently a tenure track assistant professor at Z University and before that I got my PhD degree from University College London in the UK and I also worked as a postal research fellow in UCL and even earlier and so I got my master degree from Chinese Academy of Sciences and my bachelor degree from Shan University。

so my research interest is in data efficient machine learning。

foundational models as well as some applications in natural learning process in knowledge graphs and AI for science。

Okay， so。In this， so here's the road map for these second part。

I will start with a brief introductory presentation of scientific large language models followed by some technical details of how large language models can be leveraged and so。

Can be leveraged to deal with scientific data， such as proteins and moleculars。

And in the in the in the third section， I will summarize the current technical challenges in this field and will also point out some promising direction research directions in the future。

So let's get started with the first section。 so as we know large language models have revolutionized artificial general intelligence。

so it originates from the natural language processing area before the emergence of large language models。

people use some upstream techniques such as sentence passing analysis and semantic analysis to try to make computers understand human languages。

And on top of that， researchers can conduct some downstream tasks such as question answering。

knowledge extraction and complex reasoning， but after the emergence of large language models。

a lot of things have changed okay so with between large language models such as GT GRM and Lama。

people don't need to deal with the previous upstream techniques。

they can simply use chatGPT or cloud to answer questions that extract knowledge and build knowledge graphs。

Okay， and after that， more advanced large language model systems have been developed and proposed。

For example， hugging GT is one of the is one of the complicated models systems。

So with this hug GT framework， GT could act as a controller in this system and。

It can conduct a task planning when when it receives a user request。

and it can also try to select proper AI models according to their function descriptions and executes each subt。

And gather the response from these AI models and finally output the answers that the users expect and another model is any GT which is published earlier time in this year。

so it could tokenize multimedia data into discrete tokens and in that way。

it can achieve any280 transformation between multimedia data， such as image text video。

audio and so on。And a lot of guys have must have heard so， which is developed by open AI。

It is a very complicated large language models， multimedia， large language models。

and it can generate coherent videos。According to human textile description， which is really amazing。

But large language models， I mean， general purpose， large language models are still kind of limited。

although they are good at understanding and processing， generating human languages。

they are not able to understand scientific data such as protein sequences genomes and chemical moleculars。

So suppose we have the information about we know the name of a protein。

we also know the gene and we also know the sequence of a protein and we input all of this information into a general purpose large language model such as GDPGBT and we ask the model what is the soluability of this protein。

Well， in most cases these large language models are not able to give a proper answer。

this is basically because these models were not properly trained with this scientific cor and with the words of Lu with Kingstan the limits of my language mean the limits of my world because these general purpose large language models were not trained with this scientific data it is reasonable that these models were not able to properly understand and deal with this scientific data okay。

And that's why people try to extend the limit of general purpose large language models and make them understand。

make them grasp the ability to deal with scientific data。So so basically。

there are some similarities between human languages and scientific languages。So in biology。

Researchers and scientists usually deal with the proteins and genes。 So for proteins。

there are there there are 20 amino acids which could make the vocabulary while in terms of gene。

There are four different types of nucle basees， and these nu bies could make up the vocabulary for genome language。

And similarly in chemistry， chemical scientists have also formulated their own molecular language system。

for example， smiles， deep smiles and cell files。 So they have their different grammars and vocabularies。

but basically they use they use different symbols and characters to denote atoms。

chemical bombs and substructures。And they also have their own structures to define the validity of a molecular。

Okay， so when we try to formulate some when we want to develop some scientific。

large language model， we usually make some we make comparison with human language。

large language model。 Okay， so in natural in human language processing。There is a very。

very fundamental hypothesis， which is called distributional semantic hypothesis。

it means that the meaning of a world is determined by its context okay and with this hypothesis。

When researchers try to calculate the joint distribution of a sequence。

they have developed different architectures where basically to two architect two type of architectures。

one is B and the other HGPT So for the broader architecture。

people try to mask some tokens and then train a model try to recover the mask tokens according to the provided context and with the GT architecture。

The model is trained to auto aggressivelygressively generate the next tokens by using the previously generated ones。

okay。And similarly， we could also develop such architectures， but we should。

Which we need to consider。Whether such distributional hypothesises still hold。

When we deal with moleculars and proteins， Okay， so a lot of researchers say， maybe， okay。

so maybe we we could guess this hypothesis poor， and we can continue with such hypothesis。

And then which motivates a lot of researchers to study on the scientific large language models and I will introduce you guys some technical advancements in this field okay。

so the broad the scope of this scientific large language model could be very broad。So first of all。

we definitely could include human language into this area。

but such human languages could kind of different from the general domain we usually use some text and human languages from from research papers from patternss or from。

Textbook。And then in chemistry， we will see some chemical molecular languages and in biology。

we usually need to deal with protein language and gene language。 and of course， we in some cases。

we need to we can combine。 We can mix up these different types of scientific language。

and then we could have a multimod scientific language system。So finally。

we can also talk about scientific agent and embodied AI。

which is kind of different from the general program。Yes。Yeah。哎，把强。好的，可以。We lost the voice right yes。

we lost sound I was just double checking that it wasn't just me Qiang， can you hear us。Hello， Che。

I think he encountered some Internet problem。啊 sure嗯。yeah， I'm messaging him。 let's see。

I think we can also answer some questions if you have。 So some， yeah。

some participants messaged me and I have replied one question。 So if you have more question。

you can also ask。Okay， chance here。 Okay， and we can continue。Hi， can you hear me， yeah。Okay。

sorry about the Internet connection。So where I should continue my talk。From this slice or even。

Earlier here， or here。We got to the part where you were talking about scientific agent and embodied AI Okay sure cool yeah。

嗯。Yeah， so because of the broad scope of this scientific large language model and the limited time I'm not able to cover every every part。

so I'm just gonna focus on two aspects。 One is scientific text and the other is biological protein。

Okay， so let's start with the scientific text。So basically， scientific test is still text。

It is human language， and researchers have developed。那。A very common way so。For example。

in biology and chemistry。So researchers use usually use birds GPT or GOM structures to。

Prettrain and further fine tune their models。So this table will summarize some good models in biology and chemistry and even some further comprehensive area。

And so it shows the name of these models， the publication time。

the number of parameters of these models， the base model they used。

the pre training data set and whether these models are published are open source or not。

And here we also list some general cors that are often used in biology in chemistry and so well okay。

Anda。So encoder model is a kind of。Commonly used architecture and and。

Usually I would like to introduce bioioB， which is pretrained on Pubcam and PMC and fine tune on some task specific data sets。

so it is good at understanding text， it is good at named entityT recognition。

relation instruction and question answering， so with the named entityT recognition and relation instruction。

people can use it to build knowledge graphs。Okay， and。

Decoder only architecture is another commonly used architecture。For so for this architecture。

I would like to introduce the div series model。 It， it is， it is based on the 7 B base model。

and researchers have tried to。Build a fine tuning datas。And some question and answering data sets。

some instruction data sets to fine too the base model， So at the end。

the Darwin MDP model is very good at answering scientific questions。And go at summarization， okay。

And in terms of encoder and decoder architecture， a Sine G LM is a good model。

which is published in the early time of this year。

So the authors of this paper propose a self reflective。Instruction data set to make the model。

Reflect their output and try to correct the misleading parts of the output。 And they have also。

Examine examine whether a larger model could lead to a better performance。 And the answer is yes。

So this figure shows the results。Okay。And when we try to evaluate such textual scientific large language model。

researchers developed a lot of benchmarks。So this table summarized the typically used benchmarks。

including MM M L U， C E， A G I E， S， Qe， Xie Zhi， Si evil， SQ， Si Ben， and S assess。So basically。

this data sets contain questions and answers from different subjects from different levels。

and it could be in different formats。And one benchmark that I want to highlight is text。

it is called sign no evil， it draws ideas from the Chinese just traditional Chinese philosopher cons。

and there are five different levels in this evaluation benchmark， which are knowledge coverage。

knowledge exploration， reasoning， safety and application。And according to this benchmark。

the GT series models are still among the best performing models， okay。

So that is the the texto scientific large language models and the other one is protein language models okay protein language models is kind of similar to human languages。

it can be formulated with a sequence of symbols， but they have all their own characteristics。

So it has， so a protein has four different levels of architectures。 The primary structure is a。

Squence of amino acids， and it has also its secondary structure。

the primary structure and qua structure。 Okay， so this table summarized the recently published a protein large language models。

It can still be organized in three different architectures。 The encoder only。

the decor only and encoder decoder architecture and。We also list the publication time。

the parameters of these models， the base model they use。

the pretrain data set and what kind of capabilities they are they have and whether these models are open source or not and these models are usually have so their parameters are usually at the level of 100 million 1 B。

10 B or even 100 B， but most of them are at the level of 1 B and 10 B。Okay。

so the training date could be from the public。Database such as unefF， P farm and Swissporting。

And with this trained data with these pre trained large language models。

researchers can use them to do protein function prediction， protein family prediction， protein。

protein contact sorry it's a amino acid contact prediction。

And people can also use it to design some normal protein sequences， for example。

sequence optimization， protein deormal design， and inverse folding。Okay嗯。

And so for the encoder only model， I would like to highlight the ESM model from the metata AIT。Okay。

so it is very similar to the bird architecture board。

but it also incorporates the evolutional information from the protein domain。And。

So this ESM model basically tries to extract the sequence information into the pretraining objective function and another recent model is prompt protein which tries to incorporate more structural information into the pretraining objective function。

So here there there there are three different pretraining objective functions。

one is still very similar to B it is mask language modeling。

but the other two are3D structure prediction and fourth level structure prediction and in that way more structural information could be incorporated into the during the pretraining stage。

Okay， the other， well， in terms of the decoder only architecture。Progene model is a very good one。

It is published in nature biology in nature Bitechnology in last year。 it can accept。

well it follows the GBT architecture and it can accept control tag as input and generates protein sequences。

As the output and the generated protein sequences are supposed to hold the well their functions of this these generated protein should be this control tag。

there is a one too many corresponding relationship。Okay， and because of this。

the strong capability and the generated protein sequences are usually normal。

It has been widely applied to enzyme design and therapeutic protein design， like entity， sorry。

antibody body design。Okay， and for the encoder decor architecture。

so extreme more P G L M is a very large scale protein large language model。

It has the 100 B parameters， which I believe is the largest。The the largest protein language model。

and it adopts the DLM architecture， which allows it to。To。

Do the protein understanding tasks and also generate normal protein sequences。Okay。

and this table summarize the data sets for for protein protein training and benchmarking。

So for the pre training， usually researchers use unIRF dataset sets and also some structure related dataset sets like alpha4DB and in terms of pre training。

Sorry， in terms of a benchmarking， researchers can use discriminative tasks like classification and regression。

but the generation ability of large language models。Well。

it is more attractive and more interesting。And this leads to another problem。

So how can we evaluate the generated protein sequences。

There are some computational based evaluation computational based evaluation metrics。

such as the noveld diversity and protein sequence sorry。

the protein distance of the affordability and recovery。 So with this metrics。

computational scientists can。Computes， whether the generated protein sequences are novel or not。

How similar with this generated sequences with the existing sequences and whether this this protein sequences could fold into a given 3D structure and。

And so on。 But such computation based evaluation metrics are not that reliable， and in most cases。

we still need to do we level experiments。Okay， so that is the technical details about scientific。

large language models。 and then I will move on to the third section。

and I will summarize the challenges and some perspectives in this field。

So we have talked about scientific symbols and languages， for example， chemical moleculars。

proteins and genomes， they have their own structures。

they have their own vocabularies and grammars and accordingly researchers have developed some molecular languages。

genome languages and protein languages and some multimod scientific languages。

There are still some interesting problems that we can explore in the future。 For example。

compared to natural language， large language model。

the skill and quality of the and fine data sets in scientific data it is not that gold and more importantly。

theres there's a lack of cross cross model data sets。 So we must build such data by ourselves。

If we want to build some if we want to develop some multimodal large language scientific large language model。

Okay， and then we need to deal with the longer sequences。

Usually protein sequence and genome sequence contains so they are much longer。

they have like hundreds or even thousands of tokens in a sequence。

which is much longer than human language sentence Okay and then we we can somehow we must incorporate and properly utilize the threeD information and then the autoregressive learning problem could also be improved because well human human speak word by word from left to the right or from left or from right to the left。

but protein sequences and genomes are not so theres no it is not unique directional okay and for model evaluation。

we still。Most in most cases， we still rely on we lab experiments。

but I think a good way is to try to reduce such dependency。And make the iteration faster Okay。

and we also need to think about data privacy modelbi and give equal access to different organizations and and researchers。

Okay， so to help you guys have a better understanding of this tutorial。

I will also give you this some links。And some relevant materials on。

So here is the a company service that we have。We have uploaded to archive and also some Github repository and some other service。

Like a general introduction about AF science， the chemical molecules and biological proteins。 Okay。

so I believe that is the end of this， this second part。

Thank you very much for your attention and participation。 He， thank you。

So next part will be shift to doctor Z Meng。 Okay， Thank you。Let me share my screen。Yeah。

I think I will need to stop sharing on。Can you see my screen， okay。O。对。Okay， so hello， good morning。

good afternoon， good evening everyone， so my name is Zi Xiao。

I'm a lecturer at the University of Glasgow。So I'm doing some research in like AFO biomedicine。

information retrieval， nurse graphs， law savings models， allan based agents。

and AFO scientific discovery。So in my part， I will be covering some large incorporation frameworks for LLMs and the N drop integration for scientific MP tasks and also the scientific prediction tasks。

So in part one， Johnny has already discussed cons and pros between the Ls and knowledge graphs actually in general theyre like combining knowledge graphs and LMs have three paradigms the first time paradigm is the KG enhanced the Ls which is use KG to provide some fact energies or domestic technologies for Ls to improve their accuracy or symbolic reasoning abilities。

the output of such a paradigm is actually a LM。And the second paradigm is actually the L augmented knowledge groupss。

so we use LM to generate knowledge or use do some language processing to improve the generaliz abilities。

and then the output will be some like Kg related tasks of the nu groupss。

The third paradigmM is actually the synergized LLMs plus KGs。

where we use LLMs to learn the nu presentation for KGs and the KGs could provide some fact into LLMs。

but in this studio area， I will only covering some like large enhanced LLMs which incorporate KG during different stages of LLMs and for the purpose of enhancing the understanding of the knowledge learned by LLMs。

These include some tasks including like N web pre traininging， N integration functionning。

and N editing and N and learning， etc。So actually integrating ST technologies can occur at any stage of the development of Llums。

For example， it can be happened during the progen stage。

we can indirectly use not2 as the progennic copper， some examples such as the pre KGE。

earning and more XT。And CO also can happen during the post trainingning I mean post trainingning is like after pre trainingning。

we can design a separate objective that specifically for some larges or some other informations some examples such as separatebi that we use like inject some syninonme information into the bird model to enhance the ability to do enterlinking and also like MOP。

the mix of partitions on top protein etc。It also can occur during the feuning stage where we can find two the NLM suites directly with the N lab so that we can inject some nu information for some donxing tasks。

And some example include the fus DTI， Fu DT GA， and also separates have a lot of applications that can be fine tuned for some donstream tasks。

It also can happen during the inference stage where we don't need to do any training。

but just put them into the LLMs as a prompt， especially in current stage we have the retrieval augmented generation we can actually retrieve some large graph or some large ships and then use rack to enhance the performance to inject some large into the language models。

some example include M rack or biore。So in the second part of tutorial， Dr。

Zgsang has already discussed the by encoder and actually。

the by encoder offers like the efficiency encodings of individual entities can speed up retrieval and computation。

but may sacrifice like final grant interactions， between different encoders。

So it actually could achieve higher efficiency compared to cross encoder， So for cross encoder。

we encode entities jointly and capturing some more detailed interaction。

but at the cost of the greater computation resources and time。

but it could achieve higher effectiveness needs compared to the byenr models。

So these are basic like architecture that we can use to inject some large bulbs。

Here is an example of using the buying encoders for the drug target protections here we try to inject some interactions between the protein and drugg and then we use the like support protein encoders to encode oiddia acids into embeddings and for the drugs。

which is actually in the reation of selfie， which is kind of replanation of molecules。

we can encode them by using some molecule lens models such as cellform。

And then this that is pass over to the like Fus model to use a token level fusion model to effectively learn those fine grant information for the jack target interactions。

So more more XPT is another example that using a cross encoder for the more protein of predictions and actually is a unified language models of text and marks prechant on the smell strings and wrapped by the text so。

In the process， they construct some like text and the smells hetero tokenized separately。

and this text and smell interaction was integrated during the protein chain stage and then further conude on some specific dostream tasks such as the molecularcur property predictions。

mecur tax generations。So in terms of the fine tuning stage， apart from the full fe tuning。

there are many integration techniques of LLMs which aims to enhance the efficiency。

these techniques also code like parameter efficient tuning， some examples such as like adapters。

rapid tunings roll flood big fits of prompt tuning。

for these techniques they only vent tune a small set of parameters and keeping the original parameters fixed and try to reduce the computational cost。

so as we can see that actually in different resources。

they actually achieve quite different performance over different like data size。

So they're particularly useful when we fine tuned super large lecture models where in some like low resources settings or when the computational resources is limited。

呃So。Nararch editing is a particular settinging of the large integration for alums。

s the motivation of knowledge editing is that LLMs are neormously heucinated。

perulated by and factually decay so we should be able to adjust specific behaviors of protein language models so these like large editing normally happens during the post training stage and it is like some techniques in the interaction of like prime tuning。

continue learning machine unlening generations。Given it like knowledge。

it will try to locate the information within light neuronss and then conduct some like operation such as insertion。

modification eraions so that the information on launch can be updated for Lms。啊。

So this always happened during the post training stage。

So that's next look at the KG integration for the scientific onP tasks。So in general。

there are many national links processing tasks related to scientific discovery。

including like question answer， entity linking， document classification， summarization。

note generations or hypothesis generation of large growth compil and reasoning。

so I will describe some introduce some examples related to these tasks。

So here is an example of the KG integration for the clinical text state generation。

so the colion is a largely important framework for the clinical date generation they use like topics from the KGs or topics from others and cells from as to formulate the prompts and they use the synthetic data and they use the synthetic data to be fine tuneed fors to further improve the general like domestic specific knowledge。

啊。So this like integration is actually done during the plauning stage。

Here is another example of the N integration during the feunum stage。

this is like a KG based LLM agent with complex medical QA that leverage some no cod knowledge of LLMs and structured cod knowledge of medical concept using LLMs。

So in this model， upon seeing a query， this work generates some relevant triples by using the launch base of LLMs。

and these triples are then verified against a grant a granted knowledge box to feed out some error information to ensure that only relevant information it can contribute to the final answers。

It also used like a Rola like prim tuning strategieses。

so it was happened in the fine tuning stage and used Roa as the techniques。

and here is the example of using the N enhanced Ls for the question answering tasks。

it actually contains like formal stages， the first like efficient construction labeling a limited set of examples and developing a based N graph extraction system to construct a the KG and and then they use a pre prelening techniques with like cableroll。

which is can of post the chain stage try to integrate such like domestic specific nu graphs loan like Ro based fineunning and then they can they will do another stage of the cells super fine tuning。

which involves retrieve some sub graphs from these domestic cageGs and modified input according accordingly。

And perform some supervised tu and then in the final stage， theses also will act as evaator。

provide some like feedbacks on the large correctness and so that the model can better align the domain knowledge。

So this work actually used both like post training and function tuning stage to integrate this large informations and the roll up is used to enhance the efficiency for tu。

In terms of the inference stage， the retrieve augmented generations was widely used to gain some large enhanced generation for the current airlines。

The bear right is an example that could adaptively select a nurse source and domestics tour to advanced biology question answering reason task。

So in this model， they use like LLs to do some kind of retrieval selection to select some suitable tools or retchirievalous method to retrieve the domestic efficacy documents and then use another prompt system to formulate the question for retrieval and then combine this retrieevval results for their context for the next step。

and then they do kind of transfer of reasoning and before they generate the answer。

So this is actually happened in the inference stage。

there's no any tuning and especially it's very useful for the current stage of LLMs。

Sometimes women need to deal with like super large scale knowledge graphs， for example， the UMSs。

which is a one of the largest biomedical knowledge graphs containing formula concepts and more than 20 million relations。

And so how to deal with this large scale graph， the model mix of a mixture of partition gives idea of divt and conco where they actually use like a partition algorithm to partitioning。

Pratition the graph into some small graphs， for example using mattes。

which is graph position algorithm and infuse these domestic specific knowledge of each subgraph into separate adapters so each subgraphs will injected into the separate adapters for the functiontuion stage they can add a mixed layer on top of these adapters。

And fine tune on some application， domain specific applications。 So using this scheme。

we are able to deal with some large scale noises， and this is also using some primary tuning and this is happened during the vent tuning stage。

a post change stage before the vent tuning。So also this。

Look at some KG generation example for the scientific prediction tasks。

there are many scientific prediction tasks including the gene disease Association prediction。

protein function prediction， drug repproposing drug target interactions。

and protein to protein interactions， text to monitors。A many acid to contacts prediction tasks。啊。

I will briefly introduce some of the tasks and describe how we can use LLlums and knowledge to improve their effectiveness。

So here is an example of the gene disease Association prediction tasks。

so the gene disease Association prediction tasks is a process of identify which gene are involved in a disease so for the gene we actually don't have like a very good but for some gene we can map them into a protein so we can go the protein sequence we can the wind acid sequence for the disease we actually don't have like apart from text。

we don't have any other information， but we can actually use the setic distance between different disease so that we can use the text as the disease as the disease representationations so efficient DNA is a model that use fine encode encodes to do the disease gene disease association task and use a。

Bird model， apartment bird model to encode the disease description to encode the disease into the text embeddings。

replanation beddings and encode gene into。Also using a protein language model to encode the gene represented by the protein sequence into the vector vector space and then further use a futures model to to a prediction hat。

So this is I by included models and also combined with the post training because they use both the general like gene D association。

copper and further fineitude on some task specific benchmarks。And。

YeahOntop protein is a model that could deal with many different protein related tasks。

such as protein function predictions， protein to protein interactions this model actually constructs a normal large scale knowledge graph that consists of the gene ontology and it is a related proteins and gene annotated text for protein sequence describe allos in a graph。

So then this Kg then was integrated into a language model by using a normal contrasting learning and to jointly optimize the large graph and protein embedding space during like postal training stage。

So this is actually happened during the post training stage and then they will need to find two on some domain specific tasks。

such as like protein structure protein function prediction or protein to protein interactions。

So it compares the post training and fine tuning。啊。

This is a task we call the omin acid contact prediction。

which is a process of identifying which aminomin acid residues in a protein are closed together。

so here this model called the keep is chain on large graph that consists of five million triples of the protein Kg25 and explore the large graph at more granular level by applying the cross attention again the cross attention。

to the secret of the protein embeddings and the language embeddings。

And it can be trained also using the mask language modeling heads。

and it is also a binr architecture that encode protein and language model language inflammation are in parallel。

The drug repurposing is a task that widely used to seek to identify new drugs for new targets for drugs that has already approved for the treatment of the existing disease。

It was normally traditional it was normally addressed by some graft neural networks models such as like TxGLN。

which is a graph foundational model for the zero shot drug repurposing。

amified the therapeutic candidates even for the disease with limited treatment options or existing drugs。

It trained a based on medical knowledges and use graph neural networks and metric learning models to rate those drugs as a potential indications。

But now with the development of large legs modelss， particularly HIVBT。

this tasks can be also addressed by using like TBT。

so this work was like leveraging the HIVBT to recommend some drugs for the ashe disease for the ashe disease repurposing tasks。

And it actually further evaluate the potential efficacy of the 10 most frequent suggested drugs generated by JBT and using the electronic record data to validate the effectiveness and the conclusion suggests that JGBT can generate actually very high quality hypoides for drugg repproing。

So and previously， Dr。 Zhang Shahan has already mentioned that for the language models。

we actually need to deal with different mod of the nursecrafts。And for example。

we have the proteins which can be represented by the oin acids and we have drugs。

which is represented by the mergers how to combine these different modalities into language model to do different tasks is actually a challenging task。

but it has been researched by many papers they are basically like architecture to merge these multimodality information of knowledgecos into language models。

The first is we call the multimodal congestive learning。

One example this is actually borrowed from the computation， the clip models。

which learn the alignment between the text and image， but here we try to learn some representations。

by by aligning the protein representation of the drug representation。

so both the foundation models the multiple foundation models will be fine tuned for example by some constive。

何？The second architecture is。IIs is try to align to like a central mod。

This is also borrowed from some current visual language models。

where the image data or video data can map into the unified presentation of the text language models。

And this can be applied into the scientific discovery of domain as well， while we can， for example。

map the molecule lens models， the representation of molecules and rep of proteins into a single text lens models。

for example， lemma， right， then we can keep the lemma model being fixed and fine too。

those protein lens model and molecule lens models。

The third architecture is kind of a transformation across modities。

which they call the bridge models， which is try to do some transformation from two different modities。

but they will update the parameters only with the transformational model。

So apart from thes scientific discovery， the LM agents is a very hot topic recently。

and it could also apply to the scientific discovery domain。

there are two main paradigms of these like scientific discovery agents。

the specified specialized language model， which are chain on some like scientific data。

which are typically tailored by some scientific tasks such as molecularcur related tasks。

protein related tasks and gene tasks。And these models are used as tools to perform some specific tasks in which users actually provide information required for a task。

and then the model output the prediction。Another paradigm is actually the general purpose language models where they are trained based on some diverse text information so from different materials。

including like scientific papers， archives， some things。

and they are fine tuned by some reason task or planning or information retrieval task。

but they are serve as like assistant like usage， allow use user to use like plain language to interact digital the model。

In scientific discovery， another key challenge is actually the co because scientific is a co process。

so this paper actually use the idea， try to build the connection from the text data。

they try to overcome some unseen connection in the scientific data so that they can use different experts to do a pipeline of research and even use like multiagent systems for some reasoning collaborations。

Cal scientist is another example that using an lab agent to do some not only the generation task。

but also manipulate the physical world experiments。

so it actually build a connection between the language model and the hardware API documents so that it could conducts experiments in the physical world which build a connection between the generation world and physical world。

So yeah， to summarize this tutorial， so I think we had carried three parts of ourLs and large G for scientific discovery。

we have introduced different knowledge Gs and motate why we need to integrate knowledge G into our limbs and we also introduced different knowledge G in life science and described some challenges。

opportunities and。Dr。 Chang has also introduced some scientifics， including like biholder models。

crossing code models， and in my part I also introduced some large integration frameworks and also the pipelines。

pre training， post training， fine tuning parameter efficient tunings and also describe how to use them into real scientific NRP tasks and also the prediction tasks。

诶。I will conclude this tutorial children by leaving the foreign takeaways， for example。

what does like KGs bring to as because actually KG could provide some enhanced na plantation could improve explainability。

reasoning and inference and also integrating nastr into alums can increase accuracy and reduce the he nationations。

When we consider how to effectively incorporate knowledgeclub into LLMs。

we need to consider the manufacturers， including the bankable models， whether this' protein。

molecular text or encoder models， which can be like bankr models of cross encodecor models。

and we need to consider what stage can be used to integrate knowledgecls。

and also integrated techniques such as RoA and contact learning or LLM agents。So yeah。

that's some reference and。This is the end of this tutorial。

I'm not sure if we still have time to take any questions。Perhaps one quick question。All right， well。

in case there are questions， there will be， please feel free to continue the discussion in the tutorial Disc channel on Slack or directly contacting the organizers of this tutorial。

All right we are now at time for transition first of all。

thank you tutorial organizers for this amazing tutorial we really appreciate the time and care that you've clearly put into maintenance this tutorial really accessible and informative again please feel free to continue discussions in Slack or by directly contacting the organizers and if you haven't seen already as a reminder we also have a tutorial feedback form we greatly appreciate any feedback you have and we will of course share this with the tutorial organizers as well so with that I will now transition to Titania。

😊，Allright， I'm going to bring my screen sharing up。And is my audio good。Yes。

Allright。嗯。So we've already started log， but officially welcome to the Learning on Graphs conference。

its third edition now in 2024。😊，And we wanted to kick this off with just some housekeeping。

So I think all of you are aware of this， but just to repeat it and for those on YouTube。

join our slacklack， all communication will be via Slack， all the links for Zoom， Ga town， etctera。

are on Slack as well。😊，And the schedule for the log website is is always up to date。

and you know we've tried to put it in in a Google sheet that people can access easily and it has all the time slots around the world。

😊，Everything is on Zoom， everything is recorded， it's also live streamed on YouTube and it will go up on YouTube afterwards as well so you can you know catch everything later as well and lastly poster sessions and socials will use Gaer town so most of your posters should already be imported into Gaer town and if they're not then we're actively in the process of doing so and if there are any doubts or if there's anything that's not clear about presenting at log or participating in log reach out via Slack or via the log conference Google Group email and will be shorter reply。

😊，So。Our mission with log has always been to advance graph and geometric machine learning， right。

And this is a field that's growing rapidly and changing as well， growing in practical impact。

as well as growing in terms of the size of the community。😊。

So the purpose of log was to have this really accessible to all and free to attend virtual event with global talks and top research in this field and to also bring local communities together to have these local meetups and to you know foster sense of community and togetherness around these research topics that we're all passionate about。

😊，And lastly， our third goal is review quality。 So as a new field， you know。

we want to incentivize and have accountability in the review process， right。

And we've been trying to take initiatives to do so primarily through trying to curate the reviewer pool the area chair pool and give out monetary rewards as well。

and we'll give you more details about that very soon。😊。

I think the heart of log and something that's really amazing to see is 15 local meetups this year。😊。

And this is around the globe in Europe， the US， Asia as well and the Middle East。

Thank you to all the organizers for putting these local meetups up。

Thank you to local meetup chairs as well。 and I just came today from the new Delhi log meet up and it was amazing。

300 people showed up。 They were great posters。 they were great talks。 They were panel discussions。

controversial thoughts。 it was amazing。 Like I really enjoyed it。

And I think this is what makes log really special。😊，And。Next。

I just wanted to say a word for our keynote speakers。

so today we'll have Professor Yuuvan followed by Zachary Ulici。

Zavier Bresson and Alden Heung we've prepared like really exciting talks covering theory methods and cutting edge applications as well。

so I hope you're all excited for that I just wanted to also say note the different time slot for Xavier's keynote and the subsequent session which is to accommodate those of us in Asia as well。

😊，And finally， a big thank you to our programme chairs。

Guy Wolff and Smha Krishnnaswami Without them， none of this would happen。

And they've prepared a fantastic program， which they're going to tell us all about。

So I'm going to stop sharing。 andmha will be taking over。😊，Okay， can you see my slides？And hear me。

Okay， cool， okay， thanks everybody thanks for joining you can see the slides。😊，I can see that yeah。

Oh you can okay。嗯。Thanks everybody for joining today we're really excited about a log that's already going on and as you saw a lot of the meetups have happened so people are in this swing of things I just want to to start off and kind of reiterate that the reason that you know all of us are here the organizers this steering committee and the participants are because of course graph and learning based on graphs is increasing in importance with every passing year this is a chart that's actually showing a graph of the archive citation network in the archive citation network which itself is a big graph we see that the number of machine learning papers with abstracts containing the word graph have been steadily increasing and the only reason it looks like 2024 is lower because it's stopped in August but by December after N。

😊，This will zoom up so this is a topic that's of increasing importance and so we should sort of be proud of ourselves that werere in this in in the weeds。

as you guys know graphs are very important to the world or around us。

whether it be in social science and molecular and biomedical science in things that can help us discover drugs and targets but actually also many other fields including materials discovery market modeling。

finance etc， and so this is what makes this a community that increases in importance with every every passing year it's because there's new and new data sets on many。

many domains that continue to be added in the world and most of them can be modeled as graphs but not only that the notion of a graph of course。

Comes from mathematics so there's a very well developeded mathematical foundation that we can take advantage of so this combination of this mathematical foundation and a high degree of application I think will only increase the importance of learning on graphs and understanding what we're learning so the mathematical foundation you know as Tiittanya mentioned you know rests on can rest on geometry but maybe also many other things there's graph theory of many varieties there's of course geometry。

Romanmania and algebraic discrete also topology topologies well developed on graphs especially in TDA。

😊，But increasingly we also see connections between Marovviian and stochastic processes。

diffusion processes on graphs and of course graph signal processing which is at the core of a lot of the GN based work that we see so our conference also encompasses all of these topics so here's a fun log paper title word cloud so you see the biggest word here is knowledge so graphs are a way of encoding knowledge and reasoning about knowledge which is you know at the underpinning of you know machine learning and data science and we see of course GNNs so graphs in the context of neural networks gaining importance other terms that I want to show are representation there's a lot of things in graphs that can be represented nodes edges the entire graph itself to give us information about different layers of the data it's naturally multi。

😊，Skill course grain information and some of the applications that I mentioned you'll see in here so you'll see molecular you'll see large language strategies。

social property cancer structural so this word cloud is sort of an embodiment of where you know LOG is going this year so I hope you're very excited to join in and with that i'm going to pass it over to Guy who can talk a little bit about the statistics of the papers we saw and what we were looking at。

😊，Thank you， Sammta， I'm going to actually try to share here if you let me share my screen we'll see if it works。

嗯。

Hopefully everyone can see。I see my screen now？Yes。So。So yeah， we。This year we have full papers。

abstracts and TLR papers in order to bring here exposure for multiple avenues of research and when we look at the acceptance rate。

it's at about 40% for full papers which is a slightly more rigorous process 56% for abstracts in order to allow a little bit more innovation and also submission that are not necessarily going to proceedings but can expose things that will later on be fully developed in other venues and TmLR papers or published papers that are given a spotlight here and that is why we have a high acceptance rate。

we already rely on them being accepted and peerviewed there。

We also have here the division of ors from each of these categories balancing the different sources of papers coming in and posters。

that still allow people to explore their work without necessarily having a slot in the program this is a very big operation so you can see you can see here how many area chairs and reviewers we have we thank all of them this is not easy to go through all of this process and select the best papers。

Have。And when we look at the scores that we expect out of 10 and。

You can see here we have the average score for a poster on the positive side around six。

which is above exception and threshold basically and for or else we expect more confidence and full clear accept and the confidence score is roughly where we want it to be anyone who's reviewing knows that we hesitate to put five as full confident there's always a possibility of missing something but at the same time we don't want educated guesses so we're very happy where the quality of review ended up in terms of these statistics。

The average review length， you can see a distribution。

we have a relatively good amount of text coming in from reviewers in order to justify their decisions。

you know we can always do better but at the same time we have to remember the limited time and all the other duties especially with our review process being relatively short。

In parallel to IC and other conferences， people have a lot of workload。

so we definitely appreciate those who took the time to have lengthy ours to help us make the decisions。

In order to improve， we also rate。The reviewers， the area chairs are helping us with this and again we would thank the area chair for this most of the reviewers were meeting expectations。

Although that exceed expectations we will be giving out awards and we will mention it in a few moments。

those who are below expectations， we will work on improving our program committee for subsequent years。

so we definitely make decisions based on these ratings for the future to ensure we improve our view quality。

Based on all of this fitting in， hopefully you will agree with us that we have a very exciting program with our 12 orals。

we have four sessions， each of them with three orals， they are spread over time zones。

of course based on our audience mostly concentrated on。Were on North America and Europe。

but we also have a session dedicated to our audience in Asia and then on top of it we have of course the poster sessions and the tutorials to complete the program。

In the keynotes， of course， this is a team effort， we have a very big advisor board。

including also SmeA that has a double head of program share and advisor they are ensuring the consistency of the program across years log is an annual event so we think all of them probably engaged in ensuring the longevity of this endeavor。

And we also have an organizing committee， just like Chatanya said without the program chairs and the reviewers and the area chairs。

none of these would happen， also without the organizing committee， none of it would happen。

they were extremely helpful throughout the process we relied on them Smith and I with a lot of the help behind the scenes。

So I definitely want to highlight them also for operating open review， you can see here。

Lu and Han and Titania and many others they were there on a regular basis helping us operate everything and they deserve many。

many thanks for making all this possible。And finally。

I can say that on the final day you have something to look forward to。

we will be announcing a best paper award。That that we will select and we will also recognize area chairs and reviewers that exceeded expectations。

as they said， so we have 30 plus best reviewer awards and also three top area chairs so you can look forward to that。

And yeah， and with that， I think our opening remarks are done。

so I'll give the mic back to Jeania for the next parts。Thanks， guy。 Thanks， withta。

So now we have a bit of a break。 like we have around 10 minutes still Professor Yuuvan's keynote and。

😊，I think if you like， you know， we're on slacklack and we can chat there。And yeah。

we can just wait for。More people to join the zoom。

go and tell your friends that the keynotes happening now and we we'll also wait for Professor Wang to join as well and help her get set up and so on。

Hello。😔，嗯He hello， how are you。😊，Good， how are you doing All right， thanks something good。Everything。

All right。Yeah， I think it works well I can hear you loud and clear excellent all right be allotted time sure then I'll mute myself for now。

😊，As you like it， soundss good。Allright。I think we're ready to start soon。

So welcome back everyone and I'll hand it over to Yuquaang to introduce the speaker。哦，系。So hello。

everyone。 welcomelcome to our L G conference this year。 And today。

we are very happy to have Professor Su Yuwang from UC A UC C E SD to give us a talk about the science and lens generalization of neural models via algorithmmic alignment。

And Professor Wang has achieved many good works on graph learning and graph theory and and a spectral method and so on。

and very welcome you to our session and looking forward to your talk。All right。

Yeah so thanks for the introduction and also thanks for having me it's really glad to be here well I know that this conference is learning on graphs I mean so we're going to see graphph neural networks and graphs very soon but in the later half of this talk so bear with me first。

😊，I'm actually going to start with one step back and talk about algorithms。

which as a systematic procedure for solving problems， really has a very long history you know。

since ancient time， like the Euccleus algorithm to computed the greatest common divisor， okay。

so that's an algorithm procedure。😊，Now， in the last 60，70 decades。

the modern computation power brought to us by computers really has led to a paradigm shift in algorithm。

both in terms of how we view algorithm and how we design them， so for example。

now when we think about algorithm， we usually view it as a computational algorithm。

computer algorithms。😊，Now with the impressive power that we witnessed by modern AII。

especially very recently those broader by the large language models。

it's natural also to ask whether by fusing modern AI and the power of data。

whether this could lead us yet another revolution to agz design。So in particular。

so let's not forget that in computer science， the classical algorithm design really has been one of the corner stones in the last 40 years。

50 years so that's a huge literature on that and this brought us。

this gives us a many very beautiful deep insights about the mathematical structures behind the problems。

elegant algorithm frameworks， many approximation algorithm with theoretical guarantees etc。😊。

But at the same time， not all theoretically sound algorithms readily transfer to practice， I mean。

they may have nice asymptctic time complexity， but it' still not necessarily practical。

Furthermoremore， algorithm mean it's usually hard for us to really articulate what are the patterns。

what are the structure in data and further to leverage that so through a way at the same time where they mentioned that we have witnessed amazing power of modern AI in learning discovering the hidden pattern from data as well as to leverage that for the task at hand but at the same time in general。

it's not an easy question to know whether a learned model or generalize especially generalize out of distribution and in fact in many times we do not even know whether given a specific architecture。

even have the capacity to implement the specific task specific algorithm at hand so one of the things that I've been really interested in in the last a couple of years is how can we combined。

The best of the both worlds to advance this neural algorithm design。

So how can we combine algorithm ideas and insights with neural network so to develop more powerful frameworks that can learn from an adapt to data in this talk。

what I'm going to focus on is the specific question of size or lens generalization of newer model。

So basically the question is that can a neuraler model with only bounded complexity in terms of the number of parametersmeter that yet it can generalize to input of arbitrary sizes and syn sizes Okay so that's a question。

So you want a neuraler model because once it's trained。

you have to freeze it So the newer model has only bounded complexity which is independent of the input size。

but you wanted to work for input of any lens。So this。

I'm using the size generalization instead of lens， I mean you will send that the recent years have a lot of work on。

for example lens generalization for large language models I'm using the term size because I mean the input doesn't have to be a sequence so the size is the cardality。

😊，Now note that this size generalization really is one very fundamental property of computational algorithm if I give you a sorting algorithm。

then it has to be able to sort any sequence of input of any lens whether I've seen it before or not okay so this is also very highly desirable for neuro algorithm models now there are multiple challenges involved and in different levels the simplest type of question is what I call the expressivity so basically given the specific architectural neural model does it have the capacity namely can I set up the parametermeter so that this model indeed can achieve size generalization or important in practice is that do I have a practical model that has this expressivity。

😊，The next layer of the question is that well， your model might have the capacity to achieve size generalization that can work for any input size okay but if you give me a trained model do I know whether it generalize or not so do I have efficient way to certify approval correctly that a given trained model indeed has this property can work with any input correctly okay so I call this certification problem because you are trying to certify a given model and in particular very important in practice that it's not just any certificate very often in order for this to be also useful for training。

we actually want that the low loss somehow reflect the ability of size generalization so can of certain loss can low loss imply that。

ok。😊，The last type of question this is a most challenging one。

this is an optimization question namely well can we train through， for example。

the standard grade and decent or other optimization measures train a new model so that it produces a final model that can generalize to any and inputs okay so now obviously top down the question become more and more challenging okay and the optimization in particular challenge even in the classical statistical learning domain we don't have many tasks that we can provide privilege guaranteed optimization okay so。

U。In this talk， I'm going to tell you some of our recent work along this direction where for the size generalization of newer models。

the main theme is that to use this alignment with certain algorithm structure to help us to achieve the size generalization so you can also view this as an algorithmm inductive bias and in particular I'm going to give you three small vignette。

the first two focusing on this slightly the first expressivity type of question okay the last one is on the certification okay I'm going to spend so I I'm going to go a little faster on the first two examples just to give you a taste of how they work and hopefully spend most of the time on the last most challenging one which is to leverage alignment between。

😊，Gure network and the Belelman Ford procedure to certify that we can indeed learn the Bellelman Ford procedure。

😊，All right， and feel free to interrupt me with any questions， I think I can see the chat window。

All right。But before I start let me say that theoretically there have been a lot of very interesting recent work on understanding this excivity problem essentially to understand that the computational limitation of a neuro models。

especially transformers okay so basically fundamentally whether they can compute and know whether they cannot compute okay now compared to those line of work I mean for example。

I mean such as to show that a transformer can simulate finite anatma and so on so to compare with those work you can think that our approaches are more problem driven was the goal to have an effective and practical model to tackle the given problem at hand but also with certain theoretical guarantee the size generationization guarantee so you can view those line of work as you know computational complexity。

computability。😊，Of study while our goal is to algorithm design。 we want to have this neuro a design。

Okay， to solve your given problem， which can be a hard problem。

say commatory optimization problem at hand。 And I would like to have a newer model to tackle that。

So there are different challenges involved。 Okay， now this is also closely related to all this interesting work on neuro am reasoning。

As well as in the last few years， there's also a line of work on using machine learning for combinatorial optimization problems okay so this is closely related。

but our focus again is on to have this size to pay attention。

be careful about this size generalization issue and ultimately as I said earlier one would like to go beyond the expressivity and representation power one would like to be able to optimize a neural model to achieve that size generalization so the second part of my work on this neuro Belman Fort will be one step towards that goal。

😊，All right， so I'll start with the first example， so this is a joint to work with my colleague Andrew Khan。

students Ra Nam and Qin Yi Yang。So the problem here is it's actually an MP hard problem。

it's called a rectilinear shineino tree problem very briefly， it sounds very simple。

you're given a bunch of points say in the plain euclan space here at the blue points and you just want to connect them using a tree structure and you want to minimize the length of the tree the total length of the tree。

okay it's called a shineer tree because you allow to add additional nodes like this redin nodes that you see here。

😊，OkayAnd it's called recallinear because here we're going to measure the distance using L1 norm so but this really doesn't matter whether whichever norm you use。

it doesn't change the hardness of the problem Okay so this rectallinear line tree will consider the specific version of rectallinear using L1 distance because it's coming from motivated by whsi chip design implications where the rectallinear line tree is usually used as a first step to give the global routing of the wires for a net when you put them on the chip。

I mentioned that the problem is NP hard but it's very easy to get actually comes with vector approximation algorithm。

however in practice， given that this really directly tied to the cost of your the heating the cost of the chip so people really try to improve this make it as accurate as possible。

and there are many different heuristic approach developed。

the best theoretical approximation algorithm， it's a beautiful pits given by Sangiio Aurora which give us this onePlusus excellentps approximation for this problem okay but as I said their heuristic algorithm and recent years this also machine learning based approach rest is one reinforceforment learning-based approach for this problem。

😊，Okay。Now。The bias， the theoretical algorithm the pitus。

even though it's a one plus epson approximation， but we'll see later that this is not practical so hasn't been implemented okay so in practice fluidte is what's been used most commonly。

All right， now this is a question that we want to solve and on the surface sounds like that well this is NP hard problem。

how can we just have a newer model with only constant number parameters meters but applicable to input point sets of arbitrary size okay how can that be achieved。

Well the key idea is that maybe what we can do is that we don't do it just in the end to end neuromod。

we actually mix this with some outer level algorithm framework okay so we let the algorithm also to carry a lot of weights to handle the complexity so in particular what you're going to see is that we are going to have this outer framework which is an existing algorithm framework。

but inside we're going to call the neuro modules which those neuro modules are going to have only bounded the size independent of the input。

😊，Okay， and this is why this is using the algorithm alignment idea because we're now using some algorithm framework。

😊，In particular， what I'm going to next very briefly tell you and Shier is just a mixed neuroagrim version of this best theoretical algorithm。

this PE proposed bio aurora。😊，I will not get into detail I just want to give you the high level ideas but the first I mean how does this the theoretical algorithm work it actually have a very simple and beautiful idea behind so let's say that we just looking at two dimensional case all the points are in the play okay and I want to find that the shine tree connecting fast shinetry connecting them。

😊，We just first we put a bounding box around it and then we just build a qua tree on top of that。

which it just keeps sub divividing this bounding box till every cell contains only one point Okay。

and for the technical reason to get a surgical bound。

we actually need to randomly shift this qua tree Okay so the results is it is a randomized algorithm。

Then we're going to build thissteiner tree by choosing those Steer points in the bottom up manner so in particular you don't have to worry about details about the key idea in this pittas is that if I just look at a particular cell in your quad tree that's what you see here I really don't care how the tree look like inside in order for me to build the tree I only need to know this is a key idea by aurora I only need to know how the tree leave the cell。

how they exit this side of the cell okay and in particular I don't care I don't have to check all the possible place where they live I can just focus on those orange points called Ports okay so you consume that the tree you're constructing it's going to leave the cell only through those portals I only need to know that the cost for those。

Ait pattern okay and to build that cost， you can do this in the bottom manner and when I consider a parent quad tree cell。

I can just assemble this from those exit pattern from its four child cells okay so this is the dynamicim programming step where you just simply take those portal configurations from the child cells and use them to assemble the portal configuration for the parent cell and then you go all the way up to the rule root and at the root cell you pick the portal configuration that give you the lowest cost and then you can of propagate the back to retrieve the optimalstein tree and then that's the algorithm okay so once you build the quad tree you only need to the dynamicim programming only take a linear time times the time you have to inspect all the portal configuration at each dynamicim programming step。

So this is just a theoretical guarantee adapted from a Ro paper Now the algorithm is actually very simple conceptually why didn't we implement this this is because that this dynamicium programming step。

you still have to inspect all possible portal configurations that has a very huge polynomial sorry even though it's constant abilities like a1 to the power of1。

this is still a huge number so you cannot afford to enumerate all possible portal configuration exiting a specific cell okay this is just far too expensive to remember so the idea is where quite natural how about we just replace this little unit by neural network and the nice thing is that it doesn't matter which cell you're looking at the neural network will always take for child child cells portal configuration as input and spillit down to the parent cells portal configuration okay and you can instead of enumerating all possible portal configuration。

😊，You can try to learn this latent representation of the portal configuration Okay this is independent of size N the input size。

but hopefully it's also much， much smaller than your explicitly enumerating all the configurations Okay so that's a main idea so in fact you need more than just one little this neural model you need four of them to handle the leaf case。

the dynamic programming step， the top case and also this retrieve case okay but the key is that each of them only work locally okay each of them has is independent of the input has only bounded the size and your algorithm outside is your dynamic programming algorithm and that algorithm will call this neural modules multiple potential linear number of times and once this little components are learned they generalize to any input size because youre just repeatedly calling those。

好能。So that' and your training can be restricted to very small size input because you just want to train these components well。

And I will not get into the results here in detail， but roughly speaking。

we only train this on point sets of size 200， but then we test it on size up to 5000 okay and as we can see that as the size become bigger the bigger so this red is the best so the bigger the numbers the better it is this is improvement then as the size become bigger we actually have a more improvement over the other existing algorithms。

😊，哎。And it's also much faster because it's a new approach。

So that's the first approach where the problem is hard and we kind of mix。

we use existing algorithm to help us to design a neuralmod to tackle that。

so that's where the a alignment come in to help us。Questions on this。All right。

so now let me give you the second example， so here the ultimate goal so this is joined to work with my students E Samantha and Chen。

😊，So heres the high level question is to compare complex objects such as you know point cloud sampled from some geometric shapes that you want to compare them or maybe you want to find the mean or medium of a collection of complex objects so in particular here I'm just going to focus on the having a neuraler approximation of the was in distance between two Euclidean point sets okay so now the question is what is the right architecture that so that we can efficiently approximate this distance。

What you'll see later is that we're going to use this alignment with the sketching based approximation algorithm to help us to reduce the model complexity to be constant。

and you also have to put some consideration to be what's right in your architecture so that you make sure that the result in architecture can really approximate such functions okay。

All right， I that I don't really need to introduce the w in distance so here the setup is that this is basically a distance you can use it to compare distributions supported on the same metric space it's been commonly used in machine learning I'm going to focus on the L1 largest in distance but it works for anyLP wide just in distance so basically this is also called Earths mover distance in the computer vision all of this is special cases of optimal transport so here you can think that I have a two distributions in say eucidean space and I want to move transport one to the other with the minimal cost where the cost is measured the total the amount of mass you have to move weighted by the distance that you're waiting you're moving them okay。

嗯。😊，Now in particular I'm going to focus on distributions induced from weight to the points。

so you're really comparing to weight to the point sets now if the input at to weight to the point sets of size。

let's say and then to compute them exactly takes N cube login time。

but you can also use this as synhor， this antroic regular which takes quadratic time。

but what we want is that we want just bounded the size newer model to approximate this even more efficiently and noted that once you have a newer model to approximate the wsin distance。

then this can be this is a differentiable so you can now actually use this in much more easily in the machine learning pipeline where the ws distance is used say as part of the loss。

诶。All right， I should emphasize that here I'm only approximating the largest distance itself。

not the optimal transport map that induced that distance。

which is a harder problem okay so here I just want to know the distance okay。Okay。

since if we care we want to approximate this distance。

how do I model this distance so for simplicity let me just assume that my input points A and B。

they're all in some hyper cubee in D dimensional space and the cardality is at most and okay so we don't need this。

but this is just make it easier to describe。So now you can each of the points sets now is basically can be represented by this tensor where you have n rows。

each row represents the coordinates of a point， and you have n number of points。Okay。

so you can this way， you can view that as this largest in distance simply as that you can be two point sets like this and I spit out a single distance。

okay， a real value， which is the largest in distance。😊。

But obviously this function has some constraints， The function has to be symmetric with respect to this two input because if I give you B and A or A and B。

the function value is the same okay Furthermore， for each of these factors A and B。

they themselves has to be permutation invariant well it has to be an invariant to the permutation of each of these factors so you have some permutation to permutations。

if I permut this input points I should get the same distance， it doesn't change the distance value。

Okay， so we in other words， what we' are proximating is a function that satfi these conditions。

So this is a special case of what what we call this SFGI function here。

just like the more general allow you to have just not just two objects， but K objects。

say if you want to compute the min of K objects and allow you to have other group actions instead of just a permutation group okay but you don't have to worry about this more general concept。

basically the key is that we're interested in approximating this fact from。😊。

This product space to R and this function should satisfy all the symmetries。All right。

Now we have seen a lot of really beautiful work in geometric deep learning in handling all kinds of symmetries and using those work it's very easy to see this first observation。

which is that let's just focus on here basically because this function is symmetric with respect to this two input here A and B because of that there exists functions phi。

which has to be G invariant to the group action in this case that means that the phi has to be permutation invariant。

such that you can just apply the same operation phi to each of the input phi and of A and B。

and then add them up， this is how you make this function to B permutation to be symmetric here and then followed by another function row outside so this is just following from existing work on handling the symmetries。

え？And what this suggests is that okay we have this simple architecture actually to approximate this v in distance。

this is just a CMs network plus MOP at the end in particular for each of the input point sets you first go through this phi here and then you sum them up and then you follow by another MOP at the end applied to the sum of this image after Phi okay so it seems that to be very easy。

we have the architecture。Well， you we just use this。

the issue is that right now we don't know what's a composite of this。

whether they're really of bounded size or not， maybe they need to grow as your input A and B become bigger okay in particular the note that this by itself has to be permutation invariant has to be invariant to the order of input points in each of the point sets okay and we know how to handle that I mean we have many models that give us permutation invariant achieve permutation invariant。

deep set is one， but you can use some transformer and so on okay the issue is that if we just directly say we're going to use a deep set here。

the dimension， the latent dimension of a deep set at least what we know right now the best dimension actually depends on input size N okay so we cannot argue this module is a bounded size。

です。Okay。And this is where this alignment to algorithm come into picture so here is actually a very simple sketching idea to approximate the virus in distance between two points set so we have this black points that's the input okay instead of using so that's the input point set instead of using them as using this points directly imagine that I have something called an epsilon cover of my domain in this case is a hyper cube and the epsilon cover is just a bunch of points in this case this blue ones such that the epsilon ball around them the union of this epsilon ball covers the entire space okay and the。

to know that you can assign these black points to this center of this epsilon cover。

and so this eilon cover you get a bunch of weighted the points。 Okay。

the weight of each point depends on how the assignment of the black points to them。

Okay the key is that this weighted the point sets Now that dimension is only the socalled a covering number。

which is independent of M， but only depends on the space where your points are coming from Okay so it's a property of the space。

Okay， so you have euclidean space or something else called a doubling dimensional space。

then you can bond to this covering number。 Okay so that's in other words。

I have a this H that map every input points to。Bunded the representation now okay。

this way to the point sets or bounded the size such that Now if you give me two point sets instead of computing the ws in distance between the original two point sets。

I only need to compute a wides in distance between this two way to the point sets。 Okay。

this in this output and that give you an approximation of the ws in distance with additivearrow Epsilon。

Okay， so。Furthermore， this map， that map this input points to this weighted point sets of bounded size actually can be expressed in this specific form。

so this is a nice form because this can be written as you only need a function applied to each individual point instead of to the entire point sets and then sum it up。

Okay， now this is important in having the model with this。

we now finally ready to have the final model its still a CMmes network。

but you conceptually you can just think that every input point set first I'm going to go through this mapping。

To map to this weight to the point okay and that the mapping itself has this form， in other words。

I only need to learn a pointwise function okay once I have this way to the point。

I now use the earlier CMs network to send through another function by now this function by doesn't have any condition but it's operating on a bundleed size input now。

then I sum the map and apply an MOP at the end so all of this neuro modules now only need to work on bounded the size only depends on the input space where your points are sampled from independent of the M okay so you get this expressivity of this final model that can it has a capacity of approximating the largest in distance of any two input point clouds。

Okay。And so these are some results and we compare with the Shorn so in fact I mean it very often it works this are some other neuro-based approaches and our approach works better sometimes is better than sinkhorn as well and all this neuro approaches is much much faster than the Shorn approximation。

Okay， this is the training time， not the inference time， the inference time is very small。So yeah。

okay。All right， so that's a second example where we used this alignment with the sketching algorithm。

try to reduce the size of the model to approximate wass in distance between point clouds。Questions。

嗯，人要是q and你。session。嗯 okay。Okay。Right， so indeed， so this previous approach was number one the。

The heuristic algorithm， this flute actually was highly optimized for small size。 In fact。

for size smaller than9， it even have a lookup table to remember for different type of certain configuration。

s the how to solve it quickly okay so that's why as it become larger。

this heuristic approach is actually works worse。 and the same thing so what's interesting is about this previous machine learning based approach using reinforcement learning okay so that model also works really well when the input size is small。

but my my suspicion is that for the reinforcement learning since only trained on small instances when you increase the size。

it it may not have really learned the right strategy that extends to the large large input size。

That's why as the size increase the performance become worse。Yes， okay。 thank you yeah。All right。

so now I want to go to the last part of the talk， which is going beyond just expressivity。

rarely try to see whether your model has a given model can generalize two different sizes or not。

Okay， but first note that the model efficiency of a neural model is often affected by the interplay of your neural architecture and the task structure okay。

so for example， the work by Sanford at all shows that transformer because of this self attention。

And also the parallelism， it actually is much more effective than at simulating this KHop induction hat。

which is essentially used it in terms is kind of reasoning okay。

then RM based approaches that's one reason that they believe that the transformer is more effective is better。

Now it has also impercolate， both in percalaly and also some theoretical justification observed that when you have this kind of alignment of neuraler architecture with certain ama structure or task structure that facilitates the am reasoning and out of distribution generalization okay in particular in the graph learning community。

we have this GNNs are intuitively dynamicyn programmers。

okay so a series of work observed that and they are naturally aligned with procedures such as Belman Ford procedure or Dtra and so on okay but in general。

it's not very clear that how do we really articulate the precise benefit when we have this alignment of the let's say that the graph neural network with the Belman Ford okay can we articulate that。

😊，嗯。Advantage having that alignment。And that's what we hope to solve What can the alignment of graphraph neural networks with beman Ford proceed really give us in terms of this size generalization so that's what I'm going to talk about next well for this audience I assume that I don't need to introduce graph neural networks there are many different families of graph learning models that have been commonly used and one is the widely used is the message passing graphph neural network there are the graph transformers highor tensor- based models like IGN etc so I'm going to focus on this method passing graphph neural network which roughly speaking you basically maintain at every node some feature and you keep at each layer it just keep updating this feature okay and the way you update the feature is that every node will collect messages from its neighbors okay and then it's use the message from the feature。

From the neighbors to update and get a new feature。 Okay， so this is。

very common generic framework and you can in particular so here just saying that at the case level。

every node will first collect features of its neighbors and then it aggregate them in a certain way and then it used this aggregated information with its own feature to update to give a new feature at this node V okay。

😊，You can set up this aggregation and update function in different ways this are going to be learned often they're represented by some let's say MOP okay the only condition is that this aggregation function should be permutation invariant so it should take them it shouldn't depends on the order of its neighbors okay now to achieve this permutation invaris where it is in this we could do something similar to the deep set namely I have this I take I sum up all the neighbors feature and then。

Followed by another function Okay， so this is one of the most common strategy people do in GNNs I mean just the sum or the neighbors or sometimes is the average instead of sum okay another way to achieve this permutation invaris is that I use the max of all the neighbors feature okay。

so both the sum and the max permutation invariant okay and so this is also this has also been used in the GNN sometimes you also see some models they use all the strategies use the sum。

the average and the max for example okay。I'm going to consider this max version。😊，Okay。

as the aggregation functionion in my GNN model okay and instead of max I'm going to take the min but it symmetric it's the same thing this is because that using this min pooling local layer at each node this more better aligned with the Belelman for the procedure okay in particular the GNN I'm looking at this is pretty much the standard GNN。

the only difference is that the pooling I'm using the min of the neighbors okay。

so I'm going to look at this formulation as at each GNN layer okay。

so you simply take the min of all the neighbors feature and then also take its own feature again and then and then update it okay。

so that's the formulation of our graph neuralure network。Okay。😊，All right。

and it's easy to see that well so first and let's define this Bllman fourth step。

this is following the previous work by Darziik and Vekovic so this is in graph theory in algorithm we know this is how we one algorithm to compute short is pass where every node we have some original distance estimate the new distance estimate its just that it original distance estimate and also you look at all the neighbors and what's the best way to reach it from one of the neighbors okay so this I called we call this Belllman fourth step。

😊，Okay， and it's very easy to see that one can easily set up。😊。

MOP to represent this update MOP and the aggregate MOP so that it exactly simulate this Bel Belman Fort step。

Okay， so that's alignment。 That's why that intuitively graph neural network has a natural alignment structure resemblance with the Belman Fort step procedure。

Okay。Butum。That's not our goal。 The goal is that。It's easy to see I can simulate the Bman For。

but can a GNN really learn either one step Belman force or the K step Belman For procedure correctly。

you know over only affectffs the set of training graphs Okay。

so you give me a training graph and we train our model that's the really learned the Bman Ford audit did it just fitted the input the training graph and memorized everything Okay。

so that's the the question that way we hope to tackle Okay， so in particular。

so the setup that we assume that our GNN。So we are given a graph where the initial node is just some initial shortest past distance estimate okay and the goal of a GNN is to predict what is the new distance estimate at each node after if you perform K step K Belman For steps okay and the training loss intuitively so you have a GNN and given a set of training graphs。

the training loss is basically defined as the average distance prediction error at each node across all the nodes in all of the training graph okay so that's a training loss okay。

All right。So。Is it hopeful to get this generalization site generalization Well let's actually first look at a very simple case okay。

this is a warm up case let's assume that my GNN right now is what I call one layer。

small GNN so we have only one single layer okay furthermore for the single layer inside this is how we set the GNN basically each of the update and aggregation is just the one MOP layer where the dimension。

the latent dimension essentially is as small as possible the smallest model to to simulate the bowman forward basically okay and the activation is。

😊，So very interestingly， in this case， you can actually show that。😊，The low loss。

So if you're train your neural network so that the loss is low， Say the loss。

you train your neural network。 Okay， so that the loss is at most apsilom， okay， then。

Your perimeter has to satisfy these conditions。Okay， so what does this mean， Well， if this， if this。

No equal to exactly 0， then these are actually all solutions for the Belman Fort。 Okay。

so in other words， if this equal to 0， they will exactly implement the Belman Fort procedure。 Okay。

so there are infinite number of ways to implement the Belman For using this neural network。 Okay。

if this is zero that they' are going to implement the Belman Fort。

So here we're saying that were not deating too much from those one of those infinite number of solutions。

basically。 Okay， and this further implies as a result that now once you train your model with this low loss。

now you apply your model。To any input graph， with any positive edge weights。

you will be able to guarantee to predict the Belmon Ford step correctly with the arrow Abpsilom。

okay？What's most interesting is that you only need to train your model on this there exist only a very small training set of only eight graphs okay and your loss is evaluated on this eight graph and this is not just existence。

you can actually easily this eight graph looks like this Four of them looks like this form and the four of them looks like this okay you just need to take this eight graph and if your neuraler network is trained to have low loss over this eight graph then it's guaranteed to generalize。

So intuitively， why is this happening intuitively what happens is that because of this alignment between the Belman Fort and this GNM。

it's almost if you ignore the Re， it's almost like a linear system of equations okay so you're just trying to choose this test train graph to essentially enforced that the Re is not really playing a role and then you can kind of solve this linear system okay。

so that's the intuition not exactly so but that's why this alignment really help us to achieve this。

😊，Okay， but this strongly relies on that right now my neuraler network is actually of this a very small size。

Okay， so my latent dimension is very small so I don't have don't have lot of margin for error okay so in general。

you want to have any neuraler network because I can control how the user is going to design their graph neural network so the question is that what if we over parametermetize our graph neural network Okay what if we I can have arbitrary number of layers in this GnN and inside your update and aggregate module can also this MOP can also have multiple layers and the widths of those each layer。

the latent dimension D can also be large okay so it could have essentially arbitrary widths and depths。

😊，So。So no this is not empirical， this is a theoretical guarantee。

namely for this eight graph where this graph is specifically chosen。😊。

If your loss on this eight graph is low， then you are guaranteed to generalize okay。

so and the intuition why we choosing this those eight graph is as I explained earlier。

is that intuitively here it's almost likely you have almost the linear system。

but you have some ralu inside that to screw things up that you're choosing those graph to try to mitigate the effect of the Ralu。

😊，so this is not empirical results， but later can you see that this this does have an empirical implication in training of the graphphure network。

😊，Okay， so this is just to address the question in the。In the Q&A， okay， there's another one。Yes one。

yeah。😔，From Rick about the right So no， no， that's an excellent question。 I mean。

this is in general。 In fact， even for this one， for the next step。

you will see that I have to change my loss a little bit。 I think this is a great question。

and I do not think that in general， we have there actually， in fact。

very few not many architecture where the low loss necessarily implies generalize it just good solution。

Okay so a lot of the study， if you look at the optimization of neural network community。

a lot of the work is that the opt solution can have low loss。

but it' not the opposite direction is not obvious Okay so there has to be certain structure of your the problem at hand and also of your neuraler network that has to。

in some sense， match each other in order to get such a clean results。

but even for this case now I'm coming to this in the overpametize the case。

I would have to change my loss a little bit。I think it's a good question and it's also a hard question。

I don't know whether we have a general recipe that can easily translate this results to other problems。

Allright， okay so now when we have this over parametermetized the case， Okay。

as I said it now there are too many ways that things can go wrong。 Okay。

I actually have to change the loss a little bit I have to use regular the loss okay。

so basically because your your model is over parametermetized。

I have to artificially to say that I want sparse solutions basically Okay。

so that's what you see here。 I have to add a L0 as L zero sparsity。😊，To recognize this loss。 Okay。

but with this， regular the loss， let's still just consider one step bowman For。 Okay， again。

you get this serumorem。 This is a theoretical guarantee。 Okay。

so this is the same eight graph that we saw earlier。 Okay。

your training set can actually be bigger than this eight graph， you can have additional graph。

but you need to contain this eight graph。 Okay， and let's say that that's the total comp of the training sets。

so what the serumorem is saying that now okay， if your regular parametermeter here。

this weight is chosen correctly， sufficiently small。 Okay， then as long as you' regular the loss。😊。

Of your neural model is within epsilon of the optimal like the smallest the possible you can achieve then again you get the same size generalization guarantee namely for any input positively weighted the graph applying this neural model to that graph will approximate the one step beman for with additive arrow okay so so once you in other words it's very similar to the previous one。

it's just that now you need to use this check this I recognize the loss okay you also do have conditions on this this coefficient here and as you can see that the more graphs you are using or as your neural model become more and more complex then the condition on this is stronger you have a less and the less margin okay so you have to satisfy this condition but if this add。

Is sufficientness more than you get to the same generalization guarantee。

So both of these two so this is basic for one step Belman Ford okay and then the next question of course。

is that can I use work with some multiple steps what if I just want to train a single module that can predict a K step Belman For directly you can already use this result saying that I'm just going to pile such module up I mean you can run this module K times okay so thearrow will also accumulate so you may want to directly go with a K Belman For step without any intermediate dis supervision okay so your training would just be the input graph with some initial distance estimate and your label is the distance estimate after K step Belman For steps so you want to see whether you can train this correctly okay so now this problem become much much more complicated and we cannot use the same eight graph that I was telling you those very simple graphs before anymore I have to choose。

My graph much more carefully in order to make this work。

but still to know that you can explicitly construct a set of graph of order of k okay so the total number is order of k there's still a very simple form as you can see I listed some here okay and with this training set then can you get a similar guarantee as before。

you can have over parametermetized GNM where the number of layer has to be at least the number of steps K okay and using the recognized the loss then you get size generalization as long as the loss is sufficiently small。

😊，Okay， so that's the。These are all the surgical results now as I said that this actually have interesting practical implication because so let's say that now I'm going to train my graph neural network really just to predict a Belman Ford procedure okay so in this experiments this is very preliminary experiments I'm just looking at one step one beman for step okay and but my neural network just to see what's happening let's use a single layer but the latent dimension is much bigger than needed okay so here the latent dimension is 128 okay and。

😊，And so we train what you see in this plot， okay？I train three cases This one is the training set is exactly the eight graph as our theory predicted that I should use。

I should consider to test my loss Okay we also have two other setting where I just add additional graph to the training set so the training set is 3264。

but they all include that eight graph needed Okay and the training laws I cannot really use L0 regular。

so instead I'm using L1 regular as the training loss。So as you train。

so here this very messy piece is basically you see how the X axis is an epochX during the training okay what you see is that all of this per me in your GNM okay different color means the different size so the green color is when I use just the8 training graph blue is when I use 32 training graph red is when I using in 64。

What is showing in this picture is that you see ultimately to implement the Belmon Fort。

one of the spa solution is that basically one of them go to one and then the others all go to zero。

essentially your parameter should if you get a spa solution okay and what this is indicating is that you can see that for the green roughly speaking。

you can see all of this， they roughly become zero， most of them become zero around here。😊，Okay。

and this one becomes one。But for the blue， it converges later， and for the red。

it converges much later。Okay， in other words。When I train my neural network in this particular case more training graphs actually is not helping if I use the correct aid graph。

it actually converges to the optimum solution， the spae solution much faster than when I add additional one and the reason is because that we know this aid graph contains the right signal and when you add additional graph essentially kind of blurred the signal intuitively so I find that this a very peculiar very interesting phenomenon because in general in the classical statistical learning because we don't have other assumptions and we assume that it data coming from this same distribution testing data coming from the same distribution in that case。

the more training set we have the more beneficial it is while here interestingly from optimization perspective seems that if I just use this correct set。

the smaller set it actually converges much faster。We know what our target task is。Okay。

so this is all showing similar behavior。All right， I mean。

there are many interesting questions immediately after this。

I mean our theoretical guarantees for L0 regular， but in practice systems that L1 works pretty well empirically could that actually L1 is sufficient to get theoretical bond as well and can we also I hear I'm very carefully choose those those eight graph or order of K graph in order for the theory to work okay but we have some intuition that make us believe that I may not need to choose them so carefully。

I could choose certain random graph with bound at the size it looks like that those are sufficient why is this interesting why do I want to just use random graph imagine that suppose in the future I'm going to train a neural model that is capable of implementing different procedures Belman Ford and other procedures okay and。

😊，You want your training set essentially can that's useful for all of this different procedure for all of them certain randomized the training set Sur。

then I can just use the same training set for all of them okay of course that the optimization question is wide open namely even though we said your loss goes below the threshold then it's guarantee to generalize but how can I guarantee that using SGD I can really achieve low loss so that is not clear。

All right so I think that the three examples that I want to give you all of them are related to the size generalization。

which is an important family of the out of distribution generalization so to me I think that in the modern application of AI and MO this out of distribution generalization is one of the perhaps most important questions because we constantly especially for a lot of so I'm involved in Tlos which is the intit for optimizationization very often you have this very harder problem。

say in chip design robotics， the problem is coming to say NP hard commoid optimization problems and you don't have the luxury to train over a lot of examples large examples。

you really hope that whatever we learned from smaller instances can generalize to large instances okay so this is a very crucial challenge that we're facing okay and I give you three examples where we use ideas insights resse。

😊，from algorithm to help us to tackle the size generalization problem but in general。

I think this is all within this general research direction I would like to advocate which is using combining algorithm inside with machine learning to help us to design better more principled neuralags I'm hoping that neural algorithmm really give us a paradigm shift in how we think about algorithms okay and there are many interesting but also challenging open problems still remain how to use effective and efficient neural model to solve hard problems okay in particularly optimization is widely open there are also other questions in the example that I give you usually we have to put special consideration for the given problem at hand but can there。

I this a more universal recipe that works for a broad class of family。

a broad class of problems instead we have to design one specific for input on the other hand。

in the algorithm design usually we do design special algorithm for each for individual input task。😊。

And how about the composition of such suppose you have you know， you have many algorithm modules。

a neural algorithm module that you know can approximate a certain algorithm tasks with generalization guarantee how about when we compose them because then you can think that the neural we can leave to machine learning to learn the normal ways to combine those neural modules but then we need a theory to understand that in what case can we argue that the composition of those modules still have generalization guarantee or other guarantees if each individual module has those property and what I in my example is the neural modules usually the model itself is just abundant module but you can use more powerful approaches。

recurrent neural network or chain of sort which allow you to kind of have a longer memory and。

OnAnd what can we see in those cases Okay so there many interesting problems remain and I hope that this attract more people to work in this direction Thank you for attention。

Yeah。So thank you very much for Yu Su's very interesting and exciting talk and has expired some like thoughts and discussions during the talk。

I think due to our time limit， because we are going to the next session。

almost exceed several minutes。 So we will we would like to thank you again and hope like any audience interested in Yu Su's works and recent progress can like a content and maybe our conference as well。

hope you can join our conference in the next few days。

Thank you， Yu。 Thank you。 yeah。Okay， I will now hand our microphone to Haitao and Alex。

Yeah for the next oral presentation session。OO。Thanks。So。Hi， everyone。

I'm the session chair of this oral session。 I'm Hao from the Michigan State universities。

And let us like welcome。 Like the first speaker is Floridaiddao。 So Tori， are you here， Yeah。

can you hear me。Yeah， yeah， I think you can check and the share the screen。

And so Frano Tori is a PhD student from YJ University Brussel。

and today like his oral talk is about a very interesting paper。

it's called the effectiveness of curvature based Rering and the role of hyper parametersmeters in the GN Revisit。

Yeah， I think I can see it great yeah。Yes， perfect， thank you very much， I was about to ask。Okay。

so good afternoon everyone or morning， depending where you're listening from。

so my name is Ferrianno and indeed I'm aPG students at the variety of State Brussel in Belgium and from today I will be presenting some of our work on reanalyzing curvature based in graph neural networks and the role of hyperpar in GNM revisitants。

Now， first。Let me acknowledge the performance that Graph neural networks have shown on graph structured data in recent years。

which is really due to the ability of GNNs to use the implicit geometric prior underlying in the data so the graph structure of the data。

which is really crucial in order to solve all kinds of graph problems。Now。

one of the issues that does hinder graphph neural networks is the issue of over squashing。

So this is really because Gra neural network use message passing。

and so that means that overcing appears when。A lot of messages have to pass through one single edge。

or there are like there's a local region in the graph through which a lot of messages have to path。

And then we can call these edges so the message passing through these ashes is then compressed。

so there's a sort of loss of information， and so we can speak of bottlenecks in these graphs so these edges through which these overcrotching occurs are then called bottlenecks and the graphal network will then lose performance because these messages don't contain the necessary information anymore in order to solve the task。

Now altering the graph structure has therefore become an important approach to alleviate this oversquaashing problem。

and this graph modification can happen in different ways。

but today I will focus on rewiring and by this I really mean the targeted addition or removing of edges in order to improve message passinging。

Specifically I will focus on discrete graph curvature rewiring。

so this is a rewiring where we use discrete graph curvatures in order to detect local bottlenecks in graph so we can for each edge in a graph we can assign a curvature value and depending on the value of this curvature whether it's positive zero or negative it will it will be associated with a certain local topology of the graph and so depending on this topology message passinging will behave a bit differently around this edge and so we see if an edge has a negative curvature we can associate more of a local key like topology around it and indeed this corresponds this will correspond to more to edges more responsible for oversquaashing because we have this three structure which grows the numbers of neighbors and so theres a lot of messages that have to to pass through a couple of very specific edges which。

And caused overscorching to happen。Now， the question is， how can we compute these creature neurons？

So one of the nominal papers in in this field has to be top the paper from topping at all where they proposed balance form and curvature which is a notion that depends on three different aspects around an edge so suppose that we would want to compute the curvature around the orange edge between node1 and2 then in order to compute the form and curvature we would have to count the number of three cycles so these are the number of common neighbors neighbors between node one and two we would have to count the number of four cycles so these are neighbors from one and two that are also connected between each other and important here is we consider four cycles as when there's no edge between the neighbor of one and node2 for example or vices vera so they don't have a diagonal inside and then finally we also have a gamma max factor so this counts a sort of geneeracy so we can see here that node one as a neighbor which is responsible for two four cycles in around the orange edge and the geneeracy of the。

of four cycles for which to know is responsible will be captured by gamma max structure and so using these local properties around the edge。

we can associate a curvature value to the orange edge which will be either negative positive or zero and so hormone curvature we go from has a minimal value of minus2 and can then increase according to the of these local topologies。

One of the main results from the paper from Topping Adel was what we call the oversquaashshing theorem so if we have a graph neural network creating embedding vectors。

we can actually bound Jacobcoian of message passing based on the curvature notion here on the left hand side in the orange mouse we have the Jacobian of message passing so what does it say well its how sensitive node K is to messages arriving from node I which pass through a selected edge I and J to select the edge between IJ but so for this edge IJ we can compute the curvature value。

which is a certain value， and it turns out that the delta which means like how far away the curvature value is from the minimal value of minus two of form curvature it turns out that this delta can be used to bound this Jacobbin of message passing and for the smaller the delta will be。

the closer the edge is has a curvature of minus。But also the lower the bond will be of the Jacobian。

which means that they're sort of oversquaashing happen because note K is losing sensitivity to the messages passing from I。

passing through the HIJ。And so it's really important that this links the negatively curved edges in a data set to over questioning over these edges。

and so we can use this theory in order to rewire graphs so how can we do this we can use stoatastic discrete Wihi flow so that means that if we have for example。

the red edge right here which can be detected as the most negatively curved edge in the graph and we would want to rewire around this edge。

We can look at all possible other edges that can be added in order to add either three cycles or four cycles around this edge。

and based on the improvement that these edges add to the curvature of the red edge。

we can select one of them， the green one in this case。

in order to improve message poing and improve the curvature value of the red edges。And in this way。

improve we can rewire the graph in order to help message passing around these negatively curved edges。

which should be responsible for over squing。Now there's one aspect of term that I didn't discuss。

which is the fact that there's also a condition on the edge selected。

so if we have the edge Ij which has a certain curvature value denoted by the value of delta，Well。

actually this delta， which again is the difference between minus2。

the most negative value of curvature and the value of curvature of the at Ij Well this delta has to satisfy two things。

one it has to be smaller than one over the square root of the maximum of the degrees of node I and J so here D I and DJ node the degree of node I and J and this delta has to be smaller than one over gamma max。

So this is a condition and it's important because mathematically this condition is required in order to prove the final part of the theorem which is normally namely that the Jacobgan can then be bounded by this delta value so once we have an edge and we have a delta value it is necessary to treat。

This deelta conditions before we can actually make a statement about the oversshing nature of this edge namely is it a bottlene or not it turns out that when we perform rewiirring actually these conditions are never。

So means that when selecting edges to rewire around。

these conditions are not explicitly checked in the process and if we look at the dataset sets Texas。

which was used to evaluate rewiring， if we would rewire 89 edges。

it turns out that none of them actually satisfy this condition we call condition2。

the condition on the delta， but that means that none of the edges satisfy the condition that delta has to be smarter than one over gamma max or one over the square roots of the max of the degrees。

Now， it turns out actually that this condition2 is very stringent because none of the edge is satisfy。

but actually mathematically there is。Mathematicically there's a sufficient softer condition that actually can be applied to obtain the same result and so here we call this condition 2B。

so actually the condition on the degrees can be replaced by a condition on the number of three cycles or on the number of triangles and in this case the condition then becomes ata has to be smaller than one over the number of three cycles and smaller than one over gammams。

And indeed this is a softer condition because we see that out of the 89 edges rewired for the Texas data set。

around 7% of these edges， which is still very low satisfied condition 2 B during rewiring。

But so here we still see that most of edges selected during wiring actually don't satisfy this condition。

neither condition 2 or condition 2 B， which really limits their interpretation as overquaashing edges because again。

this condition is needed in order to interpret the final part of the theum as the edge being over an over squashing edge。

No，we asked is well， maybe this is a temporal type of effect namely that edges that do satisfy the condition are selected earlier and then edges are selected that actually don't satisfy the condition and this is something that we here on the figure on the x axis we on the figure we have all edges that do not satisfy condition2 be so these are the 83 edges from the data sets。

The is selected to be wirewid and the curvature value and on the x axis we have the number of three cycles around this edge and so the dotted line indicates one over the number of three cycles。

so this is the condition that the edge has to satisfy and so any edge above the dotted line implicitly doesn't satisfy the condition of the number of triangles while anything below the dotted line also doesn't satisfy the condition but it doesn't satisfy the condition based on the gamma max parts of the condition。

And so we color coded all of these edges selected during rewire based on when they were selected in the rewiring process for the lighter the color。

that means the earlier the edges were selected and the darker the later。

and we see actually that all these edges that don't def the condition are selected a bit during the entire rewiring process。

so this is definitely not a sort of saturation effect that these edges are selected at the end of the rewiring。

but you see that this occurs during the entire process。So again。

we have edges that don't division and which really if by rewiring these edges， which。

Strictly ading to theuterrum don't necessarily overst the information because。

We can't use the subsequent result of theuterum since this condition isn't satisfied。

Now we can ask where does this problem stem from， well actually if we look now at all curvature values of the edges in the excess data。

these are not only rewired， but these are all edges in the dataset sets。

we see that minus2 is the most minimal value for the curvature。

but we see that there are very few edges actually close to this minimal value and recall call that delta which indicates how close we are to minus2 as a curvature value has to be small because it needs to be interpreted as a bound on the tracobin。

but here we see that very few of these edges actually are close to this minus two bounds。

really limits their interpretation as a overctting edge。Now， of course。

this is not a dataset specific artifact we detected from multiple dataset sets that are used to evaluate rewiing methods and we see that this problem occurs over all these data sets。

there are some outliers， for example， core and equipment which are citation networks do have a higher rate of around 70% of edges selected that do satisfy the condition。

but we see that in general， most datas only have around 30% of these edges that do satisfy this condition。

and this is problematic because we might be rewiring edges that are not necessarily over squing information according to the tiem。

And also for all other data sets， we indeed also check that this is not a temporal effect。

but this happens at any point during the rewire。So now we came a bit as a sort of at a contradiction because well these revivring algorithms are used and they are better accuracies are actually reported using these methods。

but because from the theorem and the fact that these edges don't satisfy conditions。

we can question well what is happening， why are these better aies reported in the sense that technically these edges aren't necessarily oversquaing。

So we set out to analyze the dependency of hyperparameter on all these rewiring techniques。

So here we decided to analyze not only balanced forman curvature DFC。

but we also analyzed other types of curvature namely Justin Liu or augmented forin。

and we also looked at variances of these curvature， so these are denoted by the subscript。

for example， the sub3 denote the fact that we are looking at balance forin without computing four cycles which are computationally more。

Expense and so we performed a very large hyperparameter sweep over all these when rewiring this dataset set and we also compared it to the performances we obtained when we do no rewiing but instead of only focusing on the best setup that we obtained。

we actually decided to look at the distribution of all the results obtained during the hyperparameter sweep so here the curvature or the distribution that we see the notes the amount of times a certain final test accuracy was obtained for each hyperparameter setup。

And we see actually that the reviving methods don't really offer from the performances that none offers。

so really no reviving actually on a distribution level really follows the same the performance as rewiviving graph reviving the graph。

and there are some outliers as well， but even if we look。

for example at the top 10 percent so not all hyperparmeter configurationation， but only the top 10%。

we see that there's very little variation in these performance。

And even if we look at the results are really an outlier in this distribution。

well the top 10 performing result， we see that there's very little variation in in the result as well。

and that rewiring seemingly doesn't offer that big of an advantage to no rewiring so it seems from these results and from the results that we get and of course we tested it on various data sets so this was not only for the Texas but we use we analyzed this for all datas presented。

嗯嗯。In the papers， it seems that rewiring and the results obtained from these remiing method are more linked to finding the optimal hyperparameter setup then really a structural shift in performance in the distribution due to rewiring。

so it really doesn't seem that rewiring is bringing the necessary benefits that one would expect。

which based on the fact that these edges selected during rewiring are not。

are not satisfying the condition it seems indeed that there's a kind of disconnect there So there are really take ways from our work one is this mismatch between theory and experimental verification so at no point is there a discussion over the validity of the theorem from topping adult so mathematically it's a very nice result linking the overcoting to negatively curved edges we just saw that on empirical data sets when applying these rewiring techniques it seems that these edges don't satisfy the condition and which really limits their interpretation as overcoing information so I think it would be really interesting to see could if these methods could be augmented with a sort of theorem check to see like okay we're actually rewiing edges that need that are actually oversing information or not。

And the second part is really the hyperparameter where we saw that the dependency of this result is。

well there's a high dependency on hyperparameters。

and so it's hard to distinguish the effect from rewiring purely from the effect of the hyper finding the ideal hyperparameter set up in this context。

And then finally I do want to address as well that data sets that we've used are under discussion in the literatures as well and we are aware of this for sure。

but these data sets have been used to evaluate the reing rev varying methods initially。

so we thought it was interesting to really just look at these dataset sets and look at their performance a different angle growth from the theoretical perspective as well as from the hyperparameter perspective。

but it's definitely a point that can be worked on to see how this holdss on other dataset sets。

And then so I want to take the time to thank my collaborators for this work。

and I also want to thank the reviewers from Lock and all the organization because this was a really nice review Ron and we got some really nice suggestions from the reviewers as well。

if you have any questions feel free to ask him or you can come visit at the booth number two。

or you can find our link for our code on GitHub and I want to thank you all for listening。Okay。

Hi thanks for the great talk so like I have like one general question about like the revivring part。

so I think this is a very interesting job but I think what is like the revivring we only focus on like the data part we are trying to organize the data and the thing here is like we still have the graph neural network right？

😊，I mean， the performance is determined by both the Graph neural network and also from the data review。

So do you think like there are like different kind of graphraph neural network architectures or like you even use a graph transformer can somehow like will influence like your experimental results。

🤢，Well， I think there's on the graph architectures。

I think there's definitely different performances based on which architecture we use。

but underlyingly the rewiring really aims to find the optimal data structure to apply these architectures on so we had some experiments on other architecture types as well and we saw that there there was also the same ID that there's no distributional shift from rewiring so it's not very architecture dependence。

but I think it's interesting to look because it's something that we were considering as well is these hyperparameters there's both rewiring hyperparameters like how many times do wewire the graph which edges do we select as well as there's learning hyperparameters for the architecture。

how many layers what's the learning rate and I think it would be interesting to see what is the dependency on the different types of hyperparameters splitting between the learning and the data modification but it's indeed it's an interesting point to split the dataset and the learning part for sure。

です。Oh， thanks for it。 And they are like。Yeah。So I think there's no。 let me。

I think there's no further questions。Yeah。And yeah， okay。 Thank you。

And I think like we can move to the next oral talk。 iss Nicholas。 Yeah， Can you share your screen。

Okay， so I will make so let's welcome like Nicholas Canra from。

sorry it's recorded oh my god。好。嗯。Yeah， so let's welcome like Nicholas Ker。

which is PhD student from TOM， and today he will introduce his job。

Expressiveity and generalization， fragment bias for molecular genes。Yeah。Nicholas。

hi can you hear me， can hear you and can see so hi， I'm Nicholas。

I'm going to present our paper expresslusivity and generalization。

fragment biases for molecular gen。😊，hich I developed together with my amazing colleagues from technicalnical University of Munich。

嗯嗯。Okay， so let's first look at the following two graphs which could represent molecules。

And they are obviously not the same。 And you don't have to be a chemist to know that they will probably have different chemical properties。

But。AndFor traditional G N N， as you might know， they are undistinguishable。

This problem is known as the limited expressivity of traditional G N Ns。

And it can be seen in two different facets。First， the ability to distinguish non isomorphic graphs as in the example is bounded by the famous vicephlaimmon test。

and secondly， which is connected to the first point。

traditional gene ends applied to almost all substructures， a substructure could be， for example。

a six ring， like in the example above for five ring。

And this is a big problem as substructures are very important in many different domains。

especially so in chemistry and molecules which we will focus on。

So how can we deal with this problem and how can we increase expressivity The first approach is well。

we can just adopt the model architecture such that the model is able to learn more complicated more complicated functions by its elephant can learn to recognize substructures。

And this approach is followed by so called higher order GNs。

And they depart from just having one representation for each individual note and instead have learned representation for higher order constructs like every K couple of notes。

This comes of course， with an increase in complexity。

which can be quite large and can make them unsuitable for tasks with larger graphs。😊。

But we have very well established theory with which we can accurately describe and relate the expressivity to expressivity measures like the KVL test。

😊，And in terms of expsivity， quite they are better than the traditional G andNs。

but especially in terms of substructure recognition。

they are still limited and often can only recognize smaller substructures like smaller rings。

but not a six ring， but this is like very important in chemistry。

And it has been shown that they perform quite poorly in generalization tasks。

so generalizing to out of distribution data and they are also susceptible to adversarial attacks。

So instead of trying to adopt the model architecture。

we can also just provide more information to the model and this is done by so called or what we call fragment bias and ends。

They hear we first fragment the。The graph into substructures and then this substructure information is given as an explicit inductive bias to the models。

And they often retain the linear complexity of traditional gene ends。

but we don't really have a lot of fury that could relate or compare existing fragrant biasg ends to each other。

So we can't really say much about the explusivity and it's also there's also not a lot of work on generalization for those fragment by gene ends so in this work we want to focus on those fragment by gene ends with a particular focus on explusivity and generalization。

So if we want to build a fragment bias gene N， we have to answer too many questions。 First。

how should a graph be fragmented， So what are the substructures that we want to give to our model as inductive bias and then secondly。

how are we going to use these substructure informations in our model。

Okay， let's first focus on the fragmentation。And in all generality a fragmentation scheme looks something like this。

we have a set which is predefined of substructures and we call this our vocabulary and then the fragmentation scheme just identifies some of the substructures in our vocabulary in the given input graph。

And so which substructure should we consider？On the one hand。

we course want to have all important abstractstructures。

because while this gives scheme model valuable information。

Important obviously depends on the domain and the task attempt for the chemical domain of five。

so rings could be important， maybe a junction too and you could think of many other important substructures。

But on the other hand， we don't want to be too fine grained because the fragmentation should still facilitate generalization across diverse graph structures。

So even though this particular substructure might be very， very important for a particular task。

We probably shouldn't included it because it might only appear once's advice twice in a complete data set and might make generalization to other data from other distributions difficult or finding similarities between different graphs。

So we somehow have to find a balance between a fragmentation that's too fine rate and to coarse grained。

And we propose rings past fermentation that tries to be somewhere in the middle。

And so it works like this。 we get a molecular graph。😊，Then we first extract a minimal cycle basis。

And the remaining edges are then connected to form maximally long uninterrupted paths。 And with this。

we are able to fragment the complete graph using only two types of substructures。

basically rings and paths。

Okay， now that we have fragmented our molecular graph。

how are we going to use this information in our model and we first have to encode it。

What's traditionally used in Fr biased gene ends is how simple when hot encoding。

Which also means that similar fragments get completely different encodings as all fragment types gets completely different incodings。

And this might be a problem， as we might suspect that the5 ring into sixth string are actually quite similar and share similar properties and should get similar encodings。

Additionally， this only supports a fixed number of fragments。Our ordinal encoding， on the other hand。

is designed with the idea in mind that similar fragments should also get similar encodings。

And we achieve this by splitting the encoding into two parts and the first part only encodes the class of the fragment。

so if it's a path or a ring and the system shared across all paths and shared across all rings。

And the second part encodes the size of the fragment， and this is just scaling alert embedding。😊。

And with this， where the similar fragments actually get similar encodings and the model is able to transfer knowledge between similar fragments。

as we might later see。😊，Additionally， we now can support infinitely many different rings and paths。

Okay， now that weve encoded our fragment information。

how are we actually going to use it in our model？😊。

And there are three natural ways in which this can be done and in which this is already done in literature too。

The first and simplest is， okay， we can just concatenate the fragment information as a node feature to the existing node features of the model and then apply a normal GNN to it。

Secondly， we could have learned representation for each fragment individually that exchanges messages with the underlying atom and vice versa。

And lastly， we could build or or we could connect neighboring fragment representations to form a higher level graph in which additional messages are sent。

So which one of those approaches ensures the maximal expresssivity。

that's not immediately clear if this higher level abstraction actually comes with increase in explusivity and if the higher level graph is actually more expressive than just having note features。

So。We need some measure of excsivity with which we can compare it because it's often difficult to compare the models directly with each other。

AndT we take inspiration but inspiration by the classical vicefaer Lement test。

which is used for gene ends without any fragmentation。

and we develop corresponding variants for the take into account as additional fragment information。

😊，And with this， we are actually able to show that the excivity strictly increases。😊。

And as a nice side effect， we are also able to compare existing fragment bias geneNs because they or most of them can be bounded by one of those free tests and then in turn we get some sort of hierarchy。

Among those existing frequent bias gens。Okay， now that we have answered all questions。

let's put everything together and build our own model。

Which we call fragmentnet。so given a molecular graph， we first apply our rings past fmentation。And。

Then we use our ordinal encoding to get like representations for each fragment。

And for message passing， we use a higher level graph to ensure a to ensure maximum expressivity。

So we have in in total four kinds of messages， messages on the。Original graph。

Messages from the higher level to the lower level graph。

The other way round and messages on the higher level graph。

Okay。So let's come to the empirical evolution in terms of expluivity。

some common benchmarks and lastly generalization。

We empirically test the exclusivity by its ability to count chemically important substructures。

And as you can see here in this table， our model is able to count perfectly simple substructures like this five ring and this four path。

😊，And this should not be a big surprise as those fragments are actually part of our vocabulary。

so we provide this information directly to the model so。

It's no surprise that it can learn to count dose。But we are actually able to count also more complicated substructures that were not part of the original vocabulary。

and this not only holds true for those shown here but for a wide range of substructures。

And this maybe shows a solution to the before mentioned conflict。

We don't have to include all important substructures。

We just have to include enough such that the model can then learn to combine the existing substructures and to learn more complicated ones。

It might be maybe a bit like in the German language where you can like combine simple words to build really complex fonts like Duno D Schd's capitanes Caelshaft。

😊，Okay， let's come to the empirical performance and here we focus on the popular penalized Lo P regression on zinc and the long range peptides benchmark。

And fragment is best among all fragment byst ends and at least achieves comparable performance to the state of the art model。

which is the transformform a gridit。Okay， lastly， let's come to the generalization abilities。

And we test the generalralization abilities by removing all。😊。

But we want to test the ability to generalize to molecules containing completely unseen fragments。

And for this， be remove all molecules containing seven rings from the training set。

but a test that still contains molecules with seven rings。

So we have to generalize two molecules containing completely unseen fragments。😊。

And here our ordinal encoding actually really helps and we are able to greatly outperform the state of the art model。

and this shows that we actually can transfer knowledge from similar fragments。😊。

With our or encoding。And in our paper， we not only showed that we improve generalization for unseen fragments。

but we can also show that we are better in the setting where molecules contain very rare fragments。

so at the tail of the distribution。And also molecules from a completely different distribution。

So in summary we present fragment， a robust and highly expressive fragment biasGN。

it retains the linear complexity of traditional GNNs。

we develop theory that with which we are able to compare the expressivity of existing frant bias GNs in this theory we achieve the higher level graph。

the highest degree of expressivity。We show that our ordinal encoding actually helps for generalization。

and we achieve state of the art performance at least for fragment bias gene ends。😊，And yeah。

you can find our code and our full paper under this who QR code and thanks for your attention and thanks to the organizers and now I' am happy to answer questions if you might have any。

😊，Yeah， nias， like sex for a great talk。 So I have a small question。 It's like， yeah， recently。

like more and more people are talking about like the foundation model。

like in the molecular domains and also like， I want to see a small question。

maybe I'm not so familiar with that， but I want to ask like。

do you think such kind of like fragmentnet is a more suitable way like to observe like better scaling behavior because I see some kind of paper。

They are working on this direction。 Yeah， thanks。I am not actually sure。

I think this so fragment provides like really strong inductive bias by providing this additional fragmentation。

And I think it's especially important， if you have like。If your data set is like smaller。

then the model actually needs this inductive bias and I think for like bigger models。

It might not be needed。 And we already saw that like the transformer。

which doesn't have this strong inductive bias， was actually able to outperform us on those bigger data sets。

O， gotcha。都是他们。Let me see if theres any other further questions。I think there's no further questions。

And thanks Nicholas， for a great talk。 Yeah， thank you。

And next， let's welcome the tomorrow。 So， yeah， can you open yourmic or like， share your screen。

yeah。

Yeah， so Tamara is a PhT student from the University of Edperor。

and today like she's going to introduce her very interesting job is a new symbolic framework for answering graph pattern queries in knowledge graph。

Tmer， can you hear me， yeah。I think still yeah yeah。

yeah okay I can hear thank thank you very much for the introduction yes will be presenting our method for answering graph pattern duties or knowledge graphs。

😊，So this is joint work。With Michael Coez and Daniel Rasa from the Ra University of Amsterdam。

Palo Vercelo Ho Ro andil Romemerro from the Pontific University Se Chile。

and Florsge and me from the University of Antwar。So I will briefly start introducing the object of study of our work which are knowledge graphs in knowledge graphs。

the data is represented as entities and relationships between these entities。😊。

One key task we might want to perform over at knowledge graphs is query answering， as for instance。

in this example， we want to know we might want to retrieve the people who have friends who lives in Santiago。

To do this， there is。Many efficient or not so efficient algorithms。

but we just explore the graph and get the get the。😊，The answer is we need。

so we start from Santiago， we look who lives in Saniago after we look who are friends with these people。

But one problem that we have with this approach is that in practice。

most knowledge graphs tend to be incomplete， so the answers we get by applying these techniques might not be all the answers。

all the relevant answers that we know that we want from the graph。😊。

So this problem has been tackled by the my children and community and the models that they use。

they combine reasoning over knowledge graphs in particular link prediction with in between the evaluation of the query itself。

so actually the train models that can evaluate queries while predicting some some of the possible missing links I will outline very general how these models work。

😊，In general， I have a query like the one I showed you before。

they start from the constants of the queries in this case， the constant Ianttiago， they predict。😊。

The people that lives in San Diego， of course these include the links that are indeed in the graph and the links that might be missing from it after in this case we will do another prediction from this set。

we do an extra prediction to see who are friends with these people。😊。

And then we can answer the query。This method can be summarized or can be described as a bottom up method of evaluating the queries where again we start from the constants and we go up up until the root or the target node of the query。

sadly not all queries can be evaluated in this fashion。

actually just three like queries can be evaluated like this three like queries in particular three like queries that have constants in the leaves and the the only three variable or the target variable sorry in the。

😊，So what do we propose， we propose a strategy or a framework to leverage attach these methods by keeping the essence but allowing them to evaluate a broader classes ofqueries。

😊，So how do we do this and this is specifically what I want to talk with you today。

our premier andra I will present it after I will talk about a bit about the implementation we use the particular implementation and the results and some final remarks。

😊，So in a nutshell the framework andravel given a query。

the first question you ask is you look at the shape of the query you want to tree like if it is tree like then you feed it into a model like the one I described before and you get the answers if it's not tree like then it goes inside of our little unraveling process where from this nonT like query we approximated using tree like queries or tree like approximations and after we fit these approximations into the model and we get the answers。

😊，So what kind of approximations are， of course they are three like and I will now illustrate how to get this。

😊，This approximation， so let's take the triangle query in this case。

our target variable is x and we have relation R S and T。😊，So let's go and travel the triangle query。

We start by the target variable in this case x and we will go nagating the query let me start going from x to set via the t relation so I will go from x to set after from set to y here in this case I'm using the inverse of the s relation and then I go from y to x okay and can I can keep going and this make this query as long as I want after I can do the other direction I go from x to y from y to set and from set to x and this way I have3 like approximation of the triangle query I can do several approximations depending on how how deep I want this approximation to be so for instance here I have the unraveling of depth1 to3 and you can。

You can go on and make it make them as deeper as you want。

So the the nice thing about this and traveling about this approximation is that they enjoy some strong theoretical warranties as the first two theoretical warranties I want to discuss are safety and conservativeness safety means that。

😊，The approximation we are getting our over approximation， which means we don't have。

we don't get false negative every every。😊，So， yeah， we don't get fast negative every。

Every answer to the original query is also an answer to the approximation and the other property is conservativeness is that if we apply this unraveing process to a query that is like we get a query that is equivalent to the original one。

😊，Second and maybe more more important is the optimality for any quality we have that thetra this is the the。

😊，The query that we get by unraveling this query is the best possible over approximation from if you consider any other tree like approximation。

then untraveling is the best one。😊，And finally， one other interesting property is that as I said before。

these are over approximations which have no false negative but might have fast positive but you can go as you go further as you go deeper in these approximations they start refining refining refining and potentially when you go deep enough the approximation might tend to be better again this is。

😊，These are theoretical warranties or theoretical properties of the untraraveling map。😊。

So let's dive into the experiment here we have two main question that we want to address。

the first one is how do these theoretical properties of the unraveling techniques translate into practice and the second one is in general how is the performance of unra in the benchmark in the citizen benchmark query benchmark and knowledge graphs benchmark for complex query answering。

😊，So in a particular implementation we use and we presented it in the paper。

we use as the underlying。complex query answering model we use GNN QE。

GNNQE is neuro symbolmbolical model that uses as the link predictor it uses GNN N VFNes。😊。

And for the other logical operators， it uses face set operators， basically all them。

the operators are done using face sets and then faceologic operations over these face sets。😊。

For the experimental setup following previous work。

we focus to address the performance of the model we focus on ranking metrics this is。😊，The model。

Will give us a score for all possible answers after the answers or the possible answers are ranked according to this scores。

and we calculate the ranking metrics such as the neurociprocal rank or the hitet K by comparing this with the ground through。

let's say with the real answers of the quedries。😊，And for the query sets， we also。

Include the traditional tree like queries used for the benchmark for complex query answering。

but also and as we are more focused on the approximation schema。

we include some cyclic queries in particular we include nine cyclic patterns that would introduced in a paper from last year from the real first or the logic query and we also introduce fresh new cyclic queries which in this case have one special property that they don't have constant in between the query。

let's say everything is existentially quant either existentially quantified or the target variable。

😊，So we evaluate our model on these three sets ofqui。What the experimental result says。

well the first thing we want to know is how the things go from theory to practice in particular。

this property I talk about you。😊，I told Europe about some minute ago about the depth of the unrivalings。

theoretically the deeper we go the better the approximation should be。

but we see in practice and especially when focusing on this ranking metrics that this is not the case we did some further exploration and we see two possible explanations for this。

the first one is that in the benchmark knowledge graphs。

we don't tend to see the cases where we actually need longer。😊。

Longer approximations or longer queries to get the real answers this is one and the second come from one inherentrant weakness of several complex query answering models and in particular the one we use in our implementation DNA and Qe is that the longer the query the performance tends to get a bit worse this is because you start concatednating successive projections this is success in predictions and the errors start proagating。

😊，Along the query， which can。Counterbalence， the benefit one might get by using longer approximation so we series for the in particular here I'm showing the result for the triangle query and the square query on the data set on a free waste data set。

😊，Okay， so now let's move into the the real numbers the real performance of。

Of unravel for cycling queries， we see that in in the。😊。

To the rest of our knowledge first paper that addresses thecyclic query problem which was published last year and they also introduced these data sets。

we evaluate our strategy in these data sets and we see that for two out of the three benchmark data sets in this case for the three waste ones。

our model outer forms，😊，The other model of the model before was the state of the art forcycl queries and remains very competitive also for the nail data set。

😊，Moreover more， sorry for the cyclic queries without constant。

we see that our model in general outperforms two baseline that we that we consider in this case。

it allperformance on pretty much all of the queries except for one query maybe or so on the nail data sets so we see that。

😊，That actually our framework is a viable strategy for answering pschic quis。

which is the main purpose of the work。😊，As final remarks， I want to。

To to state that and driver provides a framework that allows existing model which have this limitation of all being able to evaluate tree like queries。

they allow them to evaluate a broader class of queries and that it is actually a viable strategy instead of。

Of trying to handerra or going into more expensive models to evaluate cycles。

one can use these approximations with a real nice performance。

So thank you very much for your time and your attention。

I will be happy to answer any of your questions。Yeah。

it's a great talk like open up like a new directions or new sub directions in the knowledge graph answering So I have like very quick questions。

So the thing is like yeah for long questions like what is like the time efficiency problem like across like those tasks because it seems like very like expensive especially the most those higher order problem I think especially like MBF nets they miss this problem quite often so I'm very curious about that part yeah。

Yes， well in general， I think that the efficiency part comes from two ends。

One is the projection itself， let's say the link prediction and the other one is actually evaluating the query。

This is one advantage that have the models like the one I。😊。

I introduce is that they operate over sets so they don't have to actually。

Instiate each of the single entities and go doing the link prediction they all operates over sets。

which is why this might by be an interesting like framework to take this complex query。

make it into three legs so you can exploit the efficiency of working with sets instead of really addressing every single entity and doing the predictions and the projections。

😊，O ok， thank you。Yeah， let me check like if there are other questions here。W。Oh。

I think there's no further questions。 Yeah， thanks all the great Okay we can look in the post that after anyway。

thank you。 Yeah， okay， thank you。😊，Okay， so thanks everyone， for the attendance。

So this will be the end of this part oral talk。 and then the next part will be a tutorial about hit roughly graph。

which is start half an hour later。 Okay， thanks for the attendance。

hello。嗯，停点。Yeah。So hellello everyone， welcome to the lock conference and tutorial on Heophilicgraphra learning I'm Xiaox Xing and we willll be the host for this section so today we are excited to have a fantastic lineup of speakers who will present the latest progress in heteroophilgraphra learning now let's welcome our first speakers to time to and tutorial on。

Okay， so I can hear some echoQ Okay， so now let's welcome our first speaker sito to present the introduction and background knowledge Sio is a PhD student at McGill he is working on graph representation learning and AI for science。

So。Cit， you can stop with your presentation Okay thank you for the introduction X Ho everyone my name is Sian。

I'm a postdoc at MillerA。Today， I'm very happy to give the tutorial of Catophilicgraph Learning on conference together with the two brilliant speakers。

Jian Jinhua and Qin Chenglu。啊。I would also give some special thanks to our code assistant。

Jia Qiizhu and advisors Guo and Xiao Wencho。So。What is hyperphily and why should we care about it in graph learning so before we introduce hyperphily。

we need to know the concept of homophily。So hopefully in ancient Greek it means same common or friendship love。

it is a concept in sociology and evolutionary biology。

and it states the tendency that individuals with similar characteristic are easier to communicate and bond with each other。

There are a lot of example in our world that has the homophily phenomena， for example。

the birds of the feather blocks together or assortive beating， and in our real life。

there are also some homophily phenomena， for example， in social media， people with similar ideology。

religion， interests， education background， race and ethnicity。

are more likely to connect or follow each other。So what is homophiing in graph learning or more specifically homophiing in message passing？

啊。It isInvest passing homely is a principle or assumption that is implicitly imposed in our v passing process。

as shown in this figure。For the indiscernable boundary node， if we have a homophiic graph structure。

this kind of structure will provide extra useful information into the aggregated node features over the original node features。

So that the indistinguishable will become distinguishishable after the measured passing。

so such kind of relational inductive bias is thought to be a major contributor to the superiority of GNs over traditional neural networks。

or various tasks， especially on node level tasks。So hyperphi means lack a homophily or low homophily。

More specifically， notes with different labels from different causes are more likely to be connected。

If we do message passing on such kind of graph structure。

those from different classes are more likely to be connected and their future are more likely to be mixed。

So that notes from different classes will become indistinguishable and this will make our GN harder to classify them。

There are also some empirical observations。We found that some graph aware models will underperform their corresponding graph agnostic models on some data tests。

for example， Cornell， Wisconsin， texts， and field。

A simple MOP can outperform some baseline G Es like GCN， GATT and graphs。嗯。

And heterophily is considered as one of the main reasons for this kind of performance degradation。

And they found that kind of these data sets marked by red have some low homophiiame。So this means。

There do exist some kind of bad benchmark graphs that we need to be very be careful about that。So。

Can we recognize those kind of bad graphs with a scalar metric and which set of graphs are the really bad graphs and how can we categor them？

So in the next two sections， Qin Cheng will introduce these two topics。Let's welcome， teacher。

I will stop sharing here。Thank you，tao。 in this section。We will introduce homophily metrics。First。

we have metrics based on graph label consistency。They are edge homophily， not homophily。

class homophily and adjusted homophily。These definition are all based on linear feature。

independent graph label consistency。So if the matrix value is small。

it indicates the inconsistency between graph and label。

And graph structure would have an negative effect on the performance of June N。For example。

here we give the definition of atphily and not homophily。

The actualphi measures the proportion of edges that connect two nodes in the same class。

And node homophily evaluate the average proportion of edge label consistency of all nodes。

Then we also have similarity based matrices。These metrics are based on some linear measurements of low similar。

With no features。For example， the journalist at home Phil use cosine similarity between those features。

So currently， the matches we have introduced are all based on pairwise comparison。

Then the third type is the neighborhood identifiability or informativeness。

These metrics are based on a neighborhood distribution， instead of pairwise comparison。

They are nonlinear and are independent of not features。For example， the label informativeness。

It finds different connectivity patterns by measuring the informativeness of the labels of neighbor。

The last type is the hypothesis testing based performance metrics。For this type of metrics。

please find the reference here for details。Basically。

it use the P value of hypothesis testing as metrics to measure the no distinguishability of the aggregate feature compared with the original features。

This matrix is the first one that can capture nonlinear， feature dependent information。

So we have no various kind of homophi metrics。 homophimetrics are basically proposed to identify tough hydrophilly data sets。

Next， we would like to do a comparison for them。The purpose is to see whether this matrix can identify good or bad graph。

The approach is that we use the performance of baseline J N models on synthetic graphs。

As synthetic grafts can be generated for various kind of homophie levels。So for each homophi level。

we can check whether the homophi metricss agree with g performance。

So if the matrix value agree with the during performance， we can see that this is a good metric。😊。

Specifically， the steps are summarized here。First， we generate the synthetic graph with various homo levels。

then for each generated graph。Notes are randomly split into train validation and test sets。Next。

we train the baseline model on this graph and calculate matrix values for each graph。Finally。

we apply the metrics values and the G performance curve with respect to homophi levels to see whether there is an agreement between metrics and G performance。

In this comparison， we generate synthetic graph using use the following frameworks。The regular graph。

professional attachment and gin cat。The first one。

the regular graph is generated according to the attromorph level and the base data sets。

Where for each node， it uniformly generate some random inter class edges and some inter class edges。

And note features are sampled from the corresponding class of the basic data assets。

The professional attachment method uses coefficient to control the probability of creating intra class edges and sample node features from overlap to de Gausion。

The last one， the Jinkai model generated graph based on the edge connection proportion coefficients and base dataset。

Where a large coefficients means fewer into class edges。

And they propose an algorithm to generate node feature and edges based on the coefficients and based data asset。

We conduct experiments， and these are the results。He each column corresponds to a generation method。

And the first role is for G performance。Where the green and blue lines are for G C N and SG C performance。

And the second row shows the matrix values。And for each graph。

the X axis represents the homophily levels。Then for each generation method in a column。

we compare the June performance curve with in the upper figure with a metric curve in the lower figure。

Then from this curve， we have several observations。

and we conclude that current homophimetrics are not good enough。

Because we compare when we compare the June performance curve and matches curve。

we can see that the correlation between different metrics and June performance are different。

For example， we can see that on R G in the first column。

the G performance curve in the figure A is u shaped。

Then a good homophiometric should also have this kind of view shape curve。

Then we check the mattress curve in figure D。For R G。

we can see that probably the aggregation homophily is the best one。However。

we can see that on other two types of synthetic graphs。

The agreement between aation homophily and the gym performance is not so good。Therefore。

we still need to find a good homophilimetric。 and an idea homophilimetric should show same correlation with gene performance on graph generated by all these three methods。

So in the future research， a newly proposed metric should be tested on synthetic graphs with all three generation methods to get a comprehensive evaluation and comparison。

So here we provide the repository for synthetic experiments。

The repository can be assessed here by scanning the QR code。

We also build the collab and we here show how it works。For each generation method。

the workflow is the same。 And here we use the R G， for example。First， we choose the coefficients。

which control the homophiic patterns of the graph。

and we use the query data sets at the base data sets。

It will generate 10 graphs use home of user coefficients。

Then we would evaluate all these metrics on the graph。And finally。

we can get a plot showing all the matches values here the position of each circle stands for the mean matches value and the size of circle stand for the variance。

So in the next section， we would discuss benchmark data sets。So in the pre section。

we find that current homophilimetrics are not good enough。

So for distinguish good graph and a better graph， the only standard we have now is the actual performance on data sets。

Which means that when to compare the performance of so called graph aware model and graph agnostic model to know whether the graph structure is beneficial or not。

SoIn this section， we would like to summarize and categorize the popular benchmark data sets for heteroophilic graph learning。

So here， for example， the GCN is a graph aware model。

and the corresponding graph agnostic model is MP 2。

Since the only difference between G CN and MP 2 is whether the graph is used。 and similarly。

the corresponding graph a model for S G C1 is MP1。Here we exam 27 popular data sets。

We compute the edge and node homely on on these data sets。

and we provide the performance of baseline， graphware and graph and models。

So we say if the graph aware model performs better than the graph agnotic model。

then the graph structure is beneficial for G N。Then based on this result。

we can observe three kinds of patterns for hydrophil data sets。For the first category。

For these data sets， the edge homo fully and node homo fully are very low。

And the graph aware model performs worse than the corresponding graph agnostic model。

So these data sets are the most difficult data sets for G N。 So we call them malignantphily。

In this case， the graph structure provides harmful information in the future education step。

For the second category， we can see that the graph aware model perform better than graph agnostic model。

😊，But the node and edge homophily values are very small。So。

We still agree regard them as hydroophily data sets。And call them benignophilly。That in this case。

heteroophilate aggregation is actually beneficial for Gens。Now， we look at the third type。Here。

the graph aware model and the graph agnotic models have inconsistent comparison results。For example。

for the secure filtered these data sets。We can see that ML LP2 perform better than G N。

but S G C1 performs better than ML LP1。So we'll call this kind of data sets as ambiguousfully data sets。

In this case， the underlying synergy between graph structure and model linear nonlinearity can influence g performance altogether。

So here we summarize all these results。Our total repository also provide codes to train these hydrophilly specific GNs on 27 data sets。

So let's go through the workflow together。First， we can choose a journal model from the menu Here。

for example， we choose the AM GC N。Then here on the left panel on this page。

It shows the step where we choose the data sets。And on the right panel。

it shows that we are training the model on， on these data sets。Next。

will introduce hydrophilly specific models。

Sorry。I think。

Any that's that to。

O。Okay， thanks， Ting。In the next session， I will。Introduce some popular method to address the hydrophilic problem。

We mainly summarize 10 most popular methods。The first one is Eagle neighborhood separation。That is。

we encode the Eode embedding and its aggregated neighborhood embedding separately。

Because they are likely to be dissimilar in healthability setting。

Zhu and others proved that concatednation is a better combination function than averaging function on the generalization of heterophigraphs。

And we proposed the H2 GTN。With the Eagle neighbor separation。

The second method is signed message passing。So before we introduce smetric packing。

we need to introduce two concepts that is the frequency and low cost scter frequency is the eigenvalue of bra Lapaian and small eigenvalue corresponds to the smooth eigenvector and large eigenvalue corresponds to the nons eigenvectors。

嗯。😊，A future is called low pass， if it does not significantly affect the content of the low frequency person signal or the smooth signal。

but ats the magnitude of the high frequencyency components or the non pass component。

It is found that the neighborhood aggregation step or the multiplication with the normalized a matrix can be seen as a special form of low pass filter。

which captures the low frequency or the smooth information in your input low features。However。

Bo and others found that high frequency information is also important to capture the differences between those。

and especially in hydrophilic setting， we need to add low pass。

low frequency and high frequency signal together。So we propose the FAGCN。

which is pretty similar as the G At network， but they allow the attention score to be negative。

So that the negative edge weights can propose the high frequency information。Also。

Xie and others use the generalized age rank。Which associate a learnable weight to each step of validation for different hops and the weight is learnable and can be both positive or negative。

And the signs of the weights can adopt the heterophily and homophily structures。

Li and others proposed the Galogian， which estimate a coefficient matrix to model the relative importance of those。

The coefficient matrix Z is learnable and it can be solved by solving an optimization problem on local feature similarity and multi of graph structure similarity。

So the Z matrix is different from the simple attention mechanism or the self attention mechanism。

it allows the negative force and is more efficient to compute so that it can solve the heteroophil problem。

The third method is directly using the Hypascutor and use the nodewise channel mixing mechanism。

so it is proved that Hypacutor is effective for some kind of hyperability situation。

And Luan and others also found that different nodes may have different kinds of local hyperophillic situations。

As shown in this figure， so for different nodes， some nodes has a pretty low local homophie situation and for other nodes they might have very high local homophily situations and this curve this distribution are different from between different data sets。

therefore different nodes may need information from different channels。

some need information from low pass channel， some need information from high pass channel therefore。

They developed a adaptive channel mixing mechanism， which include low passs。

high paths and identity channel together in eachGN layer。

We also develop a nodewise channel mixing mechanism。

which is learnable to combine the channel information adaptively and not wisely。

So here is a visual of the output。So as we can see the input feature is very， it's pretty noisy。

we can see no patterns in the input feature if we fit it to GCN。

The output of GCN still has no clear boundaries between different classes， however。

if we feed it into ACM GCN。We can see that in low pass channel， still no clear patterns。

but in high pass channel and identity didn channel， we can see that。

Some some patterns already start to emerge through the nodewise channel mixing by the upper values in the final output。

we can see there are very clear boundaries between different nodes。

especially when you compare it with the output the original GN。And through Appplach study。

ACM is demonstrate that it can。Boost the baseline the performance of data。

and it can surpass the so models on almost all data sets。The fourth method。

It's the selective message passing that is the impose a selective mechanism in the message passing framework so that the model can learn to return the use or relevant information and discard the detrimental parts in the node embedding。

for example， in GBBTGN。The author use。Two kernels NGCN。

One kernel to deal with the homoophilic node pairs and the other part deal with the heophilic node pairs。

and they also use a gate which is included by an MLLP。And it's learnable。

It can understand which of the two kernels should be applied for each note pair。And in FSGN。

the authorss use soft Macax as a regularizer and a soft selector for the features aggregated from neighborhoods at different hop distance。

And this year， Finnkkoen and others proposed the cooperative GN。And in this network。

each node has the select to choose different actions from an action set。

Where there it includes listen， broadcast， listen and broadcast and isolate actions for different notess。

so different notess has the freedom to choose what we want to do in the massive passing。

so there are more flexibility and more expected power in the massive passing for work。啊。

The fifth method is the spectralGNs， spectralGN is a type of network that is built on graph signal filtering technique in the spectral domainamine。

people who try to develop more expressive spectral filters to enhance the aggregation function for better performance on a heterophiligraph。

For example， he and others proposed a burn， which use burst polynomial to learn arbitrary graph specbuts。

Wang and others propose the Jacob con， they use the Jacobcopi basis due to its orthogonality and flexibility to adapt to a wide range of weight functions。

He and others proposed the Chat2， which fix Chat with Chasha interpolation。

it enhances the original chair shelf for polynomial approximation and can reduce the wrong beom。

So another method is beyond local information。So the logic behind it is that since we know that the local information are not informative or they are bad。

so why not try the information from distant notes。So。

There are mainly two different methods to encode the dis information the first one is through the multihg information we can do it by DFGNs。

for example Jnet APPN or others there is a special multihg method called link X it is pretty simple it encodes the ag matrix by an MLP。

And the author show that it can encode the。Monoly information in social science into such network。

and it can be considered as a kind of toolh method。

Another kind of method to encode global information is to connect distant node in a latent space。

that is you rebuildbu the neighborhood with a latent known embeddings， for example， in GMGCA。

Pay and others precompfuse the unsupervised node emdding in the late space and redefine the geometric relation and reconnect notes so that some distant note might be directly be reconnected。

However， in this year， Xin and others found that we cannot increase the receptive scope blindly because the distant notes are not always beneficial in heteroophilgraph。

For example， in this figure， the curve being the proportion of Khawk neighbors that have the same label with equal notess。

So。The actor that is the red curve is the curve for the heophilgraph and as we can see。

For one hop neighbor， they are around only 20% of the local neighbors that will share the similar labels with the ego notes。

however， when you increase your receptive field，To the Khawk neighbors。

the proportion almost remain the same。 That is。The multiho neighbor does not necessarily means they are good。

so we cannot just blindly increase our receptive scope。

we need some adaptively some adaptive mechanism。ForFor the reive field。

therefore Luu and others proposed the flexible diffluion scopes by PDGNN。

We propose a new class of parameterized laplastine matrix。

which probably offers more flexibility in controlling the diffusion distance between node than the conventional graph laplastene。

and it allows the long range information be captured adaptively through the diffusional graph。

The seventh method is class structure learning。So since GN are highly sensitive to the quality of the given graph。

So why not we just optimize the graph structure so that GN can learn on a good graph instead of。

Instead of a better graph， so there are two main ways for graph structure learning。

the precomp graphing rery and end to end structure learning。So for the pre completed one。

the graph structure is optimized。The graph structure of optimization and model training are to separate process and they cannot be integrated into a single differentiable pipeline and for the end to end learning。

the structure optimization is inserted into the model training pipeline and the whole pipeline is differentiable。

For the precomputed graphic rewiring， people usually define a metric for the guidance of the rewiring。

for example， Gong and others propose the homophily oriented rewiring。

that is they define the neighborhood homophi value。

which measures the label complexity in the neighborhoods and members are grouped and aggregated based on this metric。

And similar like the homophily oriented one， Lee and others proposed the distance oriented environment method。

Surs propose the similarity oriented reviv method and Troy and others proposed an information theory based method。

To do the graph very。So for the end to end structure learning。

Zhao and others used a neural edge predictor which can learn to promote intraclass edges and demo intraclass edges during the model training process。

And Wu and others propose a probabilistic based method。

he and others proposed the causal inference based training and Ye and others proposed an edge requirement for the structure learning。

So although it is found effective in some scenarios。

there is also some controversies around graph structure learning， for example。

Joehou and others inherently find that there' is no significant correlation between the homophily of the learn graph and the performance of the tasks which challenge the common belief that higher homophily can lead to better performance。

And Zg and others provide a theoretical analysis showing that the similarity based re method provides no information being on those specific tasks and questions the effectiveness and necessity of infrastructure learning。

So the eighth method is。The physics informed method。

So physics informed method incorporates physical constraints or physical laws into breast learning。

for example， the repulsive or attractive policies。

these kind of physical laws can help to model the complicated interactions between those including the haophilic relations。

For example， in the All key message passing， the author use All key force。

which is associated to some particle system。That is influenced by both attractive or repulsive forces。

There's some other method， for example， Choy and others proposed grid， which include a。

Reaction diffusion layer with the combination of three standard reaction equations from natural sciences and for additional reaction terms。

Zhao and others proposed the convection diffusion equations。

which use a term for homoophilic neighbors and convection term for the heteroophilic max propagation。

Park and owners propose the reverse heat diffusion process。

They use the reverse of the heat diffusion process to learn the known states in the past time steps。

which are far from the equilibrium states of the diffusion process。

The ninth one is enhanced information diffusion that is they boost the expressivity of GNs by modifying the information diffusion process in the aggregation function with more advanced and sophisticated mechanism。

Yeah。For example， in auto g M。The authors aligned the hierarchy of the rooted tree of a central no with ordered neurons in the representation。

Making specific specific plot of neurons targeted for mass passing within specific crops。啊。

In half hop， the author use an edge upsely method that adds slow nodes at each edge to mediate the communication between a source and target node。

There are some other method， for example， Boner and others propose the shift convolution that use a hierarchyical shift diffusion process。

And Michelly proposed the graph Estate network。Where they use the。The echo computing。

the Rer computing paradigm， that is the recursivelyfuse input and latent neighborhood features with random initialized and untrained input to rer and reservr recurrent weights。

The tenth method is graph transformer， the self attention in graph transformer essentially reconstruct a learnable。

fully connected graph structure， and they are shown to have the ability to capture long range dependency between those and they are more effective than a graph network hydrophilic problem。

However， it is imperatively found that there is still a huge gap between the performance of blood transformers and the SoGs hyperophilic problem。

There are some efforts that try to make graph transformer work。On heteroophilic graphs， for example。

Bo and others proposed the specform， they encode a range of eigenvalue via positional encoding to capture both magnitudes of frequency and relative frequency information。

and they design theder with a bank of learnable basis to capture high frequency graph signals。

Lee and others proposed the MP former。We they aggregate the center node and the different hub of neighborhood information。

and they treat the node features as token and sterilize the token as sequencees。

And we can capture the distinction between the egoos and the neighborhood nose。

There are also some other method。For example， the compatibility matrix based method。嗯。

Compatibility matrix are just the model connection。

probability of nodes between each pair of classes， it is a fine brain classwise relation。

And it is pretty powerful when those features are incomplete。Zhu and others propose the CPGN。

which model the label correlation through a compatibility matrix and propagates a prior belief estimation into the GN with the compatibility matrix。

Zong and others propose clip we use a constraint enhanced compatibility matrix to construct。

desired neighborhood messages。Another interesting one is to use the edge directionality。

Rosie and others shows that treating the edges as direct increase the effective homophie of the graphs。

suggesting that a potential performance gain from the correct use of directionality information。

The logic behind it is that if we have。An edge from A to B。

It is possible that we don't have an edge from B to A。

So we should allow this kind of flexibility in our mess passing instead of that we only assume a connectivity between A and B and we assume this kind of symmetric relation。

So they design and return that。Which account for the edge directionality information by separate a of the incoming and upcoming edges？

🤧嗯。Luan and others conduct fair and comprehensive evaluation over almost all the above popular methods and from their experimental results we found that only the high pass filtering with adaptive channel mixing and the selective message passing are verify to be really effective for heteroophilly and for other So GNNs a lot of them just sacrifice their ability on homophi graph to achieve relative better performance on heteroophil graph。

for example， is to GCN， GPRGN， Burn， link X and G GN。

And this kind of heteroophilia fever result implies that their proposed methods are not universally effective。

They also observe some serious scalability issue， for example， GGN。

GKGN and FSGN suffer from severe out of memory problem。

which means that we should consider the computational complexity when you design a hydrophillic specific model。

Yeah。Any question on this section？Will we have a question session after the tutorial？哦。

Hi Se yeah do you mind turn back to page 46，46。Yes。

yeah so I have a question about the compatibility metrics， so how do define the compibility metrics。

is it the learnable matrix or heuristic metrics。I think。It is learned。

It is estimated through the learning process。Okay yes。

and we we since we have the label and we have a ground truth one。

And I think I remember in the original paper， the author just compared the ground truthatibility metrics with the learned compatibility matrix they found that it can be learned from the data。

Okay， got it， thank you。Are there any other questions from the audience？Yeah， if no of this go ahead。

Okay， that's mobile。So。In the following section， I will introduce the。

Theoretical analysis around homophiia and hephie。They mainly investigate how those homophily and hyperphily impact the behavior of GNs。

So let's start from introducing the shortcomings of the homophie metrics。

So the old homophimetrics mill captures the graph label consistency， that is。

we measure the proportion of edges that connect nodes from the same class。So however。

based on such kind of principle， they fail to explain some low homomobil cases， for example。

in this figure， the message passing on by party graph has some kind of different behavior。

so as we can see all the nodes from class red connect to class blue and all the nodes from class blue connect to class red。

and there is zero intratra class edges。So based on the old definition of chromphilimetric。

This graph has zero homomoophil value which means this graph is a very bad graph。However。

after F aggregation。The note they just switch the note colors。

but they are still distinguishable for the classifier。

so which means such kind of graph label consistency is not enough to understand the G behavior and we need to find some different perspective to consider graph structure and homo buildinging。

Luan and others propose to study the homophily from the post aggregation known similarity productive。

and they define the post aggregation known similarity matrix， the S matrix。

they define a new aggregation similarity score。Yeah。

This score mainly calculate the proportion of node that are more similar to node from the same class than node from different classes。

And they define a new a new metric based on this score。

They conduct synthetic experiments to see the agreement between the metrics and G performance。

So for a good metric， we are expected to see a monotononic increasing curve for the G performance that is a low metric value we will indicate the bad。

GN performance and a high metric value can indicate the good G performance therefore the curve a good curve should be monotonically increasing however。

under some old homophiia metrics， for example， the actualphi not homophiia or classical homophiing we observe a u shape curve。

which means this old metric cannot explain the performance of G in the low homophiing area。

In the new proposed metric， although it has some variance。

but it looks like a monotonically increasing metric， which means it is。

A better metric than the previous metric to explain G performance。

So another perspective by Ma and other others is that。

They propose as long as node within the same class share similar neighborhood patterns。

their embeddings will become similar after aggregation， for example， node one and2。

Note1 and two are from class blue， but they all connect to class orange， yellow and green。

so all their connections are hydrophilic， but the hydrophilic patterns are the same so after aggregvation their node embeddings will still be the same so it will still be classified into the same class。

Therefore， they claim that homomobile is not necessity for bracket to perform good。However。

such kind of analysis has a deficiency that is the only consider the intraclass node the distinguishability。

but ignore the intraclass node the distinguishability， for example， suppose we have。

The node3 which is from Class green and node3 has the similar neighborhood pattern with node1 and2 so after aggregation node3 node3s embedding will become similar as node13 or node1 and2 and will be classified into class blue。

which is bad。Therefore， Lu one and others propose that we need to consider both intraclass and intraclass node distinguishability。

and an ideal case is that we have a smaller intra class node distinguishability than intraclass node distinguishability。

for example， the node1，2 and4。So has this kind of ideal condition。So based on such clay。

the authors propose to quantify the no distinguishability on a new proposed toy model CSBMH。

In this CMBMH， we have a parameter that can control the homophie level of the geneative graph。

And we compute two metrics to quantify the node distinguishability that is the probabilistic bias error and negative generalized Jeffy divergence。

And they plot the relation between the homophie levels and the node distinguishability on this toy example。

they found that a medium level of homoe has a more detrimental effect on the node disability than extremely low levels of homophie。

And they call it the mid homomophi pitfall。So Zheng and others find that the current understanding of homophie is still not comprehensive because all the above analysis only consider the label aspect of graph data。

And since we have three basic elements for B structured data， that is the structure information。

the node features， and the node labels。So there are two missing aspects we need to study。

So Zheng and others disentangle the effect of homophily into three different aspects。

the and label structure and feature homomophiia and they claim that the synergy between these three components can provide a more complete view of the impact of homomophiia on G performance they define the label homomophiia as the consistency of node labels across topology。

They define the structure homophie as the consistency of node structure information within the same class and the feature homophie as the dependencies of structure agnostic node in node features on the topology。

So based on this three definition of homomophie。They conduct analysis on a new proposed graph generation model called CSBM 3H。

And we derive a new metric which is called tri ho， and it considers all three the predefined aspects。

Through the tests of 31 real world graph data sets and the comparison with 17 existing homophie metrics。

Trahooma is verified to have significantly higher correlation with gene performance than the existing metrics。

All right。Okay， Mao and others proposed a fine grade analysis of homomo。

They found that both homophily and hyperphily patterns exist within a single graph。

And GN can outperform MLOP on one pattern， but fails on otherwise， for example。

On the figures on archive data sets。So the y axis of the figures is the MLP performance minus GCN performance。

The column below the0 line means MLOP will underperform GCA。

And the column over the zero line means NOP will outperform GCA and more and others study such kind of difference on。

a more f level of homophiing intervals， and they found that on archives。

MOP will underperform GCN in the high homophiing intervals， but on squir root。MlP will。

Outtoperform GCN in the whole home high home building intervals。So they show that。

GN form admirably on nose with the majority neighborhood patternss。

that is the homoophilic node within homoophilic graphs and hyperophilic nodes within hyperophilic graphs while struggling on noses with the minority patternss。

So。They conduct analysis on such a disparity of the performance。

and they propose a rigorous non ID pack based generalization bound。

revealing that there exists some kind of distribution shift between the training and test node。

And this kind of distribution shift leads to the performance disparity。

and we identified that the homophiia racial difference between training and test node。

that is a graph structure shift as a new graph out of distribution scenario。

There are some other interesting study to explore the relation between distribution shift and homomophily。

for example， Loveland and others and Zhao and others study the shifts between local and global homomophiity。

Zhu and others study the shift between input features and output labels。

Hatophie is also related to other problems that G will suffer from， for example。

the oversmthing problem。So oversmotthing problem is that as the depth of G gets deeper。

the node representationations are gradually smoothed out and the information loss will occur。

Yan and others take a unified view to explain the oversmthy and hyperophilic problem simultaneously by profiling those with two metrics。

The relative degree of node and the node level has ph。Based on the two metrics。

we managed to operate dons for oversing and heteroophilly and predict the GSN performance。

Here we would like to emphasize the difference between more smoothie and hetero。

That is oversthing will only happen deep GCS， but not in shallow Gs。

but heteroophilly will cause performance degradation to all kinds of Gs no matter it is deep or shallow。

Heerophil is also found to be related to the over question。

Over question means as the number of layers increases。

the information from the exponentially growing re field is compressed into fixed length node vectors which will cause information loss for messages from a distant node。

And the Jacobbin of those representations is a popular tool to study the oversption problem。

Rubbin and others use the Jacob matrix to develop a unified。

theoretically theoretical framework to understand the combined effect of hydrphily and overs sququaing。

They name it the homoophilic bottlenecking， which means the bottlenecking between those of the same type。

There are also some other interesting theoretical findings， for example。

Choy and others point out two drawbacks of the signed message propagation that is。

Under some conditions that。Those from different classes share a high similarity。

the sign method passing will decrease the separability between those and also the sign method passing will increase the uncertainty and instability of your model。

啊。Shi and others study the double decent phenomenon in graph learning。

Double descent means increasing model complexity will first reduce the tester rates and then leads to higher risk due to overfitting。

However， when you continue to scale up the complexity of the model into over parameterized regime。

it will result in a decreasing risk again， so this is called double disent。

Shi and others studied the relation between double descent and homomobile。

We predict the existence of double descent in GNs and expose that a negative sub will result in better performance on health playground。

Lee and others study the future shortly。graph network。

that is we explore how random feature shling among those in the same class affect GM performance。

They found that the feature shortling will reduce the dependency between graph technology and features。

it can boost the GM performance on a homophiic graph， but not in aheadophil graph。ok。Any question？

commentsments on。this section。Hi theres a question in the Q andA box could you please check yes so there's a question was the intuition behind the better performance of German and US MP on the Heophil。

ok， so。The question is， what is the intuition behind the better performance of GNs versus MLLPs on heterophy？

Benign and a graph。 Yeah， that is since what we are studying is to find out the bad graph structure for method passing。

right， and it is。Found that not all hyperphiicgraphs are bad for message passing， right？So。

We would like to。Through the experiments we would like to find out， so which hetero ofophilic？

Which hetero of telegraph will lead to the worst performance？For the network。 and which will not。

So on the benign heteroophil graph， those are the graphs that will not lead to bad performance of the network。

Abiguous graph is it depends so in some kind of cases it will lead to bad performance and in other in other other cases it will not and the malignant one is that it always lead to bad performance so the malignant catophilic data sets。

Are the real embedded。Graphs that we would like to pick out to test your model。

So that is why we would like to categorize those kind of。He ofophil grass。So。Yeah， that's it。Yeah。

thank you for addressing this question。All right。So。你播歌。Yeah。啊。In this section。

I will introduce some hyperophil related applications。诶。

The most popular one is the detection task that is the fraud anomaly bo detection。

The detection tests aim to identify and isolate the abnormal and malicious nodes of subgraphs in the network。

which can have significantly impact on the security privacy robustness for real applications on network data。

For the graph based fraud detection。The f nos are usually surrounded by dorm nodes that have been cheating。

So we inherently have the hyperophilic structures and it plays an important role。

As the figure shown on the right side， the red notes are fraud users。

the blue notes are legitimate users and we can see that the fraud users and legitimate users they are densely connected each of each other。

therefore we have very highly hyperophilligraph。For the fraud detection graph。诶。

For other detection tests for the graph based anomaly detection。

It mainly identify the real observations that deviate significantly from the majority of the objects in relational and structured data。

The abnormal users are sparse and connect to the majority of normal nodes which lead to the hydrophilic structure。

For the bo detection， the bos are automated programs that mimic human behaviors and they usually have some malicious purpose。

for example， this spread， some misinformation， some hates。

violence or something bad and those bos having shown that will intentionally interact more with normal people。

therefore it also has some catophilic relations。Hetroophilly is also closely related to some traditional graph learning tasks。

for example， graph coloring。Graph coloring is to assign different colors to the nose of a graph such that no adjacent node will have the same color。

Wang and others recognize the graph coloring task as cattroph problem。

and they introduce negative method passing into physics inspired graph network。

which allows node to exchange information with their neighbors in a negative way such that the note features can become more dissimilar after each met passing layer。

There are also some other graph learning tasks。That is related to atrophily， for example。

link prediction， graph clustering， graph classification， and recommend system。啊。

There are also some computer vision tasks that have scenario of hyperphiia， for example。

the point curve segmentation。Point cloud is widely used of 3D data。For the 3D computer mission。

And print of segmentation。It aims to divide the unclassified target point cloud into separate regions with different attributes or functions。

However， a most popular method for green cow segmentation implicitly assume homophily and ignore the hyperophilly nature of some edges。

especially at the boundary regions。As shown in the figure on the right side。

In the boundary re notes， there exists some local heteroophillic structure and the ignorance the ignorance of the heteroophilic structure will undermine the distinguishability of node representations。

诶。Urban networks are often hyper as nodes with different attributes or functions may have strong connections。

for instance， roads with different traffic conditions。

land uses or geographic locations can influence each other。

And urban computing involves designing and optimizing the structures and functions of urban networks and can be better learned by GNS with Ha。

Brain networks also have a half of the leak structure in it。嗯。

The brain the brain regions of interest that is ROI respond to nodes and the connect features represent the edges and from this kind of graph sample we can see that there exists both homoe and heavily structure and。

Because this seminari can physically attach to each other。

Such interplay between homomophily and catphily will poses challenges to the analysis of brain networks。

Yeah， that's it。ForFor the application， those applications share some similar properties。

that is first you use message passing all these applications and in these applications you have。

The graph that knows with different attributes were densely connected， so in such scenario。

Cattrophi will play an important role。So。Any questions on the applications？ok。So in a next step。

Yeah， I'm talking about like okay， challenges yeah yeah， okay， in the next section。

we will'll introduce the challenges and future directions of catophilic graph learning。

Is welcome to you。Yeah， heteroophilly could have impact in heterogeneous graph learning。

hypergraph learning， also temporal graph learning， as well as a molecule generation。

Hedlogeneous graphs refer to the graphs with multiple types of nulls or edges and like but the current studies hily only study。

Only focus on graphs with single types of nodes or edges。So what we need to do is we， we need to。

Establish。A complete benchmark for heteroophilly heterogeneous graphs。

as far as we need to bring more like a awfulophilly based metrics for heterogeneous graphs。

And second is the。Is the howophil usage on temporal graphs。

Heerophigens are mostly investigated on classification tasks or static graphs while under。

While underexd node level regressions over dynamic graphs， in this case。

we just have time bring graph homophily values as shown figure from graph A to B to C to B this graph structures where like homophily values should just change over time。

so we need to find a better way to define the homophily value temporal graphs as time goes by also we need a complete benchmarking for or had hopefully on temporal graphs。

In the survey is heterophily application hypergraphs。

hypergraphs refers to those graphs where edges can connect to multiple nodes。

like more than just two nodes and heterophily has been proven as a more common phenomena in hypergraphs than being simple graphs。

In the previous study。But they haven't addressed the challenge。

like how to define the homophily metrics on hypergra。

As well as also we need to establish a better benchmark on this hydroabbs and related hetero learning。

Also had roughly could have an impact of fairness in graph representation learning Loveland review the link。

Between group fairness and local homoography and discovered that not all known neighborhoods are equal and those dominated by a single category of assistive attributes often for changes in achieving fair treatment。

This particular， these are really particular when local class and sensitive attributes are homofully diverged。

Also， Lo introduced a node injection based furnace attack called MIFA designed with a homophily increase principle。

it can significantly undermine the furnace of mainstreamstream genome is。

including the furnace of wear genes is by only injecting one% of nodes。Also。

Cao studied the evolution of a recommendation of furnace over time and showed that extreme values homophily or ahead ofly can be。

Determinal to recommendation fairace in the long run， even when group sizes are balanced in the data。

They also demonstrated that promoting Mophily in Hadophilic networks and Hadophilly homoic networks can improve the furnace of recommendations。

Also， homophie， the like the idea of homoophilly can be used in market design。

The laser figure shows the node level homophily of molecules in the kill un niole cells。

And similar to what's been commonly used like partition coefficients。

synthetic accessibility score or drug likeness。The the homophily were。

No number mostly can be used as a measure of diversity to compare between the generated distribution。

Where is the real trend distribution just to give a measure on。On the generated molecule diversity。

Okay， I'll headed it back to Seattle。

Thank you， Will。嗯。There is a question from Vicky。So。诶。He or she asked that。诶。

Are the metrics measuring how well a G N may work or the。Ha to fill of the graph。

A GM may not work as well for a number of other reasons of our has。In the old days。

those metrics are only a measure of home free anropphyological graph。

but since we introduce this kind of concept in graphra network and we try to identify the good and bad graphs with this metric。

So this metric gradually become the metric to recognize the performance of G on those graphs。

Or more specifically， we would like to use this metric。To measure how well。

The aggregation step will provide to the neural network。So nowadays。

although people still use the word homophy or heterophily。

I would like you to use the new word it called performance metric。

So since homophiia and heterophily is not good enough it's already proved that it's not good enough to identify the good and bad bad graphs right so nowadays there are just some baselines to measure the performance of GNs on some kind of graph。

Yeah。There do have some other reasons apart from heteroophilia that can influence the G performance。

for example， over squing worse bluein and others。So that is why when we try to find out those bads。

we need to compare the graph aware model and its corresponding graph as model。For example。

when you compare a 10 layer GCN with a two layer MLLP and you found that the 10 layer GCN perform bad and you claim that the graph is bad。

that is unfair because the 10 layer GCN might suffer from oversmthing instead of the bad graph structure。

Therefore， you should compare the two layer GN with two layer MlP。And to see if。

The graph is good bad。So at least you have two。呃。Just a。

Try to remove other other effect as much as you can so that you can identify。

Whether the massive passing step is good or not for the graph number。Yeah。

Thank you for the answer and are there any questions from the audience？Okay。

so there is a question from Slack。Okay， let me check you can also check the chat box in Zoom。

Totorial discussion， right。Com。Patrogene啊。A heterogeneous metric， right？AndIntuition， so。Oh。

Let me share model。

Yeah， here is the slide question。啊。Wonderful session， thank you。

any thoughts on next step for defining metrics？Hetgeneous heterophiligraph。

also any intuition for when we should consider these techniques。Always supply homophilimetric。

investative G and under performanceformance。诶。So for the heterogeneous heteroophilic matrix。

I think this metric are mainly useful for the metapath based heterogeneous heteroophilic matrix and since different metapaths play different roles in the whole heterogeneous pipeline。

Because you have， you have several different like sub channels for the heterogeneous graph。

so we need to compound the。The metric value from a different metapath based subgraph to consider the whole effect on a single heterogeneous G。

So how can we combine it and how does this different subgraph impact each other。

I think that is the the means step to define the metric on heterogeneous G， there are some。啊。

Some true， some。Experiments or some initial definitions of them， for example。

you use the maximize homophimetric over all these subgraphs or you use the average value of all the metrics of the subgraphs。

but I would not consider them as conclusive。So they are still a long way to go at least you need to find out that there do exist a relation between your defined metric and the performance heterogeneous G。

So if you can find such kind of relation， I will say that okay， you'll find a good metric。嗯。And。

When we should consider these techniques。🤧呃。Actually。So。Back to like two years ago。

I think the metrics should be considered before you train any graph model。

Especially in the large scale model， for example， if you need like one week of training of the model。

And you don't know whether we should use the graph structure or not。

you need to first test if the graph structure is good or not before you training because you don't want to waste any time。

So I first study this because my supervisor told me。

A student of her used Graph network in reinforcement learning and they found that it performed worse than a single MLP。

so what happens to the network？嗯。And back to like the early 2020。

I don't know I really don't know the reason， so in the old days。

I thought thatGN is universally effective for any tasks。But。After the。

The juices work in 2020 New paper， I just recognize that there do exist some kind of graphs that provide bad graft structures。

So therefore we should use this metric to first test your graph before you really train your graph model。

especially if it is pretty small， for example if you train like three 30 seconds maybe you don't care about the training costs right but if it is pretty large。

you need to spend like one week to train on it you don't want to waste your computational resource so it's better to like。

First computer momentum。啊。Are graph transformer less the？The skepticical to issues。That arise from。

Do you means。Okay， do you mean graph transformer？Might。Be better to deal with。

The problem fromic grass。So what do you mean the less susceptible？To the issues。

Do means potentially better on。How complete for us。IfYeah， if this is your question。啊。I would say no。

Because from。As I show in the slides， Millereller and others。

they conduct very comprehensively exact experiments they show in that had to believe the graph transformer still has long。

It still has a long way to go for solving hyper problem。

Because they connect。They connect each pair of edges on the graph， however。

this kind of fully connected graph is not always good。And。

The graph embedding and edge embedding is not good enough to provide the information。

To the self attention， to distinguish the hydrophilic edge and homoophilic edge。So。Therefore。

Graph transformer can learn a good enough graph。For massive presence， I think。So nowadays。

I think the specformer by Bo and others is quite impressive。

that is the first transformer like model that can form pretty good on hydro graph and I remember from their ablation study they're defined new defined graph node encoding。

Is the most effective component to improve the performance。Therefore。

to boost the web transformer on head of the photographgraph we can maybe study。

From the domain I perspective。Yeah， any。questions from the audience？Okay， so。So。

I think that's all for our section for our section on Tra。

Philip Gaf learning and thank you to all our speakers for sharing a comprehensive review on Heath Philip G learning and also thank thanks all the audience for your engagement so we hope you have enjoyed this section and enjoyed the rest of the love conference。

Yeah。Thank you thank you so much and as a reminder we have a tutorial feedback form which I will put in the zoom chat right now we absolutely appreciate your feedback so please share as much as you are willing to share and with that we will this brings us to the end of today's law conference we'll see you tomorrow for another day of v we'll begin with another exciting tutorial this one on geometric geometric generative models so please join us on zoom or YouTube live stream for another lively discussion have a great rest of your day。

😊。

图机器学习会议：LOG 2024：知识图谱与大语言模型融合教程

概述

在本教程中，我们将学习如何将知识图谱与大语言模型相结合，以推动科学研究的发展。我们将探讨知识图谱如何弥补大语言模型在显式知识、可解释性、领域知识更新等方面的不足，并介绍具体的融合框架、技术细节以及在科学发现任务中的应用。

第一部分：科学知识图谱

知识图谱定义与核心概念

知识图谱是一种用于表示结构化知识的多关系图。它由实体（节点）和关系（边）组成，能够以三元组 (主语, 谓语, 宾语) 的形式描述事实。例如，(曼彻斯特婴儿, 开发者, 汤姆·基尔伯恩)。

知识图谱不仅包含事实数据，还可以包含描述其元数据的模式（Schema），以及更具表达力的本体（Ontology）。本体使用如OWL（Web Ontology Language）等语言定义，可以描述分类体系（如食物分类）和复杂的逻辑关系（如“每个人恰好有两个父母”）。

知识图谱的优势与构建

与传统数据库相比，知识图谱更直观、灵活、易于扩展和集成。它支持基于图的导航、相似性计算和规则推理。

构建知识图谱有多种方式：

众包构建：如维基数据（Wikidata）。
自动构建：从非结构化文本或半结构化网页数据中提取知识。
从现有资源构建：将数据库、表格等结构化数据转换为知识图谱。
知识图谱集成：整合多个现有知识图谱。

案例研究：生态毒理学效应评估

我们通过一个具体案例展示知识图谱的应用。在生态毒理学中，评估化学物质对物种的影响需要进行大量实验。为了减少实验成本，我们构建了一个包含物种、化学物质及其已知效应关系的知识图谱。

方法：我们整合了生态毒理学数据库、NCBI物种分类学和多个化学本体等数据源，构建了一个统一的知识图谱。然后，我们利用知识图谱嵌入和链接预测技术，预测未知的“化学物质-物种”效应关系，从而在实验前进行有效筛选。

模型：我们尝试了两种方法：

朴素方法：将知识图谱嵌入（如TransE）输入全连接神经网络进行预测。
端到端方法：联合训练知识图谱嵌入和效应预测器，通过组合多个损失函数（化学子图损失、物种子图损失、效应预测损失）来优化模型。

实验结果表明，该方法能有效预测潜在的生态毒理效应。

生命科学知识图谱的挑战与机遇

生命科学领域知识图谱面临诸多挑战，包括：

可扩展性：如何高效构建和管理大规模知识图谱。
人机交互与可解释性：如何让模型决策更透明。
多模态与多领域融合：如何整合文本、图像、基因序列等多模态数据。
表示学习：如何结合符号表示（知识图谱）和亚符号表示（嵌入）。

尽管挑战重重，但知识图谱在推动科学发现、构建可解释AI方面拥有巨大潜力。

第二部分：科学大语言模型

上一部分我们介绍了知识图谱的基础与应用，本节中我们来看看如何利用大语言模型处理科学数据。

科学大语言模型概述

通用大语言模型（如GPT系列）在自然语言处理上表现出色，但难以理解蛋白质序列、基因组、化学分子式等科学数据。这是因为它们缺乏对这些领域特定“语言”（如氨基酸序列、SMILES分子式）的训练。

科学数据有其自身的“语言”体系：

生物学：蛋白质语言（20种氨基酸）、基因语言（4种核苷酸）。
化学：分子语言（如SMILES、SELFIES），拥有自己的语法和词汇。

基于分布语义假设（词义由其上下文决定），研究者们为科学数据开发了类似BERT（掩码语言建模）和GPT（自回归生成）架构的大语言模型。

技术细节：文本与蛋白质模型

科学大语言模型涵盖多个领域，本教程重点介绍科学文本和蛋白质模型。

科学文本模型：
这类模型在生物、化学等领域的科学文献、教科书文本上进行预训练和微调。

编码器模型（如BioBERT）：擅长理解文本，可用于命名实体识别、关系抽取和问答，辅助构建知识图谱。
仅解码器模型（如Galactica）：基于科学语料微调，擅长回答科学问题和进行摘要。
编码器-解码器模型（如SciGLM）：通过自反思指令数据集训练，能反思并修正输出错误。

蛋白质语言模型：
蛋白质序列可以视为由氨基酸符号组成的“语言”。

编码器模型（如ESM）：类似BERT，融入蛋白质进化信息。
仅解码器模型（如ProGen）：类似GPT，能根据控制标签生成具有特定功能的全新蛋白质序列，应用于酶设计和抗体设计。
编码器-解码器模型（如xTrimoPGLM）：参数规模巨大（1000亿），能同时理解蛋白质和生成新序列。

评估生成的蛋白质序列是一大挑战，目前仍需依赖湿实验室实验进行最终验证。

挑战与未来方向

科学大语言模型的发展面临以下挑战：

数据质量：高质量、跨模态的科学数据集稀缺。
长序列处理：蛋白质和基因组序列远长于自然语言句子。
三维信息利用：如何有效融入分子和蛋白质的三维结构信息。
自回归学习：蛋白质序列的生成方向性不如自然语言明确。
评估依赖：仍需大量湿实验验证，迭代速度慢。
数据隐私与公平性：需考虑数据隐私、模型偏见和不同机构的平等访问。

未来需要在这些方面取得突破，以加速科学发现。

第三部分：知识图谱与大语言模型的融合框架

前两部分分别介绍了知识图谱和科学大语言模型，本节我们将探讨如何将它们有效融合。

融合范式与阶段

知识图谱与大语言模型的融合主要有三种范式：

KG增强LLM：利用KG为LLM提供事实知识或符号推理能力，输出是增强后的LLM。
LLM增强KG：利用LLM生成知识或进行语言处理，输出是增强后的KG或相关任务结果。
LLM与KG协同：LLM学习KG的表示，KG为LLM提供事实知识。

本教程主要关注第一种范式，即在LLM的不同阶段融入知识图谱以增强其知识理解能力。

融合可以发生在多个阶段：

预训练阶段：将KG作为预训练语料的一部分（如Pre-KGE、MoMu）。
后训练阶段：设计特定目标，将知识注入已预训练的模型（如SAIL、OntoProtein）。
微调阶段：使用KG信息对LLM进行下游任务微调（如FusDTA、RoSA）。
推理阶段：无需训练，通过提示工程或检索增强生成（RAG）将KG信息作为上下文输入（如MedRAG、BioRAG）。

融合架构：双编码器与交叉编码器

在模型架构层面，主要有两种方式：

双编码器：分别编码实体（如用蛋白质语言模型编码蛋白质，用分子模型编码药物），效率高但交互较弱。
交叉编码器：联合编码实体对（如蛋白质-药物对），能捕获更细致的交互，但计算成本更高。

知识图谱集成示例

以下是KG集成在不同科学任务中的应用示例：

科学NLP任务：

临床文本生成：使用KG中的主题和单元构建提示，利用合成数据微调LLM。
复杂医学问答：LLM生成相关三元组，通过知识库验证后，结合RoSA参数高效微调技术进行答案生成。
问答与推理：结合后训练（如CaLM）和微调，检索KG子图并修改输入，进行监督微调，并利用LLM作为评估者提供反馈。

科学预测任务：

基因-疾病关联预测：使用双编码器，分别用BERT编码疾病描述，用蛋白质语言模型编码基因（蛋白质序列），再进行预测。
蛋白质功能预测：构建包含基因本体和蛋白质的大规模KG，通过对比学习将KG与蛋白质嵌入空间联合优化。
药物重定位：利用GPT-4等LLM生成药物重定位假设，并通过电子病历数据验证。

处理多模态与智能体

科学数据常涉及多模态（文本、序列、结构）。融合方法包括：

多模态对比学习：对齐不同模态的表示（如CLIP风格）。
对齐到中心模态：将其他模态映射到文本模态的表示空间。
跨模态转换模型：学习模态间的转换函数。

科学智能体：
LLM智能体在科学发现中大有可为，可分为：

专用科学LLM：针对特定任务（如分子、蛋白质）定制。
通用LLM助手：基于多样文本训练，能进行规划、推理和信息检索，允许用户用自然语言交互。
协作和实验操作是科学发现的关键，已有工作探索多智能体协作研究，以及连接LLM与硬件API以操作真实世界实验的智能体。

总结

在本教程中，我们一起学习了知识图谱与大语言模型融合的三个核心部分。

首先，我们介绍了知识图谱的核心概念、构建方法及其在科学领域（如生态毒理学）的应用价值与挑战。

其次，我们探讨了科学大语言模型，了解了它们如何处理科学文本和蛋白质序列等特定领域数据，并认识了当前面临的技术挑战。

最后，我们深入研究了知识图谱与大语言模型的融合框架，包括不同的融合阶段（预训练、微调、推理）、架构选择（双编码器、交叉编码器）以及它们在科学NLP和预测任务中的具体应用。

关键要点：

知识图谱能增强LLM的准确性、可解释性，并减少幻觉。
有效融合需考虑基础模型（文本、分子、蛋白质）、编码器架构、融合阶段以及集成技术（如RAG、对比学习、智能体）。
这种融合为推进科学发现提供了强大的新范式。

教程内容基于图机器学习会议（LOG 2024）相关演讲整理。

图机器学习会议｜ Learning On Graphs Conference 2024 p02 P02_Day_2-Zachary_Ulissi_keynote__Geometric_Generative_Models_tutorial__orals -BV1k9pAzpE8S_p2-

で、つに。

How need circle you should get more at。你好惜为你。你嗯。I don is it this voting panel comes back every two seconds or no。

Oh， no， is it because people are joining， though because people got， yeah。O。还要看这个。可以过。好，嗯。对。Oh。

it's cute。哎可以。嗯不。对。对。嗯。Perfect， I think it's 3 pm we can get started and hell everyone。

welcome to day two of the law conference 2024 day one was very exciting all of participation and we hope it increases over the next three days。

😊，Okay。Hello。What happened？Yes no today we have a tutorial or the first session today is a tutorial on geometric generative models from Joy Bose。

He Benheu and Alex Tong。Hello。😊，Is it working no can you hear Yeah。

I think you cut off for a second but we can hear you now。是。え？Should we start？Yes no。

Now Joey is a postdoc in Michael Bnstein's group at Oxford and he did his PhD in MilA。

he worked a lot on geometric deep learning and generative models and has produced some instrumental work。

In this field。And Alex is。嗯。听不懂。Alex is an incoming assistant professor at Duke University。

a postdoc currently with Osa Bengeo has also worked a lot on optimal transport and geometric models with single cell biology Heley is an incoming research assistant at Fair and currently a final her ph。

嗯。😊，嗯的是。He student with Aaron Lipman and she's also involved in the interest instrumental flow matching paper all of our organizers today actually have a lot of experience with workshop organization participation and talks and we hope that。

We hope that they are able to distill all the complex concepts today in a very easy manner and I'm very much looking forward to the tutorial you can get started now。

Okay， thanks thanks for the kind introduction to Iignation and thanks for people joining online and as well as people in the room physically at Oxford So this tutorial is going to be in three parts and it's about geometric generative models so I'll talk to the first five minutes before passing it on to He who's going take over from there so before we get started。

the first thing I want to talk about is why do we even want to consider generative models right so generative models beyond images and text has been a new adaptation of a lot of the cool techniques so we've see generative models being applied on domains like proteins。

robotics， information geometry and even climate modeling so these are all domains that have a lot of rich geometric structure that is not often afforded to you and when you try to take up generative models for text and images and as a result we need to rebuild tools to tackle these problems from the ground up and also from rigorous first principles and this is I guess the gateway point for the tutorial where we kind of build the tools to show you how a lot of the familiar concepts can be。

😊，Elevated to geometric structures and how do we actually build fate models on top of these complex geometries？

So having said that， this tutorial is going to be essentially three things。

one of them is that it's going to be an introduction to main geometric concepts that we actually use to build these models so we're going to kind try to distill a lot of the key practices that people are currently using to make more efficient and numerically stable models and then finally we're going to try to figure out how to actually code some of these things and see how some of the complex geometric operations actually translate into actual lines of Pytorch code what this is not is that it's not a crash course on differential geometry it's not meant to be super rigorous on the math side of things although we will cover some math concepts and it's not going to be a rundown list of all of our papers or all the papers in this field and what the state of the artrc models are although we will touch upon some some design decisions that people use to build these state of the artrc models and finally it's not a history lesson of all prior attempts to tackleable geometric data with generative models it's really tailor made for what people are currently doing in practice now and the modern generative modeling setup and how this is。

of elevating the field forward。So having said that。

the tutorial is broken down into three parts roughly an our H。

the first part is going to be by He and the second part is going to be by me and the third part is going to be by Alex。

the first part is going to be a primer on modern generative models and then we'll move on to geometry as well as the actual bringing the pieces together and how to build these geometric geneative models。

😊，Wait one second， sorry， this is the long slide deck。2。🤢。

Yeah， sorry， sorry， that was the wrong slide deck， but this is the correct slide deck now。

So yeah the first part' is going to be by He second part's going to be by me which is how do you build geometric tools and finally Alex will bring everything together in combining geometry and geneerative models in practice so with that i'll pass it on to He and He please feel free to take over the screen control。

😊，Thank you。嗯。

Okay， so。

How don't I go back to I'm going to share my screen。Do you see the slides okay？Yes， I believe so。

Okay， so we're going to begin with our first part of the talk about simulation freegenrative models。

So let's dive in so we're going to start by setting the problem that we want to solve in generative modeling and the setting is as follows what we have is some unknown data distribution we denote asQ and we typically have access to it via finite set of samples which we denote it as x1 and it will become clear later on in the talk why this subscriptive one appears here。

but the setting is that we only have access to finite set of samples from that unknown distribution。

And our goal is to learn a sampler from that unknown distribution queue。

meaning that we want to learn some function that is able to efficiently generate novel samples from the unknown data distribution that we're interested in。

So this is in general， the goal of the generative modeling problem。Now， in deep geneative modeling。

what we do is， of course， we learn a neural network with parameter data。

And a generative model could be abstracted into I mean from my point of view。

it could be abstracted into this topple of two objects。

so the first object psi of theta is the generator。

which is a function that allows sampling generating new samples from the learned data distribution。

And。B of theta is the underlying density that is defined by this generator function。

So for some algorithms of deep geneative modeling， we may or may not have access to this underlying density。

but in any case our goal is always the same， we want to find parameters data such that the underlying density that our generator creates approximates the target density cube。

🤧嗯。So let's try to put it into a more visual scheme。

we have our finite set of samples from the target distribution queue。

And the way that we typically build a generative model is by choosing some easy to sample source distribution with denote as P and samples are denoted as Qn。

And the generator takes samples from the source distribution and transform them into samples that ideally would be distributed as the target。

And as for the modeling of the density， the underlying density， for example。

in this case it would look like this， and if we have access to this quantity this P of theta。

then we would want it to approximate Q。Now， the question is how do we model a psi of theta。

how do we model our generator， because there are so many ways to do that。

So in deep generative modeling， we will go quickly about these three paradigms and then we will continue to the paradigms that we're going to talk about today which are flow matching and diffusion。

but autoaggressive models， we have a likelihood based model。

which sequentially generates the signal that it wants to generate。

it is likelihood based and thegeneration is sequential。

meaning that that the generation depends on the dimension of the signal that we want to generate。

this works very well for language modeling is all of you may know for other lambs。

but it is less trivial order to use for example for images or even for geometric applications were the order of pixels over coordinates of elements in our data。

is not very clear。The second paradigm of the modeling。

which was pretty much the most dominant one until the appearance of diffusion models is GNs。

and in YNs the generator is simply a neural network that takes latent vectors from some latent space which is typically in a lower dimension than the data and transform them into data sound。

Now training this generator involves defining a discriminator which receives samples from both the data distribution and generated samples from the generator。

and it needs to predict whether're really little fake。

Training this model involves solving a miniax problem。

which could be very hard to train its delicate training procedure。

and we do not have access to computing the likelihood。

which could be interesting for some scientific applications。

but it enables fast sampling compared to autoaggressive models， for example。嗯。

And the last paradigm is VAEs where we define a autoencoder except now this latent vector is defined to be sampled from some distribution that we restrict while training。

so it also enables fast training fast sampling sorry but we do not have access to likelihood as well。

So far， it has shown inferior performance compared to， for example， Gs。

So it has not become such a dominant generative model yet。In this talk。

we are going to focus on dynamical systems as gene models。

And the way that we're going to view our generative model is。

By a time dependent process so a dynamical system describes the evolution of point samples in time。

so now we augment our generator with an extra dimension and we have a time dependent generator。

And sampling actually becomes simulating。Given a source distribution。

we sample a point sample x not and to generate a sample at time t and so the choice here for1 is completely arbitrary。

but let's say that at time1 we want to reach our target Q and that's why the sub one。

so we need to learn the components that allow us to simulate this process such that we sample from the target distribution。

Now， there are two main dynamical systems that people have worked with to develop generative models and each of these equations。

Define the instantaneous or the infinitesimal change that a for example make in time。

so here the infiniteicsimal change in x in the position of x is dictated by a velocity field that tells us in which direction we need to take this infinitesimal step。

And this defines an ordinary differential equation。And for diffusion models。

we have a stochastic differential equation， so the first part is pretty much the same as the ODE we're now instead of calling this velocity。

this is called the drift of the SDE。And the stochastic part comes here。

so we have a diffusion coefficient which is also a deterministic function。

and here we consider only diffusion coefficients that depend on time and not on deposit position x。

And。A brown in motion so the brown motion is the stochastic part。

and you could imagine that it means that at every time step you add a little bit of Gausian noise to your process。

So。🤧嗯。Let's see what it means to simulate from these processes so starting with x not we want to solve the initial value problem and this amounts to basically taking the integral over the velocity so starting in x not we consider trajectory that the point follows reaching its final point at time t and。

For every time on this trajectory， the velocity field， as we already said。

points towards the direction in which we need to take our next infinitesimal step。

And this is a deterministic function， so if we start from x0 again and we take this integral。

we will end up in the same xt。On the contrary for diffusion， although we write it as an integral。

it is not as standard integral because because we have here integration over brown and mohe so this is a stochastic thing so。

simulating this process would look something like this and if we simulate it again so we start from the same xn and we take this integration once again。

we will not end up in the same in the same position okay so this is something that's important to note so compared to flows where we define a deterministic map here we have a stochastic process or a random variable that is defined by this simulation。

Now these are integrals， but if we are working on a computer we would want we wouldn't be able to simulate these things。

so an important thing to mention is that。We have ways to do that by numerically computing these integrals and there are many standard ways to do that。

but I think that the simplest solvers for both ordinary differential equations and stochastic differential equations are the oiler or the oiler Mariaama sample methods which basically say that at each time stamp if you want to take we want to move to x of t plus some small delta t which is the discretization interval of what we integrate over。

so we take a step in the velocity direction times the time interval， the discretization interval。

So and for diffusion， it's similar， except now we have this， again， the brown and motion term here。

🤧嗯。Okay。But we want to build a generative model out of this so the question is where are the probabilities right I mean when we build a generative model we want to be able to make the probability density of our model match on target。

And。In fact， both these dynamical systems have characterization of how either the velocity field or the drift and the diffusion coefficient are tied to the evolution of the probability density in time。

so if we start with samples from a source distribution P and we simulate their trajectories according to the velocity or according to the SDE that is defined by the drift and the diffusion coefficient。

then we know how the probability densities will change in time。🤧And。This is a major thing。

and in fact， we can kind of say that。For flows， if we have UT if we have the velocity。

so maybe we can build a generative model out of this right so we can we can have UT and make it evolve to probability density over our desire and。

On the same note for diffusion。So。Can we build a geneerative model with this and the answer for flows is quite immediately yes。

but for diffusion we need one more thing。And we're going to go over it。So。In diffusion。

the stochastic differential equation actually only describes the forward direction okay。

so if we start from some known from the data distribution， simulating this equation。

simulating this SDE would only add more and more and more and more noise。

eventually at infinite time， reaching a stationary distribution of this stochastic process。Now。

If we want to build a generative model， we need to be able to simulate the reverse one， right。

we're going to start from a noise sample and then transform it into a data sample。And。To do that。

there is a reverse process that has the same the same marginal distributions， meaning that。P of T。

the distribution of time T would be the same for both the forward and the reverse。

but depends on where we start to simulate it。And。The reverse SDE is formulated with the one more thing that we need。

and it is the score function， which is the gradient of the log density at time0。Now。

For learning a score based model， it has been shown in diffusion and score literature that we can learn the score。

the marginal score by regressiongressing to conditional scores and I intentionally write here P data and not Q because the times in diffusion and in flows are reversed so we're clear that here we're taking a sample from the data。

x is a data sample， and simulate the forward process and that yields the conditional flow on a data sample。

嗯。And。What we call these model simulation free is that for certain SDDs， we have ways to。First。

to actually sample from this conditional distribution without having to simulate the forward SD。

they have closed form expressions and the second thing is that we don't have to simulate the process at all or back propagate through it during training this model。

And we're going to be using these similar principles when we build a flow matching model。

so keep that in mind。He。Okay， so to build a diffusion model， what we need is to learn the score。And。

A few things to note here is that we could only use it with a Gaussian source distribution this is kind of a limitation of diffusion models that is going to be lifted when we use flows and we'll also discuss why it is beneficial to use flows over diffusion for geometric applications and also the solution only reaches the source distribution asymptotically so we have to simulate our。

Our process for a very long time， ideally to infinity so and this occurs errors when we want to simulate the reverse process because we do not really reach the source distribution。

Now。We move on to flows， so when flows things are actually a bit simpler。Because under some。

Very reasonable regularity assumptions on the velocity field。

it will define a flow psi of t and this flow function defines a deformmorphism in the space which means that it is smooth and it has a smooth inverse and this inverse is defined by simply minus the velocity field so compared to diffusion models where simulating forward in time and simulating backward in time involve different components and different objects that we need here all we need is the velocity field。

嗯。Which makes things a little bit simpler when we want to build a generative model with these。🤧Okay。

So to build a generative model with flows， what we need is to learn the velocity field。

And in fact it is it can define a universal transformation between densities what it means is that we can start off from any source distribution to any target distribution and we are guaranteed that exists some velocity field that transfers between these two and this is compared to the limitation of a Gaussian source for diffusion。

And the second thing is that it is defined on a finite time interval， and again。

we don't have these errors in the boundary of small mismatch between our distributions。

So we're now going to deep dive into building a flow matching model。

we're going to revisit diffusion once in a while， but the main focus of the talk is going to be on flow models。

So。We said that we have the flow ODE that describes the motion of point samples in time。

and correspondingly we have the continuity equation。

which couples between the velocity and the evolution of the probability density in time and we can see it visualized here in these nice gs。

And now the goal when we want to build a flow generative model is to find a velocity field UT such that the terminal distribution P of1 would match our target Okay so we refined a little bit goal for generative modeling problem for the flow case。

So using flows as generative models were originally introduced in the neural ODE paper by Chan at Al and what they suggested was。

To use the fact that we can compute the log likelihood of the model okay so a little giic massage to the continuity equation would give us the equation here on the right and this basically means that if we can compute the log density over source distribution which is true if we chose it to be a Gausian。

then we can integrate over the divergence of the velocity field to get the log density of the model at every desired time。

And if we can compute the log likelihood of our model。

then why not use the maximum likelihood objective to train it？Sounds like a good idea。However。

It requires a lot of computational overhead。Why is there so first we need to simulate Xt during training right because we need to evaluate our velocity field at every XC and furthermore。

we need to take derivative derivatives of our velocity field which we don't have。

A very efficient way to compute the divergence in high dimensions。

so we resort to some although unbiased but to an estimator。

But the main thing is that we also have to back propagate for this simulation。

so this becomes very heavily computationally to even train such models and。

Flows have not been scaled up much until the introduction of flow matching like types of models。

So here we are。So basically flow matching was introduced in the same time with stochastic interpoolents and rectified flows。

they were all published in the same eyeCL。Conference in 2023。

Although they arrived in the same method eventually。

they did come from a little bit different perspectives and we're going to follow the flow matchingtting perspective on building a simulation free method for training flows。

い。ok。So the basic idea in flow matching says。😊，Okay， so we said that。

We have the continuity equation that couples between the velocity and and the distribution and time and the core principle that we're going to follow in suggesting are。

Our training objective is the fact that the velocity field generates a probability path if and only if it satisfies the continuity equation。

so that means that if we could build some probability path that starts from a source distribution and reaches an approximation of our target distribution。

Then。We could regress through the velocity field， and if we have the velocity field。

then we have a way to generate samples from that distribution。

But to do that we need to do two things so the first thing is we have to build this target probability path and it needs to satisfy the boundary conditions starting at the source and at the end reaching an approximation of the target and we need to have the generating velocity field U。

The second thing that we need is to find a tractable optimization objective for this because at this point。

a few of you may wonder， okay， but if you have your velocity fielded in hand。

why do you need to learn it？Right， so we're not going to have it in hand。

we're going to find some other way to do that and that's going to be involved in finding the tractable optimization objective。

😊，So we're going to start with building our target probability paths and that's going to guide us towards finding our tractable objective。

As well。🤧嗯。So to build a conditional， we build a probability path by building conditional probability paths。

so our boundary conditions are， as we said before。

and we use the law of total probability to define our probability path as marginalization over some conditional probabilities。

Where if we choose our probability path， for example， to satisfy that at time zero。

it is simply the source distribution and at time1 it takes all of the samples to the x1 sample from the data that it is conditioned on and if we plug in these two into these integral we will easily see that it satisfies the boundary conditions。

Just for notation and so we understand what we say。

This is called the marginal path and the conditional path is called the conditional path。

Now we're going to see how we're going to build it， the velocity field that generates it。So。

For the conditional path， there exists some conditional velocity field that generates it。

And we can express the marginal velocity field that generates our marginal probability path。

In terms of the conditional velocity field。Okay， so we turn our problem of finding the marginal to the marginal velocity field into marginalizing over some。

VConitional velocity field that is potentially easy to compute or find。Okay， so。

Looking at this ugly integral。😊，we are not done yet。

we do need to further find an tractable objective。

but I want to note that here we could also marginalize over any condition and some useful different conditioning that could be used is to instead of conditioning over only the X1 over only over the data points we could condition on both a source sample and a target sample and this will open up the option to define different kinds of couplings between the source in the target。

it could look into these two papers below that for example suggested solving mini batch optimal transport couplings between the source and the target to make the learned trajectories。

Straighter and hence。To have a faster inference solving when simulating from such flows。

And another option is to condition an x not。Which will。

As we will see later connects to the epsilon prediction or noise prediction in diffusion models。🤧嗯。

And just for notational simplicity by base rule， we could switch all this to just reverse the conditioning。

so here we have PT of Z given x so it's like simpler to write it down。Now。

We need to find our conditional flow matching loss that will allow us to actually train such a model。

So。Similar to in score matching， as we said that we could regress to a conditional score to learn our marginal score。

it turns out that also for flow matching we have this equivalent， so equivalent sorry。

so the minimizer of the flow matching loss is the same minimizer as the conditional flow matching loss and actually only by building conditional velocities we could learn the marginal one that generates our PT our desired PT。

如不。So we're almost done okay， so we built our target probability path and we define the velocity field that we want to match and we found a tractable objective。

but the last bit that we're missing is。How do we compute the conditional velocity for some conditional probability paths and the way that we do it and that is pretty。

I think relatively easy to。To construct。Is to build these conditional paths by。

Building the conditional flows。All right， so the conditional flow simply tells us if we take a poor example at time zero where it will end up at time T and the derivative of the flow is the velocity field right it's connected by the flow ODE。

So。And we will build the conditional flows such that at time zero they leave the point in place and at time one they take all the points to the。

The condition the sample X1。Okay， so we have the conditional probability pack。

the corresponding velocity field that generates it。

and the corresponding flow that is defined by this velocity field。ok。Now。

a popular instance of conditional flows that argued used are aconitional flow。

And they simply they're called laughingine because they're enough in combination of x1 and x0 where alpha t and sigma T are time dependent coefficients and in order to satisfy the boundary conditions we would need alpha t to be0 at time 1 and sigma T to be1 sorry at time zero and sigma T to be。

One to time zero and vice versa， okay？No。To get the velocity field。

we write down the flow ODE okay so we have the time derivative of the flow as equal to the velocity field evaluated at the point flow website condition on next one。

And。Computing this derivative， we get that it's simply the derivative of alpha t with respect to time time x1 plus sigity dot dot denote derivative with respect to time times x0。

And if we plug in。Like we take x0 out and we plug in that so that we can evaluate the velocity field for every position in space and not just on trajectories of the flow。

then we end up with this expression。And in fact， we could have various prediction types that are related to flow matching。

so if we condition our velocity field on x1 and then we marginalize okay so we take the conditional velocity field here z is x1 and we marginalize to get the marginal velocity field we will end up with a velocity field that is a function of x and of what people typically call x prediction or the noiser and we can see that we can transform from the x prediction to the velocity field in a very simple iine function。

The same goes for epsilon prediction， so if we condition on x not and marginalize。

we will have the expression that connects between the velocity field and epsilon prediction。

And lastly， full condition on both we' have an expression that's dependent on both。So in fact。

we have shown that all of these types of predictions are transferable and we can move from one to another quite easily。

🤧嗯。One instance of a choice of conditional flows that are used and was proposed in the flow matching paper is to use aine conditionals that connect at0 and xt in a straight line。

so it basically is a linear function， so alpha t is t and sigma t is 1 minus t and we connect between them in a straight line。

And the corresponding velocity field would be x1 minus x0。🤧Okay。Okay。We could also instantiate。

Flow matching for a choice of a Gaussian source distribution of course and for and in that case x of T will be distributed as a Gaussian and we could also instantiate both the variances exploding and the variances preserving schedulers that are popularly used in diffusion models。

so we see that we have lost nothing in terms of the expressiveness of our model when we move to flow so we could you know we could also create these paths with flow models but on the other hand we cannot create for example the conditional optimal transport paths with diffusion。

And。Just to visualize how these trajectories differ。

these are conditional trajectories so the black points are initial sampled from x0 and we see how the conditional flow moves them to x1 which is the red point so with condo T we see that these are straight lines and with variance preserving we see that the trajectories sometimes over should the sample and like go above it。

😀嗯。Okay， so we have defined our way to build our gene model。

And let's put it into some recipe that we could follow。

So given a set of samples for Mo target distribution Q。We need to construct a probability path PT。

such that at time zero， it is the source distribution and at time one。

it will reach an approximation of our target。And the way that we build these probability paths is by building conditional flows。

🤧And。To learn our generative model， we need to learn a velocity field UT with the conditional flow matching loss regressing to conditional velocities。

our goal is to build marginal flows that generate PT。Okay。

so now we're going to have a short coding session trying to put these things into。呃。So one sec。我其消。

Do you see my screen。Yes。Allright。Okay。So we're going to build a。

A very simple example of a flow matching model on two dimensional data set and we're going to follow our recipe for building this model so we're going to have our data and then construct the probability path by defining a conditional flow and then we're going to write down the training loop and eventually we're also going to sample from it so I have a pre cloned the Repo we're going to need it for our data。

And here we have defined the architectures or these are standards set up for neural network training。

we're not going to go over it。So we define our architecture， which is a simple MLP。

and then we define a model that will represent the velocity field that we're learning and the optimizer that we're going to use。

And now let's define our data。In the assets directory we have two images one is like an image that says log and the second one is one that says 2024 and we're going to create two the distributions based on a mask on these masks so here we have a function that samples uniformly from the white digits or letters and so let's。

Sample from our data。Oh， what happened？I'm not into right directory。H文。Oh， okay。Oh。Yeah。

I'm not sure where okay， I'll just use。Absolute path， so we don't waste much time on this。

I don't know what happened。H we switched them。Oh。Okay， something is wrong here and I'm not sure。

What it is。Okay， so let's try the other way around。 we'll try to connect to a。

To a hoster runtime story。HThere aren't any GPUs。🤧Okay。

Do you have any questions in the meantime while I try to solve this problem？Okay。

so we're going to try it this。I wrong。你是谁。Okay， Jo， do you wantna maybe。

Go ahead and continue and we'll try to get back to it later。Yeah， so I can take over the second part。

but for what it's worth some of the people in the audience say that it's working for them so I think it's maybe stochastic No I mean I don't know what like first I mean you see that like the images exist in the path and they open them。

So one another thing we could do is， I mean we could。Just maybe go over this， I'm sorry for。

I don't know why why it doesn't work， but maybe let's go real quick over this so if you sample from these masks that I showed you you will get these source samples which write out log and the target samples are going to write out the 2024 and we're going to try to learn a flow model from log to the 2024 distribution。

Okay and now so this is the data and now we're going to define our conditional flows so for for conditional optimal transport we have a very simple expression for our flow so x of t will be1 minus t times x0 plus t times x1 and the velocity field is the time derivative with respect to。

With respect to time， yielding x1 minus x0。So to write down the training loop。

so we need to take our expectations over time uniformly and we need to sample points from our source and our target distribution。

so this is the function that samples uniformly from the mask。

We're going to compute our conditional flow x of T and the conditional velocity field UT and run out the model。

To compute the neural network prediction of the velocity field and our loss is simply an L2。

Distance between the predicted velocity field and the target conditional velocity field。

So if we trained this for a while and in the meantime we were supposed to write down our Ouler silver that will allow us to simulate from our model so Eer silver as we presented the in the slides it gives us a recursive。

😊，A recursive formula for for evolving x T in time。

so xt in the next step would be equal to xt plus the velocity field times the disctization interval that we chose and this discretization interval is defined by the。

The difference between the end point in time and the sharp point in time divided by the number of steps that we want to take。

and we also need to progress D as we take our steps。

if everything works and we take our model and we sample from oiluler sampler。

we can see that we generated samples that right out 2024。😊，And eventually， we could。

Animate the point trajectories so this should work I think because it's independent so here we see the flow of like going from the source distribution of writing out log to the target distribution 2024 so I think it's a nice illustration of the flexibility flow matching that we can use any source any target distribution and learn a flow model from that。

Okay。So that was less successful than expected， but let's move on。

I have like two more slides and we're done。😊，Okay， so。You can see my screen。Oh。

it's the outer screen。Okay。So flow matching has been extended to very various interesting applications。

the one that we're going to be focusing also in the next parts are flow matching for general geometries。

and has also been used for designing equivariant models for two permutations and all sorts of symmetries。

I think the most prominent use of flow matching is the adoption of the stable diffusion3 model to use flow matching instead of diffusion models。

it has also been used for control generation and to solve inverse problems。

also utilizing the fact that we could use any prior only source distribution that we want so to learn different types of generative models。

So to sum up this part， we've shown that flows are powerful generative models when they're supervised adequately。

for example， with flow matching， and it gives us a flexible framework for training generative flows and compared to diffusion。

they alsofer improved sampling speed and stability when we use the conditional optimal transport paths compared to the variance preservative for example。

😊，So now I think some open challenges on this part is that we still have to simulate the process so sampling a point from the generative model does involve taking multiple forward passes through the neural network which could be prohibitive。

so it's still a challenge to learn a one step model。

the performance as well as these models without distillation and the second thing is to scale these two other domains because I mean flow matching has been very successful for images。

but for example， on language it still does not match the performance of auto aggressivegressive models。

So on that note， thank you for listening and Joey。😊。

You can take that over。

Thanks Ha for the great first part of the tutorial， let me start sharing my screen and then oops。

系。Well。Yeah let's go here， so in the meantime maybe if there's any questions we can take some questions now。

otherwise we can jump straight to the second part of this tutorial。

which is going to be how do we build in geometry for machine learning？

三。OkayIf there's no questions， I'll get straight to the point and get started。嗯。

So when we think of generative models on manifolds。

the first question you have to ask yourself like which of the current things actually remain constant and which of the current things need to be rethought from the first principles so for instance our data previously was on regular Euclidean space now the first choice we have to think about is like well what happens to the data when you parameter on a manifold and how do we construct probability paths before they were quite simple there were straight lines and but on manifolds there's no such easy notion of straight lines or the conventional notions have to be adapted and how do you build conditional flows and even how do we define velocity fields all these things need to be kind of elevated to manifold so this brings in many questions the first question being how do you even represent a point on a manifold this is a design decision also in regular flow matching we had kind of a Gaussian source distribution or even in diffusion unfortunately for manifolds theres no canonical definition of a Gaussian distribution they are obviously analogs to it so the choice of a source distribution。

It has to be adapted as well。 So there's different choices， and it's unclear。

which is always the superior choice。And more importantly。

when you define your flow map regular Euclidean flows。

we could easily do alpha t times x1 plus sigma x0 unfortunately manifolds don't have this vector phase structure so we can do addition so this immediately break so this has to be elevated so even defining straight flows is a challenge and lastly the notion of velocity and vector fields also needs to be kind of generalized to manifolds so we're going to tackle these points but but in doing so we're going to try to take a bit from the point of geometry so what do we need to invoke in geometry to enable us to define these concepts for generative models。

😊，The first thing is that well how do we even characterize points on the manifold so your data distribution that we want to represent well to do that we can think of it we have to first define what a manifold is so for the purposes of this presentation and this is not going be extremely rigorous but it's going to be serviceable you can think of a manifold as being a smooth topological space that locally looks like a vector space but globally when you glue together these patches looks totally different as a result we can say that each point belong to this topological space but even when you do this you have to think about consideration so when you do generative models。

how does the curvature of the manifold affect numerical stability how do we even parameterize this manifold there's more than one choice or do we use extrinsic coordinates do we use intrinsic coordinates and other choices and finally what additional structure do you need to impose to actually do generative modeling so just having a topological space iss not enough even though it's smooth it's not enough we need to add additional structure and we'll see the structure。

We really need is the notion of a metric which allows us to define distances as well as angles and norms。

So moving forward。So let's be a bit more formal about what a manifold is so a manifold as I mentioned earlier。

there's local patches that kind of look like Euclidean spaces so more formally these things are called charts so each chart is an open subset UI and has a corresponding homemorphism phi that takes you from this chart to a corresponding Euclidean space so in this picture over here we have some manifold and recovering it with these subsets and these subsets may intersect at different points and from each subset there is a chart there's a map phi that takes you to an equivalent Euclidean space and the reason this is cool is because well we know how to do deep learning and geneative Euclidean space so we want to be able to use this fact or exploit this fact to kind of generative modeling on manifolds so the one thing we have note is that when you define charts that cover the manifold we have to first assume that the charts are smooth so we've already added some structure to the problem the smoothness means that the charts and the maps between charts are going to be seen infinity。

Definitely continuous and differentiable， and this is an important criteria because the vector feels that we will define have to satisfy this property。

So in this picture over here， if I define a subset U I want a chart that takes me to a corresponding space that's mapped under the chart and correspondingly if there's another chart UJ there's a corresponding homemorphism phiJ that takes me to a corresponding putting space and now the key condition that holds everything together is how do we glue charts together so more specifically if the chart intersect at certain points we need to be able to handle this criteria this area over here so this chart to chart gluing theres this technique where you can kind of just follow the arrows so you go from phi and you go from phi J to phi I inverse and you can go back to UJ so there's an easy compatibility condition that needs to be satisfied which is to say that whenever charts intersect there's a smooth way of going from chart to chart and as a result you can kind of glue together a manifold by just picking different charts such that they covers the entire manifold and each chart has its own you can think of it as its own phi map which allows you to define a corresponding local。

In space。And that's kind of the high level idea。So having said that。

how do we actually represent the manifold so there's more than one way of representing the same geometric object。

the two main ways I'll talk about is extrinsic versus intrinsicsic so what do I mean by extrinsic so extrinsic means you take a manifold and you embedded it in a higherdial Euclidean space so in the picture over here we have a sphere in R3 so R3 being the ambient space and the sphere is S2 and we can say that the sphere lives or is embedded in this higher dimensional Eucclean space and the way you do that is that more formally you can embed any manifold by defining what's called an inclusion map so the inclusion map is this function Yoda that takes every point on the manifold and tends it to an equivalent point on RM。

And the general principle is that we want to kind of when you think about parameterizations of manifolds for generalative modeling。

we want to think like a deep learner， which is to say that let's make it as close as possible to something that already works and we know iss scalable and we know how to work really well with Euclidean spaces。

so excentric parameterizations of manifolds are very nice for this exact reason because you can use the coordinate system of the ambient space to do a lot of your deep learning machinery。

So one quick advantage over here is that the ammbient space is a vector space。

so a lot of the familiar operations like addition are permissible and the ammbient space which may not be permissible on like a regular manifold。

so that's something to think about。So another equivalent choice is an intrinsic coordinate system so when I say intrinsic I mean a local coordinate system where you just choose charts and the charts that cover the manifold and you do all of your computation in these charts so a very simple example for the sphere because the sphere is such an illustrative example is that you can do something called stereographic projection and you can cover the sphere with two charts and these charts are essentially divide the north pole in the south pole so if you take a point on the upper hemisphere of the sphere and you start drawing a line from that point and see where it intersects the plane that cuts the sphere in half that's called the stereographic projection of the sphere and you need two of these charts so u plus a new negative to cover the top half of the sphere and the negative of the sphere so immediately there's a problem with this parameterization not from a mathematical point of view but from a machine learning point of view the problem is that you slowly reach numerical instability as you get to the poles。

When you're thinking about doing geneative modeling， using these charts。

you have to be conscious of this fact because for normal geometric applications。

when we actually do numerical computation this could be a problem so as a result one preferred solution is to maybe perhaps use the extrinsic coordinate system as opposed to the intrinsic coordinate system on the sphere。

So another equally valid choice of coordinate system for the sphere is that you can use what's called almost global coordinate system so any point on the sphere can be represented by some radius from the origin and some angle theta and some angle5 if you give me these three things I could pick any point in the sphere and represent it as a valid parameterization of a point on the sphere。

So many times this is going to be also a favorable parameterization。So again。

one thing you have to consider is that just when you pick a parameterization what is the actual potential drawbacks right one drawback is that because we're using angles we have to use trigometric functions and can these trigoometric functions be always numerically stable and also their inverses so sometimes when you do computations we have to compute inverses of these trigmetric functions and as a result for small angles they could be numerically unstable。

Okay so the next object that we really need to define is that for diffusion models and flow matching models。

everything is related to defining velocity fields and vector fields。

so to generalize the notion of velocity and vector fields we need to define a tangent vector so a tangent vector is essentially for every point P on a manifold you define a function that takes you to RN and we want to kind of essentially say oh。

a tangent vector is tangent to a point on the manifold so in this picture over here we have the point P and a tangent vector is you can think of it as like living on the tangent plane attached to the point P。

Now we'll have a more formal definition， but the picture is。

Essentially the same for all types of manifold， you want to define a tangent vector that's essentially tangent to some sort of curve at a particular point。

😡，And if you have a local coordinate system， we can always represent the tangent vector using the local basis。

so in this picture over here， imagine your tangent space。

which is the set of all possible tangent vectors at a point P you have a basis vector so you can represent each tangent vector at this point using the equivalent tangent basis and this is what we call a local basis to represent the tangent vector。

😡，So a bit more about tangent spaces， so a bit more formally， a tangent space。

you can define it as like by first defining a parametric curve， which is time dependent。

if you pick a curve of gamma that goes from let's say minus1 to1 says that it passes through the point P at gamma equals zero。

you can think of a tangent vector being the time derivative at that point on the curve。

And as a result， you can think of this as the DDT of the curve gamma evaluated at t equals0。

so that's formally what a tangent vector is it's literally attending to an equivalent curve that's passing through that point。

😡，Now， as a result of this， because we have the tangent vector and we have a local chart at that point。

we can always say， oh， we can take the basis vectors and RD and pull them back by using the map Phi inverse to define an equivalent set of basis vectors for the tangent space。

😡，So this gives you a local parameterization for how to define a tangent vector using intrinsic coordinates。

So a bit more formally， if you have a point p， you can say that local coordinates of the point p is essentially given to you by a differential so that's say Dhi p。

which pushes forward tangent vectors from the tangent space at point p to equivalent space RD and you pull it back by taking phi inverse so each tangent vector in the basis of the tangent space can be represented by pulling back through phi inverse。

the equivalent Rd vectors， and that gives you a tangent basis。

so as a result we can take the vector V in the tangent space and rewrite it using the coordinate system of the tangent space。

Right。I told you how to define tangent vectors using the local coordinate system。

but I also said that extrinsic parameterizations of the manifold may be favorable for some computation。

so the natural question is that can we exploit this extrinic parameterization to define tangent vectors in and the ambient space and the answer is yes。

and the way we do that is to say that oh you can always embed your manifold in a higher dimensional ambient space and because we did it through this inclusion map which tells you for every point there's the corresponding point in the ambient space。

we can also pull back the coordinate system of the ambient space back sorry from the manifold back into the ambient space and the way we do that is to say that oh we'll take the phi inverse map which is from the chart and then we compose it with the Yoda and as a result we take the local basis vectors EI and we just add it the Yoda and phi inverse composition and we have an equivalent way of representing tangent vectors but with the coordinate system of the ambient。

SpaceSo this is super powerful and super cool because you can kind of essentially do all your computation in the ambient space using the basis vectors of the ambient space。

😊，So this gives you a natural question which is how should we parameterize your velocity fields right so our velocity fields is kind of what we care about when we're doing flow matching and equivalently for diffusion。

how should we parameterize them？So one direct option is to say。

well'll output directly to the local chart so that we'll say the output of our network will be directly on the tan space at a point P。

😡，Or alternatively， we can output them directly in extrinsic bases。

but the extrinsic basis because they're built from the tangent bases。

they will automatically be on the tangent space correctly。

or finally we can kind of take a velocity field in the amient space and project it to the tangent space if you have a projection matrix P。

So in pictures it looks like this， so we have our local parameterization and then our extnic parameterization where we have to change the basis vectors using the basis vectors of the ambient space or alternatively we output essentially in the ambient space without carrying where the manifold is as long as we have a projection matrix to project it back to the tend space at a point P so for certain manifolds like the sphere。

we often have this enclosed form， but for many manifolds this could be too difficult to compute too expensive or not available in closedsed form and as a result it really depends on which type of manifolds you're working with when determining which method you want to use in practice so for the sphere this is really easy to compute but for another manifold we may not have this。

😊，So now that we've kind of talked about tightening spaces as well as smooth manifolds。

what other structure do we need to actually invoke or kind of build flow matching and diffusion models？

So the first thing is to realize when we're trying to build flow matching。

the loss function required us to compare vector fields or velocity fields rather。

and in this example over here we use the two norm。

but the problem is in manifolds we don't have an equivalent version of the norm yet so we don't know how to compute this quantity so we need to invoke additional structure to enable us to compute this quantity and more importantly we don't even know how to compute Xt so Xt is this point along the straight line path we had in Euclidedan space which is telling us the particle that's moving from times zero to time1。

😡，So the main structure that we need to add is this notion of a Romanmanian metric and this metric is going to be super useful to us because it allows us to actually define an equivalent norm on the manifold as well as define distances and other useful properties so more formally a Romanmanian metric is a selection or an assignment of inner products that varies smoothly on the tangon space so this inner product consumes two tendon vectors and it' defined on a particular point P and it very smoothly at every point on the manifold。

😊，And this is a very important geometric structure because for spheres。

the way you define this sphere is to say that every inner product。

W two ti vectors gives you a positively curved quantity。

So1 over k is essentially the curvature constant of a sphere。

And it's a constant curvature manifold because no matter where you are on the sphere。

the metric is the same。😡，So I just want to reiterate one more time that choosing our metric is a choice。

so we've decided to add this choice because we wanted to be able to compute this loss function over here and if we were trying to do other geometric deco learninging operations。

maybe we don't need to make this choice and more importantly you can equip a different metric for a different space and it's a choice based on the modeler so we could have picked a different Rimani metric on a space and we would have had a different modeling problem and sometimes you choose a metric based on convenience rather than a prescription。

😡，So having chosen a metric， a remaining manifold is formerly a space， a smooth manifold M。

Equipped with a metric G， and this metric G is defined everywhere on the manifold。😡。

So common examples of remaining manifolds are of course the Euclidean space。

then the metric of the Euclidean space is just one the sphere where the metric is1 over k at every point and hyperbolic spaces which is the foil to a sphere where the metric is negative one over K so here's an example of a hyperbolic space so over here is the hyperbollloid model of hyperbolic geometry there's more than one model hyperbolic geometry。

but you can see that now this space is negatively curved rather than positively curved which is the sphere。

😊，So why are metrics kind of important so Romanmanian metric is important because as I will explain to you in a second。

they enable us to compute a lot of important quantities so we can compute length of vectors which is the norm。

we can compute distances and we can compute angle and again the metric it consumes two vectors on the tangent space so UVV and to actually use this in practice it's just u transpose G where G is the matrix representation of the Romanian metric so G is a positive 70ite a positive definite matrix and it's going to be different for each point on the tangon space or each point on the manifold rather。

😊，So how do we actually use a metric to define norms so the way we define a norm formally is a square root of the inner product between the same vector twice。

so U U inner product square root is formerly the norm of a vector and it measures the length of the vector。

Similarly， you can define angles as follows， you can say cost theta when you're in a tangent space is the inner product of two tangent vectors UVV divided by their norm。

And once you have this norm， we can now actually define the loss function for flow matching a bit more crisply because what changes between the top one and the bottom one is this addition of this metric。

and now the norm is computed with respect to the metric。😡。

So now we can actually compare or and compute this loss。Or the norm for this loss rather？

Now so the last thing we need to introduce is that well how do you measure distances on a manifold。

the distances the way you do that is if you pick two points between the manifold。

so let's say points P and Q we want to find a curve gamma that passes through P&q and is the shortest path between P and Q。

And the curve gamma that is the shortest path between P andq is called the geodesic。

and it's also the straightest path in the manifold sense。😡。

So the main idea to actually construct the distance is to say that oh。

we'll pick a curve or we'll take a parametric curve and we'll measure the length of the vector as we traverse the curve。

so we'll measure the norm from times0 to1 as we compute the tangent vector to this curve as we traverse from p to Q and the idea is that this because we have a metric we can measure the norm or the length of the vector as we traverse the entire parametric curve and the distance on the manifold is defined as the curve that minimizes this integral。

😡，So it's a very natural definition of a distance。So a couple of facts about this distance is that it is actually the shortest path like I mentioned。

and the shortest path is called a geodiic， it is also the straightest path in the sense that it is the straightest with respect to the metric。

And because it's the greatestest path， it's also the most kineticically optimal path。

so traversing this path will give you the lowest kinetic energy and if you pick any other curve between P and Q。

you will necessarily incur a higher kinetic energy or more energy needs to be used to traverse this path。

So sometimes it's a bit confusing to think about geodsics。

but it's really easy when you actually visualize them so over here what I have is a visualization of the probability simplex so the probability simplex is essentially you have a manifold where each point is a categorical distribution inside of the interior the simplex and I want to essentially draw a path between this blue dot and this orange dot and each point is along this path is going to be another categorical distribution so if I draw a linear path you can see the line is straight but if I change the metric on the space to another metric which is called the fiser row metric you can see the path essentially bends and this is the bend path is now the shortest path with respect to a different metric and this metric allows you to kind of bend these paths。

😊，And the main thing I want to highlight here is that different metrics induce different geodesics。

so you can have the same space but by choosing a different metric。

you've changed the geometry of the problem。So the Romanian metric is really a very important gadget。

Thank。Okay， so because we pick the Romanmanian metric， we can compute distances。

we can compute angles， we can compute norms， what does that actually allow us to do？😡。

So if you recall in the flow matching equation， we had to compute this loss function。

which is the L regression of the target velocity field with the one that you're learning。Also。

because we can compute distances， we can now compute an equivalence between these two objectives。

which is to say that we can use the X1 prediction or target prediction framework and say that we would just want to minimize the distance between the predicted endpoint and the actual ground truth endpoint。

so this is a completely equivalent loss function to the first one except that it uses a different parameterization。

so your network over here would predict the endpoint rather than the velocity at an intermediate step。

Similarly， for diffusion， we wanted to predict kind of the score function。

and we want to regret to the score function so we can do that now because we have essentially the same norm。

And as a result， we can also compute the equivalent of target prediction for diffusion models。

which is called epsilon prediction， we want to predict the noise added to your sample。😊。

And I remind you， all of this is connected in the Euclide setting with an ODE and an SD。😊，However。

one thing that's still not clear to us so far is how do we actually integrate on the manifold because we don't know how to simulate this OD and including spaces and the reason we don't know how to simulate it is because again the manifold by itself doesn't have a vector space structure so we don't know how to do an oiler step yet or oiler marama step for SDs。

So to make this a bit more crisp。When you simulate flows and diffusion models and euclide in settings。

the actual order integration is telling you， oh， the t plus one step is xt plus the velocity prediction times sometime delta。

so we just move along this path for time delta T and we get to the next point xT plus1。

similarly for diffusion models， we have a similar prediction where you move along this direction which is controlled by your score function。

plus you add a little bit of browning motion and then you just keep going along this path。Now。

what bricks for manifold？And you need the next point to be guaranteed to be on the manifold because if you just arbitrarily add two things。

you will not be on the manifold。😡，And finally for the SD use case。

the hard thing is how do you define Browning motion on a manifold because at the very beginning of this I told you there is no canonical definition of the Gaussian distribution and even though there's always a way to define Brownian motion。

it may not be easy to compute in closed form and for many manifolds of interest this is actually a very tricky object to compute。

😊，All right so to compute the actual order integration or equivalent of integration we need to introduce a couple more manifold operations so the first thing we' have to introduce is like how do we move from a point on the manifold back to a tangent space and how do we move from the manifold to a space and vice versa and finally how do you move vectors between tangent spaces so the first operation is the exponential map will take a vector and find the corresponding vector on the manifold corresponding point on the manifold。

😊，And the logarithic map is the inverse of the exponential map where you take a point of the manifold and you bring it back to nicovalentton space。

and finally the last operation is called parallel transport。

which tells you if you move a vector from one tendon space to another。

what is the curve that has to follow so that you don't change the vector in some sense。

So the first operation as I mentioned is an exponential map which tells you if you give me a tangent vector at a point P。

how do I find the corresponding point on the manifold if I do an operation。

so in pictures exponential map is very simple， it says that oh if I started at a point P and my instantaneous velocity is V and if I find the curve gamma that corresponds to this velocity vector。

😡，Where will this curve take me if I follow this curve for time one。

so if I follow this curve in the direction of V for time one。

wheres the point on the manifold that I will land？😡。

And the output of the exponential map as a result of this is a point on the manifold。

and we travel a unit of time along this curve gamma。

and in conceptually this is kind of similar to addition in Euclidean space。😊，So for spher geometry。

the close form expression of one exponential map is this nasty looking term over here。

so already you can see that computing this is a bit more complicated and involved compared to just adding in Euclidean spaces and this operation will of course change depending on which manifold you're in。

😊，Now the inverse of the exponential map for most cases is logarithmic map and the logarithmic map is exactly the same idea。

but in reverse， yeah。So there's a question in the audience。Yes。

That are very for all the people doing that。It's it can be expensive yes oh sorry， me Jo。

can you also repeat the question yes of course。Let me just go back to the slide。

but the question was the operation and the expression at the bottom of the sphere isn't expensive to compute all points。

So in this case of the sphere， it's not super expensive because it's just like cosine。

but for like other manis like hyperbolic spaces， what will become tricky is that this will become inverse cosine and it will become numerically unstable near the boundaries of the manifold。

so this is a very good question。😊，は。Sorry， can you speak up？对。啊こで。真的题。嗯。但 basis事也不用。But that心里。

Consider a lot of point that integrated take。So the question was is this for every P on the manifold so this exponential map is defined for every like it's an injective map yeah so it's defined for every point on the manifold every point on the remaining manfold will have a corresponding exponential map and most of the time it will also have a corresponding logarithic map。

没听见啊。So thanks for the question so the logarithic map is the inverse of the exponential map and it's well defined only in a local neighborhood of the exponential map where this becomes a deffmorphism and the idea is the following if you start at a point Q and if I was to follow the curve backwards in time which is the corresponding vector V I would end up in right so this is the intuitive definition of a logarithic map and the expression below is the logarithic map for sphere so you can see the expression is very different than the one for exponential map and it's because it's doing something completely different。

😊，So the last operation I want to introduce is this notion of parallel transport。

Oh sorry this should say parallel transport not log map but the parallel transport the idea is is the following if I have two tangent spaces at P and Q and I want to move the vector V from one tangent space at P to another tangent space at Q how do I move it so I don't essentially change the vector in some sort and I'm being intentionally vague about what I mean by change？

But the idea is that the inner product between V and W， if you have two vectors over here。😡。

If you take this inner product at every point along the curve it's going to be the change it's going to be the same rather so there's no change in the inner product so the metric remains the same so this is the corresponding vector V that's transported along this curve gamma and hits the tan space at Q and parallel transport is a unique operation and it's also reversible which means that you can take inverse parallel transport which takes you from a point on Q or a vector on Q and transports you back to a vector on P along the same curve。

😡，Right， so these manifold operations seem a bit esoteric and it's unclear why we need them and I'll try to tie it back into the generative modeling use case。

which is to say that， well the target velocity in the Euclidean case was just simply UT of x given Z。

but if you write the actual expression for this it's the time derivative at Xt。

so this is a particular choice of the path you choose and this is essentially telling you if I pick a straight line between x0 and x1。

what is the velocity if I was traveling at linear speed between x0 and x1 at a particular point？

So really the straight line is a judiic and Euclidean sense。

so how do we elevate this in the Romanian sense so in the Romanian sense。

we have a corresponding expression， which is to say that。

We want to take a log map at a point along the juDistic。

so logM remember it takes you to a point on the tangent space。

so this is now a1ant vector and you want to divide it by1 over1 minus t rather。

and that tells you how fast you need to go。So again。

straight line in Euc the Euclidean spaces is a judiic in the remaining sense。

we have a corresponding expression as well。So how would you actually calculate this in practice so one example is for the manifold S3 so S3 is the group of 3D rotations。

so it's orthogonal matrices in 3D。And the logarithtic map for SO3 happens to be the matrix algorithm and inside the matrix algorithm the input is a rotation matrix so how do we take the rotation matrix and transform it back to its corresponding tangent vector unfortunately if you compute the logarithtic map naively you have to compute this infinite power series which is truncated with end steps and as a result this is a very expensive operation to compute so even though mathematically it's very easy to define what this is in practice this could be very difficult。

😊，And this is a place where from a machine learning point of view。

you have to make a very conscientious design decision， which is to say that， oh。

I can change the representation of my input so instead of representing 3D rotations as a rotation matrix。

I can change it to another representation， which is the rotation angle representation and the rotation angle representation is just simply this expression over here。

which tells you the corresponding angle and the radius in the direction you have to so a vectors omega and the axis of rotation around mega。

😊，And as a result， this is a completely different representation of the same object。

but the logarithmic map for this representation is much easier to compute。

you don't have to deal with an infinite power series of matrix powers。😊，And in practice。

when we do implement things in SO3， we prefer this operation because it's much faster to compute。😊。

All right， so now that we kind of develop intuition for how to do the manifold operations。

we can now actually formally define how to do inferences on manifold so intuitively the easy expression that changes is。

Instead of doing addition， we do exponential map at Xt so we compute everything in the tangent space。

our UT is a tangent vector multiplied it by Delta T。

and then we take the exponential map to get to T plus1。😡，Similarly for diffusion models。

we also take the exponential map of the term inside the expression over here。

and the other thing to note is that the Z T previously was a regular normal distribution。

this is not a tangent normal distribution， but this is a normal distribution and the tangent space at point P。

😊，And remember， the tendon space is also an Euclidean space。Equivalent to Euclidean normal。Right。

so the one thing that we still need to talk about is for diffusion models specifically。

we need to be able to compute the score function and the score function is this gradient of px or PT of x。

but we still haven't defined how to compute the gradient on a manifold。

so this is one of the last pieces we need to actually complete the story。😊。

So the way we compute gradient is that we want to elevate the gradient operation to be a Romanmanian gradient and the Romanmanian gradient has a very similar interpretation to Euclidean gradient。

so if you give me a smooth function on the manifold。

the Romanmanian gradient is the direction of steepest ascent critically with respect to the metric。

so using the metric G so the formal definition is that we want to find the gradient the Romanmanian gradient to be nabla G of F inner product with a vector v such that it is equal to the differential at that point f of the function f by。

p。And the vector that satisfies this is the unique vector nage， which is at the tang space at M。

So in Euclidean spaces， the equivalent understanding would be it's the directional derivative。

However。This definitionThis definition is a bit abstract。

so how do we actually use this definition in practice so when you want to express the remaining ingredient gradient in coordinates operationally it just means that you take the matrix representation of the matrix of the metric。

compute its inverse and multiply it by the Euclidean gradient。😊。

So this gives you an operational tool of how to actually use the remaining gradient for actual computations。

Because everything about the geometry of the problem is encoded in this matrix representation of the matrix metric G。

Similarly， we might want to compute the log density at a point P so a lot of the times one of the benefits of using flow mapping models is that you can compute the change of variables or the change of log density along the trajectory。

but to do this we need to define an equivalent version of the divergence operation so just like there's a Euclidean divergence operation there's a Romanmanian equivalent of the divergence operation。

😊，And the main idea is very simple， so the divergence of a vector field tells you how much of the vector field is a source or a sink at a point。

so similarly we can compute the divergence just by invoking the metric as follows so in a coordinate chart。

you can essentially take the square root of the determinant G which is the metric。

and then all you have to do is like correct for it when you compute the term inside the sum by square rootG and we do it component wise for each of the different components of the vector field。

😊，So what would it look like for Euclidean spaces again。

so remember the metric for Euclidean spaces is just identity so this term would be one and this term would be one over here and then you recover the Euclidean divrgence operation。

So yeah， and machine learning， we prefer to do compute this divergence in a chart because it's a very simple operation to just code up。

All right， so the summary so far is that when we think about building deep learning on manifolds specifically。

we need to take care of certain things， the first thing is that each manifold requires different design considerations。

the first tip when we think about this is that we want to pick a parameter transition of the manifold that makes it as close as possible to a Euclidean problem and the reason for this is that we know how to do things in Euclidean space really well and we want to kind of like be as close to a Euclidean setting so that all out of our deep learning pipeline works。

And then the other thing you want to be careful about is that certain manifold operations may be inherently numerically unstable。

especially close to the boundary for manifold， so you might want to reparamettize or pick a different geometry where such a problem may not persist。

😊，And the final I guess tip is that at least for manifold use cases。

diffusion seems harder to do than blow matching and the reason for this is that well we have to compute the score which of course is a mining gradient and for diffusion you have to also define browning motion on a manifold and that's not so trivial to do for certain manifolds so the real question you have to ask yourself is that when you're doing geneative modeling on manifolds do you really need to do an SD or can you get away with just an ODE and most of the time an ODE will give you more bend for your buck。

And finally there is no canonical Gaussian distribution on a manifold。

so choosing the prior is also a design decision， many people choose a uniform prior over the entire manifold because it's easy to do。

but of course full matching is compatible with any prior so you can pick any source distribution and go to any target distribution。

which is again not possible for Gaussian sorry for diffusion models on manifolds because you have to pick a Gaussian like prior。

😊，So I'll take a couple of questions now before we go to live coding and as we're going to live coding。

I want to quickly recommend a couple of Python packages that kind of help you know。

allow you to do these geometric computations a lot more simply。So I actually had a question。😊，Go it？

Yeah， in your experience， is there sort of problems for which the intrinsic representation actually helps compared to the extrinsic one？

Yeah so that's a very good question so the intrinsic representation helps whenever it's hard for you to find a projection from the extrinsic representation so a lot of the time the projection operator could be very nasty to compute or could be expensive or you may not have it in post form and in those cases you prefer an intrinsic representation。

😊，So see， thank。But that's a good question yeah and unfortunately like this answer varies from manifold to manifold。

Any other questions before we jump to coding？

Not anything that I see in the chat， but yeah， we can go ahead。Let's try to code some things。

let's try to connect the runtime。Okay， let's see all right will。Better yeah， all right。

so we're going to try to code something very simple just to give you some intuition for some of the things we've talked about。

And the first thing is that。We're going to code up。

Because we talk a lot about it with geometryomeranosphere。

We'm going to try to visualize some of the geometric concepts on this sphere。

So the first thing is that we just import a bunch of these packages。And if it cooperates。

We will continue。Okay， so it's running now， we'll also use the geoOt package because this is the one package that has a lot of the nice manifold operations that's already written for you and it's extremely simple to use almost stupidly simple so I want to kind of illustrate the beauty of this package and it's not my package。

😊，All right， so as this is running， so we're going to take the first example to be a sphere embedded in RD plus one。

so this is an extrisic representation， so by definition。

every point in RD plus one whose norm is equal to one can be thought of as living on the unit sphere。

😊，And to do that we're first going to code up the inner product in the sphere so the inner product on a sphere is essentially the Romanmanian metric so we're going to code up the Romanmanian metric on a sphere and the first thing we do is that the Romanmani metric will take in two points so X and y so X and y are both tang vectors and we want to compute an inner product with them so the first thing I will do is to check that X and y are actually valid points and then next I will do is because we're embedding the sphere and Rd plus1 we can use the metric of the ambient space as well so this means that the inner product here is extremely simple and already I see that wholepit is helping me out over here。

😊，But it's essentially essentially essentially the regular inner product on Euclidean space where you do x times y and you sum across the dimensions。

and this is essentially what we're going to implement。

So it's stupidly simple and the reason it's stupidly simple is because we've chosen to embed the sphere in RD plus one。

the ambient space。😊，If we had chosen an intrinsic parameterization。

we would have to be a bit more careful。As a result of this。

the norm monoossphere can just exploit the Romanian metric。

So remember the norm is the square root of the inner product of the vector。

So how will we do it so the norm is。Product is， so we'll use the product inner product function that we've defined earlier。

So spear inner product。X， Y， and then。The norm is going to be the square root of this inner product。

And that's simply the norm induced by the Romanmanian metric。Finally， as I mentioned earlier。

this manifold is embedded in R D plus one， so we need to be able to project to the manifold So any point on the ambient space can be projected to the manifold by simply taking the the。

😊，Norm of the sphere， so actually I'll just code it up like this。Stop Nor。Yeah。呃。So yeah。

so at every point divided by its norm， plus some small epsilon to ensure numerical stability just allows you to project to the sphere directly。

So while this runs。Now another equally valid parameter transition of the sphere is in the polar coordinate system。

which is we want to pick two angles theta n phi， and this allows you to define any point of the sphere by taking a radius r away from the sphere。

and then using the angles data n phi so if you want to convert between this polar coordinates to an xy Z coordinate system。

what we would do is say that the first coordinate x is sine theta cos phi。

the second coordinate is sine theta sine phi and the Z coordinate is co datata。And as a result。

we can actually use this coordinate system to visualize things on the sphere。

So I've already written a function that does this for me， so I won't code it up from scratch。

but essentially。😊，What it does is that it plots this shaded version of a sphere or unit sphere for you。

😡，Okay。Okay， so now that we've done that， we can also see the difference between this intrinsic coordinate system versus an extrinsic coordinate system。

so I have another function over here which essentially tries to visualize the sphere by generating a lot of random points and then projecting to the sphere by using dividing by its norm。

😊，And as a result。You will see now it gives you an equally valid parameter position of the sphere。

So each of these points is not on the sphere， as you can see。

So I picked the random generated a random set of thousand points and then divided by a norm。

and I'm on the hypersphere。😊，Okay。So the next thing we were going to talk about is a different manifold operation so there's the exponential map logorithic map and parallel transport each of these are a bit more complicated so it's easier to code this up with the package geoopt so for the packet geo optt what I've already done for you is kind of define this class sphere and as you can see the exponential map is simply self- X map where the first argument is the point on the sphere and and second argument is the velocity you're following to reach the endpoint of the exponential map and correspondingly we have the log map we have the distance and we have the parallel transport on the sphere so the beauty of this package is that instead of coding up these equations from scratch we have simple one line expressions that give you the entire computation for free。

😊，Okay， so given that we have the Xmap logmap and parallel transport。

I want to define a geodic interpoent on this sphere。

so the judic interpoent is the actual interpoent that we will use to do fl matching。

so we need to be able to compute this which is essentially if you give me two points。

let's say x0 and x1， what is the equivalent point Xt if I follow this this path。

So the Ju to signal interpot， I mean， it's quite simple to code up because we've defined our function before and it's really。

So we wanted to find X T， and we were just going to brutally copy this expression over here。

so the expression says。We want to use the Spear X map。😡，At x。 x0， then we multiply by time t。

Of the Spear log map also at x0 of a0 x1。😡，So what is this saying。

we're saying that we're starting at a point x1。And then we want to move to x0。Along trajectory。

the judi that links them together for time T。And that's how you're going to get XT。Okay。

so let's actually try to visualize as youistic so I have。Two points on the hyperphersphere。

so these two are red dots， and now I want to compute the judaic between them。

Which is going to be a curved path between these two points。

Right so to compute the ju between these two points。

what I will do is say that I will take a discretization of time， so I will take 200 steps between 0。

01 and 0。99 and I will compute the trajectory， which is to say that I'll compute the point X T at every point。

🤧嗯。So。Quite simply because we have our Judaic interpent。We will consume points zero on one。

give it a time parameter and compute the trajectory and attach it to the trajectory。

So if you do this。We should get a curved path between these two points。

and this path is the shortest path between these two points traversing this field。

So very simple to do and if I was seeing a flow matching model。

what I would do is I would take the time derivative of this path of a particle that's moving along this path。

So that was sphere in the time remaining I also want to expose you to another geometry which is not too different than the sphere。

which is toroidal geometry， so it's going to be a taus in 3D。

so a tous in 3D is very similar to a sphere but it's geometrically quite different。

so we have two angles again， so theta n phi and we have two different raddis are in like a minor radius R and a major radius capital R and a taus is essentially the product manifold of two circles。

😊，If you take the product manifold of two circles， you have essentially what's a Taurus。

And to do that， we can kind of just generate some tourist data， so I've created a tourist for you。😊。

And it essentially looks like this， so this is quite a fact tous and I' generate random points on this taus and I'm going to do the same trick as before。

which is to say that oh， I want to visualize geodiics on a taus。

so what does the corresponding geodiics on a Tarus look like？😡，嗯。Yeah， so the tourist is。诶。

So we're going to use the geoOt package。And the tourrus is essentially going to be a cross product between the spherical manifolds as well as。

Oops， this should be。2。So。Yeah。Oh correct one second。Yeah。

so it's a product manifold between two S1 circles and we're going to make a judiic interline again。

So the Ju ecosystem interpoent is essentially the same idea as before。

but the thing that we changed compared to the sphere is that we just change the manifold。

but the rest of the operations remain the same so X map， T times log map。

everything else is the same。So。As you can see over here， when you do this。

you have a corresponding geoistic that takes you from this point over here and wraps around the touristus and gets you to this point over here。

the same idea， same thing and that's kind of the beauty of the geoO package is that it abstracts away all the actual nasty analytic expressions and gives you a very clean interface to code this up。

嗯。The last thing I want to mention is that the tutorial also contains a very detailed example of hyperbolic geometry that I will not go through because it's a lot more tricky to explain。

but I invite you to try it yourself there's a lot more cool things over here such as the hyperbolic inner product and finally also the equivalent of Gaussian distributions if I run everything。

so equivalent of defining a Gaussian distribution on hyperbolic space and how that changes based on the geometry of the problem。

So that's kind of where I want to leave you all， so I will take any further questions if there are for this part of the tutorial。

if not I'll pass it on to Alex afterwards。So there is a question in the audience。Could you please？

Tell me which part of the code you were talking about。

I'm actually not sure if the audience is allowed to unmute themselves to ask question， but they。😊。

I think the question is other cases other than SO3。

where the matrix representations have less issues on the log length P map。

So S3 is a broader class than what I described， so S with3 is specifically a lead group and for all lead groups。

the logarithic map is the matrix logarithm， so you will have issues competing the matrix logarithm for any lead group。

So not just SO3， any SN， ON， and other stuff。But that's a good question。

There's another one that the Euclidean geometorries are preferred because we have better intuition。

but our spherical and harmonical systems also well understood。

but I'm also not sure what harmical means here。诶。Yeah。

I'm not sure about the question but I think the euccleidein geometry is preferred because of the numerical stability it offers so if in those systems like harmonical there's equivalent numerical stability I would imagine it's okay but my intuition tells me it won't be so actually they meant hyperbolic systems yeah so hyperbols yeah so that's a very good question so hyperbolic geometry。

what I didn't go through and you should definitely try going through this on your own is that we have to do a lot of additional tricks to make things numerically stable so for example you have to clamp more aggressively we have to also have a smaller epsilon value because we were dealing with hyperbolic sine and cosine functions and they blow up really quickly because the reason they blow up is because each hyperbolic signine and cosine function is going to be like E to the x E to the power X so the value of x controls how big the value could be so if x is greater than 40 you cannot。

I said and。Floating point precision。So that's why we have to climb these things。Yeah。

I hope that answers that question。yeah I don't think there are any more questions all right in that case I will pass it on to Alex and Alex。

if you want to share your screen。

You turn off your audio。Can hear us？ButYes。Thanks。好嘅事啊。

Where did my cur go。嗯。不。

是。

Oh， okay嗯。Yeah， so this is the third part where we're talking about how to mash these things together。

So the geometric part and the general evolve part and well mostly focus on flow matching things。

because well， I think it's the easiest one。 But we can talk about why that's the case in in。

In a bit， so I'm going to talk about sort of applications and sort of categories of applications where it's sort of easier。

where it's harder than talk about how to go from theequideian case of flow matching to Romanmani Fl。

And then look at sort of a case study in protein design so in particular afold2 and how they sort of represent these types of manifold in proteins because that I think gives us an interesting idea about what coordinate spaces are useful when intrinsicsic and extrinsic are useful in practice and then flow matching on a tous this is one of the cases in proteins but it's sort of it's very easily to realize we'll go through that one with code and then finally think about sort of practical tips for these types of levels。

Cool yeah so we looked at this slide but basically the first category things are where you have a nice logric map and exponential map and you can parameterize geodesics and everything is fairly easy right so it' these types of nice parameters manifolds are very widely doesn't appear again okay I see yeah so this sort of first categories is I would say most of the focus on geometric generative models it's much easier the second category。

Is on these nonparmetric models So like is where it's data drivenve or not sort of not easy shapes right so it's sort of much harder to parameterize well you know where is the where is the e map what is the geodestic right so these are also very interesting and data driven So you know for instance。

one of the things I' is in single cell So we know sort of data driven manifolds are quite interesting but much harder to deal with So I think these sort of geometric geneative models we mostly focused on the first case。

but then I'll sort of give hints on how to do the second thing。嗯。Okay。

cool so Euclideion to Romanian flowmeing Okay so this is sort of notation reminder Z is sort of the conditioning so we're picking endpoints here so that's a tuple x and x1。

Q is the distribution over x0 is and x1 so most of the time this is independent。

this is what this says is you pick x0 and x1 independently and then in Eukuadideian the easiest way to do this is just to take a straight line between the two take a convex combination with regard to T and when youre only interpolate and so when you do that then you get a velocity that is very easy which is x1 minus x0 so here you have depicted right you sort of pick your random of points from your distribution and then you pick an xt in the middle and you have a vector field that goess from x to x1 in a unit time。

诶。Okay， so for Romanmanian flow matching， we'll use the same first two parts。

so we'll take an X0 and x1 in the Romanmani setting and exactly what those objects are。

it depends on your coordinate system， but here's the sort of X and x1 and then say you take the same independent distribution。

And then the interesting part begins is how do you take XTs right so this subtraction and addition operations are not defined on running manifolds as Joe is saying。

but instead we have to use the exponential map and the log entry map so this is sort of the formula for XT and I'll give an intuition what this means in a second。

And then for the flow what we have is log x t like it click x1 and and then1 or minus d right so okay so why is this sort of the right thing to do Okay so on a sphere you can look at the interlation between x and x1 Okay so what is this sort of saying right so。

I think of it as well okay， maybe it's easiest to visualize in the Eradian setting so for Euclidean we know that the expent for map is just a plus B and the log good thing map is B minus a。

And so you can rearrange these things right into a form that looks exactly like reminding flow matchingching and so if you take the X man。

you sort of get the well x0 plus stuff and then you get log map which is t times x1 minus x0 and so this is sort of the same formula right you can rearrange this back into the original sort of con combination and so this is sort of the intuition is basically you're starting in x0 you're adding t times the flow that you would need to get from x0 to x1 and time1 and then that's think get your xt right so sort of move for t amount of time in the direction x1。

And then for the flow， you say， well， okay， I want to go from xt to x1 in time1 minus d。

and so that's the flow computed with respect to sort of xt and x1。

And so this is how it works in general in specific cases right there's many ways to parameterize these maps so it sort of depends but this is this is the way and so just one thing right is this requires very efficient evaluation of your exponential log group for maps so that's sort of a lot of the tricks are well how do you do this efficiently and well some fault you can't right so if you can't do this。

what do you do well you have to do something more complicated where you integrate XT over time and then you analytically calculate sort of the the XT dots so the UT here？

So yeah， that's the sort of this is the basics， all you need to know of sort of to go from E into your money flow matching is to go from this first half。

which is sort of very simple kind of combinations and then use the exponential en by map instead。

Coole， so do I have any questions before moving on this part？有一个。

Okay okay so one right piece that we were thinking before is this sort of log likelihood computation right so in the Eidideian setting it's very useful to figure out the sort of log the log likelihood at a time1 giving your vector field in your prior and so this sort of gives you the calculation but it requires the divergence with respect to de metric and so you can use the Romanmanian divergence to calculate the log likelihood at any time using this sort of formula right so in Eucquaidideian case it simplifies because the determinant G is just one so you get the standard determinant form back and so this sort of all you need is well you can do flowting and then you can calculate log likelihood if you would like using this formula。

Cool okay， so we sort of talked about just how you go from Eonian to remind flow me。

but how is this sort of useful in practice so I'm going to talk a little bit about protein design and the representations there and then we can look deeper into sort of the different choices we can make。

诶。Okay， so protein design basically we're trying to design proteins which I'll describe in a bit that I sort of satisfy some properties and so that's the to hear why to do this well we proteins can do a lot of interesting things。

mostly medicines， vaccines maybe for climate change we'll see what the applications are but I'm parents to do this。

But okay so whatmics you want to decide for so there's sort of many objectives and you can think of them as sort of all functions of the structure and a little bit of sequence and so what does that mean right so you sort of there' there's function you want to optimize and then you want to generate structures that satisfy these properties。

Okay so well what is a protein I guess that's the that's the real question Okay so what are proteins right so they are we we sort of name them as sequences because they're sort of repeating patterns of molecules。

There's a lot of them well' there's sort of a lot of combinations right they're built up of this 20 you know I said building blocks so sort of sequences of 20 different v capy size 20 and then you sort of the sequence folds into some structure and that has some function that's the idea and then that that sort of does something and you would like to figure out well let's make a new protein and what does it do。

Okay so but how do we represent proteins so proteins are really these very long strings long molecules that are chain like so is the sort of backbone of that that forms a sequence and the sequence what the sequence really tells you is what the side chain is so what hangs off the ends and so what does that mean right you sort of have this couple of representations one is sort of a cloud of atoms so the first is like you could look at the 3D coordinate of each atom location and so this should be a sequence of length n and there's about 19 atoms per amino acid if you count thehys and three coordinates right so you could represent them in Eukrainian space and we could in fact do eukine fluact on all of this or are confusion whatever in fact this is actually currently done also this way so weve talked about that in a minute and so。

The other way we're representing this， I think which is very interesting is sort of is a set of reference frames in Se3 so sort of a central location for each amino acid and then a rotation around it and so that tells you sort of the orientation of thatbone and that's the sort of these triangles here so that's the sort of main。

Main pieces and then you also have the side chains and so here you have sort of a series of rotations off of the main backbone and so this is a series of torsion angles right so that's what these little green things are representing here is how rotate are relative to the previous torsion angles。

Okay so so okay， so why would you ever choose this sort of crazy representation Okay so I guess there's there's a couple reasons okay so this is just a different reputation right there's 20 different amino acids you sort have different colors for different types of molecules and different lengths and then the backbones is sort of parameterizing where the C alpha atom is so the main one and the neuroation around it。

嗯。But so why would you want to do this representation Well there's a couple of reasons I think one is that well it's。

I would say it really enables us to use our prior knowledge about how these molecules are structures in particular right we know these bond likes pretty much precisely right so instead of having to figure out exactly how distant these two bonds are。

you could say， well， I know exactly how far they are away from each other and sort of rotated around and figure out well where should it be relative previous one and not having to clear out in 3D space。

whether it well is too far away or too close right。

And so that's one reason I guess another is that well this is sort of a bunch of independent components and so this is sort of easier to learn from a machine learning perspective so you could also parameter this for instance as sort of all torsion angles so you could do torsion angles for the background too and then you have a lot a long series of torsion angles for the entire thing and this is a possible urbanization but it's not as good because it's very global right so if you make a small change in one place and it can change things all along the protein so you change a small angle somewhere and it may *** in another place and so this is sort of harder to learn too so this is sort of twofold good right you have very independent but also builds in our prior knowledge about structures。

Oh嗯。Great， so。Okay， so Alpha fold was a model well and helped win the recent chemistry Nobel Prize that goes from sequence to a 3D structure。

We're mostly going to be talking about this last part that's the structure model and that's the part with a lot of geometry in it everything else doesn't really have much geometry it's mostly big embedding spaces either sort of the length of the protein or the length squared of the protein so that's sort of the main processing components but the last part is sort of sort of use the geometry that we know about proteins so that's what builds in the sort of SF3 and SOG space。

And okay， so what does that look like？Let's see okay so so why do we care about this right in terms of the generative modeling framework right so off all two is not really a generative model I would say。

It goes from a sequence to a structure， although you could。

claim it is in some ways because it sort of starts at a structure and then do three steps to refine that so maybe a conditional derivativeative funnel but the real reason that we're interested in this is because almost all of the protein generative models right now use the same structure model as the backbone of the network to sort of think about 3 e space。

嗯。Cool， so okay， so the structure model of World two right it takes in backbone frames and a representation of a protein in sort of a single andeditation and outputs updated backbone frames and then also angles for the side chains。

Right okay so and this is the code for it okay so thinking about this a little bit right we have sort of s is single representations Z is pair we take in sort of these identity ridges and then we slowly update the rigids and then we have a sort of angle reant which takes the current representation and gives you angles back and so at the end you just get angles and updated rids in this very nice form Okay so but we can look at exactly the shapes here right that's sort of interesting so。

😊，The rigidges here are represented as a n residues by seven and the angles are seven by two right so these sort of seem just like magic numbers and I'll explain where these come from in a second。

呃。こ。Okay so why are they calling ridges and they're calling them ridges Riges here are elements of S3 so that's a translation and a rotation。

and so the seven comes from the translation part being R3 and a rotation part being cartoonernion representing a rotation in SO3。

And so what's interesting here is right is directly saying here's seven coordinates for the sort of the thing on Se3 and output those seven coordinates right so that's sort of the you know treating these as intrinsic parameters of the model you're outputting already directly so the second part is angles so angles right or so2 and so right there this rotations okay so what's interesting right is that there are okay so there's seven rotations here but why is it outputting is extra two dimensions right well it turns out what alpha fold actually y can I see。

Okay， me all get away to this。So。So it actually predicts sort of a point in r2 and then projects it to the unit circle so this is a very interesting parameterization right so that is one choice you can make so and why do they choose it right so they choose to predict a point in just r2 and then predicted the circle instead of say directly predicting a point on the circle this is another choice that you can make right so this sort of gets that well what what what parameterization should you pick for your for your manifold。

And so we'll sort of go over exactly why they might have chosen this in the next month。Okay， so。Yeah。

so here， right what you do for the angle resonancesonnet is you take in this representation。

It's sort of a residential network to get the projection。

you divide by the norm and so that all that does is just give you a point on the unit circle。

which you could always transform into a portion that you would like to。2。Okay。

so flow matching on a touristus。 Okay， so the touristus is interesting， right。

So what do we mean by touristus， So this is sort of the one dimensional touristus right Tore is mostly showing that the two dimensional。

that's how I would tell us and of course。😊，O fold is using seven dimensions right so okay so what is that right parameterization for this kind of thing so there's sort of many choices right so one that's nice is well you could parameterize rotation matrices so rotation around sort of the center coordinates as a two by2 rotation you could also directly parameterize how far you are along the circle in terms of movement in radance or you could do this other thing right which is predict a point in r2 and then project down right so these are really the different questions is well you know which which one should you pick right I guess as the question and I think it depends on the application and we'll go through sort of well which one might be good in what setting。

呃，不给。Yeah， okay， so。Now we will go to sort of coding the torus and also to show you how to code from thequian example that Tiy had to a reponian flow matchingching of the torus and think about the permitation of the tourus and the peritation of the vector field。

Okay。是。It does not give me my cursor。嗯。嗯咯。这。Okay。All right。

so this is the same notebook that Kelly was showing。So you can see right we have this MLP。To in。

yes， okay， excellent。Okay， so we have a MLP rights， which and。We have the same model Okay。

so if youve run this on the log in 2024， well， you get。

He sort of gets distributions changing over time right so that's what we saw a little while ago lets how to transform this and so now we'll see how to use this and get the remodeling setting so okay so in the To flow matching so first thing we're going to do is there's just a little change in the data set right so okay so use just some nice plotting functions。

This allows us to convert between spaces。Sort of how do you make a surface of the torsus in 3D and then visualize it？

And okay， so this is the part where we start。 So okay。

so we take the same source distribution that we had。

we do some sort of multiplication and moving it around to get it in the right places。

and then we can visualize it。 so same looking log to 2024。

But then we can also visualize on the Taus in 3D right so there's many different pization of the Taurus right so this is sort of a projection on 2D right so this is looking at the intrinsic coordinates in two dimensions and looking at them so this is sort of think about it if you wrap the thing around the top and the bottom that's what you would get。

And this is the visualization in 3D。Where did they go？Okay， I'm somehow deep inside the touristrus。

Yes， well there is a Taus here， but it seems I can't zoom out and trust me there's a log in 2024 here somewhere。

Interested。Is that right？不差不的。Okay。We will zoom out， hopefully。Alex。

maybe press the how the home button， it should like reset the。The home shaped button， excellent。

perfect， thank you。Okay， yes， this is excellent Taurus And so yeah okay so yeah。

there's the 2024 on top and the log on bottom and so I put these here because well sort of that the log is near the bottom of negative pi and than the 2024 is near the top and so there's sort of many different ways right so all we require from flow matching is that they master the distribution at time zero and then actually has other distribution time1 So there's sort of many flow choices we could use and so in fact we can use the Uquainian flow on the tous right we can use Ulinian flow in the intrinsic parameters of the tous and still get a valid answer right so this is the first thing we're going to do。

😊，Okay， so I'm going to define a couple useful things here。So this is a symmetric modulus。

which is we'll use that in sec that's not used yet。

And here's the sort of the this is how you would do it in Euclidean， right。

so let's define the log map of Euckuidean， right it's x1 minus x。

And then we can calculate X T in the equi setting， right， which would be using that formula in the。

In the slides we have let see。

So we have。Yes。And。

嗯。Okay， so I'm looking at this。From running and flow matching right we're defining XT as。嗯。Well。

sort of in this case x0 plus t times the log x1 minus x0 so this is log x1 minus x0 and so this is the correct xt same thing as doing theequt setting and the get U T is also the same right so all we do here is just reformingly the sort of the simple this way into something that looks like the log map so that when we change the log map and the exponential then we'll get to towards flow matching so。

So this is exactly the same， so I'll start running this。

And the same parameterization so what is this doing this is basically pretending that we're in 2D right we're just projecting the taus into Eaninian space and then new flow matching and because flow matching doesn't sort of go outside of the Tarus we're just still moving along the Taus but we won't cross this sort of pi to negative pi boundary right so this is sort of interesting is we'll get a flow but we won't sort of go the fastest way we'll go around in terms of the Equainian setting。

诶。So this is running。 Oh， wait， I better。 I'm sorry， it's not running because I。

Did a hacky trick where I just time sleep so that it wouldn't die。Let's see， Yes， okay。

so I'll try this again。😊，诶。Yes， okay。 so we're running this Yeah。

so because in the beginning of this thing， we lost the we lost the。The T forward。

and so we couldn't。Yeah， that was a problem。Okay， so this is running and we'll see see sort of what the Elideian flow matching if you were to do it looks like on the Taus and then we can start coding the the how you do that sort of Armanian flow matching with the XM that sort of correct and does geodesics so this one can sort of cross the boundary of pi pi to negative pi and go around the other way。

哦。Okay， so yeah， so。Oiler integration。It's the same thing。

we'll also use the same oil integration for this。诶。And then flights。4。嗯。一。

Yeah so this should be done shortly and then we'll talk about the changes you need for the other。

Further tos。Okay， so yeah， well we'll start talking about this Okay so for the tourists。

what might you need to change。I would say， so the log mapap definitely is going to need to change and then the computation of XT is going to change。

And I think the computation of UT probably it shouldn't need to change and we'll see why that is in the。

You should be done here。 Okay， so yes， so we have the flow now。 So Okay， so this plus the traories。

right， So we started at the LOg。And then these plots sort of the directories， right。

but we noticed that the L should sort of go， if it was going the closest way it would go downwards and around sort of the boundary here。

And yeah， so we can see this in sort of different types of animations。诶。

I guess the interesting one is this 3D on the Taus， so we'll see in this in a second。诶。

the I guess the the basic idea here is well it really doesn't like flows are nice because you can choose whichever one you want and as long as it matches then you're good to go so is this one easiest to learn that's the next question right。

And probably not， but。yes，冇事。诶， o。And this always runs slower when people are watching。冇事啊。嗯。

唔该我这己时间啊。不致。Okay， so yeah， you can see two things here， right， so the animation。

We can see the moving up and then for the Taus。We can see that this sort of goes the long way around。

so the sort of the inside， there's nothing。But it goes sort of around the outside of the tors here。

Okay， cool， so how do we get to Romanmani inflammation？Okay。

so for this we're going to use this symmetric modulus function Okay。

so what is this actually saying and why do we need Okay。

so our x0 x1 our values between negative pi and pi。

And what we would like to do is sort of go the correct direction around the circle right so we have two points x0 and x1。

And if we do what we currently do。Then。You always go in the same direction right so if this is the boundary between pi and negative pi when we do x1 minus x0。

that means we will always go in around the circle this way so all the vectors even if you were say here and here you would go around the circle this way right so that would be a very long flow and so what the symmetric muds does so we added here。

🤧Is picks the right way Okay， so well this math is a little bit complicated right。

but you so think about the case where you sort of you have。

X0 is is like a little bit above zero and x1 is a little bit below zero， so like near negative pi。

Then we get。A little bit so so we add the two together and it's well just about。

Well it's a little bit more than zero right it's sort of like you know two epsilon over two so just epsilon mod2 pi minus and so this this is why you might want to do this is it sort of basically says which direction is closest then points the error out in the closer directions so you sort have two choices and that goes the right way。

So yeah， that's the first thing we need to do。 so this is logmap which tells you how to get from x0 to x1 as fast as possible on the Tos。

And then the other parts we need to do is add a symmetric modulus here。Which says， well。

no matter what I want to do， I want to keep in the range of。Pi to negative pi to pi。

And so these are should be the only two changes we need for this and we will start training again。

And so UT doesn't need to change right so what does UT say it says take my current Xt and take my current X1。

which are both。诶。placelaces locations on the UniICcle and then does a log map to say well which direction is the fastest and then says I should get there in time1 minus t and so effectively construct sort of this vector and then says I would like to get there in some amount of time in the time remaining essentially in one minus t。

Okay， so all this runs。 Okay， so all we need to do to change this function so。

Since we're using these two coordinates， all that needs to change is we need to make sure that it stays within pi negative pi and pi。

And so we just need to add this symmetric modulus here， which says， well， take my sort of XT update。

And well， clip it sort of wrap it around if you need to。

so if you go outside of above pi or below negative pi then I'll just go back in the circle。Cool。

so that defines sort of the oil int on the circle instead of on our two。Everything else is the same。

except we need the wrap。嗯Okay。Next， so this should be finishing up。

Okay， so it looked like this before where went sort of well。Pretended it was like Equideian space。

but this one should pretend it's like a Taus and go the shorter way around the Taus。诶。Okay。

and so here's the updated plot， so you can see the LG now goes down or mostly goes down and around and then back up to the top to the 202。

诶。Yeah， here we can animate it。Hello。And then make a lot of the tourists again。그。Yes， you request。嗯好。

Yeah， that's a good question。Yes， sorry， so Jacob was asking sort of why some of the points go the long way around。

And so that I would say it's because some of the points it's the closer way to go around basically right so if you're going from the very top of the log to the very bottom of the 2424 then it's closer than going sort of around the other way right and so depending on exactly where they are placed than you would go this way right so one of the interesting things is like these seem very strange and maybe you want to do sort of more shorter paths on average so this does not give you sort of optimal transport paths but if you did a pairing that was sort of related to which points were closer overall instead of an independent pairing then you would get paths that went sort of probably all the way all around that way and closer way。

Thats be I guess it's depends on where they are so if you move the log up a little bit。

then it'll probably shift and half will go each way basically so it sort of mix。呃 cool。Yeah。

so you have the animation and yeah you can see that some split up and some split down it's kind of interesting to sort of do a bunch of mixing and then they all appear in the 2024 and slowly。

Gain resolution again。Yeah。好。And yeah you can see the trajectories here right they look very different instead of going around the outside。

most of them go through the inside yeah and the 2024 here looks fairly ugly。

but I think it's just because it's squished yeah。呃 cool。Okay， let's see。 So yeah there's this Okay。

so that's the the main thing is is this So I was going to talk a little bit about parameterization So there's a couple of different ways to parameterize these functions right so here we say。

Let's let's make our network output something in two dimensions right and this is going to be something or we're going to clip it basically between to match the UT right so this what is this this sort of outputting something in the range of the logarithic map of the unit circle so that's saying it should be sort of it's outputting a velocity vector around the circle so it could be either going if it's positive it's going this way if like going this way but it's sort of outputting in the in the tang space on the manfolds which is kind of interesting but you could parameterize it other ways right so you could also parameterize it so that you predicted the X0 for instance right so instead of。

Using this as VT， we could say instead that the model outputs x sort of some predicted x1， right？

So if we did this instead and that means well this is sort of target prediction instead and we could do and then we could calculate VT so then VT right would be would be。

But we could use the same get UT function， right， which says how fast would I need to go， right。

get me the vector that takes me from xt to x1 and time t。

So I can use that function again to say VT should be equal to get UT。

To get me from Xt to X1 pre in team out of time， right？

Okay so this is this is a very small change right but now it's very interesting right because our're model is out putting something on the manifold right so a point on a manifold and now're now we're calculating the vector to get there right so this is a different type of parameterization right it should it's interesting because well oftentimes for different kinds of manifold it's easier to predict something on the manifold rather than the tangon space right so。

This is a common trick is well we have for instance。

like many models that are good at outputting the object directly。

but not so many that are good at outputting the tanon space in fact for protein SR3 for instance almost all the models try and predict the sort of nice looking protein predicted x1 and then calculate the velocity to get there in team out of time and so and that's just because of sort of numerics and how do you parameterize these models and leveraging sort of existing work to parameterize and predict sort of molecular shapes or whatever manifold shapes you have。

Yes， so this should work exactly as well， so now you have a VT and this allows you to。You know。

Pterage your model this way and the final thing right would be to change it so that you predict regress against X1 directly and so instead of doing a loss here。

you can instead do a loss that was on the X1 pre。So。If you did instead x1 pred and x1。

And oftentimes you want a scaling factor here， so it turns out the inian space。

the right scaling factor is sort of1 over t1 over1 minus t in manifold space it's much less clear oftentimes it's fair unknown but that's sort of a heistic right it's sort of one of these loss weighted things that all of them should converge it doesn't matter although some of them should work much better than others so you can always sort of do this trick where you predict X1 instead and oftentimes it's easier to get losses in this space than others so for sort of the taus it doesn't matter but more complicated samples of life。

Cool。Yeah， so that's what I wanted to say mostly with this example。

is there any questions on sort of this thing before I move on sort of you know best practices？

Yeah。小机发觉嗯。呃 ok。

All right， so yeah， we did this sort of parameterization on the Taurus。P and director Field。Okay。

so some some ideas right so Econate Verian it's very easy in this simple touristrus space the touristus can right live on many dimensions so we showed a 2D plot a 3D plot but also you could do a 4D for instance right if you perpeize these sort of unit vectors and so if the network for instance outputted sort of 2 by R2 then you could also get four dimensional so sort of many different extrinsic vers of the tous。

Let's see so some parameters are better， I would say for code because well you make fewer errors so yes。

you could do this rotation matrix and in fact there are some tors that way but oftentimes it's harder。

some are easier some are more stable and the best one depends on your sort of application so whether intrinsic or estric I would say in Taus mostly people use the middle one and the one on the right I would say the middle one is nice because it's very simple right you output exactly one dimension per tous dimension but it has sort of this discontinuity at negative pi to pi so some people are worried right the network sort of needs to say the same thing is going to happen in a pi negative pi and that's oftentimes a little bit hard to enforce So what's nice about unit vectors is there's no sort of discontinuity right you're doing it there's one higher dimension everything is smooth and so although you do have to output twice as many dimensions right so memory constraints well code。

Like this so I guess this is one of the reasons why we often do one higher dimension is to make it smooth right so sort of you could do in the minimum dimension but oftentimes it's not very smooth and you have a closed metaphorifold and so in fact that's the same thing that's happening in the alpha fold tube right。

Right so SO3 can be represented in many ways， Coernion is one of them and one thing that's nice is that it's very smooth you could also represent them as a thing in a3 the sort of axis angle orientation but this also has a discontinuion so oftentimes it's better in practice you represented in one more dimension everything is smooth than trying to minimize dimensions。

呃ello。Yeah so I think it's careful balance right so S3 can also be represented as rotation matrices and are in in nine elements right but well that's many extra elements so it's sort of a balance on what it takes so sort of smooth but fewer things is probably right but not kind of position。

个。And see how the final thing is vector field parameterization most of the code that I've seen is oftentimes doing this target prediction so you have a network outputs the X1 and that calculates the vector field to get there in a correct amount of time and so this can be equivalent in many senses。

especially in theequian space both analytically and practically in for spaces it's almost never equivalent。

Analytically， but is often large just as well。Yeah so choosing right。

this is sort of for whether you do an intrinsic versus extrinsic parameterittization and I think sort of choosing the smooth one is probably best although for well examples is' often easier to do photo to。

哦。Okay， so going sort of practical conclusions， I guess Romanian spaces we mostly prefer flows and that's because diffusions are really hard to find except in very simple settings。

it's much easier to find paths than diffusions。And this sort of manifold spaces the parameterizable ones。

which is the ones we've mostly talked about today。

and then the ones that are not parametermatizable and those are much harder。

As far as we know and finally the per really matters so I can show you an example of what doesn't but generally making it more liquid is good Okay。

so here's an example of what doesn't work that I was trying to code。Okay。

so equivalently right you should be able to do this parameterization of Xy and project and so I made this X Y MLP before this but all it does is instead of it output sort of twice the dimensions。

right？So that's a point at R2 for every dimension of the Taus。

You can normalize it and then get sort of the predicted theta back right and return the vector field and so yes this in theory works but in practice works very poorly and so I think this is sort of a numeric thing right because while you're differentiating your back property our tangent this clip right and this division so you know maybe there's a problem here right so you could definitely do this better right but this is a different sort of parameters for the tous。

Same things apply。But when you do it it just sort of collapseapses right so well the different parameterss definitely matter a lot basically and so you know while many things should work。

the training is sort of the stable training I think is the hard part and it sort of is a little bit of playing around if figure out what works best I don't think anyone has sort of a general role and exactly how you parameterise these things。

But well， I guess the closest you can get to a general wall。Is like this， making it more Euclidean。

making it more like means square error is sort of the best we can do。

And so that's the general suggestion， but it depends on the applicant。Okay， so yeah。

final thing I want to talk about is sort of， well。

what is there that we don't know how to do and what are sort of the open questions。Okay go right。

let's go back to this。Okay so one of the big things is sort of do you need the echovariance right so alphaphafold2 has all these echovariant pieces built in but if you look at alpha fold3 it in fact removes them and directly parameterizes the sort of representations in Rn and then the atoms in their coordinate space right so they totally skip this。

And。Well， this is this was sort of a you know， it's a question of whether you need echovaris at scale sort of。

and I think this is still very open problems and people have opinions either way and you know this is I guess under debate exactly when you would need sort of these types of models especially when you scale up。

Because you can sort of softly enforce sevariance in other ways。Yeah。

Yeah and so this parameterization piece I think is the key of many of the geometric dynamic models。

how do you sort of parameterize the model in a stable way and learning dynamics nobody really has analyzed carefully on would say' of lots of tricks that people use but nobody says like why this really works or has looked at lost landscapes or anything like this that I know of so this I think is an interesting question about you what sort of parameterizations can we get better at figuring out a pra what sort of parameters you might want to use for your specific problem。

嗯。And so I guess this is getting back to the first part。

but move that when should you sort of use these tricks versus just you know parameterizing in the ambient space or trying to embed an ambience space if you can and doing it that way。

I guess， you know， it sort of depends if you want sort of this constraint or if you want to softly enforce something potentially how you know how much do you need this to be the case？

And then yeah sort of well Joey's been asking this question for a while that you know what are we going to do after flows and infffuions for this thing so I guess this is a good question is can we get sort of more efficient architectures right so there's some problems with the current models。

especially in terms of well how do you do inference quickly especially on running manfolds it's very unclear if you can do sort of better than many steps。

And the final thing is sort of these nonparmetable manifolds I think they're very interesting ideas about data driven manifolds。

but we don't know how to do it very efficiently other than sort of integrating ODs and matching them so still very useful but not as efficient as when you have these nice parameterizations。

诶。Cool， yeah， so that's the end of what I had I guess we're quite early， but so it goes。Thanks。

any other questions？Yeah。Hi， Alex， there is a question on hisla。Okay， addressreris。

I can say loud so everyone can hear it Adris says。

A question from the third coding section of the tutorial that for points on the cut logos of the torus or for any generic manifold。

how do you determine where these points go， do you make any systematic choice or does any arbitrary choice make sense？

That's a great question。😊，I mean， in this code， we do whatever the modulus of Python does。

right so we specifically see oh stop sharing screens or I'll study again。呃う。So in this case。

we literally use the modulus operator Python， I believe that says if you are exactly zero。

then you stay positive， so it should be。It should be that like anything that is exactly on the border goes towards the positive side。

but in practice right these are real numbers so it pretty much doesn't matter what you do on the exact boundary。

But there is sort of this problem in that you parameter ci the manifold and it should in theory。

if it learns correctly， so do something very smooth and lip shits on the border。

but in practice of course you can look at this and this is not always the case right？Yeah。

could be better peruts。Maybe I can a bit about this issue so so since you have these points like the cut logo of the of the Tourrus so your conditional flow would be would have a discontinuity on this points of the cut logo so in general this。

Viols the conditions that you need on the velocity field for everything to be well defined。

but in practice since we're using neural networks which do which are smooth typically if you use a smooth activation function then。

It kind of averages out in the learning and in the smoothness of the network。

so that's like another point to add about discontinuities。😊。

I mean you could treat this thing by zeroing out the velocity field around these point and smoothing out the behavior around there。

but I think that in practice I don't think that anyone has seen like a critical effect of this because these typically are zero measure events。

so they don't affect much the outcome of the gene model。这。有。P。

Yeah he says thanks for the answer， does anyone else have any question you can ask on the QA on Zoom in the chat Vol soon as luck。

so we are tracking questions then。They do follow up questions for Holly based on the previous question。

does this also applies for singularities？What do you mean by singularities。

maybe you can write in this slide right it's a telephone game。😊，Because they are。

Liistening to the talk， what anyway say。In the meantime someone asked for the code and the slides。

I think Jo already sent it in the tutorial as like， but they will be also available。

I guess in your website。Y， yeah。So。So it will be publicly available。走嗯拜。Yeah。

address is answering right now， so in the in the follow up。😀嗯，呵呵。😊，嗯。

Okay， have the。 Okay， you have access to it。 Okay， okay， okay。Can you see this Hol Yeah， yeah。

I can see it Yeah， we can see it yeah。So I guess these are similar issues to what happened in the Tos。

for example， because I mean on the sphere， that's where the example of the herable theorem is that you cannot have a velocity field that does not zero out at any point。

So for this specific instance of the sphere， I think you could also do something like what I suggested previously of smoothing out your conditional flow around the singularity points。

Which are in this case would be like the cut locus on the sphere would be the antipoal point for the next one point that you condition right to you would have an issue only on the antipoal point。

So then again。😊，You could you know， it's a bit artificial but modify your conditional flow so that an epsilon neighborhood of that singular point will not flow to x1 so it's like we would also have a nonzero density on the sphere because the current way that what it does is like imagine that you start from the antipo point and you just kind of shave the sphere all to x1 so we actually have areas where the conditional flow is not defined and also theres zero density conditional。

So you could smooth it out and again i'm not sure how much of a practical relevance it has， but for。

But maybe， maybe it does incur some issues。Does it answer your question， And？嗯你。😊，IThink it does。

Yeah。That's funny setting。く。Great。😊，🤧嗯。Great。All right。Right， so。

I think if there is no further questions。Yeah， we would like in the name on behalf of the organizer just to thank all of you to the tutorial organizers like Joy Heley。

Alexander， the three of you。Or like a nice tutorial， which is like three words long。

but we know it's a lot， but there was different parts， it was self contained。

easy to follow and also with a lot of code， but I think that it's also really。Valuable in the field。

we would like also to remind the people that there is a feedback form。

For the tutorials that I will send it right now in the channels like here in the chat。

and it helps a lot both the tutorial organizers and also the meetH organizers to know how we can improve like the experience。

especially this virtual experience。Yeah， thank you very much to all of you。Thanks。Thank you。Yeah。

and we will come back in 15 minutes， like in the clock time in a bit more than 15 minutes with the next keynote session by Saakri Lucy。

so maybe we can have just a break right now for 15，20 minutes and come back six well well。

I in Europe it was six so in 20 minutes。Thank you very much。

all of you for attending the conference， Bye bye。明拜。Hello。No you cant hear me right， okay。

sorry I had some problems。Yeah， I was saying that the。Yeah thank you very much all for coming。

like welcome Professor Sachri and thank you very much for being here and we are very pleased to have you here in luck having a keynote。

😊，Just aign note before the keynote we have horror presentations and also poster presentation so just we have a long pro line ahead yeah with no further ado I will present Professor Sakari so Professor Sakari is a research scientist the Meta。

the third team and he's a gen Profesor the chemical engineer at CMU。

He also appeared at Z at MIT and apostoc Kist Stan for。

and the work that Professor Sakri does is an AI for chemistry and climateimate applications model to understand and design nanoscale interface is usually using machine learning。

Some examples are computational methods for catlysis as we can see and machine learning models to pick their properties or active learning methods to guide the systems so with no further ado I will leave the states to Professor Saakri so thank you very much。

😊，Super， thanks so much for the invitation。 And it's also fun to be here。

I know a bunch of the students who worked with me at Carnegie Mill and really appreciated the log conferences and the mentorship and other things for some of the projects。

😊，Um， so。Today I wanted to talk about a couple of recent data sets and efforts that we've released openly that I think might be relevant。

especially for people as they start to think about downstream applications of these really cool GNNs and other graph learninging techniques that have been developed over the past couple of years。

And specifically， I'm going to touch on two， one is OMAT 24 and OCX 24。

and I'll get into exactly what we did there and how that might be useful for some of these energy challenges that are so important to the world today。

So just a little bit of background on that。What fair and the fair chemistry team are for those who aren't super familiar with their work。

so FAR is fundamental AI research at meta， most of the things that you're probably you've probably seen from fairR are things like Lama 1。

2，3， vision models like segment anything， embodied AI agents inside of robots and similar and different language translation methods。

One nice thing about being a fair is there is a general commitment to open and reproducible research and so that's really helpful to work with the community and open as much as we can and move super fast it's been really fun to see how much the community has helped on some of these open science challenges。

One of the nice things about being at fair is obviously lots of GPUs for training large models。

but we also have access to a lot of idle data centers for running simulations and so that's sort of focused a lot of our efforts on how do we build the right computational data sets to help train the graph models and other fancy GNNs and things like that that we want to use for chemistry and specifically the chemistry team is about 15 people we mostly work on open science things like catalysis and direct air capture and other materials science problems but more recently some of our team has also helped internally with display materials for headsets and ARVR applications。

Okay， so really what I want to talk about today is。

Focused on catalysis and for those of you who it might have been a while since your last chemistry class。

a catalyst is basically something you use in order to make a chemical transformation move faster。

And this is specifically important if we want to try and electrify various processes around the world to address challenges in climate change。

The most well known process is you take renewable energy and you charge batteries and that can help with grid applications or light EV so small cars super well established。

we see that in our day to day all the time。But there's a ton of other uses of chemistry and materials。

other sort of industrial applications that we also need to find ways to decarbonize So things like heating of buildings I'm sitting in a house。

I think it has a natural gas system。 I wish I could find ways of electrifying that。 We have methods。

agriculturegric ammonia takes a huge amount of。ennergy across the world in order to make the fertilizer that we need trucks are harder to electrify。

it's really hard to build an electric plane， consumer products like plastics and industrial things like concrete and other applications。

all of these also need to be electrified。And so there's not one way that we can do it， but。

There are a lot of chemical methods that we can use to produce these other things that are important。

and so one way that we might do that is use renewable energy along with building blocks like CO2 or water and then upcycle those into more valuable things like pure hydrogen or fuels for aviation or heavy duty to transportation or chemicals as precursors for plastics or concrete and steel for industry。

And so I'm not going to go into a ton of detail on exactly the mechanisms and how we design the catalyst。

but I'm happy to follow afterwards if any of you have questions。

I just want to get across that a lot of these challenges that we need to work on for sustainability are fundamentally catalyst problems because these are all things that we need to know how to make。

So as one example for steam methane reforming， we want to make hydrogen and that might be take methane and water and try to make CO2 and something like hydrogen and a traditional process might be use oil as a process and along with the hydrogen that you make。

you might also make a bunch of CO2 which is maybe something that we want to try and avoid。Now。

The equivalent electrochemical process is we might take water and electricity and split that into hydrogen gas and oxygen gas。

And this is a much cleaner process in that it just has water as an input and hydrogen and oxygen is an output。

Ogen is all around us。But of course we need to find a way of actually doing this fast and so often we think about things like platinum surfaces in order to actually help us split this water and get the hydrogen so that's what the catalyst is。

For something like sustainable aviation fuel， if any of you have heard of that basically again planes are really hard to electrify so an alternative to trying to get batteries and planes is instead try and use renewable electricity and precursors like carbon monoxide or CO2 along with hydrogen that weve formed up here in order to make。

Higher hydrocarbons or alcohols basically jet fuel and again this is basically a way of doing this more sustainably that jet fuel can then be used inside of a plane and the key is that it came from renewable sources and renewable electricity so again these are just a couple of applications。

Now one。Actually try and discover a new catalyst for one of these processes it's actually really hard and there's several key steps along the way so the first is you have to be able to identify potential catalyst materials like one of the materials i'm going to consider are they surfaces or little nanoparticles or some porous material there's not one right answer people have thought about all of these。

Over the past few decades in this field。The second is we need to find ways of prioritizing and screening them。

basically which are the ones that are most worth our time。And finally。

once we make some predictions here， we need ways of actually testing these in lab to say。

what are the best ones？And have we actually discovered something that can be used in a real industrial process？

Now most of the work in our group has been focused on this screening process and I think we've made a lot of progress I'm not going to go into a ton of detail。

but I just want to highlight for anyone who hasn't seen the work。

we've been running these open competitions and leaderboards for the mission learning community to introduce heterogeneous catalysis and why these things are important if you go to opencatalysproject。

org those live leaderboard we see submissions every few months it's been super fun to see the community proposed new models for the space I'll get into a little bit about the datas but the datas are also really big these are really interesting grass learning challenges。

And it's also been fun because we've had both academic and industrial and other。

Competitors basically submit models so for example。

Google Deep Mind and Microsoft Research Asia and Tencent AI Lab have all contributed models that have ended up on the state board which has been super fun we also run challenges at places like Nps to basically keep the ideas fresh and then periodically have new new challenges to help the models keep on moving further so the last one was in 2023 we're not going to have one in 2024 but I think we'll have one in 2025。

Okay。So at the end of the day， we need data in order to train these large models。

we want models that are large and generalizable and work across many。

many different types of chemistry。Our group has released a bunch。

So OC20 and OC22 are for catalysis and inorganic materials， ODAC is a data set。

Of materials for direct air capture， basically taking CO2 out of the air and then trying to store that。

We have a more recent model called in data OAT 24 on inorganic materials I'll talk a little bit more about that in detail all of these are done with electronic structure calculations。

basically really high fidelity physical simulations there's hundreds of millions of examples this basically makes it again a really interesting graph challenge we don't just want graphs that work on small data sets graph models we want graph models that can scale the hundreds of millions of examples and I think that's really help push the field forward in terms of how fast and scalable these methods have to be。

These were really expensive to compute something like a billion CPU hours in total。

And all of these have been open source for commercial or non-commercial uses。Okay。So today， again。

I'm going to talk about two things。 One is Omat 24。

this data set to help us with this first task of identifying potential catalyst materials。

And then later， I'll talk about Ocx 24， which is basically a data set to combine experimental computational catalyst discovery for the end。

And I'm going to skip over this middle steps only because we've had previous talks that log on screening catalysts and a bunch of other public demos and lectures and other things。

So if you have questions about this middle step， feel free to follow up afterwards。

and I'm happy to point you in the direction of more resources。😊。

So let's talk a little bit in detail about inorganic materials of discovery。

So for those of you who don't have a materials background， which I would expect is probably most。

that's fine， a simplified view of the world is basically we have materials that are made 100% of zinc。

Or 100% of copper， this is basically a little simple phase diagram of copper。

And if to we want to hypothesize a new material like zinc copper with a specific structural form。

and this is a graph conference so you can already see that these look sort of like graphs depending on how you define a bond inside of one of these systems。

I hypothesize a structure， I calculate the energy with my favorite physical simulation and now this forms a lower hole。

this is basically a phase diagram and what it says is anything that's up here is going to end up being more stable as a linear combination of pure copper and zinc copper。

Now if I come up with a new hypothesis for a different structure。

so this is the same 5050s in copper， but it has a different graph structure or a different crystal structure and this one is lower energy。

Now， I have updated my haul to basically say， this one is better than that one。

And so I no longer need to consider this one is either metatable or unstable。

And I'll get back to that in a second。 And so this is now the best。Material at 50，50。

and anything up here will decompose if。A combination of zinc copper and 100% copper。Okay。

if I hypothesize a better structure at a different composition， so 75% saying 25% copper。

now this one gets used for the lower haul and gets updated。

And this is basically the process of materials discovery。Now， of course。

Real life is not quite the simple。 This is an oversimplified view of reality。

And so a really common counter example of material science is diamond versus graphite， so。

For any of you who has thought about jewelry， I'm sure you've heard the phrase diamondiamonds are forever。

That basically implies that if you have a diamond， that's the most stable thing and that will last forever。

But both of these are 100% pure carbon， the graphite in the pencil is also 100% pure carbon。

And if you do a calculation， at room temperature pressure， sort of like all around us。Actually。

what you find is that graphite is the more stable structure。

And so what this says is if you wait an infinite number of years。

Eventually the diamonds that we have in will eventually decompose into graphite。Now in practice。

this process is so slow that it's not something we need to worry about。

I'm not going to go worried about the diamond and my wife's diamond ring basically decompos into graphite on our time scales。

the barrier to go from this to this is really large so it would take a really， really。

really long time for that to happen， but nonetheless if I waited a long enough amount of time。

eventually I would form the most stable structure which is graphite。

Now the other key here is that in practice， these phase diagrams are actually very dependent。

On the conditions that you testament and so if you look at really high temperatures and really high pressures like you see in the mantle of the earth。

That's actually where diamonds are formed and so at those specific conditions。

actually diamonds are the most stable structure， and so in the mantle of the earth the diamonds that are formed will last forever。

But as soon as you go to room temperature， again， bring them up to the surface。

graphite becomes a more stable structure。And I'm just bringing up these examples just to help you understand what it means to discover an inorganic material。

because this is super relevant to catalysis。 And there's been so much excitement about inorganic materials discovery and grass networks that we need to keep these ideas in mind as we actually start to use these things in practice。

😊，Okay， so bringing this all together， basically what happens is we want to calculate formation energies that's basically how stable is a material we're going to use graph networks in order to do that。

The materials that are most stable along thishu， we're going to call stable materials。

Everything that is very close to this hall， we're going to call metatable。Maybe we can make it。

maybe we can't it's a little hard to know how fast it'll decompose diamond I would lump into that category of room temperature and pressure there's usually a little rule of thumb of about 0。

1 Ev per atom that people use to say how close is close for thishull。

And then anything that is way higher energy than that。

we would say this probably not is unstable again， it doesn't mean you can't make it。

but it means it's also unlikely and probably it's not going to show up in your life。

And so when we propose a new material either with some generative model or our favorite heurtick。

we can classify based on these methods， so things up here will be unstable。

things close to the hole will be metas stabletable and if you find new material that is way below the whole of the currently known materials that's when you can say okay computationally I have discovered a new material and maybe that's when as you're going and try and get my experimental friends to make it。

And a couple of the metrics that the G network community has used to try and track these things are things like the stability rate。

if I hypothesize a bunch of new materials， what is the percentage that are actually stable so below the whole or perhaps stable。

unique and novel that is they are interesting and new and no one has that I hypothesize them in the past。

Okay。So let's talk a little bit about how to predict these formation energies and again this is one of the areas where graph networks have really changed the game and are being used in the day to day now so we're really lucky to have these physical simulations that we can use as ground truth。

They're not perfect， but they are very helpful in practice basically we're trying to model all the electrons inside of one of these systems。

we use computational chemistry codes， solve things like the Schchrodinger equation or use things like density functional theory to approximate and then solve that instead and the really nice thing about these quantum chemistry codes is although they're quite slow they can run on classical computers and if you have a large enough data center you can run a lot of them in parallel。

One of the things to keep in mind is that these simulations are really， really， really slow。

So every time you want to。Calculate the property of your material it's something like 100。

000 CPU seconds per calculation and if you look at most of the large generalizable GNNs that are used in the space today that are trained on millions or hundreds of millions of structures most of those run on the time scale of about a tenth0 of a GPU second per calculation and so these are already about a million times faster and again I can't thank the graph learning community enough for helping with these tasks' been really remarkable how much progress has been made in the past few years in this space。

Oh。Okay。So there's already some nice benchmarks by others in the field on whether or not models are useful for computational materials discovery and so if you go to Mathbench Disc。

this is basically organized by the Maial Project， which is a Open Science organization at Laws Bkeley National Lab。

They've run this website they've had this challenge up for I think about a year year and a half now most of the top methods on this leaderboard are various types of graph networks so Magnet chargeNe Mace7net or aquiformer V2 I'll go into this one in a minute all of these are various types of message passing or Gph networks each of them has a really cool paper associated with them Magnet and chargenet Magnet came from UCSD Spingongs group chargenet came from Gi Car's group at Lawrence Bkeley National Lab Macce's Gabojani's model from University of Cambridge7net is a recent modification of Nequip from a really wellknown group in Korea and Orb is a model that was developed by Oral materialsial which is a startup in space so it's really cool to see a lot of recent progress。

One of the metrics here is the mean absolute error for calculating the formation energy， the MAE。

the units of this are electron volts per atom。And so 0。026 is pretty good already。

If we go back and look at one of these holes， I said。

if you want to classify is something metatable or not， you need to be within about 0。1 Ev per atom。

So 0。1 is significantly larger than 0。036 and so already we're seeing a lot of progress。

these models are getting pretty helpful for day to day discovery。

All of these models that are listed here are trained in the same data set。

so the same set of about 1。6 million structures that came from about 150，000。Simulations。

we were doing little local relaxations。And all of this training data came from the materials project again。

it was a really， really well known and very， very helpful open science effort in the materials community that's been running for over a decade now。

😊，Another thing to keep in mind with all of these is that。

When you're training on millions of structures。It's not surprising that the graph networks have also gotten quite large。

And so if you look at some of the models from a few years ago， they were reasonably small。

hundredunds of thousands of parameters in these GNNs。But as you go to larger and larger systems。

larger and larger data sets， again， not surprisingly。

the models themselves have also gotten larger and deeper and more expressive。

and so it's not uncommon to see tens of millions of parameters now just to give you a sense of scale in these sort of systems。

Now， of course， all of these are trained on the same training data set。

and so another thing that you could ask is what if we just add more training data？

So I'm going to go into that in a second。Let's talk a little bit about what this model is。

so EQV2 small dens is basically a specific architecture acroformer V2 and DenEs is a denoing strategy that we use for augmentation。

So we actually didn't come up with this specific architecture。

Ecroformer V2 was developed in collaboration with Teshm's group at MIT。

she's super well known in the Ecrovarian GNN space。

I don't want to go through all the details of this network basically I just want to get across the input is a 3D graph。

the output is something like the energy or the force， the force is a vector。

There's usually a bunch of message passing steps in these systems where basically all the atoms are talking to all of their nearby neighbors and we repeat that message passing step a few times。

And then at the end， we have an output head that basically takes these loan representations from the GNN and then tries to predict the final properties of interest。

There's a lot of work that's been that's gone into how to make these things more scalable and more efficient。

so for example， there's different convolution strategies like ESEN was a model from the fair chemistry team that basically came up with a better faster。

more scalable convolution strategy that's used inside of Aquiformware V2。

And a lot of this work was done by Ela Lau， he was a。

In turn on the fair chemistrytry team two summers and one summer ago。

and basically the V2 is basically taking the original eformer model from Tes from Tesla's group and then scaling it up to work at scale on hundreds of millions of。

Examples。Now the other thing I mentioned was this denoizing strategy for augmentation and this is basically coming from a limitation of most of the data sets that have been available and so in the inorganic community most of these simulations are local relaxations。

local optimizations of a structure you start with the hypothetical structure。

you calculate the forces and you move downhill to try and find a lower energy structure and so the data themselves is usually not incredibly unique a lot of those structures along the way are often very similar。

So we've been thinking a little bit about how to help in this case one opportunity is various types of denoising your data augmentation strategies。

so one of them is called Dens this is up on an archive it was basically an intern project from Eloon about not this past summer but the summer before basically given a structure you basically try and modify it and then the goal of the model in this augmentation system is basically please tell me the original structure。

That had this specific force spectra that I know is true。

And so this augmentation basically gives an auxiliary system that allows us to train on other points nearby while still using the same original training data set。

so this is basically an augmentation strategy。嗯。You can do it for a relaxed point too if this thing is mostly relaxed I can try and bump this quite a bit and then again the goal of the model afterwards is please tell me the structure that has a very。

very small force nearby and hopefully it will map back to the original structure。

Another thing that we could do is we could say well。

maybe this is just a problem with the underlying data set。

why don't we just try and build data sets that are far from equilibrium if diversity is good。

why don't we just go ahead and make。Data sets that are more diverse。

And so that's specifically what we did with this Omat 24 data set。So OED 24 is massive。

it's more than 100 million non equilibriumbri structures。

it's useful across many different application areas like catalysis and optics and electronics and many more inorganic materials are used in many。

many， many different areas of material science。These were。

Structures that were made by taking known stable structures and then using really high temperature sampling strategies to get really far from the original equilibrium。

we also ran some short timescale moleculardynamics at 1000 Kelvin or 3000 Kelvin really。

really really hot， or we try and rattle days a bunch in order to come up with again。

hypothetical other structures that we think are going to be useful for training the GNNs that we care about。

This data again is all open source， they're hosted on a hugging face。

the models that are trained in the data are also available on Hi and F。

If you use these larger data set， you can train much larger GNNs and because of digital training data you can do quite a bit better。

so if you just trained in the original MP Triage 1。

6 million data set I said that you could get an MAE of about 0。036 electron volts for atom。

If you train a very similar model。That's larger， 90 million parameters because we have more data now。

the total data set is about 100 million structures， you can get that down to about 0。

02 electron volts per atom。Which is starting to get really close to the underlying errors associated with DftT。

That is， the models are so good that they are approximately as good as the noise in the underlying training data set itself itself。

which is really amazing。😊，so these models are openly available now anyone in the world can go ahead and use them and now this data set is open。

I expect that we'll see some more models come up in Equiforma V2 as others in the community and try and scale these things even further。

Okay， another area that we can go is we can say well。Maybe we're limited by hypothetical materials。

why don't we work with generative AI to propose new structures and so one example is motivated by Meta's Mo Gen。

which was released least earlier this fall， basically a state of the art movie and audio generation system。

Our team has been working with some of the people behind Mo Gen to apply various types of models to materials discovery and these are also areas where a lot of others in the community have also had some really interesting progress so one model that I'll talk about in a little bit more detail is we can fine tune language models that's a reasonable strategy we can implement flow matching methods。

For this graph。Systems， that's what this gift is coming from。

or we can try and combine the best of both worlds and I'll talk a little bit about that this flow LM model is going to come out is already up on archive and will be at Nps in just a couple of weeks in December。

And the cool connection here is it's been really fun to see some of the ideas that have been developed at meta for things like Mo Gen and specifically the audio generation system in Mo Gen is a flow matching method developed by。

😊，Others in fair like Ricky Chan and your Lipin and others in the community。

we can apply these same flow matching methods that were originally developed for audio generation and we can apply them to materials so it's been super fun to see this back and forth between the different communities。

😊，So let's talk a little bit about what some of these structure generation systems might look like。

So on the left is one system that we worked on that was inspired by some work by Ais Bro Guuk at the University of Toronto。

Nate Grover did this N is a PhD student at NYU， he was interning with us。あ。

Not a summer but the summer beforehand and basically what we did was we took Lama two models and fine tune them to generate these graph structures by saying please give me Xyz structures and the shape of the system the lattice cell and the really fun thing was that at the time we were able to show that these language models were competitive with some of the models that were available for。

Structure generation like CDVE， the source code for this is up on GitHub and the paper was released at ICClA this summer。

Anna Samm and others in our group， Ben Kiurt Miller， who also recently just joined。

worked on another system taking flow matching methods， like I said。

from Fair and applying them to materials generation and they were able to show that they could do even better than a model that had come out in the intermediate time between these two called diCP and basically what this model is doing is basically deforming this thing。

to find an even better crystal structure given a hypothesis。

Wwhich is the initial one and if you look at the original structure。

I probably most of you probably haven't done enough material science to see that the original ones are not very realistic。

but by the end they're starting to look fairly ordered and it looks pretty close to maybe an oxide system。

these little red atoms are probably oxygen atoms and these blue ones are probably some sort of metal。

One of the really interesting observations that we saw was also that this chemistry language model seemed to work really well for telling us what types of materials to make。

and this flow matching method seemed to do a really good job of putting the atoms in the right place。

And so a more recent model， we call F LLM basically combines these two for best of both worlds。

we basically use a language model to tell us what sort of structures we should be trying to make。

like what's the composition or maybe an initial gas on the crystal structure。

And then once we have an initial guest， we use flow matching with this Romanmanian flow matching method that I talked about previously。

To basically come up with a denoy structure that is even more stable and better than the original initial guess。

And so again， this is up at on archive now， I think the source code is also up or will be shortly and it will be presented at Ns in a couple of weeks。

Okay。And。Again， one of the really fun things about this area right now is that there's not one right method for generating graphs for this system。

These ideas and strategies are synergistic and that means that we're going to see probably a lot of progress in the next couple of years as people mix and match these different methods。

So if you take this crystal generation LMbased method and the slow matching one and combine them。

you get a method that is way better at predicting stable structures than either of these alone and again just for me like opportunity for methodological improvements in the space I think it just says the field is wide open and it also says that all the work that's going into other methods for generative AI in language and imaging video a lot of these are probably going to be applicable to graph generation as well。

Okay， so I want to switch gears a little bit and talk a little bit about how we make data sets and test materials by catalysts in a lab。

And specifically， we have this OCX24 data set。That we recently released that。

I think will help the community make some progress。

So OCX24 is basically open Catalys 2024 the X stands for experimentaliment。

it's a combined computational experimental data set。

That we think is going to be useful to help us make AI drivenri discovery and catalysis even more meaningful。

And I would say the data set and the methods that we release as part of this paper are relatively simple。

And so I think there's going to be a lot of low hangingging fruit and progress from the external community and making these even better。

And I hope that someone on this call comes up with an even better strategy than what we came up with。

😊，Okay， so in summary， basically what this data set is is it's a computational data set of little molecules sitting on surfaces that we think are correlated with experimental catalysis。

They were calculated for about 19，000 stable metastabletable materials。

and I'll talk about where we got those from。With each of these materials， we formed many。

many different hypothetical catalyst surfaces and placed these little oddates on them。

And so in total， we used some of the state of the A graph neural networks that we developed as part of the Open Catalyst project to run hundreds of millions of simulations。

Run DFT calculations on the best of those。 And then that forms the computational part of this data set。

So these are things that we can calculate。With simulations that we think are going to be relevant to the final experimental testing。

On the experimental side， I'm going to go through in a little bit more detail。

basically we have two different synthesis techniques。😊。

Lots of characterization to help us identify what materials we made in the testing platform so that we can actually say。

what is this catalyst doing？So this data set was done for two different types of chemical reactions。

Both of which I touched on at the beginning of this presentation， so one is hydrogen evolution。

given water and electricity split it to make hydrogen gas and oxygen gas， this is useful for making。

again， hydrogen as a energy source for a hydrogen fuel cell。

And the second is the CO2 reduction reaction， CO2 plus hydrogen plus electricity。

Makes fuels or hydrocarbons or plastic precursors。

So。How do we know which material is actually going to be stable as a catalyst so if we go to an open database like the materials project there are about 150000 structures about 60000 of those have actually been synthesized experimentally。

Most of those are not actually going to be stable when you actually put them in water and you apply a potential and try to make this thing into a real catalyst。

And so if you screen for stability， basically you find even though we know about 150，000 materials。

Actually under reaction conditions， when this thing is actually happening， only about 6。

000 will stick around for a long time， the other 149。

000 will probably decompose under those conditions。😊，So 6000 is a good starting point。

but we also wanted to try and expand this number a little bit we wanted even more possibilities。

And so in addition to the materials Project， we also consider two other data sets。

Alexandria and OTMD。And in total， if you take all three of these together and you screen for ones that are going to be stable again under these conditions。

you find about 19，000 materials that are probably going to be okay to actually test if you can synthesize them and get them into a real catalytic reactor。

Okay。So。What do we want to do？The dream would be basically， we have 19000 materials。

We order a whole bunch of those。 make a whole bunch。

And then test them all and then release an experimental data site。Unfortunately。

real life in real catalysis is just quite a bit more complicated。So what do we actually have to do？

Um， so first， we had to try and figure out which is 19，000 we actually wanted to make。

either with some screening strategy or some sampling strategy ahead of time。Next。

We had to go and work with a bunch of experimental partners to try and find synthesis techniques to actually make the materials that we care about。

Then we had to find a way to get those things that we made in nanopart form to actually stick to the electrode in the membrane so that we could use them in a catalytic reactor。

And often we had to repeat this process a few times。

so not every way of making the materials is compatible with how we were getting them to actually stick to the electrode。

And finally， even once you get them on the electrode and you run the reaction。

you have to characterize the materials to make sure that you actually made what you wanted to make。

so if I came up with a hypothetical structure， I don't want to test something random。

I actually want to test the structure so that I know that I can match computational experimental data at the same time。

Often I don't actually make。What I was hoping to make。

and so I might have to go back and tune or change or come up with a new synthesis strategy and then repeat this whole process again。

And finally， on the timescale of years， we were able to make a couple of hundred of examples of these materials and so these couple of hundred are the ones that were released in OCS24 and these are the ones that were actually tested at scale。

So what this process looks like is basically there's two strategies。

chemical reduction and spark ablation， chemical reduction is a technique that was done at the University of Toronto spark ablation is a method from a startup called VS particle。

Again， we had characterization and lots of of these electrolyzs in parallel to basically try and testest these things this is a picture of a chemical reduction robot from aeys Cheem speede system。

this is a VS particle machine that's doing the spark aation to make these arbitrary nanoparticles in these little film form。

And finally， this is a picture of this testing apparatus to basically they say how good are these it actually making？

Real catalyst and turning CO2 into something more valuable。

And the sort of data that we get out looks something like this。

so at three different applied voltages， different conditions。

how much energy are we putting into the system。These catalysts will make varying amounts of hydrogen or methane or CO or ethylene or other chemicals。

So for example， for this copper nanoparticle system at low current voltages。

it makes mostly CO and at very high applied potentials。

it makes mostly hydrogen And depending on the synthesis strategy。

we get qualitatively similar results， but not quantitatively the same。 So for example。

chemical reduction versus spark ablation you get some small differences in how much CO you make at intermediate current densities。

So this is really just to give you an idea of what is the data that look like， what are the outputs。

what are we trying to fit And these are all things that we want to correlate with the graph networks。

And there predictions on the catalyst site。Okay。In total， we made about 500 materials。

200 of those were clean enough and deduplicated for testing。

And order magnitude too larger than any other data set that we're aware of。

For these sorts of electrochemical conditions at high current density that have been released openly。

You can see that some areas are much denser than others。

so we have a lot of zinc and silverber and copper and gold systems， some areas are harder。

there's not that many materials that are stable that have gallium or inium and so these are a little bit less common just to give you an idea of what the distribution of the data set looks like。

And one really cool example， again， just going back to what we're using these graph networks for。

I want to highlight some of the results for the hydrogen evolution reaction and so I've said this a couple of times but basically hydrogen evolution reaction HR is basically water tu electricity forming hydrogen。

😊，Gas and oxygen gas。And if you just take really cheap。

But simple descriptors based on the composition of the catalyst。

so like what is the atomic mass of the elements and the radii and electron negativity？

You can fit the experimental data on the Y axis， the predicted voltage。

To the experimental voltage and you can get a reasonable parodity plot， so an MA of about 0。09。

I would say it's not the most beautiful linear regression and set of residuals that I've ever seen。

but nonetheless the R squared is about 0。5 that's somewhat reasonable in this space。

If you then go and predict all the other materials that are hypothetically possible。

you see that there's a whole bunch of materials that are predicted to be way better than the ones that we tested in gray。

And specifically， the material that is most well known for this reaction platinum。

which was not included in a data set shows up as a star。

basically what this model is saying is there should be a whole bunch of materials that are better than platinum。

Which is hypothetically exciting。Now， what was really exciting to us is that if instead of using these really simple。

Composition features。Instead， we used。Features that came from these graph networks trained on the Open Catalyst project。

And so these are things like hydrogen or OH absorption energies feeding into the same linear model。

We get a fit that looks qualitatively very similar。 The parodity plot is。Very similar。

R squared is also about 0。50。6。 Me is about the same。

But when it comes time to make predictions this is a much more physically meaningful and realistic graph。

it has its characteristic volcano shape that shows up in the catalyst literature。

basically what it says is that there's a sweet spot here that catalysts like platinum helping the lie。

😊，And this this platinum catalyst is in this model predicted to be one of the best known materials。

and this actually matches what happens in real life。

There's a couple of other materials here that we think are a little bit better that we're trying to test。

But we're not making these predictions that there's going to be ones that are way better and if you actually tested most of these materials that came out of this original model。

I think you would find that almost all of these are actually not better than platinum practice。

And so this is basically a long- winded way of saying that we were really happy that some of the models and GNNs that were developed and by the academic community for the Open Catalyst project seemed to have better extrapolation and more physically meaningful outcomes than if we had just used really naive descriptors at the beginning of this project。

Okay so in summary， I hope I was able to convey that this is a super exciting time to be at the interface of AI and chemistry and Ma science。

the graph learning community has just made a ton of progress in the past few years some of these models are really helpful now we're starting to think more about downstream applications and all the other things that have to go right for materials discovery in this space。

UThis has really been a community effort， it's not any one group。

it's been this sort of competition and back and forth and open leader boards that have driven a lot of this progress。

which has been really fun to see。Another fun thing has been。

Materials sorry models that were developed for the Open Catalyst Project Leader Board also seem to be correlated with models that are state of the art for other areas like interorganic materials discovery。

and so I showed that basically acroformer V2 a model that is currently state of the art for catalysts also can be made state of the art for inorganic materials。

We've released a new data set for the community OAT 24 that I think will drive a lot of progress in the next year or two to come up with even better models。

And finally， I touched on some emerging work on high throughput experimental data sets。

These are hard finding materials and testing them across many。

many different types of elements and compositions is really complicated we've released the first data set from our group in this space OCX 24 I expect there will be others from the community。

I think we're really at the beginning of this journey and I think there's a lot of opportunity for graph networks to help in this area as well either by helping with the original sort of synthetic simulation data sets and making those more accurate or perhaps directly helping us predict experimental and computational results in making the bridge together。

Finally， I just want to say everything I talked about today was 100% of team effort it's been awesome to see all the collaborators work together to enable these new data sets and system so we have a large group at Fair we also have really close collaborators at Carnegie Mellon like John Kitchen collaborators at the University of Toronto like Ted Sargent and others and finally I didn't touch much on the direct air capture work but I mentioned it that's mostly been in collaboration with Georgia Tech we've also had some recent interns here have contributed to a lot of these projects so with that I'll stop there happy to take any questions and if I don't address anything and you have any questions afterwards feel free to shoot me an email or follow up on our GiHub page and post issuess if you see anything interesting thanks so much。

Thanks so much， Hu。Thanks so much so who has been very insightful， very exciting。😊。

I've learned a lot about how to use it， I think you have conveyed very properly how like all the methods based on graphs are。

They help to speed out both the process and they are also very fast compared to other computational methods。

So I think it's very， very interesting and very insightful。😊，If anyone want to ask questions。

we have the chat open， I think there is one question also as Slack and anything else。

Raphael Arto ask you a question， Professor Sakachri。

would be interesting to predict the parasite processes within the material discovery？Yeah。

I think what you're asking， but correct me if I'm wrong is basically。

If you're trying to make these things， you might make a bunch of other things by accident that may be parasitic in that you're trying to make something。

but there's some other processes that you don't want to happen。And that really gets to real life is。

Often not about the thermodynamics， which is this really simplified view of inorganic materials that I was showing about talking about before。

I'm just going back here， but actually it's about the kinetics of these different processes that are all happening at the same time and often the one that's the fastest wins so the reason I I in the rest of the field have focused on the thermodynamic sort of view of the world of just what is the most stable is。

It is way simpler， and it was easier to make models and dataset sets for those challenges。But。

Now that we have these we need to we need to go the next step and think about these kinetic processes in parallel that you're talking about and I think that's where the community is heading。

So using the graph networks that have been developed to say what is the rate of reaction to go from A to B to C and what are all the other processes that's exactly what you're going to have to do to get models that fit the experimental data even better than the current ones so we're just at the beginning stages of that。

One of the reasons there's been so little work in that area is just these。

These simulations were so expensive and so slow and so annoying to converge in the past。

We just weren't asking the， the more physically meaningful questions that we' were hoping to do。

But now that we have these GnNs， I think we can really move very fast。You're a great dancer。

He says thanks as well yeah also Shatania had a question about that's also a question that came to my mind when you were showing the table about the data sets and architectures how responsible are in terms of improving the performance they field the data sets and architectures so the exact question is how do you weight the data sets or architectures in terms of importance in pushing the frontiers in material discovery like。

Yeah， that's a great question。So the short answer is I think。

I think all of these have to happen in parallel and have happened in parallel and so what I mean by that is if we had gone back five years and we had said。

okay， the only thing we're going to do is methods development for GNNs。

I think we would have made progress。But we wouldn't necessarily have come up with architectures that we're going to scale the hundreds of millions of examples。

And so the one thater。That are the best right now and these really large leader boards are also the ones that are simultaneously expressive and fast scale。

and that's important in practice。Um I。On the flip side。

if we had only done data sets and we had never done any work on the GNN side。

the original GNNs five years ago in the material space like CGCNN。

they were so cool to see and were way ahead of their time in terms of like how they were thinking about the world。

but they were clearly not expressive enough to really get at these sort of kinetic questions and if you move atoms a little bit what are the properties going to be。

And so I think， again， it has to be both at the same time。

There's a limit to how far we can push the data generation side。

so a billion hours of compute for all these different data sets and hundreds of millions of examples。

We can continue to do that for a while， but like realistically we're not going to get 100 times more data here。

there's sort of like an upper limit on what's possible。

so we either have to get smarter about how we generate the data into choose new structures。

Maybe that'll be more active learning or other methods of。

Selecting points that are the most meaningful or it could be higher fidelity simulations。

These are approximations of reality there are perfect。

Probably there will be smaller data sets of really。

really accurate calculations that will come next。 So there's still a lot of work on the data set side as well and I also just want to convey if you look at the open Catas project The board。

we've seen the steady stream of。😊，New model architectures every few months。

I think there's going to be another one that'll show up pretty soon on the leaderboard。

I hope that that continues for a while every time another one comes from either us the community it gets me really excited I don't know what's going to happen next year in all honesty like obviously we're trying some random experiments to see other cool things that we think might be interesting but again a lot of the advances have come from the academic community and we've been surprised so I really hope someone on this call comes up with the next architecture next year。

Yeah。😔，Yeah， definitely， definitely。诶。So before we finish， Sam will ask。

what do you see as the biggest challenge facing the field over the next few years？Yeah。

I think that' there's so many things that have to happen here。

I would say maybe the most important is。Really on the experimental data set side and so。

Going through all this work。Right， we came up with systems that were。

I would say quite industrially relevant， like the way that we were testing these catalyst materials was。

Maybe not exactly the thing that you would use at full scale in industry， but they were。

Pretty close to the current densities that you would see they were pretty meaningful。

We did hundreds of examples here， and I showed you。

These are definitely not perfect models that have been developed so far there's a lot of scattered here。

And so I think there's a ton of room to make these data sets even bigger。

And that's going to be important if we want more data driven methods in the actual experimental discovery process。

😊，And so that's not， I would say not wildly creative on my part。

there's a ton of people in the community who are pushing on this that could be high throughput systems。

but if you do high throughput you also need to make sure it's still industrially relevant。

it could be automated robotic systems to basically do these things faster。

it could be other ways of testing， probably it's going to be a combination of all the above。

But I think it's going to be really important to make these experimental discovery processes statistically significant if you look at the way most of it's been done in the past it's been individual papers and individual compositions a few at a time and it's it's really hard to make progress in data driven methods when that's the state of the art so these are all hard problems。

😊，To be honest this， it's not clear if it's going to happen in one or two years or two or five years or five or 10 years。

but that's clearly a direction the field has to go。Yeah， thank you very much。

I think we are running out of time， so I would like to thank you again。

I think you transmit all the excitement about the field， it has been very very insightful talk。

So thanks Pro Professor Sachri。😊。

Thanks so much for the invitation。Thanks， thanks。 our pleasure。And we are going to move to the oral。

so I will pass the Nick to Yanang。

Hey， yeah guys， yes， yes， I'm here。 We just trying to， yeah。

just trying to make Steve as co host as well。So。All right。

so I guess well just begin because we are data behind schedules。All right guys， so welcome。

this is the second oral presentation and my name is Ymba I'm from Cornell University I'll be today's host for the presentation section and with to do let's welcome our first speaker I think it's Christopher Ger who is currently a Todoc at the University of Woodba in Germany and I think he will be presenting to us comparing hierarchical Network quotations based on relative entropy。

So Christba， are you here？Have you made him co yeah， I just made。All right， yes。

I'm here and now I can also say it。B。Sorry。In other words。Yeah， whenever you're ready。Yeah。

just I'm currently at the local meetup in Germany and in case you hear some noise in the background then you know what's going on。

It's a poster session going on at the moment and some band is rehearsing for a little bit of music later today。

But okay， let's go with the presentation， I am going to talk about comparing hierarchical network partitions based on relative entropy。

which is a work together with Ingo Schs。So。Just to start with the too long didnn't listen version。

So we can partition networks in many different ways， which means that we often compare partitions。

And the problem that we see with common partition similarity scores is that they ignore link patterns。

Me like the Jakard index and variance based on mutual information。

So what we do is we propose a measure that we call flow divergence。

which is a link away our partition similarity score and based on the combination of describing random walks and the ideas behind the Ko pipeline divergence。

So comparing partitions。Let's say we have some， some items。

And they are partitioned in some certain way， let's say this our reference partition and now we have another way of partitioning them。

let's say like this， then we can take， for example。

the Jaar index to compute how similar are those two partitions of the same elements and basically what happens here is that we look at all the elements that are blue and we're checking how many of those agree in this case two out of four that are blue would agree between the two partitions。

And to compare those。Go through the colors and check between partition A and B for each color。

how well do they agree and we might have to search through all the different groups because the colors might be flipped somehow in one of the partitions。

Here， the Jaar index would tell us well the agreement between B and A is12。

And we can also come up with other ways of parting the same elements。

Where the Jakar index will tell us。That the agreement is the same。

so these two partitions C and D have the same agreement with ASB has and similar things would happen for mutual information based measures。

essentially because they are considering set overlaps。

but we're interested in comparing network partitions， so if I take these partitions from before。

And draw them as a network and interpret what we have as communities in this network。 Now。

we would commonly just use the same measure。 again， the Jakarard index to compare。

How well does B align with this reference A， how well does C align， how well does D align。

and the Tracard index would tell us that， well， they all agreed to the same extent。And then。

You should ask the question， well， but what about the links？

Because if we consider these partitions as community structures。

then there should be some difference between C and D for B and C， we can see okay。

there is a lot of symmetry， so they are probably the same。

but there should be a distinction between D and the other ones。

So our idea is here to combine ideas behind the KL divergence and random walks on networks。

So the careL dirgence， I'm sure that many of you have heard what it is and know it。

but just to recap， let's say we have set X of some letters， ABC DE， and they appear according to P。

And let's say now we have a sequence of those letters coming from some source。

and we want to describe the sequence。What we can do is we can design Afman code。

Based on the frequencies and use one code word for a symbol。

How many bits do we need to use per symbol in theory。

well that is given by the entropy of this distribution P？In this case，1。96 bits。Now。

let's say we have an estimate of P that is Q， perhaps based on the frequency of the symbols that we have observed。

And assuming that these are the two frequencies with which the symbols appear。

we would design a different code to encode the sequences。

Would then encode it like this and would get an entropy of 2。17。What now the K L divergence does is。

It tells us if we use a code that is optimized for Q to describe symbols that are distributed according to P。

How many bits extra are we going to pay for this， So what we're going to pay then is the entropy of P plus the KL divergence of Q to P 0。

15 bits， and that is different from the difference between the two entros。Okay。

so the KL divergence tells us how much do we pay extra for using the wrong code。

So to say by assuming a different distribution than the correct one。Now。

let's take a look at random walks。 Now we just， we have a graph。 We take a random walk on it。

and now it's this random walk that generates a sequence of symbols for us。

and these symbols are now the nodes。We can describe the sequence of nodes in a similar way by assigning code words to the nodes and then using one codeword first step of the random walker to say where the random walker is。

The description length for that in this case where we don't have communities is the entropy over the node of visit rates。

so we actually don't need to simulate a random walk or assign code words that is just for illustration what we are actually interested in are the visit rates for the notes。

Which in undirected graphs， we can compute them directly。

otherwise we might compute them with the power iteration。

perhaps using Pat or other means of calculating a stationary distribution。

And if we then combine these two equations one and two。

we can essentially rewrite the entropy in the shape that looks like a random walk。

we have the random walker at a node U with probability P U。

and then it will take steps from note U to note V。

With a certain probability toUV and the number of bits that we need to describe this step is log2 of PV if we don't have communities。

but we are actually interested in communities because we want to compare network partitions so this cost in bits。

That is lock2 of PV， we're going replace it by something that we call here S of M UV。

so S is simply a function that tells us given a partition M stepping from U to V。

what is the rate at which the Roman worker does this and lock two of that will give us a number of bits that we need。

And we're going to use the map equation， which is an information theoreticalore objective function for community detection tool。

Otain such an S that tells us how much do those steps cost？So let's get back to this network。 Now。

if we look at it， we see there are three communities。

What we can do then for obtaining a more efficient description of a random walk。

Is that we can assign code words that are unique within communities。But now we need to add also。

Module boundaries or community boundaries with specific code wordss to enter and to exit the communities so that it becomes。

Uniquely decodable and we can communicate in which community are we and when I say I visit the note that has the name00。

we know by the context of the community where we're in which of those ones I mean。

And if we then take the same one and walk again to encode the same sequence， we would do it。

With these code words where the colors indicate from which module the code words come。And。

The code length， so the number of bits that we need to use per step is given by the map equation。

which is shown here to the right，Essential it's a combination of weighted combination of entropy so in the front。

We have the entropy of switching between modules and Q is the rate at which we need to use this code word。

and then this sum contains one entropy term for each of the modules and a weighting factor for that that corresponds to the fraction of time that the random walker spends in the module。

If we were interested in detecting communities， we would minimize this by searching through all possible assignments of notes to communities and select the one that gives us the shortest description。

One nice property of the map equation is that we can actually rewrite it。

that it looks like the random walk description that we saw on the previous slides so we have a random walker at an O view it takes a transition from U to V and the cost or the rate at which the random walker does this is now given by mapps in。

Which I show in a moment， but it says that based on these modules。

what is the rate of a random worker given these modules stepping from U2 V？Again。

we don't need to simulate an actual random work。😡。

And we can now that we have these communities visualize the coding structure of the network as the tree annotated with code words。

But again， we don't actually eat code words， we are interested in the stationary remote visit rates and the transition rates between modules。

Mapsson basically looks at this coding structure and says， okay。

if I have this coding structure given or this transition and visit rates for the random walker that are normalized by module。

What is the rate that the random worker goes， let's say from the green module to a node 1 or is in the green module and goes to note 10 and so on。

Now， the partition dissimilarity score that we propose， we call it flow divergence。

As I mentioned is based on the combination of the KL divergence and describing random walks on networks。

and the idea is that we want to measure how many extra bits do we need to describe the random walk if we use so to say an estimate B of the networks。

so to say true partition a。So we designate again one partition as the true one and then look at another one that we can and compare to it。

so going back to this example from the beginning we take the reference partition to the left and we'll look at this coding schema that coding scheme that is derived from that and use the one to the right to compare。

So essentially here on the left， we have again， the KL divergence on the right side。

the map equation， rear rate to the shape of a random walk。We combine those to kind of。

Look like a combination of KL divergence and the map equation to answer the question。

how many bits do we need extra？But this is not quite what we want。

because if we now substitute in the definition， then we will see that we only get a difference between。

To partition quality scores。 So what we're actually going to do， there's a small difference here。

We are using the partition dependent transition rates。

Which are derived from this coding scheme that I have drawn。I's a tree like this one？

And we basically simply normalize them properly。Now you might look at this and see like， oh my God。

we have a double sum over the notes that doesn't look very efficient。

but it actually turns out that because of the regularities and redundancies in this coding structure。

we only need to consider M times n many pairs of communities and nodes and typically there are much fewer communities than nodes。

so we don't need to consider the full m times n so M squared many pairs of node because of the redundancies in the community structure。

So if we now look at the initial example again， if we use the map equation to measure which partition is best and then we see the reference partition A is best next B and C。

they have the same code length of 3。73 bits and D is the worst in this case，4。47 bits。

Now when we use our partition dissimilarity score to compare pairs of partitions we see that in the first row here in the table actually well a is completely identical to A B and C have the same dissimilarity and D is somewhat well a bit。

not quite a bit， but half a bit may be more similar to A than B and C and the reason for that is that the overlap between a and D。

Is in the node with a higher degree， So the nodes that have a higher visit rate where the random walker has a higher probability of being there so that the resulting coding structure for describing the random walk changes less compared to partitions B and C。

So here we see that only because a partition has a better quality。A better quality score。

in this case， B and C than D， they don't need to be more similar to the reference partition。

And we can also do the same thing with hierarchical partitions here we have two partitions where one has just single level of communities。

the other one has two levels of communities。Then we can look again at those coding schemes and make the comparisons。

And because I think I'm running out of time I would like to conclude so our motivation was that communities are based on link patterns。

but common partition similarity scores， they typically ignore links because they weren't designed for this application and our solution is that we combine the ideas behind thescribe between random walks in the KL divergence to develop a link aware partition similarity score which is asymmetric it does not require matching of communities between partitions like we had in the Tacard index that we need to match and find the best match between two partitions。

And we can naturally compare hierarchical partitions because this is based on random walks in the coding scheme that allows us to describe part transitions between any pairs of notess。

No， thank you very much for your attention。All right， so thank you so much Christopher。

so we still have a few minutes for Q&A， I think there is one on our Zoom。

which is how does this differ from metaPA prior based approaches？

I'm afraid I'm not aware of that approach， so I might not be able to answer that question。

Could you say again， I'm I think I cannot see the I asked the chat， perhaps。No。

I don't see the message in the chat。You can't see message。

So it's it's the Q and A function on the hand。Okay， so。I think it's asked by VK and if you are here。

can you。IThink I can。Make your co hosts to let you clarify right I guess。二 k。Yes， if you would like。

I think you can now use your microphone to clarify your question if you want。

So how does this differ from metapath prior based approaches？I think the question is。Is it okay。

meet okay half prior okay， I mean， I'm since I'm not familiar with it。

I'm I'm going to look it up and see and perhaps if the the one who asked the question is so interested。

perhaps just drop me an email and we can discuss it offline because right now I believe I cannot answer that question unfortunately。

I guess from my understanding metapath based approaches it's just like a more fed version of your random walk based once because Mepaths have stipulated schemes for those walks but random walks you don't really require the node types of each staff in your random walk I guess so that's I guess my understanding Okay cool so do we have any more questions coming。

Yes we can still take one and two。All right。嗯。Iuess if there's no question at this point I do have a question so I think so first I guess it's a comment I think it's a very nicely intuitive idea to kind of use the random walk to kind of encode the structures of the communities and also use page rank to kind of they efficiently compute the planning probabilities of random walks。

which makes you basically the the metric to be quite efficient and。

And computable so I think that's very nice and I guess one question I would have is about like have you tried your measure on some of the real world data sets and test some of the community detection algorithms performance based on your matrix and what kind of insights can we possibly get from doing that。

Well I have looked a little bit into what happens if we have limited data and the thing that usually happens is that community detection algorithms。

they say that okay， now it's easier to find communities there's missing data so it's kind of like okay。

now we overfit。But as this happens， the flow divergence actually becomes larger。

so what you might think you gain in compressibility it's you're getting further and further away from the let's say like the true partition if if we can say that we have something like that。

All right， thanks， so I guess we had another question from referral Art Tello。

I think the comment is that one question would be interesting to check arbitrary lamps。

Im not sure if that's clear enough for you to answer。Arbitrarily linked。

I'm not sure if that perhaps relates to the random works。

so I'm actually considering the statistics of random walks in the limit。

so there's not a fixed length if that it was perhaps I didn't make that clear enough in the presentation。

Okay， thanks。So I guess with that I guess we will just conclude the first presentation by Christopher thank you very much for your time and that's excellent talk we will move on to our second presentation by。

I think it's going to be presented by Antona Jolie， who is currently a PhD student at CNRS in France。

So I think he will be presenting graph Cosing with message passing guarantees。

so I'll now transfer the host， yes the co hosts to An。And。Oh， hi everyone。 Can you hear me。

I can hear you okay。Okay， fine。Okay， so hi everyone。

I'm Autoly and I'm here to present our work G Coing with Me Ba Guarants。

which has been accepted this year to Ns。So in the recent years graphraph neural networks struggled with large graph search those used in recommended system。

one of the techniques employed to solve this issue is to use graphcoing and consist in reducing the size of the graph by mapping the node of the original graph to some cost node in a constant graph。

but we might wonder if training a JN on a custom graph is probably close to training it on the original graph。

So I will begin by introducing some notation on graph Coing。

I will use the notation of enage to cast with some generalization。

so graph Coing is defined on original graph and produce a cos graph GC and the matrix Q the coing matrix Q defines the mapping from the original graph to the constant graph if QIJ is zero as a nonzero value it means that the noded J of the original graph is mapped to the node I in the co graph。

Then we can define for all signal living in the original graph， such a feature。

we can define the constantsine signal X with the cosine em matrix cube and we can define the uplifted signal in the original dimension。

the reconstructed signal Tdx with to define this uplifted signal we use as。Lifting matrix Cus。

which is the moon and horizontalverse of the matrix cube。

then we can define the projection operator P， which is the constant and then lift operator。

Now I would like to introduce a very classical spectral guarantee introduced by Lucas or by Andreas Lucas。

it is named the restricted spectral approximation constant。

it is the of all the signal living in the low frequency of the lab of the signal and the difference between the signal and the reconstrict signal with the projection of peritoppy it can be used it can be seen as a quality measure for the cosonning and many classical coonic algorithms aimed to minimize this spectral guarantee。

Now we just briefly recall that GNN relies on neighbor propagation。

the node features at a step L depends off the node feature at the previous step and the multiplication by a so called propagation matrix。

which and we we denote it as it can be the adjacency with different kind of normalization。

But we might wonder what is the best choice of propagation metrics on the custom graph？

So we can think to use the same normalization as in the original graph or weighted version proposedd by one。

but with these classical choices， spectral guarantees on the coast name does not lead to mis passing guarantees。

And to solve that tissue， we propose a new propagation matrix on the constant graph which depends on the coine matrix and the uplifting matrix you can remark that even if S issymmetric matrix or propagation matrix is oriented。

you can see on a very tall example with the propagation matrix means the adjacency。

you can see the original graph。The constant graph and then are propagation matrix。

which has the same weights as the constant adjustmentancy。

but you divide it by the size of the source super nodes。

so it is a weighted propagation which is independent of any costing algorithm used to to produce it。

With this new propagation matrix， we're able to bounce the difference between the all signal propagated in the original graph and the constant signal propagated in the constant graph without propagation matrix and then up 15 in the original dimension。

The key point is that this upper bounds depend off a spectral guarantee namely the RSA constant。

that means that if we have a good costonic algorithm with strong spectral guarantees。

we will have a better bound and we will have a better reconstructed signal。To have this the。

we make additional assumptions that are easily verified for most of the propagation metrics and laplash use。

So here you can see some example， some illustration of this the， so it is the same plot。

one with one with a log scale and one with a linearar scale。

so you can see our bound or theoretical bound in dot lines so you can see it is very tight for small cosonic ratio for small coonic ratio or propagation matrix is the only one belows that bound so it emphasizeizes how good the bound is for small cosonic ratio。

so we compare it to like the two classical choice I have introduced before so the same normalization and the weight and we also compared the two symmetric version of a co matrix one with a cos matrix Q and one with aliftlifting matrix Kus。

So。If we make some additional assumption about the loss function or the activation function。

we're able to extend the propagation theorem to gene training and to bound the loss of the。

The loss of the gene trained on the co graph and the loss of the gene trained on the original graph。

this happen bound still depends on the RA constant。

But activation the activation function assumption is quite strong and for now it's only verified for the activation function being the identity。

so does the SGC network simplified graph convolution。

But additional work could be to extend this assumption to other activations function and then author GN。

So to illustrate this example we conducted several gene training on the wellknow Coine sitesr dataset set and the 100 times larger graph readdit for nodeclass。

so you can see that for the AGC networks which verifies all our assumption。

the benefits of a costing metrics are more evident and it outperforms all the other sort choices of propagation metrics。

when we relax the assumption on the activation function。

and we have results with a classical GCN we can see that our results are a bit blue and our propagation metrics is still competitive especially for high costing ratio。

but some other choices might be a good idea such the weighted version of fun。啊。To conclude。

I would say that we in this work we directly link the curA。

The spectral guarantees and propose a new propagation matrix。

To have like message passing guarantees and then G training guarantees on the custom graph。

This is only the case for thiscoonic matrix， which has the particular to be oriented。

but doesnt is independent of any cosonic algorithm used to produce it。

Future works could be to explore the。The activation function assumption。

especially because it is found quite a strong assumption to have。

To to hire to have the communication property， but other works could also include them。

Could also include additional orderura assumption we made。thank you for your attention。

if you have any question I would be glad to answer it and you can also see me at the poster session if you have any other question。

All right， so thanks so much for your exciting talk and I think we still have quite a few minutes for questions。

Yes。So I guess we can wait a little while for people to have any questions if they have。

I guess Ive had a quick question like can you elaborate a bit on the good on this page like the when you use GC and instead of STC to count the experiments。

we can see that the other coerc I that's in your baseline SC Di actually your better result Im actually I was probably a little bit lost and you elaborate a bit on like why that happens since what is inside we can get from that。

Do do you mean why do we have like less good results with G compared to S you。Yes， yes。

like when the coerccing method is not your proposed one， but the one and baseline。Oh yeah。

this is mainly because we provide ourthereorem about the GNN with strong assumption on the activation function namely when the activation function and uplifting matrix commute can commute for all the features and that is only verified for now for the activation function being thejaency。

that's your ACC network， but for the GCN we have a non nonlinear with value。

so this means that it breaks on assumption， so that's why the benefits or method are more evident because we cannot bound it like the training of the GNN and the weight deviation is like very similar to the classical normalization but just you put more weight on the cell loop and you put more weight if the cluster of the costing is heavier at this node。

That's why it's a bit better than the classical normalization and it is more interpretable for that case。

so that's why it maybe sometimes it has better results。

see so does he like give us some empirical insights like when we are not we are not we do not have the nice guarantee of the activation function in STC what kind of coering method we could potentially use。

does that give us some insight about it Yeah in fact the method that I used is more a postpro after you coing you can use any coing algorithm and any coing algorithm would give you a matrix Q and a matrix Q+ and a constant graph GC so you can after when you decide to train GNN on your co matrix you can use different choice of propagation metrics like the same normalization or of vision so I would say if you go with LTC you should definitely try propagation metrics and otherwise I think you should try perform metrics and the weighted version which is kind of smarter。

Smarter choice rather than the same normalminization。All right， thanks for the answer。

I think we do have one question from the audience。

so the question is what are the limits of that propagation function？那。

What do you what do you mean I limits， more particularly like。As simply， like。

the main drawbacks of the method， because I would say two。嗯。

Okay obviously I would say the strong assumption of the on the activation function is one of the biggest limits we also making another assumption that the features lives in the low frequency of the graph so that is but it's more an assumption that we make classically for graph concerningning so that means it would be it would work more on Ophillos graphs than on aphios graph because if you merge two nodes in the same super nodes but they have different levels you will have very a lot of difficulty to distinguish it so we make another assumption about the features living in the low frequency。

Of the graph and z on the low frequencies of the lapation and then on the previous preserved space。

so this is another stronger assumption。But as we used picture guarantees， it is quite classical。

I hope I answered your question。Yes， I think。Yeah， so guess we can dismiss this question。All right。

so if there are no more questions coming up at this point， I guess we will just。

Conclude our presentation for this paper here and thanks so much again Ena for your presentation and again。

if the audience have more questions coming up， you can also feel free to post it in our Slack channel as well。

And。All right， so I guess we'll proceed to our next speaker。

All right， so our last presentation today is going to be given by Mihail Miraoff。

who is currently a researcher at Ys。So I think the title for the presentation today would be revisiting Gal monthly Mesure has been written here in the slide so Mikil very nice to meet you and whenever are you ready。

Nice meet too， can you hear me？Yes， I can hear you good and can you see my slides， yes。

Yeah good perfect okay so my name is Michelderonov。

this is a work withmia Pracarenkawa in which is named Travis Gr coop measures and yes we both work in Yic。

😊。

So。The half level graph is like the property of the graph that if like it has like its nodes are divided in some classes。

then if nodes of the same class tend to be connected。

then we say that graph is schematicic if node of the opposite classes of different classes not con signing tend to be connected then we say that the graph is heterotraophilic so the question is like how to formally define this property and like how to measure it。

😊，One important comment here is that measuring homophily is not the same as measuring genome performance it's kind of a common knowledge that forphiligraphs gens tend to work better but actually if let's say I take fully heterophiagraph with only two plus like here on the slide clearly any gen will work good on it so like it will be able to guess labels and there is some other evidence that for very heterophiagraphs gens also work well。

😊，Another argument about it is that caopphily was like discussed and was like the property long before Gens like built a thing so when we' talking about caopphily we don't want to just predict general performance。

basically what we want is to capture our intuition about which graphs are more homoophilic so if I see to graphs and I think that this graph is more more homoophilic I expect this measure to give the same answer that like this graph is more homoophilic。

😊，So well do it in the following way like we approach。😊。

Question we will see previously used caphi measures。

we will discuss some of their properties and basically we will formulate like what properties we actually want from good ca measure。

And I said we conclude that all previous measures doesn't have a least one of these properties and I said we provide the measure which has all of these properties。

so let's see previous hum monthly measures。😊，Here there are a lot of like notation and like create it's hard to understand it from like the scratch。

let me just describe in words what measures we have like it's here on the right there are like four measures。

😊，First one is Hm ofly。Basically， so let's say I call an H homomophilic if it's connect to nodes of the same class。

so I take the graph I calculate the number of all homomophilic ages and divide it by the total number of ages it's like the simplest measure possible and it's called H homomophilia。

😊，Second measure is a bit different here I calculate like for each node。

I calculate the number of neighbors of the same class and divide it by the degree of this node in this case I can say this is like the measure of how homoophilic this particular node is and after I over like average it over all nodes and it's like I deoteed by I name it by a node homophiia。

😊，Next one is a bit complicated to let me just like give a sketch it。

basically I take all notes of the same class， I calculate number of its neighbors of the same class and divide by the total degree of all notes of this class and this is kind of a homoophilia of the class and after that I do some other ca of it and average it。

😊，Last one is adjusted hephiia and it's an interesting one so like this part on the right which is DI divided by two times a number of ages so how we can think about it it's D is the total degree of a class like the sum of four degrees of the class and two times number of ages is clearly the total degree of like old gap and so this is kind of the fraction of degree of corresponding to my class so if I would draw the ages completely randomly then I would expect the number of comeophilic ages inside this particular class like the fraction of its ages to be like this D divided by two e squared so on the en numerator the right part basically is kind of expected number of comeophilic ages。

😊，Like so I see the ash H， so it's like the real aed number of cometic ages。

I subtract the like expected number of comeinetic ages and I third I do a normalization by denmerator。

😊，So it's like the idea of address come matter。😊，Now let's see some properties。

we will see them for H caopphi and based on it， we will formulate it like the properties which really want from came。

So HF is just a fraction of homoophilic ages in the graph so what like what's good about this measure first of all。

if all ages are phophilic then I get the fixed number one so for any graph if all ages in the graph are comeophilic I get this like the same number like the good upper bound which is like one。

😊，The same is for all heterophigraphs。 So like， E ages are heteroophilic， I get 0。

like no matter what graph is， e ages a heteroophilic，0， perfect。😊。

The last one is kind of monolatonicity that like if I have a graph and I add a hemophilic cage。

then Hmophil increases or if I have a graph and I like delete heteroophilic age。

there hophil decreases increases， sorry， so that's kind of like good properties which we want from like homoophil measure。

😊，Now let's see what's the downside。Contue three graphs like all of them are a doary model so basically I have a bunch of node and for each to note the probability of an age is like I' know one divided by three or whatever like some constant proion so like the ages are completely independent from labels。

😊，So suppose we have only two classes and balance of passes is like 15 like 150 notes in first class 150 notess of a second class。

so on average I'll have like half of the ages will be homoophilic so age homoophilly will be 0。5。😊。

But if I have the same like ultratrine model， I have roughly two classes and the balance of the class is 90 to 10。

😊，And again， like draw this asent， the average expected H cooppyly is 0。82。😊。

If I have five classes like uniform and like 20 notess in every cloud and I draw I just randomly then expect that H homophiia will be 0。

20 but it's clear that like from like intuitive perspective all of these graphs should have like the same kind of neutral homoophilic because like in all of these graphs the structure of the gap itself has nothing to do as labels so it would be reasonable to expect to have like the same number in all of these like three cases so this is kind of a drawback of Hphi it basically says that if I have graphs with different balances or different number of classes。

then H homophi is not reliable to compare such like such data sets like which of them is more homoophil or heteroophilic so let's see what properties we have based on these examples because three is like exactly what we mentioned before so if all just a homoophilic we want to have some upper bound or marks es are heteroophilic we want to have some lower bound I mean and we want some an ethnicity which says that。

😊，Additional home affiliateages or removing curulator affiliate ages must increase H。😊。

The next property is like the key property which exactly the property which H homoophil lacks is like constant baseline。

so how we formulateted so we want to say that like if a structure for graph is independent from labels。

then we have some constant value so we do it in the following way， we fix all node degrees。😊。

And all fast labels， I we kind of cut all ages so like now every node has like not ages but like kind of half of ages like sticking from it and I we red ages randomly so we preserve degrees and preserve labels。

but all ages are grown randomly。😊，In this case， we expect to have some fixed air base value。

So we want our measure for every graph play made this way to have expected value error base。😊。

Okay now there are like two other minor properties but let me mention then just to mention one of them is that if I formally add one more class which is not present in the graph so there's no node of this class I just formally increased number of classes by one then homoflly measure shouldn't change and the second one is that if I rename or order classes then homoly measure shouldn't change most of these properties mentioned in like formally by Platon of fatalal including this like crucial constant baseline property。

😊，Now， let's see like what properties these measures have。😊，First of all。First three。

H of and No of and class of doesn't have constant baseline。 So basically。

they are not reliable for comparing deficits with like different class balances。

Next measure iss a adjusted homophily which was termulated by pla of fe and it incorporatepo in itself constant baseline。

basically it's like what I described before is that like numerator will be zero if we draw ages randomly then like on the left it's like kind of absorbed number homoophilic ages on the right it expected number of homoophilic ages and so it has constant baseline。

but unfortunately this measure doesn't have a minimal agreement and it doesn't have monooneity for some lower value of a homophie。

😊，And our main contribution is that we formulated the new measure and biascaophile。

which has all of these properties。😊，So let's see， basically we will just see this formula for nearly the rest of the talk and just。

😊，Think why it truly satisfies the properties which we described。

we denote by CI the fraction of comemoilia ages inside class I and by CIAIJ half of this schiophilia ages between classes I and J。

😊，Now we reduce the formula below first of all it's dependent on some coefficient alpha。

but let's for now ignore the second term so it's not really important it's more technical and the crucial term is the first one which is right now in black and let's see what it means that like why it makes sense at all。

😊，So first of all， suppose we have no featureophillicages。Then all CAJ is0。In this case。

numerator and denominator basically is the same value， so we get one。

And like the test is equal to one so we have maximum agreement perfect the same for minimum agreements。

suppose we have no heteroophilic ages， no homoophilic ages， sorry。

so CI is equal to zero and all ages are heteroophilic。😊，In this case。

all these square roots are zeros。😊，And the numinator is basically the negative of denominator and we get minus one。

so we get minimum agreement。😊，I'll say nothing about monity but you can like it can be proven that like it has monity in most cases。

whatever and now let's see that like the most important part is like why this has constant baseline like the main property which like。

😊，Previous measures slides excluding adjust come of them。

So suppose we are in this setting when we cut all ages and withdraw them randomly。😊。

Suppose like we are in the setting and we know that like the fraction of degree of class I and the fraction of the degree of class J。

😊，So suppose I know it now what can I say about coefficient CI and CIAIJ so like I know like the fraction of the fraction of like this half ages which correspond to the first class and to the second class。

like what can I say about CIAI and CAIJ？😊，First of all。

like CI will be just the product of like fraction D divided by2 e like squared so it's just the probability that I pick like the first one from from this fraction the second from the same fraction for CAJ you have the same and for CJ is like half of the htoophillicages well have this formal DI times Dj divided by like this2 e times2 e so。

😊，呃，sorry哦。Okay， okay， so the conclusion here is that based on this formula， we can see that。😊。

If we are in this random reding cage situation， then we expect this square root of CICJ to be equal to CIJ。

So basically this proves that the expected value of numerator is zero， which is good。

like this basically proves that constant baseline。So yes。

so another kind of intuition how to interpret it is that this like if I know how many homomatophilic cages I have so I know CI and CJJ。

then I can expect the number of heteroterophilic cages to be square root of CI CJJ。😊。

And so this square root this kind of expectation and CJ is like what I really absorb。😊。

And I take the difference and like if the difference is zero。

then I'm exactly in this random situation， which is constant baseline and yeah if it's above。

then it's more homoophilic， if it's below， it's more heteroophilic。😊，ok。

So now yeah now let's see what the second term is about and like why we at all need them need the second term the problem with the first term is a following so suppose only one C like the CI is equal to is more than zero so let's say c11 is bigger than zero and every other like C like C22 etc as zero so like we have only hophilic ages only in one class basically。

😊，Then you see that all square roots will be zeros。

And in this case like the first term will be just minus one because like the en numerator will be just minus the deminator and that's clearly not what we want like I mean it violates like the minimum agreement and it violates like monooneonicity because like now if Pte increase c11 a bit。

the first one will still be minus1。😊，And so the second term kind of fixes this。

so like in this special case when like only once CAAI is bigger than zero。

like the second term practices is I won't dig deeper， but here what you can note also is that，😊。

The second term can be arbitly small， basically， we multipllyied by some alpha where alpha can be any like any number。

basically。😊，Okay。So this is our formula which like has like all desirable properties for all alpha bigger than zero。

but in practice since alpha can arbitrarybit small we recommend just to drop it and use as a first because like it's the most crucially has nearly all properties like so excluding that it's violates monoonicity and minimal agreement in this rare special case other than that like like this first is perfect and like we like we recommend to use it yeah by the way。

😊，It's like we given an example of property which we give an example of measure which satisfies all properties。

but we do not say it's like unique measure， maybe like clearly there are like other measures which also satisfy this property。

😊，Okay， so it's like again， the review table with properties and measures here you can see that。

And bysophilia V， alpha begin than zero has like all properties and when we drop the second term we just have some like small star with minimal agreement and with mentalix。

😊，Okay， I think this is nearly it， so in our paper we also include some examples of desirable and desirable behavior of homo measures and we compare them for different synthetic and drill deficits。

Also we tried to solve the same problem for directory graphs oh well by the way like this thing also can be applied to weight graphs so like it will work the same way。

but like for directory graphs unfortunately properties themselves contradict each other so like it can be proven that if we directly from like these properties for direct case。

then there will be a contradiction so we need some other properties。😊。

I think that Su we also will have a poster session tomorrow so come there and like ask us some questions we'll be glad to hear them thank you。

that's it。

好。All right great so thanks so much for Todd， I think we still have a little we're actually running behind the schedule and actually the postal has already begun but we still have a because we started late we still have a few minutes I guess we can still take one or two question if there's any。

Yeah， it's usual， I guess I give people a little bit time to type in questions and。

I guess I will also just utilize this small gap if there is any question to ask my question。

so I think you started by motivating about the relationship between you know homoly measures and the performance of GNs so I guess my question will be later on because you have proposed your own homothly measure is there any insights that we can talk about between your homoly measure and GM performance。

Oh we haven't measured it yet， but in jail like I tried to actually distance from this genetic performance because again as I said。

like genetic works well for poly fo homophiligraph but for polyliccatephigraphs they also like have like surprisingly performance so they like the drop in the performance uses in the between then there's like neutral homophiia yes I see。

😊，And also， I guess the other question that I had was like。

do you have some of the real statistics that you obtain after running your measure on real world data sets。

like some of the very popular data benchmarks that we use？

It's like not statistic we basically we have values of like different measures for different real benchmarks and we can see that sometimes like let's say if。

😊，Let's say H homoopia says that this benchmark is more than like one benchmark is more complicated than another one and we can see that this happens just before there is some class im balance or something。

we can see that double measure works better， that it compares them in the way which we expect intuitive we formula with our properties。

😊，Yeah， I mean it would be nice if we have some kind of。

I guess counter counterintuitive examples of things that we think should be homo monoly。

but at your measure is not and things like that would be very nice to show if there's yes but like it would be not nice in some way because it will say that this measure actually bad。

like it means that you haven't counted in something。

it mean that like our intuition wasn't captured by the following true the idea was yes so but like if you find such examples it would like it would be like step forward。

maybe you'll find some other properties。Yes， so I think we did not have more questions coming up。

Yes， so if there's anyone wanted to have more discussion。

please feel free to post questions comments in our select channel and with said I think we will just finish today's oral presentation thanks again very much for Mha's talk and our poster session is right now underway in our gather town hall I think so I would encourage those of you who are interested to move there thanks very much again for everyone for coming。

图机器学习会议：P02：几何生成模型教程

概述

在本教程中，我们将学习几何生成模型。我们将从现代生成模型的基础知识开始，然后深入探讨如何将几何概念融入这些模型，最后学习如何实际构建和实现几何生成模型。本教程旨在为初学者提供一个清晰、实用的入门指南。

第一部分：无模拟生成模型

上一节我们介绍了本教程的整体结构，本节中我们来看看生成模型的基本问题设定和现代方法。

生成模型问题设定

我们有一个未知的数据分布 ( Q )，通常只能通过有限样本集 ( {x_1, x_2, ..., x_n} ) 来访问它。我们的目标是学习一个采样器，能够从这个未知分布中高效地生成新样本。

在深度生成建模中，我们学习一个带参数 ( \theta ) 的神经网络。生成模型可以抽象为两个对象的元组：

生成器 ( \psi_\theta )：一个允许从学习到的数据分布中生成新样本的函数。
底层密度 ( p_\theta )：由该生成器函数定义的密度。

我们的目标是找到参数 ( \theta )，使得我们的生成器创建的底层密度 ( p_\theta ) 近似于目标密度 ( Q )。

生成模型范式

以下是深度生成建模的几种主要范式：

自回归模型：基于似然的模型，顺序生成信号。生成顺序取决于信号维度，在语言建模中效果很好，但对于图像或几何应用（其中像素或数据元素的顺序不明确）则不太直接。
生成对抗网络：生成器是一个神经网络，它将来自低维潜在空间的潜在向量转换为数据样本。训练涉及定义一个判别器，解决一个极小极大问题，可能难以训练，但支持快速采样。
变分自编码器：定义一个自编码器，其潜在向量被限制为从某个分布中采样。支持快速采样，但通常性能不及GANs。

在本教程中，我们将重点讨论作为生成模型的动态系统。

动态系统作为生成模型

我们将生成模型视为一个时间相关过程。一个动态系统描述了点样本随时间的演化。我们引入一个时间相关的生成器 ( \psi_\theta(t, x_0) )。采样变成了模拟：给定一个源分布，我们采样一个初始点 ( x_0 )，然后模拟这个过程，使得在时间 ( t=1 ) 时，我们到达目标分布 ( Q )。

有两种主要的动态系统用于开发生成模型：

流模型：由常微分方程描述。
[
dx_t = u_t(x_t) dt
]
其中 ( u_t ) 是速度场，定义了无穷小步长的方向。这是一个确定性过程。
扩散模型：由随机微分方程描述。
[
dx_t = f_t(x_t) dt + g_t dW_t
]
其中 ( f_t ) 是漂移项，( g_t ) 是扩散系数，( W_t ) 是布朗运动。这是一个随机过程。

流匹配

流匹配是一种训练流模型的免模拟方法。其核心思想是：速度场生成一个概率路径，当且仅当它满足连续性方程。因此，如果我们能构建一个从源分布开始、最终近似目标分布的概率路径，我们就可以通过回归速度场来学习生成模型。

以下是构建流匹配模型的关键步骤：

构建目标概率路径：我们通过构建条件概率路径来构建边际概率路径 ( p_t(x) )。边界条件是：在时间 ( t=0 ) 时为源分布 ( p_0 )，在时间 ( t=1 ) 时近似目标分布 ( q )。
[
p_t(x) = \int p_t(x|z) q(z) dz
]
其中 ( z ) 是条件（例如数据点 ( x_1 )）。
定义条件流和速度场：对于条件路径 ( p_t(x|z) )，存在一个生成它的条件速度场 ( u_t(x|z) )。边际速度场可以通过对条件速度场进行边际化得到。
[
u_t(x) = \frac{\int u_t(x|z) p_t(x|z) q(z) dz}{p_t(x)}
]
寻找可处理的优化目标：可以证明，最小化流匹配损失等价于最小化条件流匹配损失。这为我们提供了一个可处理的训练目标。
[
\mathcal{L}{FM}(\theta) = \mathbb{E} \left[ | u_\theta(t, x_t) - u_t(x_t|z) |^2 \right]
]
其中 ( u_\theta ) 是我们学习的神经网络速度场。
构建条件流：一个流行的选择是仿射条件流。
[
\phi_t(x|z) = \alpha_t z + \sigma_t x_0
]
其中 ( \alpha_t ) 和 ( \sigma_t ) 是时间相关的系数。对应的条件速度场为：
[
u_t(\phi_t(x|z)|z) = \dot{\alpha}_t z + \dot{\sigma}_t x_0
]
特别地，选择线性路径（条件最优传输）时，( \alpha_t = t, \sigma_t = 1-t )，速度场简化为 ( u_t = z - x_0 )。

流匹配总结

在本节中，我们一起学习了如何利用流匹配框架构建生成模型。其核心步骤是：

给定目标分布 ( Q ) 的样本，构建一个概率路径 ( p_t )，使其在 ( t=0 ) 时为源分布，在 ( t=1 ) 时近似 ( Q )。
通过构建条件流（如仿射条件流）来定义该路径。
使用条件流匹配损失学习一个速度场 ( u_\theta )，其目标是回归到条件速度场。
学习到的速度场定义了生成目标概率路径 ( p_t ) 的边际流。

与扩散模型相比，流匹配更加灵活（可以使用任意源分布和目标分布），定义在有限时间区间上，并且通常能提供更快的采样速度和稳定性。

第二部分：几何机器学习工具

上一节我们介绍了流匹配生成模型，本节中我们来看看如何为这些模型引入几何结构。

流形上的生成模型挑战

当数据位于流形而非欧几里得空间时，我们需要重新思考许多概念：

数据表示：如何表示流形上的点？
概率路径：如何构造流形上的“直线”路径？
条件流：如何定义流形上的仿射组合？
速度场：如何定义和参数化流形上的向量场？
源分布：流形上没有高斯分布的标准定义。

流形基础

为了进行生成建模，我们需要为流形添加一些结构。

流形定义：一个流形是一个光滑的拓扑空间，局部类似于向量空间（通过图表），但整体结构可能不同。
参数化：
- 外在参数化：将流形嵌入到更高维的欧几里得空间 ( \mathbb{R}^m ) 中。优点是可以在环境空间中使用熟悉的向量运算。
- 内在参数化：使用覆盖流形的局部图表。可能面临数值不稳定性（例如球极投影在极点附近）。
切空间与切向量：在流形上每一点 ( p )，都有一个切空间 ( T_pM )，包含所有在该点与流形相切的向量。切向量可以形式化地定义为通过该点的曲线的导数。
黎曼度量：为了计算损失函数中的范数（如 ( | u_\theta - u_t |^2 )），我们需要在切空间上定义内积。黎曼度量 ( g_p ) 为每个点 ( p ) 的切空间平滑地分配了一个内积。它允许我们定义向量的长度、角度和点之间的距离。
- 范数：( |v|_g = \sqrt{g_p(v, v)} )
- 距离：两点间的最短路径称为测地线，其长度由度量决定。

流形上的关键操作

为了在流形上进行积分（模拟ODE），我们需要三个关键操作：

指数映射 ( \text{Exp}_p(v) )：给定点 ( p ) 和切向量 ( v )，沿着由 ( v ) 决定的测地线移动单位时间，到达流形上的一点。类似于欧氏空间中的加法 ( p + v )。
对数映射 ( \text{Log}_p(q) )：指数映射的逆。给定点 ( p ) 和 ( q )，返回切空间 ( T_pM ) 中的一个向量 ( v )，使得 ( \text{Exp}_p(v) = q )。给出了从 ( p ) 到 ( q ) 的“方向”和“距离”。
平行移动 ( \Gamma_{p \to q}(v) )：将切向量 ( v ) 从点 ( p ) 的切空间沿着一条曲线（通常是测地线）移动到点 ( q ) 的切空间，保持向量的某种不变性（如与另一向量的内积）。

流形上的流匹配

现在我们可以将欧氏流匹配提升到黎曼流形上。

目标预测参数化：在欧氏空间中，损失可以是速度场回归或目标点回归。在流形上，我们同样可以最小化预测目标点 ( \hat{x}_1 ) 与真实目标点 ( x_1 ) 之间的测地距离。
[
\mathcal{L} = \mathbb{E} \left[ d_g( \hat{x}_1, x_1)^2 \right]
]
其中 ( d_g ) 是由度量 ( g ) 诱导的距离。
流形上的条件流：欧氏空间中的线性插值 ( x_t = (1-t)x_0 + t x_1 ) 被替换为测地线插值。
[
x_t = \text{Exp}{x_0} \left( t \cdot \text{Log}(x_1) \right)
]
对应的速度场为：
[
u_t(x_t) = \frac{\text{Log}_{x_t}(x_1)}{1 - t}
]
流形上的ODE积分：欧拉积分步骤 ( x_{t+\Delta t} = x_t + u_t(x_t) \Delta t ) 被替换为：
[
x_{t+\Delta t} = \text{Exp}_{x_t} \left( u_t(x_t) \Delta t \right)
]
这确保了积分路径始终保持在流形上。

几何生成模型总结

在本节中，我们一起学习了为生成模型构建几何工具所需的核心概念：

流形表示：选择外在或内在参数化，考虑数值稳定性和计算便利性。
黎曼度量：添加度量结构以定义距离、角度和范数，这对于计算损失函数至关重要。
流形操作：指数映射、对数映射和平行移动是模拟流形上动态系统的基本操作。
流形上的流匹配：通过使用测地线作为条件路径，并将欧拉积分步骤替换为指数映射步骤，可以将流匹配框架自然地扩展到黎曼流形上。

对于流形上的生成建模，流匹配通常比扩散模型更简单，因为它避免了在流形上定义布朗运动和分数函数的额外复杂性。

第三部分：结合几何与生成模型实践

上一节我们介绍了流形上的几何工具，本节中我们来看看如何将这些工具与生成模型结合，并应用于实际问题。

从欧氏到黎曼流匹配

将欧氏流匹配公式推广到黎曼流形，核心是将向量加法和减法替换为指数映射和对数映射。

欧氏流匹配（回顾）：
- 条件流：( x_t = (1-t)x_0 + t x_1 )
- 速度场：( u_t(x_t) = x_1 - x_0 )
黎曼流匹配：
- 条件流（测地线插值）：
  [
  x_t = \text{Exp}{x_0} \left( t \cdot \text{Log}(x_1) \right)
  ]
- 速度场：
  [
  u_t(x_t) = \frac{\text{Log}_{x_t}(x_1)}{1 - t}
  ]

案例研究：蛋白质设计与流形表示

蛋白质设计是几何生成模型的一个重要应用领域。蛋白质可以表示为：

原子点云：所有原子的3D坐标。可以在欧氏空间中处理，但忽略了已知的化学约束。
刚体框架（在 ( SE(3) ) 中）：每个氨基酸残基的位置和方向。这捕捉了主链的几何结构。
扭转角：描述侧链旋转的角度。位于环面 ( \mathbb{T}^n ) 上。

AlphaFold2等模型使用了 ( SE(3) ) 和环面表示，将几何先验知识（如键长、键角）编码到表示中，使模型更容易学习。

环面上的流匹配编码示例

以环面 ( \mathbb{T}^1 )（圆）为例，说明参数化选择的重要性。

角度参数化：直接输出角度 ( \theta \in [-\pi, \pi) )。问题是在边界 ( \pi ) 和 ( -\pi ) 处存在不连续性。
单位向量参数化：输出二维单位向量 ( (\cos\theta, \sin\theta) )。这是一个光滑的嵌入，没有边界不连续性，但需要输出两倍的维度。

在实现环面流匹配时，关键点是：

对数映射：计算从 ( \theta_0 ) 到 ( \theta_1 ) 的最短方向，需要处理模 ( 2\pi ) 的环绕。

def log_map(theta0, theta1):
    diff = theta1 - theta0
    # 选择最短路径
    return (diff + np.pi) % (2 * np.pi) - np.pi

指数映射：将切空间中的向量加回到角度，并处理环绕。

def exp_map(theta, v):
    return (theta + v + np.pi) % (2 * np.pi) - np.pi

ODE积分：使用指数映射进行更新。

theta_next = exp_map(theta_current, velocity * dt)

实用技巧与开放挑战

参数化选择：倾向于选择更接近欧氏空间、更光滑的参数化（如单位向量 vs 角度），即使维度稍高，这通常能带来更好的训练稳定性。
向量场参数化：实践中，许多模型选择“目标预测”参数化，即神经网络直接预测目标点 ( \hat{x}1 )，然后利用对数映射计算所需的速度场 ( u_t = \text{Log}(\hat{x}_1)/(1-t) )。这通常更稳定。
等变性：对于像 ( SE(3) ) 这样的对称群，构建等变网络架构很重要，但最新的模型（如AlphaFold3）也展示了在环境坐标中直接操作的可能性。
开放挑战：
- 采样速度：流匹配仍需模拟ODE，采样涉及多次网络前向传播。研究一步生成模型是一个挑战。
- 非参数化流形：对于数据驱动的、没有解析形式的流形，如何定义几何操作和生成模型仍是开放问题。
- 损失景观：不同参数化如何影响流形上生成模型的训练动态和损失景观，尚缺乏深入分析。

总结

在本教程中，我们一起学习了几何生成模型的完整流程：

生成模型基础：我们了解了流匹配作为一种强大、灵活的生成建模框架，它允许我们在任意源分布和目标分布之间学习确定性变换。
几何工具：我们引入了流形、黎曼度量、指数映射和对数映射等概念，这些是将生成模型扩展到非欧几何数据所必需的。
实践结合：我们看到了如何将流匹配公式推广到黎曼流形，并通过蛋白质设计的例子了解了不同几何表示的选择。我们还讨论了环面上的具体实现细节和实用技巧。

几何生成模型是一个快速发展的领域，为蛋白质设计、机器人学、气候建模等许多科学应用提供了强大的工具。希望本教程为你提供了坚实的入门基础，并激发了你在这一领域进行探索的兴趣。

图机器学习会议｜ Learning On Graphs Conference 2024 p03 P03_Integrating_Large_Language_Models_and_Graph_Neural_Networks -BV1k9pAzpE8S_p3-

All right， so I think we can start I'll keep it very brief it's my pleasure to introduce Xavier Bsson Xaviers associate professor at National University of Singapore and。

He needs no introduction。 He is the pioneer of G deep learning。

He has some of the most sightted works， has organized several conferences and tutorials。

including being a huge supporter of the learninging on Gs conference and winning some of the biggest individual grants for this research。

So Zavier take it away。😊，All right， thank you for the introduction Chaa so welcome everyone and thank you for joining so I'm excited to talk to you about Graph neural network and large language models。

so this is a collaborative work with Shao Shiinhe， Brian Hoy， Thomas Laurent。

Ian Lacoon and other collaborators。Okay， so here is the outline of the talk。

so I will first introduce large language models and then I will like the question if we still need graph neural networks in the era of LLMs。

😊，嗯。I will review the advantages and limitations of large language models and graph neural networks to identify tasks where this combination can be useful and in particular I will present two works。

the first one will be to use LLMs for improving GNN reing and the second one will use GNN for LLM reing and I will conclude。

All right， so I think we live in a very exciting times with the deep learning revolution compared to when I was doing my PhD。

so computer vision completely changed with the introduction of deep learning as you know。

so it was in 2012。😊，With INe architecture at that time was a commercial No network is still very powerful architecture with AlexNe。

eight layers， 60 million parameters， two GPUs。And the industry quickly understood， actually。

that there was going to be very。Very。Prodive for to make money so it works very well for automatic recommendation it will work pretty well in autonomous vehicle it's coming and of course for surveillances this is a very powerful tool。

😊，The revolution for natural language processing came a bit later in 2022 so that as set was internet。

basically the architecture was a transformer。And there was much more layers like one of the layers and 175 billion parameters compared to AlexNe。

number of GPUs was also much larger from2 to 10，000。

but it really created a new industry that we call geneativeVI。For everyday task， for example。

coding， text summarization， content creation， dialogue system。

everything is automatic now with this system。And actually people gave a name to this large pretrain network train on massive data set。

this is called a foundation model。So we have now this gene AI industry booming。

so for Tech generation we have proprietary LLMs with GPT Jimmy and I from Google， Cloud。

onropic and Coque。😊，Twitter we have also open source LLMs thanksly you know thanks to metata and Yan Lakuun basically we are able to manipulate these LLMs and this is fantastic for researcher I guess so Lama3 Miral Gima Gma by by Google that has set quite huge but something also also interesting is that the architecture has not very changed from know 2020 So it steel decoder there are some few improvements but basically this is still transformer The algorithm changed quite a lot so it's 20000 GPUus if you want to buy it for yourself so it would cost today like 4 billion US Dora and the next generation is going to be even larger with 100。

000 you know H10 GPUus。For image generation， so we have also pry models like Mi Fltable diffusion and Delhi。

we are not lucky for research or because we don't have access to this open source。You know。

models that I said are quite huge and architecture also strangely is still you know transformer with some variants but this is called a visual transformer and we use di model so this is the new stable way to generate new data。

So theseLMMs， they' are been trained on internet basically and we know that internet is a network right is a network connecting web pages and each webage is basically a text document with a lot of information so。

So and then if you look at， for example， also Wikipedia so you have you're going to have links inside the web page connecting to other Wikipedia pages the same also if you take adjacent find。

so there is a lot of Jason fine on internet for the structure of。

Of websites and when you open adjacent file， you will see links so there is a lot of links so basically what it means it means that LLMs have been trained on graph data and a lot of graph data because internet is not only the text but also the connections between the text。

😊，So they have learned the relationship between text data。

So the thing that they have been trained on so huge massive data sets， so they are able to identify。

for example， relationships between entities if you ask them and also predict and on labels in the network for example of scientific articles so when LLM came basically with Sheoing E and my collaborators basically we have to do question do we still need GNN actually for you reasoning over graphs structure data because LLMs have seen everything right so they have been trained on internet so they know graph data。

😊，So I was very worried to tell you the truth。So so in this talk I'm going to focus on text attributed graph so they are basically what we call textual graph so each node is going to be connected to other nodes for example here this is a knowledge graph and each node the feature will be text okay and also the edge information so the connection between two nodes and the edge feature will also be text okay so it's basically a topological graph and on the top of it you have a lot of text information。

😊，So we will not consider a non-textural graph like a molecular graph this is not the goal of this of this presentation so first of all what we did was okay let's look at a popular data set and let's try to predict using LLM so the popular dataset set we use was the one from OGB so this is OGBN archive so this is a data set which is a network of scientific papers so the number of nodes is 170。

000 number of pagess is 1 million and here the task is basically a prediction task for the class of the paper so we have 40 glasses to predict and the node feature are basically for each node this is an article so you have the title and the abstract and you just want to predict you know the class so the results are as follows so if you use a GnN。

Train on the bag of world features given by OGB library basically you would get 70% accuracy okay now if you look at the OGB leader ball the Sota paper is actually a gl which is a combination of a language model and graph neural network and the precision was 76。

6 percent accuracy okay and then what we simply did is basically we took at that time it was 2022 we took at that time the best LLM that was GPT3。

5 and we asked that you know given the title the abstract and the 40 classes to predict actually the core classes and he was able to do 73。

5% accuracy。そう。The good news is that it's not better than the state of the app using GNN is still closed。

but I think that was a big relief， I think for me that we still need a GNN in the rear of LLMs。😊。

Also something that was interesting is that of course probably the OGB data set was part of the training data set of GPT 3。

5， so remember OGBT 3。5 was trained on internet and probably this data was part of the training data so we have seen the test set and this is what I mean。

Okay so now let's try to see the how can we combine graph neural networks and large language models so for this we need to identify what are the strengths and also the limitations of these techniques so large language models they are really impressive they are able to accurately model language distribution so predicting the next world given the context they do that you know it's very impressive and they have been trained actually on everything you know from human knowledge which is on internet so the way I see it for me is like so you have the human knowledge which which is here and then you have this LLM which is some kind of AI orac okay so Nacor knows everything you have seen everything and then humans we go to the orac and we ask we home to the Oac okay give me an answer to my question。

SoIt will answer something okay so you can ask anything to OA influence on something。

but the problem with an OA code is that because you have so much knowledge。

you need to have a very precise questions to theA authorized。

it will not give you the answer that you like okay。😊，So but still。

you know that was very impressive at that time， so in 2020 when GPT 3。5。

wa released so very strong zero short capabilities so the scaling lows despite some people can say today for training and influence have not yet saturated so bigger networks larger data set longer training time longer search inference actually still improves the result and it's very easy to see why it's because companies they are buying more and more GPUs so we know that the scaling load is not dead。

😊，The limitations so we know that because again we don't know how to pump correctly an LLM they will make you know a horse we call that gently hallucations。

but this is basically very bad answers so they have logic they are also limited logical reasonzoing so for example there is Terto which is who is like a very strong mathematician and try you know the last version of GT and you say basically oh they are actually very limited again you need to you need to give them a lot of precise po to make it work so they have a limited logical reasoning So what openi try recently to do is to improve that by learning you know。

Chain of third for example and also for inference to do search algorithms so it improves this limitation but it's still not there they are also limited graphs and in capabilitiesabities。

even again if they have seen the test set of the graph task during training。For GNNs。

I think strength is basically to be able so if you have a graph like this and you have a question。

for example， is the Monaisa in the same city as Alice's friend Bob， so you have here Monaisa。

you have the city， you have Els， you have Bob so。😊。

If you do multiple layers of GNN what you will do you will basically learn a multih path that will go through the solution of your task so they are very they are very you know good to do that and they are very effective for many different modities and you know now we're talking about text but they are also very good for example。

you know physics biology commoid optimization and also chemistry and we know that chemistry that was a very good year for AI so there was a nobel price chemistry for Al For and Al For as you know is a niche transformer so predicting the pairwise distances between residue in amino acid sequences so this is a graph neural network basically limitations for GNNs is basically they like graph foundation model in the scale of natural language processing and computer vision of course the community as well a lot on that it's very interesting to push in this direction but。

You know there is this emergent property basically means that we need to go beyond the scale of training data and compute to get something very powerful so we are not yet there the problem is basically the data set we don't have like a large data set available OGB is still comparatively small compared to ImNe which has 150 gigabyte of images。

the hardware for running sparsing and algebra is not optimized it's much slower than standard dance operations existing pre2GNNs because basically of that they are small they are not doing。

Billions of parameters this is basically millions of parameters and I think also something which is today which is an limitation is that industry has not yet found some interesting application of GNN because industry is really driven you know the AI research and AI product why we have GPT today because industry this interesting indeed deep learning and then also to develop you know product so it's not yet clear you know how to make profitable stuff from GNs it will come I think but it's not yet there。

So combining anLM and GNNs basically is it means developing a joint training a joint text and graph foundation model this is of course a very attractive idea。

very promising idea but today I think the issue is that there is a very huge imbalance between the knowledge coming from text from LLMs and the knowledge coming from GNN so of course but we would like to do is to do some discount kind you of architecture where we have the text then the text we go through an LLM it will process it it will give us some vectors the same also for graph it will go to the GNN will process it will give us some vector and then the vectors be will be。

I'm sorry we be processed together with self attention or cost attention and we for example。

we do text generation so the fact that you know we have a huge difference between these two domains。

I think this is this is very challenging。So what it means it means basically that we need to tailor you know the combination of LLMs and GNNs to get some value of that so for example。

what we can do we can use the LLM using the vast knowledge of LLM and then try to improve the performance of small scale text attributed。

Graphs and we can also do the reverse one that you know we can use a knowledge graph， for example。

to constrain the LLM to give more precise response okay so reducing hallucations this way。

So this is what we will do next， so we will review you two walks that we have done and this is really focusing on text reing task so the first work will be we will use LLMs to enhance GNN reing so this is a work basically that is taking LLM reing abilities to improve GNN predictions。

😊，And and it's pretty actually effective and robust the second paper we will review is basically a GNN withNNs LLM reing so this is a foundation work where we try to put together all the benefits of LLMs GNNs and also something that we call graphHag that I will explain。

😊，Okay so let's see first the first technique here。

so the technique is called tape so here the idea would be to use LLM knowledge to improve the quality of the node features in a tag okay so if we have better node feature then we would be able to predict you know with more accuracy。

😊，With higher accuracy so the question is how do we extract again。

information from an LLM for a specific tag task。😊，So to do that。

we are going to prompt theLLM because。LLM again， accumulated so much knowledge that what we would like to do is to prompt the prediction of the LLM。

but at the same time we also want to understand its reasoning。

Okay so we will ask the so given for example， an article the article will be the title the abstract we would ask the poem question so basically predicted the class but the same time we will ask the LLM to give us you know its reasoning so why did you did you decide for this for these predictions okay。

So we could also reasoning your explanation if you want， okay？

So now that we have you know the sequence of words for abstract title or explanation prediction。

this is not something that we can directly use with GNS so what we need to do basically is to have a mapping from sequence to vector okay so we want to take this input sequence of wordss and then output a D dimension of vector that will summarize this information and that will improve basically the。

The explicitlyosivity of this node feature so remember that in this example a node is an article okay。

this is another article and then what we want is to predict if there is a relationship。I'm sorry。

and what we want to predict is， you know the class。😊，Okay。So what we propose。

we propose something that is going to leverage both Opre and open source LLMs。

This technique is a kind of integrator between a close LLM。And an open LLM。Okay。

so the closed LLM can be GT that can be Gi so we know that this disclosed this proprietary LLMs actually are better than the open ones Okay so we can go on the leader world of LLMs and you see that always the top two are GP and Gi okay so the unfortunately for researcher proprietary proprietaryry LLMs they are better than the open ones but the problem is the closed LLMs is basically that they only provide sequence of words right they don't provide the vectors that we want to train the GN so in contrast if you look at the open source LLMs like Lama。

Or Gema basically they are going to provide the text but also the vectors right so we have access to everything inside the architecture。

we have the hidden vectors， the hidden features， but also the you know the output everything is given to us。

So what we decided to do in 2022 so at that time GPT 3。

5 was the best closed LLM so this is the one that we used and so we have our article so this is the node I in the graph given the title of the abstract we query we get the explanation we get the prediction and then what we do is that we are going to convert the sequence of for example of what in the prediction by using。

A language model， also like a small one， which is the beta in this situation。

it has 129 million parameters。And here let me zoom in so what we do is basically so we have a sequence of what token so this is basically the explanation if you want and here we have。

you know in any transformer architecture you can have a class token basically something that will summarize the sequence of well so we give us an input these we go through transformer layers。

we output would and then we get the class token after L number of transformer then we go through an MLLP。

small MLLP to fine tune on the training set okay so the training set we know the correct class。

so we want basically to fine tune the MLLP on on the correct class of the training set so in the MLP here it's a small one only two layers。

the first layer basically we just be some features and then it will go through another layers to get the number of classes so for example that can be 47 if it is good。

if it is OGB archiveU。So this guy here is going to be a vector of seven all dimensions。

so this is actually going to be our enrich feature so this feature will represent the input sequence that we have here Okay and this is very tailored to the。

To the task that you want to solve okay， so this way we are able to get enrich rich feature for the explanations and which feature for you know title abstract and and then we can also have a prediction feature。

Okay so what we should do if we are in 2024， actually we should change you know the smaller the verta language model by now using a large language model okay so why we can do that at the university is basically if you take。

for example Lama2 or gma you can fine tune them using this very nice technique of Lo okay so low rank fine tuning and basically with my small GPUus actually I'm unable to fine tune you know a large language model like Lama2 so this is very great okay so basically what we would do we just need to change the proprietary 3 LLM which the best one so you take the one that you like and then here instead of using the verta you can use Lama Lama2 for example。

😊，Okay， so once you have your enriched node features。

what you're going to do you are going to train your GNN okay with this new node features so you can use your favorite GNN okay。

and then you can make the prediction。So we can compare the quality of Note feature now。

so we are going to see shadow feature language model and large language model features okay so the shadow features so if you look at OGB data set so they have already designed some nice encrafted features for each data set。

for example this skipgram for OGB archive。And what happened is that you get 70% accuracy on the test set in only four minutes。

okay， so this is very fast and I think this is a very good baseline for model performance。Now。

as I said before the state of the art is Gm and Gm was training simultaneously a language model okay and also a GNN and basically this model got the best accuracy of 76。

5% but of course because you need to train you know your language model this is going to take more time so it's going to take 9。

2 hours okay so there is here at that time before the introduction of GDPGPT that was really a huge tradeoff between you know if you want to increase your accuracy you need to pay the price of you much more computational hardware but also training time。

So now this is what we suggested， basically we use this LLM we pump them。

we translate into vectors and then we fine tune so accuracy was 75 posons and only tris to do that so interesting once we published the paper so we got at the the top one of the little ball and then there was other techniques that used the same approach of course a little better and then now the top three even today actually I was surprised I look at that recently so even today the top three models for OGB archive are basically based on this type technique。

Okay so of course， one review two say that oh we cannot trust your results because OGB archive is part of the LLM training data set so we cannot trust you so what we did basically is that we produce a new OGB archive not a GB archive but an archive data set we could tape archive 23 so it's available to download we have 77000 papers same number of classes and basically we have the same conclusion there is nothing that changes when we do that and again this is also the reason that and then an LLM is even if you has seen you know the test set is not about to reason very strongly with G structure so so you still need the GNN you know to use the topological relationship to make good prediction。

We did some applicationss study what we observe is that there is no specific feature which is better than the others。

it's actually the combination which is important。😊，So we term conclusion for this work。

so basically we can use the LLM knowledge and its reasoning abilities to enhance the node features of the tag and also the make it to fine tune to the specific to the specific task okay so what we did is something like is not end to end so we don't run everything together it's actually we first generate good features。

And then we train the G。 So this is， I think， along the。

The trend now of LLM so basic LLM is not trend and to end they are like four steps self supervised fine tuning reward reward model and finally。

A reinforceinment learning okay so so each step is done independently because each step is very clear to do it to train and I think you know it's very stable and this is exactly the same conclusion that we have here。

😊，We can make it very stable and it's efficient， everything which is stable we have better performance usually。

😊，Okay， and also something interesting is that we can leverage both actually proprietary tree and open source LLMs。

so we have the best of the two worlds。It's also， of course， yeah， so some people will like it。

it's interpretable because you can see the reasoning world of the NLM。😊。

Okay so now let me go to the Suig technique that I want to introduce。

this is called G retrieveever and the idea is as I told you before is that so LLM is very powerful knows everything but the poly is going to make errors because we never have a good pump in some sense。

so we need to constrain to regularize you know the LLM response into a much smaller space。

And and to do that actually we are going to use you know a tag a text activity graph like your knowledge graph and it will force the LLM basically to to answer related to this to this tag okay。

so the key question is of course， how do we extract pertinent information from you know。

A graph and to force the LLM to be more focused。Okay so to do that we are going to use the tokens so I think now everybody understand that so because of the transformal architecture what is very nice is to walk everything as an input right so the input of your NLM can be of course your query token but it can also be other information like visual visual tokens and also graph token so this is what we're going to do here we are going to use two kinds of tokens so the first one would be and I would explain that in the following slides a graph and token which is going to be here and then a textbased token which is going to be here okay。

So the graph encode token it is something very natural for example if you do molecular science。

you want to represent your graph as one vector and then you use this one vector to make a prediction for the property that you want so here this is the same idea so we can select any ferator GNN we apply multipleipole graph learning layers to compute very deep node hidden feature you compute then the mean over on the nodes and then you apply an MP a small MP on that so this way you will have a graph encode token that summarize your topological graph and also the feature on your graph and with one vector okay so this is going to be this guy of g dimensions。

the graph G is a set of directed edge is defined by I G where are node i points to node G so here the graph G is defined with edges04 16 and so on okay so you have a one to one mapping between this mathematical representation of the graph and this text representation of the graph。

However， there are many ways to use language to represent graph so for example in this paper they have this nice example to show oh you have you know many representation so the text based representation of a graph is not unique okay so this is this is this can be an issue the other one is also what I say is not text equivalentval okay it's not text equivalent in the sense that if you change you know this guy by this guy you will have probably different results okay so we want to have some kind of equivals for the text representation but we don't have it。

The other thing is a scalability issue when you want represent a graph with text okay so if you take an open source LLMs。

the context window is limited right so for example if you use the one that we use in academia Lamma2 so basically this is 4000 tokens of limitation so if your graph is small no problem but if your graph is for example the Wikipedia graph then this is like a huge number of nodes and huge number of edges so this is not something that you can do there is a scalability issue。

Of course LLMs are uponne to eucation， so for example here we have an example that an LLM can produce nodes and edges that do not exist actually in the knowledge graph that we use so in the vocabulary of the graph。

you know we have some for example here the one G doesn't exist and is able still you know it's part of the vocabulary so the LLM can give you actually this entity here which doesn't exist。

So what we propose is basically， so first we are going to apply a graph hug that I'm going to explain how we do that to retrieve a subgraphph from possibly a large text at which graph。

which is going to be relevant to the query of the user。😊，Okay step two。

we are going to concatenate the user query， but also the graph ending token and the text based graph tokens to create the input sequence of the LLM okay then step 3 the LLM will give us we generate an answer step four we will train everything so we will simultaneouslyly train the GNN parameters and also we fine tune the LLM parameters using Lo harm and the good thing is that you see if you use Lo harm you only using 0。

5% of 7 billion parameters so it's only 35 million and the GNN has something like 5 million so it's not that much actually to train so this is something we can do in academia。

So the graph hug that we propose so this is a graph retrieval augmented generation so hag is today very popular here we want to extend to graph the main problem is of course the scalability so I'm going to tell you how we how we solve this problem so。

😊，Yeah， let me maybe go here。 So the first thing is to do indexing。

So what we do is that we're going to have a text attribute graph。 So we're going to take。

The node feature so this is a text node feature we are also the edge feature of the same。

so we apply a pretrain and frozen large language model or small language models or at that time we just use a small language model but but we we can use a large language model now so you do that and you get some D dimensional representation of all the nodes and of theH and you store them in in a database in a graph database so today we are lucky there are many open graph databases that that can be used so。

for example， pg DgL， but also Lama index also there is Microsoft and nebula graph there are also some poppy tree graph database like Neo4J。

Okay so we do that so we take this and we have Victor representation of the node and the edges okay the second thing is that we are going to do retrieval so given a query from the user we will represent the query for example what is the name of justin BL border so we would use the same you know language model to represent the query as we did for the node and the edges okay and then what we will do is basically we just do a similarity metric evaluation so here we just use you cosine metric and we can retrieve this way the top key node and edges from the graph database okay so we would get something like this which is a noisy I would say subgraph。

😊，And then the next step we basically to extract just you know a smaller a smaller graph which has the most event formation and that's it okay so the way we do that we are going to solve the price collectingstein or tree okay so the price collectingstein or tree is basically a tree so if you start from this original graph and then each node as a price so the higher price。

the more important you want to be in the tree in the final tree so you can solve you know you want to maximize the price but also you don't want to take everything so you are going to have a penalty with the cost of the solution that you want so usually this is you don't want to or it should be minus'm sorry it should be minus here so you don't want to take too much nodes so。

Basically this is the number of nodes that you have here okay so when you when you solve this comm optimization problem which is NP but you can have an approximate solution using semideite programming SDP you will get something like that okay so this is a directed tree and you see that a directed tree usually has some kind of foot node any flowout for example and of course it wants to take the larger nodes okay so this is the original Steer。

ino tree we can modify it because here there was only the node。

but of course when we do graph learning there is also the edge feature that we want to use so we can incorporate you can easily incorporate edge information okay so we have a prize also for the edge so you see this is for example an age which is more important than this age here and we just you know modify a little bit the communto optimization problem here and we can solve it you know using very fast technique so this is very the approximation is is almost mostly linear al approximation。

Okay， so if we compare the standard rug with the graph hug， so the standard。

basically you will have your knowledge database and then you will extract you know the number of relevant document that you want。

😊，For the graph lag here， so we have also a graph database and we are going to extract a much smaller but very relevant graph related to the query of the user okay。

😊，So now I'm coming back to the tokens of the input LLM。

so the first one again this is the graphangle do tokens I already talked about this。

so this is you know an MAP on the mean the the node hidden feature or the last layer of of the GN and it's going to be here so this is something of course if you have longable parameter here。

everything can be changed you know by back propagation。😊，So for the text base。

basically we would have two sequence of input world， so the first one is the query of the user。

which is the name of Justin Biber on broader。And the second one will be the textualization of the cloth。

Okay， so this is。This is something again that is important for。😊，To use to tap the LLM ability。Okay。

so we do that so we have the graphical representation。

it will go through the text and so any LLM at this first layer that is's doing to do the word embedding okay。

so here took an embedding and then it will go through you know the LLM。So the training。

the training is very standard， so we give these input tokens。

we have the L transformformal layers and then the system we generate reccursively the response because we know the label using a training set we can basically fine tune the system to give us the right answer okay so we can again we are going to okay compute the crossenttropis with respect to the generated answer and the ground truth and then we are going to do the backward path to compute the gradient and update the parameters of the system so for GNN but also for the LLM。

Okay， so in summary geo retrievalal is composed of four steps so we have graphH which are this here。

which is for subgraphraph retrieval related to the query of the users then we have a computation of laptopkens using a GNN we have also。

Yes， response generations， once we go through the input tokens and also the transform layers and find any model training okay。

So here we try to combine the best of all world together and I think this is done in a very natural way yeah we have to change the existing data set and create a new benchmark to evaluate this task so doing the reasoning of text attributed graphs。

Yeah， so we have defined things， explanation graphs， scene graphs and web QSP。

So the menu result are basically that okay here there are many things。

but you should only focus on that， so if you do our technology retrieval。

basically you will be able to beat if you only do LLM so you have your LLM your query and then you get the answer if you do only the GNN po tuning so we still do better and if you only do the Loha fine tuning LLM so we still do better okay so this is really trained to combine the best of of this world。

Okay， so the scalability is basically now we are able to for example， only use。

you know reduced by 83% and 99% number of you know the number of tokens。

number of nodes that we use so instead of taking you know the whole Wikipedia graph we will only take you know a very small graph subgraph of the Wikipedia of the Wikipedia network。

Yeah， of course the one that we were interesting is does it improve hallucination so we compare with the baseline which is just you doing LLM with query and we manually。

so what we did is that we ask you know one of we did one queries and then one of the responses and we look at manually you know there is some missing nodes or adding you wrong nodes in the same also for theHs and we see that if you use geotry where you really reduce the hallucination。

Add studies so what we observed basically is that and that was very interesting is so the token given by the GNN and the token the text tokens of the graph they are actually contribute equally okay so the information coming from the GNN and the information coming from the text and theLLM of the graph is they are basically both important okay so they are complementary they really improve everything so I think it makes sense of course because you have your GNN we know that they are pretty good for extracting graph information but also the LLM as a very strong capability with you know with text representation so the two are complementary。

😊，So in conclusion， so if we want to unlock the LLM Cap we need to use you graph as know represented as tokenen as wells。

but actually combining everything so the LLM， the GNN and the graphraph hack provides superior performance。

so it's not only the LLM Cap that does the work is actually many authors。So GA is effective。

efficient and mitigate and imaginationmun， so here are the paper the code and I really invite you to read the blog post by Shashing he so she has done a terrific work here。

she's the main the main you know researcher in this project and she received for this actually the 2024 Google scholarship and I really yeah invite you to read a blog post if you want to have a highly introduction of this。

So also yeah so the Py optmetric team you know they included the geori so this is very nice and I think also something which is very interesting is to look at again we are always go back to the question can we use GNN to improve products right and I think here there is an opportunity so if we look at the history of website engines so everything starting with you know Google Page rank they didn't use any text processing。

Then there was walk to V， they used walk to Vk， then they use an language model， they use also hug。

And finally we have Gemini okay so Gemini is LLM plus5 so LLM if they don't know the answer they we look at HG when you pump Gmini so the next step hopefully would be to use GNN and you know some text a graph knowledge graph that would that would be useful for web search engine so here what is attractive I think is that everything is integrated and learnable so this is a you know deep。

So the next step was of course that we are working with Xiao Xing is basically okay so we have the text the text feature。

we have the graph feature so the last feature you want is basically doing the image introducing the image information so again today we as a transformer so what you can do is that you can take your image you can decompose your image into patches and then your patch can be represented by you just a vector so this is the visual token and then you go through transformal layers and you back propagate also to update the the parameters of your visual transformer。

And that's it， so thank you so much for。😊，Yeah， for being there and I'm happy to take any question。

难处的要送到。So if the participants have any questions， you can type those into chat or slacklack。And yeah。

to perhaps get us started， I had a question。And yeah， I thought like how。

how you propose to tokenize everything， essentially and fine tune with Laura。

these language models is is really exciting， right， Like it really unlocks multimodality you。

you did have a slide about graph tokenization， right， and that there is no canonical order。

So how do you deal with that in this work， You know， like。

how do you kind of decide how the graph is tokenized in the end。😊，Are you talking about this one？

Yeah， like you where you'd shown the talk like a graph slide as well， right？

kI'm not sure what you mentioned like this slide right so is this how you kind of textualize the graph well？

So textualization， where is it？😊，This one。😊，Yes， yeah。 yeah。So textization is arbitrary。

It's completely arbitrary， so I think Brian Pelozi know has his nice you know paper and this is an issue right so its you you want basically to have as you said a kindononicical presentation。

And there is not。Okay so so what is for the LLM that you have been trained you know the best representation to extract the most information so this is impossible to say so not only you don't have a unique representation but you still need to have like a world of representation otherwise you cannot have access to the LLM knowledge but at the same time you have no idea you know if this representation is good or not so this is like prompt right so there is no no way to know the quality of your prompt before before you try so for example what we did I don't know if you notice that but for example there is a change of performance in the prompt if you put the title after the abstract。

So so I think all these models they have this issue so the way you pump you're gonna to have very def so here there should be no def right if you put the title before before the abstract but there is a def so you need to play and this is you know I was half joking when when I talk about these LLMs that you know so they know they have seen everything you know they have seen all the human knowledge and you can ask them anything but if you are not precise if you don't know how to ask them the question they will never give you a right answer so the only way I think to bypass this is。

😊，Basically， to get along with the non uniqueness of the text presentation， but then to find tune。

So before you know Loha I was I was quite pessimistic about this technique and I say oh it's very hard you know to find actually some Google presentation that at the end will be some vectors and then you need to align the vectors together with the graph the graph vectors and everything else but you will never be able to get something good because they have not been trained this way but if you do this lo you know fine tuning then you have a way to align this vector information so it's not perfect so far it also doesn't make sense if you do the。

You know the end yeah， for example， yeah that's go to the end so it doesn't make sense in some sense in some sense。

you know， to combine the visual tokens， the graph tokens with you know the world tokens because they are very different modities so what you do with the MLP that you put here you are training to align you know the visual space the graph space with the text space and then what you do if here this is frozen basically you don't do anything you hope for the best you hope that in this space you know the alignment is good enough to give you good precision so it's never the case I don't think so but there are so much parameters that eventually give you something good but now that we can fine tune with Lo then what you do is basically you force all these spaces to align。

And this is why it works better when you fine tune。Yeah thanks。

I mean thanks for all the details it's always exciting to discuss research this way。😊，嗯。そいに。

Do you have time for one more question， Yes sure。Yeah。

so I guess like another thing to discuss right would be so if we can convince industry of the potential of these approaches and I'm certainly excited。

do you think it's it's possible in the future that we have LLMs which are let's say like more implicitly able to handle graph structured data here we had to fine tune with with Laura and so forth right but do you think like there's a possibility to have graphs as one of the pretraining modalities yeah。

😊，This is what the community is trying to do today， right， so let me go to the slide。

This is exactly what the community is trying to do right。

it's basically can we do some graph foundation model？

So you will be able to portray your graph in different modalities。

And then you recombine then with auto modalities what we do today is that here we are using the LLM self-attention layers right so but what we would like to do ultimately is to use some so you train very powerful foundation model for graph foundation model for vision。

foundation model for text， and then you combine them using this you know transformer layers for different task that would be you know I think for industry that would be the best way to go the problem is that today only LLMs as so much。

😊，Knowledge parameters you know so there is very it's very imbalanced there is no way I think today that we can do a product we can be powerful with Gen only G today is just a top cherry on the top in some sense so so it's not like you know we are missing data training data so the way is full of you know graphto data is full of that there is no issue with that the question is how do you make a product that would you know excite industry like you know metata Google and so on that would tomorrow say okay let's do like GDPgT but let's do this for graph so I think you know because it's not clear and also I have this you know that was already it at some point someone was saying you know why isn't GN in Hyman industry so that was very interesting I think you know。

Discion， so the only to the industry that likes G is biology。

Right so because you cannot do today LLM for biology。

so the only one that can make money of that is basically biology and so we have the Nobel Prize so there is a huge promise that this deep learning Gar neural network and so on can make money of course it's a promise it's not yet there but I think this is today where the money is for GNN but of course we would like to do it also for you know for text there is no reason we cannot do that but。

As long as we don't have like industry， I mean we need to be honest with our so industry drives you know the AI。

the big AI you know improvement like GT right so GPT could never be undeveloped in academia so of course academia is very important to get the ideas you know like for example division models that have been developed by a PhD student in Germany and so on so ideas we come from academia but you know scanning up and making a soet or change we come from industry so it's part of the pipeline if we are not able to convince people to use GNN for you know text so either GNN is not good enough or we are not doing a good work know to find the right products。

Yeahや。Nice， yeah。 I think， I think this work is is definitely going in that direction， right。

There's's a question。Yeah。There's a question from the audience so the questions as follows for graph Ra。

could it be possible that retrieved nodes and edges are mutually exclusive and don't actually form connected subgraphras if so is their methodology to ensure that the subgraph is connected？

😊，Yeah， so this is a very good question so I think this is completely arbitrary again。

so depending on the task you want to you know you want to solve so sometimes maybe you want directed you know subgraph you want undirected there are many ways to retrieve you know subgraph so you know minimum span tree is the simplest one but they don't have any you know in some sense the size of the the output size cannot be controlled。

this is the number of nodes so here we wanted something you know with less so I would say everything is possible as long as you know the task that you want to optimize so here we use this one but again you know that was for us that was directed so that was for us and we also wanted to have like a small size so we can quickly you know do that so we wanted something better than the minimum span tree so this is why we used China tree which is done out but again。

You can use different you know subgraph extraction that is going to fit your objective there is no problem with that you just you know you need a process to extract a subgraph that's very important and this process has to be linear time because if you if you do that on you know。

On large graph， I mean， at the end again， we are talking about product， you want to do it， you know。

on the fly， so you want something very， very fast。That's I think the only condition。

so of course a linear approximation a linear time approximation is never as precise。

but if you take enough nodes and edges it should be good enough， I think for your application。

Yeah that makes sense thanks for all the detailed answers and yeah thanks for the wonderful talk I think it's really exciting and you know like how all these technologies come together and with graphraphra with graphraph vector databases I think were we're on the cusp of hopefully the kinds of breakthroughs you talked about thanks for thanks for the talk。

😊。

Sure， thank you very。

图机器学习会议：P03：整合大语言模型与图神经网络

在本节课中，我们将学习如何将大语言模型与图神经网络相结合，以应对文本属性图上的推理任务。我们将探讨两种互补的方法：利用大语言模型的知识来增强图神经网络的性能，以及利用图结构来约束大语言模型以减少其幻觉。

引言

很高兴向大家介绍Xavier Bresson教授。他是新加坡国立大学的副教授，图深度学习的先驱者，拥有多篇高被引论文，并组织过多次会议和教程，同时也是图机器学习会议的大力支持者。

欢迎各位。我很高兴能与大家探讨图神经网络与大语言模型的结合。这是与邵世新、Brian Hoo、Thomas Laurent、Ian LeCun等合作者共同完成的工作。

以下是本次演讲的提纲。首先，我将介绍大语言模型。然后，我将提出一个问题：在LLM时代，我们是否还需要图神经网络？我将回顾大语言模型和图神经网络的优缺点，以确定两者结合能发挥作用的场景。具体来说，我将介绍两项工作：第一项是利用LLM改进GNN推理，第二项是利用GNN改进LLM推理。最后进行总结。

大语言模型与图神经网络：优势与局限

我们正处在一个激动人心的深度学习革命时代。计算机视觉领域在2012年因AlexNet的引入而彻底改变。自然语言处理领域的革命则稍晚一些，在2022年随着Transformer架构和大规模预训练模型的出现而到来，这催生了生成式AI这一新产业。

这些被称为基础模型的大语言模型在互联网海量数据上进行了训练。而互联网本身就是一个由网页和链接构成的网络。这意味着，LLM在训练过程中已经接触了大量的图结构数据，并学习了文本数据之间的关系。

因此，一个自然的问题是：对于图结构数据的推理，我们是否还需要专门的图神经网络？因为LLM似乎已经“见过一切”。

为了探究这个问题，我们聚焦于文本属性图。在这种图中，每个节点和每条边都关联着文本信息。我们使用了一个流行的数据集OGBN-Arxiv，这是一个科学论文网络，任务是根据论文的标题和摘要预测其类别。

实验结果如下：

使用标准的GNN（基于词袋特征）可获得约70%的准确率。
当时的SOTA模型（结合了语言模型和GNN）达到了约76.6%的准确率。
我们使用当时的顶级LLM（GPT-3.5）进行零样本预测，获得了约73.5%的准确率。

好消息是，LLM的表现并未超越结合了GNN的SOTA模型。这让我松了一口气，因为在LLM时代，我们仍然需要GNN。此外，有趣的是，OGBN-Arxiv数据集很可能已经是GPT-3.5训练数据的一部分，这意味着LLM在某种程度上“见过”测试集，但其表现依然有限。

结合LLM与GNN：思路与挑战

为了有效地结合两者，我们需要明确它们各自的优势和局限。

大语言模型的优势与局限

优势：能够精确建模语言分布，拥有涵盖互联网人类知识的庞大知识库，具备强大的零样本能力，且其性能随规模扩大仍在持续提升。
局限：存在“幻觉”问题，逻辑推理能力有限，在图推理任务上能力不足，且提示工程对结果影响巨大。

图神经网络的优势与局限

优势：擅长通过消息传递聚合多跳邻域信息来解决图上的推理任务，适用于多种模态（物理、生物、化学等），并且在某些领域（如AlphaFold）已取得突破性成功。
局限：缺乏像NLP或CV领域那样的“图基础模型”，训练数据规模相对较小，硬件对稀疏运算的优化不足，参数规模通常为百万级（而非十亿级），且工业界尚未找到能大规模盈利的杀手级应用。

结合LLM和GNN，意味着开发一个联合的文本与图基础模型。这是一个极具吸引力但充满挑战的想法，主要因为文本领域（LLM）和图领域（GNN）在知识量和模型规模上存在巨大不平衡。

因此，更务实的做法是针对特定任务定制两者的结合。我们可以：

利用LLM的广博知识来提升小规模文本属性图上的GNN性能。
利用知识图谱等图结构来约束LLM，使其生成更精确的回复，减少幻觉。

接下来，我们将介绍遵循这两种思路的两项具体工作。

工作一：TAPE - 利用LLM增强GNN推理

上一节我们探讨了结合的必要性，本节我们来看看第一种具体方法：如何利用LLM的知识来增强GNN在文本属性图上的推理能力。

这项工作的核心思想是：利用LLM的知识来提升文本属性图中节点特征的质量。如果有了更好的节点特征，GNN就能做出更准确的预测。

关键问题在于：如何从LLM中提取与特定图任务相关的信息？

我们的解决方案是提示LLM。我们不仅要求LLM做出预测，还要求它给出推理过程或解释。

具体流程如下：

对于一个节点（例如一篇论文），我们将其文本特征（标题、摘要）输入给LLM（如GPT-3.5），提示其预测类别并给出解释。
我们将得到的解释文本、预测文本以及原始文本，通过一个较小的开源语言模型（如BERT）转换为向量表示。
我们在这个小语言模型顶部添加一个轻量的MLP层，并在训练集上对其进行微调，使其输出的向量能够更好地服务于下游的分类任务。这个微调后的向量就是增强后的节点特征。
最后，我们使用这些增强后的节点特征来训练任意的GNN模型。

技术细节：我们巧妙地结合了闭源LLM（如GPT，性能强但仅提供文本）和开源LLM（如LLaMA，可获取内部向量）。闭源LLM负责生成高质量的文本解释，开源小模型负责将这些文本转换为可微调、任务相关的向量。在2024年的今天，我们可以用LoRA等技术微调更大的开源LLM（如LLaMA-2）来替代之前的小模型，以获得更好的效果。

公式表示：
对于节点 i，其增强特征 h_i' 可通过以下方式获得：
h_i' = MLP_\theta( LM_\phi( Prompt(Text_i) ) )
其中 Prompt(Text_i) 是给闭源LLM的提示，LM_\phi 是开源语言模型，MLP_\theta 是用于任务适配的轻量级多层感知机。

实验结果：
在OGBN-Arxiv数据集上：

基线GNN（使用Skip-gram特征）：70%准确率，训练约4分钟。
当时的SOTA模型（GLM）：76.6%准确率，训练约9.2小时。
我们的TAPE方法：75.5%准确率，训练约3小时。
TAPE在准确率和效率之间取得了很好的平衡，并且曾一度登上排行榜首位。为了排除数据泄露的质疑，我们构建了新的数据集TAPE-Arxiv 2023，结论依然成立。

本工作小结：
我们可以利用LLM的知识及其推理能力，来增强文本属性图的节点特征，并使其适配特定任务。这种方法并非端到端，而是先独立生成优质特征再训练GNN，这使其非常稳定和高效。同时，它结合了闭源和开源LLM的优势，且具有一定的可解释性。

工作二：G-Retriever - 利用GNN增强LLM推理

上一节我们介绍了如何用LLM赋能GNN，本节我们将探讨反向思路：如何用图结构来赋能和约束LLM，以提升其在图相关问答任务中的表现。

核心思想是：LLM虽然知识渊博，但容易因提示不当而产生幻觉或错误。我们可以利用一个文本属性图（如知识图谱）来约束LLM的回复空间，使其答案更聚焦、更准确。

关键问题在于：如何从图中提取相关信息，并强制LLM关注这些信息？

我们的解决方案是将图信息作为额外的输入令牌提供给LLM。我们使用两种令牌：

图编码令牌：使用一个GNN编码器处理检索到的子图，通过池化和一个MLP得到一个向量，该向量总结了图的拓扑和特征信息。
文本化图令牌：将子图结构用自然语言描述出来（例如，“图G包含边：A->B, B->C”），作为文本序列输入。

然而，将图文本化存在挑战：表示方式不唯一、缺乏等价性、对于大图可能超出LLM的上下文窗口限制。

因此，我们提出了一个四步框架 G-Retriever：

图检索增强生成：首先，从可能很大的文本属性图中，检索出与用户查询最相关的子图。这解决了可扩展性问题。
令牌计算：计算图编码令牌和文本化图令牌。
响应生成：将用户查询、图编码令牌和文本化图令牌拼接，输入LLM以生成答案。
联合微调训练：使用LoRA技术微调LLM的参数，同时训练GNN编码器的参数。由于LoRA只更新少量参数，整个模型可以在学术界有限的资源下进行训练。

图检索增强生成详解：

索引：使用一个预训练的语言模型将图中所有节点和边的文本特征编码为向量，存入图数据库。
检索：将用户查询也编码为向量，通过向量相似度（如余弦相似度）从图数据库中检索出最相关的节点和边，形成一个“嘈杂”的子图。
提炼：为了解决检索到的节点可能不连通的问题，我们使用斯坦纳树算法。该算法在给定节点权重（重要性）和边成本的情况下，寻找一个总权重高、总成本低的连通子图。我们对其进行了修改以同时考虑节点和边的重要性，从而提取出一个小而精、连通且相关的子图。

训练与结果：
我们构建了新的基准数据集来评估此类任务。结果表明，G-Retriever的性能优于仅使用LLM、仅使用GNN或仅使用LoRA微调LLM的基线方法。它能显著减少LLM的幻觉，并且分析表明，图编码令牌和文本化图令牌的贡献是互补且同等重要的。

本工作小结：
要释放LLM在图推理任务上的潜力，需要将图信息以令牌的形式（包括向量化和文本化）提供给模型。结合LLM、GNN和图检索增强生成技术，能够实现卓越的性能。该方法高效、可扩展，并能有效缓解幻觉问题。

总结与展望

本节课中，我们一起学习了如何将大语言模型与图神经网络相结合，以应对文本属性图上的复杂推理任务。

我们首先分析了两者各自的优劣，并提出了两种互补的结合范式：

TAPE：利用LLM的广博知识生成解释，进而增强GNN的节点特征，提升其预测精度。
G-Retriever：利用GNN和图检索技术从知识图中提取相关信息，并将其作为额外输入约束LLM，使其生成更准确、更少幻觉的回复。

这两种方法都展示了跨模态结合的巨大潜力。展望未来，一个自然的延伸是融入视觉信息，构建文本-图-视觉多模态基础模型。工业界的发展（如搜索引擎的演进：PageRank -> Word2Vec -> BERT -> RAG -> Gemini）也预示着，深度融合、可学习的图神经网络与语言模型系统，将是下一代智能系统的重要方向。

尽管构建通用的图基础模型仍面临挑战，但通过针对性地结合LLM与GNN的优势，我们已经在通往更强大、更可靠图机器学习系统的道路上迈出了坚实的一步。

图机器学习会议｜ Learning On Graphs Conference 2024 p04 P04_Day_4-Alden_Hung_keynote__Neural_Algorithmic_Reasoning_tutorial__orals -BV1k9pAzpE8S_p4-

Can you hear me。Yes， I can do a quick introduction and then you can get started I think that。入诶。

Hello everyone， welcome today of the law conference and thank you for your participation so far today we have a keynote speaker to kickstart the day and very thankful to Dr。

Aldenhung who is a principal research scientist at Iomorphic Las and a core contributor of Alpha fold3 and actually deliver a keynote on this very topic Dr。

Aldenhung actually has a background both in neuroscience and in computer programming having won many competitions along the way and then having gotten his PhD from Johns Hopkins and then a post-doal fellowel from the NIH about nine years ago he moved to DeepMd to work more on reinforcement learning and now he's set his sights on drug discovery with the Alpha fold team Without furtherrado I will let Dr。

Aldenhuung take over from here Thank you for giving a talk here。Thank you， thank you， Vigish。Hi。

everybody。 I'm Alden Hong， and I'm representing isomorphic labs today to present our work to original drug design with Alphapho 3。

So at isomorphic labs， our goal is to advance human health by reimaginaging drug discovery with power and pace of artificial intelligence and in this very example today。

I'm going to show how Alphapho 3 is actually being the man like driving power in isomorphic lab in the last few years。

So how can we use artificial intelligence to redefine drug design。

Here I want to just give a little bit background we know like we all love the exponential growth of the technology right we know like everybody here is very familiar with the more slow things are getting faster and faster on our hardware side and our GPU performance are getting just exponentially grow as well and on the other side things are very different in the drug discovery industry。

So you can see on the upper left hand side， if you can see my cursor。Here， you will see。

this is the drug approved every year。 And you can see it's not。

A big amount is only on a handful of like scale of dozen of more like drugs being approved every year。

And the cost of each drug going to the market is just getting more and more across year。

So there is a。Low Co rooms， low， which is like the rivers of Moore slow for the drug discovery industry。

saying it's getting more and more expensive， exponentially to get the new drug into the market。

So can we use artificial intelligence to make this getting better。

So I would quickly introduce you to the bottlenecks in the drug discovery。

So on here we see a standard drug discovery pipeline looks like from left to right。

the starting phases we call target discovery and target validation which is more on the biology side this is identify what is the protein or what is the molecular target that we should be aiming for to making the intervention and then on the middle of this is the chemistry side that we call hit identification and lead optimization This is once you identify the target how would you find the molecules that can be interacting and intervening with the target protein and then later stage there is this preclinical and clinical trials。

And there are various bottlenecks here， which I will focusing on bottlenecks one and bottlene2 today。

The bottle next one is understand the disease biology。

so potentially can we use alpha photo to understand the biology better and also even more importantly can we use alpha photo to help us to design better molecules So this is a bottleneck two and there are other bottlenecks。

but I will focus on one in two today。So one thing I would like to emphasize is structure prediction。

Str biology is a key enabler in drug discovery。 There's two part of this。

The first part is the structure prediction make us to have a better understanding of the biology。

So， for example， for in this particular example， we are seeing how the protein is interacting with the DNA here so we can know what is the function looks like And also more so we can also use the structure prediction to help us doing this we call the rational drug design。

Which means if we understand the 3D structure of the target protein like here， this is the。

The gray area here with this cavity is a protein with the empty space we call the pocket。

And you can starting from a small part of this ligan， which is a small molecule。

You can gradually grow it into a bigger part。And eventually。

growing into a full drug like ligan that is binding to this target protein and stopping intervening the function of this protein。

So we have this latest model A43 published earlier in the year。

which is a joint collaboration between。Workers from Iomorphic Lab and Google Deep My。 And this is。

Graio work like building upon from the legacy of Aphafo and Aphafo2。

and this new model is able to predict structure of not just protein， but also DNA RNA。

and very importantly for the drug discovery the small molecules。And technical wise。

this is a diffusion model， and this is getting to the new state of art。

Across all these different metrics。 And I will next go into deeper about the details about how our those3 is making this to work。

So overview of the Apaphos3。So let's starting reviewing what is alpha for2。

Avavo 2 is the first foundational model for biology。

and it worked in the way that given a sequence of amino acid。

as you can see on the left is' just like very simple， it's just like a string of characters。

This model is able to generating the 3D coordinate of every single atoms。

Of this 3D protein structures from just like this sequence and the way this is working is actually it can also use the genetic database search which using the multiple sequence alignment which is a very important information also for folding the protein and this is Ava2 The limitation is that this is only working for protein sequence and this is actually also only working for a single protein chain。

So over the year after Avavo2 first release in 2021， the next year after。

Did my colleague also release alpha fold multitimer in 2022 and the difference。

the extra advantage of alpha vote multipletimemer is that it's able to fold multiple protein chain。

as you can see in this middle page。 And just earlier this year， we publish alpha vote 3， which。

as mentioned earlier， this can fold all different kind of biomolecules。

including DNA RNA protein and every possible small molecules。

just as long as you can provided the molecular graph， we can fold it。Okay。

so a very high level too long didn't read。Alpha force 3 is a diffusient model。

Aphavo3 can fold this different kind of biomolecule identities。

Alpha 3 produce better result in protein ligan co folding better than the classical ducking algorithm。

A。Talk about more in details later。 It also give a better protein protein folding。Prediction。

especially， it is reaching much better result in protein。Antibody， antibody antigen interaction。

Compared to Alpha photo2， and this is also working very well for nucleic exit folding。

So this is an overview of the model of Alpha Fo 3。I will not try to actually go through each single block on this page。

but I will try to showing some high level ideas about how the model is working in the next few slide。

So， first of all。The model is taking input as in alphavavo2， but now besides protein sequence。

its can also take different kind of modality of input， for example。

here there's this ligan molecules and also it can also take an input as DNA and RNA and basically this is a conditional generative model and this is a diffusion model to be specific。

And the conditioning actually goes through like this。

so the input to the model going through a feturization。And this also。

including with each protein or DNA R sequence， we will do the multiple sequence alignment。

and that will be the extra information into the model。

And then there will be this trunk part of the computation， which we are generating a trunk invaing。

And this trunking bedding will be the input to the diffusion module。

where the diffusion module is the head part of the module that is doing the diffusion kind of generation。

And also， very interestingly， which is that we also have a confidence head。

This is something we already have in Alpha2， and is also still very helpful in Alphaphos 3。

This head is taking both the embedding from the trunk and also the simple structure。

And the job of the confidence side is to predicting how well is the mothers。

Generative structures are。 so it can give the user a sense like how confident the model is。

And I'll get in more details about how this confidence at work later。Okay。

so in comparison of Alpha vote2， I just want to highlight if you are familiar with Alpha vote2。

what is new from Alphavo3 and why it is working better？

So this is a diagram from the Apha photo2 paper， and basically the main is that we replace the middle part。

Of the model it was used to co Eform and we replaced from a simpler and more kind of uniform architectural co pairform。

which I will mention in more details later。 and the second part that is also quite important is that we replace this very complex structure module which doing some kind of very fancy equilibriuman computation with a much simpler diffusion head and it turn out is working quite well。

Okay， so one thing I would like to get into more detail is like， what is the computing unit。

What is the token or how do we visualizing the information。So this thing is kind of like。

A language model with tokens， but it's actually more like a graph， but it's a fully connected graph。

So we have a sense of a token， or each token， you can also think as a node in a graph。

And what is the token in this ofva those3。 It either can be a residue。

So there can be a chunk of atoms in standard protein， DNA RNA is， for example。

as you may know DNA and RNA have this standard vocabulary of8CG。

And there's also 20 standard amino acid for the protein。 So in those cases。

you can encode things at more compact way。 So each node will be a single residue。

But for a general ligan， for any general molecules， you cannot use such kind of vocabulary anymore。

because there are a vast amount of different molecules。 So in that way。

we decided the best way is to just encoding things as a single level。 So in that way。

it can express any possible molecules。 So you have this kind of。Iybrid description。

some node are presenting a larger chunk of the ats。 Some node are presenting a single at。

And the way the computation is working is actually save as like a 2D tensor。

So it's like number of node by number of node by channel。

and the computation is happening on 2D tensor。 So one way to think about this is this is a fully connected graph。

And we are doing a lot of computation。 on fully connected graph。Okay， so just to give an example。

what do I mean by a residue can be a standard protein。 That means， for example， if this is a glying。

Then like which is a standard amino residue。 Then this token will have multiple atoms positions。

But if it's a ATP， which is a very common molecules in the body。

then because this is not a standard vocabulary。 This will be We could flatten out。

This will be represented by each one token。 is one single at。

So this AP molecules of like 30 something atoms， it will lay over 30 something of the tokens in our representation。

Okay， and now let's move on to the feturization。As I mentioned a bit earlier， we get all this input。

the sequence， the ligans and other information， and this will go to some further computation。

We will do some template matching。 See if there are some template you can retrieve from the database to help you do the prediction and one very important information sources is this genetic search。

which will give back the multiple sequence alignment which will be a very strong and helpful signal to help you know which two protein residues are coevolution and they may be in contact with each other and for the molecules we can generating the floating free floating conformers and this can be good reference signal about how those molecules looks like and we can fold those together with the protein。

So this is how were visualizing the input。And then it's go to the trunk， the part we call pairform。

and this is the detail of the pairform is doing two type of computation。

One is the triangular update。Which is updating。The input and output。Of a particular node。

say every signal flowing into the node and every signal flowing out to the other node。

And there is also the Excel attention， which means because this is a very large 2D matrix。

we are not able to do a standard， like transformer kind of attention with all this n squared tokens。

So well choose a subset of token， like every single row or every single column。

this subset to do the attention。 And this is being repeat for multiple blocks。

This is basically just a simplification from the E former。

if you are familiar from La term in Alpha phototu。And then now we get down to the diffusion module。

which is head upon the trunk part。The diffusion module is getting。The noise。Ground truth。

or in the inference time， it was starting from random Gaussian noise。

and were taking the input from。The trunk and this trunk input will be the information to guide the diffusion head to。

The noising， this noisy input into the ground truth target。

And this is trend with a standard diffusion model training。

And one thing interesting to mention is like we are not doing any kind of equiious diffusion in here。

were just doing something very simple。 just randomly rot and translating the targeting with the。

This data have limitation and turn out working。 Just quite okay。

And the losses for the diffusion model， there's the first part is to min square error loss on the autumn coordinate。

So this is the standard diffusion losses。 You're just denoing your current noisy input to the target。

and then you computing the MS。 And we also have other part of the losses。

which turn out to be very helpful， even though they may be。Like fully me mentallyally grounded。

but these are the like extra weight for the banded ligan to make sure layer are in the correct location。

and we also have a LDDT loss， which means we care about the distance between pair of the atoms。Okay。

so now we'll move on to the Alpha4 3 training。So the full training loop is like thisz， we have list。

Network， and we have the diffusion inference here， and we have the ground truth here。

And with the ground truths and the denoised output from the diffusion model。We are computing a loss。

This is the loss I mentioned in the previous page。

So the means square error loss and other two losses。

and also the other part of the loss is on the confidence module。

So we have the confidence module that will be able to predicting itself。

It's reflecting itself on how well is the generated molecule structures are okay。

and this have three different part。 It's up the PDDT。

which is already in alpha alpha2 and also have the two other terms， which is。

Quite similar to each other is predict distance error or predict a line error。

Both of them are just estimating。Can you tell me how much is the arrow for predicting the distance between two tokens。

O。So here we show a little bit how the training looks like。

The training takes about on the scale of like two days。 and then we do some extra fine tuning。

And in the fine tuning stage， we increase the crop size。

So one thing we need to do for training the Ava model is that we need to crop the input data because the data that we use。

these biomolecules are sometimes can be really large so they can have like on the scale of like several thousand tokens but during the training time we cannot fit in with such big token with our current hardware。

So we need to crop into smaller part。 and it turned out this works really fine。

We can always just learning from part of the input structure。

It's like you learn from part of the image and this generalize okay when we are inference on the full structure。

And basically， the main thing is like we're adding some extra losses on this later fine tuning stage。

but most important things that we also make the crop size bigger。

And on the right hand side figure here， you can see this is how different metrics looks like。

So everything here the Y axis is the LDT， which is a way we measure the quality of the structure。

And this is for different kind of entities。 So for the ligan， intra ligan。

this getting to a higher value and plateau like quite early。

But for some other metrics like protein protein interaction。

you can see this is actually convert much slower and actually getting this benefit from the fine tuning with a larger crop so。

We have a like weighted sum of all these metrics that help us deciding when is the best time to save the final model。

Okay， one thing that。I'm being mentioned very quickly in the previous page about the training laws。

which I want to highlight further again here is this permute ground truth。

This is some detail about training the model， but I think it's quite interesting to mention is like this particular block here is the permute ground truth。

which is actually quite important and a tricky part to deal is in the alpha photo design。

So the idea here is like。When you are generating。Correct， like generating your samples。

How do you matching your performance to the target。And in this particular example，10 Gs here。

they are two identical chain。 We call it that a2 hometer。And there's also two ligans here。

And when you denoising and generating your target。

You need to permuting your answer to your samples to computing the correct loss。 You cannot just say。

okay， I'm just assuming my gene sample one is matching two target one。

you need to run all the permutation。 and the same permutation is also necessary for the small molecules。

For example， in this particular molecules here。 you see like there is a benzene ring here and with the oxygen here。

And with this， this atom 0 to 5， it will be the same in this two different permutation。

So one need to be very careful about when computing the losses like it's considering all the possible permutation and computing the loss correctly。

Okay， and now I'll move on to share some of the Al4 3 results。

which is probably the most exciting part。 And this is a full summarize page about how the model performs on the ligan set on the nucle exit and on the coval modification and on the protein。

I will go into each subcategory in the next few pages。 So first。

let's focusing on the protein ligan。In this category， we rely on a data set called postbuster。

which is a third party paper。 like there's this particular data set。

And here we compare Al those 3 with a few other。Also， like recent coming paper。

like first Campbo Rotafold here and also a few other approaches。

Tiff dock and some other method and also like some classical old docking method。

like like Vina or gold here。So。And the y axis is here you can see is percentage of arm is less than two and strong。

This means like how well you can put the ligan correctly in the protein pocket。

And the blue R is the performance of alphavo 3。 You can see it is performing very well。

One thing I want to mention is like Alvo3 is essentially very different from all this classical ducking method and some of the machine learning method like di do。

in the sense， like this methods lay assuming you already have the protein structure。

and your degree of freedom is only how you put your ligands within this already rigid protein structure。

But Alvo 3 is doing a thing we code cofolding， which means we are generating both the protein structure and the ligan together。

So potentially this ligan， this protein can actually。IIs there a question or no， sorry。

So potentially， this is able to actually generating a protein with ligan。

Lets say when when you fold this protein by itself， theres a cryptic pocket。

So this pocket actually comes up。 But only when you run the model together with protein and ligan。

You will see like this li going into the pocket nicely。Okay， so this is performing really well。

And as you can see here， we are overlaying the models' prediction and the ground truth。

And you can see they are overlaying very well， which I I was showing more details。

More more examples later。And one thing I want to show as a highlight or or as a like things for future work is like you can see this are。

A bunch of different quality control people do in the postbuster benchmark。

And there's like two things I want to highlight that。The alphavoory is not doing perfect yet。

The first one is let's take tra your chi。 This means like when you have a molecules and you have a。

Autom in the molecule that is bounding to four other atoms。 There's a single chi。

And do you get your molecule to have the correct chi。 And we can see the alphavavo is doing okay。

but not perfect。 And some other approach like do is actually by construct doing this perfectly。

So this is something we are trying to improve。 And there's also this intermole。But。

this is like the distance between protein and Ligan。

this means like when you fold this structure in your prediction。

do you have any collision between your protein and ligan if you put them too closely to each other。

And Alvado is not working at 100% here， but very close to that。And the performance on DNA and RNA。

which is list like new features we can do。 It's performing really well。 I will say。

in general and and performing very well， it's not。Better than this approach called alchemy RNA to。

which use a human input， But as the pure machine learning approach， This is the state of art。

And as you can see in this example， it was able to nicely fold this DNA along with this protein and give a very reasonable。

Confidence predictions here on the right。 I， I will explain a little bit more on what the confidence prediction。

Means later， in a slide。And here， I want to also highlight an other new feature in alphaphavo 3。

which is not in alphaphavo to this list co modification。 So in the protein。

there is a single post translational modification， P TM， which means。

Usually you have this 20 standard amino acid residue。

but sometimes after this protein getting generated translated， they will getting modification。

for example， there will be like one extra phosph being putting onto the residue。

and in the original Avavo2， we were not able to model Lla。 But in Alvavo 3。

we have approached to just treated this。Modified residues as like a ligan。

So we have a way to handle this kind of molecules。 And in here。

you can see we have a specific amino acid residues here。

which it correctly should be list called S EP。And the standard one will be this SR。

and the difference is like among list two， as you can see。

the difference is having these extra phospho groups here。

Okay and Alvavo3 is able to correctly modelling this change。

and this is actually a key importance thing to model in this particular example。

if you are not modelling that one， then you are not able to correctly putting this peptide into the correct position。

but while Avavo3 can actually modelling this complex well and putting this thing in the correct place。

Okay。X， and finally， I want to show this protein protein interaction Inter prediction。

This is when you have like two or more protein trends。Whi。

hi you want to model how they are interact connectinging with each other。 This is， in general。

much harder than。Folding a prediction by itself。 So this is a harder task。

And one thing particularly interesting is here is that we can actually modeling much better in this antibody antigen interaction。

And the way that we can。Actually， do this well is that we can actually run the model multiple times。

so you can actually scaling at inference time。 What we do is that we' just making multiple seat like running the model multiple times。

And each times the model will run with a different subseling of the MSA and。

The keys in wirelessless work is like because the model have this confidence had。 So it can use。

The confidence has to choose among this like 1000 generated samples。

And it actually indeed can use Laso。 So it' actually can scale， as you can see。

when we're getting more samples。We actually can get a better performance metrics。

And this is just an example showing we are able to fold in this particular case。

an antibody with the antigenteger。Okay， and finally。

this part is about the confidence metric checking the accuracy。This is showing， again。

the importance about this confidence metric， as I mentioned earlier in this antigenteigent antibody folding。

The confidence is a very important and very useful information for the chemist to look at the structure。

So， for example， when we folding these particular structures here。

this structure of four different chain like one in magenta here， one in orange。

one in blue and one in gray。And this predictive distance error matrix is showing how confidence are you。

About different parts of the chain。 So， for example， we can see there is a high confidence。

With in Chen， which is not surprising， Right， So， so because as mentioned earlier。

it's always easier to fold the chain correctly。 But the more interesting thing is like。

we know also here the the darker Gings like better confidence。 Okay， so here the model tell you。

okay， I'm very confident between Chen A and F。 and I'm also very confident between Chen C and D。

but I'm less confident between the relationship between Chen A C or A D。

And this is actually quite makes sense because this。Whole complex is floppy in the middle。

So this upper and lower half x can wly run。 So that's why the model correctly suggesting it cannot be very confident for those two terms。

Okay， and with all thoses description about AvaF3， I now want to move on to share a little bit about on the application side。

how in isomorphic lab， we are using AvaF3 to enabling our regional drug design。

And one thing very important is that we want to make sure this is working on novel target working un novel protein ligan interaction。

we don't want to have a model just memorizing the data set We want to make sure this is actually generalizable to new protein target that we actually don't have structure。

new molecules is not being designed yet， but our model are still be able to predicting the structure with this novel protein and novel novel ligan combination and we are actually using our model in the day to day discovery pipeline in isomorphic lab。

Okay， in this first example here。We are showing this is a novel target， not in the training set。And。

In this kind of protein target， there's usually more than one packet。 that means you are。

if you fold a ligan with this particular protein， it potentially can go to different places。

And in this particular example， we can。Correctly fold this target into folded this ligan into the correct pocket with respecting to this protein。

which is not in the training data。

And then here， I would like to show you a short video。About another example of how we use the model。

Sorry。

So here this is an animation of a protein target called team 3。

And we are folding lists by itself now at the moment。

And you can see now we are folding a particular ligan waste list。And as you can see。

when we folding list the。Blue part is the ligan， and the orange part is the chen from the protein。

And you can see， whenever we folded。Different ligands in， this protein part slightly changes。

So this is what I emphasized earlier。 What we do is a co folding。

We are not keep the protein rigid and just putting the ligon on it。

So this is actually making it possible for the whole complex to find the。Most low energy。

most plausible state。 So it's not like you have a rigid protein you vote first and then you just do your ligan onm it。

Okay。冇翻。I cant。

And here I want to highlight another very cool example here。

So this is a very complex terernary structure。 As you can see。

this is a really really complex thing。 There are like two protein chains here。And here。

the way it is showing is the white is the grand truths and the blue and the pinks are the structure prediction from the model。

And as you can visually see， this there。Basically looks like overlapping super well。

and this is quite complex because this is a structure where you have two protein chain and layer is a ligand standing in the middle that is interacting with both sides of the protein。

And we were able to fold this very well。 And one thing I would like to highlight with this particular example is that we are actually able to use the confidence matrix to correctly sa selecting this best folded result。

So English this lower right scatter plot here。 The X X is the。😊，Re the quality of the result。

the protein ligan pocket R MSD， so lower the better。 and the Y axis is the confidence score。

So you can see were generating a bunch of different molecules。

a bunch of different predicted structures。 Some of them are better than others。But the top one。

the most confidence structures by the confidence。Mats score is actually indeed the one with the lowest R SD。

So this is the like really nice example。 Our confidence metrics helping us to selecting the best possible。

Syimps。Okay， so。We also have this Alpho server， which is publicly available for everybody to use。

And this is a server launched by our colleague at Google Di Min。And。

You can find this on the Internet and feel free to leave a feedback。

And I think there's a lot of people actively using this web server。And also， we have。

The Github code that is able for academic use， which you can find here on Github。Okay。

let me now quickly go through some kind of not fully solved area and the limitation of alpha poster。

Okay， the first is the confirmation coverage。 This is actually a quite interesting one。

So in this particular example， called Serbl here。We have two structures we call apple and Holtate。

E state means you only have the protein by itself。

and Hol states means you have the protein together with a lid。As you can see， if you look closely。

there is a liance here。Okay， and。You can see。This is a case where the protein is taking different。

that the silver one on the left hand side is the ground truth， and the blue is the model prediction。

And this is a very hard example， because this is saying。When there is no ligand around。

this two sides of the protein layer far apart from each other。

but when there is this ligan coming in here， this is kind of like close up and this part is taking a different confirmation。

And the current alphaholster is not able to modeling that correctly。

It's actually modeling the part when the ligan coming in fine。

But even when there's no ligan is still generating this wrong state。

So this will be something to work on in the future。And the other thing is this hallucination。

Which is a quite tricky thing。 And this is happening with the diffusion model in alpha fold 3。

which is less a part problem in alpha 2 before， because we have other losses to handle with life。

This saying if you have a protein where your ground truth is this particular part so that this one is the ground truth。

And you can see alpha 2 and alpha fold 3 I'm modeling this part5。 But this also have a lot of part。

which is code。Disorder the region。And alphapha vote2 will just correctly put them as this thing we call a spaghetti。

just just loosely put them in the space。 But alphaphavo3 will have kind of like hallucination like the language model does。

It will try to fold them as if they are correct structure。

This is not ideal as this will be affecting with like say protein in prediction or like protein legal interaction prediction。

And this are something to work with to think about how to deal with later。And then。

There's another issue is like we also observe that when we are running inference on large structures。

we are seeing the structures sometimes have this problem that they will be like overlapping themselves or like different channels overlapping on each other。

And this may be related to some issue with the noise in the initial noise in the diffusion noise。

or maybe there are some other reason this is something to keep looking at。Okay， so。Where's next。

I'm going to try to wrap up the talkkio。The first thing I want to highlight again of the whole alphaO and how it's related to the graph machine learning is first of all。

biomolecules are naturally graphing the physical space， right so。As we describe earlier。

this pair former thing here， as you can see in the right。

they are basically a fully connected graph。 And each token here is an is either an autumn。

Or it can be a group of autumnom， which is like a residue in the protein。

And what we are trying to do is we are modeling all these interactions。And。

The way ourvo is handling with this is that we are doing this fully connected graph。

And this is kind of like probably the best thing we can do at the moment。

The reason is like before you do any folding， you don't know which molecules。

which atoms are close to each other。 So you cannot quite do a local constraint。

so you cannot have a more meaningful， like local physical constraint age， say， okay。

I know this particular atoms is close to this atoms and these two atoms are very far away。

so I don't need to do any age computation。Because before folding， you don't。

you should not have any prior say any two at should be closed or far away from each other。

So the way we handle this list， I would say， maybe a little bit proofful is like not' just assuming it's fully connected。

So every token， every autumn can look at each other。This turned out to just working very fine。

but maybe the caveat with list is like this is also very compute and memory heavy and as mentioned earlier as well。

this means at the training time we need to crop into a smaller chunk and if there's maybe a better way to representing in a spa way then we can actually training on a bigger crop or in the training time we can training on the full structure and this would certainly be something。

Desirable because not reducing the training and evaluation time distributional shift。

And those are just to mention。Proin protein interaction。

which or all any other kind of bio biomolecule interaction。

like protein DNA and protein RNA interaction， can also be represented as graph。

And it will be very interesting。For。To present this and understand this interaction in the。好嘅你す。O。

And there are some future directions that we are pursuing beyond Avaos 3。

The first one is function prediction。 So structure is just an important thing， but it's not angle。

We actually want to use the structure to help predicting the function and something very nice as an example is this paper called Al mis。

present a published by our colleague in Google did。

they use Alphavo2 to use this model to predicting what is the single point mutation in the printing sequence that maybe be cosing。

A functional change of a poor。And at Iomorphic La， we actually want to make a full suite of machine learning models。

To help predicting the function and modulation of the functions。

We want to go all the ways from the lower level， the highest resolution。

say starting from quantum chemistry。Molecular dynamics， can we using this kind of。

Simulated data to get synthetic data to train our model to have this like better alpha a both kind of structure prediction model。

And once we have a good model for the structure biology。

How can we use that to build predictive interaction， like protein interactionome。

And can we go from there to predicting the biology， Can we go from there to get a。Virtual sale and。

Eventually can we use that predicting the toxicity or effect of a drug？

A human body or the animal mother。ok。In conclusion。We are very excited about the progress。

and we think understanding biomolecular structure is the key。

So we want to learn how the molecular machine is working and we， as showing earlier。

we show this is regional developed modulator to change their behavior。C。

understanding the structure is the key for the drug discovery。 And Alvoory is making this。

Not just a dream is a dream come true。 This actually works， and we can now fold all biomolecules。

protein DNA， RNA and ligands。This diffusion model works very well。 This trend on all the PDB data。

And we show this work on novel protein ligan pairs。

And we are actively day to day using avavoory to do regional drug discovery。

We use this to predict the novel protein ligan k structure and this speed up our drug discovery pipeline。

So we don't need to wait。For a crystal structure， which is usually taking at least months to get to verify a structure。

but we can just run the model and have a very strong confidencence。

say we know the Ligan is going to the particular pocket with this particular toes。Okay。

so in the end， I would like just to share something about Iomorphic La and we are a very interdisciplinary company。

so you can see we have this kind of very different diverse profile of people。

but we are always looking for passionate machine learning researcher and engineer to join our company。

And as you can see， we are like having a very happy working environment。

And I will stop here and thank you for your attention。

Thank you for the talk about it There are a few questions people are posted in the chat The first one I think might be expecting is more about the sort of you know choice between equivaris and data augmentation and kind of what were the main sort of motivating factors between sort of going for data augmentation versus equivaris is it something that you kind of tested practically and just found to work better was it somehow also sort of engineering motivated basically sort of any thoughts in that that it's easier to。

嗯，Sory you夸到还还还行还是。😊，各 ahead。No， no， I was just saying， yeah。

basically if you had any thoughts on on thatin。 Yeah。

I don't know where maybe I should just stop at some。Yeah， I'll just double the slide。

but it's probably not related to the answer。 Yeah， so， so this is a expected question， as you say。

and I mean。Thisす。At first this is just like this is easier to implement so we started with lists right and just doing data documentation。

And。We actually have some form of investigation and trying to use some kind of equivaris like Gen or equis。

transformer， this kind of thing and。To our experience is not working。 It works as well。

but it's not working better and。I was assuming it potentially can be more data efficient。

but I'll just say in practice it's working out fine and the data augment。

Just working equally well and for simplicity， we just go ahead with this simpler design。Yeah。O。

The other question is basically on you know， so one thing you kind of one thing is that we don't actually have enough data on how the protein evolves through time typically so to say compared to the amount of static structures that we have there's a question about are there like good possible ways to also sort of apart on the temporal evolution of proteins you know as I say conformal changes during docking。

for example， that sort of directions along those。If there are ways to kind of model this well and etca here。

Can you see the question again。So the question is basically that you know one does not actually have easily available data for the actual sort of dynamics and temporal evolution of proteins。

but are there kind of ways to go about modeling this essentially？Right， okay， so， yes， so that is。

A very interesting topic。 And I， I think also a very active topic in the field。 that is。 Can you。

is' actually also related to one of the example in the limitation I mentioned earlier， Right。

It's not able to modeling the two state correctly。 So， so currently。This is。Indeed。

tricky because we are mainly relying on the PDDB data set， which is the crystal of the structure。

which is the static snapshot。 And this data set is not a data set curated for machine learning。

Right， This is a data set with like 50 years of hard work from the structure biologists。

And they just put whatever important。Like they， they feel， they。

they realize and and put into the database。 So database have some bias。ok ， but。

So there is the issue with that， I would say one thing is very promising and I think there are several groups is looking into is how do we combine machine learning approach with molecular dynamics？

So can you use molecular dynamic to simulating different state and with those different state。

I think one holy grail， which is very hard problem is can you really have something like Avavo 3 as a generated model but it's actually generating each different conform state with the correct Boman distribution。

and that is。Something this would be the next big breakthrough for the protein modeling， yeah。

I agree The last question is actually about the transfer learning capabilities when you kind of use multiple modalities。

so basically if you look at protein nucleic acids。

these are typically much more data scarces than say just single protein or protein protein interactions did you actually see benefits for example。

if you trained a model from scratch on protein nucleic acids as opposed to doing this code training between different modalities。

Right， I think this is a very good question。 Thiss actually something we've been。

Recently looking into， I， I think we don't have very clear evidence。

So the list is not part in the publication。 but I would say there are some good evidence。

When we co training both protein， DNA， RNA and ligan。

we actually get better result from all of the modality。 So if we， for example。

only training a model， particularly just for nucle acid， iss actually performing like。

It's prefer okay， but worse than a cold trend model like Alva those 3。

This is kind of interesting because this may not be super。Kind of straightforward。

like intuitive because they are using different vocabulary， right， But， but I think its still like。

Because co training with this bigger data set of protein is probably getting the model to some better like parametermeter space。

that is making it also work other mod， yeah。Thank you。 Yeah。 I don't see any more questions。诶。

So but I guess to wait on for a couple of minutes just to see if anyone has anything sure yes。

So maybe maybe not， but yeah， I would like to thank you for your time and also for the wonderful talk。

😊，嗯 yeah。Thank you so much and thank you everyone for your attendance。Yeah。え。Thank you。😊。

Okay， so we are now moving to the next session which will be a tutorial session。

I think that we can have like a five minutes break for people who need to grab a cup of coffee or tea in the meantime we will set everything up for the tutorial。

Just to make sure that everything is working properly better， can you hear me？也不是我。Hello。Yeah。

We are all in the same room， so only I am unmuted right now， but we will alternate during the talk。

That sounds just great。Hello。Can you maybe you want to try to share the screen to check that everything is working as expected？

Of course， let's do it。Just a second。

HowAbout this， yeah， perfect。Great。I think we can still wait five more minutes and then we can start。

嗯哼。😊，Sounds good to me。Okay， so I think we can start is 4 pm UK time right on time。

So it's now time for the last tutorial of the season titled Ne algorithmrimic reasoning Part two from graphs to language。

This tutorial is a follow up to the previous tutorial that Lo 2022。

and I'm very excited to see the progresses that field undergo in this relatively short time span。

For this tutorial， we have a big list of brilliant speakers。

Petr Viicchovi is a St research scientist at didmind and affiliated lecturer at the University of Cambridge。

Olga Kotslovan is a software engineer at Google Deep Mind。

Fderico Barbarro is a third year PhD student at the University of Oxford under the supervision of Michael Brotstein。

Theresa Markkiva is a research engineer at Google Deep Mind and Alex Vitwsky is also research engineer in a Google Deep Mind。

and last but not least Will Fred Boey was also a research engineer in a Google Deep Mind。Remember。

those sessions are live streamed on both Zoom and YouTube and the sessions are recorded and will be made available on Zoom on YouTube afterwards。

Please interact with the speakers using the QA tool from Zoom or by dropping questions on our Slack channel。

Now without further ado， let's welcome the speakers。Please， better。All right。

thank you very much Steve for the kind introduction and thank you so much to everybody in the log conference organizing committee for giving us the platform to talk to you all about neuralalrimic reasoning for a second time after a two-y break a lot of really exciting things have happened and we cannot be more excited to share all of these great advancements with you all my name is Pat Relichkoch I'm a senior staff research scientist at Google Deepine and an affiliate lecture at University of Cambridge and I will be kicking us off for today we have a lot of exciting things in store for us so let's dive right in the first part of the tutorial will be talking about neuralalrimic reasoning on graphs and' be given by myself and Olga Koslova who will give a tutorial aspect of this work later。

😊，Before we dive in just to note， thank you all so much to all attendees who are joining us along on this journey and we would just like to call out this was hopefully communicated to all the log attendees but just saying that this tutorial does have a light prerequisite on the previous tutorial that was given a log 22 for which you can find all the materials on our website and we will actually also upload all of the materials of this tutorial to the same website sometime after it's finished it's not a must it's a light prerequisite it does help a little bit with the understanding Another thing that we will recommend for a latter co-ab session is if you don't already have a huggingface token to please create one it will allow you to more efficiently go through the tutorial in real time but once again don't worry if you haven't done this you will be able to follow along and all the materials will be available for you to produce later。

There as Steve mentioned， there are many modalities in which you can ask questions to us but personally especially because there's so many of us we actually all in the same room right now。

we will prefer asynchronous questions so all of us will be monitoring the Slack channel and whenever we see a question that's like appropriate for one of us to answer we will do that so we highly encourage you if you have a question for any of the speakers at any point。

just raise it in the Slack channel， the tutorial discussionsions channel and one of us will be with you shortly to answer the question and lastly but not least if you enjoyed it or did not enjoy the tutorial and you have any feedback for us please remember to give it at the end I believe Michelle will circulate the link in the Slack after the tutorial is over we would highly appreciate it and it will mean a lot for us when we prepare future iterations of this tutorial。

Okay， so let's get stuck into this let's talk a little bit about neural algorithmic reasoning and specifically I would like to kind of kick off with a few generic words on all of the key building blocks of the term neuralalic reasoning so that hopefully even if you haven't come across this term before you can sort of get a feel for what is it that we are trying to do here so first maybe let's dive into algorithms and already algorithm is something that maybe when I was going to high school I thought was a super welldefined term but actually then I went into the textbook on introduction to algorithms which is the authoritative text for this topic and I saw how an algorithm is defined there and as you can see it's not really well definedfin even in the authoritative textbook it's merely mentioned as any well-defined computational procedure that takes some value as input and produces some value as output there's not really a lot more to go on in fact they also mentioned。

Points like it's thus a sequence of computational steps that transforms the input into the output and it can be specified in many ways in English as a computer program。

hardware design， and so on， the only requirement， whatever that is is that the specification must provide a precise description of the procedure that you're trying to follow。

So this doesn't give us a lot to go on so maybe it's even better if we ground it in a particular example and one classical example of an algorithmic problem for which we design algorithms to solve them is the sorting problem so to be more specific in the sorting problem you were given as input a sequence of n numbers a1 a2 to AN。

And what you're asked to produce as output is any permutation of these elements。

so this will be a1 prime a2 prime to AN prime， which is a permutation of the original elements such that for some defined inequality less than or equal these elements are arranged in a sorted order This is a very fundamental task which is a building block of many other things we want to do in computer science and it should come as no surprise it's been studied a lot by previous literature and there's conversely many algorithms that solve this problem。

one algorithm that you might have heard of it's one of the first algorithms typically encountered in a undergraduate algorithmic course is insertion sort so this is an algorithm that gradually sorts the list by scanning it left to right and at every step it looks at the current element and inserts it into the partially sorted list up to that position and this is repeated one step at a time until the list is sorted this is something you can nicely visualize with a step。

Step trajectory of how the algorithm works so here we have an input list 52431 and gradually as insertion sort goes left to right it readjusts these pointers。

the green pointers of the list such that eventually you will end up with a list1，2，34。

5 and it's a welldefined computational procedure that you can make various proofs that it will necessarily terminate with the final list being sorted。

Okay， so that's algorithms， but we talk about algorithmic reasoning。

so we need to talk also a little bit about reasoning and。

I want to do this in a way that doesn't open a can of worms because reasoning is such a loaded term in modern artificial intelligence and if you ask 100 people what reasoning is。

you're very likely going to get 100 different answers。

so I'm going to make it very clear that this is what we as in our group at Google Deepmind what do we mean when we say reasoning。

So to us， reasoning is a robust procedure for solving problem instances and really the keyword is robust。

Many of the other things are not so important here。

So what this means in practice is I actually don't require this procedure to be fully accurate I claim that humans are also capable of reasoning and as you know humans very often perform only approximate reasoning。

especially when you have only partial data available to you right and furthermore。

maybe this one is a little bit more controversial but we don't require the reasoning to be provided in a symbolic manner。

So in a way that a human can interpret what's going on we would also be satisfied with a reasoning system that does all of its computations in this high- dimensionmensional latent space without an easy way for us to interpret how it's reaching its decisions really the keyword for us is robustness it should behave consistently across all instances of the problem that we care about So if it makes mistakes it should make mistakes in a predictable manner and when we talk about。

Predictability across all problem instances and we're talking about systems trained from data。

this implies that what we care about is out of distribution or OOD generalization This is really the essence of the core problem behind what makes reasoning hard。

And。This hopefully explains algorithms and reasoning in isolation the way we see them。

but why should we look at algorithmic reasoning as an approach so to see that we can talk a little bit about how can we even evaluate out of distribution generalization and that's something that's quite tricky to evaluate with any specific data you might have at your disposal today in fact you can think about what are the desired ratta for a data that you can use to evaluate such generalization and really you need to be able to measure generalization anywhere inside your distribution so you need to be able to generate outputs reliably efficiently for any input you care about and I argue that these three items together imply by definition that what you need is for there to be some algorithm that produces your data because the existence of a reliable efficient outputs generated for a given input is by definition an algorithm right？

And so now we know why algorithmic reasoning， but why do we care about introducing neural networks into the mix because algorithms themselves are capable of doing this really well and in fact。

they're very robust。In order to quickly justify this I'm going to prime you with a question。

find me an optimal path from A to B， I'm giving you no additional context whatsoever just what would you do if I told you find me the optimal path from A to B now if you are a theoretical computer scientist。

chances are you will see this question and you will react in a very singular manner you're going to assume that what I'm giving you is a weighted graph on which I'm asking you to find shortest paths from a given source and you're going to diligently take out your favorite shortest pathfin algorithm like Dtra and use that algorithm to find all the shortest paths emanating from a particular note of the graph this is at least how I would intuitively react that someone who studied computer science at undergrad and I would argue many others would probably react to this question the same。

But I have never actually told you that this problem gives you a nice welldefined weighted graph with a single scalar weight per edge right in reality very often especially when we care about solving problems in the real world。

there exists some real world problem that underlies this and that might be， for example。

a real worldorld vehicle routing problem where you have to deal with not a nice well-defined weighted graph with a single scalar per edge。

but rather you have the status of all the cars and traffic， how fast they're moving。

you might have data from the phones in the cars， there might be various roadblocks， traffic lights。

whether conditions all making these estimates much harder。And traditionally。

how people would go ahead and solve this heurically is we would manually try to take this complexity of the natural world and try to map it into something that an algorithm can operate on because an algorithm does require the input to be in a very rigid state in order to act in an appropriate manner。

But okay we really cannot hope to always do that manually so we could also try neural networks This aligns with the modern paradigm of tool use where your neural network processes the complex real world data and produces inputs for an algorithm but actually and you can look at the first NR tutorial for justification behind this neither of these are likely to be successful for every problem you care about because there's bottlenecks inherently involved in invoking the tool and basically it doesn't matter if you have an algorithm that would probably correctly solve your problem if you're executing it on inputs that are incorrectly mapped this is really the problem garbage in garbage out。

it doesn't matter what the procedure is and therefore this is the justification behind NR for higher flexibility to allow us to be more flexible in this we can actually try to train neural networks that more closely mimic what algorithms would do and therefore get the best of both worlds。

both a neural network that can process data coming from a more rich input space。

And also have it be more robust out of distribution like an algorithm would and trying to tread this really careful line in a way that satisfies the tradeoffs the way we want it to is really the essence of NAR or neural algorithmic reasoning。

Now okay so far I'm just talking to you about reasoning in general。

but this is a learning on graphs conference and I did promise you the first part of this tutorial would focus on graphs so why do we care about graphs specifically once again this is an argument that we explore a lot more detail in the original tutorial but here's just the one very simple picture that illustrates why graphph neural networks are a particularly good representation for dealing with these problems so on the left hand side you see a classical algorithm for finding shortest paths that is Belman Ford it iteratively updates the distances to each one of our nodes that's the D variables by always computing them as the minimal way to reach a neighbor so the distance to a neighbor D plus taking the edge from that neighbor to the target node so that's really what this minimum formula and codinging on the left and on the right- hand side you have the computations of a typical graph neural network and you can see how there's this very nice correspondence between the steps of what the algorithm is doing and the data flow of the GNN you can imagine。

Different variables， the DUs as node features， you can imagine the act of adding the edge weight onto these variables as computing a message function and you can imagine the act of aggregating across all neighbors as the aggregation function of the GN there's this very nice correspondence between solving different subproble and aggregating them together that lines up across both GNNs and classical algorithms and in fact this is something we can formalize a landmark paper from ILR 2020 from Kalu Shu and others in Stephanie Aelka's group at MIT has studied this in a lot of detail in the what can neural networks's reason about paper where they show that in fact graphraph neural networks are a good choice here because they align so well with the paradigm of dynamic programming。

which is a very generic paradigm you can use to solve all sorts of classical algorithmic problems and additionally this is a concept we explored in subsequent work in fact together with Andrew Ddzick we published this。

On Graph neural networks our dynamic programmers at Newris a couple years back。

which explicitly explores a category theoretic setup that aligns graphraph neural networks and dynamic。

So hopefully this justifies why should we be doing neural algorithmic reasoning and also why should we be using graph neural networks to do that。

but I've mentioned to you the core idea is this algorithmic alignment so design parts of your model to align with parts of the thing you're trying to fit so let's give you a few basic groundwork steps on how to do that Here is our standard graph neural network equation so we are updating node features from an inputspace X to a latent space H we're doing so by computing a message function psi across every single edge of our graph we're aggregating all messages using our aggregator function O plus and once we are done with all of the aggregated messages we combine them into the final node embedding using the update function phi So what are some different ways in which we can take this neural network expression and make it better aligned to certain algorithms Well there's a variety of things we can do one thing we can modify are the parametric function。

the size and the Ps one very beautiful example I still love to give on this。From Hao Tang。

which was published at New's 2020 as part of the EerGN paper。

which notices that many algorithms such as pathfining are homogenous and that means they have this very nice property that if you multiply all the inputs by a scalar。

Lada the outputs are actually exactly the same just scaled up by that same scalar so shortest paths are a classical example of this if you multiply all of the edge lengths of your graph by a factor of lambmbda。

you will still have exactly the same shortest paths is just that the lengths of the shortest paths will be multiplied by lambmbda and this means that if we're fitting a homogeneous task like this。

And if we make our neural network components psiine phi themselves compute homogenous functions。

we're going to have a much better time， it turns out that to make a standard MLP homogenous all you have to do is get rid of the bias vector so make them pure linear transformations without the Fine part and that's one of the things that how T and others do in this paper to great success so it's just one example of a really simple but beautiful modification you can do to a GNN to make it fit better this particular class of algorithms。

We might also want to change the aggregator and actually this is something that was my first incursion into the field at ICear 2020 we noticed so once again you see the Belman Ford example in the lower right corner。

all we did really was noticed that since the Belman Ford algorithm is doing minimization if we choose an aggregator that aligns with that like taking the max that will be much easier to learn the required message functions than if you take a more expressive aggregator like some and I'd say nowadays this feels like a very normal decision back in 2020 it was definitely contrary to public wisdom and it was well known especially around the time when the gin paper came out that it was understood that some was the most expressive aggregator than average than Max and Max was seen as kind of the least useful of all of them but it turns out for algorithmic reasoning that kind of bias is exactly what you need for a lot of problems and max is now a relatively standard choice in a lot of these problems but back then it was a bit of blasphemy。

Another thing we can modify RD input features themselves to make the problem more algorithmically easy for the model one standard example of this are end body simulations in physics like the early work on interaction networks from Peteat Bataia and others and the idea here is when you want to fit end body systems and their trajectories often the forces you need to model between them follow in inverse square law gravitation electrostatic forces they all decay with the square of the distance so if you put just raw distances as your edge features you might not have such a good time because your message function to function computing the force now has to compute an inverse square rule and this is quite tricky。

But if instead we feed the inverse square of the distance as the features， now。

suddenly the required message function is linear in your features and it's much easier to extrapolate。

So even something as simple as being mindful of your feature engineering can make a big difference。

😊，And lastly， you can modify the computation graphs。

so the neighborhood of nodes over which you compute your embeddings。

and one standard example of this is the pointer Graph network that we published at Nps 2020。

where we explicitly force the edges of our model， we rewire our graph to align with a pointer- based data structure like thejo set union and this gives us a much better theoretical time complexity requirement。

So these are all the various things we can do to a model。

but let's say you've now proposed a really interesting thing update to a model and you want to test it out or write a paper about it。

how would you go about doing it back in the day it was really tricky every paper had to design its own dataset it was a bit like the Wild West in there it was really hard to compare results across different publications which prompted us to propose what is now known as the CLRS benchmark and really it's just a collection of these classical algorithmic skills we might want our model to respect in various ways so as the name implies CLRS 30 it's a collection of 30 benchmarks tasks from the CLRS textbook introduction to algorithms and even though it's only 30 algorithms it already spans a wide variety of skills sorting searching pathfinding string matching and even geometric algorithms。

And this is a publicly available benchmark that we published at ICML a couple years back。

you can access it， play with it， use it for any data you might want for any publication。

we've already had a lot of very happy users over the years。

we hope that whatever the things that something we say today will resonate with you and lead to you also using this benchmark。

So what does CLRS offer to you， it actually offers a unified graph representation for a variety of algorithms here you see Belman Ford that we discussed before as well as matrix chain multiplication。

a classical dynamic programming algorithm， and you see how the library gives you access to the entire step- by step trajectory of what this algorithm is doing as it's gradually computing shortest paths or optimal ways to multiply matrices。

or also sequential algorithms are represented as graphs。

so the already mentioned insertion sort is represented as well as a string matching algorithm like the naive string match which tries to find the first occurrence of one string in another。

So all of these algorithms are specified using a fixed number of probes and those just track the various variables that the algorithm is using during its execution and those variables may be used either as input or the model might be queried to predict them as output or they might be used both for input and output so the model might be asked to track their state over the lifetime of the algorithm and once you specify these probes。

this uniquely determines what the data is going to look like what the encoder and decoder architecture should look like and what the loss function should be so CLRS 30 is not really just a single data。

it's a data generator and the baseline generator because you can specify any new algorithmic task and once you use the CLRS language to specify the probes of that task。

the library will take care of everything for you so it will design the machine learning model。

design the batching， design the loss functions and Olga will show you a lot of this later on in the tutorial like you basically don't have to do almost any machine learning if you set things up the way you want to。

And I want to briefly walk you through the representation of CLRS 30 because as I said we have this common graph representation so specifically we're going to be looking at insertion sort again and I'll mention to you the six probes that are being recorded in the insertion sort specification the first such probe is the position probe we have this as a nice tiebreaker for all of our different algorithms and a way to nicely encode positional information so a position probe is just something that's given as input in each node of the graph and it's a scalar it's just a scalar encoding what is the index of each one of my five items in this case。

Then we also have the keys note I'm highlighting at the bottom trajectory what these different probes are in the picture。

so keys are the actual input node scals telling you what are the values to sort so the5。

2431 those will be encoded inside the key probe。Then what we want to predict is the output is the final state of the sorted array。

these green pointers at the end telling you what is the predecessor of each node in the sorted list。

so this is now at the output stage for each node we're predicting a pointer to the previous node。

Now， of course， we can track how these pointers evolve over the lifetime of the entire data structure。

and this gives us this spread H， which is also a node pointer。

but it's a hint type meaning we track it over the lifetime of the algorithm for every node in the list。

what its predecessor is over time as more and more items get sorted。

and this is all exposed to the model so it can use it as input predicted as output。

whatever is most preferable。And lastly， we have two additional node masks which tell you of certain elements that the algorithm is currently focusing on the first one。

the blue pointer here I is telling you for the currently considered item where in the partially sorted list it will land。

And also， you have another sweeping pointer， which just goes left to right over the list。

telling you what's the current element being sorted into place。

So all of these are exposed in the setup for the model。And as already implied here。

all of the probes can be either inputs， outputs or hints or live on nodes， graphs or edges。

and they can also be of different types like scalar pointer and mask。

and inputs and outputs are fixed during execution whereas hints are something where you have whole trajectories of them during the lifetime of the algorithm and they actually specify the algorithm。

know that all sorting algorithms as long as there are no duplicates will have exactly the same inputs and outputs。

the only place where they differ is how the intermediate states will change。

Now as part of these slides we have this nice animation showing you how these representations travel through the model and how they get embedded into nodes。

how they get embedded into edge features and how all the mask features also get encoded and how the GNN then processes them I'm skipping quickly through this because this exact same animation exists in the original NAR tutorial I'm just leaving it here it's going to be in the slides that are already shared it's going to be here for your convenience you can easily see all of these things that CLRS does automatically for you all of the data passing you see here you don't have to implement it yourself once you've specified the probes of an algorithm。

the CLRS benchmark takes care of all of this for you all of these losses。

data passings encoders decoderrs and so on are all done for you。

Now of course because we already mapped everything to the same graph representation。

why should we stop it learning just one algorithm why not have one graph neural network that can learn all 30 of them and this is exactly what we tried to do in some prior work so what you imagine here is you have this unified graph processor network and you have for each one of your tasks like sorting pathfinding or finding the convex hall of a set of points you have for each one of them a separate encoder function Fi and a separate decoder function Gi those map to and from this latent space and they're generally really simple and the question was can you train a single architecture like this it turns out you can we developed upgraded model called triplet GMPNN which we used as a foundation of our generalist neural algorithmmic learner that we published actually a the inaugural log two years ago and here you can see the performances of the multitask single architecture in orange across the 30 algorithms out of distribution versus the baseline single。

TaskEx in blue and you can see there's some differences。

sometimes the multitask model is much better， sometimes it's much worse。

but on average the performance of the two are quite comparable and therefore it led us to believe that we can indeed train just one model that can fit all 30 of these algorithms at the level of a single task expert。

😊，So this is roughly everything that led us to 2022 when the original NA art tutorial tutorial was given。

so what did we miss， what happened in all this time since 2022 that is worth taking paying attention to。

So slowly and steadily starting from triplet GMNNs which augmented the normal gene equation with the message passing over explicit triplets and gating mechanisms and also Scor decoders for tasks requiring permutation outputs and this already achieved a really solid performance across the 30 algorithms as you can see on the left since 2022。

this research just kept growing。The first such instance published at ICLA 2023 were relational transformers from MSR which show what you need to change inside the transformer architecture to make them a bit more competitive on these tasks。

specifically most transformers don't have an easy mechanism to incorporate the edge features which are really important for graph tasks on NAR so what you do is you just let the edge features constitute part of the query and key vectors and value vectors and this suddenly allows you to have a more competitive transformer architecture。

Additionally， G fornets published at this year's ICear。

they actually leverage the idea that these algorithms are designed to be Markovviian and therefore they have an explicit gateit mechanism which actually allows the model。

Explicitly forget some proportion of the previous embeddings of their processor network。

and this led to a lot of really outstanding results， as you can see in the table。

a lot of the algorithms are now at out of distribution performance above 90% accuracy。

And most recently， we've done something blasphemic， which is published this year at LoG。

the Renar model， the recurrent neural algorithmrimic reasoner。

where we actually set the aggregation function to an LSTM。

Therefore dropping all notion of permutation equivaris that is very beautiful and important for graph tasks。

well we argue that in CLRS this might actually be not such a bad thing to do because for a lot of these algorithms like list algorithms and so on we have an explicit canonical node order that we can feed into our LSTM and actually we get not only really good results it' certain list algorithms but actually we get a double digits 80 plus percent performance on Quick Selectlect which my good friend Michael Galkin called out in a recent blog post that is very unlikely that it's going to happen this year。

well it turns out all we had to do was drop permutation symmetry and we had a model that's capable of solving Quickselect。

😊，And actually when you take all of these results together across the state of the art models。

there's only three algorithms left where we don't have above 80% score on the original out of distributionbution test sets used by CLRS that's Floyd Warsll。

nuthmoreris Pratt and strongly connected components hint hint。

these are directions you might be interested to focus on in the immediate term to help us completely crush the existing test set and then in the future we're going to have to have an even bigger distribution shift because this one is just too small for a lot of these specialized GNN executors。

Beyond these main architectural advances， we've also done work to endow graph neural networks with memory mechanisms and two works that I think are worth calling out here are the recursive algorithmic reasoning module from Jononasra stillangath and myself that was published at Lo last year that endows GNs with a stack and neural priority cues that I published with Rishajain and Piiaro Leo at an ICML workshop last year that endows GNNs with a priority Q style data structure。

so these both endowGNs with a persistent memory component which can be leveraged for more precise and more robust answers even across bigger distribution shifts。

Another thing that's quite important is I've mentioned how critical these trajectories were and the hints to the system performance。

in fact there's been several papers arguing that if you just naively learn to predict the next hint given the previous one it might not be so good so Saak Mdaavi had this great TMLR paper where they basically showed that for many classic CLRS algorithms not using the hints was sometimes a good idea。

This prompted several other works like this paper from Groiono and Yoo Milla Procorencova that you can actually have very handicrafted supervised losses that let go of hints in their entirety but still allow you to execute out of distributionbut with great accuracy。

this was published at Europe's last year and also recently。

because many algorithms really require you to find the fixed point of a certain operation。

Doic Gorgiev JJ Wilson and David de Bufffeli have shown in the forthcoming Europe's paper that will be presented next week that you can use Deep equilibrium models to greatly stabilize the training dynamics of these models。

Now， of course this might paint a slightly daunting picture for these intermediate states so should we just get rid of them altogether I would actually argue we had some papers recently that say that hints can take you a long way if you use them properly so if you don't just do simple next hint prediction so one approach that we used last year was to show that。

😊，Actually， the trajectories of algorithms will often have exactly the same initial few steps if you transform the input in some predictable ways。

so here for depth for search， we have this trajectory exploration from node1 to node2 to node3 node 3 has no neighbors we go back to node2 and explore node4 that next step exploring node4 will still be exactly the same even if we attach all of these different nodes and edges to the graph and we might want to exploit that to design a more finecrafted contrastive objective and specifically the hint relic method which Beicche Bevilaquias published at ICMl last year in Hawaii was to basically use this to design a contrastive learning objective where we contrast a step in the original hint trajectory with against incorrect step in an augmented graph and this gives us a very rich space of contrastive learning to use and actually using this we set one of the current state of the art results Another recent very exciting result that will be presented。

s next week， the OpenBook NAR method actually not only uses the hints from the original from the actual input that you're trying to predict the output for but also give the access to the model of the entire training set of all of our sequences。

that's why it's called open book so typically when you as a human tries to execute an algorithm very often you will look at examples of previous executions to ground your predictions better and it turns out if you give your model explicit access to the training set of possible reference sequences。

this can give you a much better NAR performance once again a more non-trivial way in which these hints can be used。

And I mentioned previously we can align all these different parts， parametric functions。

aggregations， features， computation graph， you might wonder is there anything else left to do。

turns out there is and it wasn't even in this plot。

there was this implicit clock which says that you always have features at time T and they all work together to compute features at time T plus1 and this is perfectly synchronized。

but this might not be an assumption we might always want to hold because many algorithms we might want to fit are asynchronous meaning that we can only update a handful of variables at each computation step。

Yet all of our models， GNNs transformers and otherwise are fully synchronous。

they update all of our nodes everywhere all the time and if you think about it。

what kinds of auto distributionbution general can you get from message and update functions in such a model because they have to learn an identity function almost everywhere and then highly complex functions somewhere else and it's quite tricky so better alignment with asynchronous algorithms might be a very useful future direction and we had this paper with Valerie Engelmeyer and W Gogevt last year's log showing that this might just mean let's try to execute as parallel as possible algorithms and that's just better for GNNs to do that's easy to prove elegant scalable but obviously not all algorithms are embarrassingly parallel so we can't do this。

😊，So there have been a variety of methods like Guac and cooperative GNNs that used explicitly asynchronous message passing where at every step you only pass some of the messages and you only receive some of the messages is obviously directly solves our problem inside the architecture but it's quite tricky to scale on modern hardware which typically requires things to be done in parallel and there's many discrete decisions to make when deciding which messages to send and which messages to receive so actually we worked on asynchronous algorithmic alignment with Andrew Dodzik Tara von Glenen and Rasvan Pshcannu where actually we tried to tread the middle ground so we designed GNNs that are still synchronous so efficient to execute on a GPU or TU but make them provably invariant under various asynchronous execution traces this leads to a pretty cool monooid equivariant style result and it hits a sweet spot between a method that is theoretically sound scalable and modern hardware and also feasible to implement so I highly ask you to check it out if you're interested it was published。

That last year's log as a spotlight。Lastly， what about different applications of NAR or more theoretical understanding。

how has that developed in the last two years in terms of places where NAR has been deployed。

the four areas I'd like to call out are we used NARs to get better predictions in neuroscience。

predicting blood vessel types in the mouse brain， this was the dualalmic reasoning paper with Daniel Numerosso and Davidvi debaccu that we published at IC 2023。

then they've broken into computational biology with NARs Do Ggev and others have published the naT paper which uses these NARs for trajectory inference over gene expression data。

they are also seeing usage in harder combinatorial optimization problems。

the conar paper from Do Ggev and Daniel Numerosso was published at last year's log shows how we can pretrain a model on polynomial time algorithms and then use that model as a great foundation for MP hard problems and get better performance at fitting them。

Lastly， together with Ephiia Panagottaki and several other collaborators at Oxford。

we worked on using these NARs to get better performance for an algorithm that's critical for point cloud registration and robotics by basically automating the steps of that algorithm better。

What about theory Well we believe that fitting algorithmic invariance requires to step beyond geometric deep learning and model things that are not just purely equivariant functions this prompted us to develop the categorical deep learning framework at ICML together with Bnagaovch Paulardard Andrew Ddzig Tama von Glenen andjaro we have published this at ICML earlier this year at Vienna then the great work from Arturra Bagdloka and Kimononallas at Waterloo has shown how you can use loop transformers to exactly simulate many NAR procedures that you might care about of course this still doesn't tell us exactly how to find such loop transformer parameters but it's a great encouraging step that such classes of architectures are in the right direction。

Then two really exciting works on quantifying out of distributionbution generalization better。

this work from Andrea Sluka， Martinquis and others that was released on the archive earlier this year。

there a great step towards quantifying out of distribution and also a great step from Bruno Riverro U Las Covs and co-authors on the graph Metro paper that improves the out of distributionbution generalization through a rigorous and motif-based approach。

Lastly， I'd like to call out two recent quality of life improvements to CLRS that Olga will tell you all about in her tutorial to follow。

The first one is in the basic CLRS benchmark we always use fully connected graphs and the salsa CLRS work from Mindder and others in Raj Wattenhoffer's group at ETH that was published at Lo last year actually shows that if you use sparse graphs in the algorithms where that is possible you can generalize to much bigger structures than what CLRS will allow you in and of itself this is a repository that's a public Pythtorchbased fork of CLRS that you can find on GitHub and lastly but not least through the work of Vladimir Miianch and others that was published at Lo last year we actually now have a nice way to visualize what these algorithmic executors are doing over time and plot their trajectories in PCA space sometimes it's covering really cool patterns over how these models process data once again all the code for generating these beautiful plots is publicly available on GitHub and we'll also tell you all about it in the forthcoming tutorial。

That was it and I hope you enjoyed the first part of the talk of the tutorial for NAR over graphs I'm now going to hand over to Olga Koslova who will tell you all about a coab tutorial that covers a lot of these concepts thank you so much and Olga the stage is yours。

All right。嗯。Hello， everyone。Give me a second to share my screen and we can start。い。

Yeah。Okay。

Yeah， so hello everyone， my name is Olga and I'm a software engineer at Google Dd and basically in this tutorial we will walk through the code snippets of how you can。

Train and Ya model on Cs Ver benchmarkmark， we will look into novelble architectures and multi algorithm training。

and we will also explore the libraries that better just mentioned。

To make the experiment kind of clear， I will restart the session。

And we will walk through this column it starts from the O installations。

which I conveniently installed while Peter was giving you a talk so we don't need to actually run them。

Then we import the libraries that we need。And initialize the jus random number generator。

Here we have two helper functions， one prints the data sample and another draw the plots for train losses and validation accuracy。

We don't need to get into them， but they are helpful to us。

Now let's recap briefly how you generated the data based on the tutorial two years ago。

you would create a sampler for each algorithm specifying the number of nodes in each sample。

And you will iterate for it and will look something like this。You will see this spec。

which basically shows you all the variables used in algorithm。

And the data bar will look somewhat like this containing the lens， the hands， the inputs。

the outputs。What we can do now is we can use these create sampler functions。

And as we see in documentation， we can use it to create sample for training， validation and testing。

setting out the train lens， the violence test lens and the algorithms that we want to use。

Same as I do here。The parameters I choose don't really have specific reasoning for that。

apart from that I choose the validation length in distribution and test auto distribution。

And this is just the two algorithms I like for the shortest part。It。

So yeah let's print data batches from the new samplers and we see that we have two batches。

one for each algorithm。And the spec list is also changed， now we have two of them。

Now we need to create model and we do this exactly as we did it before with small change we have added new models。

for example， the triplet gion and model that that Pe mentioned kind of。Really great fun。

it is that you can use it as easily as you did it before。Let's。

Let's see how we can load it and use it。So yeah here we specify the parameters of the model。

Then we pass it to CRS baseline model and then we initialize it。

this is the kind of JX feature that in order to start using the model you tune first using the example batch of data。

Kia， I will just save the checkpoint for just initialize the model and I will use it in the second part。

So this is the train loop， I will start the training first and then walk you through the function。

Basically what we do， first we initialize some lists for losses and acuracies。

Then we do the for loop for like 500 steps， and each loop， we first do the training step。

where we go through all the samplers for all the algorithms， do like the feedback step。

And collect the losses。Sometimes we will evaluate the model once every 50 steps， in this case。

we will also go through every validation sample。And it will model basically。

And sometimes if the current accuracy is much higher than the best so far。

we will run the evaluation of test set as well and save it and save the model if it's the best one。

Yeah， he's the training goes。Not that fast on the slow， so we will need to wait a minute。

And then we will draw the plots to visualize what's going on。In terms of loss and accuracy。

F more steps。Yep right， so we see the lowest decreasing for both algorithms。

so this is the be first foratory and D Str。And accuracy for both of them as well。

Now let's look into the library which allows you to visualize the trajectories of each sample。

So the our latent Spaces library， we will need to do a few more inputs。Okay。

So what we do here is we choose an algorithm on which we will see。What happened？

And I will create kind of a test data site for that。

Another thing that I actually forgot to mention when I initialized the model。Let me get back to it。

Is that here？We add the back equals true parameter， this is a new feature which was just added。

but it allows you apart from the current predictions and hence it allows you also to return embeddings from each time step so we will use them to visualize。

And yeah。If it's set to false， which is body default fold， the model will behave as it。Deate before。

So yeah，'ve with the fact that we know that the model is set to Dbu。

we expect this third output of models predict。To be this stocked。

I will actually walk through it in a second。Basically， what happens in this function？

As we first initialized the for the trajectories and for the lens。

Then we go through all the samples in which we have like， yeah， approximately 16，000。

We dump these trajectories。Then we crowd them all together in bond list。啊。

Then we concatenate them along the batch axis， and then we need to move on very， very confusing。

Let's look at it step by step。So the car charge shape is time steps， bo， notess and feature。

What this means is that for each timet。We have a， a bunch of。Some posts。Each containing a。12 notes。

I think。Yep，14， sorry，14 notes。And features like。128 features。So we gather these all together。

then we collect them for all samples， getting all trench。

And what we do then is we reduce the nodes axis。Because it's quite tricky to track embeddings across many nodes。

So yeah， we want to have just one point for them。嗯。We do it。

we use the element wise maxs because this is basically the same aggregatory which is used in cell as processors。

So what we do next and the transpose is basically just just to change some dimensions afterwards。

so it will be easier to use by the library。So the two experiments we will do here is that we will build some plots for the train model and for just initialized and see what we have。

Okay。All right， we have all the data for both of them now let's visualize。

Briefly going through this function while the plots plotting。

So these lines are just like purely technical， we could create some folders for the plots。

And then set up some experiment setting like plotting setting。Here we。So we have。16000 samples。

Obviously。They will have different trace liness。So we need to select only those who have like equal length of the trace。

In this case， we want to choose the most frequent one and。Filter only only these lens。

so we choose the most frequent one， and then we filter the data and the lens using it。

And then we pass it all to the plotting functions from the library。And here's what we have。

So for just initialized model。So these two plots。What we see here is that we aggregate the graph embedding for each step。

And projected in 2D or 3D space using PCA。嗯。And we draw a line between different steps。

so we choose like。50 samples and we draw 50 lines different for one sample for different timestamps to see how it changed。

Through time。And for ontrain model， we see that。Nothing really useful happens， the samples diverge。

there is no clear solution here。But on the other side。This is a small spoiler。

but like on the other side for the train model。We see that。

These samples actually converge in one place。So the star from the blue。And converge to the right。

And we also can see that the distance between the initial point and the samples moves is quite peak in the beginning and then it becomes small and small。

And the second floor we have。I。Instead of having one PCA model for all the points。

We do different PCM models for every time stamp。And at the timestamp axis， having these。

This with the representation。So as we see a phone train model。Each point。

the points are quite spread。While。For the train model。

They all kind of well still not super in in one spot， but like we train them more only for 500 steps。

that's very little， but we already see the difference。

And this is just like a you plotting functions from the library。

we encourage you to go and explore what's available there and also check the paper it has much better description of what's going on。

And the last part that we will walk through is the sparse training。

We need to restart the kernel here and I will explain why。So we need to， to import。Again。

everything we need， but the good thing is that we need only the inputs。And the part three。

so we can。H I the Part one and part2。Yeah， we will explore salsa service Library here。

The good thing is that the execution will be faster。

But it has only four algorithms available from thisLRS 30 and two additional ones which you can check in a paper。

And another significant difference is that it uses Pytorch。So。你样。This is just for you to know。

not advantage or disadvantage， just a different thing。

And the experiment we will set up here is that we will。

Run the small evaluation on like 10 samples for the slls model and for salsa Cs model and data sets and check the VrM usage。

And the reason why we needed to restart is because there is no easy way to to reset the the ju jus memory。

Kind of starts and yeah， so we just restart run time and run from here。对。

I will launch the code first and then explain what was going on because it takes about three to five minutes to actually run it。

So we initialized the model this we've seen before。

And then we have a for loop where we create data samples from like2 to 4000 and run a short evaluation loop。

so it's really short one， the nuM samples just then and we do it for PFS algorithm。And choose。

PGN model for both Silvers and Sal Silvers。We measure the。Ram usage this way。

so in JX we measure big bytes in use and divide it for this number to get blankgabytes from bytes。

Yeah， here we will need to wait for about。Three more minutes， because the figure the size。

the longer we need to8。And I expect that we even might hit the。

Runtime error because we take too much memories， as you can see， it already grows significantly。

If every step。While we wait， I will describe the Celsiuselsus thes function as well。

So in these lines we just remove the standard logging because we don't need it in this experiment。

so the output will be as nice as this。Then we initialized the model。

loading it from the config and yeah build the。Data sets to unit the model。Then we do same loop。

We can reset the thoughtr settings here。Build a data。Go through， go for it。

measure the usage and save it。Then we still have some time to wait two。We evaluate the last bit。

Yeah， and afterwards， what we will do， we will plot the Ram usage for both of them together and see。

How significant the difference is？In this sample。Usually takes quite long。

but hopefully right now it will be quite fast because the most significant part is load like a data set generation。

But because I I already have run this color well pattern gave you the talk， they should be cashed。

So do you expect that if you launch it first time， it will be much longer than when I did？

And another note， I forgot to say right in the beginning that we are using the GPU。嗯。First of all。

because were limited on a time here and we want you to actually see what's going on。 And secondly。

because here， for example。The example in the Salsus Sils Library for farm usage。

they used the device。Yeah， this is super fast here。And that's spot。So basically， this is a。

Reimplementation of the plot from the paper。The blue line is the CRS line。

the orange one is Celsius the CRS result for both four PGN models。And。This is the side of graph。

The blue line stops here because later it hass the out of memory error。

but despite that you can see that the difference is very significant and we can see that that。

Using Sal service is a much more efficient way， but it had like obviously it's up to you what to use。

but we wanted to highlight that there are some advantages and disadvantages of different libraries。

And I think we're here with this I will stop， this is all that I wanted to tell you。

I am happy to answer the questions in the slackck channel so please post them。

And right now I will hand over to Federico。

But here we'll be using same laptop so we will just switch places。嗯。あ？嗯ello。Yeah。

ちち。Hello。So let me just share my screen。Here am。Okay。

hopefully everyone can see it or else you should probably tell me now。 So yeah， I'm Federico。

I'm a third year PhD student in the University of Oxford。 and yeah。

I'm happy to be presenting the part on language models。😊。

So I appreciate that a lot of people are coming from like graphraph neural networks perspective。

so maybe not everyone is like super familiar with language models。

so I'll give a very brief intro to them。So most language models actually use transformers。

at least most of them like deployed and and so they're out aggressive and we'll kind of look into why this is important。

And the way they work is they essentially take text and they split it into tokens。

And the tokens are kind of like really the basic unit that this language model uses and the process of turning text into tokens is done by something called a tokenizer。

and this tokenizer is kind of independent from the training of the language model but they are quite sophisticated。

and they're actually very important for the performance of the language model。

And then these tokens are really just like IDs that the tokenizer gives to text to split it and then the part of the language model is kind of taking these token IDs and turning them into like token embeddings that then the transformer can process meaningfully。

And the language model is really a function taking token and token embeddings or really it's a function from text to text。

of course， but there's some kind of intermediate process in which you turn text to these tokens。

tokens are processed into final embeddings and these final embeddings are somehow processed to tokens or to text again。

嗯。And this utter aggressive process means that really the language model is predicting at one step it's predicting a probability distribution over what it thinks the next token should be and then it's doing this repeatedly so you really only generate one token at a time and maybe we can kind of illustrate a bit more tediously what this means so here we have this input which is like what is two plus two。

And this tokenization is the one that I believe would be the one that like ChaGPT would be doing。

So it's splitting like words and and usually like digits are treated independently so numbers will have their own token usually or you know maybe multiple tokens and then。

You feed this sort of language model and then you extract the next token which would be the in this case。

And then in what you do is then you reffeed this this into the language model。

so you've generated a new token and then you reffeed this new sequence where you added this additional token to get another token and then you do this again and you get another token and so on。

And so from this you go like what is two plus2 then the model can generate like the answer is for so here even like spaces can have their own tokens like tokenization is actually very complicated and it's not like。

It's kind of like this dark art that makes language models work。

And these transformers usually use usually are called decoder only transformers and so this decoder part essentially means that they're using what's called a causal mask and I think people in GNNs will be quite familiar with masking this is how you kind of encode the topology of the graph well the causal mask really just means that like we have this lower triangular mask and this is important because essentially it's saying that tokens can only attend to tokens coming before them and this is quite useful because at least as a training trick of course because now you can if you didn't do this then itll be kind of hard to train models because the attention would be like looking ahead so if you're trying to predict the next token then you could easily like overfi to this and like not generalize at all。

And importantly， these transformers are actually extremely deep。

so like the 400 billion parameter ma model has more than 120 layers。

so these actually are extremely deep transformers and it's kind of remarkable that we can build such deep models and they work so well。

So just in equations， again， this is all like relatively simple， of course。

but like so we compute first raw activations and these are based on the queries and keys。

so here I'm kind of showing the computation based on like you fix a token and then you look at all the keys and these activations are simply the dot product between the two then people can modify this a bit。

but let's say here you're not using positional encodings。

And then what you do is you you apply softmax and this softmax is simply you know kind of the standard way of taking a vector and RN and turning it into a probability mass function over an。

Overand things， essentially。And then what you do is you do a linear combination of this。

so you take their values and then you linearly combine them based on the attention。

and importantly this sum kind of goes only looks behind you right so like this is the fact this is coming from the causal mask。

Okay， so now we can start talking about reasoning and I want to give in this talk maybe more of an overview because I want to kind of introduce people to a lot of different concepts which I think are actually very related to what people are working on in graphal networks。

One of the kind of very pivotal papers which is also quite recent but extremely popular now is like is this idea of chain of thought and this was the this comes from the observation like from pretty nice observation it's like this fact that if you give a model like these examples are from a dataset set called GSM8K which is a dataset set of like kind of simple。

Math questions， but usually like word problems and so on。 if you give a model。Like this prompter。

you can see that it gets the answer wrong， but instead what you can do is like you can kind of make it try to explain what it's doing so the difference between left and right is that here you give it in the left you give it an example in which you just ask it to spit out the answer。

the answer is 11 while on the right you kind of tell it how it can potentially reach the answer and in this sense like the model is trying to to explain itself right when it's answering the question and somehow this helps with performance。

Now， there's also like people use this kind of different terminology。

So there's like a difference between scratchpad and like k shot prompting。

So K shot prompting is like in this sense， in this example， you have like K is one。

So before you before the model is answering this question about the cafeteria。

you give it an example on about tennis。 So this is like why you have you give it one example。

if you give it two it be like two shot prompting and so on。

and then there's also this idea of scratchpad。 So in some sense。

scratchpad is like allowing the model to use additional tokens to do computation。 So for example。

in some sense， in this output， the model is doing some kind of scratchpad computation where。

Where it's not just telling you the answer。 It's also like kind of saying how it's arriving to the answer。

So there usually these two are used like at the same time。 But。

but there's like a slight difference between the two。

And actually I think this can relate to a lot of work that people do in graphraro networks。

which is that of like looking at the expressive power。

connecting it to the Vifire Leman test and so on， people have done similar things in analyzing the excluivity of chain of thoughts so kind of analyzing how like does giving the model does allowing the model to kind of compute intermediate tokens actually help theoretically and the answer is yes。

and perhaps this is not super surprising if you kind of look at transformers from a point of view like automata and so on like you can really view these these or like tuuring machines。

you can view these kind of chain of thought process as allowing the model to have extra tape extra memory or so on so in this plot from this recent eye clear paper。

What they kind of show is like that you can in one way increase the amount of chain of thought the model is able to do so that it can capture a broader class of algorithms。

in fact it can fit like even polynomial time algorithms and then instead you can also give it more embedding size and this is kind of like giving it more memory so these two are kind of two directions you can chain of thought can help。

Of course embedding size it's probably like you know it's kind of constant。

but a chain of thought you can give it maybe more you have more freedom in this。

but this is kind of a way in which people have theoretically shown that chain of thought actually has some theoretical benefits Of course these analyses usually have some restrictions so they'll assume like hard attention so attention only tends a single value or or sometimes they do like this kind of softer hard attention in which you can have multiple values but it's all uniform so you need to make some simplifications but this is a nice computational model and I don't want to maybe talk go too much deeper into this。

but I'm linking here some papers like so for example there's this whole kind of line of work about complexity theory which is kind of what I was talking about now but there's also like very interesting work regarding like this introduction of this new kind of programming language which is called RAP and RAP is essentially a programming language that mimics。

Oations that a transformer can do and the people have， for example。

observed that if the kind of task you want to solve has a small equivalent graspP program。

then the transformer will be able to generalize better or at least fit this task better so it's kind of nice that people are going this direction of like trying to formally understand transformers but also these kind of connections to programming languages or what kind of programming languages。

They can implement， for example， this tracer work is is kind of showing how to implement given a RAS program。

how to get an equivalent transformer out of this， so I think these kind of works are quite interesting and I'm assuming will be interesting to people who work on GNN expressivity。

Okay， so now let's talk a bit about length generalization。So I guess as Pe mentioned， you know。

we're really interested in building well， I guess now also language models capable of performing robust computation and really for us we can try to restrict our sal for the scope of this talk to length generalization。

And this is the ability to extrapolate from short problem instances to longer ones。

Probably it's clear why you'd want to do this。 So you want your model to generalize。 but。

but there's also some more subtle like advantages to this。 like in general。

you can like shorter sequence data will be more abundant。 It's easier to to get。Data that's shorter。

And， and also it's， it's computation more efficient to train on shorter data。 So it's。

it's not just a nice thing to have， but it's， it's。

there are also some kind of practical considerations that， you know。

make this making a model that's good at this kind of helpful。

And so to this end there's this benchmark that Larisa and others in the group have worked on。

which is this CLRs text benchmark which I guess Lasa will talk a bit about more in the collab part but essentially CRS text is a direct translation of the graph type the graph benchmark into text。

so in this case there's like this insert insertion sort example where you you can tell the initial state and then you can give it the trace of the algorithm and ask it to predict the final step or you can ask it to predict the entire trace and the final step。

so in some sense this is like taking the graph CLRRS benchmark and literally translating it to something that an LLM can like symbolically try to manipulate and this is quite nice because a lot of benchmarks in language model。

Like GSM8K are kind of word problems and they're harder to maybe generalize in terms of making them longer or more difficult or have more inputs。

While in this kind of benchmark， it's absolutely trivial， if you want to make it more difficult。

you can just give it a longer array to try to sort so this is why this kind of procedural generation is very useful when you're trying to benchmark models and you want to do this robustly and kind of check generalization this way。

So here there's an interesting plot from the paper and what they show is so first of all this green line is this Gemini 1。

5 flash kind and where it's attempting to solve the CRS text benchmark and you can see that performance is essentially zero this model is huge compared to this 2 billion parameter Gema model which was instead specifically tuned for this task and you can see that the GeEma model even though it's much smaller has been tuned on it and it's much much better at the task and these red dots are kind of the indiriion dots while everything else is out of distribution so by in distribution I mean that the model has been trained on these sizes while where you don't have red dots it has not seen these sizes。

On the right when you have when you in kind of 35 to 40 this is the extrapolation regime。

one the other ones like interpolation so。You can see that yeah。

GeMA seems to be much better and also there's like this kind of plus RPE which stands for some type of position encoding improvement。

which also helps the generalization， I'll talk a bit more about that later。

But here you can see for many kind of data sets， this is one model being trained to solve all of them simultaneously and you can see that tuning of course helps。

but you can also usually see， for example in naive string match or a lot of them like in extrapolation。

so in length generalization， the models really drop in performance quite quickly。Yeah。

so I spoke a bit about position encodings now positioncoings really impact length generalization and for example the ones that they showed in the just showed in the plots are these randomized positioncoings and at the end I have references to everything so you can find everything I'm talking about but then there's like a lot of a lot of people in general have also looked at position encodings and tried to。

You knowT to develop positioncoings that generalize better and the main intuition is that for length generalization you really care about how these positioncodings behave in the limit or at least like as kind of your sequence gets larger and larger。

so fire and RP are nice examples of this kind of careful type of study。

Then people also developed embeddings specifically for certain operations。

so for example these as embeddings。Which are designed specifically for arithmetic type of operations。

however， most most LLM still use these rope positional embeddings。

which are probably like the most common at the moment。

And rope were motivated because essentially they decay the activations as the distance between tokens increases and this is the plot which they showed in which you can see this kind of decay as the relative distance increases but instead actually in our very recent work we show that this does not really have to happen and this only happens when these queries and keys are the same vector and so we have some kind of more like kind of mathematical analysis on this and and so for example what we actually show in our work is that like rope actually helps in probably length generalization for different reason。

so for example it allows these models to build very specific attention heads so for example here you can see these like head five and eight are very distinct from the other patterns and this is kind of what we show in our paper we really kind of dive deep into like。

What's happening into these kind of embeddings now I won't go too much deeper into detail。

but there's a reference here if you want to check it out。

but my main point here is that like length generalization。

well position codings play a huge role in this and actually understanding them is extremely valuable and because they're still quite poorly understood。

And I want to talk a bit about information propagation in these transformer models and I want to show motivating an example which is a discounting task so here we ask Gemini 1。

5 like can you sum1 plus1 plus one etc and this kind of X axises is how big this 1 plus1 sequences and the y axis is the absolute error so how far off is Gemini how far off is the answer and you can see that already at this task Gemini is really bad at doing this and in fact for many kind of related problems it's quite bad so here for example like counting just the ones in a sequence of ones and similarly we kind of try with chain of thought with few shot learning trying to make it like use like scratch pad it doesn't really work。

And these are huge， huge models。 I mean， Germany and 1。5 was at the time like the kind of。

The best German eye model we could find and and this kind of like there's an question here is like is this like a very fundamental issue or is this like just we need more training or for some reason they're not trained on summing one plus one plus one。

etc。So to check this， maybe let me show you a nice experiment， for example。

let's say we prompt Gemini how many ones are in the following sequence and I give it K1s and then how many ones are in the following sequence I give it K plus11s So one and two only differ that two has an extra1 at the end and I plot the norm difference between the last token at the end of the transformer so this I'm using Gema 7 billion parameter model Gema and you can see that as the K factor kind of grows the representation gets closer and closer。

And this is an issue because at some point you can see there's this BF16 precision。

at some point if the sequence is too long the two sequences will have the same output representation。

so they're forced to make a mistake， so essentially the two representations become indistinguishable as k grows。

it falls under the floating point precision。And this is what we call representational collapse。

which is in some sense analogous to oversmthing， there's some slight differences。

but it's some kind of result about oversthing and we make this more formal there's like a few assumptions you need to make but but yeah so we pretty much verify that this occurs in GeMA 7b and this points to this kind of issue in transformers that you know like just due to this effect like counting in these kind of styles of prompts is forced to break at some point。

And there's also a similar effect which is due to the causal mask so this causal mask is again like this lower triangular type of mask and what happens is that information still has to reach this like y5 so at each layer at some point you need to extract this yfi。

which is like the representation at the last layer of the last token because then this is like what you then softmax to get out the next token so the softm only looks at this last token。

And。The basic intuition here is that like let's say youre token V5 and you want your representation to survive into y5。

there's only one path for your representation to kind of survive and it it's if these attention coefficients are quite large from V5 into V5 instead if you're like at V1 there's like many。

many paths that your representation can take and kind of allow it to survive so for example in this diagram the blue starts to dominate simply because it essentially has more paths that reach y5 compared to the red so you can see nicely in this kind of illustration here。

And like a coroller is like we have some experience in the papers like that copying the last token for language models harder than the first and this is because essentially the information from the last token is much harder for it to survive into the representation of Y5。

😊，We have many more details in the paper， but I guess one of the takeaways is that like phenomena like over squashing and over smoothing are are actually also present in language models and in many ways theyre。

They're kind of analogous and I think this connection between language models and graphal networks or at least transformers and graphal networks has been kind of known。

but what's very interesting to me is that like these phenomenon you can kind of start studying them from a point of view of like text and like how do these phenomenon related like tokenization and text and all these things is very interesting and then the fact that they then imply some kind of failure case into in some problem is I think maybe something that people in graphal networks are quite interested as like an application of these like over squashing and overs smooththing results。

Okay， and now I would like to talk finally about the softmax function。

And the submax function is really one of these like key functions in machine learning and they they take a vector and turn into a probability mass function and it has nice properties。

of course， relating to like entropy and so on and it has been shown actually many works in previous years so some works at Ns and this works at calm that。

That some tasks like may suffer from this dispersion of the softmax and by dispersion。

I mean as you add more tokens， the softmax is forced to lower attention to the existing tokens and this is simply because everything has to sum up to one and now if you're if you add more tokens like it's like。

🤢，Tokens cannot have zero probability because the only way for them to be assigned zero probability if is if their activation is negative infinity。

So of course， activations are all bounded， so this cannot really happen so。For example。

in this plot what we show is like for this max retrieval task。

as you keep doubling the sequence length， at the start your max retrieval is correctly attending to the correct token。

but as soon as you go in length generalization so out of distribution。

the softmax has to disperse more and more until it's essentially uniform。

And this kind of points at fact that like the softmax function makes it quite challenging for you to be robust to very sharp types of operations。

And a nice way to formalize this is perhaps through entropy。

and entropy is really just measuring like how close to uniform your distribution is。

So the highest entropy distribution is exactly the uniform distribution now this kind of channel entropy is defined on discrete distributions because in continuous distributions is' actually not as well behaveved but you know we're softming discrete distribution entries so we're fine and here what we plot is we give it we give GeEma some inputs and we grow the underlying sequence size so the sequence is like similar to the ones we had in our transformers need glasses papers so like you know can you count the number of ones and you give it a growing sequence of ones and you can see in these plots that the entropy in the attention heads grows as in the number of items is increasing and this is kind of showing how like this distribution is becoming more and more uniform simply as a function of the number of items which are。

Given to the。To the model。And in sometimes， this can be used to argue that like things like copying or you know very sharp very sharp operations cannot be robust unless you have like things go to infinity or negative infinity right so then you can have the softmax can build delta functions。

but of course we cannot do this in fact what we show in the paper that what really bounds you is like the spectral norm of of the query and key weight matrices so these are the ones that kind of tell you how sharp your Somax can be because Somax being translation andvari really all you care about is the kind of range that the softmax values can have and this range is precisely given by the spectral norms of these weight matrices because these are the ones that define the range in the delt product。

So for example， what we show is that changing the temperature or things like this can be used to artificially sharpen the softmax and you if you're interested in this kind of work and like the relationship between softmax and entropy and out of distribution generalization。

I invite you to check out our preprint。Which yeah， this paper here。Okay， so just as a summary。

we studied decoder only transformers and we saw， for example。

techniques like chain of thought and we kind of describe how these can provide。

Additional expressive power， like not only in practice， like empirically， but also theoretically。

And we really focus on length generalization， which is this problem of like you have a problem which or you have a task you want to solve which the number of inputs can grow and you want to generalize you want to be able to solve it regardless of the number of inputs and we saw how like CLRS text is a nice way to kind of benchmark this it's like a natural kind of paradigm to test length generalization。

We also saw how position encodings play a big role in lung stization。

and this is I think a very interesting and active area research。

And then things like the causal mask and how this relates to overs sququaashing due to the topology or representational collapse so some kind of oversmthing idea and then we also saw how there's like issues in the softmax so the softmax is dispersive so this points is like some issues in generalization just from like architect the architectural level so of course it's very interesting now to like take these findings and like try to improve and on corner models and this is kind of what we've been trying to do in our work we try to point out issues and then try to somehow fix them。

Jude like it now that we have a better understanding I have left also a bunch of references。

of course， this are not exhaustive， but hopefully it's a good way to get started some chain of thought on expresssivity。

On RAsp， on length ionization position encodings and other like miscellaneous ones。

which might find interesting， a lot of them we did talk about in the slides， but not all of them。

so maybe it's hopefully some nice these slides act as like a nice resource for people trying to get into this kind of research。

And now I'll hand over to Larisa for the collab。 So stop sharing。 Yeah， and thanks so much。

Yeah。I think we can't hear。😔，嗯。Thererissa， sorry， can you hear me？好。Laa， sorry。

sorry to interrupt you。tryingrying to reach how to arrow。

Yeah， Laza， can you hear me？Because we can't hear you。Let's throw one more time can Okay， no， yeah。

no， yes， perfect Yeah， sorry， I try to， I try to contact you and I need a manage too。

So is it okay now？Yeah， I can hear you loud well Okay， great， sorry for this。

What about screen sharing， Does it work now Yeah， we can see。

So we can okay， let's start over again， sorry about this Yeah apologize apologize for this so okay。

sorry At number two， hello again， my name is Theresa。

I'm a research engineer at Deep Mind and today I'm going to show you E Ecoop。

which should kind of gear you up with。Everything you need to conduct your own experiments with CLRS and large language models。

To start so at the beginning， I want to emphasize that I use a co version to prevent any preemptions because actually fine sh of a large language model is a demanding compute demanding thing。

so that's why I use co Pro to prevent any preemptions but you can try to use a free version it's not a problem。

but sometimes you might get preempted so。And wait till the collab schedule gives you another batch of compute so for this collab I use a G4 GPU kernel of the coll you can find it there so if you click。

Connect if you change try to change runtime。 so my setting is T4 GPU and this is high Ram version because like usually you need a lot of memory in your experiments and it's better have more than less。

So at the first step， we just。Install all all packages， Python packages we need。

And after them being installed， we are good to go for setting up the environment。

And as the first step， we are going to to set up hugging face talkingken。

So here's the instruction how to do this。 You need to do to the hugging face website set up the talkingken and then put it there into a special。

form。just to save a key which allows you to access Higing phase repository。

So the thing is why do we use Hi phase， h phase it's actually an industrial level standard right now for large language models。

this is the most popular library， it contains a lot of different models。

a lot of different pretrained ways， and it's very easy to work with it。

So that's why we choose it and as soon as you have your secret key and token。

your secret installed and Hi face token is loaded into collab secrets， you can use it。

you just need to import a correspondent library and then just ask the library to give you a token。

After this， we need to do some minor environment setup up it isn't very tricky I just switch off this pre allocatecate thing in checks to really to make more memory on GPU available because checks usually tend to pre allocatecate quite a lot of memory for itself for its own operations and there we don't really need this so I just switch it off through this。

Set through setting up this value。And create a random number generator keys for further use。诶。咩。

Rund this for you。Yeah， and in the final before I carry on to loading the model。

I just want to make sure that everything is set up right and we are really having deal with GPUs and because this' is very important it's very hard to find model for example on CPUU but as you can see I checked our default backend and this is GPU。

this means we're good to go。So the creation， the loading and creation of the model using Higing F is pretty straightforward for this。

you need to choose the model name from the Huging F Report index。In our case。

this is a Google Gema 2 B model。Then I set up some quantitization parameters。

this is just a technical part。Quronization parameters for our model and then I load the tokenizer tokenizer this is you can think about it as a program which translates text into numbers and vice for the model and then to do the backward transform after the prediction so as you can see there I tell I use the hug phase talkingken to access actually tokenizer because it's loaded from the hugging phase repository。

And after having the tokenizer， I load the model， also as you can see I pass the model ID。

which is the name from the repository index。I pass the quantization parameters。

name the device and finally provide access token for the function to be capable to load the weights This model has only two billion parameters。

but in fact it takes around five gigabytes of memory so。Yeah， it's quite big。 and it usually takes。

as you can see， a few minutes to load everything into your co。

but as soon as everything loaded we can。We can experiment with it。So why it's loading。

I can just show show you the test imprint。 So as soon as model is loaded。

you can ask it about something to just verify whether everything is fine with model， whether。

Like our setup in terms of accelerators， GPs， etc， works well or we have some problems and just for this purpose you can just ask the model something in this case I asked the model what would you be if you would be whatever you want to be and finally enough it started to advertise me some Netflix area which is pretty funny。

So， but。Now， good news， we actually have our model loaded， and we can conduct our experiments。

but before。😊，We try to teach model something。 we need data。 so in the following section。

we are going to actually create our own version of CRS text data set and then apply it as a source of data for fine fueling process。

So。Initially in the talk， you might notice that Sus has a variety of algorithms implemented。

so it can generate samples for more。Different tasks to ask model to solve different tasks。

there are 30 different algorithms we can employ for this in CE。Some of them are pretty trivial。

for example， you have just searching of minimum in the array， but some of them are pretty advanced。

for example， we have optimal BSD。Yeah， but yeah there are many different things and different types。

also you have some sorting algorithms， string related algorithm。

many of them the extra as a graph algorithm。So but let's imagine we choose something and want to generate our first data set to do so I provided you a function which does this is defined there。

I'm not going to go into the depth about the code because it's quite selfexplanatory if you'd like to you can just read through it and use it。

That's why I'll jump to just to the main call which creates the data so in this example we want to create actually a very simple data set。

Wwhichch comprises two algorithms， Belman Ford and BFS， and it was in the previous tutorial。

and used two lengths for each of this algorithm of size 8 and 16。

For each unique combination of algorithm names and task plans。

we going to generate around we are going to generate 100 samples。Yeah。

and then when all samples are generated， we are going to split the initial big data set into two sub data sets where a train data takes 80% the initial of all samples。

validationaledation test data sets take 10%。So once everything is generated。

you'll receive the data object which this is a very typical data type for hug phase and Pytorch and this data actually contains train test and validation。

data sets and provide all necessary interface interfaces which you need to run hugging phase fine sh function。

so this is kind of a nice wrapper which frees you from any hustle with data as soon as you have it。

And now let's look into， actually，Content of this data。And there I print actually one sample。

the first sample from the train dataset and we can see there that there are kind of two records inside of the sample。

the first record is marked with the user role and the second is the assistant role the user actually the text is about setting up the task what are we going to ask the model and the assistant record is about dancer we expect to have。

I'll explain why data is formatted in this way a little bit later。A little bit later so。And。Actually。

the main idea is since we have rows， this is a special way to signal the model。

Which part of the text is expected to be like the user input and which part of the text is expected to be actually the model output。

the model answer to the text。So but obviously usually large language models they take on the like plain text。

that's why we need a special function which turns this into a proper text。For this purpose。

I imply the technique which is called chat templating， you can read more about it via this link。

But generally speaking， what it does。It takes your initial sample。

And actually turns it into a plain text， plot some special text which like adds some additional information per se signals to a large language models。

for example， there we have this again this Belmon for example。

but after applying tokenizer is this shot template to this data we have a very nice plain text with special symbols。

for example BUS means beginning of the sequence， so this one marks the beginning actually of the sample。

Then we have a start of the turn and end of the turn these two tags actually separate different roles from each other and let model to understand where is the data input from the user and where is actually。

Something which we suppose the model has already predicted and we are trying to maybe continue through auto regressive predictions。

So yeah， and as we can see， we now have our text formatted and the only thing left I need to do before we start our fine tuning process。

I need to define a function which will automatically。

Convert this original sample sample into the tokenized version which I just showed to you。

so for this purpose I defined this function， it's very simple I just take the input sample and convert it。

And let's see how this works， I indeed took the first sample of the train data and you know to be fair with the model I don't want to show it the answer。

that's why I remove the agent answer from the sample and convert it into the text and get a really nice and plain text function a plain text with text。

And then I feed it into our model。And as you can see。

Things don't go well because Mo has never seen this format before and has never most possibly tried ever solved problem like this。

That's why it makes an attempt to answer something to our question， but utterly fails。

And that is to say its we might want to have a model which is like， you know， specialized。

that is to say fine tune on our data to conduct our further experiments。So for this purpose。

we are going to run a fine tuning fine tuning process using Laura， Laura。

it's a low rank adaptation techniques， which allows your。Fune large language model。

like kind of cheap， cheap because like if you try to function den entire model。

it has that many ways that most possibly it won't fit into accelerator and it might take a really long time to train like a big model from the scratch。

That's why people came up with like you know simple and cheaper solution of doing so what do they do first we say that we are not going to train the model。

we are going to train to only the subset of layers and for each layer what we do we take a layer and actually freeze it。

we don't change it during the training but what we do in parallel to this layer。

we create two layers。Two smaller layers， I mean one side of this layer is usually of size D but the internal side is kind of a set of two bottleneck layers is R and R is usually is very extremely small so that's why you have this low rank thing into the name of low rank adapter。

And you have two sequential layers of low rank and actually。

in fact you train them instead of frozen weights and as soon as you pass data through your frozen layer and through the set of layers during function functioning step。

you just sum them up together， and this is your new representation。

which passes by further to the next layer。When the functioning process is over what you do。

you just multiply a by B and you have a new matrix of size D by D as the original weight matrix of the layers was and just add them to the original frozen layer and you actually per se have a new。

New weights for your layer， we should just replace in the original checkpoint and this is said you find student your model。

Youll find your there。So and actually when you have your model loaded。

when you have your data prepared， it's pretty simple to kick off the process。

you actually define Laurara config where you say how small the rank of your lower rank adaptator is。

What kind of layers you are going to train and what kind of layers you are going to train for more details I encourage you actually to go to this tutorial which describes in depth all possible detail and tweaks you might use to find you the model。

Yeah， and as soon as your lower confi and the rest of the things are assembled together。

you can just kick off the SFT trainer function and train your model and it takes approximately maybe 10 minutes or even less four minutes。

To see the results so there we did very little steps， we actually did 200 steps。

but we can see that the last function went down and that is to say that you know model inhibited some changes in this behavior otherwise for our loss function won't change。

And yeah， and we want to actually see what's the difference now between what we used to have so this pretty strange answer there and what we have now and so for this purpose let's feed one more sample into our model and as we can see there。

Now model maybe so we've done a very few steps there。Model， maybe dancer might not be right。

but now model is doing its best to follow the structure we provided even after a very few steps like 200 steps for the model it's。

It's very little， but due to some reasons I don't know why。

But the model decided to identify itself as thego instead of the model as we used to have。There。

Let me show you this place。😔，I heard it。Yeah， there are。 but overall。

there is definitely a positive tendency in。In its response， so if you find your model for longer。

you will definitely have much， much better results。

And in the final I just want to show you a few tools which might be useful for your research。

for example you might see beautiful pictures from the Federica Federica paper so many of them were obtained by the Pennzi library library which allows you to visualize the internals of the models so you might see the weights for example。

of heads， the activation etc etc and this actually helps you a lot to understand what's going on inside and why model respond to your question in one way and another so this library works not only with Google models but with all models which are presented in Higen F repository so if you'd like to do some NA visualizations for your research definitely give it a shot。

And in the rest of， this is just a very short code。

the code that allows you to load models and to save your model after functioning， load it back。

And again， feed data into it， for example， to continue your research and your investigation about different things related to reasoning and maybe something else。

So actually， this is it I wanted to show you， I hope。

You enjoyed our presentation and now I think I hand over to Alex and Wilfred， right？So， yeah。

Yeah。さ。好。Soれ do you hear well？Yes， thank you you。and thank you for attending tutorial。

this part I would like to talk about neurosymbolic crrising as well as NR neuro algorithmgorithmic crrising。

specifically on language and graphs。😊，In this short and final part of tutorial。

we would tie both parts together。嗯。And define how GNN can allevvate several of the issues。

Of the decoder on the transform。My name is Alex and research engineer at the Google Deep Mind。

So I would like to start with motivation。あさ。VO experience。Different kind of。

Basically a ran of one and which model can come into the every corner of the society。

but surprisingly the very unstable fragile when we when it comes to out of distribution reasoning so one of the recent papers like a GSM symbolic very simple modification on of GSM the data set like basically a varying like just names and values。

we see huge drop on performance for all modern large language models and even worse is like just irrelevant irrelevant information to problems performance drops up to like 65% and not only is so this is like just slight shift of distribution。

but if you would consider maybe like a harder reasoning problem like for example chis。

Or like puzzles， we always see that we scale of the problem which which bigger and big have number of combinations。

like performance of such models drops rapidly。😊，So and I think there's certain similarities with how people perceive reality and how they think。

like just just small just example， like go on the first board。

like if it's like average probably co， we will spot immediately that he or she can fork queen。

But on the second word when we need to have three moves to basically do the same it's already harder and requires something so it's deliberate thinking and like point in and reasoning so it's certainly true that like more and more data I mean there in being a training like becoming a master Grand master you can start recognizing this position the same proof is LOMs with a huge amount of data you definitely like create the bar and you can。

kindind of automatically detect and solve certain problems。

but this does not imply that you solve reasoning because like you still can face bigger problems or still you can go out of distribution so I would claim that if any size of whs that will be always problems that unsolvable like and go puzzle always we will need to find like certain solution to like reasoning it definitely there are many ways of doing this but the problem is there and we wont I think go is try to solve and the problems which are not entirely training set so basically going out of distribution another small example I think there was recently is zebra logicic benchmarking logic or reasoning ability of language models which measure different performance of different language models on grid puzzles so grid puzzle is。

ASimple puzzle is like bunch of logical statement and you need to find for example。

matching so like who lives in what house or who owns certain things and like testing on human we already see that like what dimensionality for the puzzles time required for human to solve it grows dramatically and so we would expect as well like for OOMs need more resources or better algorithms to solve this problem or probably both so can and we can take inspiration from like fundamental work of Daniel canning about thinking and fast and so like who defined like concept of system one and system two system one being。

Ft automatic and quick system so which operate effortlessly with no sense of monetary control so this application is human and system too is like requires attention and like effort andological step by step reasoning so it's frequently associated with agency or concentration。

😊，So I think。😊，Those combining those like assuming that like Lms so so significant part of like fundamental like pattern matching。

I think the reasoning is still has way to go。 So I would just two quote of。

Leslie Raand who made fundamental contribution to learning theory。

so he claims that most fundamental aspects of intelligentent cognitive behavior has ability to learn from experience and the ability to reason from what not。

As a quote from Yosha Benjo， it's maybe old but it still， I think。

holds true that current training systems are are weak once they're required to go and generalize beyond training and distribution。

So I could just briefly look on this system one and system two characteristics。

we can see that very much complementaryium so like if system one is fast maybe in a sense implicit but system two is so slow in the sense that it requires multistep or algorithmmics in here so the same is like we may think that system one is can easywise parallel processing system two usually does have some sequential momentum in it。

so the same is fixed compute like if system one is maybe fixed compute or system two is definitely require variable amount of compute with BnB problem sizes you need to solve I think even it' an energyous human but maybe average human reaction time is like 200 milliseconds。

but you cannot definitely solve complex problem or just。Of central 100 milliseconds。

and you definitely need to also apply some deliberate logical step by step reasoning。

So the same as goes about the distribution outer distribution and pattern much and such。

so like I think it's very nice and kind of a nice idea to。😊。

To combine those systems togetherza and definitely was done before in many different ways。

I would argue that like Al zero system like which combines some Montetecarro research。

And like deep learning network is part is one of the approaches also like tuuring machine narrow tuuring machines。

which emulate。Algorithms or alpha code maybe we can in sense things that like system one construct system two。

So there are many definitely many different approaches。

but many of them are very fruitful and promising and today we want to work in one of those like we work from our group where it now and Vi is going to present in detail this work So yeah please please take driver。

😊，Can you hear me。啊。Yes， we can hear you。Okay so hi everyone， I'm Wilfred。

a research engineer in the foundational research Uni at Google Ded， and today I'm going to。

Can everyone to hear me。Yeah， yeah， we can hear you， but we we don't see anything just we talking。

Just like good。I can't see my slides I'm not sure Alex are you still controlling of this sorry sorry we don't hear you。

Can you hear me。Yeah we can hear you we just hear your face know your slides my slides okay。Okay。

let me just share the size then。So Im not to use with Zoom， so it's a bit。

I thought it would be it should be like a green button。

which like share screen in the bottom of the interface。Just to see。What this。

So can you， can you see the。Yeah， great， we can see it。别所以说。

this my。Okay perfect sorry I'm really not just the Zoom okay so as I was saying hi everyone。

I'm Wilfred a research engineer in the foundational research unit at Google Demind and so then I'm going to talk to you about our work on augmenting large language models with neuralalmic reasons。

And so the motivation behind combining LLMs and NEs is pretty straightforward。

LLMs as you all really good at communicating natural language。

but this still struggle with robot algorithmic reasoning on the other hand NAs have been so be。

Quite good at this kind of robust reasoning on symbolic graphical inputs。

so the big question that we try to answer was can we get the best of what works namely。

can we get a model that can robustly solve algorithm winning problem specified in text。

And to answer this question， we proposed the hybrid model whose architecture is depicted here。

you can see it involves transformer layers with the N and it simply implements a unidirectal post attention mechanism between the N nodes and edges and the transformers tokens essentially。

so the transformer updates to the post attention updates the transformer tokens representations using the computations performed by the nor algorithmic reasoner。

And the data set that we used for this study or the two dataset sets that we used the CRRS 30 and the CRRS text dataset。

which I'm assuming you should be familiar with by now since they were covered in earlier tutorials。

but if not we have a reminder of what a sample from this dataset looks like here you can see that the green part so you can see a sample here for the insertion So algorithm。

the green part corresponds to the input the traces in blue and in the target against which we evaluateva the model is in red so for this study we didn't really use traces that we were trying to predict the target so the output given the input directly。

So we build the transna from a transer model of 70 million parameters equipped with rope plus randomized position includingcoding。

I think federo mentioned how this randomized position includingcoing can help with our distribution generalization and this study we wanted to you have the transer as powerful as possible in terms of so that we wanted the baseline to be as good as possible。

あの。On auto distribution of geneticization engine so we equipped both models with randomized solution including we trained both models with the trans undertransform on seven epochs of the training data。

which included training sizes of four，8 and 12， we then evaluated both models on auto distribution sizes 10。

which tested interpolation capabilities of the models and 14 which tested extrapolation capabilities。

We consider both the cases where the base model was pretrained and the cases where it was not pretrained so randomly initialized。

and this in practice correspond or whether you are in a French tuning regime or in a pretraining regime or in an initial training regime。

Our results showed that the transna significantly outperformed the baseline so the Purrons owner on most categories of algorithm。

as you can see here。In both the extrapo and the。And the interpoli regime。

And this was consistent across whether you pretrain the base transformer or not， essentially。

So at this point we've established that we can indeed build a hybrid model that is good at solving algorithmic problem specified in text right。

but the obvious issue is that it requires two input streams at inference time text and graph when for a lot of problems of interest we only have one at inference time usually text right so the question is can we do better can we can we get rid of the second input stream can we get rid of the requirement of having this accompanying graph。

With the representation of the problem and so this end we explored distillation。

So we tried to see if we could distilllate a transform model into a pure transformomer model in such a way that the transformer recovers or so recovers the improved auto distribution capabilities of the transmit。

So here are some of the experimental details for the distillation， as I just mentioned。

the trans was you know the model that we， the teacher was the trans model that we had just trained。

so the one trained on sizes for eight and 12。The overall student loss was computed as a convex linear combination of the ground truth next to and prediction loss restricted to in distribution program size only。

and the teacher distill loss present tropy between the teacher logins and the student prediction。

And these were weighted convex linear with a factor in a convex linear way。

So the student training sizes included for eight and 12 for which both the groundro and the teacher supervision were provided and we also provided teacher supervision for sizes 10 and 14 just so the teacher can somehow teach or show the student how to behave in this auto distribution setting。

but because we provided signal at these sizes we call them soft kind of call this regime soft audioD。

so the student test sizes included 6 and 10 which tested the interpo capability and 14 and 16 which tested the extra Po capability and further。

10 and 14 because they were kind of seen through the teachers logs。

we referred to them as as I mentioned above auto distribution sizes。

We on these soft adult distribution sizes， but we also evaluated on the you know truly never seen by the student sizes which are 6 and 16。

so these were never seen either by the you know by the student neither via the groundro loss neither via the teacher supervision。

so this one kind of true adult distribution sizes。

So this is it for the experiment details of the distillation。Experiments。And。Yes， so we。

So here are the results and again we showed them for various adiion regimes。

including the interpolation regime which here corresponds to six and 10 and the extrapolation regime which corresponds to 14 and 16 and we already mentioned that you know 10 and 14 kind of this soft OZ sizes because we provided teacher supervision for this lens。

but we also have six and 16 which for which we never provided any kind of signal or supervision signal to the pseudoarner so these are more like the the the。

The true adult distribution test， if you will， and you can see that so what you're looking at here is the comparative of the performance of the different models according to depending on the value of alpha and。

And alpha is was simply the coefficientient of the distillation of the teacher supervision。

essentially in the distillation loss in the final distillation loss。

so an alpha of zero means that there's no teacher supervision and it also means that。

It is a baseline so we're training a pure transformer model on the in distributionriion from central P loss so that's the baseline and any value of alpha that's greater than zero is a distilled model because it kind of benefits from the teacher supervision and the baseline here corresponds to the red bar then and you can see that across the board pretty much and for the various algorithms considered the red bar is always most of the time at least is the one that that's the smallest which means that the distilled transformers the transformers distilled from the transform most of the time better than the regular transformer。

So yeah so at this point we've shown that we can build this trans transform hybrid architecture that performs rather than the transformer。

but we've also shown that we can then you use it to train a pure transform model that's also better than some transformer that you'll just train on the in distributionri sizes of your data。

so let's now look at the limitations of the approach。

so the first one that needs to be highlighted here is the dependence on initial graph representations。

so even though with thisill， we kind of like were able to see content that at inference time。

we still nonetheless need graph inputs to even train the transform before distilling it in so into a pure transformer。

So we we still depend on having， you know once one graph inputs thats graph inputs that are once one。

like that map points one with the text inputs and that's not always the case so that's one limitation of the study the second limitation regards the distillation experiment that we w run and you could see in the results that it's also always clear which which distillation factor to use yeah it kind of looks like it depends on on the task and it wasn't particularly clear as how to determine that automatically so yeah so the natural order limitation is maybe investigate that like how to find optimal alpha for you given scenario and also maybe consider using enemble decoding techniques where you for example you would train multiple distilled models we know with different alphas and then maybe decocode from this。

ble of distant models using ensemble techniques used in usually used in the field such as majority voting or even doing something more on the。

On the weights the different models in your ensemble， you can， for example。

do weight averaging and then decode from a model that derives from this weight aaging right。

so say so yeah， how exactly how exactly to choose the disstillation factor and morely how to deploy the distilled transform models is still on。

Okay， so。Moving on to the conclusions and steps， just as a reminder。

we have introduced the transformna model which is a hybrid architecture combining the language an understanding of a transformer with the robustness of reasoning of a pretrained gene based nor algorithmic reasoner have shown through any evaluation on zero text that this transformer is much better than a transformer in terms of our distribution generalization。

we've also shown that we can distill this transformma and a transformer only model which is also better than just a transformer trend on in distribution data。

and we hope that future work will investigate expansions of interest notably in mathematics logical inferences and command and reasoning that's it。

Thank you。 so now I will show small demomaco integration。Transar and Gema model。 So give me a second。

I will try to share the screen。

Okay， so in this scope， we will just try to have small demo how we can combine Gema model and transar gene and based architecture。

So it's based on the work of Vasa so what of things we are going to do is the same。

so just we need to install a bunch of packages。嗯。To the patch dependencies。

we do need to use Hi phase token this for basically like a Germanma model with open open weight model and it's hosted on in a high phase so to access and download't know what model we need HF token like you can configure it here。

And if you have higher， you can get audio your earth settings toins。给啊。Don on the packages。

like could just setting up environment。The same way。

so most of experiments in the paper were done on 70 million model here but the general is bigger just few parameters you can is this a2 billion model2 billion parameters。

meaning I embedding parameters。It has the 18 wire，8 heads and like 2568 head size。So we okay。

maybe double will check if it's ordered already。 Yeah， so we can after all in the model。

we can inspect。content of the model and the structure。 So as you see， its。

compositeoses of 18 in JaA deco quader wires， each wear is attention where forward byopware with rotary inviting and normalorization so you can briefly expand explore configuration as well just model can so after all model we can do small tests that we can very bit。

modelelう。Does worker as it expanded on， Let me come still what it。嗯。Yes。

it may take a little bit of time， but。Because it's。Okay， so in meantime， we can yeah。

we can explore model and model parameter sources is for a bit quantitized model so pertrain it。呃。

Yes。Okay， let me check status Okay， more and checkpoint， Yeah I think it should be fine。

so let me double check and。Yeah， okay， so we inference does work So we use generate method and it's of supply prompt we can get something out。

So now we want to quickly integrate trans So it's。That's quite the difference。

It's because we are using pre model。You just use multipat to change forward Gema model forward method so we can supply trans hiddenance。

So in the original paper， how its done， we basically have like frozen the genetic processor。

it provides information to the proteinskins in the transformer。

but here we simplify stuff so we just use nahance like storageserv input data。But what we will do。

we will create。Basically， like in a transna， we add like cross attention after each where in transformer so let me just do this way so we can by we can just in center insert a bunch of these wirees so we get like double size of wirees and we can verify that model is changed。

Check， so yeah， we have like casual。Casual IOM model which after every second wear is N cross attentionware。

which provides information to the transformers so in cross attention queries are coming from tokens and keys and values are coming from general。

Okay， so yeah we can。Maybe do few checks。嗯。Yeah so this is like input ID。

this is what tokens are transferred to so after techchanization is done like every token gets an ID so we can in this Gema model it's I think 256000 tokenins is used each of them then like go through embedding and then like we provide it as an input to the model so here we can we have。

Right now they on Google Drive， but probably will move as a source like because it's trans activations so you can put them together with。

Data， basically like this is Ser's data which consists of query answer algorithm。

so give a second we can probably we can award the strain and validation data sets and verify content。

嗯。And like using this data， we can basically like run our model。

And with appropriate we can also train it。Okay just a few checks so yeah we can get data this is like formatting is the same as basically what is I used in your work so we can optimizeize this formatting and yeah we can go point go through forward parts and like add maybe small training loop here。

Yeah， I think it's basically import everything which I wanted to show。

so its I think it's trans ID is very promising and I as Wilford describe it to be is there lot of extensions which we can build on it？

Allright， so that was the。Mic。So that was the last part of the last session of our tutorial we have finished with about 30 minutes to spare we been we've tried to be quite active in answering questions live in the Slack but if there are any other things you would like us to address while the whole team is in the virtual room so to speak please let us know we'd be very happy to talk more about this as you've seen throughout these two and a half hours we are generally very passionate about reasoning so this is a topic we love to talk about at length。

And we hope you enjoyed it， please do leave some feedback at the end。

That will really help us prepare future editions even better。Yeah。

absolutely thank you very much and thank you also for being so in time why we leave some time for to the audience for typing some question。

maybe I can go with like a very broad in general question。

which will be which are in your opinion the next steps for now so where do you see it from here to the next year。

the next steps， let's say。Okay that's a great question it feels like something that all of us should try to answer so maybe I will start and then we can go in the order in which we gave our answers so Olga Federicos Alex Wilford so in my opinion。

well what I'm personally most excited about the future of like these NAR methods and their applications。

I feel like with LLMs， as we described in Federico's part and also through the Transnar that Alex and Wilfred talked to you about。

we've sort of started to describe how a tiny GNN which in reality is much smaller than what a big language model might contain can carry a disproportionately huge amount of knowledge in a way that doesn't have a lot of bottlenecks and therefore can be more readily used by these models so already with very small GN style interventions you can make a big difference in the reasoning capability of the big models that are currently practically used on arbitrary data so I am generally quite excited about what we can do when we alleviate those bottlenecks for real especially around long context queries so Transnaar you should think of it as like the first prototype in hopefully a series of approaches that we try to address this problem we don't think of transnaar as the exact solution but we do think of it as something。

Will hopefully inspire future solutions and yeah I would say look out for more of the same I'm really curious to see what will happen in the future there that's that's my opinion at least Olga would you like to say a few words？

sorryrry。I will try。 hopefully right now we can hear me correctly。

So I am not much a researcher myself， to be honest， I am a software engineer。

and I don't know much about。Like I don't have deep knowledge in this topic to from my personal perspective。

I'm just excited to observe what people in the area will continue to do what what they will achieve。

And at the same time， I wish that there will be good engineering practices that the code will be not painful to work with。

that you actually can go to this code base at this mentioned in a paper and you will be able to launch it and it will work and there will be documentation there will be comments I just wish everyone to have good engineering kind of。

Day to day life。And I hope this will kind of become better， not worse in time。哎。

I'm jumping on from Alex's computer。I think there's。

There's a lot of need to understand what like to what extent。

The models we currently have are capable of performing robust computation。

Because to me it's not super clear like I feel like a lot of the work we've done is kind of showing that like at some point things break and how to fix that。

I think to me is the most interesting next step and I think it's not obvious at all， but yeah。

that's currently what I'm kind of interested like this kind of combination of better understanding the failure cases and like kind of pushing forward given our better understanding what you know。

like essentially improving given our better understanding。到。Okay， I hope。Does sound work now？

I think I still I am still talking through Alex's laptop， right， no， okay， fine。Yes。

speaking about my perspective， I totally agree with Federicica。

it's very interesting to find blind spot or weak spot of the model and come up with some architectural trail solutions because right now we are very。

very restricted in terms of this autoregressive paradigm。

Another thing which I find personally very interesting is different， maybe。

Mechanical alignment techniques， which allows you to understand better what's going on inside of the model。

because right now， despite of having already a pretty solid set of literature about how transformers works it's still sometimes wonder in darkness because you don't really know what's going on inside you just feed something and hope that you receive some meaningful answer from the model and possibly hope that you'll capable to you know fix it on the fly I don't think that this approach might be reliable for systems where like you know quality interpretability or interpretability or which decisions are critical I don't think this is a like。

A sufficient level of reliability right now。Over to you， Alex， I guess。Yesす。Okay， and back。

I think we have ultimate and very interesting goal to kind of solve out of distribution reasoning and as Frederderica Varisa said。

we definitely need much better understanding of limitations and like of architecture。

so like we do know that small things。Preceivably like position of in or like techization of Star sometimes make significant difference。

I don't believe that like such incremental solutions will as always way。 So I would envision like。

Really， envision like somehowi here architecture which。

Whats which would be very powerful in terms of reasoning but very powerful in terms of like pattern matching and I also would have to see this all this system would be tightly integrated and like maybe elegantly so like trans is great system but yeah it it maybe maybe it can be made simpler and at the same time more powerful so I do believe that there are still a lot of work and the lot of interest in the situation in this area。

Thank you。And Wilfred， do you want to add your opinion on this， any thoughts from Wilfred？

Are you mut。Yeah， sorry， I think I mostly share know Alex's perspective on this in that you know and all as well I think it also echoes what Federico is saying。

you know we split it you a few pitfalls and fundamental reasons why there are these pitfalls and trans。

I'll budget them。And。But maybe there needs to be some more fundamental like architectural changes that needs to happen in order also you know。

not like it would be good， it would be great if。Maybe models were cured in a more principled way。

so that we actually know， like， for example， if discovering pitfall's post ho is a bit。

It would have been great like to have like a theoretical like if you build old models from theory then you know we will have better understandings of what they can do earlier in process I guess and like more related to what we have been working on I'm very excited about efforts like you know what we did with the trans because I think it's its I think it's a more principal way of bringing robust reasoning abilities I mean I'm supposed to you know patching stuff by more quality data because it' know were really borrowing you know you know stronger capabilities like and from from you know for models that have been shown to be better aligned for reasoning and so for me that it's a very guarantee like provide more guaranteed than you know trying to you know build better data mixtures and so on so I really like to see this kind of ideas maybe like。

more investigated and also being applied more widely on you know than the series text dataset set。

which I think is I mean obviously great starting point it would be great to like see how to deploy these ideas on dataset sets where you know the concept of an accompanying graph or like a problem defining graph is not so clear and obviously this is challenging because how do you do that but that could open you know as I was say。

more principal ways of like。Yeah， more reasoning align our language models essentially and also just be cost effective in the long run because apparently the way our reasoning is pursue is。

you know， it's not particularly cost effective。Yeah people are scaling our compute。

not just the training time but also our inference time， maybe we could circumvent all that。

maybe we don't need that with better models， maybe we wouldn't admit all those things yeah。

That's awesome， awesome。All right， great， so thank you all for giving those great answers。

I'm actually particularly happy that Olga called out the engineering standards in the current code bases。

which I think you can definitely tell the difference when you want to make something principled a codeb which is easy to work with versus one which isn't easy to work with and we're trying to play our part as much as possible and upholding those standards。

CLRS itself is a really huge library machine learning benchmark speaking terms at least and we're trying our best to make it accessible。

we always receive a lot of feedback that's very useful and I guess we would use this opportunity to say if anything we said today inspires you to try out CLRS or one of its variance and you have any hiccups even if you manage to do what it is you wanted to do to just please let us know like we're always on the lookout to make the benchmark more accessible and more easy to use for more and more people。

😊，Yeah， I think there's plenty of things to do right both from an engineering point of view from an algorithm point point of view and also from a task design point of view。

so thank you very much for your opinions on this。Maybe one very quick final question like you don't need to do a round of answers。

maybe like if you feel you can just answer by yourself or otherwise I'm very happy to your multiple point of views。

but like regarding the use of or let's say wise architecture or regarding the proposal of new architectures that are better aligned let's say with reasoning or with the inductive bias that you aim at achieving for your task。

Where do you see like the prior human knowledge that we have about the task because like a very frequent trend in。

for example， deep learning is just to let's say， rely as less as possible on the human knowledge and let the network learn by itself。

where do you see where do you stand in this position so do you think that for now。

we shall find better ways also to integrate human knowledge。

human priors or human constraints or we just hope the networks to learn those from data？

Honestly I do think that this is the kind of question where more people want to answer I think it's a pretty interesting question。

We actually discussed it a little bit in Slack as well So it's a great question thanks for asking Steve。

I will give my personal point of view but know I think all six of us came into this field from different perspective so there is a chance that others will vehemently disagree with what I'm about to say so please feel free to shout if that's the case but I would say that first of all what you say is completely right so when you look at a lot of the things that we talked about in the tutorial today they can boil down to understand your problem and then apply or understanding of that problem to design a better architecture to design a better training data feturization to design a better you gradient descent regime or training data regime so yeah obviously sometimes there's really heavy math involved behind that particular step but for all intents and purposes you could summarize it as。

Understand your problem and then use that understanding to do better I think there are some really cool papers which we didn't mention today。

but they were mentioned in the original anyar two years ago a blog which basically say that that's inevitable if you want out of distributionization so I'm referring to some previous works from Brunoary Bro's group with Baricche Be Vila and others talking about size generalization basically saying that in order to generalize you need a causal model of your test data right now in the extreme this is a causal model of your test data。

but if you want to say generalization to arbitrary test data then this translates to some causal assumption about either the structure of the task or the structure of your model or more often than not both right so I would say it's inevitable now。

Obviously we are not just keen to solve the algorithmic challenge remember at the beginning I motivated all of this by saying we want the architectures to be a bit closer to algorithmic computations。

but we don't want them to become algorithms， there are many reasons why algorithms are limited and we don't want the models to inherit those limitations。

rather we want them to take the good parts of algorithms while not losing what makes them good in handling various kinds of generic noisy data。

So what I would say NAR is all about and where I stand in this area is we are basically treading this interesting line between models becoming more specialized for the task versus models becoming more general and we actually believe like our ethos is that there exists a sweet spot where the model is just specialized enough to be like maybe hugely more beneficial with the same number of training data points you have for your problem while still being general enough to be applicable maybe on general language problems or something like this which is where transar for example comes in So we believe that there is a sweet spot and that sweet spot is not discovered with just simple transformers in fact when you think about it。

Transers or to incorporate some symmetries like there's explicit object boundaries attention is inspired by key query mechanisms in fact this kind of content-based attention was covered by neural tuuring machines which predate transformers by two years So you know computer science inspired priors have made their way in。

All architectures including the transformer itself and therefore like why would this be the global optimum right why can't we improve some things a little bit more while staying generic and one good example of this was our asynchronous algorithmic alignment paper with Andrew Rasvann and Tamara where as I've described it briefly in the tutorial we basically build an architecture that's asynchronous to invariant execution order sorry that's invariant to asynchronous execution order while still being synchronous itself so it still runs on a scalable GPU or TU but you have this very nice symmetry property that otherwise it wouldn't be it wouldn't be satisfied so I'd say that's for the most part what I would say the philosophy should be in the field but I would like to see if my co-presenters feel the same way maybe maybe I'll prompt Federicco to speak first。

Yeah。Yeah， I agree。 I think the sweet spot is is like a good。I mean。

what came to mind is like maybe this strange analogy where it's like you know， if if I want。

if I'm building cars and I want a car that can do like it can both like maybe move heavy objects and go kind of fast I build like a pickup truck。

but if I just want to make a car that's very fast。

I make like a formulaula one car and if I want to make a car that moves heavy objects I make like a tractor so like in the sense of like very specialized things。

I'm assuming we'll just work better for the things they' designed to do， but like。

I view these kind of things as as kind of orthogonal。

like you can either go in the direction of like you want to make some hyper specialized system that's very good at。

Doing algorithmic reasoning or you can kind of try to go in the other direction。

just make a very generally nice car。Um， and so yeah， I think the two views are kind of compliment。

but it makes sense that like your pickup truck will never be as fast as a Formula One car。

but your Formula One car will never be good off road。So。Yeah。

it's probably just a game of trade offs， I guess， as you were saying。And。

Any thoughts from anybody else？From the team。应该咁还。对能。

YeahI think there may be Parreta front where you we still don't want to build X000。

Different model car models and like Z definitely like even like I don't remember like in enforcement planning in times。

if you do take environments which are very far apart， it's hard to make system work boss but。

They what of similar environments they benefit in fact because you apply something from one to and as I in end you get better performance and they better if we understand the system the better we understand limitations I think this kind of sphere is becoming bigger and bigger so yeah it's probably impossible to build universal system but it definitely like con maybe certain optimals which can cover huge huge areas of expertise like so it still believes that like we can do better so transform is great model but for certain problems I think we still need to do better and I think even all this work with those usage like I don't know OOMs。

We calling like Is， so it definitely shows that we is a。

TheDemand and there are exist problems you want to solve。

And why there' are definitely a different ways of doing。

but I think having great understanding of like neuro architectures and trying to find optimal one。

I think is very interesting in very promising interaction in general。😊，Yeah， very interesting。

very interesting。Somebody else want to give an opinion or yeah。

does anyone else want to speak Larissa Wilfred any thoughts or are you just happy with how we've described the entire field？

有白意见。Okay， Aris is pretty heavy。Yeah， please。Okay， so。

I think that we can end it here and leave a 10 minutes break before the next session of orals so that also the audience can relax a bit relax and grab a cup of coffee or a snack。

I will personally go for a snack now。I want to thank you very much again all the speakers for their very insightful and very interesting tutorial。

I hope also that the audience enjoyed it and I will say see you in 10 minutes from now better nine minutes from now for the next session of As thank you all again。

Thank you so much， everyone。Hope you enjoyed it。对。Okay， I guess。

Let us hope this speaker joins at some point。😔，あもデア 나타로。H啊啦哈嘿。

I would not attempt to prounce this name when announcing the paper。Oh。啊。Nchoas k as as this。Oh。

let me just make you co host， in any case。So my notion was that decomposing force fields as flows on graphs is the first paper that we're having now。

But maybe that's okay， maybe Im wrote sorry no no no。

I thought there was great in scarcity stuff I mean the yes。

it's gradient and scarcity So that is me well， this is great because Nicholas。

I can actually pronounce。😊，Good。But you have a might to train talking0 one。嗯。To practice。嗯你可。😔，Okay。

Have you been enjoying further sessions of the。😔，Of this of this event。

what's it called lock conference。Yeah， yeah yeah yeah yeah yeah I mean there was a post session on Wednesday yesterday I couldn't but today we had the Paris local meetup so I was oh nice yeah is Alexandra still organizing that one？

Yeah with with several of us， was also organizer。 It was in telecom， was very really。あ、こ goodこ。哈。😊。

That's yeah， I mean。Paris now has a whole bunch of people and or what means now always had a whole bunch of people in this area。

嗯。有。I wanted to add to make sure to send us some pictures from the meetup so that we can collect a very nice portfolio of meetup pictures。

so oh yeah yeah， I know especially the took some maybe there's some radio over a Twitter I'm not sure but yeah'll pass I'll pass the message definitely if they' are not here。

呃。A decision。I think probably the local meetups are the nicest part of lock conference now both。

I would say both。I like singing it's a nice combination of in person and seeing people from afar。嗯。

As well， yeah，Let's see what the in person version next year brings to the table。

That's for sure I definitely hope that local meetups will still happen。

even though the main conference is imp person because I mean， not everybody is able to go。😊。

So kind of the I would be very much against polishing the local meetups because I think that's kind of the point of lock to Yeah。

yeah yeah， accessible to everyone Okay， but then Steve。

do you think we should get started with the oral or what's your take。😊。

Maybe we can try to wait one more minute I just made an announcement on slack let's see if we managed to catch some more people and then I think we can start What do you think you want me to start screen sharing to test stuff or Ed won hurt。

Okay。That。As show the whole screen is usually safe。😔，Yeah。

is it good that way if I go for a screen like this。Yep， we can see the whole screen。

Thats just at the top。 We have the， the black bar of the Lo meeting console， but I'm always the。

Then you can press hide here if you want， but it doesn't really matter。Vi it。II'm always。

Do you see the hide button now the menu that you just opened that yeah， okay， okay。

it was in the menu。Hi video panel。Oh no ciing something else， Hannah。Whatever。

I think if I just don't interact with it it's going to disappear。Yeah， and I mean。

it's in the black area of your slides anyway， so who cares？嗯。有。ButOkay， so so five years later。

we have to use whatever。😊，I mean， if you're a computer scientist and it's fine。

then you don't need to know the stuff。Anyway， I think。Steve interrupt me if we should， oh。

by the way， the like we literally none of your slides are obstructed by the bar no no no， but。

I prefer not to touch anything and。But then I would say let's get started if that's reasonable and with that。

yeah， I mean， I don't really have much to say except for yeah we are starting the orals now Friend of the Sun and I hope we all excited about them and the first one by Nicholas Carvin who I'm sure we' we've heard that name a bunch of times if we're here in the the let's say learning on graph area and I'm sure he can tell you a lot more about his paper than just the title which I'm now saying the gradient scarcity with Plevel optimization for graphraph Learn but then just let's go Nicholas what's the deal yeah okay thanks a lot so thanks a lot to the organizer for this wonderful controls and especially so this was a paper on the Tla tracks which is new this year I think which I think is a really a wonderful idea。

So this work is mostly the work of my former student Hashham Gan， but he couldn't be here today。

so I'm really presenting on his behalf， but this is his book。😊，So yeah。

this work about gradient scarity with bi level optimization and graph learning。

we'll see briefly what this means。😊，Okay， just change the time。Okay。

so we all know what we're doing here here we're looking at transdive semiupervised learning。

so very simple， one of the most basic tasks we all do when learning on graph， you have a graph。

you have a bunch of nodes V and within these nodes you have a subset of labeled nodes that I call V training and they have the labels and you want to predict the other labels。

So before GNN， one of the most classical way of doing that was to do what is called laplashian regular in a sense that we were directly looking for the labels of the nonlabel nodes by optimizing over this vector here directly of the label。

such that you compose your cost function with the cost function over the training labels。😊。

With some cost L here， so progression classification， whatever。

And you added some regular such that the predicted label were smooth of the graph。

This was under some kind of homoophilic assumption of the graph and what is nice is usually especially when you do regression。

this type of thing is very nice to solve。😊，It has a closed to expression and so on and so on that involves the laplation of the graph d minus a。

the classical laplation， because this guy is just vector off labels computed against the dilalation of the graph。

Now we all know that the modern way of doing that is rather computing some parametric model， a GNN。

so here I will consider a classical message passing GNN such that at each layer a node communicates only with its neighbor。

but the message passing can be anything， it really doesn't matter for the rest of the talk here。

It's just a message passing JN so that's yeah the prediction here will be that you optimize the JN on the training labels and you get like the best possible weight for these JN and that's how you predict your whole vector of labels okay so note here that I have emphasized the dependency on this predicted label on the adjacency matrix of the graph a。

And you guess it's because we're going to do graph learning such that this A will actually change over time with some optimization for you。

Okay。So graphra learning is now a well established field in some sense。

even though it's very active and that in many questions that and scrutiny。

including the one that I'm presenting today， it comes from the observation that the graph that you observe usually the adjacent matrix is noisy in some sense。

you can have missing links you can have spoilus links， stuff can be homoophilic。

heteroophilic and so on and so on。😊，And so there are many methods to do graph learning。

which graph learning here means I'm going to modify the underlying graph to do my prediction with laplaian or J。

So many way of doing that can be supervised and supervised with some you some prior that you're going to put on your graph。

it can be differentiable or it can be purely discrete algorithm on the graph or such that it is better in some way。

And here in this talk as in the title I'm going to define it as a bilevel optimization problem I'm going to see in a minute just why okay so assuming supervised byle so it's supervised assuming that you have another set of label nodes V out so out here refers to the outer optimization problem so the the first one was the inner optimization problem is the out optimization problem so just another set of label nodes if you haven't one of them you divide it between training and validation as we do usually and you call the out your validation set so right。

😊，You're going to optimize。This time over the graph， so over the adjacency matrix。

A cost function that is a prediction of this sorry over these new training levels， right？

I think the key point is that this cost function it includes as a term already the predicted label with respect to these adjacentsymmetric that's why it's a bilevel optimization problem because this prediction usually involves itself an optimization so if you remember in the first slide I optimize the GNN to predict this way with respect to a and here optimizing over a and this as a term the optimal label that I'm considering with respect to a so here in this talk can be either the GNN or even the simple classical le regulation。

😊，Okay。😊，So of course you guessed that it's usually a bad ID to directly optimize over a as a full dance matrix。

so what we're going to do is usually restrict this set of learnable graph it can be more or less complicated we'll see in the next slide。

And one way of seeing that is that of course complete graph are too closelyly， that's for sure。

but also when you use complete graph， they are very。

very prone to overfitting and you can even show very simply that if you optimize over the complete graph and you can learn any edges on anything well then it's going to discard a lot of edge where you don't have labels okay and this is a bit related to gradient scarcity and what we're going to see just after。

😊，You can also add some regular， it's usually a good idea again for the same mode。Again。

for the same idea that you want to fight over a 15 if you learn all the edges on the graph。

So few remarks， this graph landing program was often formulated as a joint optimization problem over the labels and the graph so min mean right you minimize over the graph and and the labels in the same cost function。

this is a marginal idea all the joint optimization can always be formulated very simply as the bilel optimization。

just min min its a bilel optimization but all the bile optimization is not always the joint optimization。

especially when the cost functions are different and the inner optimization problem is something a bit complicated non convex like a landing a JN for instance。

And when you do it in practice， if the inner optimization problem requires itself in practice。

I mean in the code requires back propagation like learning in a GNN。

then optimizing this outcus function will require to back propagate through the back propagation so what is called higher order by propagation。

if you've never seen the term they are dedicating Pyto library to do higher order by propagation so by propagating through back propagation。

Okay。Okay， so just a few minutes on what I can choose for this problem。

So I'm not going to do the whole field here， but usually what we can choose for the set of matrix that you want to learn again said to you that the complete graph is not a good idea。

you may want to learn just the weights of existing edge so you observe some edge and you optimize over the weights of the existing edge so it has a few names usually it's called edge refinement or edge learning on this type of。

It's also it may be a good idea to add some new edge in case the edge that you observe there are some missing links。

so a compromise between existing the observed graph and and the complete graph is to learn edges up to KH or r hub here of the observed of the observed adjacent symmetry matrix sorry maybe be6 that's why so you learn the weight of the existing edge。

but you also learn weights， new weights between nodes that are two hubs or three hub from each other there it can be a good idea to add a few shortcuts in the graph。

😊，It's also a good idea to add regular， there are many choices for this。

And usually you guessed it the modern way of learning a graph is to do itself a parametermetric model and optimize over the parameter of this parametermetric model。

so it's usually it's some type of graph to graph GNN。

it computes some kind of nodedown bedding for instance and then pretty edges with an MLP that takes the node bidding as an inputut。

It's usually the modern way of doing rough law。But to do some theoretical analysis。

especially gradient scarcity and I'm going to explain what is gradient scarcity consider edge refinement so we just want to learn the edge weight of the existing graph right so if you take as a prediction model a geneN with scala here ourll start with JN here because location regulation is actually more complicated for what I want to say。

😊，If you take as a model genuine with scale layer， then you can see that it has a financial receptive field of size scale since it's a message by JN。

And as such the edges that are more than kHub from the training nodes do not play any this is something very known that they don't appear in the cost function so see that it's very known when you optimize when you want to predict labels you know that some edges were not going to be taken into account just at inference but not for training but when you optimize over the graph this is a problem because when you optimize over the graphs these edges are trained somehow you compute the gradients and the trained and gradient caity refers to the fact that from this very simple fact if you consider the edges that are more than KHub away from the training nodes KHub in terms of shorterest paths right？

Then the gradients of this outer cost function that is supposed to land the graph will be exactly zero with respect to these edges。

so they will receive no gradients ever， and this is a problem this is quite is called gradient scarcity。

😊，So it was known for joint optimization， so in this paper we we prove it for bilel optimization。

the proof is essentially the same it's a bit more complicated because you have some stuff that are where zero before they under zero。

but everything concerns out and you still have gradient scarcity without surprise as expected for bilevel optimization okay。

So you have many ways of counteracting gradient scarcity you're not going to go about them here but I'm just going to talk a bit about laplation regular so if you don't know it。

but if you know it I Im telling that to you now laplation regulation unlike JNN has an infinite receptive field so the whole graph usually comes into account when you compute the laplationian regular here when you predict the labels so do we have gradient scaraccity in this case it wasn't really clear because like JN it's not a strong K size scale receptive field。

Well， yes in some sense， guess in some sense， because you can compute some eigvalue of some matrix I'm passing the detail here。

but what you're going to get is that the gradients of the edges that are far from the training nodes decrease exponentially with the distance to the training nodes。

Okay， so it's not exactly zero as before like we let you had for genN。

but the decrease exponentially and it's go through is it can go very pretty far okay how might it can tie it okay。

😊，So what's interesting is that the eigenvalue of this matrix that determines the rate of decrease is a bit linked to the distribution of the training nodes and its relationship with the frequency of the laplation so I'm not giving you details here。

but basically you can say that if the training nodes are very low frequency very grouped together。

then the decrease is going to be slower， but then you're going to have like big distances in your graph from the training set because it's grouped。

it's isolated from the rest。😊，And it's if it's more aligned with the high frequency of the lact。

then the decrease is going to be very fast。 But that since it's well。

the distance to the training set is usually， of course， you can find counter example。

but it's usually much much less。 So you have this kind of trade off that is。😊。

Raise a lot of open questions that we don't have on to yet。

but it's interesting because it' something we don't really talk。

we don't really talk about like the， the distribution of the training nodes within some graph and。😊。

Within within the graph。Okay so from all the ways that I showed you before there are some ways to mitigate gradient scarcity of course I was only talking about edge refinements over the existing graph we're not going to go over all the details and I'm not proposing any new method to mitigate gradient scarcity they are more than enough already one interesting thing that we observe is that whatever type of the method that you choose to mitigate gradient scarcity organization or learning a graph to graph GN I mean everything else apart from edge refinement it's graph learning over small graph like co etcM I mean medium size graph it's a lot subject to overfi you can see here in the table of result we managed to do better testing than simple prediction of the observed graph but you still have a gap between the training test at least in no experiment graph learning if you're not careful it's。

It leads a lot to overfitting because you learn the edge between the training nodes and everything everyone is happy and they don't care about the testing also。

Yeah， the more， more parameterss that you have， of course， more of h， if you're not a bit careful。😊。

Okay， so I'm going to stop here to conclude it this was a simple work。

but it raises some interesting question about graph learning。

v scarcity and the distribution of the training sets within the graph and its relationship with the graph approachology。

which is topic that I find is not often discussed。

but has some interesting theoretical consequences。😊。

And I'm just adding my talk by saying that recently got。

IGrant to welcome on vertical graph machine learning so if you know any students we be happy to come to head to diversity of hand eating crs and stuff don't hesitate to send me a message thank you。

I was music。 But yeah， that's great。 I think。😊，Eating crs that's something to do as it happens I'm making for crs for the Thanksgiving weekend tomorrow I support that but then so the next speaker I think we would welcome having him here or them here so if you're around just shoot me a message in the chat because I cannot find you yeah because next we would like to talk about towards the GN framework for combbininatory optimization problems I see Frederick Ven there we go there we go。

Make code host so friick I made your code now which means that you can start screen sharing and doing deeds like that hello hello and with that Nicholas thanks for a nice talk and thank you I'm sure someone might reach out about those grants and working with you Oh and about the crs of course that's important as well but with that thanks Nicholas and。

Let's move on to the next one with Phil。And yeah， are you all set to quickly want to say something so we can see where we can hear no yeah。

can you hear us yeah that works very well but right yeah I will be presenting together you together with my first author Sammy。

And yes， so I hope the everything is fine and then in that case we will start like I think our talk was scheduled in 10 minutes。

but we can all already start I guess yeah I mean by us。Absolutely let's go right Okay。

so today we're going to present our latest work towards the general recipe for combinator optimization with multi filtered GNNs Fred and I are going to present this together but we would like to take the opportunity thank our coauors here。

Stean Horoi， Michael Perarlmutter and Guy Wolf of course。Okay， so let's get started。

Many of us in the audience will be to a different extent be familiar with combinatorial optimization。

but for the sake of completeness let's do a very quick overview here so in combinatorial optimization we're trying to find an optimal subset of objects from a finite set and in this context we're going to think about combinator optimization over graphs。

which is like a very common case where the subsets are graph node they could be edges of the graph or paths over the graph。

but basically some subset of graph of the graph object。

And there are many instantiations of combinator optimization on graphs like the most popular ones among many others being max clak。

maximum cut minimum dominating set and perhaps the even more wellknown traveling salesman problem and then there are also other various like vehicle routing graph coloring。

etc cetera， so like we're talking about a very large family of problems here。

The thing is most of these commator optimization problems are NP hard or MP complete。

so especially when you scale your graphs to about let's say tens of thousands or hundreds of thousands of nodes generating exact solutions with conventional tools like MIP programming like let's say using the GroB solver these types of things can be prohibitively expensive so that's where deep learning comes in because with deep learning the idea is that let's generate app solutions more efficiently。

let's just do a forward pass let's say across neural network and just get certain predictions。

And supervised or unsupervised as in self supervised methods。

as well as reinforcement learning approaches are all viable。 Many of them are。

we have been tried and have been successful to a different extents。

And since we're talking about comm optimization on graphs here。

obviously GNNs emerge as like a natural candidate。

but conventional GNNs typically assume that like your information is very local and like the treats essentially graph smoothing as an inductive bias what we know from a fact is that this is not an assumption that always works well for combinatorial optimization where you also need high frequency information depending on the task at hand。

we're going to explore this in more depth and real soup。

So let's look at a few problems that we mentioned in the previous slide。

So in the maximum cut problem you have your vertex set and then you want to partition this vertex set into two subsets such that you maximize the edges that connect nodes from different sets。

that's the idea so you want to get a cut with as many edges as possible。

In the maximum clique problem， you want to find the largest node subset that is fully connected。

which is by definition a cl， So you want to find the maximum cl。

essentially and in the minimum dominating set， you want to find the smallest subset of nodes such that every node in the whole graph is either within this subset or is neighboring this subset。

So in the sense of like you can think of it as from a GNN perspective。

if you just diffuse do a one step diffusion over the graph or the MDS you can reach every node in the graph。

that's the main idea in MDS。Okay， so I'm going to pass the to to Fred who's going to take us through the methodology we employ here。

😡，Thanks， Sammy。So yeah， talking about the methodology。

we have here an input graph to our pipeline which is then equipped with intrinsic node features。

Those encode the structure of the surrounding area about a node can be， for example， the node degree。

the clustering coefficient or the eccentricity。This information goes from an MLP decoder。

which then feeds into our multifil to GNN。The output of this GNN is on the node level and we feed it for an MLP decoder together with a sigoid。

So the output of the learned part of our pipeline is a vector。

which we can see as a probability vector。That gives us for every node and probability to be part of the solution。

Looking here into the case of the maximum cut problem。

we then have an answererous loss function that we use for training。

And what we can think here of the probabilities is that yeah we want to buy partition our node subset here so that we cut as many edges as possible。

but as we move between the different classes and the probability for every node will tell us if it's smaller than 0。

5， then we want to have it in the first part of the petition， if it's bigger than 0。5。

we will move it to the second part of the petition。

How can we train with that so the first step is that we apply the transform 2 p minus1 to our output vector that means that we move our values from the interval01 to the interval plus minus1。

And what this pdratic form then does is that we will sum over all the edges and we will get a negative contribution to our loss if the two nodes that are connected through the edge have a different sign or in different parts of the partition and we will have a positive contribution to our loss if they have the same sign。

This is the learned part of the pipeline here， and once we have learned these probabilities。

we can pass them through a rule based decoder where we can simply separate the notes of our graph according to the sign of T2p minus1。

Similarly， we also have an unsupervised loss function to train this model for the maximum click problem。

And here our loss function acts as a surrogate to the true objective of the maximum clique problem。

so you can see this here on the bottom here， the true objective function can be seen as like assuming that we know all the cliques in our graph。

we could just for each of the cliques， see how many edges the clique contains and then pick the one with the most。

this translates to our surrogate loss function that the first loss term encourages the model to push probabilities to nodes that are within the click。

while the second last term encourages that nodes with high probability are highly connected and together these two this loss function then approximately makes the true objective of the maximum cut problem。

To decode a solution from this， we will order then the nodes of our graph according to decreasing probability。

we will iniate our solution as the highest probability node and then subsequently add notes to that set according to decreasing probability and every time check if we are still within a click。

Lastly， for the minimum dominating set problem here。

we want to find a minimum set so that every node of the graph set most one step away from that cell。

so again we have a two term loss function where the first part the L1 norm of the probability vector enforces sparsity so that makes us encourage this a model to have the least nodes possible in the solution set。

While at the second last term， make sure that next to every node of our graph。

we have at least one node with high probability next to it。

And the decoder will function very similarly as for maximum click that we again order our nodes according to Greece decreasing probability and then step by step construct a solution by going through the nodes by decreasing probability。

So what's interesting about this is that all these problems are NP hard。

so it's prohibitively difficult to get enough ground truth solutions for supervised training。

But the nice thing is that our unsupervised loss functions allow us to train these models without access to even a single ground truth solution。

The second nice thing is that our rulebased decoders are constraint preserving。

that means that even though we cannot guarantee that the solution of our model will be the optimal solution。

we can always be certain that it will satisfy the constraints for instance。

the solution our approximation for the maximum cut we can make sure that is a maximum it is a cl。

although we can't guarantee that it's the maximum cl and similarly for the minimum dominating set。

the solution we propose will be a dominating set， although it may not be minimal。With that。

I will move on to our modeling approach。Wwhichhich is through a multi filter GNM where you can see here for each layer of this model。

our input node features will go through different pathways where each of them can be seen as a separate GNN layer。

So the ones we see here are traditional averaging operations which have different scales of the support matrix。

this is the matrix we use for message passing and for instance， if we take S to the path of one。

this is the most traditional aggregation mechanism which we know from， for example。

GCN and GAN and many others and will just aggregate information around every node from a one step neighborhood。

But we also have like higher orders of the support matrix where we then for x S square we will aggregate from the two step neighborhood and so on。

so even though those are very traditional methods like each of those channels will give us access to a different receptive field around each node。

Then we complement those filters with what we call comparison operations。

and those are filters where we replace the support matrix by sing diffusion wavelets。

And those are those encodes fundamentally different operations compared to what we know from traditional GenN because here we have differences of different scales of the sub matrix for instance the Psi 1 diffusion wavelelet what we will do here we will first aggregate or around every node from the one step neighborhood what we see here on the left and for S square we will aggregate from the two step neighborhood and then take the difference of that so this diffusion wavelelet at every node we compare two neighborhoods of different size。

And this carries on to higher order waveleletths here， for example， size two。

where the scales of the sport matrix are usually powers of two， so here we will first。

we will compare a neighborhood of size two around every node to a neighborhood of size4。

The magic of this model is now that how we combine at the node level。

the information from these different filters， where we have two different attention mechanisms where at every node。

the model can decide to which of the filter responses it wants to attend and this is very different compared to what we know。

for example， as graph attention networks where we attend to our neighbors at every node。

here is really that our node can attend to different filter responses。For instance。

you can see here at this node that yeah it can choose which information is most important at this node。

and this can be different for each of the nodes of the graph。

So what this model is capable of doing is to learn very complex local functions that depend on the local node neighborhood。

which is very very helpful for the problems we are dealing with here。Additionally。

it's very important to have this decoupled attention mechanism that you see here that we have an attention mechanism for the averaging operations versus the comparison operations。

and we show theoretically our paper， how this is an upgrade compared to former architectures。

which had just a single attention mechanism that throws together all the filter responses at the same time。

And we show that in such cases， the filter responses of the averaging operations will always outweigh those of the comparison operations and limit the expresslusivity of the model。

And with that， we move forward to the perfect results with Simon。

So before we actually look at the results what are our baseline what are we what are the tasks and models we're comparing against so again we are testing on max Cu max click in dominating set and we ran experiments for a set of small and large graphs this is going to be relevant in a second where like the small graphs are about 200 to 300 nodes and the larger ones are I believe 800 to 1000 nodes each。

And so first we have non deep learning pipelines like Grobi， a general purpose optimizer。

and then greedy algorithms and then meanfield andnealing algorithms essentially and then from there on the real like more comparable baselines we have arearddo GNN which is a seminal GNN from a few years ago by Carius and Lucas。

which is basically a slightly altered gin that is suitable for combinatorial optimization and like gets good results across several task as well。

There's the anal model， which is essentially the adverse GenN with some max entropy anneing regularization that helps the optimization process in certain cases。

There's the scattering C， which is a prior work that also leverages these diffusion wavelet based band pass filters。

but with a different mechanism。And finally， we have the Gflow network based GFN that is like a paper from last year。

I believe， which will prove to be quite successful across these tasks as well。

while like the caveat being that it's much more computationally expensive to train and run inferluenza。

So we see that in our results， Gcon is the best self supervised learning method on both MaxDt and MDS overall for launch grass。

but also on the smaller ones。Whereas in MaxL we're pretty much on par with GFN or GFN like prove to be slightly better。

but overall we're either like the best method or the second best essentially so which kind of is in line with our aim that we want to arrive at a model that is consistently good across large variety of commuter optimization task。

which may require different inductive biases than just you know lowpa filtering graph signal processing perspective and what's more is like Gcon while being so successful has similar scalability properties to other MPNNs like address GNN which makes it much faster to both train and test compared to methods like the Gflow netbased solver and like it's much。

much more faster than say the Grobe solver and the bottleneck here is not。

The inference pass of the GNN but rather the decoder and when your decoder is very fast。

which our implementation is for the max cut， we not only beat like a time budgeted Grobi but we also gain like a 130 time speed up on Max cut compared to Grobi so I think to run this on these large graphs。

500 of them it took us about like half a minute where for Grobi it's more than an hour like just to put it in context。

And we see that like these decoupling of low and high pass filter is very essential for Max cut and MDS。

whereas MaxClick tasks seem to rely more on low pass information essentially。But overall。

we have arrived at a network that can pick and choose the types of information it may want to focus on at a node by node basis。

as well as depending on the task， so we arrive at something that's very flexible。

Finally we run some generalization experiments and we see that Gcon is actually quite adept at generalizing across graph size so in the table you see we to run two types of generalization experiments like I mentioned we have small and large data sets for each test so we're testing okay like what if what happens if we train on the small data set and try to extrapolate to the large ones and basically the percentage differences are here you see are what's the difference what's the percentage drop you see when you train on say a small and test on the large compared to just training and testing on the large graphs and overall we have less than 6% performance drop which when you put it into perspective like the overall how much better Gcon was doing for example compared to standard message passing neural network based methods。

Um， we see that GCon can outperform in distribution GNN methods like A GNN。

even when trained on like u graphs on different sizes than it's tested on。

We also see that like the Gflownet is also quite successful in generalization， whereas like address。

Anil， these types of like conventional GNN based methods struggle quite a bit。

even though like the variance per task changes a lot。So okay。

what are the main takeaways we have here？Perhaps the most important thing to notice is that like conventional GNNs like the biases associated with conventional GNNs is not really compatible fully with several CO tasks where high through information is also relevant so we need something that can leverage both types and here we introduce Gcom which leverages again these both types of information。

making a powerful architecture for a very large array of CO problems。As we showed。

like GCon can outperform both other GNNs as well as Gflownets and even challenges Gbi while retaining like scalability and like the sparsity of a standard GNN method。

And finally， we discuss this in more context and like in the paper。

but we see that Gcon is like the overall framework we presented is extendable to many optimization problems that we don't cover in this paper。

including ones that are based on edges or paths like say traveling salesman problem the only thing you need to do is to introduce a modular loss and decoder that's appropriate for your task and you can keep the model backbone separate and get close to optimal performance with the Gcon framework。

So that concludes our presentation， but we're more than happy to take any questions and you know discuss this in further。

Awesome， thank you very much。It seems we do not have any questions in the chat。

but we have the QR codes and yeah I'm sure people might be happy to check out the paper further and take away some some more details and with that would you have any last words that you want to say about the paper or should we call the day and move on to next？

I think we're good I think we covered everything and yeah always feel free to reach out to us we have some ideas about future work as well so this is probably not the last you hear of this work but yeah thanks a lot thank you。

Great， thank you very much， Vi， thank you。😊，Sorry， I' forgot the name Shin。

Do you want to say his name， the name of your call Frederick and Sammy and we are Sammy Thank you very much。

Thank you， Frederick， and you're welcome that's let's get to the next aura and I think that is by you。

😊，But if not， then whoever else is next， please write me a message so I can make a cool host and you can start screen sharing。

But we are also a little bit in front of the schedule， so it might be happening that yeah。

it might be happening that。Our third presenters still taking a little while I'm presenting closing remarks It's not my turn I thought we had a third Okay visiting graphholy sorry。

neuro incomplete factorization as a third or。Houseer we have any Oh yeah。

there's a Po house now here。嗯。I couldn't send you a message for some reason， but no。

seemsre we're good to go。 Yep， sorry about that I might have should have called your name sooner。

but then let's just get straight into it so we get as much time as possible for your presentation。

So sorry。All right， thank you very much Hannes it's very nice to be here so my name is Paul I'm a PhD student at Osa University in Sweden and I have the pleasure to talk today about our recent team and our paper about using graph neural networks in order to learn preconditioners for the Conjugate gradient method。

that's a topic I've been working on together with my supervisors。

Uan Uum from KDH in Stockholm and Ys Clluland from Sal for the last roughly two years。

But let's take a step back and first talk about sort of learning to optimize。

so using machine learning models in order to accelerate the solution methods for numerical optimization problems。

And so the idea is that we often have to solve similar problems over and over again。

And we can then use sort of the similarity between different optimization problems in order to train a machine learning model。

And then later， we can amortize the cost of this training of the model and creating of the model by solving many related problems in the future。

And so this paradigm is also known as learning to learn or meta learning when the underlying optimization problem we're considering is actually neural network optimization itself。

that's a quite a famous paradigm or also a more general framework is related to autoML。

Where we interested in learning sort of learning to optimize general machine learning algorithms。

So let's make those a little bit more concrete， so usually we start with a parameterized optimization problem。

so we want to minimize some function F。With some optimization variables X。

and we have sort of some distribution over parameters y。

So that is just the context of our optimization problem。

So then we assume some unknown probability distribution。All by these parameters。

And they're goal then to learn a machine learning model， G。

which takes the parameters and of our optimization problems and some learnable parameters for optimization and produce sort of either directed the solution to optimization problem or sort of augment the classical optimization algorithms。

And we can then use different kinds of loss functions in order to train this model。

so either we can go towards end to end learning or we can use theory in order to inspire our loss functions in order to get a better performance。

The ultimate goal is then to solve some optimization problems fast for new parameters that we might obtain in an online fashion。

So we are at the Learning Graphs conference here， right so it's natural that we might assume that our learning Talk toise model will be parameterized with a graph in neural network。

And here we especially want to highlight sort of the connection between graph neural networks and classical numerical linear algebra methods。

So I guess everybody here is familiar with the classic graph neural networks that operate or this process data that lists on a graph。

but just to recap real quick， the classic message passing framework。

Sor of consists of three consecutive steps， so in the first step we sort of compute an et em bedding for every directed edge in our graph based on our old edge features and the adjacent nodes。

Then in the next step， we aggregate the adjacent edges for each node using some permutation invari and function。

and the final step we update the nodedding node embeddings and both these updates of both edge embeddings and node embeddings are usually implemented using some small fully connected neural networks。

And since these neural networks only operate on so if the building blocks of our graph。

a graph in your network can operate on any size of graph。

as we' all seen in the previous presentations。So how does linear algebra come in。

well every matrix a actually corresponds exactly to adjacency matrix that we can interpret as a graph。

moreover， many， both numerical algorithms， but also numerical linear algorithms can be described as a graph problem。

And we use this insight in order to construct our GN architecture later on。

And so the optimization problem we're looking at as the title already says。

is the conequ gradient method， so the con gradient method aims to solve just the near equation systems。

A exit could be， so and we can also see this as minimizing unconstrained quadra objective。

Where both A and B are sort of the parameters of our optimization problem。

And X is the optimization variable we want to solve for。

And so the connegate gradient method is a is a quite famous iterative method to solve these equation systems we obtain。

Where we assume that our matrix a is square and symmetric and also positive definite。

so that means our quadratic objective itself is strongly convex。

And the method is especially effective when we have a large scale and very sparse problems。

So one very common setting， where this occurs is an riskization of finite element methods for partial differential equations。

The conversion speed of this method library typically depends on the condition number of our matrix A and also the distribution of the eigenvalueius of the matrix。

So a classical heuristic that's very important for dis solvers。Is finding good preconditioners。

so the idea is that we want to find a matrix M which usually also has to be a symmetric and positive definite and for research constraints also sparse。

That improves the spectral properties of our system。

so instead of solving our original system A x equal B。

we might solve a different system where we multiply， for example by the left with our matrix n。

And so here we sort of have to make a tradeoff between the time spent to compute the preconditioner versus the acceleration we obtain and the con gradient method。

so if we use for example， just the identity matrix of preconditioner。

we obviously don't obtain any speed up since we just recovered the original problem。

On the other hand of the spectrum， if we use the inverse matrix itself as a preconditioner。

we recover a direct method， so we don't need any additional iterative steps anymore。

So usually preconditioners are somewhere in between these two extremes。So the Chioi method。

for example， approximates8 using a diagonal matrix， which we then know how to invert。

And another class of quite popular algebraic preconditioning techniques are the incomplete factorization preconditioners。

so there we start computing a factorization of our matrix a， but do not fill out all。

The elements of that matrix。

And our method is highly inspired by that by including a new component into this incomplete factorization。

So the idea is to parameterize to learn to optimize model that takes a matrix A and produces a suitable preconditioner。

And as I said on my previous slide， we both need symmetric and positive definiteness as well as vrsity constraints for our preconditioning metrics。

for our output of our neural network。So the pararsity constraints I'm going to talk later about。

let's first focus on how we make sure that our preconditioner itself is symmetric and positive definite。

So instead of directly mapping。To a matrix M。We instead actually map to a lower triangular factor。

La。And for lower triangular factor， we know exactly how we can make it at least positive definite。

and that is exactly by enforcing positiveness on the diagonal elements。

And that we can achieve by just using a suitable activation function in our neural network。

In order to get them reob our preconditioner， we can just multiply the lower triangular matrix。

Together with its transpose， similar to the incompet Tulesky factorization itself。So。As I said。

there's a very strong connection between graph neural networks and matrices。

but the problem here is that we cannot actually use the matrix A directly for the message passing So why is that so if we just use a directly for the message passing the outputs that we get from a network in the terms of edge embeddings are typically nonsymmetric anymore even though we take into we take in a matrix A which is symmetric and that is just due to the fact that the edge updates usually take into account both the。

The notes that the edge is connecting in a diered fashion。The other problem。

which is way more severe， is actually。That we use a lower triangular part only of the output。

and that breaks the computationation equivaris we obtain usually from the neural network。So in the。

Nowive framework we would first use message passing and then use lower triangular matrix of our message passing output。

we instead suggest to first transform our matrix to a lower triangular matrix and use the graph of this lower triangular matrix for our message passing。

The problem then becomes that by only using the lower triangangle matrix。

we basically only let information flow in one direction。

and that is from low order nodes to higher order nodes where we in the beginning。

fixed the node ordering。In order to circumvent that。

we first do a step with the lower triangular part of our matrix using message passing。

and then the second step we use the upper triangular part of our message passing using the same action bearings for the second step。

So now we can obtain a。Valt factorization of our matrix， which is positive definite。

but now we have to achieve that we can also train this factorization in order to resemble a neural incate factorization。

And so how we phrase this is that we want to minimize the distance between our original matrix A and the learned factorization that we obtain from the neural network。

Which then we make subject to sparsity constraints。So since we only input edges in the beginning。

these are exactly the non zero elements we obtain in the end in our output matrix。

And we can encode here also different types of verrsity constraints。

so the ones I've shown here in the upper objective is exactly the ones that doesn't allow any fill ins。

so exactly the elements that are nonze in our original matrix A will also be nonzero in the learn preconditioner。

We can also allow additional fiins using some classical heuristics from numerical linear algebra。

for example， we can use the sparsity pair corresponding to a square or using high a polynoms。

It is also possible to then learn actually sparse outputs by adding a long penalty term to the objective function itself。

So we start with a bigger sparsity pattern add level one penalty to our objective in order to get a sparse output again from a graph neural network。

But here we， of course have to make a trade off between the time we need to run for the forward path of our Graph neural network and how good the approximation is we get out in the end。

UmSo in order to actually allow efficient training。

we need to do some additional computational tricks。

One key problem here is actually that the Frbinus norm scales computationally quite poorly since it requires matrix matrix multiplication。

And so for very large scale matrices， this is extremely costly on one hand and on the other hand。

we need to compute the gradients with respect to these， which then just blows up our memory。

So instead， instead of minimizing the Frenus norm， we propose to instead minimize an unbiased approximation using the htchin and trace estimator。

so you can just sample a vector with ID Gaussian elements and multiply that with both of our matrices in order to get an approximation of the Frenius norm that only relies on。

Matriric vector multiplications that we can compute very efficiently。

So this allows us to have very efficient training and then if we look at the difference here on the right hand side and the figure between using our stochastic approximation as a training objective or a forbeus norm as a training objective。

we see that in terms of performance on the validation mode where we always compute the full forbeius norm。

it is very marginal。During inference， we can also show you that our model is very。

very efficient and only requires linear time complexity in the number of nonzero elements in our matrix。

So let's look at some results。So we evaluate our methods on two different synthetic data sets。

so these are basically the generating distributions over our different problem classes。

so one is using random matrices of size 10，000 by 10。

000 with approximately 1% non zero elements in each matrix。

And the other problem is motivated by an example I already talked about earlier it a P saw a partial differential equation discized on different two dimensional domains using the finite element method。

and there we have our method on problems from the size 20，000 until 500，000。But here。

one key goal is also to show the size generalization of our model。

so wed only train with problems up to size 150，000 by 150，000。In general， it's however not。

Easy to generate suitable dataset sets for those problems。

And so we have to sort of trade off the difference between the problems within a data set and the generalization abilities of this model。

and so future work is needed in this area in order to address these issues。

so these are really most synthetic data sets we are evaluating on here。

So for the synthetic data set， we can see that our learned approach with additional sparsity is around 40% faster than incomplete Schlesky。

the fastest other baseline we're testing against。So if you're interested in solving many。

many random matrices， then our methodist is quite good。

So for the partial differential equation data， we can see on the left hand side we can see how long it takes to actually compute the preconditioner。

so that is just the pass through our model and here we can very nicely see the linear scaling of a model that we also showed theoretically。

And compared to incomplete Schleski， we're significantly faster in computing。

so our linear complexity grows with a better rate compared to the baseline method。

On the right hand side， we see the solving time of the actual optimization problem。

and there we see that both methods technical basically scale the same。

Yeah， then let me give some outlook and future work。

So as I said before right now， most of these problems are fairly synthetic。

so future work includes that we want to extend these methods to other problems such as interior point methods and gaussher processes。

we actually have to solve over and over again linear equation systems within these methods。

Additionally， we can learn additional heuristics for these preconditioners。

like extending the sparsity pattern and learning the sparsity pattern directly or additionally learning the reordering of the matrix in the forward pass。

We get in our forthcoming paper that's going to be presented at the Northern Lights Conference in the beginning of next year。

we've also extended the Learn preconditioning to other iterative schemes such as GM rest and have improved the training objective。

Insurance connection to large and small singular values。

So development of updated data preconditioning really has just begun。

and the story doesn't end at only linear equation systems。

but preconditioners can be important for any type of optimization problem。

So let's give into a quick summary。So heuristics such as preconditioners and optimization algorithms can be replaced by data driven approaches as we shown in this paper。

and preconditioning is there a very natural form for learning to optimal my essence。

it can be applied to all kinds of different optimization problems。

And maybe most interesting for the graph community is that graph neural networks are extremely natural computational back end for numerical linear algebra so there's a lot of different things we can use graphra neural networks for in order to improve actually the linear algebra side and exploit sort of this connection between GNNs and matrices in order to speed up optimization problems Yeah thank you and I'm happy to take any questions。

No， the thanks is frost us， Paul， this is excellent， a awesome finish to this conference。

I think with this last thoroughal session。And yeah， this， I think marks the end of the， the deal。

except for the use closing remarks。 And let's get to those。

And I'm pretty sure we're going to hear some exciting stuff。 There are like some， some news。

say you are you announcing some things， I that right？

Yeah， thank you。O， let me share my。Like the reviewer awards did we already announce those。

Yeah we haven't I'll be announcing the awards for yeah then that's pretty exciting who gets the moneyies let's see Of course we're grateful for any review but there are some especially highlight worthy reviews and they should be should be rewarded but then yes let's go。

😊，Exactly， thank you so much。😊，So thank you everyone for attending the Learning on Gras conference this year and here comes to the end of our conference and I'll be presenting the closing remarks and presenting the awards for the best paper。

top ACs and top reviewers。😊，So just to recap， the Learn on graphra conference was established with the core ideas to present the latest advances in graphra and geometric ML and we are a community- drivenn conference and we want to build a community that is truly accessible to all by making the conference free to attend and in a virtual format。

we also emphasize local meetups where the researchers in MLL GraphML can meet up locally around the globe。

we also focus on review quality by providing monetary rewards to the top reviewers as an incentive to improve the review quality。

So this year's program spans over four days and we had four keynote speakers， five tutorials。

12 orals and two post sessions with 16 local meetups。

and we are truly thankful for everyone participating in our conference presenting their research in GraphML that spends such a large and one。

图机器学习会议：P04：神经算法推理教程与口头报告

在本节课中，我们将学习神经算法推理（Neural Algorithmic Reasoning, NAR）的核心概念、最新进展及其在图机器学习和大型语言模型中的应用。我们将从图神经网络（GNN）的基础开始，逐步深入到如何将算法推理能力整合到现代AI模型中。

概述

神经算法推理旨在结合神经网络的灵活性与经典算法的鲁棒性，以解决复杂问题，特别是在面对分布外（Out-of-Distribution, OOD）数据时。本教程分为两部分：第一部分聚焦于图上的神经算法推理，第二部分探讨从图到语言的扩展，以及如何将NAR与大型语言模型（LLMs）结合。

第一部分：图上的神经算法推理

上一节我们介绍了神经算法推理的基本动机。本节中，我们来看看如何利用图神经网络来实现算法推理。

算法与推理的定义

算法是任何明确定义的计算过程，它接受一些值作为输入，并产生一些值作为输出。例如，插入排序 算法逐步扫描列表，将每个元素插入到已排序部分的正确位置。

推理在我们看来，是解决问题的鲁棒过程。关键在于鲁棒性——即使过程不完全准确或不可解释，只要其行为在不同问题实例间保持一致且可预测，尤其是在分布外数据上表现良好，即可视为推理。

为什么选择图神经网络？

图神经网络与许多经典算法（特别是动态规划）在计算结构上具有天然的对应关系。例如，贝尔曼-福特（Bellman-Ford）最短路径算法中的更新步骤与GNN中的消息传递和聚合操作非常相似。

贝尔曼-福特更新公式：
[
d_v^{(t+1)} = \min_{u \in \mathcal{N}(v)} (d_u^{(t)} + w_{uv})
]
其中 (d_v) 是节点 (v) 的距离估计，(w_{uv}) 是边权重。

GNN更新公式：
[
\mathbf{h}_v^{(l+1)} = \phi \left( \mathbf{h}v^{(l)}, \bigoplus(v)} \psi(\mathbf{h}_v^{(l)}, \mathbf{h}u^{(l)}, \mathbf{e}) \right)
]
其中 (\phi) 是更新函数，(\psi) 是消息函数，(\bigoplus) 是聚合函数（如求和、最大值）。

这种对应关系使得GNN成为学习算法执行过程的理想选择。

CLRS-30 基准测试

为了系统评估神经算法推理模型，我们引入了 CLRS-30 基准测试。它包含了来自经典算法教材《算法导论》（CLRS）的30种算法，涵盖了排序、搜索、路径查找、字符串匹配和几何算法等多种类型。

CLRS-30 将每种算法表示为一个图，并跟踪算法执行过程中的各种探针（变量）。探针可以是输入、输出或提示。提示记录了算法执行的中间状态轨迹，为模型学习提供了关键信息。

以下是插入排序算法在CLRS中的表示示例：

位置探针：每个节点的索引（标量输入）。
键探针：待排序的值（节点标量输入）。
前驱探针：排序后列表中每个节点的前驱节点（节点指针输出）。
提示探针：算法执行过程中，每个节点前驱指针的演变轨迹（节点指针提示）。

CLRS-30 自动处理数据生成、模型编码/解码器设计、损失函数计算等，极大简化了研究流程。

模型架构进展

自2022年以来，该领域出现了多种改进的模型架构：

关系Transformer：通过将边特征纳入查询、键和值向量，使Transformer在NAR任务上更具竞争力。
GFlowNet：利用算法的马尔可夫性质，引入门控机制显式遗忘部分历史嵌入，取得了优异性能。
RENAR：在聚合函数中使用LSTM，放弃了置换等变性，但在某些列表算法（如快速选择）上表现突出。

目前，在CLRS-30的原始分布外测试集上，仅剩三个算法的性能低于80%：弗洛伊德-沃舍尔（Floyd-Warshall）、克努特-莫里斯-普拉特（Knuth-Morris-Pratt）字符串匹配和强连通分量。这为未来的研究指明了方向。

超越架构：记忆、训练与理论

记忆机制：为GNN赋予栈（递归算法推理模块）或优先队列（神经优先队列）等持久化记忆组件，以处理更复杂的任务和更大的分布偏移。
提示的利用：简单地预测下一个提示可能效果不佳。改进方法包括：
- 对比学习：通过对比正确提示步骤与增强图中的错误步骤来设计目标。
- 开放书NAR：让模型在推理时能够参考训练集中的示例序列。
异步算法对齐：许多算法是异步的，而标准GNN是同步的。我们设计了在GPU上高效执行、但对异步执行轨迹具有不变性的GNN，找到了理论与实践的平衡点。
应用与理论：NAR已应用于神经科学、计算生物学、组合优化和机器人学。理论方面，范畴深度学习、循环Transformer等框架正在帮助我们从更本质的角度理解NAR模型的表达能力与泛化能力。

工具与可视化

Salsa-CLRS：一个基于PyTorch的CLRS分支，支持在稀疏图上训练，能够泛化到比原始CLRS设置大得多的图结构。
潜在空间可视化库：可以可视化模型在执行算法时，其内部嵌入随时间的演变轨迹，帮助我们理解模型的学习过程。

第二部分：从图到语言——大型语言模型中的算法推理

上一节我们深入探讨了图上的神经算法推理。本节中，我们来看看如何将这些思想扩展到语言模型领域。

大型语言模型基础

大多数现代LLM基于仅解码器Transformer架构。它们将文本分割成词元，并通过自回归方式逐个预测下一个词元。其核心是因果注意力掩码，确保每个词元只能关注它之前的词元。

注意力公式（简化）：
对于查询 (\mathbf{q}_i)，键 (\mathbf{k}_j)，值 (\mathbf{v}_j)，注意力输出为：
[
\text{Attention}(\mathbf{q}_i, {\mathbf{k}_j}, {\mathbf{v}j}) = \sum \frac{\exp(\mathbf{q}_i \cdot \mathbf{k}j)}{\sum \exp(\mathbf{q}_i \cdot \mathbf{k}_l)} \mathbf{v}_j
]

思维链与长度泛化

思维链：要求模型在给出最终答案前，先输出推理步骤。这被证明能显著提升模型在复杂任务（如数学问题）上的性能。理论上，思维链为模型提供了额外的“计算磁带”，增加了其计算表达能力。
长度泛化：指模型从较短问题实例训练后，能够解决更长实例的能力。这对于实际应用至关重要，因为训练短序列更高效，数据也更丰富。
CLRS-Text 基准测试：将CLRS-30中的算法直接翻译成文本描述格式，为系统评估LLM的算法推理和长度泛化能力提供了理想平台。

语言模型中的挑战

尽管LLM在模式匹配上表现出色，但在稳健的算法推理上仍面临挑战：

位置编码：对长度泛化至关重要。像RoPE这样的编码虽然常用，但其泛化机制仍不完全清楚。研究表明，不同的位置编码策略会显著影响模型在长序列上的表现。
表示坍缩：类似于图神经网络中的过平滑，在极长序列中，不同输入在模型最后一层的表示可能变得无法区分，导致模型必然出错。
因果掩码与信息传播：由于因果掩码的限制，序列末尾词元的信息更难传播到最终输出表示中，这类似于图神经网络中的过挤压问题。
Softmax的分散性：Softmax函数要求所有概率之和为1。当序列变长时，注意力权重被迫“分散”到更多词元上，难以维持对特定词元的尖锐注意力，这影响了复制等精确操作的稳健性。

神经符号结合：TransNAR 模型

为了结合LLM的语言理解能力和NAR的稳健推理能力，我们提出了 TransNAR 混合模型。

模型架构：
TransNAR 在Transformer层之后，引入了一个单向后注意力机制，将预训练的GNN-based NAR处理器产生的节点和边表示，融合到Transformer的词元表示中。

优势：

在CLRS-Text基准上，TransNAR 在分布外泛化方面显著优于纯Transformer。
通过知识蒸馏，可以将TransNAR的能力迁移到一个纯Transformer学生模型中，该学生模型即使没有图输入，其泛化能力也优于仅在分布内数据上训练的Transformer。

局限与未来方向：

训练TransNAR仍需要配对的文本-图数据。
蒸馏过程中的超参数（如教师监督权重）需要仔细选择。
未来工作包括将此类方法扩展到数学推理、逻辑推理等更广泛的任务，并探索更高效的架构融合方式。

总结

在本节课中，我们一起学习了神经算法推理的核心思想与发展脉络。

第一部分：我们回顾了如何利用图神经网络来学习和模仿经典算法，强调了算法对齐、CLRS基准测试的重要性，并介绍了该领域在模型架构、训练技巧、理论理解和应用拓展方面的最新进展。
第二部分：我们探讨了将算法推理引入大型语言模型所面临的机遇与挑战，包括思维链、长度泛化、位置编码、内部表征问题等。最后，我们介绍了TransNAR模型，它展示了神经符号结合在提升LLM推理能力方面的潜力。

神经算法推理是一个充满活力的交叉领域，它连接了深度学习、算法理论和软件工程。未来的研究方向包括：设计更本质地融合推理与感知的架构、构建更全面的基准测试、深入理解模型的泛化机制，以及将NAR原则应用于更广泛的现实世界问题中。我们鼓励大家利用CLRS等开源工具，探索这一令人兴奋的领域。

图机器学习会议｜ Learning On Graphs Conference 2024 p05 P05_Day_3__Part_1_2-Tutorial_on_Graph_Deep_Learning_for_Time_Series_Processing -BV1k9pAzpE8S_p5-

Global models which is what is typically done in deep learning and here have that we have a single model that is trained on many time series coming from different sources this is similar to what nowadays people call foundation models somehow。

but usually here this scope can be more limited like for instance to time series coming from a specific domain。

The advantage here again is clearly as anticipated before in the sample efficiency of the approach and also and this sample efficiency allows for building more complex model。

more complex architectures that can be used to process the input time series。

The common downside of these two approaches listing their standard implementations is that they both neglect the dependencies that might exist among the time series。

There are few methods that have been used in the literature to deal with this beyond the graph representations we will be talking about later。

and naive option， the one that this straightforward thing to do would be to consider the input collection of time series is a single very large multivariate time series。

but these clearly as severe scalability issues since this clearly suffers from the occurrence of dimensionality。

so it results in I sample complexity and poor computational scalability。

We can instead consider models that operate on sets of time series。

keeping the parameters of the model shared among this time series。

And an example of this is like attention based architectures， like transformers。

where in this case attention would be computed with respect to the spatial dimension rather than the temporal axis。

And this approach can work pretty well， but as you know。

like if you are familiar with also like with the static graphs。

we know that using the transformers to process this kind of data can work pretty well。

but clearly the downside is that we are not exploiting any prior on the structure of the dependencies and also on the sparsity of these dependency structure。

There are also other methods that have been used in the literature that for instance rely on dimensionality reduction。

and here are the idea is that we can extract some shared latent factors from this large collection of T series and use these factors to condition a global model。

And these can work well if data are low rank in certain application。

But the downside of this is that we are losing the local fine grainined information。

which instead as we will see is captured very well by the graph based approach and of course。

also these approaches in cases of like very large data。

very large collections can suffer from the same scalability issues of the other approaches。

Now we can start talking about these graph based representations， so as anticipated。

the idea is to use a graph to represent the functional dependencies among the time series and use this graph as an inductive bias for our learning model。

We can use the graph additionsymmetric to capture the to model these dependencies and additionsymmetric。

which can be both asymmetric and dynamic so it can vary with the T so at each time step。

Together with addition symsymmetric， we can also have edge attributes which can be dynamic on their own。

and that can be both categorical and numericalical。

Finally we can go back to the traffic example and here again we will have that the structure so the graph of the additionsymmetric metrics can be extracted from the structure of the road network and you can have like attributes or weights associated to the edges also to encode the road distance and here in this case dynamic topology can help for instance。

we're taking into account modifications in the traffic network in the structure of of the traffic network。

This is a sort of summary of all the information that we have available at each time step。

so we have our target time series together with the extragenous variables that can be both dynamic and static plus the relational information that we just added to the setting。

So now the idea is to use this relational side information to to condition our predictors。

our forecasting architecture， and these relationships as we will see can act as a regularization to localize the prediction predictions with respect to each node and in particular。

these can be used to prune any spurrous correlations that might be the result of not taking these structure into account。

Furthermoremore these approaches far more scalable than standard multivariate models because again。

we can keep the parameters of the model shared among the time series we are processing。

in fact we can use these kind of architectures to forecast and to process any subset of the correlated time series。

In particular， the kind of graph neural networks that have been developed to process these data are called special temporal graphraph neural networks to refer to the fact that propagation in these models happens across both time and space and in particular we will focus on those models based on the message passing framework。

And we will do that by considering this template architecture where we have which is composed by an encoder。

which simply encodes the observations， each observation independently at each time step and node。

and the encoder is then followed by a stack of special temporal message passing layer。

which is the only components of the architecture where the propagation through time and space happens。

The representations extracted by the special temporal message passing blocks can then be mapped to predictions by a decoding block。

Eder block we can look at more finer details of this and you see that the processing here happens at the level of the single time step again in single node for what concern the encoder。

the resulting a sequence of observation is then processed by the special temporal message passing layers。

TheThe decoder again， will then be operating at the level of the single node and time step to map these representations to prediction。

We can have then several implementations of these special temporal message passing blocks。

and we can see this as a sort of generalization of standard standard message passing layers where instead of having like static representations associated to each node here we have sequences of representations so what we will need to do is to modify the standard operators that compose a message passing architecture so that they can be they can process sequences。

Clearly， there are many possible designs that exist to do this and that can be somehow matched with the requirements that we have for each specific application。

So the next step would be that of characterizing the possible design paradigms for these special temporary blocks。

but this would be a good moment for asking any questions or clarify any doubt before moving on so if you have any feel free to use the chat or just unute yourself and ask us directly。

Okay。No questions， but yeah， anyway， if you feel like some some step of the discussion is not clear if you're free to use the chat and we will keep an eye on it。

Okay， so deep， there are as I was saying， there are several， okay， we have a question yeah right。

Oh sorryrry， they can't unmute themselves So you want to read out the question， yeah， yeah， okay。

so regarding correlation it's possible to use SDG and multi that have signals。Okay。

so I think we are referring to your regular time series。 This is。

This is a good question and there are methods that can be used to model these kind of irregular observations over time in space。

and we will touch on them in the second part when we will be talking about dealing with missing data。

but yeah there are methods to do that。Then。The other question is is the encoder decor also design choice and yes it is you have different operators that you could use in practice in many cases since these are just again first encoding a first transformation and later on just a readout these are usually implemented with standard MLPs but yeah in principle you can use whatever basically。

Okay。So as I was saying there are several methods， several is it okay last is it similar to embedding in transformers doing code positional information too。

we will talk about this in a moment， there is something similar to this that is actually used often in this kind of models。

but we will arrive to that point。So for what concerns？

The different designs we can distinguish between time and space models and here in these kind of models the temporal and spatial processing cannot be factorized into separate steps。

so they happen jointly， we then have time and space models。

which instead as the name suggests they factorize the processing of the temporal and spatial dimensions in two separate steps and finally we have the space then time approach which is basically the time then space in reverse order。

For the time and space， as I was anticipating here， it is that this model。

the propagation of representation through time and space happens in a way that that means that the resulting model architecture。

this processing cannot be factorized into separate stages and there are several ways of implementing these architectures。

for instance， one standard way is to integrate message passing into sequence modeling architectures。

we will see some examples of this。And the other approach is instead to use sequence modeling operators directly to compute messages。

you I mentioned search by single layer。And。Finally， there are also product graph representations。

which are basically a way of getting a static graph representation from these collections of Teamme series so that we can use the standard message passing operators on that。

So one standard example and actually also from a chronological perspective this is probably listed to the best of our knowledge。

the first model that has been introduced to do this you can we can start from looking at a standard GRU cell so looking at a standard gated the recurrent neural network and you can see that here clearly each time series is processed independently from the others。

we can easily get in this patientemp graph neural network version of these by implementing each gate of the recurrent cell by using message passing this is also very similar to the first kind of graph neural networks that have been also used to process to process static graph data but in this case at each step at each update of these cell states。

We perform a message passing at a different time step。

so we use this network to update the states of the cell at each time step by reading information at neighbors。

And these kind of diamond space models are known as graph convolution or recurrent and neural networks。

and there have been many version of these， in particular。

many different architectures that have been tried to implement these by using different message passing blocks。

In particular， one very popular architecture to do this is known as the diffusion convolution orcurrent neural network。

which is basically GC RNN， where the message passing operators are implemented by this bidirectional diffusion convolution。

where basically the idea is to have different ways for incoming and outgoing edges。

And this was one of the first models that have been used to process time series and has led to many follow up and it's still quite popular and quite effective。

The second example is spatial temporal convolutional networks。

where here what we are doing is just alternating temporal convolutional filters with spatial convolutions so with the convolutions on the graph。

So basically here what you are doing is applying standard the temporal convolution at each node separately and again you can use any kind of temporal convolutional filter that you want this this is going to be a 1D convolutional filter and then follow this with a step of message passing by stacking many of this layer you get an architecture where the receptive field will get larger at each layer with respect to both time and space。

In particular， there's this SDGCN， which has been the first architecture of this time of this type to be introduced in the literature and here basically the blocks are stacks of gated temporal convolutions with polynomial graph filters and similarly to to the recurrent SDGNs。

also for these kind of models that have been many different implementations。

in particular there is one model which is called graph waveNe that has been out since 2019 and is still very popular and still performs very well in benchmarks and the idea is to basically use more advanced convolutional operators also including diletion to increase the receptive field of the model。

Faster。The third example of time and space model is those models that use sequence modeling operators to compute messages and here this one is a simple example where the messages are computed by using a TCN again in this case basically what will you be doing is to just concatenate the time series so the sequence of observations at the two neighboring nodes and then apply a convolutional filter on the resulting sequence。

Clearly you can use whatever sequence modeling operator that you want and for instance。

there are many examples of these kind of architectures that use attention operators and in this case this will be cross attention since you will be computing attention between sequences of two different nodes。

We then have the product graph representations and these models。

these representations come from the simple observation that you can see the temporal dimension as a line graph and combine in some way。

I mean there are several ways that one could consider。

combine this temporal graph with a special graph。And then again。

process the resulting representation using a standard message passing net。For instance。

you can consider a standard Cartesian product where basically here each node will be， well。

each node that each time step will be connected to its neighbors and to itself at the previous time step。

There are also other ways of wiring this graph that you might use like the Chroncker product and in this case the idea is to connect each node to its neighbors at the previous time step。

You can also come up with many different methods of wiring this graph and actually I think this is a direction where further studies would be needed also to understand the properties of these product graph representations。

We can then see the time then space approach and this one is very simple also to understand also the resulting models are very simple to build and the idea here is to just embed each sequence separately into a vector representation and then perform message passing on the resulting graph。

And the advantage here is that this is very easy to implement and also computationally efficient because at the training time again。

you're not performing message passing at each time step， but only at the last one。

and we can also reuse all of the operators that we already know to process sequences and graphs。

The downside of this is that these two step procedure might introduce a very serious information bottleneck and as you know these can make some of the issues that we commonly have on graphs such as overmoting and over squashing even more serious like it could make very difficult to propagate longer range information across space and time。

And also since you are performing message passing only with respect to a single graph representation。

this would also make it more difficult to account for changes in the topologies and also for dynamic edge attributes。

Then we have the space then time approach which again as anticipated is basically the other way around here you are performing a message passing a teach time step separately and then you are encoding the sequence of representations with a sequence model here these approaches have been used quite often in the literature but actually they still have this kind of bottlenecks and this kind of factorized processing that might be might make it more difficult to propagate information and they do not enjoy the same computational advantages of time then space models since you are computing message passing for each single time step。

So now I see， maybe there's one question。嗯。Okay， yeah， there are also okay。

the question is about if we have other options to model the temporal dimension rather than using a line graph。

yeah， for sure， like I think you can consider any representation that you want as long as you take care of somehow preserving the structure of the data。

also one thing that I' have not said is that where also you perform message passing on that kind of graph。

you should take care of。Considering that temporal and partial edges have a very different meaning。

like for instance， you could use different wayss to process the different messages or things like that。

but yeah in principle， I think it's the general idea is that of using these static graph representations to process these data and I think the optimal way of well the optimal representation that one might use is kind of an open question or rather using different representations might have different properties that align well or not with your problem。

Is there a method that use timed space then mixed well you can do all sort of wild things the problem is that with the usual timed space approach usually what happens is that after you have included the temporal information after that you have a static representation right because you perform message passing only with the representation extracted by the encoder so you have time series going in in the first step so in the time step and then you have a graph coming out which is later on processed of course you can imagine something that was slightly different maybe you can add some kind of skip connections or whatever but yeah also I think these things that you say like timed thats mixed is somehow seen。

To the diamond space models。That I was talking about when when Google were talking about the convolutional approach。

in that case， they work exactly like this， so they interlieve the temporal and special filters to progressively increase the receptive field。

Okay， then I think we can move on and then later on we can go back， we can answer further questions。

So。This globality and locality thing I've been talking about quite a bit in the presentation plays a big role also in SDGNs。

So SDGN at least understand and implementation with these template that we have been talking about。

they are global models， so they can handle arbitrary sets， so they are inductive models。

And they can use the dependency structures to provide further conditioning on the predictions on the forecasts。

the model might need very long observation windows and a very eye model capacity in order to account for all of these different heterogeneous dynamics that might be present。

One way around this has been that the literature has somehow followed is to consider hybrid global and local architectures。

One straightforward way of implementing hybrid architecture would be to just for example。

like looking again at our template architecture to turn some components of this architecture into local。

For instance， you could make the encoder in decor of parameters that are specific to the time series that will be processed there。

TheClearly the resulting models would be able to capture these local effects much more easily。

you can see this as kind of similar having to using like a backbone model and have some layers that are fine tuned or that are specific to the kind of data that you want to process there。

And of course， the downside is that if you want to process many times series using this approach。

this would result in a very large number of local parameters depending on how these encoder and decor blocks are implemented。

One way toortize thiscourse is to simply consider node embeddings。

which are just vectors of learnable parameters that can be associated to each node。

and this slightly goes back to the previous question and you can see this kind of node embeddings as a sort of learnable positional encodings。

but what they are doing here actually is not encoding the position of the node in the graph but actually taking care of modeling the local components of modeling the specific characteristic of each time series。

These learnable components， these learnable vectors can be fed into the encoder and decor and as I was saying they amortize the learning of these local time series specific processing blocks。

Allowing for keeping most of the models parameters shared。Clearly。

we still have the downside of having a number of learnable parameters that in this way scales linearly with the number of time series that we are processing。

and so there are intermediate solutions that one might consider such as learning embeddings for clusters of time series rather than for each single sequence。

One other issue that is very important to consider here is that when we move to hybrid architecture the resulting model is not inductive anymore。

so these might sound as a downside and actually is so you definitely lose sum of flexibility but in practice you also gain something in a transfer learning scenarios so in those scenarios where you have something data from the target domain that you can use to fine tune your model。

In particular， having an hybrid architecture allows for keeping the shared parameters of the model shared。

well sorry， keeping most of the global part global components of the model shared and just fine tune the local parts in this case。

just fine tune the embeddings。And in particular， has been shown in the literature that regularizing these node beddings can facilitate this transfer learning procedure even further。

We can start and we can have a look at some empirical results。

So these four are very popular benchmarks for correlated time series forecasting in the uppermost part of the table here。

we have some baseline some reference architecture that have been developed by using the the template that we just saw in particular we have these GCR andNs that that are used。

That are basically time and space models， and then we also have different implementations of time then space models。

And then we have also three architectures from the state of the art。

So we can see that when we add these local embeddings， so for the baselines。

when we make these models， hybrid， so when we introduce some part that is modeling the characteristic of each time series separately。

we see that in performance you can improve by a white margin。

so much so that with these very simple architectures。

in many cases you are able to match the performance of much more complex state of the art models。

Also much deeper with many more parameters。We can then have a look at some transfer learning results。

so here what we did basically was taking four data from four different traffic networks。

we trained a model on three of these networks and then performed the transfer step on the fourth one。

The first observation is that the fine tuned model。

so the models after transfer learning perform much better than the zero shot inductive model。

this is somehow to be expected and we might also expected this gap to be narrow aware if we use more data to train the model。

But the other very interesting thing that we can see is that the performance of the hybridd architecture where just the local components are fine tuned are。

fit on the new data are much better than those of the fine tuned global model。InPart。

these variational and clustering thing are regularizations that have been studied in the literature this paper cited here and with by using these regularizations。

performance in transfer can be improved even further。

So now we have reached the end of the first part， we formalized the problem of processing a correlated time series。

and we show we can use graph representations for modeling the dependencies among them。

we discussed the forecasting problem and these important distinction between global and local deep learning models for time series。

which is a crucial aspect of the whole thing。And we then saw approaches to building a special temporal graph neural networks and the associated tradeoffs。

Before discussing the challenges， we will look at software implementations of the above in particular。

Iva will be presenting a tutorial of library that we developed but before that we can answer some questions and have a short five minutes break。

woWould be interesting to analyze the causes of the local behavior yeah would definitely be interesting There are different things that you can do one analysis that for instance we did in this paper here was to cluster the node embeddings well actually the clustering was done into end and then we looked at the resulting time series clusters and actually you could see that the corresponding time series were quite different in particular we used the data set of like load consumption and there it was。

Easily easy to see that those time series at very different behaviors， for， for example。

like very different consumption patterns in the same days of the week。

but yeah I think this is you could probably also do something which is a more fine grained if you also yeah if you use some way of representing some local components which is more interpretable。

but I think this is kind of an open question， yeah。

you can definitely try to use these somehow to analyze the different behaviors。

The other question is。So the static graph after time series is encode to a graph is it a universal problem in both local and global models。

can it be solved with the name cultural aggressive context。So okay。

the fact that the time series or collapsed during a static graph representation is typical of time then space models。

This does not happen in time and spells approaches， as we saw before， like for instance。

with recurrent graph convolutional networks， there the representation is not collapsed。

But this thing of having these global and local aspects， it's orthogonal tool models。

I think it's I mean it's even orthogonal to graph based approaches。

it does play an important role here since you want to take this dependencies among times zero is into account。

but this is something orthogonal tool tool tool to the specific architecture that you are using。

Could you comment on which approach time and space time and space space and time is commonly use for traffic monitoring？

Well， these， again， there are many architectures they use many different that are wired in many different ways。

so the literature is very， very rich in this regard and there are also surveys that try to classify all of these different approaches。

I'd say the time and space approach is the one that is most commonly used。

but this also depends on the kind of constraints that you have。

particularly if you have computational constraints the time then space approach works pretty well but I think the fact that that approach can work pretty well also depends on the kind of data right so if you really need to model a longer range interaction in that case a time then space approach might not be enough。

My suggestion and later on we will try to come up with some guidelines is that you can definitely start from using a time than space as a first try since again it's much easier to train much more scalable and that gives you good results and we will also talk about quantifying well seeing what good results mean。

it's fine it's actually much more easy to use。Is there any loss function that takes to account to the the variance and error for SD DGNN？

Well， I think you are referring to， well， again， you can。Fit these models using。

Basically any kind of error function that you want。

for instance if you want to take the violence into account and by taking the violence into account I mean quantifying the variance you can train a probabilistic model as I was saying this is not coed by the tutorial we are just considering point predictors for its simplicity but there are many resources on how to build the probabilistic time time series forecasting models and I'd say that this is pretty much orthogonal to the discussion we are having here but yeah for sure there are methods that you can use to quantify uncertainty and yeah。

Okay， so I think we can have the the short five minutes break and no problem， no problem。

nothing to be。

Okay， so hello everyone， we can continue with the second part。

so now we add the tutorial or actually I think we ski this slide with the tutorial information you can me there。

Okay， sorry。Okay，Just click here， Okay， are you okay。Okay， yeah， sorry。

So in this demo in this part we will see how we can actually build special temporalor GenN and we will use for this purpose。

an open source library that we developed in our lab which is called torch specialemplar special Emper you have all the links here so we have the documentation of this website and the GitHub Ra here you have all the links in the PDF online as well as the QR code so I will leave it this year for a couple of seconds if you want to open it from the from the QR code。

Yeah， so。We will move back to， sorry。Okay。落啲诶。😔，嗯谷gle装O嗯。Okay， so。There was saying。Now， we can嗯。

We can。 we will start by。 Let me attach to the okay。So in this notebook。

what we will see is to explore all the functionalities that are available in TSL。

so in Torch Special Emperor， which is a library relying on Pytorrch and Pytorch geometric to and also Pytorrch f to ease all the research or in general to。

Okay， sorry， to foster also research and accelerate research on spatial temporal data processing using graph neural networks。

So let's start by installing all the necessary computation。Necessary dependencies。

so here we are simply installing the environment， what I suggest you is to have an environment in which you install yourrch。

your fabric torch or end Ptorrch version and then install torch special a temporal afterwards and it will install all the other dependencies。

So now at the moment here， we are using torch version 2。5。Point1。Okay， let's just， this is just。

let's say routine code just to import everything we will need and to have a look at which version of these dependencies we are using now。

In the meantime， let me just briefly introduce the library。TSL is not just a collection of layers。

let's say it's not just if you want something which is like easily plug and play and have all the possible implementations of SDGN online it might not be the right library。

what instead want it to have inside is to cover entirely the pipeline from the data acquisition to the downstream task。

so we start by loading a given data set and on the data we can apply all the usual pre processing stuff that it's needed when we want to do for casting for instance。

like scaling and normalization of the data， or all tools for reampmbling the data according to the given sampling rate we have and to cluster data。

This is we have strong focus also on data coming from real world。

so all the real time series where each observation is associated with a specific time step in the real life。

Also tools to handle missing data and other irregularities and we will see something more in the second part of the tutorial and then of course we have all the parts about the modeling and inference such as all the layers that we might need and ways to build our own SDGN。

So let's start by loading a data from the ones available in TSL， we have a wide array of data sets。

we have some data sets on traffic forecasting for traffic forecasting。

some data sets for air quality or energy analytics。

Most of these data sets usually are widely used benchmarks in research in this special temporal data processing community。

We will start by using a MeA， which is a data set that Andrea was showing before in the tables。

it's a traffic data set so it contains data from 207 loop detectorors in LA。

The sampling rate is five minutes and we have approximately four months of data。

So we load the data sets simply as you I think you are already used to with other libraries。

so we import the data sets from the TSL dot data sets and we download it in a specific folder。

And here we can see that we have approximately 34000 time steps which corresponds to the five minutes for four months。

we have 207 nodes so the the detectors and for each node we have just one channel which is the detected speed。

so the time series each time series is uniaried。Although we have 207 of them。Okay。

so let's print some statistics about these data sets some information。

so we know that the sampling rate is five minutes。

we know that there are missing values and approximately they are at 8% of the data。

We can have in principle exogenous variables， so the data that come with exogen exogenous variables and among these we have a matrix that tells us the distance among this sensor。

So this covariate here， data。covariates， it's a way to。

let's say it's a storage for all information that we might need for our task。

but that are not the target of our downstream task。So in this case。

we have the distance between each of the nodes， which should be in theory the road distance from a node to each other。

We can have a look at it， so here what these metrics is storing is basically that from the node0 to node1 and until9 we have any we don't have a path essentially so the distance is infinity while for instead one to two。

we have this distance in kilometers if I'm not wrong。No， maybe， no， maybe it's less。Anyway。

now we can print also the data set how does the data set look like， so here we have the nodes。

these are the nodes ID， we have just one channel。And for each of these notes。

we have an observation every five minutes， so as I was saying before。

so we start at March and these are the readings for each of the sensor。Okay， fine。

How we can build a graph out of this， because what we now have now。

these are the only information we have， so the distance between sensors and the data that we have。

We have different ways in which we can basically build a graph when we don't have let's say a given ground root graph we will see later on in the tutorial more complex way to to build a new graph but here what we are what we will do is essentially to compute a similarity score between nodes given starting from their distance so will use a Gausian kernel based on the distance just to convert the distance in some measure of similarity so here the closer are the nodes the higher would be the score so before the diagonal was zero now it's one。

Okay， this is a standard way， let's say， to build a graph based on the distance。

it has been used in several works。Now lets let's use this method instead which is from here what essentially we did is to get a similarity score out of the distance using this function this is pre let's say it's embedded in the data set so each dataset set can implement its own similarity methods and we have some default one while this get connectivity instead applies all the usually done post computation to post processing to these similarity metrics to make it a real idea sym so now we want to say let's say let's threshold the data so all the data that are below 0。

1 we consider them as zero so no edge between two points we remove sub loops we normalize the weight on a given axis so either the let's say incoming edges or outgoing edges。

And we want the layout to be edge index。 So this is a very pyth geometric language。 Let's say。

So here the the the edge index is simply a list of edges， which is then stored in a。In our dense。

We can have a look at it， so we have this number of edges or 1500 more or less with edge weight。

so this is our weighted agency metricss。Here we have some operation just to convert it back to metrics and to have also some the sparse to check that the sparse weights of these metrics are still the same edge where it as before。

Okay。Now， let's say that from now， what we had is data set， which is still not in Pytorch。

it's mainly using nuumpy and pandas。Indeed let's say a more user friendlyend view of the data set and we can compute and do our standard analysis on that。

but then when it comes to training a model and feeding this data to a neural model we have to switch to the Pytorarch interface and we have this spatial temporal data set this basically comes in our help and here what we do is to pass to this data set target variable so it wants to know what is the goal of our of our the task basically what we want to predict in case of forecasting the connectivity metrics。

any covariate or extragenuous variable if we have them so under a。Mentioning for instance。

the encoding of the time of the day that might be used here at the moment in this example we don't have any。

but we can add them easily the mask telling if data is available or not and then we have this parameters here window horizoniz and stride that decides how this data set is then split into Windows。

Which are then feed fed in our model so。For those who are not familiar with the sliding window approach。

what usually is done when when we have a long time series is to split it in Windows and filling the model with just a small windows of data and then predict let's say what we k steps I in this case the steps side are defined with the horizon parameter the length of the look back window is defined with the window parameter and then we have a stride parameter that defines。

How many。How many time sers are there between a sample and the other？Okay， here you have all the。

let's say， all the parameters that you can pass to these classes class and the effect that they will have on your time series。

So if are if there are questions， please just write them to the channel。

I will have a look to the channel sometimes from times to time and so on I。见面。😔，Okay， sorry。

it wasn't a previous question， okay。Okay， so。TheThis data set。

once we once we create this data set now we basically have splited the long time series into smaller time seriess and all of these are now different samples of our data set。

so now we have this number of samples here。And still we have the nodes and channels as before。

each of the sample will have 12 time steps inside。And we can actually have a look at it， so yes。

we have 12 time steps and still the same number of nodes and features。Okay， we can go we can go。

I mean here we have these okay， the torch data set。

you can access the different torch data is the data set that we just built and you can access the any samples if it as if it's list so in this case we are getting the first sample in the list。

And this this data object that is returned is basically an extension of the to geometric data object。

but it's wrapped in our。Customom TSL data object and here what we have is the let's say we built upon the original data and here we have to inner kind of dictionaries let's say storage object where that in which we store all the inputs variable and all the target variable so here all the inputs are the ones that are given to the model and the target are the one in that instead we want to we want as output you don't have to specify all this information when you build your data set you have the maximum flexibility to edit all of these but if you don't do it the most common let's say procedure is adopted so your time series is splitted and when you pass your target so the target time series becomes both the input and it's name as the X and the target and its name as the Y。

Okay。Now。Yeah here we can see what the input is containing， so the x。

the and index and the weight of the index， so basically the input store model are the same that we have seen in the tutorial。

so the X and the aation metrics。And the target is just the y， so we are not again。

we are not predicting a graphs as output but just time serious graphs is just a mean that we use to process for processing in the our input。

Okay。We can in this case， we can see if we have a mask and we have it。

so the mask is telling if data is missing or not， in this case all the data that we are seeing are not missing。

And we can see if there are transforms， so the transforms here are the scrs that we usually use to scale the data so like for normalization or applying the Min Max scr。

but we will see them very soon。Okay， these are other information I can skip them to I mean you can have the this tutorial has much more information just to try to explain all the steps you might be interested in。

but the core we can skip some of them to。To do the core part so an important thing is that the batching of the dataset is done very efficiently by in the model and you can access actually batch of data as if it's at least as I said before so here we are getting the first five steps the first five samples in the data and so they are batched in a single object the advantage of these approach is that it comes when we are dealing with static graphs so we have the time series let's say that the connection the relationships among the time series time series are not changing or evolving over time so we can assume that we have a single graph and so in this case we can have these advantage advantage batching where it's basically we are stacking the time series but we are not stacking the edges and so。

The graph is the same for all of them。Okay， in this part here we are basically splitting the data set into training validation and test and we are fitting our standard scalr。

so are the ones that basically is removing the mean and divided by the unit， the variance。Okay。

When we call actually DMM for dot set up， we are actually。

Fitting all all this information this Dm here it's this passion import data module for those who are familiar with Pythtorch lighting is an instance of the Pytorch lighting data module。

Okay now let's go directly to the model part。As customerary in all these libraries。

we have all the neural network apart inside the submodule TSL points and N。

Here we have a collection of different layers that can be used blocks， which are。

let's say implement logic which is a bit more complex than layers such as encoders or decoders。

for model， and we have also some models that can be already used。

Now for this tutorial we will design a custom SDGN and this will be done by following the time then space parting so we are what we want to do is to embed all the temporal information in a single vector so we end up with unattributed graph and then we do the message passing on this single graph。

We will also make use of northern beddings， so the parameters which are specific of each node and this will make our SDGN a global local model。

So let's have a look at the I hope I executed all the important cells。This is the model。

the time then space model， so here we have the node embedding tables。

which has a smaller size with respect to the hidden size we are using。

We will have the encoding step as we said before it's just a feature encoding step。

so this takes us input a single time step and returns a single time step it applied pointwise。

then we will have the time part which is in this case a group GRU。

And then we will take only the last state of this GRU。

so the encoding of the time series and this for the special part。

we will use the diffusion convolution， which is the operator that。

Is used also in the diffusion convolution re neural network and this CRnN and we will in the end have the single vector for each node and we will have the decoding part which is done by an MLlP so the logic it's here the one that I just mentioned so we can go ahead。

We can make our own by changing the number of layers， for instance。

we can say two or end layers here。We can increase the hidden size and embedding size and the number of spa message passing layers。

Okay， let's have a look at the model now。This is our model and this is the number of parameters that we have at the moment。

I also want to show you the and how the data looks like for for our input so what we are giving us input to the model well first of all。

this is what the model see for time series so let's say we are at the node zero at sample zero。

And these are the 12 time steps we are going， we are giving a s to the model and we want to predict the next 12 in this case。

And this is what the。This is what the a single， let's say， a local un model will usually see， so。

The time series of a node has input and its output。

What instead we can do thanks to the graph processing is to also have us input let's say。

the set of the time series of the neighbors of the node。So here we have 11 nodes。

so maybe it's not that easy to see something that can be helpful for the prediction。

but what can be what's the rational behind it is that in this time series using this time series to forecast this signal on the right。

it's actually it helps it has improvements with respect to forecasting the signal only using this signal on the left。

These this value on the on the left hand side are。Normalize as you can see from the scale。

which is not the same for the target in this case， but for convenience。

we also plot the scaled target here。So there is a question so in the first part of the dialogue we were discussing that SDGN for traffic modeling such as D CRnN are time end space model here we implement the time then space model is the result the same please correct me if I'm misunderstood so these are different approaches different paring so this CRnN is indeed a time end space model as we briefly mentioned before but we will have a deeper discussion right after the notebook a time then space model is is faster it's more scalable and so this is also why we chose to have this one in in these demo also for efficiency purposes but also the performance of these approaches are not that's bad as one might might think compared to。

诶。Other time and space models。So in the end， we will see， I mean。

we are not looking at something which is， let's say a very stupid network。

it's something that theoretically can work very， very well。ok。嗯。Okay。

are we now go to the training part。We have in these TSL。 engines， we have a predictor module。

which is a lightning module that can be used to ease all let's say the burden。

which is that we usually have to do to define the training loop and validation loop all and the evaluation pipeline in general。

we can import the metrics that we define still in our library。

and are usually the metrics that are used in traffic forecasting in forecasting in general。

In the meantime。Okay， I'm starting the training now。If we won't do a very deep training。

it's just to see also in this case that the let's say that the burden of time then space model is not that big compared to other time series forecasting methods。

So here what the library is saying is that we are giving a s to this model， the X， so the input。

the edge index and the weight of the edges。

We can have a look at what's happening during training now。So this is the train loss。

So the model is learning something。好。But yeah， I invite you if you are interested to maybe tune some fiber parameters here and train your let's say your own SDGN on this example。

Or for the sake of digitalally is basically everything it's I think we can simply stop here the training。

Just do the evaluation part just to see what happens in the end。This is the， we are loading。

let's say the best model， which of course is not， I mean， a fully trained model at the moment。

And we are testing it the test on the same splits that are used usually in publication。

And here we can see the results， so this test my is the MEE is the absolute error in the entire horizon this at 15 means after 15 timets and after sorry after 15 minutes so after three time steps in the future while at 60 is after one hour so at 12 time steps in the future and all these metrics are here because we define them here。

you can have you can have your own， you can monitor whatever you want by changing or the metrics here。

Okay， so this is pretty much it， you have the tutorial its this demo is available in the slides and also on the website of the tutorial。

the library is online， you have it on Giub and read the Docs。If there are questions。

I'm happy to take questions on this part and if there are not questions。

I will move to the second part of the tutorial。Okay。So。Which one is one。Okay。嗯嗯， here。And the饭。Okay。

So I hope you're still here with me in this second part of the tutorial we will see we will have a look at the challenges that are inherent to this problem setting。

So we will see we would have a look at four challenges so the first one is on scalability so how we can deal with large collections of time series so how we can process the data when they are when we have lots of them。

A second problem that also came into questions before。

so what to do when we have missing observations or misaligned observations in our time series。

and then Danielle will cover we'll cover the latent graph learning part。

so what to do when the underlying graph is not known。

so how we can learn the dependencies that are existing between the time series and also how we can evaluate these graph based models。

how we can understand if a model is optimal or not。So let's start with the scalability part。

Scalabilityity is actually a feature， right， because in using graphs。

what we can achieve is to have a single inductive model， so a global model。

While still we are conditioning on related time series in a sparse fashion。

so we are not giving us input to the model the entire set of time series， or just one of them。

but we are using the ones that we think we think are the most relevant for the prediction of the target time series。

And in doing so， the we reduce the cost of of the usual operation of considering all the other time series from。

Being quadratic on the number of time series to be instead a linear of the number of these edges。

so the dependencies that we found in the time series。

But scalability can also be an issue here because we have data that， as the name suggests。

are spanning the two dimensions。The first one being the spatial dimension。

so the number of time series that we have， but the other one being the time dimension。

so the number of time steps that we have in each time series。So in real words， applications。

It's not that uncommon to deal with high frequency time series and also with large scale time series so this is for instance the example in smart cities like traffic forecasting or environmental monitoring。

for instance， if we want to monitor the air quality。

the quality of the hair in open environments or this can also be the case in finance where we have different prices the level of segments or even even lower。

For a large amount of of in this case of socks of entities in general。So。

Usually the problem is that we have a large amount of data and we want to process them all。

and this is particularly true when we want to account for the long range independenceencies that might exist either in time or in space or in both of them。

So for instance， if we want to capture something， which is very far in time and space。

Now let's have a look at what's the actual computational complexity of STGNs。

If we consider a time end space model a general formulation。

so there are very different implementation of time endd space models。

but in general what they do is that they need some notwise temporal processing and also they have a special processing that scales usually with the number of time steps so if we consider for instance how they how a neural network a graph base required a neural network at each time step we do in this case L layer of message passing。

So what happens is that then we scale。Let's say we need to do the number of message passing operationsions times the number of the time steps as input。

So the first step towards improving this scalability is given by the time then space model in which instead only the temporal processing is still done。

not wise， but then we just have to do the message passing on a single graph。

so this is let's say as if we had just a single timetamp。And in this case。

we have an additive computational complexity， let's say。

from with respect to edges and time stepss instead of multiplicative。And as we've also seen before。

this is not an advantage that space then time models have instead。Okay。

this is quite a good achievement， but it might not be enough when we have very large graphs or very long range dependencies so what else can be done a possible approach a possible solution could be to reduce the the computational complexity by considering some subgraphs of the full network so this can be done for instance。

by selecting a subset of target nodes and then just considering the ego graph of the introduced subgraph on the subset of nodes up to a given order of the neighborhood。

And then just do the training on this subset， another option could be to revive the graph to reduce the number of edges since we are now not scaling with the number of nodes。

but the edges we can have some further specificificationation over there。

All these methods are actually borrowed from the static graph community。We have some example。

for instance， graphage or a drop edge or many other。

The problem is that the subseampling might break long range dependencies if we have。

so if we consider a very small K， then we might actually lose some important dependency instead if we use a large k。

we might instead end up with the original graph for something similar。

Another disadvantage is that the learning signal may be noisy。

so we might have problems in optimizing our network。Another option instead is to sorry。Okay。

another option is to instead move part of the computation before training， so ahead of training。

This actually helps us in removing all the burden of the heavy lifting part in something that we need to do repeatedly during training。

so a possible let's say an example is represented by sign。

Whi is this architecture that precomputes some representation of the neighbor of a node and then allows this allows us to sample the obtained features as if they were IAD sampled。

These works have been done to reduce， let's say， to enable scalability on large static graphs。

And a possible extension and extension to instead， the spatial temporalor setting is given by。

These other work SGP in which we also have a further precomput made over time。

so instead of besides having the precomput of the node features we also the let's say this partial neighborhood feature。

we also have an encoding vector that tells us what has been the history of the ob each node this can be done through an Estate network。

this is done actually through Estate network in this work which is basically a deep RNN which an randomized weight。

And then we can after this encoding， we we propagate these encodings to the graph using powers of a given graph shift operator。

so a function of our agency metrics。And here we have the example of the architecture just to have a visual representation of it。

so what happens is that we go the time series go inside this big recurrent network and then gives us input to our encoding as output a new encoding at each time step that it's propagated over space in a non trainable fashion。

Now we reduce the training step。The cost of the training step。

which is now independent of the length of the window， the number of nodes or the number of edges。

the turning step is now simply constant basically because we can sample these features from this newly constructed data set as if they wear IAD and also the performance are not actually bad so they matches a set of the art。

But they have some downsides， the the first one being the fact that we are now extracting more and more features。

so the dimension of the vector， which goes inside this MLP。

so the downstream network is now much bigger than the initial time series。And also。

it's more reliant on cyber parameter selection as expected。

because now many parts of the network are not trained。Finally。

what another option another option to reduce the computational complexity is to use some coars or grain representation of the input so instead of using the time series at the initial samplinging rate in each processing in each step of the processing。

we can reduce this processing this resolution both in time and space， actually。

and to do it in space we can rely for instance on graph pooling。

which are techniques to reduce the size of the graph by associating a subset of nodes to a super node in a new graph。

So now what we can do is to reduce the number of operations that are needed to reach the same receptive field as the original graph with a deeper network。

but also the downside is that we are introducing bottlenecks in propagation of information now。

So for the first part on this challenge on this scalability that sits。

see if there are good questions on this part， I'm happy to answer。I hope everything was clear。Okay。

so。I will move to the second challenge， which is dealing with missing data。

leaving just a couple of seconds for if there are some questions。Okay。So until now。

we assume that we are dealing with complete sequences so that at each time step and for each node。

weober a valid observation。This is， of course， not the case in real world applications where。

Usually we have missing data that are due to different reason。

it might be due to for faults of the sensor that can be transient or permanent so we can actually lose sensor somehow at a certain point。

we might have a synchronicity among the time series and these results in having missing values in let's say in not synchronous way among the time series or or any error in general。

And the problem is that most of forecasting methods do not consider this this problem they assume to deal with complete sequences。

so what we what。What we need to do now， we need to a way to fill in these missing values to reconstruct the missing data that we have in the input。

The problem of time series imputation is precisely the problem of estimating the missing observations in a sequence of values term。

Here we can see we we have the same figures before。

but now some values are missing and what we added is auxiliary binary variable。

which is this mask M， which the notes if a value is missing or available。In the end。

what we want to do is to provide an estimate for all those values for which the mask was zero as input。

There are different data types we provide a taxonomy which is based on this conditional distribution。

so on the distribution， on the distribution of the mask conditioned on other values of the mask。

The point missing case is the let's say the the similar to the missing completely at random case so here what we are saying is that the probability of a point being missing is the same crossor nodes and time step and this basically associated with the the bernoulli with the same cost at the benoulli with the same constant and to each node and timest and this is something that usually models let's say the communication error that we might have in a remote sensor application。

Instead in the block missing we have that this distribution is not independent from missing data at other node or time steps。

so we might have a block of missing data if we in the case of the temporal block missing。

so here we might have， for instance a fold that generates consecutive missing values。

we might have partial block missing， if for instance we have a blackout in a region of a MarCD or a combination of them。

When we have missing values， we need to make some adjustment when we optimize our the parameters of our model。

So here what we want to do is to compute the loss only on the valid observations that we have。

so here the loss function that we are using， in for instance the MSC is then weighted by the mask just to say okay let's just take the values that are really available in our problem these loss is usually the one that is used or forecasting with missing values or to compute our reconstruction loss on the data we have when instead we do imputation sometimes we might need to inject some missing to train our model or to do evaluation of our model and so we mark some of the observations that we have us missing and then in this way we obtain ground root labels。

And of course， this data cannot be used in the model to obtain the imputations。🤧嗯。

In deep learning literature， there are different approaches to take this problem。

And one of the prominent approaches is to use outdoor aggressive models， so for instance， R inNs。

In this case， what happens is basically that we process our input sequence and as soon as we find a missing values。

we impute the missing values using the prediction coming from the recurrent neural network so using the observation that we had before。

So。This is effective in exploiting all the past information that we have at a single node node。

and we can also account for future observation。 if we use， for instance， a bidirectional model。

But if you look at this process， what you can。Easily imagine is that if we have a temporal sequence of missing values here we are continuously updating。

let's say， providing forecast for that value using prediction from the RNN。

so we might have compounding of errors along the temporal block。

And another downside is that this approach is struggling in capturing nonlinear space and time dependencies that might be present in the data。

So again， what we can do is to use the relational information that we have to condition the model。

so instead of having a single model for each time series for the entire set of time series。

we can have a model that can be applied to any time series and can take as input our graph to specify the relationship independenceencies in the data。

An example of this is represented by green， where similarly to what we did for the graph convolutional RNN for forecasting。

we integrate the graph processing into the other aggressiveive approach that we have just seen。

So here in general， what we do with we do in these approaches is to model this distribution。

so the probability of a datum given the entire sequence。Into three independent steps。

the first two beings the information， getting the modeling the distribution。

the condition of the information at the previous time step or subsequent time step and this is what is done typically by a bidirectional R&N。

which is the model we saw just so before。What message passing allows us is to also consider related concurrent observations。

And this is powerful， for instance， when we have long sequences of missing values in a given node。

but we still have information at neighboring nodes。

Imutation is usually done as a preprossescessing step， as I said before， for a downstream task。

for instance forecasting， and this is often necessary and for the nature of the forecasting models that PM that expect complete sequences as input and so the pipeline that usually is done is to impute this missing inval and then proceed with the forecast afterwards。

but this of course might introduce biases due to the errors in imputations。

Another use case instead is that of using an imputation model in place of a forecasting model so we can also simulate to we can imagine to have a longer sequence that the one we have。

but with full or empty， let's say from values， so full of missing values。In this case。

we can adopt essentially also imputation methods that are not meant for this。

But this is of course a workaround， so this performance might be also the performances performance might be also poor due to the absence of values。

so if these methods， the methods that we use are strongly relying on values for instance are the boundaries。

this is not a good choice。What instead we can do is to take a more direct approach where we want to avoid at all this reconstruction step and directly deal with irregular observations。

So in this case， what we do is to have a forecasting SDGN。

which is tailored and thought to deal with values that might be incomplete。

So here we have several benefits， the first one being that we now can learn directly how to leverage only the valid observations so we don't need to impute or to carry on some estimates of the missing values that we have and this is done precisely for the downstream task that we have at hand。

And also， we avoid the computational burden that is usually that comes with the imputation processing the in a prepro step。

Now after besides imputation， another important topic that can be considered is that of digital sensing。

which is the practice of estimating unmeasured data states using the data the data and the model that we have so if we have a set of nodes we might be interesting in knowing okay what will be the observation。

the corresponding observation at a given node that I do not have in my data。

And now here we can actually see the power of graphs in doing so because we can exploit these relational dependencies to condition the estimates。

this is actually what we do to a conditioned estimates on data that are closed in some space which can be in this case it's the center space so that the space where time series are。

And thanks to the inductive property of message passing。

we can enter new nodes and edges very easily， and this is also useful where。

In applications we disensing as a cost。Now an application that we can see here is basically an example of using methods that are thought for just doing imputation to instead being applied to vertical sensing。

so in this case we are simply adding a fictitious node in our data without any observation and we see what happened when we want to reconstruct to infer the corresponding and series and this is done with an graph impution model。

And the results are not bad so we can still recover something as we can see from the figure。

but of course there are several assumptions that are needed。

so we need a high degree of homogeneity between the sensors need to assume that we have a capability to actually reconstruct given neighbors using all neighbors and many more。

This concludes the part on the first two challenges before leaving the stage to Daniela。

I'm here to answer any question。So if you have a question on the scalability part or the missing data part。

I'm happy to answer。Okay。I think that's it， so I will leave the floor to Daniela。Okay。

so hello everyone， I'm Daniela Zambon， despite the name probability that you see is still underriachini。

But so here we are， so I'm going now talk about this other problem related to Latin graph learning。

which stem for the fact that all that we have seen so far rely on the fact that we do have some relations that span or connect our time series but what if such information is not available。

of course we can imagine scenarios where we have several time series but no relations are given to us or for example。

we do have some time series or some relations but we do not know all of them or other scenarios could also be that the information that is given to us is not trustworthy。

so we can not assume them to be reliable enough so we want to learn them from data so this in fact the possibility to learn a relations from the data that we have the time series that we have is something that holds the potential to apply。

All that we have seen so far also to other。Scenas so。Okay， I see another。Okay， now sorry。

this is just someone in。In a waiting room I cannot add it anyways。

so I will move on so okay so I was saying learning these type of relations from data so here we have seen that we see that we have sometimes series and we wanted to deploy or dev a model that is able to extract some information in different ways so what it's central to this topic is the fact that in order to really rely on graph neural networks we expect the graph that we extract from data to be in some sense a sparse so that we can rely on all all the advantages that we have seen from the previous presentations on you know low computation。

maybe scalability linear in the number of edges and not having to scale quadraically in the number of nodes。

This somehow can also serve as a sort of regularizers for our attention mechanism so that we can remove or drop a several connection that we know or we find out that are not relevant to make our prediction better and so focus on only in the most relevant ones and lastly to introduce a still this topic I want to say that this graph instruction process can go under different terms in the literature we may find a graph structure learning as the most used one but other names can be appropriate as well so in this case I'm using Latin graph learning because it's what it's more informative for what we are doing here so we have time series that we want to extract some relations that are in some sense before the time series or underneath and conditioned。

The realizations of the the time series that we are seeing。

So the first approach probably is the simplest one is that okay we have a set of time series。

a collection a time series， we compute some sort of similarity between all pairs of time series so for example we can take the Pearson correlation or a grandeur causality from the time series。

so in this sense we construct given a time series a matrix that store all these similarities and then as I was saying that we may be interested really in extracting something that is a sparse。

we may need to apply some sort of thresholding on top of these method and yeah so in this sense first we decide what the similarity should be for us and then from the graph that has being extracted we we use our STGNs。

Or any graph based model actually。 So another approach is that instead。

Assumes that this graph is not something that we compute on top of our time series。

but is actually something that as in some sense determined the organizationization of the time series that we are that we have available so this typically relies on some sort of assumption like that of signal smoothness。

so we say that or we assume that our signal is smooth in the sense that those time series that are more similar to each other is because they are more related to each other and they are caused by a sort of some infrastructure that is underneath our time series and by optimizing specific losses like the one that I'm showing here which is basically a total variation of the signals that we have we can try to recover the topology that。

In some sense， determine our observation。 So， of course， the optimization of this type of。

Losses also usually require that we not necessarily this but in general this type of losses may require that we constraint。

so here of course we are optimizing with respect to L in the left hand side of the equation or a on the right hand side on the left hand side we have a lappllaian L as to be a lapplian。

so we need to constrain L or the matrix cell that we are optimizing or finding to be indeed a matrix representing a lappllaian okay so this actually is also interesting because in some sense it allows us to enforce some sort of sparsity as well within this opt problem。

So this family of approaches are commonly derived within the framework of graph signal processing。

which have of course， strong guarantees in this respect。

What we are more concerned of in this presentation is that of task oriented Latin graph learning where we are tackling this problem from a more let me say integrated way where we are learning the relations in a way that our model in the end is performing best at the task we are considering so we are optimizing a downstream taskca and possibly we want to learn it end to end so this is a difference from previous one where we basically make a first step where we train the model or we train the relations or we learn the relations and then we apply them or we consider them along with the model that we are optimizing。

So for this task oriented letter graph learning we have two main approaches。

I'd say the first one is a more direct one where basically our ad matrix so the graph that we are trying to learn is model as as a real metrics and by end where n is the number of node and we're trying to optimize it in order to of course as I said。

maximize the performance of the downstream task， the second one is a probabilistic approach of learning instead of let me say a deterministic matrix say。

we are learning a random variable a A distributed according to some distribution P of fee that we are modeling in some way that we've seen shortly and often this is done mainly to allow our random variable A our graph to be。

ANly discrete object okay， so often we constrain PP to produce random variables a。

which are binary and this is so in this set 01。So again， as I mentioned。

the key challenger is probably here in this optimization framework of deep learning is to retrieve a in a way that this is in the end a sparse subjects that allows us to not only find few relations that are very relevant to the task at hand。

but also to allow to maintain our computation sparse and therefore efficient。

this is in particular not trivial in all those techniques that we use mainly nowadays which are based on gradient- based optimization。

So。This direct approach as I said， is not a probabilistic one is as opposed to that and where we have it is our la variable a that is here model in some ways as a function C of some edge scores parameters p so here we can think of p as a starting point as a three parameters in floating parameters or real numbers and by n matrix and these numbers are completed three parameters but then is the role of the。

The function C to transform these parameters in some adjacency matrix okay this could be any functions could be。

for example， a sigmoid if we want to maintain these differentiability or we may just threshold and say all positive numbers goes to1 and the negative one goes to zero so another thing is that now I said that phi could be set or a matrix of three parameters this is only one of the possibilities could be indeed an n by matrix of parameters but could also be itself a functions of other parameters or for example。

a function of the input itself okay so in this case a represented phi explicitly the dependency on some input data x and some other parameters till the phi。

So on top of these fee we apply this X function， which allows us to enforce different type of structures。

of course the first that probably you are thinking of is making a a binary matrix。

but these not only the only application we can make， for example。

a a K and N graph so that we allow to design or to construct the graphs that have a bounded degree so again this could be。

for example for scalability issues。Or some other more interesting or relevant for the application structures。

for example， a tree， direct acyclic graph or any other structure as far as we have a way to enforce some structure。

So another important or often use trick to model these edge scores and this is intended to address the fact that these edge scores are in principle quadratic in a number of nodes is to exploit some factorization of these phi so in this case we have that we have other two mates。

this is a Zs and ZT so the source node or an ZT for the target nodes。

so we can interpret these two metrics which are smaller in the end in terms of number of parameters。

As embeddings of the nodes and by means of these embeddings and some either directly adopt product or some functions of the two we can extract some parameters fee and then in the end we see again produce an a matrix a so again Z of T can be。

Can be three parameters themselves or once again functions of the input data。

So there are several advantages of this approach which are mainly on the fact that this is an easy to implement a solution。

the parametersization really infinite in the sense that all the layers that we have available in our deep learning framework are endless。

and also because we have all these deep learning solution。

they are supposed to work immediately when once applied within the the optimization tools that we have available on the downsides what we have is that as these are are the three parameters are in the end real numbers we usually end up with n square computational complexity that as you can imagine。

Does not scale well if we have a large number of time series that we want to consider altogether。

especially during training where we haven't reached a point where in the end the graph can be considered a sparse entity。

but we are considering all possible connections that may exist then in that case the computation as you can imagine is quadratic in the number of nodes。

Indeed one can okay leave the parameters to be n squared and then apply some specificification on top of that。

but that specificification is actually cuttingtting all the connections that are zero removing from the computational graph and in that sense the gradients cannot path through that path and so it's harder to train such type of parameterizations。

And finally， so given the two shortcomings that I just mentioned。

this makes the parameterization requires some extra care to be able to allow to find what we are actually looking for。

So the second family of models is a probabilistic rely on a probabilistic approach whereas I anticipated is based on creating a parametric distribution P of P and this parametermetric distribution is for our esymmetric a so there are different type of parametricization for these P of P probably the most straightforward one is one on the left where we assume that all edges are of our graphs are independent between each other so we can model each of these edges as a Bernoli and so in this sense we have that AIJ so our variable or。

Component of a digitency matrix relating to the edge IJ is simply a Bernouloli independent from all the others whose parameter is given by the sigmoid of the three parameters or edge score p IJ。

On the right hand side on the box on the right we have another family of distribution so as you can see changing this distribution allows us to rely on certain assumptions or not or force some structural prior on our on the graph that we are optimizing and so in this box on the right we have a distribution that forces graphs with a fixedes degree in this case K so basically what we have is that for each node we have this list of edge scores so P1 to PIEN and based on this we compute soft maps so that this in some sense gives us the parameters of a categorical distribution and then from this categorical distribution we can sample without replacement K nodes and this will give us the neighborhood of。

No the I。Indeed， this is a case without replacing and when exactly Knots。

extension can be designed so that we may rely on assumptions such that we have at most Knots。

but yeah， so these are just two examples。Okay， clearly here I talked as if P were three parameters indeed all the parameterizations that we have seen for the direct methods such as the dependency on the input data or the possibility of a factorizing fee as the inner product or the similarity of node embeddings are still feasible and actually could be advisable in certain scenarios and here on the bottom right we also see a way to enforce。

Some dependence is on top of these parameters free from the input data so that we design a probability distribution that is conditioned on exogenous variables。

so the first that we see on the left is basically averaging our predictions basing on all ideally all possible realization a from the distribution P so basically we are sampling an ed metrics we are producing a prediction X hat and then we compare like for example the mean squared error or the mean absolute error of debt single prediction with respect to the actual observation that we have here the not as x without the the head。

Then we average all the losses that we can collect and this will be our loss on the right hand side inside we have a slightly different approach where basically we swept the expectation within the loss so that basically we are making prediction for several a and then taking the average among all the prediction and only then we assess the discrepancy with the organization that we have observed。

so this seems actually similar but just to let you know there are some theoretical results for the right-hand side losses that are actually more that yeah。

Would be more advisable in general to use that if you are really looking for a probabilistic model for our predictions。

so here I just take taken the expectation does not need not to be the expectation could be other like functions like the median and so on。

So a more general approach in the spirit of a probabilistic model is probably to design a loss that is based or constructed as a discrepancy between the predictive distribution so P of theta of x so the distribution of of our prediction x head as determined by our light and variable a against the distribution of our observations okay so this could be in the end this delta discrepancy or divergence could be the Kbelib are could be for example the continuous rank probabil score could be an energy distance so there are several of them so this is just to say that there are different losses but all these losses share one crucial or potential issue that is the fact that if you want。

To optimize this directly with the gradient based optimization we need to take the gradient of these fee parameters and these fee parameters are exactly the same parameters where we are integrating our loss so are those that comes at theedex of the expectation okay so there are of course analytical solutions for certain scenarios of certain loss but many times these are un feasible Monte Carlo approaches which are also lot used may not be always feasible why because if we take a sample for example take the top left loss if we take a sample a and then we compute a loss and we want to take the gradient set to B。

there is no connection in the computational graph towards B so for these reasons there are some strategies to design Monte Carlo。

One approach reliance on the parameterparization trick。

which is basically a smart way to rewrite our distribution P P of a as a function where the or actually what we are。

Rvewriting is a itself as a function of our parameters phi in another component。

another random variable， which has its own distribution。

this is sort of fixed so that in the end what we are our optimization goes through or inside of the function G towards the parameters p。

we living inside the probability distribution or the random variable eppsilon here。

So this is some sense decoups the randomness originating from epsilon from the parameters that are the ones that we want to learn。

so this is march choice and used very much in practice so in the end you can see that basically the gradient with respect to phi can be ported inside the expectation with respect to the epsilon and now this time the loss function do depends on the parameters。

So this is actually practical and the only downside or the major downside that need to be pointed out is the fact that this is in the end。

sort of needs to rely on sort of continuous relaxation in order to have discrete or sparse e matrices。

And so yeah this is to say that if we need a continuous relaxation。

then the computation at least during training cannot be sparse so conversely another approach that is the one that we tackled with Andrea relies on score function gradient and ss this type ofestims rely on this smart trick where basically we are writing the loss on the left hand side of this equation as the one on the right hand side and as you can see while the expectation is still taken with respect to P of p we see that the gradient now is shifted towards the right hand side and so it's taken with respect to the log likelihood of our graph so the log likelihood log P of p So this is。

嗯。This is really smart or convenient for us because that allows us to。To keep the computation spa。

as we will see within the loss or actually through our message passing operators because a is just a sample if you are already thinking about Montecar approximation is a sample from P of P can remain discrete and sparse throughout our to the computation of our prediction x of head and also the loss。

whereas all the gradient part is。deferred to the computation of the log or actually the differentiation of the log likelihood so while the first evident disadvantage of these score function gradient in this is evident if you ever try to use any of them is the fact that they often bring up a lot of variances in our gradients so that this in the end can slow down the training curves quite a bit。

however for many applications， there are variance reduction techniques。

this could be general purpose ones or more specific to tailored to to the use case that you have at a hand like in our case but yeah what was already mentioned is that the most interesting part is the fact that the computation can remain sparse。

So and as you can see in both two scenarios， so with this pathwaywise approximation or using the reparitization trick or the score function formulation。

we have that we can take samples either from Epsilon or directly from our adjacent symmetric say。

and then then take the gradient after having sampled one or more。Samples。

So this slide is to show that indeed there is some computational advantage。

so with this this are we are here comparing score function gradient st or actually the training。

Time or GPU memory during training for methods based on score function ready the estimator or method based on the parameterparization trick。

So because we are decoupling the loss from the likelihood and the gradient is taken only with respect to the gradient。

the message passing operation， which are the ones that would greatly benefit from sparse graphs are those。

嗯。Are actually those that allows us to maintain the memory footprint under control as well as the training time。

so you see it easily with a number of nodes growing we reach easily out of memory issues here。Yeah。

so。Again， this is to stress that there are some setbacks on using these。

but there are also several solution might be you may need to devise your own solution for specific cases。

but there are solutions for these high variance setbacks。

So lastly on these topics what I wanted to stress is that okay we have used probabilistic models the main purpose so far was just to obtain some discrete or sparse graphs out of these learning process but still we have a probabilistic models so why don't we look at the edge probabilities that we learn and see if these actually can be insightful for us for our application and so to exc some relevant information out of them of course this is not necessarily the case in all scenarios because the training and the learning process be well pose but anyways if we are able to。

To obtain such edge probabilities， indeed this provides valuable insights for explainability purpose and also in the end on top of that also achieve better informed decision making。

One interesting issue in this setup is that okay graphs in this case are Latin variables so and as such in real data we don't have any observation about them。

so this makes extremely hard for us to evaluate whether the edge probabilities that we are extracting from data throughout our end to end learning process are actually meaningful in any way。

so this is an inherent problem so we it's really harder to think of our dataset set where we have such ground truth that to assess whether the learned probabilities are actually calibraterated to the true hidden ones so what we can do is of course a visualize and see if expert can confirm that this is meaningful or reasonable but what we decided to。

Study more in detail is learning guarantees that allows us to some way bypass this problem so we study from a theoretical perspective。

whether we are able to in some ways to learn the Latin distribution and learning in a way that is calibrated to an unknown model that existed。

so what we find out is that if we are able to minimize a certain losses。

losses that allows us to say that that are equal to zero only if the two distributions that we see here are actually identical。

if we are able to minimize them， what we are saying is that we are perfectly matching the distribution of the output from the data generating process so is matched by the distribution that is output。

from our predictive model okay so if this happens what we can say is that under certain assumption this is not by default and is not in general true but under some assumption what we have is we know that also the probability distribution associated with the la variable needs to be equal to the true one that generated the observed data so while this is not true in general for graphs that are stronger actually not stronger but milder these conditions for which these results hold are much milder so this is reassuring that indeed operating with graphs in such in this way could be could provide interesting and reliable edge probabilities。

So yeah， so this concludes the part related to Latin graph learning and so if you have any question at this point。

I'm happy to take them。Otherwise， I will move on to the last challenge so yeah there is one question I have two question actually assume we have a prior graph created by an expert in a field。

Good Latin graph create a better graph compared to such prior graph。 This is the first question。

The second question is， in case of prior knowledge is required， Ex of physical system。

Can we in the prior graph knowledge into the learning pipeline， Okay， so yeah。

these are two extremely relevant questions， So regarding the first one。

If they relate or not to each other depends， first of all。

on the approach that you followed while extracting this graph， so for example。

if you are computing this graph as a pearson correlation of the time series what you have is the Pearson correlation between the time series so a measure or estimate of the linear correlation of of all pairs of time series if you' are following that the last approach based on task oriented optimization。

so basically optimizing the downstream task what what we will obtain is simply the best the best graph at least according to the optimization procedure that we devised to opt the downstream performance。

so in this sense is what Andrea was calling a laanger in the sense that is。

Really oriented to optimizing the final performance of the model so。

Depending on the graph that you are giving or you are considering as a prior。

this may or may not match， so typically you can reinterpret the relations that are extracted as would dis relation。

help me make a better prediction to that time series or not。

And then enforcing some sort of sparsity prior where we try to maintain the degree of each node the small。

we are trying to promote the solution that look only at the most relevant。Relations。Yeah。

so this is for the first question regarding the second one in orer knowledge is required can we yeah so there are different ways so that the short answers is yes。

as far as you can design probability distribution for example， P of P that reflects that prior。

this could be for example if you know that you have a several physical sensors deployed in an environment and you know that sensor too far apart in this physical environment have no relation at all。

then you can enforce it in your probability distribution by simply setting to zero those parameters this of course extend a little bit beyond that for example。

if you know that the graph that you are looking for should be a sort of tree。

or a tree like then this is trick to parameterize but still in principle it is possible so for example。

in our case of Knn graph or hour and also of others is not always easy to do that or actually it is easy to design is not necessarily easy to optimize。

for example， for score function gradient dei what you have as well to do is also to take the gradient of the log likelihood okay so but the log likelihood of random extraction of k nodes without replacement is not so easy to write it down and so to optimize analytically so in our case we relied on smart optimization from other people that find a numerical approximation of that and turn out work well。

But so depending on the prior that you want to enforce， this can lead to easier or hard solutions。

Or parametertizations。Okay。So。Oh， thank you。So okay， I think that if there are no other questions。

I can move on to the last part， which is model quality assessment。

This last part address the point that， okay， we have all these models。 We have several。

Okay so we have all these models， we have our raw data。

we have our prediction now the last question of course is always through okay but how good my model my model is indeed our losses already provides us as a metric of goodness of fit but it's interesting to see that as the number of time series grows the length of this time series grows。

then we may have also intricate interdependencies among this time series can make a single number out of our training process a little bit restrictive to understand what is going on and if everything is working as expected so beyond asking whether our model is good。

we also want to ask whether our model is optimal and where if there are regions or aspects for example the temporal processing。

The special processing that would benefit from some further designing improvements。And lastly。

of course， the question is okay， now that I identified where the model may need improvements now I need solutions or guidelines to act in order to improve to make these improvements indeed here I spoke about optimality。

optimality is in general depends on the criteria that you choose could be for example as I said。

optimizing the a predictive loss like mean absolute error， but needs not to be like debt。

So we will see why and how relational information also within the task of assessing the quality of our models can be relevant or actually the solution to make this process feasible so a typical scenario that we are all facing is that we decided whether model f is better than model FB based on the performance that we see on the test set and then we say that model a is better than a model FB if performance that we see from one model is statistically better than the other model and we say that a model A is in general within a family at least of model is the optimal one if there are no other model that is better than that so this is pretty standard and what we may encounter often is that okay say that I have this model f。

I may consider it good enough or actually may say it seems working appropriately。

but I hope to make a better model， how can I say whether this model is the optimal one or I can move forward or start working on it even further order to improve it and find a better model so this is a question that is hard to answer because either we we' are still in the process of changing our models and finding a new one and until we find an actually better one but if we don't find it we still don't know whether we should look for again for it or not so unless we have of course some prior knowledge of the for example the level of noise that we have on our observed data that of course its somehow sets the baseline that we or yeah the baseline won to reach。

Another interesting approach to model model quality assessment and this is why I introduced the previous slide is that of studying correlation among the time series studying correlations is a way to understand whether in our predictions that or actually in the data that we are trying to model there is information that we were not able to extract from data in fact here consider this prediction residual so the difference from our prediction had and the true or the actual observation from the monitor environment this difference allows us to to understand better whether there is optimality or not of our model so in particular if we see that there is some dependencies among the residuals it means that there is a structural information left in the data in our residuals that our。

Model was not able to extract so in this sense， there is a margin for even further improving our model because there are structural information in that data。

So receiving correlation analysis that sense is independent from the type of performance metrics that we are interesting in。

of course does not tell us how much I can improve just say look there are something some data that is correlated so probably you should you could make it better but doesn't tell you how much but in the end this is also interesting because it doesn't need to be a comparative analysis I don't need two terms and say okay looking at those two I can say the which one is better it's an absolute answer in some sense。

So here we're talking about correlations， most of the research so far has focused on serial correlation。

so correlation along the temporal dimension but there are also works that study correlation in the spatial dimension here we are I'm presenting an approach that is special temporal。

so address space at time at the space and time at the same time and so these works is follow basically from our data。

we have that at every time step we have our graphs and our observation so we can construct these special temporal graph by connecting not only the engines along the special dimension but also the temporal dimension and by looking at single statistics of correlation and this is a single sta or design functions that you're seeing here so the sign of the inner product of the residuals。

This is a probably the simple simplest way to check whether there is other direct or indirect correlation between the two residuals so well by making these averages of all design we are designing。

Pretty simple and straightforward test that allows us to make to understand whether there is some correlation left in the data。

Well， these are split to two part because one addresses partial correlation So the one in red and the other addresses only temporal correlation。

So here we have these ws which are the weights which are basically non zero for all the edges of the graphs and can set a sort of influence or sort of importance of different edges。

So the most important part here is that while correlation in principle so while studying correlation in principle we should like look at all possible correlations that are existing among the residuals So grow square whether athlete both in time in number of nodes here we are only focusing on the most relevant ones which are the most the。

All the possible connections that are more likely to lead to correlation。

so this is why again graphs allows us to design a statistical test that are statistically powerful although the data dimensionality grows a lot so this in the end thanks to the fact that we have a sign leads us to distribution free statistical test and also these scales linearly in the number of nodes okay so this is a test that can be applied globally the graph level considering space and time altogether can be split looking only a special or temporal at once。

but it can also go into more finer detail by looking at single nodes or for example single time step or even more localize both in space and time。

So this concludes actually the parts where we are addressing many challenges that have been not resolved but were substantial research have been carried out and now I would move on very quickly on a future direction that we see as interesting relevant for for future research。

so the first foremost is probably this about hierarchical modeling where as we have seen so far the SDGNNs or the models that we have considered so far look at the data as they are so with the same fine grainined scale with which data comes。

but also considered in。Higher order dependencies by creating hierarchy instructions like pooling of node or even along the temporal dimension。

this allows us to consider both coarser grain scales。

but also at the same time reach farther apart information within our data。And in the end。

also it's it's from application standpoint， is also important insert scenario to produce forecastings that are at an aggregated level。

I think， for example that。Power load forecasting we may not be interesting at a single household predictions。

but we a more aggregated level， and in the end also allows us as mentioned here to reconcise all these different predictions made a different level in order to make a more reliable predictor。

Another interesting approach is that of considering state space models。

this these models as the intrinsic advantage of being in some sense Markovviian in the sense that observations or predictions are made only from the current state of our system and this system is updated from the previous state at every time step and could be possibly also driven from an input graph。

so this is decoupling all the three dimensions， input states and outputs allows us to have different type of graphs with different relations that can allow us to either in the former setup of a hierarchical modeling or for other reasons to have different graphs for different purposes and this is also amenable for the reason that this allows us to make prediction only when needed。

If you needed to prolong state update， we can do it if we observe， for example。

a certain point observation from the model system， so true observation of Y of T。

we can also feedback back in the same way as Kman filter operates for linear system。

we can feed back this information back to our system and update and improve our state estimates and reproagate forward with an improved modization of the system。

Another relevant scenario is that of inductive learning where several issues can occur to our monitored environment。

for example， we can see changes in toppologies， we can see new node be added to our network for each for example。

we don't know we don't have node embeddings yet so we need to make some。

adaptation in order to apply our model to these new settings or even transfer the entire model to a new network。

this is useful not only for forecasting， but all sorts of application related to this type of data and of course。

as we change the data that we are dealing with， this can also encode in performance degradation that is what we are trying to to maintain as expected。

So lastly， benchmarking is another topic that needs to be addressed in the future。

at least according to us， we have some large data sets， open data sets。

this spans energy and traffic flows mainly where these type of data this type of models have been designed for since the beginning。

but this is of course doesn't provide yet comprehensive benchmarking environment。

at least in my opinion， we also have some software like this torch special temp that Andrea and even have designed but also basic TS are useful resources in order to try to standardize model implementation evaluation cross validation and so on and so forth。

but yeah we are not。Cosees。Close to what， for example， Open graphph benchmarkch is now。

So all in all to conclude what we have seen is a framework for modeling time series。

and this combine the deep learning for time series and deep learning for graphs。

This combination of the two is turned out to be extremely fruitful in several applications。

doesn't need to be always like so， but we have seen that this inclusion of relational biases on one model was really a game changer in several tasks。

And this not only allows us to improve performance， but also more at a fine level。

we have a possibility to share parameters， so we have to better ratio between training data and model complexity and also overcoming all those issues that may be present in this data。

for example， missing data or other irregularities in our data。Finally。

I feel or we fail to suggest global local models as a safe starting point for modeling this application。

producing data similar to the ones that we have described to you so far。

So of course we have discuss the challenges I also want lastly to point out to our tutorial paper。

which is the one written here at the bottom and also the library to a special T both of them I hope you find it useful we are pretty proud of both of them and if you have any feedback on them also please write us a to us so I hope in the end to have。

诶。Treb wrote you somem interested in these topics in all these models and also these solutions。

I think there is huge room for exchange of ideas with related fields and so please reach out to us anytime if you have ideas that you want to talk to us too about and here I leave you with a link of our group。

graphraph machine learning group in Lugano， Switzerland where you can find also a list of publication。

all the materials of these tutorial and of course also our contact information。

So at last don't forget to fill in the feedback form for the tutorial that you might find on the Slack right and probably also we can post it here in the chat so thank you everyone if you have still any question please ask I'm free to address them。

The best they can。Oh， Sandra here so。はい。Okay， yes， so the feedback form is in the chat right now。

There's a question that came up in the Q&A if you want to take a look。And the Q onsl， you mean？

On the Zoom Q&A， I can read it out loud and paste it in the chat。

The question is in scenarios where multiple time series have irregular time intervals。

some occurring more frequently than others， what are the recommended approaches for data representation and graph neural network architecture to effectively learn interdependencies among these time series？

😊，Okay， I you want to answer。Okay， so very much depends maybe you have other takes on it very much depends on the level of the scale。

if you will， of these that you require your predictions to be， for example。

if you really need fine grain predictions than it's probably easier to some sort create some sort of upscaling of the less frequent time series。

but this need not to be the case because we have seen， for example， when you're trying to combining。

Sa covariates or exogenous variables like those coming from the weather or you want to encode in a traffic prediction model。

maybe the hour so to identify peak hours but maybe minutes are not so relevant so in the end what you just bring in is just an upscaling of that information and use it to make your predictions in in that sense is not necessarily big of a problem of course。

if this at each time series comes is less regular so produces observation a synchronously or very regular time steps then there are models that probably are more appropriate to them to address them could be for example continuous time models where they try to address these irregularity so so they taking data as they can and then they propagate。

Aong the temporal dimension and but they don need to have such a regular information coming all the time this is I'm assuming the time series are not on the same sampling rate right Yeah yeah I pretty much agree so like if you have some preprocessing that you can do to to deal with irregularity and this preprocessing can also be something as complex as like fitting some imputation model or some rescaling well or as simple as an up samplingling if it's just a matter of having like different frequencies or the observation then it's fine if it's something more structural that the observation are by nature like intermittent or something like that then continuous time models are the way to go Yeah for example in traffic forecasting the data that we that we have is's actually。

The passage of a car under a certain sensor I don't know how to call it and so you have yeah in principle a synchronous observation。

but in the end of the data that you have available in the end collapsal information within mostly would say if you five minutes five minutes windows yeah five minutes windows and then they count how many cars have passed through that lane within five minutes so in the end this is another way to actually downsample the data that you have depends a lot on the。

application， I would say， but on the other spectrum。

if we are trying to deal with a scenario more similar to temp graphs。

then probably those techniques are not the most appropriate。

And then continuous time models I carbonbon edge。Awesome， there's also one more question。

but I'm not sure if it has already been answered， I put it in the chat again they say when can it be beneficial to learn an adjacency matrix that depends on time as well。

for example， we drop or create new edges between time steps or there works that do that。Okay。Yeah。

I think like the NI paper the neural relational inference paper that was cited in the slides and was the graph learning approach based on the parameterization trick in that one the graph that you are learning is conditioned on each window of observation so it is dynamic is like the actual output of the model is conditioned on the observations that specific time steps。

Regarding like when these can be beneficial again， it's really like problem dependent。

like if you have a physical system where like deposit position of sensors。

for instance changes over time， then having something like that would be really I mean DDDD only thing that would make sense to consider to learn this kind of depend is if instead what you are modeling is something more static then might be okay to just learn something which is static。

another I think important dimension of this is how fast these changes in the toppologies are happening because if changes are happening slowly over time。

then what you could do is something similar to continual learning where the graph that you learn is static。

but you also have some mechanism to have it like adapt。But so that after a while。

you can check whether that kind of topology is still valid。

still giving you adequate performance and if not you can update it。Yeah。A them。

any final questions from the audience？All right， if not thank you all for the fantastic tutorial we really appreciate the time and care you've clearly put into this put into making this tutorial so accessible and informative。

please feel free everyone to continue discussions in the tutorial Disc channel or by directly contacting the organizers as a reminder the tutorial feedback form is linked in the chat we really appreciate any feedback that you may have and we will also share this feedback with the organizers themselves。

😊，So that brings us to the end of this first part of today's conference please join us later for a keynote from Svia Breson and oral presentation Thank you very much have a great rest of your day。

Thank you， thank you， thank you all by。

图机器学习会议：P05：时空图深度学习教程 - 第一部分

在本节课中，我们将学习如何利用图神经网络来处理具有相关性的时间序列数据。我们将从问题定义开始，介绍全局与局部模型的概念，并深入探讨如何构建时空图神经网络。

概述：相关时间序列处理

我们面临的核心问题是处理一组相互关联的时间序列。传统的深度学习方法通常采用全局模型，即使用一个单一模型在来自不同来源的多个时间序列上进行训练。这类似于现今人们所称的“基础模型”，但这里的范围可能更有限，例如仅限于来自特定领域的时间序列。

这种方法的优势在于其样本效率，这种效率允许我们构建更复杂的模型架构来处理输入的时间序列。

然而，上述两种标准实现方法的共同缺点是，它们都忽略了时间序列之间可能存在的依赖关系。除了我们稍后将讨论的图表示方法外，文献中还有一些处理此问题的方法。

一种简单直接的方法是，将输入的时间序列集合视为一个非常大的多元时间序列。但这显然存在严重的可扩展性问题，因为它会受到维度灾难的影响，导致样本复杂度高和计算可扩展性差。

我们可以转而考虑在时间序列集合上操作的模型，并保持模型参数在这些时间序列之间共享。基于注意力的架构（如Transformer）就是一个例子，在这种情况下，注意力将相对于空间维度而非时间轴进行计算。这种方法可以很好地工作，但缺点是我们没有利用依赖关系结构及其稀疏性的任何先验知识。

文献中还使用了其他方法，例如依赖于降维。其思想是从这个庞大的时间序列集合中提取一些共享的潜在因子，并使用这些因子来调节全局模型。如果数据在某些应用中具有低秩特性，这种方法可以很好地工作。但其缺点是，我们失去了局部的细粒度信息，而基于图的方法可以很好地捕捉这些信息。当然，在数据量非常大的情况下，这些方法也可能面临与其他方法相同的可扩展性问题。

基于图的表示方法

现在，我们可以开始讨论这些基于图的表示方法。如前所述，其思想是使用一个图来表示时间序列之间的功能依赖关系，并将此图作为我们学习模型的归纳偏置。

我们可以使用邻接矩阵来建模这些依赖关系。该邻接矩阵可以是非对称的，也可以是动态的，即它可以随时间 T 变化。除了邻接矩阵，我们还可以有边属性，这些属性本身也可以是动态的，并且可以是分类的或数值的。

以交通为例，邻接矩阵的结构可以从道路网络的结构中提取，并且可以为边关联属性或权重，例如编码道路距离。在这种情况下，动态拓扑可以帮助我们考虑交通网络结构的变化。

下图总结了我们在每个时间步可用的所有信息：目标时间序列、可以是动态或静态的外生变量，以及我们刚刚添加到设置中的关系信息。

现在的想法是利用这种关系侧信息来调节我们的预测器（即预测架构）。这些关系可以作为一种正则化手段，将预测结果相对于每个节点进行局部化。特别是，它们可以用来消除由于未考虑这些结构而可能产生的虚假相关性。此外，这些方法比标准的多元模型更具可扩展性，因为我们可以保持模型参数在处理的时间序列之间共享。事实上，我们可以使用这类架构来预测和处理任何相关的子时间序列。

专门为处理这类数据而开发的图神经网络被称为时空图神经网络，指的是模型中的传播同时跨越时间和空间发生。我们将重点关注那些基于消息传递框架的模型。

我们将通过考虑以下模板架构来实现这一点：该架构由一个编码器组成，它独立地编码每个时间步和每个节点的观测值；编码器之后是一堆时空消息传递层，这是架构中唯一发生时间和空间传播的组件；时空消息传递块提取的表征随后可以通过一个解码块映射为预测。

我们可以更详细地观察这个过程。编码器的处理发生在单个时间步和单个节点的层面。由此产生的观测序列随后由时空消息传递层处理。解码器再次在单个节点和时间步的层面运行，将这些表征映射为预测。

时空消息传递块可以有多种实现方式。我们可以将其视为标准消息传递层的一种泛化，不同之处在于，这里每个节点关联的不是静态表征，而是表征序列。因此，我们需要修改构成消息传递架构的标准算子，使其能够处理序列。

显然，实现这一目标有许多可能的设计方案，这些方案可以与每个特定应用的需求相匹配。下一步将是描述这些时空块的可能设计范式，但在继续之前，现在是提问或澄清任何疑问的好时机。

时空块的设计范式

关于不同的设计，我们可以区分几种模型。

时空融合模型：在这些模型中，时间和空间的处理无法分解为独立的步骤，它们是联合发生的。
先时后空模型：顾名思义，这种模型将时间和空间维度的处理分解为两个独立的步骤。
先空后时模型：这基本上是“先时后空”模型的反向顺序。

对于时空融合模型，正如之前提到的，表征在时间和空间上的传播方式意味着最终模型架构的处理无法分解为独立的阶段。实现这些架构有几种方法。

以下是实现时空融合模型的几个标准示例。

将消息传递集成到序列建模架构中：例如，我们可以从标准的GRU单元开始，每个时间序列被独立处理。通过使用消息传递来实现循环单元的每个门，我们可以轻松获得其时空图神经网络版本。这类模型被称为图卷积循环神经网络。一个非常流行的架构是扩散卷积循环神经网络，其中消息传递算子通过双向扩散卷积实现，其思想是对入边和出边采用不同的处理方式。
时空卷积网络：这里所做的只是交替使用时间卷积滤波器和图上的空间卷积。基本上，这是在每个节点上分别应用标准的时间卷积（可以是任何1D卷积滤波器），然后进行一步消息传递。通过堆叠多个这样的层，可以得到一个感受野在每一层都相对于时间和空间增大的架构。STGCN是文献中引入的第一个此类架构。另一个模型是Graph WaveNet，它使用更先进的卷积算子（包括膨胀卷积）来增加模型的感受野。
使用序列建模算子计算消息：一个简单的例子是使用TCN计算消息。基本上，将两个相邻节点的时间序列（观测序列）连接起来，然后对结果序列应用卷积滤波器。显然，你可以使用任何想要的序列建模算子。例如，有许多使用注意力算子的此类架构示例，在这种情况下，这将是跨注意力，因为你将计算两个不同节点序列之间的注意力。

我们还有乘积图表示。这些表示源于一个简单的观察：你可以将时间维度视为一个线图，并以某种方式将其与空间图结合起来。然后，再次使用标准的消息传递网络处理得到的表示。例如，你可以考虑标准的笛卡尔积，其中每个节点在每个时间步都连接到其邻居以及前一个时间步的自身。还有其他连接此图的方法，如克罗内克积，其思想是将每个节点连接到前一个时间步的邻居。你还可以提出许多不同的方法来连接这个图。

接下来，我们看看先时后空方法。这种方法很容易理解，构建的模型也很简单。其思想是分别将每个序列嵌入到一个向量表征中，然后在得到的图上执行消息传递。这里的优势是易于实现且计算高效，因为在训练时，你不需要在每个时间步都执行消息传递，而只在最后一步执行。我们还可以重用所有已知的处理序列和图的算子。其缺点是，这种两步过程可能会引入非常严重的信息瓶颈，并且由于只对单个图表征执行消息传递，因此更难考虑拓扑变化和动态边属性。

然后是先空后时方法，这基本上是另一种顺序。你分别在每个时间步执行消息传递，然后用一个序列模型对表征序列进行编码。这些方法在文献中经常使用，但它们仍然存在这种瓶颈和分解处理，可能使信息传播更加困难，并且它们不具备“先时后空”模型的计算优势，因为你需要为每个时间步计算消息传递。

全局性与局部性

全局性和局部性在整个演示中扮演着重要角色。至少按照我们讨论的模板实现的STGN是全局模型，因此它们可以处理任意集合，是归纳模型。它们可以利用依赖结构为预测提供进一步的调节。

然而，这种进一步的调节可能还不够，模型可能在建模时间序列集合中可能存在的所有异质动态时遇到困难。因此，模型可能需要非常长的观测窗口和非常大的模型容量来考虑所有这些不同的异质动态。

文献中遵循的一种解决方法是考虑混合的全局-局部架构。实现混合架构的一种直接方法是，例如，再次查看我们的模板架构，将某些组件转变为局部的。例如，你可以让编码器和解码器的参数特定于将要处理的时间序列。由此产生的模型将能够更容易地捕捉这些局部效应。当然，缺点是如果你想使用这种方法处理许多时间序列，这将导致大量的局部参数。

缓解这个问题的一种方法是简单地考虑节点嵌入，即可学习的参数向量，可以与每个节点关联。你可以将这些节点嵌入视为一种可学习的位置编码，但它们在这里实际上不是编码节点在图中的位置，而是负责建模每个时间序列的特定特征。这些可学习的向量可以被输入到编码器和解码器中，它们分摊了这些局部时间序列特定处理块的学习，允许保持大部分模型参数共享。当然，我们仍然有一个缺点，即可学习参数的数量与我们要处理的时间序列数量成线性比例。因此，可以考虑一些折中方案，例如为时间序列簇学习嵌入，而不是为每个单独的序列学习。

另一个需要考虑的重要问题是，当我们转向混合架构时，得到的模型不再是归纳式的。这听起来像是一个缺点，但在迁移学习场景中，你也能获得一些好处。特别是，拥有混合架构允许保持模型的共享参数（全局部分）共享，而只微调局部部分（例如嵌入）。文献表明，正则化这些节点嵌入可以进一步促进这种迁移学习过程。

实证结果

我们可以看一些实证结果。以下是相关时间序列预测中非常流行的基准。在表格的最上方，我们有一些使用刚刚看到的模板开发的基线参考架构，特别是使用的GCRNN（基本上是时空融合模型），以及“先时后空”模型的不同实现。然后，我们还有三个来自最先进水平的架构。

我们可以看到，当我们添加这些局部嵌入（即使基线模型变为混合模型，引入一些单独建模每个时间序列特征的部分）时，性能可以得到显著提升。以至于使用这些非常简单的架构，在许多情况下，你能够匹配更复杂、参数更多的最新模型的性能。

然后，我们可以看看一些迁移学习结果。基本上，我们取了来自四个不同交通网络的数据，在三个网络上训练了一个模型，然后在第四个网络上执行迁移步骤。第一个观察结果是，经过微调的模型（迁移学习后的模型）比零样本归纳模型表现好得多，这在某种程度上是意料之中的。另一个非常有趣的现象是，仅微调局部组件的混合架构的性能远优于微调的全局模型。此外，文献中研究的变分和聚类正则化（此处引用的论文）可以进一步提高迁移性能。

总结

在本节课中，我们一起学习了如何处理相关时间序列。我们形式化了处理相关时间序列的问题，并展示了如何使用图表示来建模它们之间的依赖关系。我们讨论了预测问题，以及时间序列全局与局部深度学习模型之间的重要区别。最后，我们探讨了构建时空图神经网络的方法及其相关的权衡。

在讨论挑战之前，我们将查看上述内容的软件实现。接下来将介绍我们开发的一个库的教程。在此之前，我们可以回答一些问题并短暂休息五分钟。

图机器学习会议｜ Learning On Graphs Conference 2024 p06 P06_Day_3__Part_2_2-Xavier_Bresson_keynote__oral_presentations -BV1k9pAzpE8S_p6-

So they have logic they have also limited logical reasoning。 so for example， there is Terry Taow。

which is who is like a very strong mathematician and I try you know the last version of GT and you say basically oh they are actually very limited again you need to you need to give them a lot of precise pump to make it work so they have a limited logical reasoning so what open AI try recently to do is to improve that by learning you know。

😊。

Chain of third for example and also for inference to do search algorithms so it improves this limitation。

but it's still not there there are also limited graphs and in capabilitiesbilities。

even again if they have seen the test set of the graph task during training。For GNNs。

I think strength is basically to be able so if you have a graph like this and you have a question。

for example， is the Monaisa in the same city as Alice's friend Bob， so you have here Monaisa。

you have the city， you have Els， you have Bob so。😊。

If you do multiple layers of GNN what you will do you will basically learn a multih path that will go through the solution of your task so they are very they are very you know good to do that and there are very effective for many different modities you know now we're talking about text but they are also very good for for example。

know physics biology comingial optimization and also chemistry and we know that chemistry that was a very good year for AI so there was a nobel price in chemistry for Al For and Al For as you know is a niche transformer so predicting the pairwise distances between residue amino acid sequences so this is a graph neural network basically limitations for GNNs is basically they like graph foundation model in the scale of natural language processing and computer vision of course the community as well a lot on that it's very interesting to push in this direction but。

You know there is this emergent property basically means that we need to go beyond the scale of training data and compute to get something very powerful so we are not yet there the problem is basically know the data set we don't have like large data set variable OGB is still comparatively small compared to INe which has 150 gigabyte of images。

the hardware for running spa in algebra is not optimized it's much slower than standardout dance operations existing pre2GnNs because basically of that they are small they are not doing you。

Billions of parameters this is basically millions of parameters and I think also something which is today which is an limitation is that industry has not yet found some interesting application of GNN because industry is really driven you know the AI research and AI product why we have GPT today because industry got interesting indeed deep learning and then also to develop you know product so it's not yet clear you know how to make profitable stuff of GNs it will come I think but it's not yet there。

So combining anLM and GNNs basically is it means developing a joint training a joint text and graph foundation model this is of course a very attractive idea。

very promising idea but today I think the issue is that there is a very huge imbalance between the knowledge coming from text from LLMs and the knowledge coming from GNN so of course what we would like to do is to do some discount kind you know of architecture where we have the text then text will go through an LLM it will process it it will give us some vectors the same also for graph it will go to the GNN will process it we give us some vector and then the vectors will be we be。

😊，I'm sorry we be processed together with self attention or cost attention and we need for example。

we do text generation so the fact that you know we have a huge difference between these two domains。

I think this is this is very challenging。So what it means it means basically that we need to tailor you know the combination of LLMs and GNNs to get some value of that。

so for example what we can do we can use the LLM using the vast knowledge of LLM and then try to improve the performance of small scale text attributed。

Graphs and we can also do the reverse one that you know we can use a knowledge graph， for example。

to constrain the LLM to give more precise response okay so reducing hallucations this way。

So this is what we will do next， so we will give you two walks that we have done and this is really focusing on text reing task so the first work will be we will use LLMs to enhance GNN reing so this is a work basically that is taking LLM reing abilities to improve GNN predictions。

😊，And and it's pretty actually effective and robust the second paper we will review is basically a GNN withNNs LNM reing。

so this is a foundation work where we try to put together all the benefits of LLMs GNNs and also something that we call graphHag that I will explain。

😊，Okay， so let's see first the first technique here。

😊，With higher accuracy， so the question is how do we extract again。

information from an LLM for a specific tag task？😊，So to do that。

we are going to prompt theLLM because。LLM again， accumulated so much knowledge that what we would like to do is to prompt the prediction of the LLM。

but the same we also want to understand its reasoning。Okay so we will ask the so given for example。

an article the article will be the title the abstract we would ask the poem question so basically predicted the class but the same time we will ask the LLM to give us you know its reasoning so why did you did you decide for this for these predictions okay。

So we could also reasoning an explanation if you want， okay？

So now that we have you know the sequence of words for abstract title or explanation prediction。

this is not something that we can directly use with GNS so what we need to do basically is to have a mapping from sequence to vector okay so we want to take this input sequence of wordss and then output a D dimensional vector that will summarize this information and that will improve basically the。

The excosivity of this node feature， so remember that in this example a node is an article okay。

this is another article and then what we want is to predict if there is a relationship。😊，And。

and what if you want to predict is， you know the class。😊，Okay。So what we propose。

we propose something that is going to leverage both Opre and open source LLMs。

This technique is a kind of integrator between a close LLM。And an open LLM。Okay。

so the closed LLM can be GT that can be Gi so we know that this disclosed this proprietary LLMs actually are better than the open ones Okay so we can go on the little bulb of LLMs and you see that always the top two G and Gi okay so the unfortunately for researcher proprietary proprietaryry LLMs they are better than the open ones but the problem is the closed LLMs is basically that they only provide sequence of words right they don't provide the vectors that we want to train the GN so in contrast if you look at the open source LLM like Lama。

Or Gma basically they are going to provide the text but also the vectors right so we have access to everything inside the architecture。

we have the hidden vectors， the hidden features， but also you know the output everything is given to us。

So what we decided to do in 2022 so at that time GPT 3。

5 was the best closed LLM so this is the one that we used so we have our article so this is the node I in the graph given the title of the abstract we query we get to the explanation we get the prediction and then what we do is that we are going to convert the sequence of for example of what in the prediction by using。

A language model， also like a small one， which is the beta in this situation。

it has 129 million parameters。And here let me zoom in so what we do is basically so we have our sequence of what token so this is basically the explanation if you want and here we have。

you know in any transformer architecture you can have a class token basically something that will summarize the sequence of well so we give us an input these we go through transformer layers we output would put and then we get the class token after L number of transformer then we go through an MLLP or small MLLP to fine tune on the training set okay so the training set we know the correct class so we want basically to finet the MLLP on the correct class of the training set so in the MLP here it's a small one only two layers。

the first layer are basically we just be some features and then it will go through another layers to get the number of classes so for example that can be 47 if it is good。

4 if it is OGB archiveAU。So this guy here is going to be a vector of seven seven all dimensions。

so this is actually going to be our enrich feature so this feature will represent the input sequence that we have here Okay and this is very tailored to the。

To the task that you want to solve okay， so this way we are able to get enrich rich feature for the explanations and which feature for you know title abstract and then we can also have a prediction feature。

Okay so what we should do if we are in 2024， actually we should change you know the smaller the beta language model by now using a large language model okay so why we can do that at the university is basically if you take for example Lama 2 or Gma you can fine tune them using this very nice technique of Loha okay so law rank fine tuning and basically with my small GPUs actually I'm unable to fine tune you know a large language model like Lama2 so this is very great okay so basically what we would do we just need to change the proprietary tree LLM with the best one so you take the one that you like and then here instead of using the beta you can use Lama Lama 2 for example。

😊，Okay， so once you have your enriched node features。

what you're going to do you are going to train your GNN okay with this new node features so you can use your favorite GNN okay。

and then you can make the prediction。So we can compare the quality of Note feature now。

for example this skigram for OGB archive。And what happened is that you get 70% accuracy on the test set in only four minutes。

okay， so this is very fast and I think this is a very good baseline for model performance。Now。

as I said before， the state of the art is Glen and Gm was training simultaneously a language model okay and also a GNN and basically this model got the best accuracy of 76。

5% but of course because you need to train you know your language model。

this is going to take more time so it's going to take 9。

2 hours okay so there is here at that time before the introduction of JGPT that was really a huge tradeoff between you if you want to increase your accuracy you need to pay the price of you know much more computational you know hardware but also training time。

So now this is what we suggested， basically we use this LLM we pump them。

we translate into vectors and then we fine tune so accuracy was 75 Poson and only tris to do that so interestingly once we published the paper so we got at the top one of the leader ball and then there was other techniques that used the same approach of course a little better and then now the top even today actually I was surprised I look at that recently so even today the top three models for OGB archive are basically based on this type technique。

Okay， so of course， one review two say that oh we cannot trust your results because OGB archive is part of the LLM training data set so we cannot trust you so what we did basically is that we produce a new OGB archive not a GB but an archive data set we could it tape archive 23 so it's available to download we have 77000 papers same number of classes and basically we have the same conclusion there is nothing that changes when we do that and again this is also the reason that and then an LLM is even if has seen you know the test set is not able to reason very strongly with G structure so you still need a GNN you to use the topological relationship to make good prediction。

We did some applicationss study what we observe is that there is no specific feature which is better than the others。

it's actually the combination which is important。😊。

So we term conclusion for this work so basically we can use the LLM knowledge and its reasoning abilities to enhance the node features of the tag and also the make it to fine tune to the specific to the specific task okay so what we did is something like was is not end to end so we don't run everything together it's actually we first generate good features。

And then we train the G。 So this is， I think， along the。

The trend now of LLM so basic LLM is not trend and to when they are like four steps sub supervised fine tu name reward reward model and finally。

A reinforceinment learning okay so so A each step is done independently because each step is very clear to do it to train and I think you know it's very stable and this is exactly the same conclusion that we have here。

😊，We can make it very stable and it's efficient everything which is stable we have better performance usually okay and also something interesting is that we can we can leverage both actually proprietary tree and open source LLMs so we have the best of the two worlds。

😊，It's also， of course， yeah， so some people will like it。

it's interpretable because you can see the reasoning world of the NLM。😊。

Okay so now let me go to the Suig technique that I want to introduce。

this is called G retrievetriver and the idea is as I told you before is that so LLM is very powerful knows everything but the poly is going to make errors because we never have a good pump in some sense。

so we need to constrain to regularize you know the LLM response into a much smaller space。😊。

And and to do that actually we are going to use you know a tag a text activity graph like your knowledge graph and it will force the LLM basically to to answer related to this to this tag okay so the key question is of course how do we extract pertinent information from you know。

A graph and to force the LLM to be more focused。Okay so to do that we are going to use the tokens so I think now everybody understand that so because of the transformal architecture what is very nice is to walk everything as an input right so the input of your NM can be of course your query token but it can also be other information like visual visual tokens and also graph token so this is what we're going to do here we are going to use two kinds of tokens so the first one would be and I would explain that in the in the following slides a graph and code token which is going to be here and then a textbased token which is going to be here okay。

So the graph encode token it is something very natural for example if you do molecular science。

you want to represent your graph as one vector and then you use this one vector to make a prediction for the property that you want so here this is the same idea so we can select any ferator GNN we apply multipleip graph learning layers to compute very deep node hidden feature you compute then the mean over on the node and then you apply an MP a small MP on that so this way you will have a graph encode token that summarize your topological graph and also the feature on your graph and with one vector okay so this is going to be this guy of d dimensions。

The other thing， of course， is that an LLM。😊，Wants to use word tokens right so the information the way it processes information is by using words so we need to you want to have access to tap you know to the knowledge of LLMs we basically need to transform the graph and its feature into a sequence of natural language tokens this is important to do that and so for example here you have a graph and you can represent this graph by some textual or representation for example a graph g is a set of directed edges defined by by G we are node i points to node G so here the graph g is defined with edges04 16 and so on okay so you have a one to one mapping between this mathematical representation of the graph and this text representation of the graph。

However， there are many ways to use language to represent graph so for example in this paper they have this nice example to show oh you have you know mini representation so the text based representation of a graph is not unique okay so is this can be an issue the other one is also what I say is not text equival okay it's not text equivalent in the sense that if you change you know this guy by this guy you will have probabilityffer results okay so we want to have some kind of equivals for the text representation but we don't have it。

The other thing is a scalability issue when you won't represent a graph with text okay so if you take an open source LLMs。

the context window is limited right so for example if you use the one that we use in academia Lamma2 so basically this is 4000 tokens of limitation so if your graph is small no problem but if your graph is for example the Wikipedia graph then this is like a huge number you of nodes and huge number of edges so this is not something that you can do there is a scalability issue。

Of course LLMs are prone to eucation， so for example here we have an example that an LLM can produce nodes and edges that do not exist actually in the knowledge graph that we use so in the vocabulary of the graph you know we have some for example here the one G doesn't exist and is able still you know it's part of the vocabulary so the LLM can give you actually this entity here which doesn't exist。

So what we propose is basically， so first we are going to apply a graph hug that I'm going to explain how we do that to retrieve a subgraph from possibly a large text at which graphph。

which is going to be relevant to the query of the user。Okay step two。

we are going to concatenate the user query， but also the graph ending token and the text based graph tokens to create the input sequence of the LLM okay then step 3 the NLM will give us we generate an answer step four we will train everything so we will simultaneouslyly train the GNN parameters and also we fine tune the LLM parameters using Lo harm and the good thing is that you see if you use Lo harm you only using 0。

5% of 7 billion parameters so it's only 35 million and the GNN has something like 5 million so it's not that much actually to train so this is something we can do in academia。

So the graph hug that we propose so this is a graph retrieval augmented generation so H is today very popular here we want to extend to graph the main problem is of course the scalability so I'm going to tell you how how we solve this problem so。

😊，Yeah， let me maybe go here。 So the first thing is to do indexing。

So what we do is that we're going to have a text attribute graph。 So we're going to take。

The node feature also this is a text node feature we have also the edge feature of the same。

so we apply a pretrain and frozen large language model or small language models so at that time we just use a small language model but but we can use a large language model now so you do that and you get some D dimensional representation of all the nodes and of theH and you store them in in a database in a graph database so today we are lucky there are many open graph databases that that can be used so for example。

pg DgL， but also Lama index also there is Microsoft and Nebula graph they are also some pop3 graph database like Neo4j。

Okay so we do that so we take this and we have Victor representation of the node and the edges okay the second thing is that we are going to do retrieval so given a query from the user we will represent to the query for example what is the name of justin BL border so we would use the same you know language model to represent the query as we did for the node and the edges okay and then what we will do is basically we just do a similarity metric evaluation so here we just use cosine metric and we can retrieve this way the top key notes and edges from the graph database okay so we would get something like this which is a noisy I would say subgraph。

😊，And then the next step we basically to extract just you know a smaller a smaller graph which has the mostly event formation and that's it okay so the way we do that we are going to solve the price collectingstein or tree okay so the price collectingsteinina tree is basically a tree so if you start from this original graph and then each node as a price so the higher other price。

the more important you want to be in the tree in the final tree so you can solve you know you want to maximize the price。

but also you don't want to take everything so you are going to have a penalty with the cost of the solution that you want so usually this is you don't want to or you should minus I sorry it should minus here so you don't want to take too much nodes so。

Basically this is the number of nodes that you have here okay so when you when you solve this comm optimization problem which is NP but you can have an approximate solution using semidefinite programming SDP you will get something like that okay so this is a directed tree and you see that a directed tree usually has some kind of Ho node any flowout for example and of course it wants to take the larger node okay so this is the original Steer。

inorine we can modify it because here there was only the node。

but of course when we do graph learning there is also the edge feature that we want to use so we can incorporate we can easily incorporate edge information okay so we have a prize also for the edge so you see this is for example an age which is more important than this age here and we just you know modify a little bit the communatto optimization problem here and we can solve it you know using very fast technique so this is very the approximation is is a mostly linear al approximation。

Okay， so if we compare the standard hug with the graph hug。

so the standard hug basically you will have your knowledge database and then you will extract you know the number of relevant document that you want。

😊，For the graph hug here， so we have also a graph database and we are going to extract a much smaller but very relevant graph related to the query of the user okay。

😊，So now I'm coming back to the tokens of the input LLM so the first one again this is the graph angle do tokens already talked about this。

so this is you know an MP on the mean the the node hidden feature or the last layer of of the G and is's going to be here。

so this is something of course if you have longable parameter here。

everything can be changed you know by back propagation。😊，So for the text base。

basically we will have two sequence of input worlds， so the first one is the query of the user。

which is the name of Justin Biber on broader。And the second one will be the textualization of the class。

okay， so this is。This is something again that is important for。😊，To use the to tap the LLM aity。

Okay， so we do that so we have the graphical presentation。

it will go through the text andbe so any LLM as this first layer that is's doing to do the word embedding okay。

so here I took an embedding and then it will go through you know the LLM。So the training。

the training is very standard， so we give these input tokens。

we have the transformer L transformformal layers and then the system we generate reccursively the response because we know the label using a training set we can basically fine tune the system to give us the right answer okay so we can again we are going to okay compute the crossenttropis with respect to the generated answer and the ground truth and then we are going to do the backward path to compute the gradient and update the parameters of the system so for GNN but also for the LLM。

Okay， so in summary georetrieval is composed of four steps so we have graphH which are this here。

which is for subgraphph retrieval related to the query of the users then we have a computation of laptop tokens using a GN we have also。

Yes， response generations， once we go through the input token and also the transform layers and find any model training okay。

Yeah， so we have defined things， explanation graphs， scene graphs and web QSP。

So the menu result are basically that okay here there are many things but you should only focus on that。

so if you do our technology retrieval， basically you will be able to beat if you only do LLM so you have your LLM your query and then you get the answer if you do only the GNN pump tuning so we still do better and if you only do the lowha fine tuning LLM so we still do better okay so this is really trying to combine the best of this world。

Okay， so the scalability is basically now we are able to， for example， only use。

you know reduced by 83% and 99% the of you know the number of tokens number of node that we use so instead of taking you know the whole Wikipedia graph we will only take you know a very small graph subgraph of the Wikipedia of the Wikipedia network。

Yeah， of course the one that we were interesting is does it improve hallucination so we compare with the baseline which is just doing LLM with query and we manually。

so what we did is that we ask we did one of the queries and then one of the responses and we look at manually if you know there is some missing nodes or adding you wrong nodes in the same as for the edges and we see that if you use Getry value really reduce the hallucination。

Add studies so what we observed basically is that and that was very interesting is so the token given by the GNN and the token the text tokens of the graph they are actually contribute equally okay so the information coming from the GNN and the information coming from the text and the LLM of the graph is they are basically both important okay so they are complementary they really improve everything so I think it makes sense of course because you have your GNN we know that they are pretty good for extracting graph information but also the LLM as a very strong capability with you know with text representation so the two are complementary。

😊，So in completion， so if we want to unlock the LLM Cap。

we need to use you graph as represented as token as wells。

but actually combining everything so the LLM， the GNN and the graphraph hack provides superior performance。

so it's not only the LLM Cap that does the work is actually many authors。So GA is effective。

efficient and mitigate and imaginationmun， so here are the paper the code and I really invite you to read the blog post by Shashing he so she has done a terrific work here。

she's the main the main you know researcher in this project received for this actually the 2024 Google scholarship and I really yeah invite you to read a blog post if you want to have a highly running introduction of this。

So also yeah so the P to optmetric team you know they included the geori so this is very nice and I think also something which is very interesting is to look at again we are always go back to the question can we use GNN to improve products right and I think here there is an opportunity so if we look at the history of website engines so everything starting with you know Google data rank they didn't use any text processing。

Then there was walk to V they use walk to V then they use an language model they use also HaG and finally we have Gemini okay so Gemini is LLM plus high so LLM if they don't know the answer they will look at HaG when you pump Gi so the next step hopefully would be to use GNN and you know some textat graph knowledge graph that would that would be useful for web search engine so here what is attractive I think is that everything is integrated and learnable so this is you know diplomaron。

So the next step was of course that we are working with Xiao Xing is basically okay so we have the text the text feature。

And that's it， so thank you so much for。😊，Yeah， for being there and I'm happy to take any question。

难错的要送d。So if the participants have any questions， you can type those into chat or slacklack。And yeah。

to perhaps get us started， I had a question。And yeah， I thought like how。

how you propose to tokenize everything， essentially and fine tune with Laura。

these language models is is really exciting， right， Like it really unlocks multimodality you。

you did have a slide about graph tokenization， right， and that there is no canonical order。

So how do you deal with that in this work， You know， like。

how do you kind of decide how the graph is tokenized in the end。😊，Are you talking about this one？

Yeah， like you where you'd shown the talk like a graph slide as well， right？嗯。

I'm not sure what you mentioned this slide right so is this how you kind of textualize the graph？

So textualization， where is it？😊，This one， Yes， yeah， yeah。So textualization is arbitrary。

It's completely arbitrary， so I think Brian Pelozi has his nice paper。And this is an issue， right。

so you want basically to have， as you said， a canononicical or presentation。And there is not。

Okay so so what is for the LLM that you have been trained you know the best representation to extract the most information so this is impossible to say so not only you don't have a unique representation but still need to have like a world of representation otherwise you cannot have access to the LLM knowledge but at the same time you have no idea you know if this representation is good or not so this is like prompt right so there is no no way to know the quality of your prompt before before you try so for example what we did I don't know if you notice that but for example there is a change of performance in the prompt。

If you put the title after the abstract。So so I think all these models they have this issue so the way you pump you're gonna to have very def so here there should be no def right if you put the title before before the abstract but there is a def so you need to play and this is you know I was half joking when when I talk about these LLMs that you know so they know they have seen everything they you know they have seen all the human knowledge and you can ask them anything but if you are not precise if you don't know how to ask them the question they will never give you a right answer so the only way I think to bypass this is is。

😊，Basically， to get along with the non uniqueness of the text representation， but then to find tune。

So before you know Loha was I was quite pessimistic about this technique and I say oh it's very hard you know to find actually some Google presentation that at the end will be some vectors and then you need to align the vectors together with the graph the graph vectors and everything else but you will never be able to get something good because they have not been trained this way but if you do this law you know fine tuning then you have a way to align this vector information so it's not perfect so far it also doesn't make sense if you do them。

😊，You know at the end yeah for example， yeah that's go to the end so it doesn't make sense in some sense in some sense。

you know， to combine the visual tokens， the graph tokens with you know the world tokens because they are very different modities so what you do with the MLP that you put here you are trying to align you know the visual space the graph space with the text space and then what you do if here this is frozen basically you don't do anything you hope for the best you hope that in this space you know the alignment is good enough to give you good precision so so it's never the case I don't think so but there are so much parameters that eventually give you something good but now that we can fine tune with Lo then what you do is basically you force all these spaces to align。

And this is why it works better when you fine tune。Yeah， thanks， I mean， thanks for all the details。

it's always exciting to discuss research this way。😊，嗯。そいうに。Do you have time for one more question。

Yeah sure。Yeah， so I guess like another thing to discuss right would be So if we can convince industry of the potential of these approaches and I'm certainly excited。

do you think it's it's possible in the future that we have LLMs which are let's say like more implicitly able to handle graph structure data here we had to fine tune with with Laura and so forth right。

but do you think like there's a possibility to have graphs as one of the pretraining modalities Yeah。

this is what the committee is trying to do today right so let me go to this slide。😊。

This is exactly what the community is trying to do right。

it's basically can we do some graph foundation model？

So you would be able to portray your graph in different modalities。

And then you recombine then with auto modalities。 What we do today is that here we are using the LLM self attention layers。

😊，Right but what we would like to do ultimately is to use some so you train very powerful foundation model for graph foundation model for vision。

foundation model for text， and then you combine them using this you know transformal layers for different task that would be you know I think for industry that would be the best way to go the problem is that today only LLMs as so much。

😊，Knowledge parameters you know so there is it's very inbal there is no way I think today that we can do a product we can be powerful with G only Gen today is just a top cherry on the top in some sense so so it's not like you know we are missing data training data so the way is full of you know graph structural data is full of that there is no issue with that the question is how do you make a product that would you know excite industry like you know meta Google and so on that would tomorrow say okay let's do like ChGPT but let's do this for graph so I think you know because it's not clear and also I have this you know that was already it at some point someone was saying you know why isn't GN in Hyman industry so that was a very interesting I think you know。

AD， so the only today the industry that likes G is biology。

Right so because you cannot do today LLM for biology。

so the only one that can make money of that is basically biology and so we have Nobel Prize so there is a huge promise that this deep learning Raal network and so on can make money of course it's a promise it's not yet there but I think this is today where the money is for GNN but of course we would like to do it also you know for textex there is no reason we cannot do that but。

As long as we don't have like industry， I mean we need to be honest with our so industry drives you know the AI。

the big AI you know improvement like like GT right so GPT could never be undeveloped in academia so of course academia is very important to get the ideas you know like for example division models that have been developed by a PhD student in Germany and so on so ideas we come from academia。

but know scanning up and making a soetal change we come from industry so it's part of the pipeline if we are not able to convince people to use GNN for you know text so either GNN is not good enough or we are not doing a good work you know to find the right products。

いや。Nice， yeah。 I think， I think this work is is definitely going in that direction， right。

There's there's quick。Yeah。There's a question from the audience so the questions as follows for graph rag。

could it be possible that retrieve nodes and edges are mutually exclusive and don't actually form connected subgraphs？

😊，If so， is there methodology to ensure that the subgraphra is connected？

Yeah so this is a very good question so I think this is completely arbitrary again so depending on the task you want to you know you want to solve so sometimes maybe you want directed you subgraph you want undirected there are many ways to retrieve you know subgraph so you know minimum span tree is the simplecht one but they don't have any you know in some sense the size of the the output size cannot be controlled this is the number of nodes so here we wanted something you know with less so I would say everything is possible as long as you know the task that you want to optimize so here we use this one but again you know that was for that was directed so that was for us and we also wanted to have like a small size so we can quickly you know do that so we wanted something better than the minimum span tree so this is why we used sh or tree which is done out but again。

I think you can use dependent you know subgraph extraction that is going to fit your objective there is no problem with that you just you know you need a process to extract a subgraph that's very important and this process has to be linear time because if if you do that on you know。

On large graph， I mean， at the end again， we are talking about product， you want to do it， you know。

on the fly， so you want something very， very fast。That's I think the only condition。

so of course a linear approximation a linear time approximation is never as precise。

but if you take enough nodes and edges it should be good enough， I think for your application。

Yeah that makes sense thanks for all the detailed answers and yeah thanks for the wonderful talk I think it's really exciting and you know like how all these technologies come together and with graphphra with graphph vector databases I think we we're on the cusp of hopefully the kinds of breakthroughs you talked about。

😊，I'mThanks for。 Thanks for good talk。Sure， thank you very much again， everyone for the invitation。😊。

Yeah， and again， like to all our presenters for the next session， sorry for the delay。

we had technical issues and I'm going to hand it over to Hey you and she'll be chairing the next session of oral presentations today。

Can you make me the host？

Yeah， thanks。So hi I sorry about the slide delay because we started about 15 minutes late so we will push back everything by 15 minutes。

so we have three exciting papers to share during our oral presentation and the first one is decomposing force fields as follows on graphs reconstructed from stochastic trajectories and it will be presented by Raan。

😊。

Yeah， can you share those screen please？上次 good。whenever you're ready， it's all yours， Okay。

so you can hear me。😊，Yeah。Okay great so yeah thank you very much for the invite I'm going to be talking about this paper decomposing force fields as flows on graphs reconstructed from stochastic trajectories。

😊，Which is a paper in this year's learning on graphs。

So this paper focuses on stochastic processes and nonequ steady states。

so in a simples form continuous space stochastic processes are described by SDEs or landjamin processes。

which are stochastic differential equations given by some drift term F and some diffusive term D。

And whilst these are the dynamics of a single trajectory， they're density dynamics。

So this is the evolution of this probability density。

the probability of a trajectory being at position X at time T evolves according to the very famous Fcker plank equation。

And when the density is stationary， this means at this left hand side， Dp by Dt is0。

and then the process is said to be in a steady state。

And a steady state can either be in equilibrium so this is thermal equilibrium or in non equilibrium and that depends on whether this probability flux J is0 or not。

so if this probability flux J is0 at in the steady state。

then you have an equilibrium steady state and if it is non-zero then you have a non equilibrium steady state。

So nonequilium processes are very important in a range of domains。

but particularly biology because biological systems maintain themselves in a nonequilium steady state to avoid thermal equilibrium where thermal equilibrium is a situation where theyre no longer dissipating any more heat to their environment and it's synonymous with death so here are just three examples of biological systems that maintain themselves out of equilibrium you have red blood cells well the outer membranes of red blood cells flickering with non equilibrium dynamics。

human brain dynamics here illustrated with just the first two principal components showing these strong fluxes in steady state and similarly in cell movements。

And these non equilibrium processes are also the basis of diffusion models in machine learning research。

they've gotten a lot of interest recently。So I'm going to focus on what's called the helmhot hodge decomposition。

so if you have some non equilibrium stochastic process。

it can be decomposed into reversible and irreversible components。

So the reversible component is stochastic and it basically is the part of the drift that balances the diffusion。

so you have some noisy fluctuations and then you have some drift that is going to maintain you on the stationary distribution。

And then you have the irreversible component which is driving this rotation around the stationary distribution you can see that here it's a circle so you put these two things together and what you get is a nonequili process on the left hand side that clearly is evolving over some kind of Mexican hat or donutt type distribution。

but with a particular nonequili rotation around this circle。

And so if you're interested in non equilibrium dynamics。

you're really interested in this irreversible component and how to extract it。😊。

So if you limit this to the case of isotropic diffusion。

so this is where diffusion is happening in the x and y coordinates exactly or where your diffusion is a multiple of their identity。

then the process is said to be reversible if and only if the drift is conservative so we can basically in the case of isosropic diffusion right F equals J over P plus some gradient flow if J over P is0。

then F is just a gradient flow and so it's conservative and then your process has no irreversible component so this is quite a nice characterization of equilibrium versus nonequili stochastic processes。

Now if your process has non isotropic diffusion， you can just change coordinates so that you're in a basis or in a coordinate system that has isotropic diffusion。

Okay， so now I'm going to take a slight turn towards graphs and S complexes。

And show you the hodge decomposition on a splial complex or graph。

So I'm sure you allll know what a graph is， but a sial complex can be defined automatically by basically taking higher order simplexes from the cliques。

So a set of three nodes that are connected becomes a triangle etctera， etc cetera。

you can define higher order complexes。Now an edge flow or flow along a sial complex is just a function defined on the edges that is alternating。

so if you just give a real value for each edge and you have this convention that if you flip the edge you take minus that so XIj equals minus XJI。

then you've defined a flow on the edges of your sial complex。

Now these flows have this decomposition into three components so you can basically decompose this edge flow X into a curling part gradient part and a harmonic remainder and this really closely mirrors the decomposition that we had for stochastic processes so the gradient of a flow defined on the vertices basically means you have some function F defined on each vertex and you take the difference of those to be between two vertices to be a flow along that edge then you have the curl of a flow defined on the triangles and then you have some harmonic remainder and I'm skipping over lots of the algebraic geometry here。

but this is a wellknown decomposition in signal processing for graphs and s complexes。

So if your complex is specifically a triangulation then we know this harmonic remainder disappears。

so here we have an example of an edge flow defined on the general S complex and it breaks down into these three components and if you sum the three components you get back the edge flow you have the gradient。

the curl and the harmonic part。And on the right， you have specifically a triangulation。

so you only have triangles and your edge flow breaks only into a gradient and a curl part。

And this decomposition has been used in many signal processing applications here are just two examples that I chose。

you have this ocean drifter data or you have this RNA velocity field。

it's just two examples of hodge decompositions in action。😊。

So how are we going to bridge this decomposition of a stochastic process with the decomposition of a flow on a S complex and the way we're going to bridge between the two is by considering the hodge decomposition of a discrete state Markov process。

So a discrete state markov process tells you the evolution of some probability where you have the probability of being in some discrete state I at time T。

😊，And this is defined by this matrix of rates， the Q matrix。

which basically contains all the transition rates from node I to node J。

So this has a natural mapping as a graph because you can say that the vertices in your graph are the states of your discrete state process and transitions occur over edges。

So the real key here is how do we define an edgeflow that is analogous to the drift of the SDE so that if I decompose that edge flow with the discrete Hodge decomposition。

it looks like the decomposition of the SDE with the helmholttzs Hodge decomposition。

And it turns out that the correct way to do this is by taking basically log of the square root of the ratio of these two rates。

which seems like a bit of an arbitrary choice， but it's mathematically motivated because one it is naturally alternating so it satisfies that it's an edge flow and more importantly。

this satisfies that the markov process is reversible and in equilibrium。

if and only if x is conservative， so it is the gradient of a scalar potential。

And that marries up perfectly well with what we had in the continuous state。

so our continuous state stochastic process is reversible if and only if F is a gradient flow and our discrete state process is reversible if and only if X is a gradient flow and so we're going to use this relationship to define a basically a mathematical pipeline。

😊，So here's the idea。😡，We're going to approximate a stochastic differential equation with a discrete state process that we're going to infer directly from stochastic trajectories。

We're then going to decompose the discrete state process。

and this is going to recover the reversible and irreversible components of the SDE directly from these stochastic trajectories。

So here's our pipeline we're going to begin with a number of trajectories which we're just going to consider to be a point cloud in face space So here I've just got two trajectories but you might have up to 100 living in some face space here I've illustrated in two dimensions。

Now we basically want to get a grid that kind of represents where our points live and we want this to have a uniform density。

so we perform furthest point subsampling and we get this uniform density subsle so you can see here I basically just picked out a uniform number of points that I can then grid in a way that will capture my manifold in my space。

So we triangulate these purple subsampled points and what we get is with the Delorney triangulation which is a very standard triangulation。

I've put some references here if you're unfamiliar with these techniques。

but the delorney triangulation is a natural way for basically filling in triangles in this space。😊。

And so what we get is a triangular mesh or a grid where you have these midpoints labeled in pink dots and you have the actual vertices labeled as purple dots。

So now we want to use these as our discrete states， these triangular bins。

and so all we have to do is basically follow our trajectories through these bins and count the number of transitions from each state J to I and then the maximum likelihoodestator is very simple it's just the number of transitions from J to I divided by the holding time in state J。

And so this way you can see that I followed my trajectories around。

and I've just inferred a discrete state stochastic process directly from some trajectories。😊。

Okay so what we have in this case actually is you have transitions going from the midpoint of each triangular bin to the adjacent triangular bin。

but we actually want our discrete states to be vertices， not midpoints。

so we take the dual tesseellation or the dual triangulation which is very natural s complex so're basically flips so that the midpoints are now vertices and the transitions are now occurring over edges。

So now we have what we wanted， which is we have a s complex or a triangulation where our discrete states are vertices and our transitions in our stochastic process are edges。

So all this left to do is to define our edge flow using our maximum likelihood estimates of the transition rate。

taking log of the square root of the ratio， and we have a flow that represents the drift of our sacastic differential equation on this s complex。

So now we can decompose the edge flow so the discrete version of the edge flow is very simple to decompose it's basically just an LSQR problem and I'm skipping some of the mathematics again。

but this is because the hodge decomposition decomposes into orthogonal spaces which basically means you can find some function G whose gradient is most similar to x and that is the scale of potential G that solves the helmhol hodge decomposition so here gradient basically is going to be a matrix a sparse matrix G is some vector in the space of the number of nodes and x is a vector in the space of the number of edges and so you just solve this problem with LSQR you get F and then you can just recover the curling part by taking the remainder x minus minus this gradient part。

And so we can decompose this edge flow into two parts， the car star and gradient of F。

where car star phi represents all of the irreversible currents in our stochastic process and gradient of F represents the reversible component。

😊，Okay， so now we're going to try and see if this actually works with some numerical experiments。

So a first sanity check is your reversible components should be even under time reversal and your irreversible components should be odd under time reversal。

So we basically take one ensemble of stocaastic trajectories in this case from a linear onstein Unbe process。

and we basically construct our edge flow going forward in time and decompose it。

And then we do it the same， but with the trajectories turned backwards in time。

And because you have the same meshes， you have the same edges， but the flow is what changes。

So we can look at the correlation between the flow when it's going forward in time and when it's going backward in time。

And what you get is that the gradient flow is strongly positively correlated。

which suggests it is even under time reversal and that the curl star flow is strongly negatively correlated under time flipping。

which suggests it is odd under time reversal。 So the sound to checks past and it suggests that we are actually capturing reversible and irreversible currents with these two parts。

😊，So as a general measure going forward of the level of non equilibriumilium or the level of irreversibility。

we're going to take the proportion of the curl flow over the wholeuling。

so basically how big is your curling component relative to the size of the curling component plus the size of the gradient component。

😊，And we're going to see if this metric PC lines up with what we expect。

So we begin with two solvable processes， the Einstein Ulenbeck， which is a linear process。

which we know converges to a Gaussian distribution， and the plain limit cycle。

which is a nonlinear process that we know converges to a Mexican hat style distribution。

And these have been parameterites so that there's some parameter theta which causes rotation around the station distribution。

so it allows me to change how much irreversible current there is in each of these systems。

So I do this and I basically can compare PC in the reconstructed sub complex flow to PC in the exact solution。

where the exact solution is just approximated on some square grid。

So I'm not exactly comparing a like we'd like here。

but the idea is that I'm looking at how much of the exact solution is curlling and how much of it is reversible。

😊，And what you can see that we capture the correct qualitative dynamics。

which is that as I increase theta， I'm increasing the level of curling irreversible flow in my exact solution。

but also in our reconstruction， so our method is able to capture how nonequili your dynamics are relative to similar systems。

😊，So these are two solvable examples， but let's consider two unsolvable or nonlinear examples in the form of the stochastic van deol oscillator and the stochastic Rosler attractor so in these cases I haven't got an exact form for the stationary distribution and so I can't perform the Helm's Hodge decomposition exactly。

But I can use my parameter theta as a proxy measure for how much irreversible curl flow I have。

So I do this again and I increase theta and you can see that in the band a par oscillator。

I am basically capturing quite well that as I increase theta。

I'm increasing the level of curl or the level of irreversibility in my edgeflow。😊。

In the Roslow attractor I have the same behavior but it's slightly noisier because the Roslo tractor is in 3D。

so I have to use slightly larger more coarse triangles in order to get good estimates and so you get more noise but the qualitative behavior is still there。

you increase theta and you increase the proportion of coal in the reconstructive flow。

So with these numerical experiments it suggests that our method actually works。

so we're able to capture the level of purling flow in the original data in our reconstruction。😊。

So we're going to apply it to some real world examples。

And so we're going to focus on two case studies where we know the kind of level of ground truth。

So the first one is a flickering red blood cells so you can have healthy red blood cells which have some high level of dissipation and you can have synthetic ATP depleted red blood cells。

which we call passive which have much lower dissipation so that means they're more reversible than they're active or healthy counterparts and this is something that has been shown in real world data recently it came out in science。

and then there's a result that is much much older but also well known。

which is that aythic or aging human heartbeats have lower irreversibility or lower curling flow than healthy counterparts and in the first case the data is just a univariate time series is the outer membrane position of a red blood cell over time and in the human heartbeats。

it's a unvariate time series in the form of an ECG an electrocardiogram。

So if I have these Uniivevariate time series in both the case of the red blood cells and the heartbeat dynamics。

I want to basically reconstruct phasebased dynamics because that's where my method is actually working and so we just use a very simple time delay embedding into two dimensions where we picked tau with a very standard method which is the first zero crossing of the autocorlation。

And so this way I'm basically able to take my Uni time series and get some dynamics in phase space for both healthy and depleted or passive red blood cells。

as well as healthy and ay rhythmic heartbeats。😊，So there's one caveat here。

which is that in my framework and in my numerical examples I assumed isotropic diffusion。

so I assumed that diffusion was happening exactly in the x and Y coordinates that D was a multiple of the identity in real world data you can't assume that to be true so we have to estimate the diffusion tensor and then do that coordinate change that I mentioned earlier。

So what we do is we use that this formula looks a bit ugly。

but it's just the maximum likelihood estimator for the diffusion in each triangle so you follow the trajectories and then within each triangle you just compute this quantity where this product in the middle is a tensor product and it gives you the estimate of the diffusion matrix in that triangle I'm assuming that it's constant overall all triangle so I take the average and then I change coordinates so that my data is now evolving with isotropic diffusion。

And then I can apply my framework as before。And so what do we find we find that what we find what we expected to find。

which is that healthy red blood cells have a significantly higher level of PC which is our irreversible curling flow compared to passive red blood cells and this is using ensemble bootstrapping so i'm able to generate a lot of ensembles to run this on and check the significance of the differences but you can see that we have significance at the highest level from healthy versus passive in this in this example and then we have similar results with with our human heartbeats so we see that healthy。

Heartbeats are significantly more irreversible， have significantly more curling flow than their ayic or unhealthy counterparts。

And this supports the previous results in the literature。

but using our new kind of graph based framework。Okay， so to recap。

we developed a method that leverages the discrete form of the helmholtdgedecomposition on a Sial complex to approximate the continuous form of the helm Hots HodgeD composition of an SDE。

We validated that our approach captured irreversible currents in both solvable and unsolvable stochastic processes。

And then we applied our approach to flickering red blood cells and human heartbeats to confirm that there are elevated levels of irreversible occurrence in healthy conditions compared to impaired conditions。

either passive in the case of the red blood cells with ATP depletion or arrh rhythmic heartbeats with a myocardial infarction。

which is the specific disease that we were looking at here。😊，So in conclusion。

our approach breaks new ground， the interface of graph signal processing。

particularly with applications to stochastic dynamics and biophysics。

which is not something that is commonly looked at with graph signal processing techniques。

and it prompts more mathematical work to basically look at the relationship between the discrete or graph-based form of this hodge decomposition which is getting very popular and the continuous form which is well known for both SDEs and for just functions can be decomposed similarly with the Helmts hodge decomposition。

So thank you for listening all that's left to do is to thank my co-authors so those are my supervisors Aang Gelli and Reult Lambyot at the University of Oxford。

Morton Kingleback at the Center for youaimonia and my co-authors。

poor expert David Bears and Alexander Strang If you're interested please read the paper and reach out。

Thank you so much for the fantastic talk is there any question from the audience I think we can we have time to take one question。

😊。

嗯。I have one question to start with， I find it very interesting that you have the application in red blood cells and heartbeat can elaborate on more practical usage of your model on other biophysics。

😊，So in terms of like。racicallyThere's a lot of interest in nonequ equilibrium dynamics in biology。

these are just a couple of papers from recent years。Specifically。

I don't know if it has any clinical applications at this level。

but it's more about the energetics of being able to capture from data basically how much dissipation。

energy dissipation entropy dissipation is happening in biological systems and this is quite difficult to do from dynamics and there's a number of different techniques and this is just one more in the quiver of techniques for analyzing non-eququiilibrium dynamics。

we picked those two examples because there were situations where we kind of knew the ground truth or there were other methods that had been used to get a ground truth so we could see if our method actually worked it was partly kind of the there are other exploratory avenues for plot for looking at nonequilium dynamics specifically most of my work in neuroscience where we're looking at non-equilient dynamics in the brain as a signature of cognition and of consciousness。

😊，Thank you I think we have one quick question how could we use this framework to perhaps understanding the diffusion generatedative models do you have any insights Okay yeah i'm certainly not an expert on diffusion models I know that they that they they focus on non equilibrium stochastic processes as well。

😊，If you like I've seen recent work from this month。

actually looking at the dynamical regimes that diffusion models go through。

In theory this could be applied to the output of any stochastic system as long as it evolves in a reasonably sized face space。

a reasonable number of dimensions in face space， in terms of diffusion models you could apply to the trajectories that diffusion models take。

however when you have access to the equations it's not necessary to use these kind of data- drivenriven techniques。

so I think with a diffusion model from what I understand you might have access to the actual model forms in which case you can use basically the exact decomposition。

the exact mathematics without having to go to a data drivenriven method。

Got it thank you so much again for the great talk Thank you thank you for your time and thank thanks for everyone for listening。

😊。

Now we have the second presentation from Julia， the title is a cosmic scale benchmark for symmetrymtry preserving data processing。

Yeah， whenever you're ready。

Oh thank you and thanks for the invite to speak so hi everyone I'm Julia and I'm a PhD student at MIT and today I'm excited to present our work on a cosmic scale benchmark for symmetryymmetry preserving data processing which was performed as part of the NSF Institute for AI and Fund Inter for Ifi for short。

So let's start with a bit of background on one of the flagship observations in cosmology。

which is Gal clustering。And this is where the positions of galaxies and also their associated properties are measured by cosmological surveys。

such as the one shown in the image here， and cosmologists are generally interested in studying the spatial distributions of galaxies because they can tell us a great deal about the underlying structure of the universe and also about the physical processes that shaped it。

And the data from these observations is usually very complex and high dimensional。

And so traditionally， cosmologists would resort to using summary statistics， like say。

the two point correlation function。Which at a high level just quantifies the likelihood of finding pairs of galaxies separated by a given distance。

And so while these types of statistics have been incredibly useful。

they also come with the drawback of potentially losing information about higher order correlations in the data。

And so this drawback motivates the development of machine learning tools that can reliably extract information from these types of surveys and process these point clouds。

And besides the downstream cosmological tasks of interest。

Galaxy cluster data has several characteristics that make it a valuable benchmark for stress testing。

graph neural networks and other machine learning methods。

So the first characteristic is the large point cloud cardinality right so most atomistic systems that you might encounter in other scientific domains where you use graph processing。

like say in small molecules only have around 10 to 100 nodes per graph。

whereas these data sets are sometimes on the order of over 10 to the six points per point cloud。

And so this presents unique challenges when it comes to scalability。

The second characteristic is the presence of information across spatial scales right so on one hand we have gravitational forces that cause matter to cluster and create these short range correlations。

And on the other hand， we also have the growth of structures that could have been initially in causal contact。

but then might now be spatially separated。And this creates long g correlations。

And so this means that we need to use methods that can capture both local and global information。

And finally， the third characteristic is the symmetry structure in the data。

So since the universe is homogeneous and isotropic， its properties are spatially uniform。

and so this means that the distribution of galaxies should exhibit Euclidean symmetry or in other words invariance to translations。

rotations and reflections。And so this means that this data could also be useful for benchmarking E3 equivariant neural networks。

So taking all of these characteristics as motivation。

our first contribution is the curation of a point cloud data set from existing cosmological and body simulations。

Along with an easy to use interface for accessing the data set。

And we also introduce a JaX based code repository， al EQ and A JaAX。

which implements common equivariate neural network architectures all in one consistent framework。

And finally， we systematically evaluate the performance of various graph neural networks on downstream tasks using this data set with a particular focus on equi。

Varian models。Okay， so let's now talk about the data set in more detail。

So it's derived from an existing suite of antibody body cosmological simulations called the Quixote simulations。

And these simulations take as input a set of five cosmological parameters that describe the conditions of the universe。

like say the matter density or the rate of expansion。

And then they follow the evolution of millions of dark matter particles in a periodic volume。

And it's important to note that since this volume is periodic。

what this means is that a point that's say on the very right edge of the box is actually very close to a point that's on the left edge。

and so it'll be very important for our models to also account for these periodic boundary conditions when computing quantities like distances。

And so by uniformly sampling these five parameters from some pre specified ranges。

we can obtain different spatial distributions of dark matter corresponding to different possible universes。

And these simulations are very computationally expensive to run。

and this emphasized the need for simulation efficient methods to extract information from these resulting data sets。

And in the subset of the data that we collected from these simulations， we have a total of 12。

384 in our data set。And so we' next post processed the simulations to identify dark matter halos。

which are gravitationally bound structures that host galaxies。

and then these halos are identified using an existing halo finding algorithm。

And we then select the 5，000 most massive halos in each simulation to construct our point clouds。

And each point in the point cloud comes with a set of node wise features， which are the position。

velocity and angular momentum vectors， as well as the mass of each dark matter halo。

And also each point cloud is labeled with a set of global features corresponding to the inputted cosmological parameters。

And the two parameters that we focus on later in our benchmarking study are the matter density or mega M。

And the Sigma 8 parameter， which is the root mean squared a matter of fluctuation that's averaged over a sphere with the fixed radius。

And so for example， here we see two example point clouds in our data set with different parameter values。

and they're especially different for omega M。And in particular。

we see that the point cloud with low matter density on the left appears to be more clustered than the one with larger omega M on the right。

And for each point cloud in our data set， we also compute the two point correlation function or the two PC summary statistic to use as a baseline for benchmarking our models。

And I'll discuss this quantity a bit more in the next slide。

So the two point correlation function is essentially a measure of the excess probability of finding pairs of galaxies at a given distance R apart relative to a random distribution。

So on the right， we see the corresponding to PCf plots for the example point clouds。

So the one on the left with smaller omega M and hence more clustering has higher two PCF values at larger R than the one on the right。

So essentially the values of the two PCF for large are capture long range information while those for small are capture information on smalls。

Fs。And so how do we represent this function as an input into our models？Well。

we can represent it as a vector by creating bins for increasing values of R and then computing the excess probability within each bin。

And so in practice， we actually use just 24 logarithmically spaced bins to construct the vector for each point cloud。

And we actually use these two PCF vectors as a baseline for our graph level prediction tasks where we train an MLP on these handcrafted summaries。

and then we compare the performance against our GNN models and test their ability to extract more information than traditional summary statistics。

So let's now look at the benchmarking tasks that we consider so the first set of tasks is graph level parameter prediction。

Where given the positions of the galaxies as inputs。

the task is then to recover the scalar cosmological parameters that served as input to the simulation。

And we focus on predicting mega M， which depends on long range correlations and on Sigma8。

which depends on short range correlations。And we additionally consider a node level task where。

again， given only the positions， the task is to predict the velocity vector at each point。

And this is a pretty challenging task since we're only considering the 5。

000 heaviest halos in our data set。So to benchmark GNNs on these tasks。

we first convert the point cloud into a graph by adding edges to the K nearest neighbors for each node。

And all the models that we benchmark are essentially versions of the same core message passing framework。

which can be represented as the classic edge level node level。

and optionally the graph level update functions。And so we benchmark the following commonly used message passing architectures。

three of which are E3 equivariant， and this means that they satisfy this condition that when you apply a transformation such as a translation。

rotation or reflection to the input，The output is then transformed accordingly。

And each of these architectures achieves this constraint differently。

which I won't go into details for for the sake of time。

but I will say that one thing to note is that the SCGNN and NquiP both use stral feature representations that rely on spherical harmonics。

And they both use this like this parameter L maxax to control their maximum degree。And finally。

we also compare our models to pointNe++， which operates directly on the point clouds and processes them in a hierarchical fashion。

And we also found that the use of radial basis functions as edge features was actually vital to the performance of our equivariant models。

And so what this does is it essentially transforms the scalar distances into。

A richer higher dimensional representation by projecting the relative distances on a basis of N radial vessel functions with some fixed radial cutoff C。

And we use vessel functions because they were also used inquip。Okay。

so for our benchmarking experiments， we trained these models on the graph level and node level prediction tasks for which we use a randomly sampled subset of 2048 point clouds for training。

as well as validation and test sets of size 512。And the lowest mean squared error that's achieved for each task is shown in bold in the table。

So。Looking across all three tasks variant models outperform the nonequvari， GNN and pointnet。However。

For the matter density prediction task that requires long range information。

none of the models beat the baseline， which is simply training an MLP on the 2 PCF vector。

And this is perhaps surprising since these models should be able to capture higher order correlations compared to the two PCF。

but it also is perhaps unsurprising。Sinceage。GNNs are known to struggle with propag long range information。

And so we can actually probe this further by seeing which information is present in the2 PCF that is not also captured by the learned graph embedding。

and so we do this by concatenating the two PCF vector with the graph in final readout MLP。

And so the2 PCf becomes this additional global feature in this way。

And so we do this for our best performing model， which is the SGNN with LX equals2。

and unsurprisingly， since there's strictly more information being fed into the model now。

the performance improves over both the SGNN and the two PCF individually。

But we can take it a step further and only consider using parts of the two PCF。

namely the components corresponding to small R and large R。And breaking it up this way。

we see that using only the small scale components does not help the mega and prediction task。

but when we use the large scale components， the performance is almost as good as when using the full vector。

And so this essentially verifies the fact that the missing information in our models is precisely the long range correlations。

And I think it's really nice that this data set and this type of analysis gives us an easy way to test whether a model captures the long range information or not。

And also， given the importance of simulation efficiency for these cosmological tasks。

we also benchmark our models across different training set sizes。

going from only four simulations to nearly the full 12，000。

And we see that the SEGNN models in particular show better performance across all training sample sizes。

And NQp also appears to do well in both the really data scarce and in the asymptotic regimes。

So just to summarize。Takeaways。We a set consisting of simulated galaxies whose spatial distribution is informative of the underlying cosmological model。

And we show that both the graph level and no level tasks can benefit from the use of equivariant models。

which were also found to be more simulation efficient。And additionally。

the two PCF summary statistic outperformed the GNNs in inferring a parameter that is sensitive to long range correlations。

And therefore we think this benchmark would be a great target for methods that aim to mitigate these issues associated with long range information preservation。

that for example， it might make sense to test transformers on this as future work。And so with that。

I'd like to thank my amazing collaborators and mentors。

And also to plug our repo on GitHub where you can find the link to the data set and the implementation for all of the previously described methods in JX。

So that's it for me and thanks for listening。Thank you so much for the enlightening talk and very interesting benchmark。

😊，There is some discussion among the audience。呃。So one of the question was whether there is an existence of dark matter particles。

but I think one of your co-aus has answered it， so well I guess if there's no followout question and we are short in time。

we have another question about the results， so the tables show that the EGNN model fail drastically and is there any intuition why？

😊，Yeah， so I think that it might be due to the lack of like model capacity。

like it has the fewest parameters of all these models。

but it also could just be not enough time spent hyperparameter tuning。So'm not sure， yeah。

So you mentioned there are two characteristics from the results of first one。

equivariance matter and the second one La information also matters。

do you have any issue how we can tackle both by combining them together because I think La depend is a well studied topic in Gna community as well。

Yeah， yeah， I definitely think it makes sense to try something like an equivariant transformer and I think we actually have an implementation of one in our repository but due to time constraints didn't include it in the benchmark。

so I think that would be a good combination of the two。Co yeah， thank you so much for the talk。😊。

Okay， that's a reasonable question。 So point9 plus plus is long range， but now avariarian。😊。

Yeah that's right yeah so I point out is also nice because it processes the data in this like hierarchical way。

so I think that's also something worth exploring because it seems unsatisfying to try to basically fool the node representations from like5000 nodes into just one vector at the end so yeah thank you so much for the talk Thank you。

Now let's welcome our third presentation from Saloian。Yeah。

and the paper title is on the equivariance of graph convolution and mix up。嗯。O hello， everyone嗯。

Could you see my screen。Yes， okay。嗯。Yeah， thank you everyone for attending this talk at this special time。

happy holiday at the beginning， I start my presentation on the equivalentvalence of graphra convolutional convolution and Miup。

this paper was coed by case Me anytime AI UCLA and RIS。

Yeah at the beginning I would like to introduce the basic idea of this paper。

we build up we build a connection between the graph con and a mixup they are two different ways of two different technologies of deep learning but they have some inherent relations based on our observation yeah let's get started in case some audience may not very familiar with the graph convolal neural network so I introduce a little bit about the original graph convolutional works at the beginning we have a graph we have a target node a here and when we do the graph convolution actually is do a neighbor aggregation it's aggregate the neighbors features into As feature and then we do a。

That the very original graph condition on neural network， we just averaged the。

Maybe do a weighted average and then we get a updated node feature for a。

which highlight in the green color the the feature after aggregation is a for。3 here。

So it is acknowledged that the graph conal enhance or enrich the representation of the target node A because it's just aggregate the features from its neighbors in this way its neighbors feature will enhance or enrich the target node so this is why we think graphra conutional works previously so in this paper we found another explanation for this。

suppose we focus on the task supervised node classification task at the left figure suppose we only have a one layer graph conal neural network and we consider the label of node a here the label is a010 this is a one hot label the node A belong to the second class。

So this is a one hot label， and we use this label to supervise the aggregated feature。4 three here。

But for the when we consider the idea of the neighbor regulationation or graph colsion。

the four three or the aggregated feature is obtained from the original features of ADCB highlighted in the orange color。

so this means because we use the label。01，0 to supervise the aggregated feature 43。

And the fourth three is obtained by the original four features。 So in some way in some way。

this is a this also means that we use the label0，1，0 also to supervise the original node features。

The original note features of ADCB， orange color， So this is what this is the basic idea of our paper。

This means we use the nodes As label to train node BCDs feature。

so in this way we think we use the target nodes label to train its neighbors neighbors feature。

so this is why we think the graph Conium works so well。Yeah， based on this。

we have a such equivalent training data at the left figure。

this is the original graph conal neural network。 We use a label 010 to supervise the aggregated feature and we will get an equivalent training data actually this data have a four data data samples the first one is a node a with label 0。

10 and the node D1 the feature is a 16 with the label 010 as well and also this is the same to the node C and D so。

Based on this， we think。Why graph conal neural works。

This is because the labelless neighbors of a target node are trained by the target nodes label。

So this means the Ace neighbor is trained by ace label。

so this is why we think the graph conditional neural networks so based on this basic idea we proposed our own method like we build the connection between this two mix up technology we can say we when we do the mix up with the equivalent training data we have a full data sample here data samples here。

we do the mix up on these four samples and we will get a updated one the new features as a new label we can say this is the same to the updated feature of a so。

After the mix up， the node A creature and the labels are the same to the label after graph col。

so this is a kind of relation between the graph colvolal convolution and the mix up。Yeah。

based on this basic idea， we propose our idea， the graph convolution is equivalent to mix up strategy。

so this is a very， I think is' a very simple idea here。

This is the whole framework of our work at the beginning we do a homo rela， which means we assign。

Maybe I need to introduce the graph a little bit， we have a full node here and the target node is the blue node。

Only the target label， only the target node has a label y zero here at the first step we need to rela。

The neighbors， the red node and the green node， also to the label Y 0。

The same label to node to the target node and then based on this we do a mixup we can see finally the features of the features and the labels of the target node。

the the blue node are the same after the graph con and the mixup so based on this simple strategy or assumptions homoly rela we called it we build the connection between the graph col and mix up。

So the main results of our paper are here， the graph con is a train as well as as well in the test time mix up and two mild and reasonable conditions how re and as well the test time mix up。

I will briefly introduce this to later in the experimental parts。呃。

The previous slides we showed the basic idea of our paper intuitiveively actually mathematically the graph col and mix up are also very similar we can see the first two equations the first the first line is a graph convolal mathematical expression We update。

的 feature啊。Based based on the target nodes labels and do the average。

but we use the target nodes label， we keep the target nodes label as the original one and then we get the updated feature。

but for the mix up there is only one difference we also mix up with the labels。We also mix up why。

So this is a kind of the difference。 So when we reformulate graph neural network as a mix up。

we only need to do is to re the。Neighbors label to the target node label。

which I highlight in red in the second equations。So once we do this， we do the homoly label。

we label the target nodes neighbors label to be the target node label。

we will totally we will totally reformulate the graph conal。As a mix up。Yeah。

so in this slides we a we build the connection between the graph solution and mix up mathematically。

Does under one condition， home of rela。Yeah， in the next I show a experimental results it's a very simple one just to do a verification on what our proposal is correct or not this is the prediction accuracy for node classification task on three on three data is its a very popular graph data call here and PM。

we only focus on two accuracy here， the original accuracy on these three data sets。

the average accuracy， the graph convolutional neural network is 83 and something。

But when we use the MLLP and also with the mixup， we get the similar results， 83 and 57。

So the the average accuracy on these three devices are quite similar。

So in this slide we we build the connection between the graph and the mix up experimentally。呃。

And in this experiment， we show when we use the homofle relabel with multi layerer pathogen and versus the results of a graph conal。

Graph cons。We only focus on the red line and the green line。

The red line is the results of the original graph conal neural network。

and the green line is the MLP user using homophily rela。 We just said re of the。

Relaable the neighbors and use the new data and use mix up on this new data and then train a multi layer pass we。

We can see the accuracy are quite similar， the X axis。Is the ratio of training data。

We use a different ratio alpha tri nodes。And the Y axis is the node classification。Accuries。

we can see on this four data set。The home of the rela achieves the similar performance to the graph conal。

帮我螺实。And also we propose a we also propose a test time mix up and also versus the graph conclusion the text the test time mix up means like that we train a multi layerer pass。

We train a multi layer position with the using the original data。

only use the original labeled node in the graph。 And then we do a test time mix up the test。

It means we train a multi layerier poit and then we use the trend weight。

In a graph convolutional neural network and then do the inference we see the results also say the red line and the green line we can see the green line and the red line are quite similar so in this way in this experimental we show that the test time mix up achieves the similar performance to the graph convolutional as well。

Actually， this phenomena was also funded by some previous works。

As noted in the bottom of this slides。And also， I do some experiment to investigate more on the test time mix up。

basically focus on the decision boundary。 We focus on on the。On the figures in the left hand。

This is the two classes for basically is the two classes and each and each point here represents one node。

啊。The left figure， this figure is the。Re of MLLP。 And this figure is the MLLP plus pass time mix up。

We can see they have some difference here。we plot the decision boundary for different classes。

we focus on the second rules， and we can see after test time mix up， the nodes。

the representation of nodes become more and more similar。And as well for the for the class zero。

it has a similar phenomena and in this way， it's made the note。Far away from the decision boundary。

it will in this way it will increase the accuracy and as well on the figures in the right hand we have the similar results。

个 this this figure this。This rule。This column is the class。Is it a class？Is a class one。

And then this this rule is the results of multi layerer procedure。

and this rule is the this column is the test time mix up with MLLP。

So we can see is also get the similar results is make the different class far away more far away from the decision boundarying。

And also， in the last experiments， we combined the homoly re and test time mix and the test time mix up together。

we can see。The difference。诶。We can see the results of the graph conal neuralnatal and the mixup strategy。

also in the red line and in the red color and the green color。

we can see combined these two strategy together， the results are quite similar。

Its is except one data set here。 This data set is a kind of a different result， but for others。It's。

it's quite similar， totally even the same with different training data split ratio。Yeah。

In this presentation I present the basic idea of our paper。

we build a connection between the graph col and mixup for more detail you can refer to our paper you can get it HQR code。

yeah we do a lot of other experiments on our paper as well and we provide some mathematical and analysis as well。

I can take questions。FromThank you so much for the great talk it's a really interesting connection between graph convolution and mix up。

😊，We also want to add on that Salin's talk is our first log TmLR track oral so congratulations thank you there's one question does your definition of GCN in experiments use residual connections and batch or layer norm between layers oh we don't because for for the。

😊，In this paper we only investigated the original graph commercial on neural network。

so we only use the normalized a matrix and multiply the node feature and then use a function this is a one layer so we don't use other strategies because we want to keep it simple this is also the original formulation of graph commercial。

Got it。And do you have any intuition on how the connection between Mi up can contribute to improving GCN architecture or posting new architectures to kind of tackle bottlenecks。

for example， on hydrophilic settings？Okay， okay， this is a great question I think the benefit of this relation or connection is that the first benefits is that we can use a mix up to accelerate the graph conal。

Graph con because originally we need to use the whole adjacent matrix this called a computational bottleneck。

but when we use a mix up we only need to train a multiar procedure So this is one benefit and for the whole for the heatfoly graph you mean right actually I didn't do the experiment on the heterofos graph。

but you can see you can see from from the mathematical connection between the graph conal and graph con and mix up we do not have assumption the graph must be homoly So I think for the heatfoic graph we need to do more investigation experiments I think maybe it's extended to that kind of graphs as well。

Thank you very much again for the fantastic talk。😊。

Thank you and I think this wraps up today's oral presentation thank you everyone for joining and sorry for the delay at the start。

😊，Thank you Sa T again for the talk Thank you so much happy Thanksgiving。😊，You too。 Thank you。哎。没有。

A you before we close。😔，I'm not sure what the best thing to do is。

but I'm guessing it's better to like stop the YouTube live stream and stop the record。

图机器学习会议：P06：Xavier Bresson 主旨演讲与口头报告

在本节课中，我们将学习 Xavier Bresson 教授在图机器学习会议上的主旨演讲，以及后续三篇口头报告的核心内容。我们将探讨大型语言模型与图神经网络的结合、非平衡态动力学的图信号处理、宇宙学数据集的等变模型基准测试，以及图卷积与数据增强技术 MixUp 之间的理论联系。

概述

本次演讲和报告涵盖了图机器学习的前沿方向。首先，Xavier Bresson 教授探讨了如何结合大型语言模型和图神经网络的优势，以处理文本属性图任务。随后，三篇口头报告分别展示了图方法在生物物理、宇宙学和理论分析中的创新应用。我们将逐一解析这些工作的核心思想、方法及贡献。

大型语言模型与图神经网络的结合

上一节我们概述了本次课程的内容，本节中我们来看看 Xavier Bresson 教授如何分析大型语言模型与图神经网络的优劣，并提出结合方案。

大型语言模型在图任务上存在局限性。它们虽然拥有海量知识，但逻辑推理能力有限，需要精确的提示才能工作。即使在训练中见过图任务的测试集，它们处理图结构的能力依然不足。

相比之下，图神经网络在处理图结构数据方面具有天然优势。例如，对于一个包含人物和城市关系的图，询问“莫娜莎和爱丽丝的朋友鲍勃在同一个城市吗？”，GNN 可以通过多层消息传递学习到连接问题答案的多跳路径。GNN 在文本、物理、生物、组合优化和化学等多个领域都非常有效。2023年诺贝尔化学奖授予了 AlphaFold，其核心是一个预测氨基酸序列残基间成对距离的等变 Transformer，这本质上也是一个图神经网络。

然而，GNN 也存在局限。目前尚缺乏像自然语言处理或计算机视觉领域那样规模的图基础模型。问题的关键在于数据规模、硬件优化和工业应用。现有数据集（如 OGB）相比图像数据集（如 ImageNet 的 150GB）仍然较小。运行稀疏代数运算的硬件未得到优化，速度远慢于标准的密集矩阵操作。现有的预训练 GNN 参数量仅为百万级别，而非十亿级别。此外，工业界尚未找到 GNN 有吸引力的盈利应用，这影响了相关研究的投入。

结合 LLM 和 GNN 意味着开发一个联合训练的文本与图基础模型，这是一个非常吸引人且前景广阔的想法。但当前面临的挑战是文本知识（来自 LLM）和图知识（来自 GNN）之间存在巨大的不平衡。理想的架构是让文本通过 LLM 处理得到向量，图通过 GNN 处理得到向量，然后通过自注意力或交叉注意力机制共同处理这些向量以生成文本。然而，协调这两个差异巨大的领域非常具有挑战性。

因此，我们需要定制化地结合 LLM 和 GNN 以获取价值。例如，可以利用 LLM 的广泛知识来提升小规模文本属性图的性能。反之，也可以利用知识图来约束 LLM，使其生成更精确的回答，从而减少幻觉。

以下是两种具体的结合思路：

利用 LLM 增强 GNN：使用 LLM 的知识和推理能力来提升图中节点特征的质量，从而改进 GNN 的预测。
利用 GNN 增强 LLM：使用图（如知识图谱）来约束和正则化 LLM 的响应，使其专注于与图相关的信息，减少幻觉。

接下来，我们将详细介绍这两种技术路径。

技术一：TAPE - 利用 LLM 知识增强 GNN

上一节我们介绍了结合 LLM 与 GNN 的两种思路，本节中我们来看看第一种技术：TAPE。其核心思想是利用 LLM 的知识来提升文本属性图中节点特征的质量，从而让 GNN 做出更准确的预测。

关键问题是如何从 LLM 中提取针对特定图任务的信息。我们的方法是通过提示工程来获取 LLM 的预测及其推理过程。例如，给定一篇学术文章的标题和摘要，我们不仅要求 LLM 预测其类别，还要求它给出做出此预测的推理或解释。

这样，我们就得到了由单词序列组成的标题、摘要、解释和预测。然而，GNN 无法直接处理单词序列。我们需要一个映射，将输入单词序列转换为一个 D 维向量，以总结信息并增强节点特征的表征能力。在这个例子中，节点代表一篇文章，我们的目标是预测节点之间的关系或类别。

我们提出了一种集成闭源和开源 LLM 的技术。闭源 LLM（如 GPT、Claude）通常性能更好，但它们只提供文本序列，不提供内部向量，因此难以用于训练 GNN。相反，开源 LLM（如 LLaMA、Gemma）不仅提供文本，还提供内部的隐藏向量等所有信息。

我们的方案是：使用当时最好的闭源 LLM（如 GPT-3.5）来处理文章，获得解释和预测的文本序列。然后，我们使用一个较小的开源语言模型（如 BERT，1.29 亿参数）将这些文本序列转换为向量。具体做法是，将序列输入 Transformer，提取 [CLS] 标记在经过 L 层 Transformer 后的表示，然后通过一个小的多层感知机在训练集上进行微调。这个 MLP 根据正确的类别标签进行训练，最终输出一个定制的、富含任务信息的 D 维向量作为增强后的节点特征。我们可以为解释、标题、摘要和预测分别生成这样的特征。

在 2024 年，我们可以将小型的 BERT 模型替换为大型语言模型（如 LLaMA 2），并利用 LoRA 技术进行高效的微调，这在学术界的有限 GPU 资源下是可行的。

一旦获得了增强的节点特征，就可以用它们来训练你喜欢的任何 GNN 模型，并进行预测。

我们比较了不同节点特征的质量。在 OGB-arXiv 数据集上，使用预定义的词袋特征作为基线，可以在 4 分钟内达到 70% 的测试准确率，这是一个很好的性能基准。当时的 SOTA 模型 GLEN 同时训练语言模型和 GNN，达到了 76.5% 的准确率，但需要 9.2 小时的训练时间，存在巨大的性能与计算成本权衡。

我们的 TAPE 方法使用 LLM 提示并转化为向量后进行微调，达到了 75.5% 的准确率，且训练时间很短。这项技术发表后曾登上排行榜首，后续其他技术也采用了类似方法。即使在今天，OGB-arXiv 排行榜的前三名模型也基本基于这种思路。

为了验证结果的可靠性（避免因测试集被 LLM 见过而产生质疑），我们创建了一个新的数据集 TAPE-arXiv-23，包含 77,000 篇论文，结论保持不变。这再次证明，即使 LLM 见过测试集，若缺乏图结构推理能力，仍需要 GNN 利用拓扑关系来做出良好预测。

消融研究表明，没有单一特征比其他特征更好，特征的组合才是最重要的。

本工作结论：我们可以利用 LLM 的知识及其推理能力来增强文本属性图的节点特征，并使其针对特定任务进行定制。我们的方法并非端到端，而是先生成优质特征，再训练 GNN。这符合当前 LLM 的训练趋势（如指令微调、奖励建模、强化学习），每个步骤独立进行，稳定且高效。该方法可以同时利用闭源和开源 LLM 的优势，并且具有可解释性，因为可以看到 LLM 的推理过程。

技术二：G-Retriever - 利用图增强 LLM

上一节我们学习了如何用 LLM 增强 GNN，本节我们来看看反向思路：如何用图来增强 LLM，减少其幻觉。这项技术称为 G-Retriever。

LLM 虽然强大，但可能会因提示不当而产生错误。我们需要将其响应约束到一个更精确的空间中。为此，我们将利用一个文本属性图（如知识图谱）来强制 LLM 的回答与此图相关。

关键问题是如何从图中提取相关信息，并迫使 LLM 更加专注。我们的方法是将所有信息都表示为 Transformer 的输入标记。输入不仅包括查询标记，还可以包括视觉标记和图标记。这里，我们使用两种标记：图编码标记和文本化图标记。

图编码标记：对于图学习来说，将图表示为一个向量是很自然的想法。我们可以选择任何图编码器 GNN，应用多层图学习来计算深度节点隐藏特征，然后对所有节点特征取平均，再通过一个小的 MLP 得到图编码标记。这个 D 维向量总结了图的拓扑结构和节点特征。

文本化图标记：LLM 通过单词标记处理信息。为了利用 LLM 的知识，我们需要将图及其特征转化为自然语言标记序列。例如，一个图可以用文本表示为“图 G 是一组有向边的集合，定义为...”。然而，用语言表示图的方式并不唯一，这可能导致问题。此外，文本表示不具备不变性（改变描述顺序可能得到不同结果），并且存在可扩展性问题（如 LLaMA 2 的上下文窗口限制为 4000 个标记，无法处理大型图）。LLM 还容易产生幻觉，可能生成图中不存在的节点或边。

我们提出的 G-Retriever 包含四个步骤：

图检索增强生成：首先，从一个可能很大的文本属性图中检索出与用户查询相关的子图。
标记构建：将用户查询、图编码标记和文本化图标记拼接起来，构成 LLM 的输入序列。
答案生成：LLM 根据输入序列生成答案。
联合训练：同时训练 GNN 的参数，并使用 LoRA 微调 LLM 的参数。使用 LoRA 仅需微调 70 亿参数中的 0.5%（即 3500 万参数），加上 GNN 的约 500 万参数，总参数量在学术界可承受范围内。

图检索增强生成：其核心挑战是可扩展性。我们的解决方案如下：

索引：对于一个文本属性图，我们使用一个预训练且冻结的（大）语言模型，为所有节点和边生成 D 维向量表示，并将其存储在图数据库（如 Neo4j）中。
检索：给定用户查询，用相同的语言模型将其表示为向量，然后使用余弦相似度等度量，从图数据库中检索出 top-K 相关的节点和边，形成一个“嘈杂”的子图。
提炼：为了提取信息最密集、更小的子图，我们解决一个带权斯坦纳树问题。该问题旨在最大化所选节点的总权重（重要性），同时惩罚解决方案的成本（节点数量）。这是一个 NP 难问题，但可以使用半定规划进行近似求解，得到一棵有向树。我们可以轻松地修改问题以纳入边的重要性信息，并获得快速的线性时间近似解。

与标准 RAG（检索相关文档）相比，图 RAG 是从图数据库中检索一个更小但高度相关的子图。

回到 LLM 的输入标记：图编码标记如前所述；文本化标记则包括用户查询的文本序列和子图的文本化表示序列。这些标记经过词嵌入层后输入 LLM。

训练过程是标准的：给定输入标记，经过 Transformer 层，递归生成响应。利用训练集的真实答案，通过计算交叉熵损失并进行反向传播，来更新 GNN 和 LLM 的参数。

总结：G-Retriever 包含四个步骤：图 RAG（用于子图检索）、图编码标记计算、响应生成和联合模型训练。我们结合了各方优势，以一种非常自然的方式完成。为了评估该任务，我们创建了新的基准数据集。主要结果表明，G-Retriever 能够击败仅使用 LLM、仅使用 GNN 提示调优或仅使用 LoRA 微调 LLM 的基线方法。在可扩展性上，它能将使用的标记/节点数量减少 83% 到 99%。更重要的是，它有效减少了 LLM 的幻觉。消融研究显示，来自 GNN 的图编码标记和来自图的文本标记贡献相当，两者互补，共同提升性能。

结论：要解锁 LLM 的图处理能力，需要将图也表示为标记。结合 LLM、GNN 和图 RAG 能提供卓越的性能。G-Retriever 是有效、高效且能缓解幻觉的。这项技术已被整合到 PyTorch Geometric 库中。从搜索引擎的发展历史（PageRank -> Word2Vec -> 语言模型 -> RAG -> Gemini）来看，下一步有望将 GNN 和文本属性知识图整合到网络搜索引擎中，实现完全可学习的端到端系统。未来的工作还将引入图像信息，通过视觉 Transformer 将图像块表示为视觉标记，并与文本、图标记一起处理。

口头报告一：从随机轨迹重建的图中分解力场为流

上一节我们探讨了 LLM 与 GNN 的结合，本节我们将进入第一个口头报告，了解如何利用图信号处理技术分析非平衡态随机过程。

本报告关注随机过程和非平衡稳态。连续空间随机过程由随机微分方程描述，包含漂移项和扩散项。虽然这是单个轨迹的动力学，但其概率密度随时间的演化遵循著名的福克-普朗克方程。当密度达到稳态时，过程处于稳态。稳态可以是平衡态（热平衡）或非平衡态，这取决于概率流是否为零。

非平衡过程在许多领域非常重要，尤其是在生物学中，因为生物系统通过维持在非平衡稳态来避免达到热平衡（即死亡）。例如，红细胞膜的非平衡波动、人类大脑动力学以及细胞运动。

亥姆霍兹-霍奇分解：任何非平衡随机过程都可以分解为可逆和不可逆两部分。可逆部分是随机的，其漂移平衡了扩散；不可逆部分则驱动系统围绕稳态分布旋转。在各项同性扩散的假设下，过程可逆当且仅当漂移是保守的（即梯度流），此时不可逆分量为零。

现在，我们将视角转向图和单纯复形。单纯复形可以通过从团中构建高阶单纯形（如三角形）来定义。边流是定义在边上的交替函数。这种边流可以分解为三个部分：梯度部分、旋度部分和谐波余项。这类似于随机过程的分解。如果复形是三角剖分，则谐波余项消失。

如何将随机过程的分解与单纯复形上的流分解联系起来？我们通过考虑离散状态马尔可夫过程的霍奇分解来搭建桥梁。离散状态马尔可夫过程的状态可以映射为图的顶点，转移则对应边。关键在于定义一个与 SDE 漂移类似的边流。数学上，正确的定义涉及转移速率的对数平方根之比。这确保了马尔可夫过程可逆当且仅当该边流是梯度流，从而与连续情况完美对应。

方法流程：

近似：使用从随机轨迹直接推断出的离散状态过程来近似随机微分方程。
分解：分解该离散状态过程，从而从随机轨迹中恢复 SDE 的可逆和不可逆分量。

具体步骤如下：

从相空间中的轨迹点云开始。
执行最远点采样以获得均匀密度的子样本点。
对这些子样本点进行三角剖分（如 Delaunay 三角剖分），形成三角形网格。
将每个三角形区域视为一个离散状态，通过统计轨迹在区域间的转移来估计转移速率，从而推断出一个离散状态随机过程。
取该过程的对偶三角剖分，使得状态对应于顶点，转移对应于边。
使用最大似然估计的转移速率定义边流。
最后，通过求解一个最小二乘问题，将边流分解为梯度部分和旋度部分，分别对应可逆和不可逆电流。

数值实验验证：

时间反演对称性检验：可逆分量在时间反演下应为偶函数，不可逆分量应为奇函数。实验证实了这一点。
不可逆性度量：定义旋度流占总流的比例作为不可逆性度量。在线性和非线性可解过程（如 Ornstein-Uhlenbeck 过程、平面极限环）中，该方法能正确捕捉随着参数增加而增强的不可逆流。
非线性示例：在随机 van der Pol 振荡器和 Rössler 吸引子中，该方法也能定性地捕捉到不可逆性随参数增加的趋势。

实际应用：
我们将该方法应用于两个已知真实情况的生物物理学案例：

红细胞膜波动：健康的（活跃的）红细胞比合成 ATP 耗尽的（被动的）红细胞具有更高的耗散（即更不可逆）。
人类心跳：健康的心跳比心律失常的心跳具有更高的不可逆性。

对于这些单变量时间序列数据，我们使用时间延迟嵌入法将其重构到相空间。由于实际数据扩散可能非各向同性，我们估计了扩散张量并进行坐标变换，使其满足各向同性假设。应用我们的方法后，结果证实了健康状态下的不可逆流水平显著高于受损状态，这与文献结果一致。

总结：我们开发了一种方法，利用单纯复形上的离散亥姆霍兹-霍奇分解来近似 SDE 的连续形式分解。该方法在可解和不可解随机过程中都能捕捉不可逆电流，并成功应用于红细胞和心跳数据分析，证实了健康条件下不可逆性水平更高。这项工作在图信号处理与随机动力学、生物物理学的交叉领域开辟了新 ground。

口头报告二：用于保持对称性的数据处理的宇宙尺度基准

上一节我们看到了图方法在分析动力系统中的应用，本节我们转向宇宙学，探讨一个用于测试等变图神经网络的大规模基准数据集。

宇宙学中的一个核心观测是星系聚类，即测量星系的位置和属性。研究星系的空间分布可以揭示宇宙的基础结构和物理过程。这些观测数据通常非常复杂且高维。传统上，宇宙学家使用两点相关函数等概要统计量，但它们可能丢失数据中高阶关联的信息。这推动了机器学习工具的发展，以更可靠地从这些数据中提取信息。

星系聚类数据具有几个特点，使其成为压力测试 GNN 和其他方法的宝贵基准：

大规模点云：每个点云可包含超过 10^6 个点，对可扩展性提出挑战。
跨空间尺度的信息：存在由引力导致的短程关联和由结构增长导致的远程关联，需要能捕捉局部和全局信息的方法。
对称性结构：由于宇宙是均匀且各向同性的，星系分布应表现出欧几里得对称性（对平移、旋转、反射不变），适合用于基准测试 E(3) 等变神经网络。

基于此，我们的贡献包括：

从现有的宇宙学 N 体模拟中策划了一个点云数据集，并提供了易于使用的访问接口。
引入了一个基于 JAX 的代码库，实现了多种常见的等变神经网络架构。
系统评估了各种 GNN 在该数据集下游任务上的性能，特别关注等变模型。

数据集细节：数据来源于名为 Quixote 的 N 体模拟套件。模拟输入是描述宇宙条件的 5 个宇宙学参数，然后跟踪周期性体积内数百万个暗物质粒子的演化。我们收集了总共 12,384 个模拟数据。通过后处理识别暗物质晕（宿主星系的引力束缚结构），并选择每个模拟中质量最大的 5,000 个晕来构建点云。每个点带有位置、速度、角动量和质量等特征。每个点云都用输入的宇宙学参数（如物质密度 Ω_m 和涨落幅度 σ_8）进行标注。我们还计算了每个点云的两点相关函数作为基准线。

基准任务：

图级参数预测：给定星系位置，预测宇宙学参数 Ω_m（依赖长程关联）和 σ_8（依赖短程关联）。
节点级速度预测：仅给定位置，预测每个点的速度向量。

方法：我们将点云通过 K 近邻法转换为图。基准测试的模型包括几种常见的消息传递架构，其中三种是 E(3) 等变的。所有模型都使用径向基函数作为边特征，这对等变模型的性能至关重要。

实验结果：

在所有三个任务上，等变模型的表现都优于非等变 GNN 和 PointNet++。
然而，在依赖长程信息的 Ω_m 预测任务上，所有模型都未能击败基于 2PCF 向量训练 MLP 的基线。这表明 GNN 在捕获长程信息方面存在困难。
通过将 2PCF 向量与图嵌入拼接，模型性能得到提升。进一步分析发现，仅使用 2PCF 的大尺度部分就几乎能达到使用全部向量的效果，验证了模型缺失的正是长程关联信息。
在不同训练集规模下的实验中，SEGNN 模型在所有规模上都表现出更好的性能，显示出更高的模拟效率。

总结：我们引入了一个由模拟星系构成的数据集，其空间分布反映了底层宇宙学模型。我们展示了图级和节点级任务都能受益于等变模型的使用，并且等变模型具有更高的模拟效率。同时，2PCF 概要统计在推断对长程关联敏感的参数时优于 GNN，这使得该基准成为测试旨在缓解长程信息传递问题的方法（如 Transformer）的理想目标。

口头报告三：论图卷积与 MixUp 的等价性

上一节我们了解了宇宙学基准测试，本节我们进入最后一个报告，探讨一个有趣的理论发现：图卷积与数据增强技术 MixUp 之间的内在联系。

图卷积神经网络的核心操作是邻居聚合，即聚合目标节点邻居的特征来更新该节点的表示，这被认为能丰富目标节点的表征。

我们从监督节点分类任务的角度给出了另一种解释。假设有一个单层 GCN，我们关注节点 A 的标签（例如 one-hot 向量 [0,1,0]）。在训练时，我们用这个标签来监督经过聚合后得到的节点 A 的新特征。而这个新特征是由节点 A、B、C、D 的原始特征聚合而来的。这意味着，我们实际上也在用节点 A 的标签来监督其邻居 B、C、D 的原始特征。换句话说，目标节点的标签被用来训练其邻居的特征，这可能是 GCN 效果良好的原因之一。

基于这一观察，我们可以为原始图卷积构建一组等价训练数据。对于节点 A 及其邻居，每个节点都获得与 A 相同的标签 [0,1,0] 和它们各自的原始特征。这样我们就有了多个数据样本。

现在，如果我们对这组等价数据应用 MixUp 数据增强技术，即对样本的特征和标签进行线性插值，那么为节点 A 生成的新特征和新标签，与经过图卷积更新后的节点 A 的特征和标签是相同的。因此，图卷积等价于在特定条件下（即邻居节点被重新标记为目标节点标签）的 MixUp 策略。

方法框架：我们提出“同质性重新标记”的假设，即将目标节点的邻居节点重新标记为目标节点的标签。在此假设下，对重新标记后的数据应用 MixUp，所得目标节点的特征和标签与经过图卷积操作后的结果相同。

数学表达：图卷积的数学表达式是更新节点特征，但保持节点标签不变。而 MixUp 则同时对特征和标签进行插值。如果我们对邻居节点进行“同质性重新标记”，使其标签与目标节点相同，那么图卷积的公式就可以重写为 MixUp 的形式。

实验验证：

在经典引文网络数据集上，使用 MLP 配合 MixUp 可以达到与 GCN 相似的节点分类准确率。
使用“同质性重新标记”配合多层感知机，在不同训练数据比例下，取得了与 GCN 相似的性能曲线。
测试时 MixUp：仅使用原始标记节点训练一个 MLP，然后在测试时使用训练好的 GCN 权重进行推理（即测试时聚合），其性能与 GCN 相似。实验还显示，测试时 MixUp 能使不同类节点的表示在特征空间中远离决策边界。
将“同质性重新标记”与测试时 MixUp 结合，其性能与 GCN 高度相似。

结论与意义：我们建立了图卷积与 MixUp 之间的理论联系。这种联系的意义在于：

计算加速：MixUp 可以避免在训练时进行耗时的邻接矩阵乘法，只需训练一个 MLP，可能加速图卷积过程。
理论理解：为理解 GCN 的工作机制提供了新的视角。
扩展性：虽然实验主要在同质图上进行，但数学框架并未假设图必须是同质的，未来可以探索在异质图上的应用。

总结

本节课中，我们一起学习了图机器学习领域的多个前沿方向。

我们从 Xavier Bresson 教授的主旨演讲开始，深入探讨了结合大型语言模型与图神经网络以发挥各自优势的两种技术路径：TAPE（用 LLM 增强 GNN 的节点特征）和 G-Retriever（用图结构约束 LLM 以减少幻觉）。这些工作展示了跨模态融合的巨大潜力。

随后，三篇口头报告展示了图方法的广泛应用和理论深度：

第一篇报告利用图上的霍奇分解来分析随机轨迹，成功地从生物物理数据（红细胞波动、心跳）中提取出非平衡态信号，连接了图信号处理与随机动力学。
第二篇报告构建了一个宇宙尺度的点云数据集，用于基准测试等变图神经网络，揭示了当前 GNN 在捕获长程依赖方面的不足，为未来方法发展提供了明确目标。
第三篇报告发现了图卷积操作与数据增强技术 MixUp 之间的理论等价性，为理解图卷积的工作原理和设计更高效的算法提供了新的思路。

这些工作共同体现了图机器学习在理论创新、跨领域应用以及解决实际挑战方面的活力和价值。

图机器学习会议｜ Learning On Graphs Conference 2024 p07 P07_Day_3__Part_1_2-Tutorial_on_Graph_Deep_Learning_for_Time_Series_Processing -BV1k9pAzpE8S_p7-

Welcome to the tutorial session for G deep learning for time series processing forecasting。

Re and analysis you are most welcome to engage with your organizers and each other in the zoom chat function zoom Q&A or on the tutorial discussions channel on Slack please note that the tutorial will be recorded and then upload it to our YouTube channel for those who are unable to join or who would like to rewatch the session in the future if you prefer not to be recorded you are welcome to turn off your camera and participate in the chat。

😊，Now， without further ado， I will hand it over to our tutorial organizers。

Hi hello hello everyone thank you for being here and for attending the tutorial tutorial so I'm Andrea and I will be the speaker for the first part of the tutorial so these slides are available at the link that you can also find in the chat that we have just posted if you want to follow along。

So let's start with a bit of introduction， so there are many applications。

where basically that are characterized by data that are continuously produced over time by multiple data sources。

by multiple sensors， there are many examples of this， we have like traffic monitoring systems。

sensor networks in smart cities and also this is typical of many energy analytics applications。

but also many applications in science， engineering in general， but also financial market analysis。

for instance。And all of these applications have one thing in common that is that this time series。

these temporal data are characterized by reach dependency structure。

So the standard deep learning approach over the years to process time series in particular to forecast。

time series has been that one of training a single neural network on large collection of this related time series。

Within this framework， each time series is processed independently from the others。

while the parameters of the model are shared， the main advantage of this approach in its simple efficiency since the model gets traineded on data that come from many different sources。

but the main downside is that in this way， the dependencies this reach dependency structure that might exist among the10 series is neglected。

What we will see in the tutorial is how we can use GraphD learning to go beyond these limitations of the state of the art。

so as you know since you are attending this this conference GphD learning operators and graph representations allow to allow us to embed these dependencies directly into the processing as inductive biases and as we will see the resulting models can operate on sets or subsets of correlated time series while still adding the parameters of the model shared。

Unfortunately， as we will also discuss， there are inherent challenges in applying this processing framework to data coming from the real world。

So as anticipated the tutorial will present advances coming from the combination of deep learning approaches for T series processing and deep learning on graphs and in particular we will have a threefold objective where the first objective is to provide a comprehensive framework to build these graph based deep learning time series processing models and at the same time when also to provide methods to address the challenges and the potential pitfalls in using these approaches in practice。

And we will also be providing tools and guidelines to use these approaches to tackle the real world application and also for developing new methods of this kind。

The presentation is complemented by a demo， so by a coding session that will happen in the middle of the tutorial in the middle of the tutorial and the tutorial is also available in a paper format on archive and you find the reference here。

So it's also important to say what this tutorial is not about。

so this tutorial is not about processing sequences of interactions as typically done some time when processing temporal networks。

but as anticipated here， graphs will be a tool， representational tool that we will use to represent the possibly dynamic relationships among a collection of time series。

So this is an outline， so in the first part we will be formalizing the problem setting we will be operating on so that one of correlated MC series processing。

we will also present how graph based representations can be used in this context and then we will see how to build graph neural network architectures to process this kind of data in particular we' talk about an important distinction regarding these models which is the one between global and local models。

Then Ivan will be delivering software demo， so the decoding session and both Ivan and Daniela will discuss the challenges in the second part of the tutorial。

上。

Let's start with the first part and with the problem setting。So。

For what concerns the kind of data that we want to process。

we assume to have available a collection D of n correlated the time series。

and each one of the time series can be multivied so we can have multiple observations at each time step for each time series。

Together with the target1 series， we can have covariates。

these are thesogen variables that might be somehow helpful in doing the processing and these covariates can be both dynamic but also static and in this case we talk about attributes。

We will use the capital letters to denote these stacked representations across the collection of10 series。

And refer to this dimension that spans the collection as the partial dimension。

So what are correlated time series， why the correlated in the name。

so we assume that data are generated by a time environment stochastic process and that there exist a causality aegange among the time series。

which means basically that the forecast， the predictions for one specific time series can be made more accurate by taking into account observations that related the time series in the collection。

Furthermore， we also assume that all of these time series are homogeneous so that the observables for each time series are same the same physical meaning。

And that the observations happen synchronously over time and also regularly。

we don't assume instead that these observations are generated by the same process for each time series。

And so it's important to say that while this assumption might sound very restrictive in practice。

they can usually be met with appropriateprocessing steps and can also be relaxed as we will see in the second part of the tutorial。

Now we can look at an example， again， traffic monitoring systems are a classic example of this kind of scenario and you can imagine to have many sensor spread around traffic network and each sensor will acquire some measurement regarding the traffic at this position can be like the average speed of vehicles。

the number of vehicles or things like that， and extragenous variables in this case can be like date time information or also like weather prediction。

For what concerns the static covariates so the attributes。

you can have information regarding like the type of lane or the sensor is placed on or the presence of traffic lights or things like that。

Clearly， in these kind of scenarios， there are strong dependencies among the resulting a collection of time series。

and this dependency will reflect the structure of the road network。

So let's start to look at one of the main task， main processing task that one is usually interested in performing when dealing with the time series。

which is forecasting。And the objective here is pretty much straightforward。

so the objective is that from a window of past observations。

we want to forecast the next age time step ahead observations。In particular。

we want to do that by learning a parametric model of the process generating the data。

And in particular in the tutorial， we will focus on point predictors。

while probabilistic methods can also be considered。In particular with point predictors。

these are usually trained by minimizing a cost function that somehow mal measuresasures the forecasting error and using different cost functions。

you can get predictions of different values， for instance using the MSC， so the mean squared error。

you are forecasting you are predicting the mean of the stochastic process， while using， for example。

the MA， so the mean absolute error， you get predictions of the medium of the process。Now。

a first important distinction that we need to make when talking about forecasting architecture is that between local and global models。

So local models are the family or models that have been typically used in the past for many decades to forecast time series by using like statistical methodss like the box Jenkin approach box Jenkins approach and the idea here is that you have a model whose parameters are specific to each single time series that you want to process。

that you want to forecast。The advantage of this approach is that the resulting model can easily handle eterogeneity in the dynamics of the time series in the collection since they are tailored to that specific sequence to that specific process。

The downside is inefficiency， which is an inefficiency， which is both in terms of computations。

computation since you would have to train many different models。

but it's also in efficiency in terms of sample complexity because again。

you would need you are using samples from a single term series to fit a model。

At the opposite end of the spectrum we have global models。

which is what is typically done in deep learning and here we have that we have a single model that is trained on many time series coming from different sources。

this is similar to what nowadays people call foundation models somehow。

but usually here this scope can be more limited like for instance to time series coming from a specific domain。

The advantage here again is clearly as anticipated before any in the sample efficiency of the approach and also and this sample efficiency allows for building more complex model。

more complex architectures that can be used to process the input time series。

The common downside of these two approaches at least in their standard implementations。

is that they both neglect the dependencies that might exist among the time series。

There are few methods that have been used in the literature to to deal with this beyond the the graph representations we will be talking about later and the the naive option。

the one that that the the。The straightforward thing to do would be to consider the input collection of time series as a single very large multi at the time series。

but this clearly as severe scalability issues since this clearly suffers from the course of dimensionality。

so it results in eye sample complexity and poor computational scalability。

We can instead consider models that operate on sets of time series。

keeping the parameters of the model shared among this time series。

And an example of these is like attention based architectures like transformers， where in this case。

attention would be computed with respect to the spatial dimension rather than the temporal axis and this approach can work pretty well but as you know。

like if you are familiar with also like with the static graphs。

we know that using the transformers to process these kind of data can work pretty well。

but clearly the downside is that we are not exploiting any prior on the structure of the dependencies and also on the sparsity of these dependency structure。

There are also other methods that have been used in the literature that for instance rely on dimensionality reduction。

and here are the idea is that we can extract some shared latent factors from this large collection of time series and use these factors to condition a global model。

And these can work well if data are below rank in certain application。

But the downside of this is that we are losing the local fine grained information。

which instead as we will see， is captured very well by the graph based approach and of course。

also these approaches in cases of like very large data set。

very large collections can suffer from the same scalability issues of the other approaches。

Now we can start talking about these graph based representations， so as anticipated。

the idea is to use a graph to represent the functional dependencies among the1 series and use this graph as an inductive bias for our learning model。

We can use the graph additionsymmetric to capture the to model these dependencies and additionsymmetric。

which can be both asymmetric and dynamic so it can vary with the T so with a each time step。

Together with the additionsymmetric metrics， we can also have edge attributes。

which can be dynamic on their own， and that can be both categorical and numerical。

Finally we can go back to the traffic example and here again we will have that the structure so the graph。

the additionsymmetric metrics can be extracted from the structure of the road network and you can have like attributes or weights associated to the edges also to encode the road distance and here in this case a dynamic topology can help for instance。

for taking into account modifications in the traffic network in the structure of the traffic network。

This is a sort of summary of all the information that we have available at each time step。

so we have our target time series together with the extragen variables that can be both dynamic and static plus the relational information that we just added to the setting。

So now the idea is to use this relational side information to to condition our predictors。

our forecasting architecture， and these relationships as we will see。

can act as a regularization to localize the prediction predictions with respect to each node and in particular。

these can be used to prune any spurrous correlations that might be the result of not taking these structure into account Furthermoremore these approaches far more scalable than standard multivariate models because again。

we can keep the parameters of the model shared among the time series we are processing in fact we can use these kind of architectures to forecast and to process any subset of the correlated the time series。

In particular， the kind of graph neural networks that have been developed to process these data are called special temporal graphph neural networks to refer to the fact that propagation in these models happens across both time and space and in particular we will focus on those models based on the message passing framework。

And we will do that by considering this template architecture where we have which is composed by an encoder。

which simply encodes the observations， each observation independently each time step and node。

and the encoder is then followed by a stack of spatial temporal message passing layer。