我建议你可以考虑加下以下内容
## 介绍我们使用的数据库 neo4j
## 介绍我们适用AWS云计算平台
## Data 数据
data souce, scheme, size
https://www.openacademic.ai/oag/
## 构建graph database
A graph database can store any kind of data using a few simple concepts:
1. Nodes – graph data records
2. Relationships – connect nodes
3. Properties – named data values
对于 我们这个项目 我们如下构建
1. nodes are: papers authors
2. relationships: Paper A reference Paper B. From fields `references`, we can add these relationships
authors A write paper B
3. Properties: Felds except `references` can all be taken as properties (name and value pairs)
## 数据导入
因为我们数据非常大, 高效导入很重要 适用 neo4j的工具 neo4j admin import
https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/
## 计算paper pagerank
https://github.com/neo4j-contrib/neo4j-graph-algorithms
计算完成后,在neo4j数据库中, 每个论文node多了1个名为pagerank的property
## 确定论文主题
首先确定有哪些主题,通过map reduce 统计关键字词频率,将高频词作为统计主题
统计词频我们适用了 hadoop python streaming, 因为数据量很大,应用hadoop 分布式计算
可以高效求解。
可以介绍 **hadoop** **map reduce python streaming** 等方面内容
在这个过程中 我们要进行关键词的数据处理 注意是数据清理 形式归一化 之前发你的文档写过
https://docs.qq.com/doc/BGww0r2bdHlv0mp4qG2bKveX1rxMLY1i2JfX3IQmKC2Cjyb92DNwhy29cH8t176uIT0ev6iT1
通过上述步骤, 我们获得了主题词集合
每个论文,根据它的关键字, 看是否出现在主题词集合中来划分它的主题。 对每个论文,它的每个主题作为该论文node的label加入到neo4j数据库中. 1个论文可以有多个主题词
## 数据库查询示例 query examples
top 10 papers with topic deep learning and neural network ordered by pagerank value
“`sql
match (p:`deep learning`:`neural network`)
return p.title, p.pagerank
order by p.pagerank desc limit 10;
“`
top 10 papers with topic algorithm design and analysis ordered by number of citation
“`sql
match (p:`algorithm design and analysis`)
return p.title, p.n_citation
order by p.n_citation desc limit 10;
“`
top 10 papers that reference or referenced by ‘Random search for hyper-parameter optimization’
ordered by pagerank value
“`sql
match (a:Paper) — (b:Paper {title: ‘Random search for hyper-parameter optimization’})
return a.title, a.pagerank
order by a.pagerank desc limit 10;
“`
top 10 papers that reference ‘Random search for hyper-parameter optimization’
ordered by pagerank value
“`sql
match (a:Paper) –> (b:Paper {title: ‘Random search for hyper-parameter optimization’})
return a.title, a.pagerank
order by a.pagerank desc limit 10;
“`
top 10 papers referenced by ‘Random search for hyper-parameter optimization’
ordered by pagerank value
“`sql
match (a:Paper) <-- (b:Paper {title: 'Random search for hyper-parameter optimization'})
return a.title, a.pagerank
order by a.pagerank desc limit 10;
```