# Mosaic: Processing a Trillion-Edge Graph on a Single Machine **Steffen Maass**, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys'17 June 8, 2017 #### Large-scale graph processing is ubiquitous #### One Trillion Edges: Graph Processing at Facebook-Scale Avery Ching Facebook 1 Hacker Lane Menlo Park, California aching@fb.com Sergey Edunov Maja Facebook Hacker Lane 1 Hacker Lane Menlo Park, California Menlo Pa edunov@fb.com majakabi thetis Sambavi Muthukrishnan Maja Kabiljo Facebook 1 Hacker Lane Menlo Park, California majakabiljo@fb.com Dionysios Logothetis Facebook 1 Hacker Lane Menlo Park, California dionysios@fb.com Facebook 1 Hacker Lane Menlo Park, California sambavim@fb.com #### ABSTRACT Analyzing large graphs provides valuable insights for social networking and specific properties in content ranking and recommendations. While numerous graph processing systems graphs of up to 6.08 edges, they often face significant difficulties in scaling to much larger graphs. Industry graphs can be two orders of magnitude larger - hundreds to the linear or up to one trillion edges. In addition to scalability commonly graphs of the control of the control of the control of the commonly graph concessing workflows than previously evalua project to run Facebook-scale graph applications in the summer of 2012 and is still the case today. | Table 1: Popular ben | | raphs. | |------------------------|----------|--------| | Graph | Vertices | Edges | | LiveJournal [9] | 4.8M | 69M | | Twitter 2010 [31] | 42M | 1.5B | | UK web graph 2007 [10] | 109M | 3.7B | | Yahoo web [8] | 1.4B | 6.6B | #### Social networks #### Large-scale graph processing is ubiquitous #### One Trillion Edges: Avery Ching Facebook 1 Hacker Lane Menlo Park, California aching@fb.com > Dionysios Fac 1 Hac Menlo Pa dionysic #### ABSTRACT Analyzing large graphs provides valuable is nontent networking and web companies in content networking and web companies in content networking and web companies. While numerous graph ps discussion of up to 6.6B edges, they often face ficulties in scaling to much large graphs, can be two orders of magnitude larger—thous or up to one trillion edges. In addit challenges, real world applications often recomplex ranhe processing world-from sthan #### Social networks # How to apply de Bruijn graphs to genome assembly Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler Affiliations | Corresponding author Nature Biotechnology 29, 987-991 (2011) | doi:10.1038/nbt.2023 Published online 08 November 2011 A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling a contiguous genome from billions of short sequencing reads into a tractable computational problem. #### Genome analysis #### Large-scale graph processing is ubiquitous #### One Trillion Edges: Avery Ching Facebook 1 Hacker Lane Menlo Park, California aching@fb.com > Dionysios Fac 1 Hac Menlo Pa dionysic #### ABSTRACT Analyzing large graphs provides valuable i networking and web companies in content ommendations. While numerous graph pr have been developed and evaluated on avagraphs of up to 6.6B edges, they often facficulties in scaling to much larger graphs. can be two orders of magnitude larger lions or up to one trillion edges. In addit, challenges, real world applications often re complex graph processing workflows than #### Social networks ### nature biotechnology Home | Current issue | News & cor NATURE BIOTECHNOLOGY | CO #### How to apply de E assembly Phillip E C Compeau, Pavel A P Affiliations | Corresponding aut Nature Biotechnology 29, 987-99 Published online 08 November 20 A mathematical concept known assembling a contiguous genor computational problem. #### Genome analysi home > archive > issue > computs Graph-powered Machine Learning at Google Thursday, October 06, 2016 Posted by Suith Ravi, Staff Research Scientist, Google Research Recently, there have been significant advances in Machine Learning that enable computer systems to solve complex real-world problems. One of those advances is Google's large scale, graph-based machine learning platform, built by the Expander team in Google Research. A technology that is behind many of the Google products and features you may use everyday, graph-based machine learning is a powerful tool that can be used to power useful features such as reminders in Inbox and smart messaging in Allo, or used in conjunction with deep neural networks to power the latest image recognition system in Google Photos. Learning with Minimal Supervision Graphs enable Machine Learning # Terabytes of RAM on multiple sockets #### Intel Unveils Plans for Knights Mill, a Xeon Phi for Deep Learning More sockets, More by Corl Pasinetti on July 29, 2015 Michael Feldman | August 18, 2016 01:33 CEST Tweet 0 6 Share 0 SGI UV 300H 20-Socket Applian Announcing the first 20-socket At the Intel Developer Forum (IDF) this week in San Francisco, Intel revealed it is working on a new Xeon Phi processor aimed at deep learning applications. Diane Bryant, executive VP and GM of Intel's Data Center Group, unveiled the new chip, known SGI announced today that the SGI4 as Knights Mill, during her IDF keynote address on Wednesday. controlled availability at 20-sockets single node. Asserting the value of SAP's close collaboration with syste simplicity for enterprises moving to SGI UV 300H is a specialized offering server line for in-memory computing enterprises to further unlock value real-time, boost innovation, and lov HANA. Featuring a highly differentia architecture, the system delivers significant advantages for businesses running 4 SAP HANA (SAP S/4HANA) and co extreme scale. The single-node sim enterprises eliminate overhead asso environments, streamline high avail seamlessly as data volumes grow w performance. Integrated with the recently annour SPS10, SGI UV 300H capitalizes on between SAP, Intel and SGI to option workloads on multicore NUMA (nonaccess) systems. This enables enter #### Terabytes of sockets Powerful many-core coprocessors by Cori Pasinetti on July 29, 2015 Michael Feldman | August 18, 2016 Tweet 0 f Share 0 SGI UV 300H 20-Socket Applian Announcing the first 20-socket controlled availability at 20-sockets single node. Asserting the value of SAP's close collaboration with syste simplicity for enterprises moving to SGI UV 300H is a specialized offering server line for in-memory computing enterprises to further unlock value real-time, boost innovation, and lov HANA. Featuring a highly differentia architecture, the system delivers significant advantages for businesses running 4 SAP HANA (SAP S/4HANA) and co extreme scale. The single-node sim enterprises eliminate overhead asso environments, streamline high avail seamlessly as data volumes grow w performance. Integrated with the recently annour SPS10, SGI UV 300H capitalizes on between SAP, Intel and SGI to option workloads on multicore NUMA (nonaccess) systems. This enables enter Terabytes of sockets At the Intel Developer Forum (II at deep learning applications, D SGI announced today that the SGI4 as Knights Mill, during her IDF Intel is introducing a new family of enterprise PCIe SSDs with the aim of outperforming their existing DC P3600 series and even beating the DC P3700 series in many metrics. To do this, they've essentially put two P3600 SSDs on to one expansion card and widened the interface to 8 lanes of PCIe 3.0. While this does Powerful ma come across as a bit of a quick and dirty solution, it is a very straightforward way for Intel to deliver higher performance, albeit at the cost of sharply increased power consumption. Fast, large-capacity Non-volatile Memory #### Steffen Maass Posted in Intel Storage SSDs PCIe SSD Enterprise SSDs #### Take advantage of heterogeneous machine to process tera-scale graphs Integrated with the recently annour SPS10. SGI UV 300H capitalizes on between SAP, Intel and SGI to option workloads on multicore NUMA (nonaccess) systems. This enables enter Terabytes of sockets Intel is introducing a new family of enterprise PCIe SSDs with the aim of outperforming their existing DC P3600 series and even beating the DC P3700 series in many metrics. To do this, they've essentially put two P3600 SSDs on to one expansion card and widened the interface to 8 lanes of PCIe 3.0. While this does Powerful ma come across as a bit of a quick and dirty solution, it is a very straightforward way for Intel to deliver higher performance, albeit at the cost of sharply increased power consumption. Fast, large-capacity Non-volatile Memory #### Table of contents - Graph Processing: Sample Application - 2 Design - Mosaic Architecture - Graph Encoding - API - Secondary Expression Secondary Expression ### Graph Processing: Applications - Community Detection - Find Common Friends - Find Shortest Paths - Estimate Impact of Vertices (webpages, users, ...) - ... #### Example: Pagerank Calculate impact of each vertex $$Pagerank_v = \alpha * \left( \sum_{u \in Neighborhood(v)} \frac{Pagerank_u}{degree_u} \right) + (1 - \alpha)$$ - Simple Algorithm: - In each iteration, distribute current impact along out-edges, weighted by degree - ullet Sum up all incoming impacts $\Rightarrow$ new impact for next iteration - Weight new impact with regularization factor $\alpha = 0.85$ - Repeat until no changes #### 1) Initialize 2) Propagate along outgoing edges 3) Sum up incoming contributions 4) Apply regularization: x \* 0.85 + 0.15 5) Update outgoing edges for second iteration #### 6) Repeat until stabilized ### Mosaic: Design space #### Graph Processing has many faces: - Single Machine - Out-of-core - In memory - Cluster - Out-of-core - In memory ### Mosaic: Design space #### Graph Processing has many faces: - Single Machine - Out-of-core ⇒ Cheap, but potentially slow - In memory $\Rightarrow$ Fast, but limited graph size - Cluster - Out-of-core ⇒ Large graphs, but expensive & slow - In memory $\Rightarrow$ Large graphs & fast, but *very* expensive ### Mosaic: Design space #### Graph Processing has many faces: - Single Machine - Out-of-core ⇒ Cheap, but potentially slow - In memory ⇒ Fast, but limited graph size - Cluster - Out-of-core ⇒ Large graphs, but expensive & slow - In memory ⇒ Large graphs & fast, but very expensive - ⇒ Single machine, out-of-core is most cost-effective - ⇒ Goal: Good performance and large graphs! ### Mosaic: Design goals #### Goal Run algorithms on very large graphs on a single machine using coprocessors #### Enabled by: - Common, familiar API (vertex/edge-centric) - Encoding: Lossless compression - Cache locality - Processing on isolated subgraphs #### Architecture of Mosaic - Usage of Xeon Phi & NVMe - Involvement of Host ### Graph encoding: Idea #### Compression Split graph into subgraphs, use local (short) identifiers #### Cache locality - Inside subgraphs: Sort by access order - Between subgraphs: Overlap vertex sets # Background: Column first - Locality for write - Multiple sequential reads ### Background: Row first - Locality for read - Multiple sequential writes ### Background: Hilbert order - Space-filling curve - Provides locality between adjacent data points - Convert graph to set of tiles - 1) Start with adjacency Matrix: - Convert graph to set of tiles - 2) Use first edge in tile $T_1$ : - Convert graph to set of tiles - 3) Use more edges in tile $T_1$ : - Convert graph to set of tiles - 4) Use more edges in tile $T_1$ : - Convert graph to set of tiles - 5) Use more edges in tile $T_1$ : - Convert graph to set of tiles - 6) Next edges do not fit in $T_1$ anymore, construct $T_2$ : ### Locality with Hilbert-ordered tiles Overlapping sets of sources and targets ⇒ Better locality than row-first or column-first ## From global to local: Data structure 1) Split original graphs into two subgraphs: ### From global to local: Data structure 2) Internal data structure of $T_2$ : ### From global to local: Data structure 3) Compress edges: Compressed sparse rows ⇒ Efficient, local encoding, sequentially accessed # From global to local: Summary - Better locality - Efficient encoding of local graphs - Effect: up to 68% reduction in data size: | Graph | #vertices | #edges | Raw data | Mosaic size (red.) | |----------------|-----------|-----------|------------|----------------------------------------| | *rmat24 | 16.8 M | 0.3 B | 2.0 GB | 1.1 GB (-45.0%) | | twitter | 41.6 M | 1.5 B | 10.9 GB | $7.7 \text{GB} \left(-29.4\%\right)$ | | *rmat27 | 134.2 M | 2.1 B | 16.0 GB | 11.1 GB (-30.6%) | | uk2007-05 | 105.8 M | 3.7 B | 27.9 GB | 8.7 GB (-68.8%) | | hyperlink14 | 1,724.6 M | 64.4 B | 480.0 GB | $152.4\mathrm{GB}\left(-68.3\%\right)$ | | *rmat-trillion | 4,294.9 M | 1,000.0 B | 8,000.0 GB | 4,816.7 GB (-39.8%) | | | | | | | - Pull: Gather per edge information - Reduce: Combine results from multiple subgraphs - Apply: Calculate non-associative regularization #### Edge-centric operation | zge e | | ie operanon | | |-----------------------------------|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------| | Local graph<br>processing on Tile | | // On edge processor (co-processor) // Edge e = (Vertex src, Vertex tgt) def Pull(Vertex src, Vertex tgt): return src.val / src.out_degree | | | Loc | | // On edge processor/global reducers (both) def Reduce(Vertex v <sub>1</sub> , Vertex v <sub>2</sub> ): return v <sub>1</sub> .val + v <sub>2</sub> .val | graph<br>sing | | | 8<br>9<br>10 | // On global reducers (host)<br><b>def Apply</b> (Vertex v):<br>$v.val = (1 - \alpha) + \alpha \times v.val$ | Global graj<br>processin | #### Vertex-centric operation Formula: $$Pagerank_v = \alpha * \left(\sum_{u \in Neighborhood(v)} \frac{Pagerank_u}{degree_u}\right) + (1 - \alpha)$$ #### 1) Start with $T_1$ 2) Execute Pull along all edges in $T_1$ 3) Reduce all updates from $T_1$ onto next state #### 4) Switch to $T_2$ #### 5) Pull all updates #### 6) Reduce updates from $T_2$ 7) All tiles processed, Apply processed updates 8) Switch *current* and *next* state, clear *next* state for second iteration #### **Evaluation** #### Questions: - Preprocessing Cost - Performance (in comparison) - Impact of Design Decisions - Scalability #### **Evaluation - Setup** - Single Server: - 2 sockets, 12 cores each - 768GB RAM - 4 Xeon Phi (KNC, 61 cores) - 6 NVMe (1.2TB each) - 7 Algorithms - 6 Datasets (3 real world, 3 synthetic) #### Preprocessing - Mosaic needs explicit preprocessing step - 2-4 min for small datasets, 51 minutes for webgraph, 31 hours for trillion edges - But: Can be amortized during execution: - GridGraph: Mosaic faster after - twitter: 20 iterations uk2007: 8 iterations - X-Stream: Mosaic faster after - twitter: 8 iterationsuk2007: 5 iterations #### Performance comparison • Comparison to other single machine engines with Pagerank: ### Performance comparison • Comparison to other single machine engines with Pagerank: $\Rightarrow$ Mosaic outperforms other system by 2.7×to 58.6× ## Hilbert-ordered tiles: Cache locality • Cache misses and execution times for three different strategies ⇒ Hilbert-ordered tiles have up to 45% better cache locality, up to 43% reduction in runtime ## Evaluation - Scalability #### Dimensions - Add Xeon Phis/NVMes - Add threads on each Xeon Phi ⇒ Mosaic scales well when adding threads/Xeon Phis #### Conclusion - Mosaic, a graph processing engine for trillion edge graphs on a single machine - Hilbert-ordered tiles allow: - Enable localized processing on coprocessors - Optimizes cache locality - Enables compression Code is open-source: https://github.com/sslab-gatech/mosaic Thank you! #### Evaluation - Selective scheduling - Skip subgraphs without active vertices - Especially useful for traversal algorithms: BFS, Connected Components, . . . $\Rightarrow$ 2.2×speedup for BFS on twitter graph ## Evaluation - Dynamic load balancing - Effect of choosing the wrong load balancing scheme - Mosaic has light-weight balancing scheme $\Rightarrow$ Up to 5.8×improvement in running time by choosing correct load balancing scheme #### Mosaic architecture - Host and Xeon Phi component - Streaming-based design