Exploring LLVM Bitcode interactively
Published on
While working on a tool for software analysis, I find myself looking into the bitcode quiet often. It works OK when there is one small file, but it’s incredibly annoying when it comes to real-world projects which have tens and hundreds of files.
To simplify my life, I built a tool that converts LLVM Bitcode into the GraphML format: llvm2graphml.
What is GraphML
GraphML is an XML-based file format for storing graphs. The beautiful part is that it supported by many tools: you can use Neo4J, Cassandra, or TinkerPop to mine data or things like yEd or Gephi to visualize it.
My use-case is graph databases.
What is Graph Database
To understand what a graph database is to think of SQLite but for property graphs. And a property graph is simply a graph where each vertex (or node) and edge may have several key-value properties.
The classical example: there is a number of people in the graph and they have some relationship, e.g.: ‘Alice -> knows -> Bob’, ‘Bob -> friends-with -> Eve’, etc. In this case, we can model a query like “Find friends of people whom Alice knows” in the form of a query language:
graph.vertex('person').has('name', 'Alice').edge('knows').edge('friends-with')
Each step narrows down the search space:
- from a graph get all the vertices labeled ‘person’
- among those select the ones that have the property ’name’ with the value ‘Alice’
- from the vertices select nodes through edges labeled ‘knows’
- and from what’s left pick all the nodes reachable through the edges labeled ‘friends-with’
Note: this is an imaginary, simplified query language, but you’ve got the idea.
llvm2graphml
Let me walk you through an example of how to use llvm2graphml. To follow along you need to install llvm2graphml itself (prebuilt packages available for macOS and Ubuntu) and Gremlin Console from Apache TinkerPop project.
There are essentially three steps:
- Create
main.llfile with the following content:
; main.ll
define i32 @increment(i32 %x) {
%result = add i32 %x, 1
ret i32 %result
}
2. Run llvm2graphml to emit the GraphML file:
> llvm2graphml --output-dir=/tmp main.ll
[info] More details: /tmp/llvm2graphml-38dfea.log
[info] Loading main.ll
[info] Saved result into /tmp/llvm.graphml.xml
[info] Shutting down
3. Create the database from the GraphML file
Start console:
> gremlin-console/bin/gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin>
Create the database:
gremlin> graph = TinkerGraph.open()
gremlin> g = graph.traversal()
gremlin> g.io("/tmp/llvm.graphml.xml").read()
gremlin> g
==>graphtraversalsource[tinkergraph[vertices:12 edges:27], standard]
Now go and run some queries!
Example queries
List all modules:
gremlin> g.V().hasLabel('module').valueMap().unfold()
==>moduleIdentifier=[main.ll]
List all functions:
gremlin> g.V().hasLabel('function').valueMap().unfold()
==>argSize=[1]
==>basicBlockCount=[1]
==>name=[increment]
==>isDeclaration=[false]
==>isVarArg=[false]
==>isIntrinsic=[false]
==>numOperands=[0]
==>instructionCount=[2]
Count all the instructions:
gremlin> g.V().hasLabel('instruction').groupCount().by('opcode').unfold()
==>ret=1
==>add=1
Explore the types:
gremlin> g.V().hasLabel('type').valueMap('typeID').unfold()
==>typeID=[label]
==>typeID=[pointer]
==>typeID=[function]
==>typeID=[integer]
==>typeID=[void]
Find a function with an argument called ‘x’:
gremlin> g.V().has('argument', 'name', 'x').out('function').valueMap('name')
==>[name:[increment]]
Et cetera, et cetera, et cetera…
Some numbers
These are just some numbers mined from the libLLVMCore.a.
How many
| Number of functions | 71 019 |
| Number of basic blocks | 172 621 |
| Number of instructions | 1 212 322 |
| Number of types | 122 220 |
Top 10 instructions:
| call | 290 495 |
| load | 214 769 |
| store | 167 640 |
| alloca | 154 922 |
| br | 96 848 |
| getelementptr | 78 622 |
| ret | 67 729 |
| bitcast | 62 760 |
| icmp | 20 624 |
| phi | 9 716 |
Top 10 biggest functions:
| llvm::UpgradeIntrinsicCall(llvm::CallInst*, llvm::Function*) | 14033 |
| llvm::Intrinsic::getAttributes(llvm::LLVMContext&, unsigned int) | 8420 |
| ShouldUpgradeX86Intrinsic(llvm::Function*, llvm::StringRef) | 3635 |
| llvm::LLVMContextImpl::~LLVMContextImpl() | 2181 |
| UpgradeIntrinsicFunction1(llvm::Function*, llvm::Function*&) | 2006 |
| (anonymous namespace)::Verifier::visitIntrinsicCall(unsigned int, llvm::CallBase&) | 1887 |
| (anonymous namespace)::AssemblyWriter::printInstruction(llvm::Instruction const&) | 1869 |
| llvm::ConstantFoldBinaryInstruction(unsigned int, llvm::Constant*, llvm::Constant*) | 1244 |
| upgradeAVX512MaskToSelect(llvm::StringRef, llvm::IRBuilder | 1073 |
| llvm::ConstantFoldGetElementPtr(llvm::Type*, llvm::Constant*, bool, llvm::Optional | 1055 |
Resources
Here are some links if you want to learn more about Gremlin Queries and what’s possible:
Next steps
Currently, the project is in its very early days, and many features are missing, to name a few: specific properties on instructions and values, def-use chains and other connections, complex constants (such as vectors of structs), and many more.
With that being said - contributions are welcome!