Exploring LLVM Bitcode interactively
Published on
While working on a tool for software analysis, I find myself looking into the bitcode quiet often. It works OK when there is one small file, but it’s incredibly annoying when it comes to real-world projects which have tens and hundreds of files.
To simplify my life, I built a tool that converts LLVM Bitcode into the GraphML format: llvm2graphml.
What is GraphML
GraphML is an XML-based file format for storing graphs. The beautiful part is that it supported by many tools: you can use Neo4J, Cassandra, or TinkerPop to mine data or things like yEd or Gephi to visualize it.
My use-case is graph databases.
What is Graph Database
To understand what a graph database is to think of SQLite but for property graphs. And a property graph is simply a graph where each vertex (or node) and edge may have several key-value properties.
The classical example: there is a number of people in the graph and they have some relationship, e.g.: ‘Alice -> knows -> Bob’, ‘Bob -> friends-with -> Eve’, etc. In this case, we can model a query like “Find friends of people whom Alice knows” in the form of a query language:
graph.vertex('person').has('name', 'Alice').edge('knows').edge('friends-with')
Each step narrows down the search space:
- from a graph get all the vertices labeled ‘person’
- among those select the ones that have the property ’name’ with the value ‘Alice’
- from the vertices select nodes through edges labeled ‘knows’
- and from what’s left pick all the nodes reachable through the edges labeled ‘friends-with’
Note: this is an imaginary, simplified query language, but you’ve got the idea.
llvm2graphml
Let me walk you through an example of how to use llvm2graphml
. To follow along you need to install llvm2graphml
itself (prebuilt packages available for macOS and Ubuntu) and Gremlin Console from Apache TinkerPop project.
There are essentially three steps:
- Create
main.ll
file with the following content:
; main.ll
define i32 @increment(i32 %x) {
%result = add i32 %x, 1
ret i32 %result
}
2. Run llvm2graphml
to emit the GraphML file:
> llvm2graphml --output-dir=/tmp main.ll
[info] More details: /tmp/llvm2graphml-38dfea.log
[info] Loading main.ll
[info] Saved result into /tmp/llvm.graphml.xml
[info] Shutting down
3. Create the database from the GraphML file
Start console:
> gremlin-console/bin/gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin>
Create the database:
gremlin> graph = TinkerGraph.open()
gremlin> g = graph.traversal()
gremlin> g.io("/tmp/llvm.graphml.xml").read()
gremlin> g
==>graphtraversalsource[tinkergraph[vertices:12 edges:27], standard]
Now go and run some queries!
Example queries
List all modules:
gremlin> g.V().hasLabel('module').valueMap().unfold()
==>moduleIdentifier=[main.ll]
List all functions:
gremlin> g.V().hasLabel('function').valueMap().unfold()
==>argSize=[1]
==>basicBlockCount=[1]
==>name=[increment]
==>isDeclaration=[false]
==>isVarArg=[false]
==>isIntrinsic=[false]
==>numOperands=[0]
==>instructionCount=[2]
Count all the instructions:
gremlin> g.V().hasLabel('instruction').groupCount().by('opcode').unfold()
==>ret=1
==>add=1
Explore the types:
gremlin> g.V().hasLabel('type').valueMap('typeID').unfold()
==>typeID=[label]
==>typeID=[pointer]
==>typeID=[function]
==>typeID=[integer]
==>typeID=[void]
Find a function with an argument called ‘x’:
gremlin> g.V().has('argument', 'name', 'x').out('function').valueMap('name')
==>[name:[increment]]
Et cetera, et cetera, et cetera…
Some numbers
These are just some numbers mined from the libLLVMCore.a
.
How many
Number of functions | 71 019 |
Number of basic blocks | 172 621 |
Number of instructions | 1 212 322 |
Number of types | 122 220 |
Top 10 instructions:
call | 290 495 |
load | 214 769 |
store | 167 640 |
alloca | 154 922 |
br | 96 848 |
getelementptr | 78 622 |
ret | 67 729 |
bitcast | 62 760 |
icmp | 20 624 |
phi | 9 716 |
Top 10 biggest functions:
llvm::UpgradeIntrinsicCall(llvm::CallInst*, llvm::Function*) | 14033 |
llvm::Intrinsic::getAttributes(llvm::LLVMContext&, unsigned int) | 8420 |
ShouldUpgradeX86Intrinsic(llvm::Function*, llvm::StringRef) | 3635 |
llvm::LLVMContextImpl::~LLVMContextImpl() | 2181 |
UpgradeIntrinsicFunction1(llvm::Function*, llvm::Function*&) | 2006 |
(anonymous namespace)::Verifier::visitIntrinsicCall(unsigned int, llvm::CallBase&) | 1887 |
(anonymous namespace)::AssemblyWriter::printInstruction(llvm::Instruction const&) | 1869 |
llvm::ConstantFoldBinaryInstruction(unsigned int, llvm::Constant*, llvm::Constant*) | 1244 |
upgradeAVX512MaskToSelect(llvm::StringRef, llvm::IRBuilder | 1073 |
llvm::ConstantFoldGetElementPtr(llvm::Type*, llvm::Constant*, bool, llvm::Optional | 1055 |
Resources
Here are some links if you want to learn more about Gremlin Queries and what’s possible:
Next steps
Currently, the project is in its very early days, and many features are missing, to name a few: specific properties on instructions and values, def-use chains and other connections, complex constants (such as vectors of structs), and many more.
With that being said - contributions are welcome!