I have a university project in which I have to disasemble binary files. Therefore I have tried Capstone. I tried weeks with Java but it didnt work so since yesterday I taught myself a little Python. To read the Binary I tried:
file = open('binary_file')
content = file.readlines()
from this link: Reading Binary File (.out) in Python and disassemble with Capstone and the instructions to disassemble from capstone http://www.capstone-engine.org/lang_python.html
I have the solutions from an online disassembler and the result is something more then 13000 rows. When I start mine I only get one (0x1000: sc 0x2b ). I can't find the mistake because in my eyes it is all right but I dont have any plan of Python or Capstone.
By the Way the Testcode from the Capstone page works fine so its nothing wrong with the installation i think.
Code:
from capstone import *
file = open('C:/...sth', 'rb')
content = file.read()
ergebnism = open("C:/.../ergebnis.txt", "w")
mi = Cs(CS_ARCH_MIPS, CS_MODE_MIPS32)
for i in mi.disasm(content, 0x1000):
print("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
#for (address, size, mnemonic, op_str) in mi.disasm_lite(content,0x1000):
# print("0x%x:\t%s\t%s" %(address, mnemonic, op_str))
ergebnism.write("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
ergebnism.write("\n")
ergebnism.close()
file2 = open('C:/...erdb', 'rb')
content2 = file2.read()
ergebnisp = open("C:/.../ergebnisp.txt", "w")
pp = Cs(CS_ARCH_PPC, CS_MODE_64)
for i in pp.disasm(content, 0x1000):
print("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
#for (address, size, mnemonic, op_str) in pp.disasm_lite(content2, 0x1000):
#print("0x%x:\t%s\t%s" %(address, mnemonic, op_str))
ergebnisp.write("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
ergebnisp.write("\n")
ergebnisp.close()
Disassembler libraries like Capstone treat everything you feed them as instruction bytes but normal binaries tend to contain a lot of other things besides instructions. Most of them start with some kind of header, not with code.
Hence some analysis is needed to determine which parts of a binary are code, data, resources, relocation tables and whatnot, and to feed only the actual code (i.e. instruction bytes) to the disassembler engine. This analysis is also needed for determining certain environmental parameters for the code to be disassembled, like the address at which it will be loaded by operating system, the address of the entry point, or relocations that need to be applied.
This analysis is automatically done by programs like IDA (of which there is a free version); these are often called 'disassemblers' but in fact the raw instruction disassembly logic makes up only a tiny part of their analysis capabilities. For more information, have a look at the topic Disassembler for batch/automated processing over on Reverse Engineering.
Of course, all this is moot if your binary files contain only raw instruction streams...