Fix wchar array encoding in structs#1348
Fix wchar array encoding in structs#1348Emperor-RE wants to merge 5 commits intoqilingframework:devfrom
Conversation
|
Thanks for the quick reaction! |
|
As a follow-up, this problem also seems to be ocurring with struct.ref assignments, example of my own project where the same problem occurs (in assigning fdata_obj.cFileName), even after my patch: I'll await your reply to see if this also needs its own patch or if we need to find another solution with regards to the python versions. |
|
I tested that on a Python 3.9 running on top of WSL, and it behaves like 3.8 (that is, use excessive padding) so it is not a matter of versions after all. I beleive that could be a result of a default encoding, where Python over POSIX set it to UTF-8 by default [means that any string you provide it is already implemented as a UTF-8 buffer under the hood, which gets re-padded by My Windows Python encoding is set to cp1252 (Western Europe), so it might be the reason it works OK there.
>>> import locale ; locale.getpreferredencoding()
'UTF-8'
>>> import ctypes ; bytes(ctypes.create_unicode_buffer("hello", 8))
b'h\x00e\x00l\x00l\x00o\x00\x00\x00\x00\x00\x00\x00'
>>>
>>> import locale ; locale.getpreferredencoding()
'cp1252'
>>> import ctypes ; bytes(ctypes.create_unicode_buffer("hello", 8))
b'h\x00e\x00l\x00l\x00o\x00\x00\x00\x00\x00\x00\x00'So I am a bit confued here.. I am pretty much convienced this is an encoding / locale issue, but I cannot figure out how to determine that in order to work around it. |
|
Yeah i'm pretty sure it has to do with encoding of arrays. When i tested it with some dummy code that called bytes() seperately for each ctypes.c_wchar, that does result in proper encoding (with proper \x00 padding) Calling bytes() on a ctypes_c_wchar_array object results in double encoding (with \x00\x00\x00 padding). |
|
I think it is dependent on if your python build is compiled using UCS2 or UCS4, can you check which type you are running? |
|
Both show a result of ctypes.sizeof(ctypes.c_wchar)One returns 4 ("extra padding"), while the other returns 2 ("normal padding"). I do believe this has something to do with the locale, but I can't figure it out. |
|
I've spent quite some time following this up, but the most logical conclusion i can think of is that ctypes follows the "internal" python representation of the OS, and it is still related to Linux/Mac representing a wchar_t as 4 bytes while windows represents it as 2 bytes. |
|
No doubt about that. |
|
I just pushed a fix that adds a wrapper class to substitue c_wchar arrays and represent them as an array of c_ubytes under the hood. This should ensure that they are always handled as UCS2 regardless of the host os. I tested it and it passes the struct tests of qiling & it works with the binary i'm emulating 👍 |
|
Why did you add +2 for null terminator? I think you are confusing it with |
|
My bad, that indeed causes misalignment. For my use-case that didn't matter, but i'll change it to the proper length of bytes in a new commit. I have a different question about some of the other windows struct types, could you elaborate on why you internally represent a lot of the string pointers types as the STRING type in qiling/os/windows/api.py? For example |
|
Tagging an argument as However, I will add this to my fix backlog [probably through OS resolver handlers]. |


Checklist
Which kind of PR do you create?
Coding convention?
Extra tests?
Changelog?
Target branch?
One last thing
Structs that use the ctypes.c_wchar_array type were being double encoded, causing the bytes to be padded with '\x00\x00\x00' instead of just '\x00'. I added a few lines to struct.py that fix this.
Example code that shows the problem by reading KUSER_SHARED_DATA->NtSystemRoot: