Decode a UTF-8 string that may be cut off
Let's say we got some bytestring through a socket, but it was cut off in the middle of a UTF-8 character.
We can simulate this:
bs = "приклад".encode('utf-8')[:-1] # last byte was lost print(bs) #< b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xba\xd0\xbb\xd0\xb0\xd0' bs.decode('utf-8') # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 12: unexpected end of data
So, is this string completely undecodable? Can't we just get
Don't worry, I have a solution. Works with 3.x and 2.7!
import io import codecs bs = b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xba\xd0\xbb\xd0\xb0\xd0' stream = io.BytesIO(bs) stream_reader = codecs.getreader('utf-8')(stream) print(stream_reader.read()) #< прикла