Initialization vectors: Android - Predictive text exclusions in Samsung devices

Tuesday, June 11, 2019

Android - Predictive text exclusions in Samsung devices

Short version

Samsung keyboard predictive text exclusions are located in the following location and database:
The following query at the
can be used as a template to extract text exclusion entries from the RemoveListManager database.
  • RemoveList
    • Fields: removed_word, time_word_added
SwiftKey predictive text exclusions are located in the following location and text file:
  • Blacklist
    • Text file contents are composed of one excluded word per line. 

Long Version

The following discussion shows how excluded words from the Samsung keyboard's predictive text are stored in Samsung Android devices. 

Predictive text options in Samsung Android phones
Predictive text is an Android feature that learns the user's most used typed words and presents them as options for autocomplete. For example if I type the city name "San Juan" with regularity on my device the next time I start typing "San" the predictive text option will volunteer the full name "San Juan" as an option to complete the word for me. Auto-predictive text saves the user time since I can type the word full word in just 3 taps (San + tap on the suggestion or tap spacebar for auto complete) instead of the 8 taps needed for the full name to be spelled out.

What happens when the keyboard is giving you a suggestion for a word that you don't want constantly for a set of initial letters? Imagine that currently instead of typing "San Juan" constantly you find yourself typing "San Lorenzo" instead since you moved to a new city. Every time you type "San" you get the suggestion "San Juan" instead of "San Lorenzo". By long pressing the suggestion box the keyboard gives you the option to stop suggesting the pressed word moving forward.

What happens to the long pressed word that will now not be suggested anymore? The same process is used for text exclusions for the SwiftKey keyboard app. Where does these excluded or blacklisted words reside? Why would finding these items be of importance to the forensic analyst?

Samsung Keyboard Analysis

On Samsung Android devices data related to keyboard configurations reside in the data/data/ directory. The following image shows the contents of the aforementioned directory.

Database folder exist when a blacklisted word exists

Notice in the image above how the databases folder is highlighted. This directory did not exist until I added a word to be excluded on the Samsung keyboard. Within this directory resides the SQLite database named RemoveListManager. Within the database the RemovedList table keeps the excluded word list.

In the previous image the word EMBASSIES was excluded. Notice the added time. While doing testing the actual time of addition was 2019-06-11 09:13:51. There is a difference of 4 hours. It is my assumption that the date shown is UTC time in human readable format. This underlines how important it is to test your conclusions. On your case work it is key to duplicate the environment by getting a similar phone to the original and do your own testing. 

Samsung SwiftKey Analysis

Just like the Samsung Keyboard, the SwiftKey keyboard app keeps pertinent data in the data/data/ directory. 

As seen in the previous image the user directory is where the excluded word list resides. The following image shows the contents of the blacklist text file.

For this analysis the creation and modified dates can be used to show when the list was created first and modified last. Dates for words excluded that are between the first and last entries on the list lack a timestamp or a way to infer it.

Why does this matter?

Excluded words are voluntary user generated events. When a user decides to exclude a word it is because it constantly gives suggestions to a list of words that are constantly typed. What would a list of excluded words that mostly contains terms related to child exploitation tell the digital forensic analyst? What can the analyst infer the user was typing? When was the exclusion made? Can the timestamp be correlated to a location by the use of another type of artifact on the device? User generated events tend to have relevance to our analysis and should be sought out and aggregated. We can make out the forest by getting at all those trees.

As always I can be reached on Twitter @AlexisBrignoni and email 4n6[at]abrignoni[dot]com.