Almost half a million ArrayList<String> words too much?
I put together a ~400,000 word list for my password generating program. It has 3 modes which choose words from this list, and I fear that the program will lag or be slow to load. Do I need to add some extra efficiency lines of code to keep it snappy? Here is what I'm doing so far:
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
publicclass WordData{
public List<String> wordList;
{
wordList =new ArrayList<String>();
wordList.add("cat");
wordList.add("dog");
wordList.add("bird");
//blah blah
}
}
I am coding all of the words into a separate class in the program to keep the list of words out of reach of people looking to add to their brute force lists. NetBeans has a macro recorder so I will let it run overnight while it puts each word within proper syntax. Am I overloading the wordList? It having to add each one of the words I specified would take a while, I think. Accessing it from a file makes no difference either. The current text file is about 4MB.
[1538 byte] By [
kavon89a] at [2007-11-27 11:07:10]

> I put together a ~400,000 word list for my password
> generating program. It has 3 modes which choose words
> from this list, and I fear that the program will lag
> or be slow to load. Do I need to add some extra
> efficiency lines of code to keep it snappy?
That's for you to decide. When you run it, does it run fast enough for your requirements?
If you're using this in the sense of "if the word list contains the password, then bad password," then a Set's contains method will be faster than a List's.
> I am coding all of the words into a separate class in
> the program to keep the list of words out of reach of
> people looking to add to their brute force lists.
Huh?
If I'm running this program on my computer, or writing code that uses that class, then I can see all the words that are there.
> Am I overloading the wordList?
What do you mean?
> It having to
> add each one of the words I specified would take a
> while, I think. Accessing it from a file makes no
> difference either. The current text file is about 4MB.
Reading a 4MB file to build a 400k entry list or set should be pretty quick, but the only way you'll know if it's fast enough for you is to try it.
However, I would NOT put 400,000 string constants into my .java file. I'm pretty sure it won't even compile if you do. DEFINITELY load those words from a file, or store them in a DB and use a SQL query to see if the user's password is present in the DB.
jverda at 2007-7-29 13:20:17 >

400,000 words is going to overrun the max length of methods. You shouldn't be trying to keep that many words in a class. Use a file.
What you think you are doing about security is not secure. Anyone can decompile a class.
So, read from the file. Then add everything to wordList, probably with a loop.
How do I get started on that? ;)
http://java.sun.com/docs/books/tutorial/
Rather than loading the words into an array, it might be better to leave them in the file and use RandomAccessFile to jump around the file.
Or maybe not. It depends on what you're trying to do.
The list will always be accessed randomly and turned into a String. Then I need to do a little modifying of the string before returning it to the GUI.
If you just need to select a random word, it might be worth opening the file with RandomAccessFile, then use seek() to a random location, read to the end of the line, and then read and return the next line.
When you say you want to keep people from using them in "their brute force lists", what do you mean? You think a hacker is going to take your list of words and use them in their brute force password basher thing? If they are standard dictionary words, you can rest assure said hacker already has them.
So what is the list for?
If it's for password creation, taking a word and adding some random numbers or whatever, that's fine, I guess.
If it's related to validation of passwords to keep someone from using a regular word, then I would think you probably want people to mix in some numbers and punctuation, in which case you need to validate that numbers and punctuation is being included. And if they are, that would exclude any word on your list already, thus making the list unnecessary.
Maybe the goal is to check passwords for known real words, so you can reject the password?
If so, you could do a binary search on the wordlist. Again you can use RandomAccessSearch for that rather than reading the words into an array, although an array might be worth it if you're going to be doing this a lot (like, hundreds of thousands of times).
Also there may already be an API for this. Check and see.
Keep in mind that it could be harder than just checking to see if the password is a plain word. "dog" would be a caught by a brute force attack, and so would "cat", but also so would "dogcat" perhaps.
> Maybe the goal is to check passwords for known real
> words, so you can reject the password?
>
Maybe,
If so OP, why go to all the trouble to do that when you can enforce password such as "1HmP?t with a lot less fuss.
If I had to stick with this approach I would use the jazzy spell check API which is available on sourceforge. It has a large DB of words for UK/US English and if the word entered does not need spell checked you could reject it as a password because it will more than likely be a plain English word.
I am not comparing anything with anything at all.
The program chooses a random word from a huge 400k word list, then modifies that word and sends the modified chosen word from the list to the GUI. Would the best way to do this be a RandomFileAccess thing which sets a String to the value it chooses, then I go about my modifying and return it to the GUI?
Though im notoriously awful at is I will attempt to add up
how much memory you will need.
Every String has the following:
private final char value[];
private final int offset;
private final int count;
private int hash;
That is 4 * 4 bytes = 16
If there are 8 letters in every String thats 8 * 2 bytes = 16
Thats 400,000 * 32 bytes = 12.2 MB
+400,000 * 4 bytes for a reference to every String = total ~14 MB
Thats ~1/4 of the RAM.
You may want to watch your memory.
RandomAccess will work but I think you would have to guarentee that
every entry is the same length. Pick a max length like 10.
Have the first to chars be the String length then pad the rest:
04xxxxxxbird
03xxxxxxxdog
06xxxxchoose
This way you can jump around really easily by units of 14 * char length (including carriage return \ n \ r).
You wouldn't need each record to be the same length to use RandomAccessFile; just be sure to discard the first line after you seek to a particular position. However, you'd have to be sure that the data file has fixed-width character encoding, and if the width > 1 then you'd have to take that into account when you seek.
Of course, arguably choosing words at random is a poor way to generate passwords, because they're easy to guess for brute force attacks. It's not clear what the OP was hoping to achieve in this respect.
this is for passwords? oh lord.
whats wrong with:
for( random password size ) { string += random char }
?
> this is for passwords? oh lord.
>
> whats wrong with:
>
> for( random password size ) { string += random char
> }
>
> ?
That's what I was thinking.
Well, one problem with random strings of characters is that people find them hard to remember, and then they need to request password resets, and/or they write the passwords down, and those can be security holes right there.
Using a sentence and a semi-replicable way to turn that sentence into a seemingly-random sequence of characters seems to work better.
In my experience what generally happens is a new user is assigned a password from a random password generator which has an immediate expiry date. So when the user logs on for the first time they are forced to change it and they can pick one they can remember.
Well, that's fine, but then you have to worry about whether they choose passwords like their name, or their user ID, or "parishilton" or the like.
I could be wrong but it seems the OP's requirements is to generate a password and not to check if the chosen password is valid. If OP needs to generate a random password then the algorithm suggested in reply #15 would be sufficient.
> Well, that's fine, but then you have to worry about
> whether they choose passwords like their name, or
> their user ID, or "parishilton" or the like.
That's where the "include some numbers and/or punctuation" rule comes in, and easy to validate without word lists.
Just generate a random string of bytes and convert them to hexadecimal. "1f4c3df826bf" looks random enough.
Might I suggest ...
The dictionary check is overkill... KISS it.
http://en.wikipedia.org/wiki/KISS_principle
1. set the initial password to userName + " " + currentTime
and set it to expired, so the user has to change it the first time they login.
It's secure enough, simple because almost all users login immediately they receive their account, just to try it out. If you're paranoid (which is healthy) then destroy the new account if no one logs into it after 24 hours (or so).
2. Validate new passwords against a sequence if regex's to ensure that every password
* has length of atleast 8 characters.
* contains atleast 2 alphas; and
* contains atleast 2 non-alphas.
3. don't allow reuse of the last 12 passwords; and
4. Force the user to change their password at-least every three months, and every month if you're (a) nervous (b) a banker, or (c) military. Every week if you're CIA. Every day if you're Aljazeera.