Hello i had an online test in some company and i didn't succeed in doing it(they gave me a timeline) this was the test:

Problem:
1) You have a bunch of log files in:
• C:\CloudShareCodeChallenge\Challenge
• Sample data with expected results (below) in C:\CloudShareCodeChallenge\Sample
2) Each file may contain two types of records, user records and payment records;
you must process all files
3) User records are in the form of UR,<User Identifier>,<First name>,<Last name>
• User records are guaranteed to be unique
4) Payment records are in the form of PR,<Payment Identifier>,<User Identifier>,<Amount>
• There is a bug in the payments processor, and same payments MAY appear more than once
(hence *double booking*)
• You should only take into account one payment per payment id.
5) Important notes:
• You may assume all records are correct (parsing wise): only one comma per field, no decimal
point in amount and no missing fields, so you may use split etc. for that.
• Files are not ordered, so depending on how you read the files you may get a payment record
before the corresponding user record
• Each file name contains a two letter field records_M2_2TFA7.log only payment records starting
with those two characters will be in the file
• Payments with same prefix may span on more than one file.
• All log files must be processed
• Keep it clean and simple
• If you feel you must add comment, rewrite the code to be self-explanatory.
• All the clues/notes above are important


6) Your tasks:

• Provide the list of the top 10 paying users: Full name and amount
• Provide the most common and least common first names with count
Note: There may be more than one name
•Provide the number of *double booking* payments:
• Either in thousandth of percent from the total number of payments 3 decimal points
• For 5% you should show 50.000
• For 0.2345% you will show 2.345
• Or a fraction 12,3435 / 100,000,000

Bonus Questions:
• How would you write the generator for the above problem
• What would you take under consideration, what would be the challenges

Sample data processing results:
Payments Ratio = 45.898 (235 / 5120)

Most common name(s): Hugh: 2
Most common name(s): Julio: 2
Most common name(s): Nelson: 2
Most common name(s): Sofia: 2
Most common name(s): Zelma: 2

Least common name(s): Annabelle: 1
Least common name(s): Chandra: 1
Least common name(s): Clayton: 1
Least common name(s): Cody: 1
Least common name(s): Gay: 1
Least common name(s): Guy: 1
Least common name(s): Jerri: 1
Least common name(s): Jessie: 1
Least common name(s): Julianne: 1
Least common name(s): Serena: 1

Top paying users:
Nelson Sensabaugh : 42669
Hugh Maginnis : 39375
Annabelle Glade : 30252
Jerri Bartee : 26393
Chandra Bottorff : 21221
Julianne Deller : 19624
Zelma Lubinsky : 17051
Nelson Caicedo : 16191
Sofia Lucena : 13779
Guy Gratton : 12150

There were 545 files and in each one there were approximitally 260,000 lines Where each line can be a payment record and the line is looking like this one: PR,U3VT2406XY5AC2,I8MKX,53 (of course Payment Identifier,User Identifier,Amount are not the same for each line)

And there can be lines which represent user record and those lines are like this one: UR,B7R1W,Nelson,Hornyak (of course User Identifier,First name,Last name are not the same for each line)

They gave an eclipse environment and said to do it in Java(Regular java programming),but I didn't know how and what things to use? Do I make an HashMap of more than a 100 million entries? Do I use different things? What is the algorithms that I should use and how? I just graduated so unfortunately I didn't succeed in doing this test but I really want to know how to do it?