Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 4 of 4

Thread: problem in reading the content from pdf and write in word document in java ?

  1. #1
    Junior Member
    Join Date
    Dec 2013
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default problem in reading the content from pdf and write in word document in java ?

    I have a problem while reading a content (text and images) from pdf file and write the content to word document. But content is junk characters in word document instead of original data. I have used itext-1.4.8.jar and itextpdf-5.0 jar . Any help appreciated.
    Here is my code
    import java.io.FileNotFoundException;
    import java.io.FileOutputStream;
    import java.io.IOException;
     
    import com.itextpdf.text.pdf.PdfReader;
    import com.itextpdf.text.pdf.parser.ContentByteUtils;
    import com.lowagie.text.Document;
    import com.lowagie.text.DocumentException;
    import com.lowagie.text.Paragraph;
    import com.lowagie.text.rtf.RtfWriter2;
     
    public class Check1 {
     
        public static void main(String[] args) throws FileNotFoundException,
                IOException, DocumentException {
     
            PdfReader reader = new PdfReader(
                    "/home/mujafar/Desktop/NPTEL Transcription Guidelines.pdf");
            int n = reader.getNumberOfPages();
            System.out.println("total no of pages:::" + n);
     
            Document document = new Document();
     
            RtfWriter2.getInstance(document, new FileOutputStream(
                    "/home/mujafar/Desktop/file.docx"));
            System.out.println("file created");
            document.open();
            byte[] bytes;
            for (int i = 1; i <= n; i++) {
     
                bytes = ContentByteUtils.getContentBytesForPage(reader, i);
     
                String s = new String(bytes);
                document.add(new Paragraph(s));
     
                document.newPage();
     
            }
     
            document.close();
        }
    }


  2. #2
    Junior Member
    Join Date
    Feb 2013
    Posts
    1
    My Mood
    Busy
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Red face Re: problem in reading the content from pdf and write in word document in java ?

    I have used itext-4.2.0.jar and itextpdf-5.4.5 jar and used com.lowagie.text.pdf.parser.PdfTextExtractor to read the pdf. Below is the attached piece of code which is working fine. I hope it helps

    package packJava;
     
    import java.io.FileNotFoundException;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.PrintWriter; 
    import com.lowagie.text.DocumentException;
    import com.lowagie.text.pdf.PdfReader;
    import com.lowagie.text.pdf.parser.PdfTextExtractor;
     
    public class MainClass {
     
        public static void main(String[] args) throws FileNotFoundException,
                IOException, DocumentException { 
            PdfReader reader = new PdfReader(
                    "D:\\docs\\33745_The Immortals of Meluha - Chapter 1.pdf");
            int n = reader.getNumberOfPages();
            //System.out.println("total no of pages:::" + n);       
            PdfTextExtractor parser =new PdfTextExtractor(reader);       
            //System.out.println(parser.getTextFromPage(1));
            FileWriter write=new FileWriter("D:\\docs\\file4.doc",true);
            PrintWriter print_line= new PrintWriter(write);
            for(int i=1;i<n;i++){
            	print_line.printf("%s", parser.getTextFromPage(i));	
            }
            print_line.close();
            write.close();
        }
    }


    --- Update ---

    I have used itext-4.2.0.jar and itextpdf-5.4.5 jar and used com.lowagie.text.pdf.parser.PdfTextExtractor to read the pdf. Below is the attached piece of code which is working fine

    package packJava;
     
    import java.io.FileNotFoundException;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.PrintWriter; 
    import com.lowagie.text.DocumentException;
    import com.lowagie.text.pdf.PdfReader;
    import com.lowagie.text.pdf.parser.PdfTextExtractor;
     
    public class MainClass {
     
        public static void main(String[] args) throws FileNotFoundException,
                IOException, DocumentException { 
            PdfReader reader = new PdfReader(
                    "D:\\docs\\33745_The Immortals of Meluha - Chapter 1.pdf");
            int n = reader.getNumberOfPages();
            //System.out.println("total no of pages:::" + n);       
            PdfTextExtractor parser =new PdfTextExtractor(reader);       
            //System.out.println(parser.getTextFromPage(1));
            FileWriter write=new FileWriter("D:\\docs\\file4.doc",true);
            PrintWriter print_line= new PrintWriter(write);
            for(int i=1;i<n;i++){
            	print_line.printf("%s", parser.getTextFromPage(i));	
            }
            print_line.close();
            write.close();
        }
    }

  3. #3
    Junior Member
    Join Date
    Dec 2013
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Cool problem in reading the content from pdf and write in word document in java ?

    [QUOTE]
    Thanks for Reply. Previously i have done like that only. But it only getting text from the pdf file.But not considering any spaces Font sizes, Font styles etc...It just grabs the text and printing. But i want to get all exact data(text and images and spaces and tables(i.e everything same data as pdf data).I hope you understand my problem exactly. my code attachment like this. Thanks in advance....

    package bis.proj.samp;
     
    import java.io.File;
    import java.io.FileOutputStream;
     
    import com.itextpdf.text.pdf.PdfReader;
    import com.itextpdf.text.pdf.parser.PdfTextExtractor;
    import com.lowagie.text.Document;
    import com.lowagie.text.Paragraph;
    //import com.lowagie.text.pdf.PdfContentByte;
    import com.lowagie.text.rtf.RtfWriter2;
     
     
     
    public class ReadPdfFile {
     
    	public static void main(String[] args) {
    		try {
     
    			Document document = new Document();
     
    			File  file = new File("/home/mujafar/Desktop/file.doc");
    			if(!file.exists())
    				file.createNewFile();
     
    RtfWriter2.getInstance(document, new FileOutputStream("/home/mujafar/Desktop/file.doc"));
    			System.out.println("file created");
    			document.open();
     
    PdfReader reader = new PdfReader("/home/mujafar/Desktop/ Guidelines.pdf");
    		int n = reader.getNumberOfPages();
    		System.out.println("total no of pages:::"+n);
     
     
    	String s="";
    	for(int i=1;i<=n;i++)
    	{
     
    			s=PdfTextExtractor.getTextFromPage(reader, i);
    	            System.out.println("string:::"+s);
                      System.out.println("====================");
     
    		document.add(new Paragraph(s));
    		document.newPage();
    	}
     
    		document.close();
     
    		System.out.println("completed");
    		} catch (Exception de) {}
    		}
     
    }

  4. #4
    Super Moderator Norm's Avatar
    Join Date
    May 2010
    Location
    Eastern Florida
    Posts
    25,140
    Thanks
    65
    Thanked 2,720 Times in 2,670 Posts

    Default Re: problem in reading the content from pdf and write in word document in java ?

    Is this thread related: http://www.javaprogrammingforums.com...text-java.html
    If you don't understand my answer, don't ignore it, ask a question.

Similar Threads

  1. Reading the content of a file into 2 different structures
    By rima in forum File I/O & Other I/O Streams
    Replies: 7
    Last Post: June 19th, 2013, 03:45 PM
  2. Need Help For Adding image into Word document Using APache POI
    By nareshdev in forum What's Wrong With My Code?
    Replies: 0
    Last Post: June 17th, 2013, 02:44 AM
  3. how to write data to a pdf file which is stored on a sql database
    By cancer625 in forum Object Oriented Programming
    Replies: 3
    Last Post: January 22nd, 2012, 10:34 PM
  4. Reading a text file word by word
    By dylanka in forum File I/O & Other I/O Streams
    Replies: 3
    Last Post: October 21st, 2011, 02:06 PM
  5. Write to pdf file with pdfbox with greek encoding
    By javment in forum File I/O & Other I/O Streams
    Replies: 0
    Last Post: March 16th, 2011, 08:57 AM

Tags for this Thread