Hi All,
I am new to this forum and this is my first thread. I am new to Java as well. This is my requirement:
I have some input text like this:
[NP The/DT U/NNP ] P/. [NP Workers/NNPS April/NNP skip/NN ] [PP to/TO ] [NP main/JJ skip/NN ] [PP to/TO ] [NP sidebar/NN ] [NP The/DT U/NNP ] P/. Workers/NNPS [NP This/DT site/NN ] [VP is/VBZ ] [ADJP open/JJ ] [PP for/IN ] [NP posting/VBG and/CC comments/NNS ] [PP by/IN ] [NP all/DT rank/NN and/CC file/NN administrative/JJ employees/NNS ] [PP of/IN ] [NP the/DT University/NNP ] [PP of/IN ] [NP the/DT Philippines/NNPS ] and/CC [NP the/DT Philippine/NNP General/NNP Hospital/NNP The/NNP National/NNP University/NNP Hospital/NNP ] [ADVP especially/RB ] [NP the/DT officers/NNS and/CC members/NNS ] [PP of/IN ] [NP the/DT All/NNP U/NNP ] P/. [NP Workers/NNPS Union/NNP ] [NP Friday/NNP April/NNP Stop/NNP Paying/NNP Nuke/NNP Plant/NNP Debt/NNP SC/NNP Justice/NNP Urges/NNPS Gov't/NNP ] [VP Posted/VBD pm/VBN ] [NP Mla/NNP time/NN April/NNP By/NNP Vincent/NNP Cabreza/NNP Inquirer/NNP News/NNP Service/NNP Published/NNP ] [PP on/IN ] [NP page/NN A/NNP ] [PP of/IN ] [NP the/DT Apr/NNP ] But/CC [NP Puno/NNP ] [VP points/VBZ ] [PRT out/RP ] [SBAR that/IN ] [NP the/DT US/NNP law/NN ] [VP bars/VBZ ] [NP the/DT towns/NNS ] [PP from/IN ] [VP issuing/VBG ] [NP new/JJ taxes/NNS ] [VP to/TO pay/VB ] [PP for/IN ] [NP their/PRP$ debts/NNS ] unsafe/JJ www/WRB -----etc-----------------
I needed to format the text into this format: {This is the desired output format}
The DT B-NP U NNP I-NP P Workers NNPS B-NP April NNP I-NP skip NN I-NP to TO B-PP main JJ B-NP skip NN I-NP to TO B-PP sidebar NN B-NP The DT B-NP U NNP I-NP P Workers NNPS ......... etc .......
I have written the code to transform this into a format but the output does not match the above one. So the requirement is not met.
I am using Regex to solve the problem:
Pattern p = Pattern .compile("\\[(\\p{Alpha}+) +(\\p{Graph}+)/(\\p{Alpha}+)(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))? ]+(?:(\\./. |\\./.$))?(?: +(\\./. |\\./.$))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?:(\\p{Alnum}+)/(\\p{Alpha}+))?",Pattern.MULTILINE);
Printing the output as:
The regex looks big as I have trained it to capture all types of words in the brackets []. But it is failing to generate the output when it sees: "But/CC " or this kind of pattern in my text. But when it sees the second one like: "unsafe/JJ" it generates the output.while (matcher.find()) { //System.out.println(); System.out.println("For: " +matcher.group()) ; System.out.println(matcher.group(2) + "\t" + matcher.group(3) + "\tB-" + matcher.group(1)); if (matcher.group(4) != null) { System.out.println(matcher.group(4) + "\t" + matcher.group(5) + "\tI-" + matcher.group(1)); } -------etc---------------------------------------
So currently my output(which is wrong) looks like this(with no gaps after a sentence):
The DT B-NP U NNP I-NP Workers NNPS B-NP April NNP I-NP skip NN I-NP to TO B-PP main JJ B-NP skip NN I-NP to TO B-PP sidebar NN B-NP The DT B-NP U NNP I-NP This DT B-NP site NN I-NP is VBZ B-VP -------
You can see that it has omitted some words straightaway.
So I have 2 requirements:
1. How to capture the pattern "But/CC" (or this type) which is not in brackets?
2. After every sentence or pattern we see that there is a line gap in the input text. Thus after a sentence we see a gap. So in the output also, I need to give a line break after each sentence as provided in the input text file. [Also after P/. there should be a line break as is there in the input]
Please refer to the desired output part of this thread. I need to write a Regex code to solve this. Please help me to modify/write the same.
Thanks!