Require searching only for file content and not metadata | KandaSearch Community Support Forum

返信投稿者：Christopher Schultz (2019/08/26 22:16 投稿)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Kushal,

On 8/26/19 07:52, Khare, Kushal (MIND) wrote:

This is Kushal Khare, a new addition to the user-list. I started
working with Solr few days ago for implementing it in my project.

Now, I have the basics done, and reached the query stage.

My problem is – I need to restrict the solr to search only for the
file content and not the metadata. I have gone through various
articles on the internet, but could not get any help.

Therefore, I hope I could get some solutions here.

How are you querying Solr? Are you querying from a web application? From
a thick-client application? Directly from a web browser?

What do you consider "metadata" versus "content"? To Solr, everything
is the same...

-chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl1j268ACgkQHPApP6U8
pFi6GA//VY8SU6H5T3G6fpUqQrVp05E9g7f0oGGVW1eaRY3NjgQzfbwJQmJqg16Y
MyUKpp0/P6EpR/dMPmiKBPvLppSqjT1SUNgrFi2btwtBaTibxWXd0WtEqNdinWCo
DFyJaPQaIT20IR887SPWrQSYc4oC8aKNAEDAXxlyWDzEgImE23AyCeWs++gJsaKm
RphkleBeIKCX6SkRzDFeEzx4VyKBZKcjI+Ks/9z2s9tcGmElxyMDPHYf5VXJQgcz
A1D3jPVPqm2OMvThXd2ll4NlnXe2PWV5eYfZQt/6YMwx4jF+rqG66jDXEhTHzDro
jmiZVj1VbQ0RlFLqP6OHu2YRj+01a0OtE8l4mWiGSNIrKymp+ycT9E+L0eC9yGIT
hLUfo7a3ONfOTTNAbuI/363+2WA1wBxSHm2m3kQT8Ho8ydjd7w/umR1L6/wr+q9B
jEZfAHs1TLFXd6lgqLtmIyf6Ya5bloWM+yjwnjfpniOuHCcXTiJi+5GvxLwih8yE
6CQ32kIUuspJ7N5hyiJvM4AcuWWMldDlZaYoHuUwhVbWCCT+Y4X6R1+IZfyXZnvn
wFEMD3+3r382M3G0uyh2MJk899l1kSPcX+BtRg3pOqDZh0WR+2xWpTndeiMxsmGj
UC1J1PssKUa1P0dMk7wLvgOl0BiiGC+WwgD7ZfHjF7NPL1jPtW8=
=LWwW
-----END PGP SIGNATURE-----

返信投稿者：Khare, Kushal (MIND) (2019/08/27 16:28 投稿)

Chris,
What I have done is, I just created a core, used POST tool to index the documents from my file system, and then moved to Solr Admin for querying.
For 'Metadata' vs 'Content' , I mean that I just want the field 'text' to be searched for, instead of all the fields that solr creates by itself like - author name. last modified, creator, id, etc.
I simply want solr to search only for the content inside the document (the body of the document) & not on all the fields. For an example, if I search for 'Kushal', it should return the document only if it has the word in it as the content, not because it has author name or owner as Kushal.
Hope its clear than before now. Please help me with this !

Thankyou!
Kushal Khare

-----Original Message-----
From: Christopher Schultz [mailto:chris@christopherschultz.net]
Sent: 26 August 2019 18:47
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Kushal,

On 8/26/19 07:52, Khare, Kushal (MIND) wrote:

This is Kushal Khare, a new addition to the user-list. I started
working with Solr few days ago for implementing it in my project.

Now, I have the basics done, and reached the query stage.

My problem is – I need to restrict the solr to search only for the
file content and not the metadata. I have gone through various
articles on the internet, but could not get any help.

Therefore, I hope I could get some solutions here.

How are you querying Solr? Are you querying from a web application? From a thick-client application? Directly from a web browser?

What do you consider "metadata" versus "content"? To Solr, everything is the same...

-chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl1j268ACgkQHPApP6U8
pFi6GA//VY8SU6H5T3G6fpUqQrVp05E9g7f0oGGVW1eaRY3NjgQzfbwJQmJqg16Y
MyUKpp0/P6EpR/dMPmiKBPvLppSqjT1SUNgrFi2btwtBaTibxWXd0WtEqNdinWCo
DFyJaPQaIT20IR887SPWrQSYc4oC8aKNAEDAXxlyWDzEgImE23AyCeWs++gJsaKm
RphkleBeIKCX6SkRzDFeEzx4VyKBZKcjI+Ks/9z2s9tcGmElxyMDPHYf5VXJQgcz
A1D3jPVPqm2OMvThXd2ll4NlnXe2PWV5eYfZQt/6YMwx4jF+rqG66jDXEhTHzDro
jmiZVj1VbQ0RlFLqP6OHu2YRj+01a0OtE8l4mWiGSNIrKymp+ycT9E+L0eC9yGIT
hLUfo7a3ONfOTTNAbuI/363+2WA1wBxSHm2m3kQT8Ho8ydjd7w/umR1L6/wr+q9B
jEZfAHs1TLFXd6lgqLtmIyf6Ya5bloWM+yjwnjfpniOuHCcXTiJi+5GvxLwih8yE
6CQ32kIUuspJ7N5hyiJvM4AcuWWMldDlZaYoHuUwhVbWCCT+Y4X6R1+IZfyXZnvn
wFEMD3+3r382M3G0uyh2MJk899l1kSPcX+BtRg3pOqDZh0WR+2xWpTndeiMxsmGj
UC1J1PssKUa1P0dMk7wLvgOl0BiiGC+WwgD7ZfHjF7NPL1jPtW8=
=LWwW
-----END PGP SIGNATURE-----

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Khare, Kushal (MIND) (2019/08/27 22:18 投稿)

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 27 August 2019 12:59
To: solr-user@lucene.apache.org; chris@christopherschultz.net
Subject: RE: Require searching only for file content and not metadata

Chris,
What I have done is, I just created a core, used POST tool to index the documents from my file system, and then moved to Solr Admin for querying.
For 'Metadata' vs 'Content' , I mean that I just want the field 'text' to be searched for, instead of all the fields that solr creates by itself like - author name. last modified, creator, id, etc.
I simply want solr to search only for the content inside the document (the body of the document) & not on all the fields. For an example, if I search for 'Kushal', it should return the document only if it has the word in it as the content, not because it has author name or owner as Kushal.
Hope its clear than before now. Please help me with this !

Thankyou!
Kushal Khare

-----Original Message-----
From: Christopher Schultz [mailto:chris@christopherschultz.net]
Sent: 26 August 2019 18:47
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Kushal,

On 8/26/19 07:52, Khare, Kushal (MIND) wrote:

This is Kushal Khare, a new addition to the user-list. I started
working with Solr few days ago for implementing it in my project.

Now, I have the basics done, and reached the query stage.

My problem is – I need to restrict the solr to search only for the
file content and not the metadata. I have gone through various
articles on the internet, but could not get any help.

Therefore, I hope I could get some solutions here.

How are you querying Solr? Are you querying from a web application? From a thick-client application? Directly from a web browser?

What do you consider "metadata" versus "content"? To Solr, everything is the same...

-chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl1j268ACgkQHPApP6U8
pFi6GA//VY8SU6H5T3G6fpUqQrVp05E9g7f0oGGVW1eaRY3NjgQzfbwJQmJqg16Y
MyUKpp0/P6EpR/dMPmiKBPvLppSqjT1SUNgrFi2btwtBaTibxWXd0WtEqNdinWCo
DFyJaPQaIT20IR887SPWrQSYc4oC8aKNAEDAXxlyWDzEgImE23AyCeWs++gJsaKm
RphkleBeIKCX6SkRzDFeEzx4VyKBZKcjI+Ks/9z2s9tcGmElxyMDPHYf5VXJQgcz
A1D3jPVPqm2OMvThXd2ll4NlnXe2PWV5eYfZQt/6YMwx4jF+rqG66jDXEhTHzDro
jmiZVj1VbQ0RlFLqP6OHu2YRj+01a0OtE8l4mWiGSNIrKymp+ycT9E+L0eC9yGIT
hLUfo7a3ONfOTTNAbuI/363+2WA1wBxSHm2m3kQT8Ho8ydjd7w/umR1L6/wr+q9B
jEZfAHs1TLFXd6lgqLtmIyf6Ya5bloWM+yjwnjfpniOuHCcXTiJi+5GvxLwih8yE
6CQ32kIUuspJ7N5hyiJvM4AcuWWMldDlZaYoHuUwhVbWCCT+Y4X6R1+IZfyXZnvn
wFEMD3+3r382M3G0uyh2MJk899l1kSPcX+BtRg3pOqDZh0WR+2xWpTndeiMxsmGj
UC1J1PssKUa1P0dMk7wLvgOl0BiiGC+WwgD7ZfHjF7NPL1jPtW8=
=LWwW
-----END PGP SIGNATURE-----

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Yogendra Kumar Soni (2019/08/28 07:38 投稿)

It will be easier to parse documents create content, metadata and other
required fields yourself in place of using default post tool. You will have
better control on what is going to which field.

On Tue 27 Aug, 2019, 6:48 PM Khare, Kushal (MIND), Kushal.Khare@mind-infotech.com wrote:

Basically, what problem I am facing is - I am getting the textual content

other metadata in my text field. But, I want only the textual content
written inside the document.
I tried various Request Handler Update Extract configurations, but none of
them worked for me.
Please help me resolve this as I am badly stuck in this.

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 27 August 2019 12:59
To: solr-user@lucene.apache.org; chris@christopherschultz.net
Subject: RE: Require searching only for file content and not metadata

Chris,
What I have done is, I just created a core, used POST tool to index the
documents from my file system, and then moved to Solr Admin for querying.
For 'Metadata' vs 'Content' , I mean that I just want the field 'text'
to be searched for, instead of all the fields that solr creates by itself
like - author name. last modified, creator, id, etc.
I simply want solr to search only for the content inside the document (the
body of the document) & not on all the fields. For an example, if I search
for 'Kushal', it should return the document only if it has the word in it
as the content, not because it has author name or owner as Kushal.
Hope its clear than before now. Please help me with this !

Thankyou!
Kushal Khare

-----Original Message-----
From: Christopher Schultz [mailto:chris@christopherschultz.net]
Sent: 26 August 2019 18:47
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Kushal,

On 8/26/19 07:52, Khare, Kushal (MIND) wrote:

This is Kushal Khare, a new addition to the user-list. I started
working with Solr few days ago for implementing it in my project.

Now, I have the basics done, and reached the query stage.

My problem is – I need to restrict the solr to search only for the
file content and not the metadata. I have gone through various
articles on the internet, but could not get any help.

Therefore, I hope I could get some solutions here.

How are you querying Solr? Are you querying from a web application? From a
thick-client application? Directly from a web browser?

What do you consider "metadata" versus "content"? To Solr, everything is
the same...

-chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl1j268ACgkQHPApP6U8
pFi6GA//VY8SU6H5T3G6fpUqQrVp05E9g7f0oGGVW1eaRY3NjgQzfbwJQmJqg16Y
MyUKpp0/P6EpR/dMPmiKBPvLppSqjT1SUNgrFi2btwtBaTibxWXd0WtEqNdinWCo
DFyJaPQaIT20IR887SPWrQSYc4oC8aKNAEDAXxlyWDzEgImE23AyCeWs++gJsaKm
RphkleBeIKCX6SkRzDFeEzx4VyKBZKcjI+Ks/9z2s9tcGmElxyMDPHYf5VXJQgcz
A1D3jPVPqm2OMvThXd2ll4NlnXe2PWV5eYfZQt/6YMwx4jF+rqG66jDXEhTHzDro
jmiZVj1VbQ0RlFLqP6OHu2YRj+01a0OtE8l4mWiGSNIrKymp+ycT9E+L0eC9yGIT
hLUfo7a3ONfOTTNAbuI/363+2WA1wBxSHm2m3kQT8Ho8ydjd7w/umR1L6/wr+q9B
jEZfAHs1TLFXd6lgqLtmIyf6Ya5bloWM+yjwnjfpniOuHCcXTiJi+5GvxLwih8yE
6CQ32kIUuspJ7N5hyiJvM4AcuWWMldDlZaYoHuUwhVbWCCT+Y4X6R1+IZfyXZnvn
wFEMD3+3r382M3G0uyh2MJk899l1kSPcX+BtRg3pOqDZh0WR+2xWpTndeiMxsmGj
UC1J1PssKUa1P0dMk7wLvgOl0BiiGC+WwgD7ZfHjF7NPL1jPtW8=
=LWwW
-----END PGP SIGNATURE-----

The information contained in this electronic message and any attachments
to this message are intended for the exclusive use of the addressee(s) and
may contain proprietary, confidential or privileged information. If you are
not the intended recipient, you should not disseminate, distribute or copy
this e-mail. Please notify the sender immediately and destroy all copies of
this message and any attachments. WARNING: Computer viruses can be
transmitted via email. The recipient should check this email and any
attachments for the presence of viruses. The company accepts no liability
for any damage caused by any virus/trojan/worms/malicious code transmitted
by this email. www.motherson.com

The information contained in this electronic message and any attachments
to this message are intended for the exclusive use of the addressee(s) and
may contain proprietary, confidential or privileged information. If you are
not the intended recipient, you should not disseminate, distribute or copy
this e-mail. Please notify the sender immediately and destroy all copies of
this message and any attachments. WARNING: Computer viruses can be
transmitted via email. The recipient should check this email and any
attachments for the presence of viruses. The company accepts no liability
for any damage caused by any virus/trojan/worms/malicious code transmitted
by this email. www.motherson.com

返信投稿者：Khare, Kushal (MIND) (2019/08/28 15:10 投稿)

Could anyone please help me with how to use this approach ? I humbly request all the users to please help me get through this.
Thanks !

-----Original Message-----
From: Yogendra Kumar Soni [mailto:yogendra.kumar@dolcera.com]
Sent: 28 August 2019 04:08
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

It will be easier to parse documents create content, metadata and other required fields yourself in place of using default post tool. You will have better control on what is going to which field.

On Tue 27 Aug, 2019, 6:48 PM Khare, Kushal (MIND), < Kushal.Khare@mind-infotech.com> wrote:

Basically, what problem I am facing is - I am getting the textual
content

other metadata in my text field. But, I want only the textual

content
written inside the document.
I tried various Request Handler Update Extract configurations, but
none of them worked for me.
Please help me resolve this as I am badly stuck in this.

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 27 August 2019 12:59
To: solr-user@lucene.apache.org; chris@christopherschultz.net
Subject: RE: Require searching only for file content and not metadata

Chris,
What I have done is, I just created a core, used POST tool to index
the documents from my file system, and then moved to Solr Admin for querying.
For 'Metadata' vs 'Content' , I mean that I just want the field 'text'
to be searched for, instead of all the fields that solr creates by
itself like - author name. last modified, creator, id, etc.
I simply want solr to search only for the content inside the document
(the body of the document) & not on all the fields. For an example, if
I search for 'Kushal', it should return the document only if it has
the word in it as the content, not because it has author name or owner as Kushal.
Hope its clear than before now. Please help me with this !

Thankyou!
Kushal Khare

-----Original Message-----
From: Christopher Schultz [mailto:chris@christopherschultz.net]
Sent: 26 August 2019 18:47
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Kushal,

On 8/26/19 07:52, Khare, Kushal (MIND) wrote:

This is Kushal Khare, a new addition to the user-list. I started
working with Solr few days ago for implementing it in my project.

Now, I have the basics done, and reached the query stage.

My problem is – I need to restrict the solr to search only for the
file content and not the metadata. I have gone through various
articles on the internet, but could not get any help.

Therefore, I hope I could get some solutions here.

How are you querying Solr? Are you querying from a web application?
From a thick-client application? Directly from a web browser?

What do you consider "metadata" versus "content"? To Solr, everything
is the same...

-chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl1j268ACgkQHPApP6U8
pFi6GA//VY8SU6H5T3G6fpUqQrVp05E9g7f0oGGVW1eaRY3NjgQzfbwJQmJqg16Y
MyUKpp0/P6EpR/dMPmiKBPvLppSqjT1SUNgrFi2btwtBaTibxWXd0WtEqNdinWCo
DFyJaPQaIT20IR887SPWrQSYc4oC8aKNAEDAXxlyWDzEgImE23AyCeWs++gJsaKm
RphkleBeIKCX6SkRzDFeEzx4VyKBZKcjI+Ks/9z2s9tcGmElxyMDPHYf5VXJQgcz
A1D3jPVPqm2OMvThXd2ll4NlnXe2PWV5eYfZQt/6YMwx4jF+rqG66jDXEhTHzDro
jmiZVj1VbQ0RlFLqP6OHu2YRj+01a0OtE8l4mWiGSNIrKymp+ycT9E+L0eC9yGIT
hLUfo7a3ONfOTTNAbuI/363+2WA1wBxSHm2m3kQT8Ho8ydjd7w/umR1L6/wr+q9B
jEZfAHs1TLFXd6lgqLtmIyf6Ya5bloWM+yjwnjfpniOuHCcXTiJi+5GvxLwih8yE
6CQ32kIUuspJ7N5hyiJvM4AcuWWMldDlZaYoHuUwhVbWCCT+Y4X6R1+IZfyXZnvn
wFEMD3+3r382M3G0uyh2MJk899l1kSPcX+BtRg3pOqDZh0WR+2xWpTndeiMxsmGj
UC1J1PssKUa1P0dMk7wLvgOl0BiiGC+WwgD7ZfHjF7NPL1jPtW8=
=LWwW
-----END PGP SIGNATURE-----

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Jörn Franke (2019/08/28 16:24 投稿)

You need to provide a little bit more details. What is your Schema? How is the document structured ? Where do you get metadata from?

Have you read the Solr reference guide? Have you read a book about Solr?

Am 28.08.2019 um 08:10 schrieb Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com:

Could anyone please help me with how to use this approach ? I humbly request all the users to please help me get through this.
Thanks !

-----Original Message-----
From: Yogendra Kumar Soni [mailto:yogendra.kumar@dolcera.com]
Sent: 28 August 2019 04:08
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

It will be easier to parse documents create content, metadata and other required fields yourself in place of using default post tool. You will have better control on what is going to which field.

On Tue 27 Aug, 2019, 6:48 PM Khare, Kushal (MIND), < Kushal.Khare@mind-infotech.com> wrote:

Basically, what problem I am facing is - I am getting the textual
content

other metadata in my text field. But, I want only the textual

content
written inside the document.
I tried various Request Handler Update Extract configurations, but
none of them worked for me.
Please help me resolve this as I am badly stuck in this.

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 27 August 2019 12:59
To: solr-user@lucene.apache.org; chris@christopherschultz.net
Subject: RE: Require searching only for file content and not metadata

Chris,
What I have done is, I just created a core, used POST tool to index
the documents from my file system, and then moved to Solr Admin for querying.
For 'Metadata' vs 'Content' , I mean that I just want the field 'text'
to be searched for, instead of all the fields that solr creates by
itself like - author name. last modified, creator, id, etc.
I simply want solr to search only for the content inside the document
(the body of the document) & not on all the fields. For an example, if
I search for 'Kushal', it should return the document only if it has
the word in it as the content, not because it has author name or owner as Kushal.
Hope its clear than before now. Please help me with this !

Thankyou!
Kushal Khare

-----Original Message-----
From: Christopher Schultz [mailto:chris@christopherschultz.net]
Sent: 26 August 2019 18:47
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Kushal,

On 8/26/19 07:52, Khare, Kushal (MIND) wrote:
This is Kushal Khare, a new addition to the user-list. I started
working with Solr few days ago for implementing it in my project.

Now, I have the basics done, and reached the query stage.

My problem is – I need to restrict the solr to search only for the
file content and not the metadata. I have gone through various
articles on the internet, but could not get any help.

Therefore, I hope I could get some solutions here.

How are you querying Solr? Are you querying from a web application?
From a thick-client application? Directly from a web browser?

What do you consider "metadata" versus "content"? To Solr, everything
is the same...

-chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl1j268ACgkQHPApP6U8
pFi6GA//VY8SU6H5T3G6fpUqQrVp05E9g7f0oGGVW1eaRY3NjgQzfbwJQmJqg16Y
MyUKpp0/P6EpR/dMPmiKBPvLppSqjT1SUNgrFi2btwtBaTibxWXd0WtEqNdinWCo
DFyJaPQaIT20IR887SPWrQSYc4oC8aKNAEDAXxlyWDzEgImE23AyCeWs++gJsaKm
RphkleBeIKCX6SkRzDFeEzx4VyKBZKcjI+Ks/9z2s9tcGmElxyMDPHYf5VXJQgcz
A1D3jPVPqm2OMvThXd2ll4NlnXe2PWV5eYfZQt/6YMwx4jF+rqG66jDXEhTHzDro
jmiZVj1VbQ0RlFLqP6OHu2YRj+01a0OtE8l4mWiGSNIrKymp+ycT9E+L0eC9yGIT
hLUfo7a3ONfOTTNAbuI/363+2WA1wBxSHm2m3kQT8Ho8ydjd7w/umR1L6/wr+q9B
jEZfAHs1TLFXd6lgqLtmIyf6Ya5bloWM+yjwnjfpniOuHCcXTiJi+5GvxLwih8yE
6CQ32kIUuspJ7N5hyiJvM4AcuWWMldDlZaYoHuUwhVbWCCT+Y4X6R1+IZfyXZnvn
wFEMD3+3r382M3G0uyh2MJk899l1kSPcX+BtRg3pOqDZh0WR+2xWpTndeiMxsmGj
UC1J1PssKUa1P0dMk7wLvgOl0BiiGC+WwgD7ZfHjF7NPL1jPtW8=
=LWwW
-----END PGP SIGNATURE-----

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Shawn Heisey (2019/08/28 17:47 投稿)

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to
require that you write the indexing software yourself -- a program that
extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in production
-- Tika is known to crash when given some documents (usually PDF files
are the problematic ones, but other formats can cause it too), and if it
crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It
also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

返信投稿者：Khare, Kushal (MIND) (2019/08/28 19:54 投稿)

Yes, I have already gone through the reference guide. Its all because of the guide and documentation that I have reached till this stage.
Well, I am indexing rich document formats like - .docx, .pptx, .pdf etc.
The metadata I am talking about is - that currently sorl puts all the data like author, editor, content type details of the documents in the text field, along with the textual content, and what I want is to separate them.
I also tried using ExtractingRequestHandler, understood the fmap.content in tika, but still can't reach the desired output.

-----Original Message-----
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 28 August 2019 12:55
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

You need to provide a little bit more details. What is your Schema? How is the document structured ? Where do you get metadata from?

Have you read the Solr reference guide? Have you read a book about Solr?

Am 28.08.2019 um 08:10 schrieb Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com:

Could anyone please help me with how to use this approach ? I humbly request all the users to please help me get through this.
Thanks !

-----Original Message-----
From: Yogendra Kumar Soni [mailto:yogendra.kumar@dolcera.com]
Sent: 28 August 2019 04:08
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

It will be easier to parse documents create content, metadata and other required fields yourself in place of using default post tool. You will have better control on what is going to which field.

On Tue 27 Aug, 2019, 6:48 PM Khare, Kushal (MIND), < Kushal.Khare@mind-infotech.com> wrote:

Basically, what problem I am facing is - I am getting the textual
content

other metadata in my text field. But, I want only the textual

content
written inside the document.
I tried various Request Handler Update Extract configurations, but
none of them worked for me.
Please help me resolve this as I am badly stuck in this.

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 27 August 2019 12:59
To: solr-user@lucene.apache.org; chris@christopherschultz.net
Subject: RE: Require searching only for file content and not metadata

Chris,
What I have done is, I just created a core, used POST tool to index
the documents from my file system, and then moved to Solr Admin for querying.
For 'Metadata' vs 'Content' , I mean that I just want the field 'text'
to be searched for, instead of all the fields that solr creates by
itself like - author name. last modified, creator, id, etc.
I simply want solr to search only for the content inside the document
(the body of the document) & not on all the fields. For an example,
if I search for 'Kushal', it should return the document only if it
has the word in it as the content, not because it has author name or owner as Kushal.
Hope its clear than before now. Please help me with this !

Thankyou!
Kushal Khare

-----Original Message-----
From: Christopher Schultz [mailto:chris@christopherschultz.net]
Sent: 26 August 2019 18:47
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Kushal,

On 8/26/19 07:52, Khare, Kushal (MIND) wrote:
This is Kushal Khare, a new addition to the user-list. I started
working with Solr few days ago for implementing it in my project.

Now, I have the basics done, and reached the query stage.

My problem is – I need to restrict the solr to search only for the
file content and not the metadata. I have gone through various
articles on the internet, but could not get any help.

Therefore, I hope I could get some solutions here.

How are you querying Solr? Are you querying from a web application?
From a thick-client application? Directly from a web browser?

What do you consider "metadata" versus "content"? To Solr, everything
is the same...

-chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl1j268ACgkQHPApP6U8
pFi6GA//VY8SU6H5T3G6fpUqQrVp05E9g7f0oGGVW1eaRY3NjgQzfbwJQmJqg16Y
MyUKpp0/P6EpR/dMPmiKBPvLppSqjT1SUNgrFi2btwtBaTibxWXd0WtEqNdinWCo
DFyJaPQaIT20IR887SPWrQSYc4oC8aKNAEDAXxlyWDzEgImE23AyCeWs++gJsaKm
RphkleBeIKCX6SkRzDFeEzx4VyKBZKcjI+Ks/9z2s9tcGmElxyMDPHYf5VXJQgcz
A1D3jPVPqm2OMvThXd2ll4NlnXe2PWV5eYfZQt/6YMwx4jF+rqG66jDXEhTHzDro
jmiZVj1VbQ0RlFLqP6OHu2YRj+01a0OtE8l4mWiGSNIrKymp+ycT9E+L0eC9yGIT
hLUfo7a3ONfOTTNAbuI/363+2WA1wBxSHm2m3kQT8Ho8ydjd7w/umR1L6/wr+q9B
jEZfAHs1TLFXd6lgqLtmIyf6Ya5bloWM+yjwnjfpniOuHCcXTiJi+5GvxLwih8yE
6CQ32kIUuspJ7N5hyiJvM4AcuWWMldDlZaYoHuUwhVbWCCT+Y4X6R1+IZfyXZnvn
wFEMD3+3r382M3G0uyh2MJk899l1kSPcX+BtRg3pOqDZh0WR+2xWpTndeiMxsmGj
UC1J1PssKUa1P0dMk7wLvgOl0BiiGC+WwgD7ZfHjF7NPL1jPtW8=
=LWwW
-----END PGP SIGNATURE-----

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Khare, Kushal (MIND) (2019/08/28 19:59 投稿)

I already tried this example, I am currently working on this. I have complied the code, it is indexing the documents. But, it is not adding any thing to the field - text . Also, not giving any metadata.
doc.addField("text", textHandler.toString()); --> here, textHandler.toString() is blank for all the 40 documents. All I am getting is the 'id' & 'version' field.

This is the code that I tried :

package mind.solr;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;

public class solrJExtract {

private HttpSolrClient client;
private long start = System.currentTimeMillis();
private AutoDetectParser autoParser;
private int totalTika = 0;
private int totalSql = 0;

@SuppressWarnings("rawtypes")
private Collection docList = new ArrayList();

public static void main(String[] args) {
try {
solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika");
idxer.doTikaDocuments(new File("D:\docs"));
idxer.endIndexing();
} catch (Exception e) {
e.printStackTrace();
}
}

private solrJExtract(String url) throws IOException, SolrServerException {
// Create a SolrCloud-aware client to send docs to Solr
// Use something like HttpSolrClient for stand-alone

client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika")
.withConnectionTimeout(10000)
.withSocketTimeout(60000)
.build();

// binary parser is used by default for responses
client.setParser(new XMLResponseParser());

// One of the ways Tika can be used to attempt to parse arbitrary files.
autoParser = new AutoDetectParser();

}

// Just a convenient place to wrap things up.
@SuppressWarnings("unchecked")
private void endIndexing() throws IOException, SolrServerException {
if ( docList.size() > 0) { // Are there any documents left over?
client.add(docList, 300000); // Commit within 5 minutes
}
client.commit(); // Only needs to be done at the end,
// commitWithin should do the rest.
// Could even be omitted
// assuming commitWithin was specified.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken: " + (endTime - start) +
" milliseconds to index " + totalSql +
" SQL rows and " + totalTika + " documents");

}

/**

*Tika processing here
*/
// Recursively traverse the filesystem, parsing everything found.
private void doTikaDocuments(File root) throws IOException, SolrServerException {// Simple loop for recursively indexing all the files
// in the root directory passed in.
for (File file : root.listFiles()) {
if (file.isDirectory()) {
doTikaDocuments(file);
continue;
}
// Get ready to parse the file.
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Tim Allison noted the following, thanks Tim!
// If you want Tika to parse embedded files (attachments within your .doc or any other embedded
// files), you need to send in the autodetectparser in the parsecontext:
// context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to
// see whether this file is a good candidate.
try {
autoParser.parse(input, textHandler, metadata, context);
} catch (Exception e) {
// Needs better logging of what went wrong in order to
// track down "bad" documents.
System.out.println(String.format("File %s failed", file.getCanonicalPath()));
e.printStackTrace();
continue;
}
// Just to show how much meta-data and what form it's in.
dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields.
SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields.
// Also possible to write a simple loop to examine all the
// metadata returned and selectively index it and/or
// just get a list of them.
// One can also use the Lucidworks field mapping to
// accomplish much the same thing.
String author = metadata.get("Author");

/*

if (author != null) { //doc.addField("author", author); }
*/ doc.addField("text", textHandler.toString());
//doc.addField("meta", metadata.get("Last_Modified"));
docList.add(doc);
++totalTika; // Completely arbitrary, just batch up more than one document
// for throughput!
if ( docList.size() >= 1000) {
```
 // Commit within 5 minutes.
 UpdateResponse resp = client.add(docList, 300000);
 if (resp.getStatus() != 0) {
 System.out.println("Some horrible error has occurred, status is: " +
       resp.getStatus());
 }
 docList.clear();
```
}
}
}// Just to show all the metadata that's available.
private void dumpMetadata(String fileName, Metadata metadata) {
System.out.println("Dumping metadata for file: " + fileName);
for (String name : metadata.names()) {
System.out.println(name + ":" + metadata.get(name));
}
System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
}
}

Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my collection. Please see to it & suggest where I am getting wrong.
I can't even get to see the text field in the query result, instead of stored parameter being true.
Any help would really be appreciated.
Thanks !

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

添付ファイル：

78 solrconfig.xml

返信投稿者：Khare, Kushal (MIND) (2019/08/28 20:03 投稿)

Attaching managed-schema.xml

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 28 August 2019 16:30
To: solr-user@lucene.apache.org
Subject: RE: Require searching only for file content and not metadata

I already tried this example, I am currently working on this. I have complied the code, it is indexing the documents. But, it is not adding any thing to the field - text . Also, not giving any metadata.
doc.addField("text", textHandler.toString()); --> here, textHandler.toString() is blank for all the 40 documents. All I am getting is the 'id' & 'version' field.

This is the code that I tried :

package mind.solr;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;

public class solrJExtract {

private HttpSolrClient client;
private long start = System.currentTimeMillis();
private AutoDetectParser autoParser;
private int totalTika = 0;
private int totalSql = 0;

@SuppressWarnings("rawtypes")
private Collection docList = new ArrayList();

public static void main(String[] args) {
try {
solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika");
idxer.doTikaDocuments(new File("D:\docs"));
idxer.endIndexing();
} catch (Exception e) {
e.printStackTrace();
}
}

private solrJExtract(String url) throws IOException, SolrServerException {
// Create a SolrCloud-aware client to send docs to Solr
// Use something like HttpSolrClient for stand-alone

client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika")
.withConnectionTimeout(10000)
.withSocketTimeout(60000)
.build();

// binary parser is used by default for responses
client.setParser(new XMLResponseParser());

// One of the ways Tika can be used to attempt to parse arbitrary files.
autoParser = new AutoDetectParser();

}

// Just a convenient place to wrap things up.
@SuppressWarnings("unchecked")
private void endIndexing() throws IOException, SolrServerException {
if ( docList.size() > 0) { // Are there any documents left over?
client.add(docList, 300000); // Commit within 5 minutes
}
client.commit(); // Only needs to be done at the end,
// commitWithin should do the rest.
// Could even be omitted
// assuming commitWithin was specified.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken: " + (endTime - start) +
" milliseconds to index " + totalSql +
" SQL rows and " + totalTika + " documents");

}

/**

*Tika processing here
*/
// Recursively traverse the filesystem, parsing everything found.
private void doTikaDocuments(File root) throws IOException, SolrServerException {// Simple loop for recursively indexing all the files
// in the root directory passed in.
for (File file : root.listFiles()) {
if (file.isDirectory()) {
doTikaDocuments(file);
continue;
}
// Get ready to parse the file.
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Tim Allison noted the following, thanks Tim!
// If you want Tika to parse embedded files (attachments within your .doc or any other embedded
// files), you need to send in the autodetectparser in the parsecontext:
// context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to
// see whether this file is a good candidate.
try {
autoParser.parse(input, textHandler, metadata, context);
} catch (Exception e) {
// Needs better logging of what went wrong in order to
// track down "bad" documents.
System.out.println(String.format("File %s failed", file.getCanonicalPath()));
e.printStackTrace();
continue;
}
// Just to show how much meta-data and what form it's in.
dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields.
SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields.
// Also possible to write a simple loop to examine all the
// metadata returned and selectively index it and/or
// just get a list of them.
// One can also use the Lucidworks field mapping to
// accomplish much the same thing.
String author = metadata.get("Author");

/*

if (author != null) { //doc.addField("author", author); } */ doc.addField("text", textHandler.toString());
//doc.addField("meta", metadata.get("Last_Modified"));
docList.add(doc);
++totalTika; // Completely arbitrary, just batch up more than one document
// for throughput!
if ( docList.size() >= 1000) {
```
 // Commit within 5 minutes.
 UpdateResponse resp = client.add(docList, 300000);
 if (resp.getStatus() != 0) {
 System.out.println("Some horrible error has occurred, status is: " +
       resp.getStatus());
 }
 docList.clear();
```
}
}
}// Just to show all the metadata that's available.
private void dumpMetadata(String fileName, Metadata metadata) {
System.out.println("Dumping metadata for file: " + fileName);
for (String name : metadata.names()) {
System.out.println(name + ":" + metadata.get(name));
}
System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
}
}

Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my collection. Please see to it & suggest where I am getting wrong.
I can't even get to see the text field in the query result, instead of stored parameter being true.
Any help would really be appreciated.
Thanks !

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Erick Erickson (2019/08/28 20:20 投稿)

Attachments are aggressively stripped of attachments, you’ll have to either post it someplace and provide a link or paste the relevant sections into the e-mail.

You’re not getting any metadata because you’re not adding any metadata to the documents with
doc.addField(“metadatafield1”, value_of_metadata_field1);

The only thing ever in the doc is what you explicitly put there. At this point it’s just “id” and “text”.

As for why text isn’t showing up, does the schema have ’stored=“true”’ for the field? And when
you query, are you specifying &fl=text? text is usually a catch-all field in the default schemas with
this definition:

Since stored=false, well, it’s not stored so can’t be returned. If you’re successfully searching on
that field but not getting it back in the “fl” list, this is almost certainly a stored=“false” issue.

As for why you might have gotten all the metadata in this field with the post tool, check
that there are no “copyField” directives in the schema that automatically copy other data
into text.

Best,
Erick

On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com wrote:

Attaching managed-schema.xml

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 28 August 2019 16:30
To: solr-user@lucene.apache.org
Subject: RE: Require searching only for file content and not metadata

I already tried this example, I am currently working on this. I have complied the code, it is indexing the documents. But, it is not adding any thing to the field - text . Also, not giving any metadata.
doc.addField("text", textHandler.toString()); --> here, textHandler.toString() is blank for all the 40 documents. All I am getting is the 'id' & 'version' field.

This is the code that I tried :

package mind.solr;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;

public class solrJExtract {

private HttpSolrClient client;
private long start = System.currentTimeMillis();
private AutoDetectParser autoParser;
private int totalTika = 0;
private int totalSql = 0;

@SuppressWarnings("rawtypes")
private Collection docList = new ArrayList();

public static void main(String[] args) {
try {
solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika");
idxer.doTikaDocuments(new File("D:\docs"));
idxer.endIndexing();
} catch (Exception e) {
e.printStackTrace();
}
}

private solrJExtract(String url) throws IOException, SolrServerException {
// Create a SolrCloud-aware client to send docs to Solr
// Use something like HttpSolrClient for stand-alone

client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika")
.withConnectionTimeout(10000)
.withSocketTimeout(60000)
.build();

// binary parser is used by default for responses
client.setParser(new XMLResponseParser());

// One of the ways Tika can be used to attempt to parse arbitrary files.
autoParser = new AutoDetectParser();
}

// Just a convenient place to wrap things up.
@SuppressWarnings("unchecked")
private void endIndexing() throws IOException, SolrServerException {
if ( docList.size() > 0) { // Are there any documents left over?
client.add(docList, 300000); // Commit within 5 minutes
}
client.commit(); // Only needs to be done at the end,
// commitWithin should do the rest.
// Could even be omitted
// assuming commitWithin was specified.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken: " + (endTime - start) +
" milliseconds to index " + totalSql +
" SQL rows and " + totalTika + " documents");

}

/**

*Tika processing here
*/
// Recursively traverse the filesystem, parsing everything found.
private void doTikaDocuments(File root) throws IOException, SolrServerException {// Simple loop for recursively indexing all the files
// in the root directory passed in.
for (File file : root.listFiles()) {
if (file.isDirectory()) {
doTikaDocuments(file);
continue;
}
// Get ready to parse the file.
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Tim Allison noted the following, thanks Tim!
// If you want Tika to parse embedded files (attachments within your .doc or any other embedded
// files), you need to send in the autodetectparser in the parsecontext:
// context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to
// see whether this file is a good candidate.
try {
autoParser.parse(input, textHandler, metadata, context);
} catch (Exception e) {
// Needs better logging of what went wrong in order to
// track down "bad" documents.
System.out.println(String.format("File %s failed", file.getCanonicalPath()));
e.printStackTrace();
continue;
}
// Just to show how much meta-data and what form it's in.
dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields.
SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields.
// Also possible to write a simple loop to examine all the
// metadata returned and selectively index it and/or
// just get a list of them.
// One can also use the Lucidworks field mapping to
// accomplish much the same thing.
String author = metadata.get("Author");

/*
if (author != null) { //doc.addField("author", author); } */ doc.addField("text", textHandler.toString());
//doc.addField("meta", metadata.get("Last_Modified"));
docList.add(doc);
++totalTika; // Completely arbitrary, just batch up more than one document
// for throughput!
if ( docList.size() >= 1000) {
 // Commit within 5 minutes.
 UpdateResponse resp = client.add(docList, 300000);
 if (resp.getStatus() != 0) {
 System.out.println("Some horrible error has occurred, status is: " +
       resp.getStatus());
 }
 docList.clear();
}
}
}// Just to show all the metadata that's available.
private void dumpMetadata(String fileName, Metadata metadata) {
System.out.println("Dumping metadata for file: " + fileName);
for (String name : metadata.names()) {
System.out.println(name + ":" + metadata.get(name));
}
System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
}
}
Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my collection. Please see to it & suggest where I am getting wrong.
I can't even get to see the text field in the query result, instead of stored parameter being true.
Any help would really be appreciated.
Thanks !

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Khare, Kushal (MIND) (2019/08/28 20:21 投稿)

CURRENTLY, I AM GETTING

"text" :
[" \n \n date 2019-06-24T09:52:33Z \n cp:revision 5 \n Total-Time 1 \n extended-properties:AppVersion 15.0000 \n stream_content_type application/vnd.openxmlformats-officedocument.presentationml.presentation \n meta:paragraph-count 18 \n meta:word-count 20 \n extended-properties:PresentationFormat Widescreen \n dc:creator Khare, Kushal (MIND) \n extended-properties:Company MIND \n Word-Count 20 \n dcterms:created 2019-06-18T07:25:29Z \n dcterms:modified 2019-06-24T09:52:33Z \n Last-Modified 2019-06-24T09:52:33Z \n Last-Save-Date 2019-06-24T09:52:33Z \n Paragraph-Count 18 \n meta:save-date 2019-06-24T09:52:33Z \n dc:title PowerPoint Presentation \n Application-Name Microsoft Office PowerPoint \n extended-properties:TotalTime 1 \n modified 2019-06-24T09:52:33Z \n Content-Type application/vnd.openxmlformats-officedocument.presentationml.presentation \n Slide-Count 2 \n stream_size 32234 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n X-Parsed-By org.apache.tika.parser.microsoft.ooxml.OOXMLParser \n creator Khare, Kushal (MIND) \n meta:author Khare, Kushal (MIND) \n meta:creation-date 2019-06-18T07:25:29Z \n extended-properties:Application Microsoft Office PowerPoint \n meta:last-author Khare, Kushal (MIND) \n meta:slide-count 2 \n Creation-Date 2019-06-18T07:25:29Z \n xmpTPg:NPages 2 \n resourceName D:\\docs\\DemoOutput.pptx \n Last-Author Khare, Kushal (MIND) \n Revision-Number 5 \n Application-Version 15.0000 \n Author Khare, Kushal (MIND) \n publisher MIND \n Presentation-Format Widescreen \n dc:publisher MIND \n PowerPoint Presentation \n \n slide-content \n Hello. This is just for Demo! \n If you find it anywhere, throw it away !\nA.W.A.Y away away away away away Away AWAY! \n \n \n A.W.A.Y once again ! \n \n \n \n \n \n \n \n \n \n \n \n \n \n slide-master-content \n slide-content \n A.W.A.Y \n \n away \n \n slide-master-content \n embedded /docProps/thumbnail.jpeg "],

WHAT I WANT :

"text" :
["\n slide-content \n Hello. This is just for Demo! \n If you find it anywhere, throw it away !\nA.W.A.Y away away away away away Away AWAY! \n \n \n A.W.A.Y once again ! \n \n \n \n \n \n \n \n \n \n \n \n \n \n slide-master-content \n slide-content \n A.W.A.Y \n \n away \n \n slide-master-content \n embedded /docProps/thumbnail.jpeg "],

"meta" : ["\n \n date 2019-06-24T09:52:33Z \n cp:revision 5 \n Total-Time 1 \n extended-properties:AppVersion 15.0000 \n stream_content_type application/vnd.openxmlformats-officedocument.presentationml.presentation \n meta:paragraph-count 18 \n meta:word-count 20 \n extended-properties:PresentationFormat Widescreen \n dc:creator Khare, Kushal (MIND) \n extended-properties:Company MIND \n Word-Count 20 \n dcterms:created 2019-06-18T07:25:29Z \n dcterms:modified 2019-06-24T09:52:33Z \n Last-Modified 2019-06-24T09:52:33Z \n Last-Save-Date 2019-06-24T09:52:33Z \n Paragraph-Count 18 \n meta:save-date 2019-06-24T09:52:33Z \n dc:title PowerPoint Presentation \n Application-Name Microsoft Office PowerPoint \n extended-properties:TotalTime 1 \n modified 2019-06-24T09:52:33Z \n Content-Type application/vnd.openxmlformats-officedocument.presentationml.presentation \n Slide-Count 2 \n stream_size 32234 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n X-Parsed-By org.apache.tika.parser.microsoft.ooxml.OOXMLParser \n creator Khare, Kushal (MIND) \n meta:author Khare, Kushal (MIND) \n meta:creation-date 2019-06-18T07:25:29Z \n extended-properties:Application Microsoft Office PowerPoint \n meta:last-author Khare, Kushal (MIND) \n meta:slide-count 2 \n Creation-Date 2019-06-18T07:25:29Z \n xmpTPg:NPages 2 \n resourceName D:\\docs\\DemoOutput.pptx \n Last-Author Khare, Kushal (MIND) \n Revision-Number 5 \n Application-Version 15.0000 \n Author Khare, Kushal (MIND) \n publisher MIND \n Presentation-Format Widescreen \n dc:publisher MIND \n PowerPoint Presentation \n"]
-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Khare, Kushal (MIND) (2019/08/28 20:47 投稿)

Yup ! I have already made stored = true for text. I will see to it. No worries.

BUT, I really need HELP for the separation of content & metadata. I checked , but there isn't any field that is copying the values into the 'text' field.
The only definition I have for text is :

For this : doc.addField(“metadatafield1”, value_of_metadata_field1);
I added author name, etc in the code, but not getting those fields. Also, > doc.addField("text", textHandler.toString()); has blank value in it.

Please help !
-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 28 August 2019 16:50
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

Attachments are aggressively stripped of attachments, you’ll have to either post it someplace and provide a link or paste the relevant sections into the e-mail.

You’re not getting any metadata because you’re not adding any metadata to the documents with doc.addField(“metadatafield1”, value_of_metadata_field1);

The only thing ever in the doc is what you explicitly put there. At this point it’s just “id” and “text”.

As for why text isn’t showing up, does the schema have ’stored=“true”’ for the field? And when you query, are you specifying &fl=text? text is usually a catch-all field in the default schemas with this definition:

Since stored=false, well, it’s not stored so can’t be returned. If you’re successfully searching on that field but not getting it back in the “fl” list, this is almost certainly a stored=“false” issue.

As for why you might have gotten all the metadata in this field with the post tool, check that there are no “copyField” directives in the schema that automatically copy other data into text.

Best,
Erick

On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com wrote:

Attaching managed-schema.xml

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 28 August 2019 16:30
To: solr-user@lucene.apache.org
Subject: RE: Require searching only for file content and not metadata

I already tried this example, I am currently working on this. I have complied the code, it is indexing the documents. But, it is not adding any thing to the field - text . Also, not giving any metadata.
doc.addField("text", textHandler.toString()); --> here, textHandler.toString() is blank for all the 40 documents. All I am getting is the 'id' & 'version' field.

This is the code that I tried :

package mind.solr;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;

public class solrJExtract {

private HttpSolrClient client;
private long start = System.currentTimeMillis(); private
AutoDetectParser autoParser; private int totalTika = 0; private int
totalSql = 0;

@SuppressWarnings("rawtypes")
private Collection docList = new ArrayList();

public static void main(String[] args) {
try {
solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika");
idxer.doTikaDocuments(new File("D:\docs"));
idxer.endIndexing();
} catch (Exception e) {
e.printStackTrace();
}
}

private solrJExtract(String url) throws IOException, SolrServerException {
// Create a SolrCloud-aware client to send docs to Solr
// Use something like HttpSolrClient for stand-alone

client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika")
.withConnectionTimeout(10000)
.withSocketTimeout(60000)
.build();

// binary parser is used by default for responses
client.setParser(new XMLResponseParser());

// One of the ways Tika can be used to attempt to parse arbitrary files.
autoParser = new AutoDetectParser(); }

// Just a convenient place to wrap things up.
@SuppressWarnings("unchecked")
private void endIndexing() throws IOException, SolrServerException {
if ( docList.size() > 0) { // Are there any documents left over?
client.add(docList, 300000); // Commit within 5 minutes
}
client.commit(); // Only needs to be done at the end,
// commitWithin should do the rest.
// Could even be omitted
// assuming commitWithin was specified.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken: " + (endTime - start) +
" milliseconds to index " + totalSql +
" SQL rows and " + totalTika + " documents");

}

/**

*Tika processing here
*/
// Recursively traverse the filesystem, parsing everything found.
private void doTikaDocuments(File root) throws IOException,
SolrServerException {// Simple loop for recursively indexing all the files
// in the root directory passed in.
for (File file : root.listFiles()) {
if (file.isDirectory()) {
doTikaDocuments(file);
continue;
}
// Get ready to parse the file.
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Tim Allison noted the following, thanks Tim!
// If you want Tika to parse embedded files (attachments within your .doc or any other embedded
// files), you need to send in the autodetectparser in the parsecontext:
// context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to
// see whether this file is a good candidate.
try {
autoParser.parse(input, textHandler, metadata, context);
} catch (Exception e) {
// Needs better logging of what went wrong in order to
// track down "bad" documents.
System.out.println(String.format("File %s failed", file.getCanonicalPath()));
e.printStackTrace();
continue;
}
// Just to show how much meta-data and what form it's in.
dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields.
SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields.
// Also possible to write a simple loop to examine all the
// metadata returned and selectively index it and/or
// just get a list of them.
// One can also use the Lucidworks field mapping to
// accomplish much the same thing.
String author = metadata.get("Author");

/*
if (author != null) { //doc.addField("author", author); } */ doc.addField("text", textHandler.toString());
//doc.addField("meta", metadata.get("Last_Modified"));
docList.add(doc);
++totalTika; // Completely arbitrary, just batch up more than one document
// for throughput!
if ( docList.size() >= 1000) {
 // Commit within 5 minutes.
 UpdateResponse resp = client.add(docList, 300000);
 if (resp.getStatus() != 0) {
 System.out.println("Some horrible error has occurred, status is: " +
       resp.getStatus());
 }
 docList.clear();
}
}
}// Just to show all the metadata that's available.
private void dumpMetadata(String fileName, Metadata metadata) {
System.out.println("Dumping metadata for file: " + fileName);
for (String name : metadata.names()) {
System.out.println(name + ":" + metadata.get(name));
}
System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
}
}
Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my collection. Please see to it & suggest where I am getting wrong.
I can't even get to see the text field in the query result, instead of stored parameter being true.
Any help would really be appreciated.
Thanks !

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in
production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Khare, Kushal (MIND) (2019/08/28 21:01 投稿)

If I try to add any metadata in a field like this :

doc.addField("meta", metadata.get("dc_creator"));

I don't get that field in the results, though it has been created.And, following is the definition on the schema :
When I check it in my code for the value using, System.out.println(metadata.get("dc_creator")); --> I get 'null'

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 28 August 2019 16:50
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

Attachments are aggressively stripped of attachments, you’ll have to either post it someplace and provide a link or paste the relevant sections into the e-mail.

You’re not getting any metadata because you’re not adding any metadata to the documents with doc.addField(“metadatafield1”, value_of_metadata_field1);

The only thing ever in the doc is what you explicitly put there. At this point it’s just “id” and “text”.

As for why text isn’t showing up, does the schema have ’stored=“true”’ for the field? And when you query, are you specifying &fl=text? text is usually a catch-all field in the default schemas with this definition:

Since stored=false, well, it’s not stored so can’t be returned. If you’re successfully searching on that field but not getting it back in the “fl” list, this is almost certainly a stored=“false” issue.

As for why you might have gotten all the metadata in this field with the post tool, check that there are no “copyField” directives in the schema that automatically copy other data into text.

Best,
Erick

On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com wrote:

Attaching managed-schema.xml

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 28 August 2019 16:30
To: solr-user@lucene.apache.org
Subject: RE: Require searching only for file content and not metadata

I already tried this example, I am currently working on this. I have complied the code, it is indexing the documents. But, it is not adding any thing to the field - text . Also, not giving any metadata.
doc.addField("text", textHandler.toString()); --> here, textHandler.toString() is blank for all the 40 documents. All I am getting is the 'id' & 'version' field.

This is the code that I tried :

package mind.solr;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;

public class solrJExtract {

private HttpSolrClient client;
private long start = System.currentTimeMillis(); private
AutoDetectParser autoParser; private int totalTika = 0; private int
totalSql = 0;

@SuppressWarnings("rawtypes")
private Collection docList = new ArrayList();

public static void main(String[] args) {
try {
solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika");
idxer.doTikaDocuments(new File("D:\docs"));
idxer.endIndexing();
} catch (Exception e) {
e.printStackTrace();
}
}

private solrJExtract(String url) throws IOException, SolrServerException {
// Create a SolrCloud-aware client to send docs to Solr
// Use something like HttpSolrClient for stand-alone

client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika")
.withConnectionTimeout(10000)
.withSocketTimeout(60000)
.build();

// binary parser is used by default for responses
client.setParser(new XMLResponseParser());

// One of the ways Tika can be used to attempt to parse arbitrary files.
autoParser = new AutoDetectParser(); }

// Just a convenient place to wrap things up.
@SuppressWarnings("unchecked")
private void endIndexing() throws IOException, SolrServerException {
if ( docList.size() > 0) { // Are there any documents left over?
client.add(docList, 300000); // Commit within 5 minutes
}
client.commit(); // Only needs to be done at the end,
// commitWithin should do the rest.
// Could even be omitted
// assuming commitWithin was specified.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken: " + (endTime - start) +
" milliseconds to index " + totalSql +
" SQL rows and " + totalTika + " documents");

}

/**

*Tika processing here
*/
// Recursively traverse the filesystem, parsing everything found.
private void doTikaDocuments(File root) throws IOException,
SolrServerException {// Simple loop for recursively indexing all the files
// in the root directory passed in.
for (File file : root.listFiles()) {
if (file.isDirectory()) {
doTikaDocuments(file);
continue;
}
// Get ready to parse the file.
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Tim Allison noted the following, thanks Tim!
// If you want Tika to parse embedded files (attachments within your .doc or any other embedded
// files), you need to send in the autodetectparser in the parsecontext:
// context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to
// see whether this file is a good candidate.
try {
autoParser.parse(input, textHandler, metadata, context);
} catch (Exception e) {
// Needs better logging of what went wrong in order to
// track down "bad" documents.
System.out.println(String.format("File %s failed", file.getCanonicalPath()));
e.printStackTrace();
continue;
}
// Just to show how much meta-data and what form it's in.
dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields.
SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields.
// Also possible to write a simple loop to examine all the
// metadata returned and selectively index it and/or
// just get a list of them.
// One can also use the Lucidworks field mapping to
// accomplish much the same thing.
String author = metadata.get("Author");

/*
if (author != null) { //doc.addField("author", author); } */ doc.addField("text", textHandler.toString());
//doc.addField("meta", metadata.get("Last_Modified"));
docList.add(doc);
++totalTika; // Completely arbitrary, just batch up more than one document
// for throughput!
if ( docList.size() >= 1000) {
 // Commit within 5 minutes.
 UpdateResponse resp = client.add(docList, 300000);
 if (resp.getStatus() != 0) {
 System.out.println("Some horrible error has occurred, status is: " +
       resp.getStatus());
 }
 docList.clear();
}
}
}// Just to show all the metadata that's available.
private void dumpMetadata(String fileName, Metadata metadata) {
System.out.println("Dumping metadata for file: " + fileName);
for (String name : metadata.names()) {
System.out.println(name + ":" + metadata.get(name));
}
System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
}
}
Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my collection. Please see to it & suggest where I am getting wrong.
I can't even get to see the text field in the query result, instead of stored parameter being true.
Any help would really be appreciated.
Thanks !

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in
production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Khare, Kushal (MIND) (2019/08/29 17:54 投稿)

Erick,
I am using the code that I posted yesterday. But, am not getting anything in 'texthandler.toString'. Please check my snippet once and guide. Because, I think I am very close to my requirement yet stuck here. I also debugged my code. It is not going inside doTikaDocuments() & giving Null Pointer Exception.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 28 August 2019 16:50
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

Attachments are aggressively stripped of attachments, you’ll have to either post it someplace and provide a link or paste the relevant sections into the e-mail.

You’re not getting any metadata because you’re not adding any metadata to the documents with doc.addField(“metadatafield1”, value_of_metadata_field1);

The only thing ever in the doc is what you explicitly put there. At this point it’s just “id” and “text”.

As for why text isn’t showing up, does the schema have ’stored=“true”’ for the field? And when you query, are you specifying &fl=text? text is usually a catch-all field in the default schemas with this definition:

Since stored=false, well, it’s not stored so can’t be returned. If you’re successfully searching on that field but not getting it back in the “fl” list, this is almost certainly a stored=“false” issue.

As for why you might have gotten all the metadata in this field with the post tool, check that there are no “copyField” directives in the schema that automatically copy other data into text.

Best,
Erick

On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com wrote:

Attaching managed-schema.xml

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 28 August 2019 16:30
To: solr-user@lucene.apache.org
Subject: RE: Require searching only for file content and not metadata

I already tried this example, I am currently working on this. I have complied the code, it is indexing the documents. But, it is not adding any thing to the field - text . Also, not giving any metadata.
doc.addField("text", textHandler.toString()); --> here, textHandler.toString() is blank for all the 40 documents. All I am getting is the 'id' & 'version' field.

This is the code that I tried :

package mind.solr;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;

public class solrJExtract {

private HttpSolrClient client;
private long start = System.currentTimeMillis(); private
AutoDetectParser autoParser; private int totalTika = 0; private int
totalSql = 0;

@SuppressWarnings("rawtypes")
private Collection docList = new ArrayList();

public static void main(String[] args) {
try {
solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika");
idxer.doTikaDocuments(new File("D:\docs"));
idxer.endIndexing();
} catch (Exception e) {
e.printStackTrace();
}
}

private solrJExtract(String url) throws IOException, SolrServerException {
// Create a SolrCloud-aware client to send docs to Solr
// Use something like HttpSolrClient for stand-alone

client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika")
.withConnectionTimeout(10000)
.withSocketTimeout(60000)
.build();

// binary parser is used by default for responses
client.setParser(new XMLResponseParser());

// One of the ways Tika can be used to attempt to parse arbitrary files.
autoParser = new AutoDetectParser(); }

// Just a convenient place to wrap things up.
@SuppressWarnings("unchecked")
private void endIndexing() throws IOException, SolrServerException {
if ( docList.size() > 0) { // Are there any documents left over?
client.add(docList, 300000); // Commit within 5 minutes
}
client.commit(); // Only needs to be done at the end,
// commitWithin should do the rest.
// Could even be omitted
// assuming commitWithin was specified.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken: " + (endTime - start) +
" milliseconds to index " + totalSql +
" SQL rows and " + totalTika + " documents");

}

/**

*Tika processing here
*/
// Recursively traverse the filesystem, parsing everything found.
private void doTikaDocuments(File root) throws IOException,
SolrServerException {// Simple loop for recursively indexing all the files
// in the root directory passed in.
for (File file : root.listFiles()) {
if (file.isDirectory()) {
doTikaDocuments(file);
continue;
}
// Get ready to parse the file.
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Tim Allison noted the following, thanks Tim!
// If you want Tika to parse embedded files (attachments within your .doc or any other embedded
// files), you need to send in the autodetectparser in the parsecontext:
// context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to
// see whether this file is a good candidate.
try {
autoParser.parse(input, textHandler, metadata, context);
} catch (Exception e) {
// Needs better logging of what went wrong in order to
// track down "bad" documents.
System.out.println(String.format("File %s failed", file.getCanonicalPath()));
e.printStackTrace();
continue;
}
// Just to show how much meta-data and what form it's in.
dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields.
SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields.
// Also possible to write a simple loop to examine all the
// metadata returned and selectively index it and/or
// just get a list of them.
// One can also use the Lucidworks field mapping to
// accomplish much the same thing.
String author = metadata.get("Author");

/*
if (author != null) { //doc.addField("author", author); } */ doc.addField("text", textHandler.toString());
//doc.addField("meta", metadata.get("Last_Modified"));
docList.add(doc);
++totalTika; // Completely arbitrary, just batch up more than one document
// for throughput!
if ( docList.size() >= 1000) {
 // Commit within 5 minutes.
 UpdateResponse resp = client.add(docList, 300000);
 if (resp.getStatus() != 0) {
 System.out.println("Some horrible error has occurred, status is: " +
       resp.getStatus());
 }
 docList.clear();
}
}
}// Just to show all the metadata that's available.
private void dumpMetadata(String fileName, Metadata metadata) {
System.out.println("Dumping metadata for file: " + fileName);
for (String name : metadata.names()) {
System.out.println(name + ":" + metadata.get(name));
}
System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
}
}
Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my collection. Please see to it & suggest where I am getting wrong.
I can't even get to see the text field in the query result, instead of stored parameter being true.
Any help would really be appreciated.
Thanks !

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in
production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Erick Erickson (2019/08/29 20:27 投稿)

I already provided feedback, you haven’t evidenced any attempt to follow up on it.

Best,
Erick

On Aug 29, 2019, at 4:54 AM, Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com wrote:

Erick,
I am using the code that I posted yesterday. But, am not getting anything in 'texthandler.toString'. Please check my snippet once and guide. Because, I think I am very close to my requirement yet stuck here. I also debugged my code. It is not going inside doTikaDocuments() & giving Null Pointer Exception.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 28 August 2019 16:50
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

Attachments are aggressively stripped of attachments, you’ll have to either post it someplace and provide a link or paste the relevant sections into the e-mail.

You’re not getting any metadata because you’re not adding any metadata to the documents with doc.addField(“metadatafield1”, value_of_metadata_field1);

The only thing ever in the doc is what you explicitly put there. At this point it’s just “id” and “text”.

As for why text isn’t showing up, does the schema have ’stored=“true”’ for the field? And when you query, are you specifying &fl=text? text is usually a catch-all field in the default schemas with this definition:

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true”/>

Since stored=false, well, it’s not stored so can’t be returned. If you’re successfully searching on that field but not getting it back in the “fl” list, this is almost certainly a stored=“false” issue.

As for why you might have gotten all the metadata in this field with the post tool, check that there are no “copyField” directives in the schema that automatically copy other data into text.

Best,
Erick
On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com wrote:

Attaching managed-schema.xml

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 28 August 2019 16:30
To: solr-user@lucene.apache.org
Subject: RE: Require searching only for file content and not metadata

I already tried this example, I am currently working on this. I have complied the code, it is indexing the documents. But, it is not adding any thing to the field - text . Also, not giving any metadata.
doc.addField("text", textHandler.toString()); --> here, textHandler.toString() is blank for all the 40 documents. All I am getting is the 'id' & 'version' field.

This is the code that I tried :

package mind.solr;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;

public class solrJExtract {

private HttpSolrClient client;
private long start = System.currentTimeMillis(); private
AutoDetectParser autoParser; private int totalTika = 0; private int
totalSql = 0;

@SuppressWarnings("rawtypes")
private Collection docList = new ArrayList();

public static void main(String[] args) {
try {
solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika");
idxer.doTikaDocuments(new File("D:\docs"));
idxer.endIndexing();
} catch (Exception e) {
e.printStackTrace();
}
}

private solrJExtract(String url) throws IOException, SolrServerException {
// Create a SolrCloud-aware client to send docs to Solr
// Use something like HttpSolrClient for stand-alone

client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika")
.withConnectionTimeout(10000)
.withSocketTimeout(60000)
.build();

// binary parser is used by default for responses
client.setParser(new XMLResponseParser());

// One of the ways Tika can be used to attempt to parse arbitrary files.
autoParser = new AutoDetectParser(); }

// Just a convenient place to wrap things up.
@SuppressWarnings("unchecked")
private void endIndexing() throws IOException, SolrServerException {
if ( docList.size() > 0) { // Are there any documents left over?
client.add(docList, 300000); // Commit within 5 minutes
}
client.commit(); // Only needs to be done at the end,
// commitWithin should do the rest.
// Could even be omitted
// assuming commitWithin was specified.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken: " + (endTime - start) +
" milliseconds to index " + totalSql +
" SQL rows and " + totalTika + " documents");

}

/**

*Tika processing here
*/
// Recursively traverse the filesystem, parsing everything found.
private void doTikaDocuments(File root) throws IOException,
SolrServerException {// Simple loop for recursively indexing all the files
// in the root directory passed in.
for (File file : root.listFiles()) {
if (file.isDirectory()) {
doTikaDocuments(file);
continue;
}
// Get ready to parse the file.
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Tim Allison noted the following, thanks Tim!
// If you want Tika to parse embedded files (attachments within your .doc or any other embedded
// files), you need to send in the autodetectparser in the parsecontext:
// context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to
// see whether this file is a good candidate.
try {
autoParser.parse(input, textHandler, metadata, context);
} catch (Exception e) {
// Needs better logging of what went wrong in order to
// track down "bad" documents.
System.out.println(String.format("File %s failed", file.getCanonicalPath()));
e.printStackTrace();
continue;
}
// Just to show how much meta-data and what form it's in.
dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields.
SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields.
// Also possible to write a simple loop to examine all the
// metadata returned and selectively index it and/or
// just get a list of them.
// One can also use the Lucidworks field mapping to
// accomplish much the same thing.
String author = metadata.get("Author");

/*
if (author != null) { //doc.addField("author", author); } */ doc.addField("text", textHandler.toString());
//doc.addField("meta", metadata.get("Last_Modified"));
docList.add(doc);
++totalTika; // Completely arbitrary, just batch up more than one document
// for throughput!
if ( docList.size() >= 1000) {
// Commit within 5 minutes.
UpdateResponse resp = client.add(docList, 300000);
if (resp.getStatus() != 0) {
System.out.println("Some horrible error has occurred, status is: " +
      resp.getStatus());
}
docList.clear();
}
}
}
// Just to show all the metadata that's available.
private void dumpMetadata(String fileName, Metadata metadata) {
System.out.println("Dumping metadata for file: " + fileName);
for (String name : metadata.names()) {
System.out.println(name + ":" + metadata.get(name));
}
System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
}
}

Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my collection. Please see to it & suggest where I am getting wrong.
I can't even get to see the text field in the query result, instead of stored parameter being true.
Any help would really be appreciated.
Thanks !

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in
production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

返信投稿者：Khare, Kushal (MIND) (2019/08/29 20:47 投稿)

I have been working on the same and finding out why I am not getting any data in TextHandler or Metadata.
For that, I tried first creating just the parser to extract content from the documents using the Tika AutoDetect Parser. Finally, I found out that I was missing a jar.So, this separate plain text parser worked for me. But, now when I try to run my code that I shared with you is missing some classes. That's probably some jar conflicts.

PLAIN PARSING TEXT CODE :

package mind.solr;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class ParsingExample {

public void parseExample() throws IOException, SAXException, TikaException {
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
// try (InputStream stream = ParsingExample.class.getResourceAsStream("/TestDocx.docx"))
try(FileInputStream fin=new FileInputStream("D:\docs\TestA3.docx")) {
parser.parse(fin, handler, metadata);

    String text = handler.toString();
    System.out.println("output :"+text);
}

}

public static void main(String[] args) throws IOException, SAXException, TikaException {
ParsingExample ps = new ParsingExample();
ps.parseExample();
//System.out.println("output :"+out);
}
}

JARS USED :
Solr-solrj-8.0.0.jar
tika-app-1.8.jar

I get the document content in the handler finally.

But, now when I move to my solr indexing code to run it and accordingly define my fields for the extracted content, I get an error. Following is the error that I get :

Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.(SSLConnectionSocketFactory.java:146)
at org.apache.solr.client.solrj.impl.HttpClientUtil$DefaultSchemaRegistryProvider.getSchemaRegistry(HttpClientUtil.java:235)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createPoolingConnectionManager(HttpClientUtil.java:260)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:255)
at org.apache.solr.client.solrj.impl.HttpSolrClient.(HttpSolrClient.java:201)
at org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.build(HttpSolrClient.java:964)
at mind.solr.solrJExtract.(solrJExtract.java:50)
at mind.solr.solrJExtract.main(solrJExtract.java:35)

I found that its because some HTTP Client jar conflicts, but I am unable to resolve it.
I request you to help me as what could be the issue and how it could be resolved.

Thanks!

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 29 August 2019 16:57
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

I already provided feedback, you haven’t evidenced any attempt to follow up on it.

Best,
Erick

On Aug 29, 2019, at 4:54 AM, Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com wrote:

Erick,
I am using the code that I posted yesterday. But, am not getting anything in 'texthandler.toString'. Please check my snippet once and guide. Because, I think I am very close to my requirement yet stuck here. I also debugged my code. It is not going inside doTikaDocuments() & giving Null Pointer Exception.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 28 August 2019 16:50
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

Attachments are aggressively stripped of attachments, you’ll have to either post it someplace and provide a link or paste the relevant sections into the e-mail.

You’re not getting any metadata because you’re not adding any metadata
to the documents with doc.addField(“metadatafield1”,
value_of_metadata_field1);

The only thing ever in the doc is what you explicitly put there. At this point it’s just “id” and “text”.

As for why text isn’t showing up, does the schema have ’stored=“true”’ for the field? And when you query, are you specifying &fl=text? text is usually a catch-all field in the default schemas with this definition:

<field name="text" type="text_general" indexed="true" stored="false"
multiValued="true”/>

Since stored=false, well, it’s not stored so can’t be returned. If you’re successfully searching on that field but not getting it back in the “fl” list, this is almost certainly a stored=“false” issue.

As for why you might have gotten all the metadata in this field with the post tool, check that there are no “copyField” directives in the schema that automatically copy other data into text.

Best,
Erick
On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) Kushal.Khare@mind-infotech.com wrote:

Attaching managed-schema.xml

-----Original Message-----
From: Khare, Kushal (MIND) [mailto:Kushal.Khare@mind-infotech.com]
Sent: 28 August 2019 16:30
To: solr-user@lucene.apache.org
Subject: RE: Require searching only for file content and not metadata

I already tried this example, I am currently working on this. I have complied the code, it is indexing the documents. But, it is not adding any thing to the field - text . Also, not giving any metadata.
doc.addField("text", textHandler.toString()); --> here, textHandler.toString() is blank for all the 40 documents. All I am getting is the 'id' & 'version' field.

This is the code that I tried :

package mind.solr;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;

public class solrJExtract {

private HttpSolrClient client;
private long start = System.currentTimeMillis(); private
AutoDetectParser autoParser; private int totalTika = 0; private int
totalSql = 0;

@SuppressWarnings("rawtypes")
private Collection docList = new ArrayList();

public static void main(String[] args) {
try {
solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika");
idxer.doTikaDocuments(new File("D:\docs"));
idxer.endIndexing();
} catch (Exception e) {
e.printStackTrace();
}
}

private solrJExtract(String url) throws IOException, SolrServerException {
// Create a SolrCloud-aware client to send docs to Solr
// Use something like HttpSolrClient for stand-alone

client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika")
.withConnectionTimeout(10000)
.withSocketTimeout(60000)
.build();

// binary parser is used by default for responses
client.setParser(new XMLResponseParser());

// One of the ways Tika can be used to attempt to parse arbitrary files.
autoParser = new AutoDetectParser(); }

// Just a convenient place to wrap things up.
@SuppressWarnings("unchecked")
private void endIndexing() throws IOException, SolrServerException {
if ( docList.size() > 0) { // Are there any documents left over?
client.add(docList, 300000); // Commit within 5 minutes
}
client.commit(); // Only needs to be done at the end,
// commitWithin should do the rest.
// Could even be omitted
// assuming commitWithin was specified.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken: " + (endTime - start) +
" milliseconds to index " + totalSql +
" SQL rows and " + totalTika + " documents");

}

/**

*Tika processing here */ // Recursively
traverse the filesystem, parsing everything found.
private void doTikaDocuments(File root) throws IOException,
SolrServerException {// Simple loop for recursively indexing all the files
// in the root directory passed in.
for (File file : root.listFiles()) {
if (file.isDirectory()) {
doTikaDocuments(file);
continue;
}
// Get ready to parse the file.
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Tim Allison noted the following, thanks Tim!
// If you want Tika to parse embedded files (attachments within your .doc or any other embedded
// files), you need to send in the autodetectparser in the parsecontext:
// context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to
// see whether this file is a good candidate.
try {
autoParser.parse(input, textHandler, metadata, context);
} catch (Exception e) {
// Needs better logging of what went wrong in order to
// track down "bad" documents.
System.out.println(String.format("File %s failed", file.getCanonicalPath()));
e.printStackTrace();
continue;
}
// Just to show how much meta-data and what form it's in.
dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields.
SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields.
// Also possible to write a simple loop to examine all the
// metadata returned and selectively index it and/or
// just get a list of them.
// One can also use the Lucidworks field mapping to
// accomplish much the same thing.
String author = metadata.get("Author");

/*
if (author != null) { //doc.addField("author", author); } */ doc.addField("text", textHandler.toString());
//doc.addField("meta", metadata.get("Last_Modified"));
docList.add(doc);
++totalTika; // Completely arbitrary, just batch up more than one document
// for throughput!
if ( docList.size() >= 1000) {
// Commit within 5 minutes.
UpdateResponse resp = client.add(docList, 300000);
if (resp.getStatus() != 0) {
System.out.println("Some horrible error has occurred, status is: " +
      resp.getStatus());
}
docList.clear();
}
}
}
// Just to show all the metadata that's available.
private void dumpMetadata(String fileName, Metadata metadata) {
System.out.println("Dumping metadata for file: " + fileName);
for (String name : metadata.names()) {
System.out.println(name + ":" + metadata.get(name));
}
System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
}
}

Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my collection. Please see to it & suggest where I am getting wrong.
I can't even get to see the text field in the query result, instead of stored parameter being true.
Any help would really be appreciated.
Thanks !

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:

Basically, what problem I am facing is - I am getting the textual content + other metadata in my text field. But, I want only the textual content written inside the document.
I tried various Request Handler Update Extract configurations, but none of them worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in
production
-- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com
The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus/trojan/worms/malicious code transmitted by this
email. www.motherson.com

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus/trojan/worms/malicious code transmitted by this email. www.motherson.com