URL Mapping

トピック作成者:Michael Cizmar (2020/05/29 01:11 投稿)
5
OpenOpen
添付ファイル:
返信投稿者:Karl Wright (2020/05/29 01:41 投稿)

Hi,

There are provisions in the URL canonicallization part of the world for
removal of session information from the URL. It only knows about some
kinds of widely used sessions; java app server sessions, for example,
Broadvision sessions, etc. If you can convince me that your session
information is (a) uniquely identifiable, and (b) commonly used, the proper
approach is to incorporate session removal in this framework. Please let
me know.

Karl

On Thu, May 28, 2020 at 12:11 PM Michael Cizmar michael.cizmar@mcplusa.com
wrote:

I've got a really long url with a bunch of unnecessary session query
string parameters. I've been trying unsuccessfully to map it to the same
url without the session.

an example of the url below. I thought I could do this:

url map regular expression:

(.*)\/!ut

replacement configuration:

So the go would be that the url be:

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/

But the url gets rejected.

Sample Crawl Url

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/!ut/p/a1/rZHLTsMwEEV_hS6yjDx5OWZpdRFImzYCAYk3lZM6D5TYSWoqPh8HFu2GQhHejEeae-aOLmIoQ0zyY1tz3SrJu7lneLfdBtTxI1iRhzsMFEfrpZ_6AFFoBnIzAN88Cj_pXxBDrJR60A3KeS2kvimV1KZaMKhJ886C8U1pIeSkOtNM3Pz5QewO3IJG9WIGDGW7RzkB7hZFIWxyyx3bL8LAJo6L7QoELitMPAH7r4WXLefmpvBkOoqfiTHth6vYTRxIAT1eufMy8D74Z2DqXg2Mf5Fz-zqOjJq05nzeNcr-FpchuVOyTGpjkOvGbmWlUHYmQtmZCGWfoqF_6omHq83G5gUBL-iOa0oXiw9FOxLu/dl5/d5/L0lJS2FZcHBpbW1LYVlwcGltbVlwcGchIS9vSHd3QUFBSXdpRUFJSkRBQ1VZaUVJVTVCZ09DbFFBQUlBQVNvU0FyUnFBQURBQWF0QXdMTzlRQUFFQUJ3WWVBR0tTQUFDa0k1Z21HU3dTaXJTQUFDZ0s5ZzBIUS80SmlHcGhxRWFoR29ScUVhbEdwaC9aNl9PTzVBMTRHMEs4Ukg2MEE2R0xDNFA0MDBHNy9hZ2VudCBjb250ZW50JTBwb3J0YWwlMHF1b3RlZW5yb2xsJTBkaWdzIC0gcXVvdGluZyAgZW5yb2xsbWVudCAoaW5kaXZpZHVhbCkvZjQ0YmEyOWUtODQwOC00YjFlLTg4MzktMTFlMjI4NDgxYTVhL2RpZ3MgLSBxdW90aW5nICBlbnJvbGxtZW50IChpbmRpdmlkdWFsKQ

添付ファイル:
返信投稿者:Michael Cizmar (2020/05/29 01:47 投稿)

The "!ut" and then a bunch of session information is from Web Sphere Portal. Some information about it here:
https://books.google.com/books?id=bqAXnpmj5LwC&pg=PA180&lpg=PA180&dq=%22!ut%22+session+variables+websphere#v=onepage&q=%22!ut%22%20session%20variables%20websphere&f=false

I'll look at making a change to the web crawler to suppor this like the BV and ASP.NET


From: Karl Wright daddywri@gmail.com
Sent: Thursday, May 28, 2020 11:41 AM
To: user@manifoldcf.apache.org user@manifoldcf.apache.org
Subject: Re: URL Mapping

Hi,

There are provisions in the URL canonicallization part of the world for removal of session information from the URL. It only knows about some kinds of widely used sessions; java app server sessions, for example, Broadvision sessions, etc. If you can convince me that your session information is (a) uniquely identifiable, and (b) commonly used, the proper approach is to incorporate session removal in this framework. Please let me know.

Karl

On Thu, May 28, 2020 at 12:11 PM Michael Cizmar michael.cizmar@mcplusa.com> wrote:
I've got a really long url with a bunch of unnecessary session query string parameters. I've been trying unsuccessfully to map it to the same url without the session.

an example of the url below. I thought I could do this:

url map regular expression:

(.*)\/!ut

replacement configuration:

[cid:1725c275aadcb971f161]

So the go would be that the url be:
http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/

But the url gets rejected.

Sample Crawl Url

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/!ut/p/a1/rZHLTsMwEEV_hS6yjDx5OWZpdRFImzYCAYk3lZM6D5TYSWoqPh8HFu2GQhHejEeae-aOLmIoQ0zyY1tz3SrJu7lneLfdBtTxI1iRhzsMFEfrpZ_6AFFoBnIzAN88Cj_pXxBDrJR60A3KeS2kvimV1KZaMKhJ886C8U1pIeSkOtNM3Pz5QewO3IJG9WIGDGW7RzkB7hZFIWxyyx3bL8LAJo6L7QoELitMPAH7r4WXLefmpvBkOoqfiTHth6vYTRxIAT1eufMy8D74Z2DqXg2Mf5Fz-zqOjJq05nzeNcr-FpchuVOyTGpjkOvGbmWlUHYmQtmZCGWfoqF_6omHq83G5gUBL-iOa0oXiw9FOxLu/dl5/d5/L0lJS2FZcHBpbW1LYVlwcGltbVlwcGchIS9vSHd3QUFBSXdpRUFJSkRBQ1VZaUVJVTVCZ09DbFFBQUlBQVNvU0FyUnFBQURBQWF0QXdMTzlRQUFFQUJ3WWVBR0tTQUFDa0k1Z21HU3dTaXJTQUFDZ0s5ZzBIUS80SmlHcGhxRWFoR29ScUVhbEdwaC9aNl9PTzVBMTRHMEs4Ukg2MEE2R0xDNFA0MDBHNy9hZ2VudCBjb250ZW50JTBwb3J0YWwlMHF1b3RlZW5yb2xsJTBkaWdzIC0gcXVvdGluZyAgZW5yb2xsbWVudCAoaW5kaXZpZHVhbCkvZjQ0YmEyOWUtODQwOC00YjFlLTg4MzktMTFlMjI4NDgxYTVhL2RpZ3MgLSBxdW90aW5nICBlbnJvbGxtZW50IChpbmRpdmlkdWFsKQ

添付ファイル:
返信投稿者:Karl Wright (2020/05/29 02:03 投稿)

Thanks! It's far better to implement this than to try and hack it. A
general way of removing session information with regular expressions is
probably not going to cut it either, so for now it's got to be in Java.

Karl

On Thu, May 28, 2020 at 12:47 PM Michael Cizmar michael.cizmar@mcplusa.com
wrote:

The "!ut" and then a bunch of session information is from Web Sphere
Portal. Some information about it here:

https://books.google.com/books?id=bqAXnpmj5LwC&pg=PA180&lpg=PA180&dq=%22!ut%22+session+variables+websphere#v=onepage&q=%22!ut%22%20session%20variables%20websphere&f=false

I'll look at making a change to the web crawler to suppor this like the BV
and ASP.NET


From: Karl Wright daddywri@gmail.com
Sent: Thursday, May 28, 2020 11:41 AM
To: user@manifoldcf.apache.org user@manifoldcf.apache.org
Subject: Re: URL Mapping

Hi,

There are provisions in the URL canonicallization part of the world for
removal of session information from the URL. It only knows about some
kinds of widely used sessions; java app server sessions, for example,
Broadvision sessions, etc. If you can convince me that your session
information is (a) uniquely identifiable, and (b) commonly used, the proper
approach is to incorporate session removal in this framework. Please let
me know.

Karl

On Thu, May 28, 2020 at 12:11 PM Michael Cizmar michael.cizmar@mcplusa.com wrote:

I've got a really long url with a bunch of unnecessary session query
string parameters. I've been trying unsuccessfully to map it to the same
url without the session.

an example of the url below. I thought I could do this:

url map regular expression:

(.*)\/!ut

replacement configuration:

So the go would be that the url be:

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/

But the url gets rejected.

Sample Crawl Url

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/!ut/p/a1/rZHLTsMwEEV_hS6yjDx5OWZpdRFImzYCAYk3lZM6D5TYSWoqPh8HFu2GQhHejEeae-aOLmIoQ0zyY1tz3SrJu7lneLfdBtTxI1iRhzsMFEfrpZ_6AFFoBnIzAN88Cj_pXxBDrJR60A3KeS2kvimV1KZaMKhJ886C8U1pIeSkOtNM3Pz5QewO3IJG9WIGDGW7RzkB7hZFIWxyyx3bL8LAJo6L7QoELitMPAH7r4WXLefmpvBkOoqfiTHth6vYTRxIAT1eufMy8D74Z2DqXg2Mf5Fz-zqOjJq05nzeNcr-FpchuVOyTGpjkOvGbmWlUHYmQtmZCGWfoqF_6omHq83G5gUBL-iOa0oXiw9FOxLu/dl5/d5/L0lJS2FZcHBpbW1LYVlwcGltbVlwcGchIS9vSHd3QUFBSXdpRUFJSkRBQ1VZaUVJVTVCZ09DbFFBQUlBQVNvU0FyUnFBQURBQWF0QXdMTzlRQUFFQUJ3WWVBR0tTQUFDa0k1Z21HU3dTaXJTQUFDZ0s5ZzBIUS80SmlHcGhxRWFoR29ScUVhbEdwaC9aNl9PTzVBMTRHMEs4Ukg2MEE2R0xDNFA0MDBHNy9hZ2VudCBjb250ZW50JTBwb3J0YWwlMHF1b3RlZW5yb2xsJTBkaWdzIC0gcXVvdGluZyAgZW5yb2xsbWVudCAoaW5kaXZpZHVhbCkvZjQ0YmEyOWUtODQwOC00YjFlLTg4MzktMTFlMjI4NDgxYTVhL2RpZ3MgLSBxdW90aW5nICBlbnJvbGxtZW50IChpbmRpdmlkdWFsKQ

添付ファイル:
返信投稿者:Michael Cizmar (2020/05/29 02:40 投稿)

Right. Another case that I'm exploring...crawling an internal site and wanting a load balanced url. So you would crawl something like this:

http://mystaging-server.myco.com/index.html

and then want to change it to:

https://www.myco.com/index.html

Is that better for the url mapper?

--

Michael Cizmar
Managing Director

p: 312.585.6396

d: 312.585.6286
twitter: @michaelcizmarhttp://twitter.com/michaelcizmar

http://www.mcplusa.com/

The information contained in this communication is confidential, private, proprietary, or otherwise privileged and is intended only for the use of the addressee. This e-mail is intended only for the person or entity to whom it is directed. Unauthorized use, disclosure, distribution or copying is strictly prohibited and may be unlawful. If you are not the intended recipient, please notify us immediately and permanently delete this e-mail and any attachments.


From: Karl Wright daddywri@gmail.com
Sent: Thursday, May 28, 2020 12:03 PM
To: user@manifoldcf.apache.org user@manifoldcf.apache.org
Subject: Re: URL Mapping

Thanks! It's far better to implement this than to try and hack it. A general way of removing session information with regular expressions is probably not going to cut it either, so for now it's got to be in Java.

Karl

On Thu, May 28, 2020 at 12:47 PM Michael Cizmar michael.cizmar@mcplusa.com> wrote:
The "!ut" and then a bunch of session information is from Web Sphere Portal. Some information about it here:
https://books.google.com/books?id=bqAXnpmj5LwC&pg=PA180&lpg=PA180&dq=%22!ut%22+session+variables+websphere#v=onepage&q=%22!ut%22%20session%20variables%20websphere&f=false

I'll look at making a change to the web crawler to suppor this like the BV and ASP.NEThttp://ASP.NET


From: Karl Wright daddywri@gmail.com>
Sent: Thursday, May 28, 2020 11:41 AM
To: user@manifoldcf.apache.orguser@manifoldcf.apache.org user@manifoldcf.apache.org>
Subject: Re: URL Mapping

Hi,

There are provisions in the URL canonicallization part of the world for removal of session information from the URL. It only knows about some kinds of widely used sessions; java app server sessions, for example, Broadvision sessions, etc. If you can convince me that your session information is (a) uniquely identifiable, and (b) commonly used, the proper approach is to incorporate session removal in this framework. Please let me know.

Karl

On Thu, May 28, 2020 at 12:11 PM Michael Cizmar michael.cizmar@mcplusa.com> wrote:
I've got a really long url with a bunch of unnecessary session query string parameters. I've been trying unsuccessfully to map it to the same url without the session.

an example of the url below. I thought I could do this:

url map regular expression:

(.*)\/!ut

replacement configuration:

[cid:1725c3c8c33cb971f161]

So the go would be that the url be:
http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/

But the url gets rejected.

Sample Crawl Url

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/!ut/p/a1/rZHLTsMwEEV_hS6yjDx5OWZpdRFImzYCAYk3lZM6D5TYSWoqPh8HFu2GQhHejEeae-aOLmIoQ0zyY1tz3SrJu7lneLfdBtTxI1iRhzsMFEfrpZ_6AFFoBnIzAN88Cj_pXxBDrJR60A3KeS2kvimV1KZaMKhJ886C8U1pIeSkOtNM3Pz5QewO3IJG9WIGDGW7RzkB7hZFIWxyyx3bL8LAJo6L7QoELitMPAH7r4WXLefmpvBkOoqfiTHth6vYTRxIAT1eufMy8D74Z2DqXg2Mf5Fz-zqOjJq05nzeNcr-FpchuVOyTGpjkOvGbmWlUHYmQtmZCGWfoqF_6omHq83G5gUBL-iOa0oXiw9FOxLu/dl5/d5/L0lJS2FZcHBpbW1LYVlwcGltbVlwcGchIS9vSHd3QUFBSXdpRUFJSkRBQ1VZaUVJVTVCZ09DbFFBQUlBQVNvU0FyUnFBQURBQWF0QXdMTzlRQUFFQUJ3WWVBR0tTQUFDa0k1Z21HU3dTaXJTQUFDZ0s5ZzBIUS80SmlHcGhxRWFoR29ScUVhbEdwaC9aNl9PTzVBMTRHMEs4Ukg2MEE2R0xDNFA0MDBHNy9hZ2VudCBjb250ZW50JTBwb3J0YWwlMHF1b3RlZW5yb2xsJTBkaWdzIC0gcXVvdGluZyAgZW5yb2xsbWVudCAoaW5kaXZpZHVhbCkvZjQ0YmEyOWUtODQwOC00YjFlLTg4MzktMTFlMjI4NDgxYTVhL2RpZ3MgLSBxdW90aW5nICBlbnJvbGxtZW50IChpbmRpdmlkdWFsKQ

添付ファイル:
返信投稿者:Karl Wright (2020/05/29 03:12 投稿)

That's a much better case for using the url mapper, yes.

On Thu, May 28, 2020 at 1:40 PM Michael Cizmar michael.cizmar@mcplusa.com
wrote:

Right. Another case that I'm exploring...crawling an internal site and
wanting a load balanced url. So you would crawl something like this:

http://mystaging-server.myco.com/index.html

and then want to change it to:

https://www.myco.com/index.html

Is that better for the url mapper?

--

Michael Cizmar
Managing Director

p: 312.585.6396

d: 312.585.6286
twitter: @michaelcizmar http://twitter.com/michaelcizmar

http://www.mcplusa.com/

The information contained in this communication is confidential, private,
proprietary, or otherwise privileged and is intended only for the use of
the addressee. This e-mail is intended only for the person or entity to
whom it is directed. Unauthorized use, disclosure, distribution or copying
is strictly prohibited and may be unlawful. If you are not the intended
recipient, please notify us immediately and permanently delete this e-mail
and any attachments.


From: Karl Wright daddywri@gmail.com
Sent: Thursday, May 28, 2020 12:03 PM
To: user@manifoldcf.apache.org user@manifoldcf.apache.org
Subject: Re: URL Mapping

Thanks! It's far better to implement this than to try and hack it. A
general way of removing session information with regular expressions is
probably not going to cut it either, so for now it's got to be in Java.

Karl

On Thu, May 28, 2020 at 12:47 PM Michael Cizmar michael.cizmar@mcplusa.com wrote:

The "!ut" and then a bunch of session information is from Web Sphere
Portal. Some information about it here:

https://books.google.com/books?id=bqAXnpmj5LwC&pg=PA180&lpg=PA180&dq=%22!ut%22+session+variables+websphere#v=onepage&q=%22!ut%22%20session%20variables%20websphere&f=false

I'll look at making a change to the web crawler to suppor this like the BV
and ASP.NET


From: Karl Wright daddywri@gmail.com
Sent: Thursday, May 28, 2020 11:41 AM
To: user@manifoldcf.apache.org user@manifoldcf.apache.org
Subject: Re: URL Mapping

Hi,

There are provisions in the URL canonicallization part of the world for
removal of session information from the URL. It only knows about some
kinds of widely used sessions; java app server sessions, for example,
Broadvision sessions, etc. If you can convince me that your session
information is (a) uniquely identifiable, and (b) commonly used, the proper
approach is to incorporate session removal in this framework. Please let
me know.

Karl

On Thu, May 28, 2020 at 12:11 PM Michael Cizmar michael.cizmar@mcplusa.com wrote:

I've got a really long url with a bunch of unnecessary session query
string parameters. I've been trying unsuccessfully to map it to the same
url without the session.

an example of the url below. I thought I could do this:

url map regular expression:

(.*)\/!ut

replacement configuration:

So the go would be that the url be:

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/

But the url gets rejected.

Sample Crawl Url

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/!ut/p/a1/rZHLTsMwEEV_hS6yjDx5OWZpdRFImzYCAYk3lZM6D5TYSWoqPh8HFu2GQhHejEeae-aOLmIoQ0zyY1tz3SrJu7lneLfdBtTxI1iRhzsMFEfrpZ_6AFFoBnIzAN88Cj_pXxBDrJR60A3KeS2kvimV1KZaMKhJ886C8U1pIeSkOtNM3Pz5QewO3IJG9WIGDGW7RzkB7hZFIWxyyx3bL8LAJo6L7QoELitMPAH7r4WXLefmpvBkOoqfiTHth6vYTRxIAT1eufMy8D74Z2DqXg2Mf5Fz-zqOjJq05nzeNcr-FpchuVOyTGpjkOvGbmWlUHYmQtmZCGWfoqF_6omHq83G5gUBL-iOa0oXiw9FOxLu/dl5/d5/L0lJS2FZcHBpbW1LYVlwcGltbVlwcGchIS9vSHd3QUFBSXdpRUFJSkRBQ1VZaUVJVTVCZ09DbFFBQUlBQVNvU0FyUnFBQURBQWF0QXdMTzlRQUFFQUJ3WWVBR0tTQUFDa0k1Z21HU3dTaXJTQUFDZ0s5ZzBIUS80SmlHcGhxRWFoR29ScUVhbEdwaC9aNl9PTzVBMTRHMEs4Ukg2MEE2R0xDNFA0MDBHNy9hZ2VudCBjb250ZW50JTBwb3J0YWwlMHF1b3RlZW5yb2xsJTBkaWdzIC0gcXVvdGluZyAgZW5yb2xsbWVudCAoaW5kaXZpZHVhbCkvZjQ0YmEyOWUtODQwOC00YjFlLTg4MzktMTFlMjI4NDgxYTVhL2RpZ3MgLSBxdW90aW5nICBlbnJvbGxtZW50IChpbmRpdmlkdWFsKQ

添付ファイル:

トピックへ返信するには、ログインが必要です。

KandaSearch

Copyright © 2006-2024 RONDHUIT Co, Ltd. All Rights Reserved.

投稿の削除

この投稿を削除します。よろしいですか?